Subversion Repositories planix.SVN

Rev

Rev 2 | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
2 - 1
.TH DOC2TXT 1
2
.SH NAME
3
doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltables
4
\- extract printable text from Microsoft documents
5
.SH SYNOPSIS
6
.B doc2txt
7
[
8
.I file.doc
9
]
10
.br
11
.B doc2ps
12
[
13
.I file.doc
14
]
15
.br
16
.B wdoc2txt
17
[
18
.I file.doc
19
]
20
.br
21
.B xls2txt
22
[
23
.I file.xls
24
]
25
.br
26
.B aux/olefs
27
[
28
.B -m
29
.I mtpt
30
]
31
.I file.doc
32
.br
33
.B aux/mswordstrings 
34
.IB mtpt /WordDocument
35
.br
36
.B aux/msexceltables
37
[
38
.B -qaDnt
39
] [
40
.B -d
41
.I delim
42
] [
43
.B -c
44
.I column-range
45
] [
46
.B -w
47
.I worksheet-range
48
]
49
.IB mtpt /Workbook
50
.SH DESCRIPTION
51
.I Doc2txt
52
is an
53
.IR rc (1)
54
script that uses 
55
.I olefs
56
and
57
.I mswordstrings
58
to extract the printable text from the body of a Microsoft Word document
59
and write it on the standard output.
60
.I Doc2ps
61
is similar, but emits PostScript corresponding to the document.
62
.I Wdoc2txt
63
is similar to
64
.IR doc2txt ,
65
but uses
66
.IR plumb (1)
67
to send the output to a new
68
.IR acme (1)
69
window instead.
70
.I Xls2txt
71
performs a similar function for Microsoft Excel documents.
72
.PP
73
Microsoft Office documents are stored in OLE (Object Linking and Embedding)
74
format, which is a scaled down version of Microsoft's FAT file system.
75
.I Olefs
76
presents the contents of an MS Office document as a file system
77
on
78
.IR mtpt ,
79
which defaults to
80
.BR /mnt/doc .
81
.I Mswordstrings
82
or
83
.I msexceltables
84
may then be used to parse the files inside, extracting
85
a text stream.
86
.I Msexceltables
87
may be given options to control the formatting of its output.
88
.TF "\fL-d \fIdelim"
89
.TP
90
.B -a
91
Attempt conversion of non-tabular sheets in the workbook (charts).
92
.TP
93
.BI -d " delim
94
Sets the inter-field delimiter to the string
95
.IR delim ,
96
by default a single space.
97
.TP
98
.B -D
99
Enables debugging output.
100
.TP
101
.BI -c " range
102
.I Range
103
is a comma-separated list of column numbers and ranges.
104
Ranges are separated by dashes.
105
Limit processing to just those columns named;
106
by default all columns are output.
107
.TP
108
.B -n
109
Disables field padding to column width. 
110
.TP
111
.B -q
112
Disable quoting of textural fields (see 
113
.IR quote (2).)
114
.TP
115
.B -t
116
Truncate fields to the column width.
117
.TP
118
.BI -w " range
119
.I Range
120
is a comma-separated list of worksheet numbers and ranges, this
121
limits the sheets output using the same syntax as the
122
.B -c
123
option above.
124
Suppressed chart pages are always included in the sheet count.
125
.SH EXAMPLE
126
Extract pieces of an MS Excel spreadsheet.
127
.PD 0
128
.IP
129
.EX
130
.SM
131
aux/olefs report.xls
132
msexceltables -q -w 1,7,9-14 -c 3-5 -n -d '@' /mnt/doc/Workbook > rpt.txt
133
unmount /mnt/doc
134
.EE
135
.PD
136
.SH SOURCE
137
.TF "\fL/sys/src/cmd/aux   "
138
.TP
139
.B /rc/bin
140
.BR doc2txt ,
141
.BR doc2ps ,
142
.BR wdoc2txt,
143
and
144
.BR xls2txt
145
.TP
146
.B /sys/src/cmd/aux
147
the others
148
.fi
149
.PD
150
.SH SEE ALSO
151
.IR strings (1)
152
.br
153
``Microsoft Word 97 Binary File Format'',
154
at Microsoft's developer (MSDN) home page.
155
.br
156
``LAOLA Binary Structures'', 
157
.B http://user.cs.tu-berlin.de/~schwartz/pmh 
158
.br
159
``OpenOffice.Org's Excel Documentation'',
160
.br
161
.B http://sc.openoffice.org/excelfileformat.pdf