Subversion Repositories planix.SVN

Rev

Rev 2 | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
2 - 1
.TH HTML 2
2
.SH NAME
3
parsehtml,
4
printitems,
5
validitems,
6
freeitems,
7
freedocinfo,
8
dimenkind,
9
dimenspec,
10
targetid,
11
targetname,
12
fromStr,
13
toStr
14
\- HTML parser
15
.SH SYNOPSIS
16
.nf
17
.PP
18
.ft L
19
#include <u.h>
20
#include <libc.h>
21
#include <html.h>
22
.ft P
23
.PP
24
.ta \w'\fLToken* 'u
25
.B
26
Item*	parsehtml(uchar* data, int datalen, Rune* src, int mtype,
27
.B
28
	int chset, Docinfo** pdi)
29
.PP
30
.B
31
void	printitems(Item* items, char* msg)
32
.PP
33
.B
34
int	validitems(Item* items)
35
.PP
36
.B
37
void	freeitems(Item* items)
38
.PP
39
.B
40
void	freedocinfo(Docinfo* d)
41
.PP
42
.B
43
int	dimenkind(Dimen d)
44
.PP
45
.B
46
int	dimenspec(Dimen d)
47
.PP
48
.B
49
int	targetid(Rune* s)
50
.PP
51
.B
52
Rune*	targetname(int targid)
53
.PP
54
.B
55
uchar*	fromStr(Rune* buf, int n, int chset)
56
.PP
57
.B
58
Rune*	toStr(uchar* buf, int n, int chset)
59
.SH DESCRIPTION
60
.PP
61
This library implements a parser for HTML 4.0 documents.
62
The parsed HTML is converted into an intermediate representation that
63
describes how the formatted HTML should be laid out.
64
.PP
65
.I Parsehtml
66
parses an entire HTML document contained in the buffer
67
.I data
68
and having length
69
.IR datalen .
70
The URL of the document should be passed in as
71
.IR src .
72
.I Mtype
73
is the media type of the document, which should be either
74
.B TextHtml
75
or
76
.BR TextPlain .
77
The character set of the document is described in
78
.IR chset ,
79
which can be one of
80
.BR US_Ascii ,
81
.BR ISO_8859_1 ,
82
.B UTF_8
83
or
84
.BR Unicode .
85
The return value is a linked list of
86
.B Item
87
structures, described in detail below.
88
As a side effect, 
89
.BI * pdi
90
is set to point to a newly created
91
.B Docinfo
92
structure, containing information pertaining to the entire document.
93
.PP
94
The library expects two allocation routines to be provided by the
95
caller,
96
.B emalloc
97
and
98
.BR erealloc .
99
These routines are analogous to the standard malloc and realloc routines,
100
except that they should not return if the memory allocation fails.
101
In addition,
102
.B emalloc
103
is required to zero the memory.
104
.PP
105
For debugging purposes,
106
.I printitems
107
may be called to display the contents of an item list; individual items may
108
be printed using the
109
.B %I
110
print verb, installed on the first call to
111
.IR parsehtml .
112
.I validitems
113
traverses the item list, checking that all of the pointers are valid.
114
It returns
115
.B 1
116
is everything is ok, and
117
.B 0
118
if an error was found.
119
Normally, one would not call these routines directly.
120
Instead, one sets the global variable
121
.I dbgbuild
122
and the library calls them automatically.
123
One can also set
124
.IR warn ,
125
to cause the library to print a warning whenever it finds a problem with the
126
input document, and
127
.IR dbglex ,
128
to print debugging information in the lexer.
129
.PP
130
When an item list is finished with, it should be freed with
131
.IR freeitems .
132
Then,
133
.I freedocinfo
134
should be called on the pointer returned in
135
.BI * pdi\f1.
136
.PP
137
.I Dimenkind
138
and
139
.I dimenspec
140
are provided to interpret the
141
.B Dimen
142
type, as described in the section
143
.IR "Dimension Specifications" .
144
.PP
145
Frame target names are mapped to integer ids via a global, permanent mapping.
146
To find the value for a given name, call
147
.IR targetid ,
148
which allocates a new id if the name hasn't been seen before.
149
The name of a given, known id may be retrieved using
150
.IR targetname .
151
The library predefines
152
.BR FTtop ,
153
.BR FTself ,
154
.B FTparent
155
and
156
.BR FTblank .
157
.PP
158
The library handles all text as Unicode strings (type
159
.BR Rune* ).
160
Character set conversion is provided by
161
.I fromStr
162
and
163
.IR toStr .
164
.I FromStr
165
takes
166
.I n
167
Unicode characters from
168
.I buf
169
and converts them to the character set described by
170
.IR chset .
171
.I ToStr
172
takes
173
.I n
174
bytes from
175
.IR buf ,
176
interpretted as belonging to character set
177
.IR chset ,
178
and converts them to a Unicode string.
179
Both routines null-terminate the result, and use
180
.B emalloc
181
to allocate space for it.
182
.SS Items
183
The return value of
184
.I parsehtml
185
is a linked list of variant structures,
186
with the generic portion described by the following definition:
187
.PP
188
.EX
189
.ta 6n +\w'Genattr* 'u
190
typedef struct Item Item;
191
struct Item
192
{
193
	Item*	next;
194
	int	width;
195
	int	height;
196
	int	ascent;
197
	int	anchorid;
198
	int	state;
199
	Genattr*	genattr;
200
	int	tag;
201
};
202
.EE
203
.PP
204
The field
205
.B next
206
points to the successor in the linked list of items, while
207
.BR width ,
208
.BR height ,
209
and
210
.B ascent
211
are intended for use by the caller as part of the layout process.
212
.BR Anchorid ,
213
if non-zero, gives the integer id assigned by the parser to the anchor that
214
this item is in (see section
215
.IR Anchors ).
216
.B State
217
is a collection of flags and values described as follows:
218
.PP
219
.EX
220
.ta 6n +\w'IFindentshift = 'u
221
enum
222
{
223
	IFbrk =	0x80000000,
224
	IFbrksp =	0x40000000,
225
	IFnobrk =	0x20000000,
226
	IFcleft =	0x10000000,
227
	IFcright =	0x08000000,
228
	IFwrap =	0x04000000,
229
	IFhang =	0x02000000,
230
	IFrjust =	0x01000000,
231
	IFcjust =	0x00800000,
232
	IFsmap =	0x00400000,
233
	IFindentshift =	8,
234
	IFindentmask =	(255<<IFindentshift),
235
	IFhangmask =	255
236
};
237
.EE
238
.PP
239
.B IFbrk
240
is set if a break is to be forced before placing this item.
241
.B IFbrksp
242
is set if a 1 line space should be added to the break (in which case
243
.B IFbrk
244
is also set).
245
.B IFnobrk
246
is set if a break is not permitted before the item.
247
.B IFcleft
248
is set if left floats should be cleared (that is, if the list of pending left floats should be placed)
249
before this item is placed, and
250
.B IFcright
251
is set for right floats.
252
In both cases, IFbrk is also set.
253
.B IFwrap
254
is set if the line containing this item is allowed to wrap.
255
.B IFhang
256
is set if this item hangs into the left indent.
257
.B IFrjust
258
is set if the line containing this item should be right justified,
259
and
260
.B IFcjust
261
is set for center justified lines.
262
.B IFsmap
263
is used to indicate that an image is a server-side map.
264
The low 8 bits, represented by
265
.BR IFhangmask ,
266
indicate the current hang into left indent, in tenths of a tabstop.
267
The next 8 bits, represented by
268
.B IFindentmask
269
and
270
.BR IFindentshift ,
271
indicate the current indent in tab stops.
272
.PP
273
The field
274
.B genattr
275
is an optional pointer to an auxiliary structure, described in the section
276
.IR "Generic Attributes" .
277
.PP
278
Finally,
279
.B tag
280
describes which variant type this item has.
281
It can have one of the values
282
.BR Itexttag ,
283
.BR Iruletag ,
284
.BR Iimagetag ,
285
.BR Iformfieldtag ,
286
.BR Itabletag ,
287
.B Ifloattag
288
or
289
.BR Ispacertag .
290
For each of these values, there is an additional structure defined, which
291
includes Item as an unnamed initial substructure, and then defines additional
292
fields.
293
.PP
294
Items of type
295
.B Itexttag
296
represent a piece of text, using the following structure:
297
.PP
298
.EX
299
.ta 6n +\w'Rune* 'u
300
struct Itext
301
{
302
	Item;
303
	Rune*	s;
304
	int	fnt;
305
	int	fg;
306
	uchar	voff;
307
	uchar	ul;
308
};
309
.EE
310
.PP
311
Here
312
.B s
313
is a null-terminated Unicode string of the actual characters making up this text item,
314
.B fnt
315
is the font number (described in the section
316
.IR "Font Numbers" ),
317
and
318
.B fg
319
is the RGB encoded color for the text.
320
.B Voff
321
measures the vertical offset from the baseline; subtract
322
.B Voffbias
323
to get the actual value (negative values represent a displacement down the page).
324
The field
325
.B ul
326
is the underline style:
327
.B ULnone
328
if no underline,
329
.B ULunder
330
for conventional underline, and
331
.B ULmid
332
for strike-through.
333
.PP
334
Items of type
335
.B Iruletag
336
represent a horizontal rule, as follows:
337
.PP
338
.EX
339
.ta 6n +\w'Dimen 'u
340
struct Irule
341
{
342
	Item;
343
	uchar	align;
344
	uchar	noshade;
345
	int	size;
346
	Dimen	wspec;
347
};
348
.EE
349
.PP
350
Here
351
.B align
352
is the alignment specification (described in the corresponding section),
353
.B noshade
354
is set if the rule should not be shaded,
355
.B size
356
is the height of the rule (as set by the size attribute),
357
and
358
.B wspec
359
is the desired width (see section
360
.IR "Dimension Specifications" ).
361
.PP
362
Items of type
363
.B Iimagetag
364
describe embedded images, for which the following structure is defined:
365
.PP
366
.EX
367
.ta 6n +\w'Iimage* 'u
368
struct Iimage
369
{
370
	Item;
371
	Rune*	imsrc;
372
	int	imwidth;
373
	int	imheight;
374
	Rune*	altrep;
375
	Map*	map;
376
	int	ctlid;
377
	uchar	align;
378
	uchar	hspace;
379
	uchar	vspace;
380
	uchar	border;
381
	Iimage*	nextimage;
382
};
383
.EE
384
.PP
385
Here
386
.B imsrc
387
is the URL of the image source,
388
.B imwidth
389
and
390
.BR imheight ,
391
if non-zero, contain the specified width and height for the image,
392
and
393
.B altrep
394
is the text to use as an alternative to the image, if the image is not displayed.
395
.BR Map ,
396
if set, points to a structure describing an associated client-side image map.
397
.B Ctlid
398
is reserved for use by the application, for handling animated images.
399
.B Align
400
encodes the alignment specification of the image.
401
.B Hspace
402
contains the number of pixels to pad the image with on either side, and
403
.B Vspace
404
the padding above and below.
405
.B Border
406
is the width of the border to draw around the image.
407
.B Nextimage
408
points to the next image in the document (the head of this list is
409
.BR Docinfo.images ).
410
.PP
411
For items of type
412
.BR Iformfieldtag ,
413
the following structure is defined:
414
.PP
415
.EX
416
.ta 6n +\w'Formfield* 'u
417
struct Iformfield
418
{
419
	Item;
420
	Formfield*	formfield;
421
};
422
.EE
423
.PP
424
This adds a single field,
425
.BR formfield ,
426
which points to a structure describing a field in a form, described in section
427
.IR Forms .
428
.PP
429
For items of type
430
.BR Itabletag ,
431
the following structure is defined:
432
.PP
433
.EX
434
.ta 6n +\w'Table* 'u
435
struct Itable
436
{
437
	Item;
438
	Table*	table;
439
};
440
.EE
441
.PP
442
.B Table
443
points to a structure describing the table, described in the section
444
.IR Tables .
445
.PP
446
For items of type
447
.BR Ifloattag ,
448
the following structure is defined:
449
.PP
450
.EX
451
.ta 6n +\w'Ifloat* 'u
452
struct Ifloat
453
{
454
	Item;
455
	Item*	item;
456
	int	x;
457
	int	y;
458
	uchar	side;
459
	uchar	infloats;
460
	Ifloat*	nextfloat;
461
};
462
.EE
463
.PP
464
The
465
.B item
466
points to a single item (either a table or an image) that floats (the text of the
467
document flows around it), and
468
.B side
469
indicates the margin that this float sticks to; it is either
470
.B ALleft
471
or
472
.BR ALright .
473
.B X
474
and
475
.B y
476
are reserved for use by the caller; these are typically used for the coordinates
477
of the top of the float.
478
.B Infloats
479
is used by the caller to keep track of whether it has placed the float.
480
.B Nextfloat
481
is used by the caller to link together all of the floats that it has placed.
482
.PP
483
For items of type
484
.BR Ispacertag ,
485
the following structure is defined:
486
.PP
487
.EX
488
.ta 6n +\w'Item; 'u
489
struct Ispacer
490
{
491
	Item;
492
	int	spkind;
493
};
494
.EE
495
.PP
496
.B Spkind
497
encodes the kind of spacer, and may be one of
498
.B ISPnull
499
(zero height and width),
500
.B ISPvline
501
(takes on height and ascent of the current font),
502
.B ISPhspace
503
(has the width of a space in the current font) and
504
.B ISPgeneral
505
(for all other purposes, such as between markers and lists).
506
.SS Generic Attributes
507
.PP
508
The genattr field of an item, if non-nil, points to a structure that holds
509
the values of attributes not specific to any particular
510
item type, as they occur on a wide variety of underlying HTML tags.
511
The structure is as follows:
512
.PP
513
.EX
514
.ta 6n +\w'SEvent* 'u
515
typedef struct Genattr Genattr;
516
struct Genattr
517
{
518
	Rune*	id;
519
	Rune*	class;
520
	Rune*	style;
521
	Rune*	title;
522
	SEvent*	events;
523
};
524
.EE
525
.PP
526
Fields
527
.BR id ,
528
.BR class ,
529
.B style
530
and
531
.BR title ,
532
when non-nil, contain values of correspondingly named attributes of the HTML tag
533
associated with this item.
534
.B Events
535
is a linked list of events (with corresponding scripted actions) associated with the item:
536
.PP
537
.EX
538
.ta 6n +\w'SEvent* 'u
539
typedef struct SEvent SEvent;
540
struct SEvent
541
{
542
	SEvent*	next;
543
	int	type;
544
	Rune*	script;
545
};
546
.EE
547
.PP
548
Here,
549
.B next
550
points to the next event in the list,
551
.B type
552
is one of
553
.BR SEonblur ,
554
.BR SEonchange ,
555
.BR SEonclick ,
556
.BR SEondblclick ,
557
.BR SEonfocus ,
558
.BR SEonkeypress ,
559
.BR SEonkeyup ,
560
.BR SEonload ,
561
.BR SEonmousedown ,
562
.BR SEonmousemove ,
563
.BR SEonmouseout ,
564
.BR SEonmouseover ,
565
.BR SEonmouseup ,
566
.BR SEonreset ,
567
.BR SEonselect ,
568
.B SEonsubmit
569
or
570
.BR SEonunload ,
571
and
572
.B script
573
is the text of the associated script.
574
.SS Dimension Specifications
575
.PP
576
Some structures include a dimension specification, used where
577
a number can be followed by a
578
.B %
579
or a
580
.B *
581
to indicate
582
percentage of total or relative weight.
583
This is encoded using the following structure:
584
.PP
585
.EX
586
.ta 6n +\w'int 'u
587
typedef struct Dimen Dimen;
588
struct Dimen
589
{
590
	int	kindspec;
591
};
592
.EE
593
.PP
594
Separate kind and spec values are extracted using
595
.I dimenkind
596
and
597
.IR dimenspec .
598
.I Dimenkind
599
returns one of
600
.BR Dnone ,
601
.BR Dpixels ,
602
.B Dpercent
603
or
604
.BR Drelative .
605
.B Dnone
606
means that no dimension was specified.
607
In all other cases,
608
.I dimenspec
609
should be called to find the absolute number of pixels, the percentage of total,
610
or the relative weight.
611
.SS Background Specifications
612
.PP
613
It is possible to set the background of the entire document, and also
614
for some parts of the document (such as tables).
615
This is encoded as follows:
616
.PP
617
.EX
618
.ta 6n +\w'Rune* 'u
619
typedef struct Background Background;
620
struct Background
621
{
622
	Rune*	image;
623
	int	color;
624
};
625
.EE
626
.PP
627
.BR Image ,
628
if non-nil, is the URL of an image to use as the background.
629
If this is nil,
630
.B color
631
is used instead, as the RGB value for a solid fill color.
632
.SS Alignment Specifications
633
.PP
634
Certain items have alignment specifiers taken from the following
635
enumerated type:
636
.PP
637
.EX
638
.ta 6n
639
enum
640
{
641
	ALnone = 0, ALleft, ALcenter, ALright, ALjustify,
642
	ALchar, ALtop, ALmiddle, ALbottom, ALbaseline
643
};
644
.EE
645
.PP
646
These values correspond to the various alignment types named in the HTML 4.0
647
standard.
648
If an item has an alignment of
649
.B ALleft
650
or
651
.BR ALright ,
652
the library automatically encapsulates it inside a float item.
653
.PP
654
Tables, and the various rows, columns and cells within them, have a more
655
complex alignment specification, composed of separate vertical and
656
horizontal alignments:
657
.PP
658
.EX
659
.ta 6n +\w'uchar 'u
660
typedef struct Align Align;
661
struct Align
662
{
663
	uchar	halign;
664
	uchar	valign;
665
};
666
.EE
667
.PP
668
.B Halign
669
can be one of
670
.BR ALnone ,
671
.BR ALleft ,
672
.BR ALcenter ,
673
.BR ALright ,
674
.B ALjustify
675
or
676
.BR ALchar .
677
.B Valign
678
can be one of
679
.BR ALnone ,
680
.BR ALmiddle ,
681
.BR ALbottom ,
682
.BR ALtop
683
or
684
.BR ALbaseline .
685
.SS Font Numbers
686
.PP
687
Text items have an associated font number (the
688
.B fnt
689
field), which is encoded as
690
.BR style*NumSize+size .
691
Here,
692
.B style
693
is one of
694
.BR FntR ,
695
.BR FntI ,
696
.B FntB
697
or
698
.BR FntT ,
699
for roman, italic, bold and typewriter font styles, respectively, and size is
700
.BR Tiny ,
701
.BR Small ,
702
.BR Normal ,
703
.B Large
704
or
705
.BR Verylarge .
706
The total number of possible font numbers is
707
.BR NumFnt ,
708
and the default font number is
709
.B DefFnt
710
(which is roman style, normal size).
711
.SS Document Info
712
.PP
713
Global information about an HTML page is stored in the following structure:
714
.PP
715
.EX
716
.ta 6n +\w'DestAnchor* 'u
717
typedef struct Docinfo Docinfo;
718
struct Docinfo
719
{
720
	// stuff from HTTP headers, doc head, and body tag
721
	Rune*	src;
722
	Rune*	base;
723
	Rune*	doctitle;
724
	Background	background;
725
	Iimage*	backgrounditem;
726
	int	text;
727
	int	link;
728
	int	vlink;
729
	int	alink;
730
	int	target;
731
	int	chset;
732
	int	mediatype;
733
	int	scripttype;
734
	int	hasscripts;
735
	Rune*	refresh;
736
	Kidinfo*	kidinfo;
737
	int	frameid;
738
 
739
	// info needed to respond to user actions
740
	Anchor*	anchors;
741
	DestAnchor*	dests;
742
	Form*	forms;
743
	Table*	tables;
744
	Map*	maps;
745
	Iimage*	images;
746
};
747
.EE
748
.PP
749
.B Src
750
gives the URL of the original source of the document,
751
and
752
.B base
753
is the base URL.
754
.B Doctitle
755
is the document's title, as set by a
756
.B <title>
757
element.
758
.B Background
759
is as described in the section
760
.IR "Background Specifications" ,
761
and
762
.B backgrounditem
763
is set to be an image item for the document's background image (if given as a URL),
764
or else nil.
765
.B Text
766
gives the default foregound text color of the document,
767
.B link
768
the unvisited hyperlink color,
769
.B vlink
770
the visited hyperlink color, and
771
.B alink
772
the color for highlighting hyperlinks (all in 24-bit RGB format).
773
.B Target
774
is the default target frame id.
775
.B Chset
776
and
777
.B mediatype
778
are as for the
779
.I chset
780
and
781
.I mtype
782
parameters to
783
.IR parsehtml .
784
.B Scripttype
785
is the type of any scripts contained in the document, and is always
786
.BR TextJavascript .
787
.B Hasscripts
788
is set if the document contains any scripts.
789
Scripting is currently unsupported.
790
.B Refresh
791
is the contents of a
792
.B "<meta http-equiv=Refresh ...>"
793
tag, if any.
794
.B Kidinfo
795
is set if this document is a frameset (see section
796
.IR Frames ).
797
.B Frameid
798
is this document's frame id.
799
.PP
800
.B Anchors
801
is a list of hyperlinks contained in the document,
802
and
803
.B dests
804
is a list of hyperlink destinations within the page (see the following section for details).
805
.BR Forms ,
806
.B tables
807
and
808
.B maps
809
are lists of the various forms, tables and client-side maps contained
810
in the document, as described in subsequent sections.
811
.B Images
812
is a list of all the image items in the document.
813
.SS Anchors
814
.PP
815
The library builds two lists for all of the
816
.B <a>
817
elements (anchors) in a document.
818
Each anchor is assigned a unique anchor id within the document.
819
For anchors which are hyperlinks (the
820
.B href
821
attribute was supplied), the following structure is defined:
822
.PP
823
.EX
824
.ta 6n +\w'Anchor* 'u
825
typedef struct Anchor Anchor;
826
struct Anchor
827
{
828
	Anchor*	next;
829
	int	index;
830
	Rune*	name;
831
	Rune*	href;
832
	int	target;
833
};
834
.EE
835
.PP
836
.B Next
837
points to the next anchor in the list (the head of this list is
838
.BR Docinfo.anchors ).
839
.B Index
840
is the anchor id; each item within this hyperlink is tagged with this value
841
in its
842
.B anchorid
843
field.
844
.B Name
845
and
846
.B href
847
are the values of the correspondingly named attributes of the anchor
848
(in particular, href is the URL to go to).
849
.B Target
850
is the value of the target attribute (if provided) converted to a frame id.
851
.PP
852
Destinations within the document (anchors with the name attribute set)
853
are held in the
854
.B Docinfo.dests
855
list, using the following structure:
856
.PP
857
.EX
858
.ta 6n +\w'DestAnchor* 'u
859
typedef struct DestAnchor DestAnchor;
860
struct DestAnchor
861
{
862
	DestAnchor*	next;
863
	int	index;
864
	Rune*	name;
865
	Item*	item;
866
};
867
.EE
868
.PP
869
.B Next
870
is the next element of the list,
871
.B index
872
is the anchor id,
873
.B name
874
is the value of the name attribute, and
875
.B item
876
is points to the item within the parsed document that should be considered
877
to be the destination.
878
.SS Forms
879
.PP
880
Any forms within a document are kept in a list, headed by
881
.BR Docinfo.forms .
882
The elements of this list are as follows:
883
.PP
884
.EX
885
.ta 6n +\w'Formfield* 'u
886
typedef struct Form Form;
887
struct Form
888
{
889
	Form*	next;
890
	int	formid;
891
	Rune*	name;
892
	Rune*	action;
893
	int	target;
894
	int	method;
895
	int	nfields;
896
	Formfield*	fields;
897
};
898
.EE
899
.PP
900
.B Next
901
points to the next form in the list.
902
.B Formid
903
is a serial number for the form within the document.
904
.B Name
905
is the value of the form's name or id attribute.
906
.B Action
907
is the value of any action attribute.
908
.B Target
909
is the value of the target attribute (if any) converted to a frame target id.
910
.B Method
911
is one of
912
.B HGet
913
or
914
.BR HPost .
915
.B Nfields
916
is the number of fields in the form, and
917
.B fields
918
is a linked list of the actual fields.
919
.PP
920
The individual fields in a form are described by the following structure:
921
.PP
922
.EX
923
.ta 6n +\w'Formfield* 'u
924
typedef struct Formfield Formfield;
925
struct Formfield
926
{
927
	Formfield*	next;
928
	int	ftype;
929
	int	fieldid;
930
	Form*	form;
931
	Rune*	name;
932
	Rune*	value;
933
	int	size;
934
	int	maxlength;
935
	int	rows;
936
	int	cols;
937
	uchar	flags;
938
	Option*	options;
939
	Item*	image;
940
	int	ctlid;
941
	SEvent*	events;
942
};
943
.EE
944
.PP
945
Here,
946
.B next
947
points to the next field in the list.
948
.B Ftype
949
is the type of the field, which can be one of
950
.BR Ftext ,
951
.BR Fpassword ,
952
.BR Fcheckbox ,
953
.BR Fradio ,
954
.BR Fsubmit ,
955
.BR Fhidden ,
956
.BR Fimage ,
957
.BR Freset ,
958
.BR Ffile ,
959
.BR Fbutton ,
960
.B Fselect
961
or
962
.BR Ftextarea .
963
.B Fieldid
964
is a serial number for the field within the form.
965
.B Form
966
points back to the form containing this field.
967
.BR Name ,
968
.BR value ,
969
.BR size ,
970
.BR maxlength ,
971
.B rows
972
and
973
.B cols
974
each contain the values of corresponding attributes of the field, if present.
975
.B Flags
976
contains per-field flags, of which
977
.B FFchecked
978
and
979
.B FFmultiple
980
are defined.
981
.B Image
982
is only used for fields of type
983
.BR Fimage ;
984
it points to an image item containing the image to be displayed.
985
.B Ctlid
986
is reserved for use by the caller, typically to store a unique id
987
of an associated control used to implement the field.
988
.B Events
989
is the same as the corresponding field of the generic attributes
990
associated with the item containing this field.
991
.B Options
992
is only used by fields of type
993
.BR Fselect ;
994
it consists of a list of possible options that may be selected for that
995
field, using the following structure:
996
.PP
997
.EX
998
.ta 6n +\w'Option* 'u
999
typedef struct Option Option;
1000
struct Option
1001
{
1002
	Option*	next;
1003
	int	selected;
1004
	Rune*	value;
1005
	Rune*	display;
1006
};
1007
.EE
1008
.PP
1009
.B Next
1010
points to the next element of the list.
1011
.B Selected
1012
is set if this option is to be displayed initially.
1013
.B Value
1014
is the value to send when the form is submitted if this option is selected.
1015
.B Display
1016
is the string to display on the screen for this option.
1017
.SS Tables
1018
.PP
1019
The library builds a list of all the tables in the document,
1020
headed by
1021
.BR Docinfo.tables .
1022
Each element of this list has the following format:
1023
.PP
1024
.EX
1025
.ta 6n +\w'Tablecell*** 'u
1026
typedef struct Table Table;
1027
struct Table
1028
{
1029
	Table*	next;
1030
	int	tableid;
1031
	Tablerow*	rows;
1032
	int	nrow;
1033
	Tablecol*	cols;
1034
	int	ncol;
1035
	Tablecell*	cells;
1036
	int	ncell;
1037
	Tablecell***	grid;
1038
	Align	align;
1039
	Dimen	width;
1040
	int	border;
1041
	int	cellspacing;
1042
	int	cellpadding;
1043
	Background	background;
1044
	Item*	caption;
1045
	uchar	caption_place;
1046
	Lay*	caption_lay;
1047
	int	totw;
1048
	int	toth;
1049
	int	caph;
1050
	int	availw;
1051
	Token*	tabletok;
1052
	uchar	flags;
1053
};
1054
.EE
1055
.PP
1056
.B Next
1057
points to the next element in the list of tables.
1058
.B Tableid
1059
is a serial number for the table within the document.
1060
.B Rows
1061
is an array of row specifications (described below) and
1062
.B nrow
1063
is the number of elements in this array.
1064
Similarly,
1065
.B cols
1066
is an array of column specifications, and
1067
.B ncol
1068
the size of this array.
1069
.B Cells
1070
is a list of all cells within the table (structure described below)
1071
and
1072
.B ncell
1073
is the number of elements in this list.
1074
Note that a cell may span multiple rows and/or columns, thus
1075
.B ncell
1076
may be smaller than
1077
.BR nrow*ncol .
1078
.B Grid
1079
is a two-dimensional array of cells within the table; the cell
1080
at row
1081
.B i
1082
and column
1083
.B j
1084
is
1085
.BR Table.grid[i][j] .
1086
A cell that spans multiple rows and/or columns will
1087
be referenced by
1088
.B grid
1089
multiple times, however it will only occur once in
1090
.BR cells .
1091
.B Align
1092
gives the alignment specification for the entire table,
1093
and
1094
.B width
1095
gives the requested width as a dimension specification.
1096
.BR Border ,
1097
.B cellspacing
1098
and
1099
.B cellpadding
1100
give the values of the corresponding attributes for the table,
1101
and
1102
.B background
1103
gives the requested background for the table.
1104
.B Caption
1105
is a linked list of items to be displayed as the caption of the
1106
table, either above or below depending on whether
1107
.B caption_place
1108
is
1109
.B ALtop
1110
or
1111
.BR ALbottom .
1112
Most of the remaining fields are reserved for use by the caller,
1113
except
1114
.BR tabletok ,
1115
which is reserved for internal use.
1116
The type
1117
.B Lay
1118
is not defined by the library; the caller can provide its
1119
own definition.
1120
.PP
1121
The
1122
.B Tablecol
1123
structure is defined for use by the caller.
1124
The library ensures that the correct number of these
1125
is allocated, but leaves them blank.
1126
The fields are as follows:
1127
.PP
1128
.EX
1129
.ta 6n +\w'Point 'u
1130
typedef struct Tablecol Tablecol;
1131
struct Tablecol
1132
{
1133
	int	width;
1134
	Align	align;
1135
	Point		pos;
1136
};
1137
.EE
1138
.PP
1139
The rows in the table are specified as follows:
1140
.PP
1141
.EX
1142
.ta 6n +\w'Background 'u
1143
typedef struct Tablerow Tablerow;
1144
struct Tablerow
1145
{
1146
	Tablerow*	next;
1147
	Tablecell*	cells;
1148
	int	height;
1149
	int	ascent;
1150
	Align	align;
1151
	Background	background;
1152
	Point	pos;
1153
	uchar	flags;
1154
};
1155
.EE
1156
.PP
1157
.B Next
1158
is only used during parsing; it should be ignored by the caller.
1159
.B Cells
1160
provides a list of all the cells in a row, linked through their
1161
.B nextinrow
1162
fields (see below).
1163
.BR Height ,
1164
.B ascent
1165
and
1166
.B pos
1167
are reserved for use by the caller.
1168
.B Align
1169
is the alignment specification for the row, and
1170
.B background
1171
is the background to use, if specified.
1172
.B Flags
1173
is used by the parser; ignore this field.
1174
.PP
1175
The individual cells of the table are described as follows:
1176
.PP
1177
.EX
1178
.ta 6n +\w'Background 'u
1179
typedef struct Tablecell Tablecell;
1180
struct Tablecell
1181
{
1182
	Tablecell*	next;
1183
	Tablecell*	nextinrow;
1184
	int	cellid;
1185
	Item*	content;
1186
	Lay*	lay;
1187
	int	rowspan;
1188
	int	colspan;
1189
	Align	align;
1190
	uchar	flags;
1191
	Dimen	wspec;
1192
	int	hspec;
1193
	Background	background;
1194
	int	minw;
1195
	int	maxw;
1196
	int	ascent;
1197
	int	row;
1198
	int	col;
1199
	Point	pos;
1200
};
1201
.EE
1202
.PP
1203
.B Next
1204
is used to link together the list of all cells within a table
1205
.RB ( Table.cells ),
1206
whereas
1207
.B nextinrow
1208
is used to link together all the cells within a single row
1209
.RB ( Tablerow.cells ).
1210
.B Cellid
1211
provides a serial number for the cell within the table.
1212
.B Content
1213
is a linked list of the items to be laid out within the cell.
1214
.B Lay
1215
is reserved for the user to describe how these items have
1216
been laid out.
1217
.B Rowspan
1218
and
1219
.B colspan
1220
are the number of rows and columns spanned by this cell,
1221
respectively.
1222
.B Align
1223
is the alignment specification for the cell.
1224
.B Flags
1225
is some combination of
1226
.BR TFparsing ,
1227
.B TFnowrap
1228
and
1229
.B TFisth
1230
or'd together.
1231
Here
1232
.B TFparsing
1233
is used internally by the parser, and should be ignored.
1234
.B TFnowrap
1235
means that the contents of the cell should not be
1236
wrapped if they don't fit the available width,
1237
rather, the table should be expanded if need be
1238
(this is set when the nowrap attribute is supplied).
1239
.B TFisth
1240
means that the cell was created by the
1241
.B <th>
1242
element (rather than the
1243
.B <td>
1244
element),
1245
indicating that it is a header cell rather than a data cell.
1246
.B Wspec
1247
provides a suggested width as a dimension specification,
1248
and
1249
.B hspec
1250
provides a suggested height in pixels.
1251
.B Background
1252
gives a background specification for the individual cell.
1253
.BR Minw ,
1254
.BR maxw ,
1255
.B ascent
1256
and
1257
.B pos
1258
are reserved for use by the caller during layout.
1259
.B Row
1260
and
1261
.B col
1262
give the indices of the row and column of the top left-hand
1263
corner of the cell within the table grid.
1264
.SS Client-side Maps
1265
.PP
1266
The library builds a list of client-side maps, headed by
1267
.BR Docinfo.maps ,
1268
and having the following structure:
1269
.PP
1270
.EX
1271
.ta 6n +\w'Rune* 'u
1272
typedef struct Map Map;
1273
struct Map
1274
{
1275
	Map*	next;
1276
	Rune*	name;
1277
	Area*	areas;
1278
};
1279
.EE
1280
.PP
1281
.B Next
1282
points to the next element in the list,
1283
.B name
1284
is the name of the map (use to bind it to an image), and
1285
.B areas
1286
is a list of the areas within the image that comprise the map,
1287
using the following structure:
1288
.PP
1289
.EX
1290
.ta 6n +\w'Dimen* 'u
1291
typedef struct Area Area;
1292
struct Area
1293
{
1294
	Area*	next;
1295
	int	shape;
1296
	Rune*	href;
1297
	int	target;
1298
	Dimen*	coords;
1299
	int	ncoords;
1300
};
1301
.EE
1302
.PP
1303
.B Next
1304
points to the next element in the map's list of areas.
1305
.B Shape
1306
describes the shape of the area, and is one of
1307
.BR SHrect ,
1308
.B SHcircle
1309
or
1310
.BR  SHpoly .
1311
.B Href
1312
is the URL associated with this area in its role as
1313
a hypertext link, and
1314
.B target
1315
is the target frame it should be loaded in.
1316
.B Coords
1317
is an array of coordinates for the shape, and
1318
.B ncoords
1319
is the size of this array (number of elements).
1320
.SS Frames
1321
.PP
1322
If the
1323
.B Docinfo.kidinfo
1324
field is set, the document is a frameset.
1325
In this case, it is typical for
1326
.I parsehtml
1327
to return nil, as a document which is a frameset should have no actual
1328
items that need to be laid out (such will appear only in subsidiary documents).
1329
It is possible that items will be returned by a malformed document; the caller
1330
should check for this and free any such items.
1331
.PP
1332
The
1333
.B Kidinfo
1334
structure itself reflects the fact that framesets can be nested within a document.
1335
If is defined as follows:
1336
.PP
1337
.EX
1338
.ta 6n +\w'Kidinfo* 'u
1339
typedef struct Kidinfo Kidinfo;
1340
struct Kidinfo
1341
{
1342
	Kidinfo*	next;
1343
	int	isframeset;
1344
 
1345
	// fields for "frame"
1346
	Rune*	src;
1347
	Rune*	name;
1348
	int	marginw;
1349
	int	marginh;
1350
	int	framebd;
1351
	int	flags;
1352
 
1353
	// fields for "frameset"
1354
	Dimen*	rows;
1355
	int	nrows;
1356
	Dimen*	cols;
1357
	int	ncols;
1358
	Kidinfo*	kidinfos;
1359
	Kidinfo*	nextframeset;
1360
};
1361
.EE
1362
.PP
1363
.B Next
1364
is only used if this structure is part of a containing frameset; it points to the next
1365
element in the list of children of that frameset.
1366
.B Isframeset
1367
is set when this structure represents a frameset; if clear, it is an individual frame.
1368
.PP
1369
Some fields are used only for framesets.
1370
.B Rows
1371
is an array of dimension specifications for rows in the frameset, and
1372
.B nrows
1373
is the length of this array.
1374
.B Cols
1375
is the corresponding array for columns, of length
1376
.BR ncols .
1377
.B Kidinfos
1378
points to a list of components contained within this frameset, each
1379
of which may be a frameset or a frame.
1380
.B Nextframeset
1381
is only used during parsing, and should be ignored.
1382
.PP
1383
The remaining fields are used if the structure describes a frame, not a frameset.
1384
.B Src
1385
provides the URL for the document that should be initially loaded into this frame.
1386
Note that this may be a relative URL, in which case it should be interpretted
1387
using the containing document's URL as the base.
1388
.B Name
1389
gives the name of the frame, typically supplied via a name attribute in the HTML.
1390
If no name was given, the library allocates one.
1391
.BR Marginw ,
1392
.B marginh
1393
and
1394
.B framebd
1395
are the values of the marginwidth, marginheight and frameborder attributes, respectively.
1396
.B Flags
1397
can contain some combination of the following:
1398
.B FRnoresize
1399
(the frame had the noresize attribute set, and the user should not be allowed to resize it),
1400
.B FRnoscroll
1401
(the frame should not have any scroll bars),
1402
.B FRhscroll
1403
(the frame should have a horizontal scroll bar),
1404
.B FRvscroll
1405
(the frame should have a vertical scroll bar),
1406
.B FRhscrollauto
1407
(the frame should be automatically given a horizontal scroll bar if its contents
1408
would not otherwise fit), and
1409
.B FRvscrollauto
1410
(the frame gets a vertical scrollbar only if required).
1411
.SH SOURCE
1412
.B /sys/src/libhtml
1413
.SH SEE ALSO
1414
.IR fmt (1)
1415
.PP
1416
W3C World Wide Web Consortium,
1417
``HTML 4.01 Specification''.
1418
.SH BUGS
1419
The entire HTML document must be loaded into memory before
1420
any of it can be parsed.