Subversion Repositories tendra.SVN

Rev

Rev 6 | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
6 7u83 1
<?xml version="1.0" standalone="no"?>
2
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
3
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd">
4
 
5
<!--
6
  $Id$
7
-->
8
 
9
<book>
10
  <bookinfo>
11
    <title>The sid users' guide</title>
12
 
13
    <corpauthor>The TenDRA Project</corpauthor>
14
 
15
    <author>
16
      <firstname>Jeroen</firstname>
17
 
18
      <surname>Ruigrok van der Werven</surname>
19
    </author>
20
 
21
    <authorinitials>JRvdW</authorinitials>
22
 
23
    <pubdate>2006</pubdate>
24
 
25
    <copyright>
26
      <year>2006</year>
27
 
28
      <holder>The TenDRA Project</holder>
29
    </copyright>
30
 
31
    <copyright>
32
      <year>1998</year>
33
 
34
      <holder>DERA</holder>
35
    </copyright>
36
  </bookinfo>
37
 
38
  <chapter id="introduction">
39
    <title>Introduction</title>
40
 
41
    <para>This document describes how to use the <code>sid</code> parser
42
      generator. It was written for <code>sid</code> version 1.9. The
43
      main features of each version of <code>sid</code> are listed
44
      below:</para>
45
 
46
    <itemizedlist>
47
      <listitem>
48
        <para><code>sid</code> 1.0 - this was the first version of
49
          <code>sid</code> to support attribute grammars. The output
50
          language was C.</para>
51
      </listitem>
52
 
53
      <listitem>
54
        <para><code>sid</code> 1.1 - this was a bug fix version of
55
          <code>sid</code> 1.0.</para>
56
      </listitem>
57
 
58
      <listitem>
59
        <para><code>sid</code> 1.2 - this version of <code>sid</code> added
60
          predicates, exception handling, improved inlining options, and a
61
          grammar test pseudo-language.</para>
62
      </listitem>
63
 
64
      <listitem>
65
        <para><code>sid</code> 1.3 - this version of <code>sid</code> added
66
          anonymous rules, and a better syntax for the C specific
67
          information.</para>
68
      </listitem>
69
 
70
      <listitem>
71
        <para><code>sid</code> 1.4 - this was a bug fix version of
72
          <code>sid</code> 1.3.</para>
73
      </listitem>
74
 
75
      <listitem>
76
        <para><code>sid</code> 1.5 - this was a bug fix version of
77
          <code>sid</code> 1.4. The command line option syntax changed in
78
          this release as well.</para>
79
      </listitem>
80
 
81
      <listitem>
82
        <para><code>sid</code> 1.6 - this version of <code>sid</code>
83
          changed the input syntax, added scoped rules, and removed basics
84
          (replacing them with terminal symbols). It also added a stricter
85
          ISO C language.</para>
86
      </listitem>
87
 
88
      <listitem>
89
        <para><code>sid</code> 1.7 - The syntax of the actions file changed
90
          slightly in this version. The error messages and recovery from
91
          syntax errors were also improved in this version.  This version
92
          also added explicit call by reference support (rather than the
93
          inconsistent function call semantics of earlier versions). The
94
          command line options changed in this version, to support language
95
          specific options, and the <code>strict-ansi-c</code> language was
96
          dropped. Non-local variables were also added.</para>
97
      </listitem>
98
 
99
      <listitem>
100
        <para><code>sid</code> 1.8 - Initialisers were added for non-local
101
          variables. Assignment was added.</para>
102
      </listitem>
103
 
104
      <listitem>
105
        <para><code>sid</code> 1.9 - this was a bug fix version of
106
          <code>sid</code> 1.8.</para>
107
      </listitem>
108
    </itemizedlist>
109
 
110
    <para><code>sid</code> turns specifications of languages into
111
      programs that recognise those languages. One of the aims of
112
      <code>sid</code> was to separate the specification of the
113
      language to be recognised from the language that the recogniser
114
      program is written in. For this reason, input to <code>sid</code>
115
      is split into two components: output language independent
116
      information, and output language dependent information.</para>
117
 
118
    <para>At present, <code>sid</code> will only output programs in C
119
      (either ISO or pre-ISO), but it is designed so that adding new
120
      output languages should be fairly simple. There is one other
121
      pseudo-language: the test language. This is used for testing
122
      grammars and the transforms, but will not output a parser.</para>
123
  </chapter>
124
 
125
  <chapter id="grammar">
126
    <title>Grammars</title>
127
 
128
    <sect1 id="parsing">
129
      <title>Parsing</title>
130
 
131
      <para>There are two phases in traditional language recognition that
132
        are relevant to <code>sid</code>: the first is lexical analysis
133
        (breaking the input up into terminal symbols); the second is
134
        syntax analysis or parsing (ensuring that the terminal symbols in
135
        the input are in the correct order).</para>
136
 
137
      <para><code>sid</code> currently does very little to help with
138
        lexical analysis. It is possible to use <code>sid</code> to
139
        produce the lexical analyser, but <code>sid</code> provides no
140
        real support for it at present. For now, the programmer is
141
        expected to write the lexical analyser, or use another tool to
142
        produce it.</para>
143
 
144
      <para>The lexical analyser should break the input up into a series
145
        of terminals. Each of these terminals is allocated a number. In
146
        <code>sid</code>, these numbers range from zero to the maximum
147
        terminal number (specified in the grammar description). The
148
        terminals may also have data associated with them (e.g. the value
149
        of a number), known as the attributes of the terminal, or the
150
        results of the terminal.</para>
151
 
152
      <para><code>sid</code> generates the parser. The parser is a program
153
        that reads the sequence of terminals from the lexical analyser,
154
        and ensures that they form a valid input in the language being
155
        recognised. If the input is not valid, then the parser will fail
156
        (<code>sid</code> provides mechanisms to allow the parser to
157
        recover from errors).</para>
158
    </sect1>
159
 
160
    <sect1 id="context-free">
161
      <title>Context free grammars</title>
162
 
163
      <para>This section provides a brief introduction to grammars. It is
164
        not intended to be an in-depth guide to grammars, more an
165
        introduction which defines some terminology.</para>
166
 
167
      <para><code>sid</code> works with a subset of what are known as
168
        context free grammars. A context free grammar consists of a set
169
        of input symbols (known as terminals), a set of rules
170
        (descriptions of what are legal forms in the language, also known
171
        as non-terminals), and an entry point (the rule from which the
172
        grammar starts).</para>
173
 
174
      <para>A rule is defined as a series of alternatives (throughout this
175
        document the definition of a rule is called a production - this
176
        may conflict with some other uses of the term). Each alternative
177
        consists of a sequence of items. An item can be a terminal or a
178
        rule. As an example (using the <code>sid</code> notation, which
179
        now looks unlike the conventional syntax for grammars), a comma
180
        separated list of integers could be specified as:
181
 
182
        <programlisting>
183
list-of-integers = {
184
integer ;
185
||
186
integer ;
187
comma ;
188
list-of-integers ;
189
} ;
190
        </programlisting></para>
191
 
192
      <para>This production contains two alternatives: the first matches the
193
        terminal <code>integer</code>; the second matches the sequence of
194
        terminals <code>integer</code> followed by <code>comma</code>,
195
        followed by another list of integers.</para>
196
 
197
 
198
      <para>There is much documentation available on context free grammars
199
        (and other types of formal grammar). The reader is advised to
200
        find an alternative source for more information.</para>
201
    </sect1>
202
 
203
    <sect1 id="sid-grammar">
204
      <title>sid grammars</title>
205
 
206
      <para><code>sid</code> grammars are based upon a subset of context
207
        free grammars, known as LL(1) grammars. The main property of such
208
        grammars is that the parser can always tell which alternative it
209
        is going to parse next by looking at the current terminal symbol.
210
        <code>sid</code> does a number of transforms to turn grammars
211
        that are not in this form into it (although it cannot turn all
212
        possible grammars into this form). It also provides facilities
213
        that allow the user to alter the control flow of the parser.</para>
214
 
215
      <para><code>sid</code> makes the following changes to the context
216
        free grammars described above:</para>
217
 
218
      <orderedlist>
219
        <listitem><para>There may be more than one entry point to the
220
          grammar.</para></listitem>
221
 
222
        <listitem><para>As well as being a rule or a terminal, an item may be
223
          an action, a predicate or an identity. An action is just a user
224
          supplied function. A predicate is a cross between a basic and an
225
          action (it is a user defined function but it may alter the control
226
          flow). An identity is like an assignment in a conventional
227
          programming language.</para></listitem>
228
 
229
        <listitem><para>Rules may take parameters and return results (as may
230
          actions; terminals may only return results). These may be
231
          passed between items using names.</para></listitem>
232
 
233
        <listitem><para>Each rule can have an exception handler associated
234
          with it.  Exception handlers are used for error recovery when the
235
          input being parsed does not match the grammar.</para></listitem>
236
 
237
        <listitem><para>Rules may be anonymous.</para></listitem>
238
 
239
        <listitem><para>Rules may have non-local names associated with them.
240
          These names are in scope for that rule and any rules that are
241
          defined inside it. The value of each non-local name is saved on
242
          entry to the rule, and restored when the rule is
243
          exited.</para></listitem>
244
      </orderedlist>
245
    </sect1>
246
  </chapter>
247
 
248
  <chapter id="overview">
249
    <title>Overview</title>
250
 
251
    <para><code>sid</code> takes the input grammar and applies several
252
      transformations to it, before it produces the parser. These
253
      transformations allow the language descriptions to be written in
254
      a slightly more natural form than would otherwise be
255
      necessary.</para>
256
 
257
    <sect1 id="left-recursion">
258
      <title>Left recursion elimination</title>
259
 
260
      <para>The first transformation is to eliminate any left recursive
261
        productions, replacing them with right recursive ones. This
262
        transforms a set of rules of the form:
263
 
264
        <programlisting>
265
Ai = Aj Bji, Ci
266
        </programlisting>
267
 
268
        into:
269
 
270
        <programlisting>
271
Ai = Cj Xji
272
Xji = Bjk Xki, Yji
273
        </programlisting>
274
 
275
        where <code>Yji</code> is the identity function if <code>i</code>
276
        equals <code>j</code>, and an error otherwise. In order for this
277
        transformation to work, the following conditions must hold:
278
 
279
        <orderedlist>
280
          <listitem>
281
            <para>The parameter type for all <code>Ai</code> must be the
282
              same.</para>
283
          </listitem>
284
 
285
          <listitem>
286
            <para>The result type for all <code>Ai</code> must be the
287
              same.</para>
288
          </listitem>
289
 
290
          <listitem>
291
            <para>The exception handlers for all <code>Ai</code> must be the
292
              same.</para>
293
          </listitem>
294
 
295
          <listitem>
296
            <para>The parameters to each left recursive call to some
297
              <code>Aj</code> must be exactly the formal parameters to the
298
              calling rule <code>Ai</code>.</para>
299
          </listitem>
300
 
301
          <listitem>
302
            <para>Any non-local name referenced by any rule must be in scope
303
              for all rules.</para>
304
          </listitem>
305
 
306
          <listitem>
307
            <para>A rule may not define non-local variables unless it is the
308
              only entry point into the cycle (i.e. there is only one named
309
              rule in the cycle).</para>
310
          </listitem>
311
        </orderedlist>
312
 
313
        <code>sid</code> will report an error if these conditions are not
314
        met.</para>
315
    </sect1>
316
 
317
    <sect1 id="factoring">
318
      <title>Factoring</title>
319
 
320
      <para>The second major transformation is to factor the grammar. It is
321
        possible that this phase could go on for ever, so there is a limit to
322
        the number of rules that can be generated. This limit may be changed
323
        (see the <link linkend="invoke">invocation section</link>). In
324
        practice it is rare for this to happen.</para>
325
 
326
      <para>The factoring phase tries to increase the number of language
327
        specifications that <code>sid</code> can produce a parser for, by
328
        factoring out common initial items in alternatives, e.g.:
329
 
330
        <programlisting>
331
X = {
332
a ; b ;
333
||  a ; c ;
334
} ;
335
        </programlisting>
336
 
337
        would be transformed into something like:
338
 
339
        <programlisting>
340
X = {
341
a ; X1 ;
342
} ;
343
 
344
X1 = {
345
b ;
346
||  c ;
347
} ;
348
        </programlisting></para>
349
 
350
      <para>It will also expand calls to rules at the beginning of
351
        alternatives if this is necessary to allow the parser to be produced
352
        (this is the phase that the limit is needed for). The expansions are
353
        done in the following cases:
354
 
355
        <orderedlist>
356
          <listitem>
357
            <para>If the rule is see-through (i.e. there is an expansion of
358
              the rule that does not contain any terminals or predicates) and
359
              its first set contains terminals in the first set of the rest of
360
              the alternative.</para>
361
          </listitem>
362
 
363
          <listitem>
364
            <para>If the rule has a predicate in its first set.</para>
365
          </listitem>
366
 
367
          <listitem>
368
            <para>If the rule has terminals in its first set that are also in
369
              the first set of another alternative that does not begin with
370
              the same rule.</para>
371
          </listitem>
372
        </orderedlist></para>
373
 
374
      <para>If the rule being expanded contains an exception handler, and it
375
        is not identical to the exception handler in the enclosing rule, then
376
        an error will occur. Similarly if the rule being expanded defines any
377
        non-local variables then an error will occur.</para>
378
    </sect1>
379
 
380
    <sect1 id="optimise">
381
      <title>Optimisations</title>
382
 
383
      <para>After these two transformations, <code>sid</code> performs some
384
        optimisations on the grammar, such as reducing multiple copies of
385
        identical rules, removing tail recursion, inlining all basic rules,
386
        inlining single alternative rules, inling rules used only once, and
387
        inlining everything that can be inlined. All of the inlining is
388
        optional.</para>
389
 
390
      <para>After these optimisations, <code>sid</code> checks the language
391
        description to ensure that it is possible to produce a parser for it,
392
        and if so it outputs the parser.</para>
393
    </sect1>
394
  </chapter>
395
 
396
  <chapter id="invoke">
397
    <title>Invocation</title>
398
 
399
    <para><code>sid</code> should be invoked in the following manner:
400
 
401
      <programlisting>
402
sid options input-files output-files
403
      </programlisting></para>
404
 
405
      <para>The options are described later. The input files should be a
406
        number of input file names dependent upon the output language. The
407
        first input file is the <code>sid</code> grammar file. In the case of
408
        either C dialects there should be one other input file that provides C
409
        specific information to <code>sid</code>. The number of output files
410
        is also language specific. At present, two output files should be
411
        specified for the C languages. The first should be a <code>.c</code>
412
        file into which the parser is written; the second should be a
413
        <code>.h</code> file into which the terminal definitions and external
414
        function declarations are written.</para>
415
 
416
    <para>The options list should consist of zero or more of the following
417
      options. There are short forms for each of these options as well; see
418
      the <code>sid</code> manual page for more information on
419
      invocation.</para>
420
 
421
    <itemizedlist>
422
      <listitem>
423
        <para>--dump-file <filename>FILE</filename></para>
424
 
425
        <para>This option causes intermediate dumps of the grammar to be
426
          written to the named file. The format of the dump files is similar
427
          to the format of the grammar specification, with the following
428
          exceptions:</para>
429
 
430
        <orderedlist>
431
          <listitem>
432
            <para>Predicates are written with the predicate result replaced
433
              by the predicate identifier (this will always be zero), and
434
              the result is followed by a <code>?</code> to indicate that it
435
              was a predicate. As an example, the predicate:
436
 
437
            <programlisting>
438
( b, ? ) = &lt;pred&gt; ( a )
439
            </programlisting>
440
 
441
            would be printed out as:
442
 
443
            <programlisting>
444
( b : Type1T, 0 : Type2T ) ? = &lt;pred&gt; ( a : Type3T )
445
            </programlisting></para>
446
          </listitem>
447
 
448
          <listitem>
449
            <para>Items that are considered to be inlinable are prefixed by
450
              a <code>+</code>. Items that are tail calls which will be
451
              eliminated are prefixed by a <code>*</code>.</para>
452
          </listitem>
453
 
454
          <listitem>
455
            <para>Nested rules are written at the outer level, with names of
456
              the form
457
              <code>outer-rule::....::inner-rule</code>.</para>
458
          </listitem>
459
 
460
          <listitem>
461
            <para>Types are provided on call parameter and result
462
              tuples.</para>
463
          </listitem>
464
 
465
          <listitem>
466
            <para>Inline rules are given a generated name, and are written
467
              out as a call to the generated rule (and a definition
468
              elsewhere).</para>
469
          </listitem>
470
        </orderedlist>
471
      </listitem>
472
 
473
      <listitem>
474
        <para>--factor-limit <token>LIMIT</token></para>
475
 
476
        <para>This option limits the number of rules that can be created
477
          during the factorisation process. It is probably best not to change
478
          this.</para>
479
      </listitem>
480
 
481
      <listitem>
482
        <para>--help</para>
483
 
484
        <para>This option writes a summary of the command line options to the
485
          standard error stream.</para>
486
      </listitem>
487
 
488
      <listitem>
489
        <para>--inline <token>INLINES</token></para>
490
 
491
        <para>This option controls what inlining will be done in the output
492
          parser. The inlines argument should be a comma separated list of the
493
          following words:
494
 
495
          <itemizedlist>
496
            <listitem>
497
              <para><code>SINGLES</code></para>
498
 
499
              <para>This causes single alternative rules to be inlined. This
500
                inlining is no longer performed as a modification to the
501
                grammar (it was in version 1.0).</para>
502
            </listitem>
503
 
504
            <listitem>
505
              <para><code>BASICS</code></para>
506
 
507
              <para>This causes rules that contain only basics (and no
508
                exception handlers or empty alternatives) to be inlined. The
509
                restriction on exception handlers and empty alternatives is
510
                rather arbitrary, and may be changed later.</para>
511
            </listitem>
512
 
513
            <listitem>
514
              <para><code>TAIL</code></para>
515
 
516
              <para>This causes tail recursive calls to be inlined. Without
517
                this, tail recursion elimination will not be
518
                performed.</para>
519
            </listitem>
520
 
521
            <listitem>
522
              <para><code>OTHER</code></para>
523
 
524
              <para>This causes other calls to be inlined wherever possible.
525
                Unless the <code>MULTI</code> inlining is also specified, this
526
                will be done only for productions that are called
527
                once.</para>
528
            </listitem>
529
 
530
            <listitem>
531
              <para><code>MULTI</code></para>
532
 
533
              <para>This causes calls to be inlined, even if the rule being
534
                called is called more than once. Turning this inlining on
535
                implies <code>OTHER</code>. Similarly turning off
536
                <code>OTHER</code> inlining will turn off <code>MULTI</code>
537
                inlining. For grammars of any size, this is probably best
538
                avoided; if used the generated parser may be huge (e.g. a C
539
                grammar has produced a file that was several hundred MB in
540
                size).</para>
541
            </listitem>
542
 
543
            <listitem>
544
              <para><code>ALL</code></para>
545
 
546
              <para>This turns on all inlining.</para>
547
            </listitem>
548
          </itemizedlist>
549
 
550
          In addition, prefixing a word with <code>NO</code> turns off that
551
          inlining phase. The words may be given in any case.  They are
552
          evaluated in the order given, so:
553
 
554
          <programlisting>
555
--inline noall,singles
556
          </programlisting>
557
 
558
          would turn on single alternative rule inlining only, whilst:
559
 
560
          <programlisting>
561
--inline singles,noall
562
          </programlisting>
563
 
564
          would turn off all inlining. The default is as if <code>sid</code>
565
          were invoked with the option:
566
 
567
          <programlisting>
568
--inline noall,basics,tail
569
          </programlisting></para>
570
      </listitem>
571
 
572
      <listitem>
573
        <para id="lang">--language <token>LANGUAGE</token></para>
574
 
575
        <para>This option specifies the output language. Currently this should
576
          be one of <code>ansi-c</code>, <code>pre-ansi-c</code> and
577
          <code>test</code>. The default is <code>ansi-c</code>.</para>
578
 
579
        <para>The <code>ansi-c</code> and <code>pre-ansi-c</code> languages
580
          are basically the same. The only difference is that
581
          <code>ansi-c</code> initially uses function prototypes, and
582
          <code>pre-ansi-c</code> doesn't. The C language specific options
583
          are:
584
 
585
          <itemizedlist>
586
            <listitem>
587
              <para><code>prototypes, proto, no-prototypes,
588
                no-proto</code></para>
589
 
590
              <para>These enable or disable the use of function prototypes.
591
                By default this is enabled for <code>ansi-c</code> and
592
                disabled for <code>pre-ansi-c</code>.</para>
593
            </listitem>
594
 
595
            <listitem>
596
              <para><code>numeric-ids, numeric, no-numeric-ids,
597
                no-numeric</code></para>
598
 
599
              <para>These enable or disable the use of numeric identifiers.
600
                Numeric identifiers replace the identifier name with a number,
601
                which is mainly of use in stopping identifier names getting
602
                too long. The disadvantage is that the code becomes less
603
                readable, and more difficult to debug. Numeric identifiers are
604
                not used by default.</para>
605
            </listitem>
606
 
607
            <listitem>
608
              <para id="casts"><code>casts, cast, no-casts,
609
                no-cast</code></para>
610
 
611
              <para>These enable or disable casting of action and assignment
612
                operator immutable parameters. If enabled, a parameter is cast
613
                to its own type when it is substituted into the action. This
614
                will cause some compilers to complain about attempts to modify
615
                the parameter (which can help pick out attempts at mutating
616
                parameters that should not be mutated). The disadvantage is
617
                that not all compilers will reject attempts at mutation, and
618
                that ISO C doesn't allow casting to structure and union types,
619
                which means that some code may be illegal. Parameter casting
620
                is disabled by default.</para>
621
            </listitem>
622
 
623
            <listitem>
624
              <para><code>unreachable-macros, unreachable-macro,
625
                unreachable-comments, unreachable-comment</code></para>
626
 
627
              <para>These choose whether unreachable code is marked by a macro
628
                or a comment. The default is to mark unreachable code with a
629
                comment <code>/*UNREACHED*/</code>, however a macro
630
                <code>UNREACHED ;</code> may be used instead, if
631
                desired.</para>
632
            </listitem>
633
          </itemizedlist>
634
 
635
          The <code>test</code> language only takes one input file, and
636
          produces no output file. It may be used to check that a grammar is
637
          valid. In conjunction with the dump file, it may be used to check
638
          the transformations that would be applied to the grammar. There are
639
          no language specific options for the <code>test</code>
640
          language.</para>
641
      </listitem>
642
 
643
      <listitem>
644
        <para>--show-errors</para>
645
 
646
        <para>This option writes a copy of the current error messages to the
647
          standard output. See the manual entry for more details about
648
          changing the error message content.</para>
649
      </listitem>
650
 
651
      <listitem>
652
        <para>--switch <token>OPTION</token></para>
653
 
654
        <para>This passes through <token>OPTION</token> as a language specific
655
          option. The valid options are described
656
          <link linkend="lang">above</link>.</para>
657
      </listitem>
658
 
659
      <listitem>
660
        <para>--tab-width <token>NUMBER</token></para>
661
 
662
        <para>This option specifies the number of spaces that a tab occupies.
663
          It defaults to 8. It is only used when indenting output.</para>
664
      </listitem>
665
 
666
      <listitem>
667
        <para>--version</para>
668
 
669
        <para>This option causes the version number and supported languages to
670
          be written to the standard error stream.</para>
671
      </listitem>
672
    </itemizedlist>
673
  </chapter>
674
 
675
  <chapter id="grammar-file">
676
    <title>The sid grammar file</title>
677
 
678
    <para>The <code>sid</code> grammar file should always be the first input
679
      file specified on the command line. It provides an output language
680
      independent description of the language to be recognised. The file is
681
      split up into a number of different sections, each of which begins with
682
      a section header. All sections must be included, although it is possible
683
      to leave most of them empty. The following sections exist at present:
684
      type declaration, terminal declaration, rule definition, and grammar
685
      entry points. They must appear in that order. The sections are detailed
686
      below, after the lexical conventions.</para>
687
 
688
    <sect1 id="grammar-lex">
689
      <title>Lexical conventions</title>
690
 
691
      <para>Almost all non-printable characters (but definitely space, tab,
692
        newline and form-feed) are considered to be white space, and are
693
        ignored other than to separate other tokens. In addition, two comment
694
        forms are recognised: the C comment syntax, where comments begin with
695
        <code>/*</code>, and end with <code>*/</code>, although unlike C these
696
        comments nest properly; and the C++ line comment syntax, where
697
        comments begin with <code>//</code>, and are terminated at the end of
698
        the line. Comments are treated in the same way as white space
699
        characters.</para>
700
 
701
      <para>Section headers are enclosed in percent characters, and are case
702
        insensitive. The following section headers are currently
703
        recognised:
704
 
705
        <programlisting>
706
%types%
707
%terminals%
708
%productions%
709
%entry%
710
        </programlisting></para>
711
 
712
      <para>Identifiers must begin with a letter, an underscore or a hyphen.
713
        This may be followed by zero or more letters, digits, underscores and
714
        hyphens. Identifiers are case sensitive. The following are all legal
715
        identifiers:
716
 
717
        <programlisting>
718
expression      mult-expr       plus_expr       expr-1
719
        </programlisting></para>
720
 
721
      <para>Identifiers are split into two namespaces: local names have one
722
        space; types, actions, rules, non-local names and terminals share the
723
        other namespace, so it is not possible to have an identifier that is a
724
        type as well as being a rule for example.</para>
725
 
726
      <para>The following symbols are also used:
727
 
728
        <programlisting>
729
&amp;   ;       =       [       :       ]       !       ,
730
||      $       ?       {       }       (       )       &lt;
731
&gt;    -&gt;   ::      ##
732
        </programlisting></para>
733
 
734
        <para>All other characters are illegal.</para>
735
    </sect1>
736
 
737
    <sect1 id="type-decl">
738
      <title>The type declaration section</title>
739
 
740
      <para>The first section is the type declaration section. This begins
741
        with the types section header, followed by the names of all of the
742
        types used by the grammar. Each type name should be terminated by a
743
        semicolon. An example of the type declaration section follows:
744
 
745
        <programlisting>
746
%types%
747
NumberT ;
748
StringT ;
749
        </programlisting>
750
 
751
        This declares two types: <code>NumberT</code> and
752
        <code>StringT</code>. There is no requirement for the type names to
753
        resemble names in the target language (in fact this should be avoided,
754
        as it is possible to use many different target languages).  All types
755
        used in the grammar must be declared here. Similarly, all types
756
        declared here must be used in the grammar.</para>
757
    </sect1>
758
 
759
    <sect1 id="term-decl">
760
      <title>The terminal declaration section</title>
761
 
762
      <para>After the type declaration section comes the terminal declaration
763
        section. This section declares the terminals that will be used by the
764
        grammar. A terminal is a recogniser for a symbol in the input alphabet
765
        of the language to be recognised. It is possible to declare terminals
766
        that are not used by the grammar.</para>
767
 
768
      <para>The section begins with its section header, followed by the
769
        declarations of the terminals. Each terminal declaration begins with
770
        the name of the terminal being defined, followed by its type, and
771
        terminated by a semicolon. If the terminal is not used in the grammar,
772
        the declaration should be preceded by a <code>!</code> symbol.</para>
773
 
774
      <para>A type (for terminals, rules and actions) is written as a colon,
775
        followed by the parameter tuple, followed by the <code>-&gt;</code>
776
        symbol, followed by the result tuple. If the type is zero-tuple to
777
        zero-tuple, then the type may be omitted. A tuple consists of a comma
778
        separated sequence of name-type pairs (with the name and type being
779
        separated by a colon), surrounded by parentheses. For parameter
780
        tuples, the type may be suffixed by a <code>&amp;</code> symbol,
781
        indicating that call by reference should be used (the default is call
782
        by copy). For declarations, the names should be omitted. For
783
        terminals, the parameter type must be the zero-tuple.</para>
784
 
785
      <para>The simplest type of terminal declaration is as follows:
786
 
787
        <programlisting>
788
terminal1 ;
789
        </programlisting></para>
790
 
791
      <para>This means the same as:
792
 
793
        <programlisting>
794
terminal1 : () -&gt; () ;
795
        </programlisting></para>
796
 
797
      <para>An example of a more complex terminal declaration is:
798
 
799
        <programlisting>
800
terminal2 : () -&gt; ( :StringT ) ;
801
        </programlisting></para>
802
 
803
      <para>If these terminals were not to be used in the grammar, they should
804
        be declared as:
805
 
806
      <programlisting>
807
!terminal1 ;
808
!terminal2 : () -&gt; ( :StringT ) ;
809
      </programlisting></para>
810
    </sect1>
811
 
812
    <sect1 id="rule-defn">
813
      <title>The rule definition section</title>
814
 
815
      <para>The rule definition section follows the terminal declaration
816
        section. It begins with the section header, followed by the
817
        definitions and declarations of all of the rules used in the grammar,
818
        and the declarations of all of the actions used in the grammar.</para>
819
 
820
      <para>Rule declarations look identical to terminal declarations, e.g.:
821
 
822
        <programlisting>
823
rule1 : ( :NumberT ) -&gt; ( :NumberListT ) ;
824
rule2 ;
825
        </programlisting></para>
826
 
827
      <para>Action declarations are similar, although the names are surrounded
828
        by angle brackets, e.g.:
829
 
830
        <programlisting>
831
&lt;action1&gt; : ( :StringT &amp; ) -&gt; () ;
832
&lt;action2&gt; ;
833
        </programlisting></para>
834
 
835
      <para>A declaration (or a definition) may be prefixed with the
836
        <code>::</code> symbol. This symbol forces the definition into the
837
        outermost scope. Scopes are described later on.</para>
838
 
839
      <para>A rule definition (called a production) looks something like the
840
        following:
841
 
842
        <programlisting>
843
add-expr : () -&gt; ( v : NumberT ) = {
844
v1 = mul-expr ;
845
plus ;
846
v2 = expr ;
847
v  = &lt;add&gt; ( v1, v2 ) ;
848
||
849
v1 = mul-expr ;
850
minus ;
851
v2 = expr ;
852
v  = &lt;subtract&gt; ( v1, v2 ) ;
853
##
854
v = &lt;ex&gt; ;
855
} ;
856
        </programlisting></para>
857
 
858
      <para>The production begins with the rule name, followed by the
859
        parameter and result names and types (in this case, the rule name is
860
        <code>add-expr</code>, there are no parameters, and there is one
861
        result name <code>v</code> of type <code>NumberT</code>). This may
862
        optionally be followed by local declarations (there are none here -
863
        they are described later).</para>
864
 
865
      <para>The left hand side of the rule is followed by the <code>=</code>
866
        symbol, a list of alternatives surrounded by curly braces, and is
867
        terminated by a semicolon. The alternatives are separated by the
868
        <code>||</code> symbol, and the last alternative may be separated from
869
        its predecessor (there must be one) using the <code>##</code> symbol;
870
        if this is the case, then this alternative is the exception handler
871
        for the production (otherwise it has no exception handler).</para>
872
 
873
      <para>An alternative may match the empty string, using the symbol
874
        <code>$</code> and the terminator symbol <code>;</code>, i.e.:
875
 
876
        <programlisting>
877
$ ;
878
        </programlisting>
879
 
880
        unless it is an exception handler alternative (in which case it must
881
        do something), or a sequence of items. The empty string is only valid
882
        if the production has no results. If you want to match an empty string
883
        in a production that has a result, it is necessary to use an action
884
        (or identity) to provide a result.</para>
885
 
886
      <para>An item is an identity, or a call to a (possibly anonymous) rule,
887
        a terminal, an action or a predicate. An identity looks like an
888
        assignment in conventional programming languages:
889
 
890
        <programlisting>
891
( a, b, c ) = ( d, e, f ) ;
892
        </programlisting></para>
893
 
894
      <para>Each tuple must contain the same number of names. In the case of a
895
        one-tuple, the parentheses may be dropped, e.g.:
896
 
897
        <programlisting>
898
a = d ;
899
        </programlisting></para>
900
 
901
      <para>Note that this is a binding operation, not an assignment.  Each
902
        name on the left hand side must be a new name. It is illegal to
903
        redeclare a name that is already in scope. It is possible to assign to
904
        a name which is already in scope by prefixing that name with the
905
        <code>&amp;</code> symbol, e.g.:
906
 
907
        <programlisting>
908
( a, &amp;b, c ) = ( d, e, f ) ;
909
        </programlisting>
910
 
911
        would assign to the name <code>b</code>, which must have been
912
        previously defined (it may be a parameter; if it is a call by
913
        reference parameter, then the change will propagate outside to the
914
        calling rule).</para>
915
 
916
      <para>It is also possible to use the <code>!</code> symbol in the result
917
        tuple to ignore results, e.g.:
918
 
919
        <programlisting>
920
( a, !, b, ! ) = ( c, d, e, f ) ;
921
        </programlisting></para>
922
 
923
      <para>This is not particularly useful in an identity, but may be more
924
        useful in a call to a rule, terminal or action. A call to a terminal
925
        or rule looks like a call to a function in a conventional programming
926
        language, e.g.:
927
 
928
        <programlisting>
929
( a, b ) = rule1 ( c, d ) ;
930
( e, f ) = terminal1 () ;
931
        </programlisting></para>
932
 
933
      <para>Calls to actions have the same form, except that action names are
934
        surrounded by angle brackets, e.g.:
935
 
936
        <programlisting>
937
( g, h, i ) = &lt;action1&gt; ( a, e ) ;
938
        </programlisting></para>
939
 
940
      <para>In addition, one of the names in the result tuple of the call to
941
        the action may be the predicate result symbol <code>?</code>, in which
942
        case the action is used as a predicate (more details on predicates are
943
        given later).</para>
944
 
945
      <para>When calling a rule, terminal or action, it is necessary to have
946
        declared it (or in the case of a rule declared or defined it) before
947
        the call.</para>
948
 
949
      <para>If a rule or action is being invoked, and it takes one or more
950
        call by reference parameters, then the corresponding arguments should
951
        be prefixed by the <code>&amp;</code> symbol, e.g.:
952
 
953
        <programlisting>
954
length = &lt;string-length&gt; ( &amp;string ) ;
955
        </programlisting></para>
956
 
957
      <para>If the rule, terminal or action has the zero-tuple as a result,
958
        then only the right hand side of the definition is required, e.g.:
959
 
960
        <programlisting>
961
rule2 ( a, b ) ;
962
terminal2 () ;
963
&lt;action2&gt; ( c, d ) ;
964
        </programlisting></para>
965
 
966
      <para>If the rule, terminal or action has the zero-tuple as a parameter,
967
        then the parameter tuple may be omitted, e.g.:
968
 
969
        <programlisting>
970
( a, b ) = rule3 ;
971
terminal3 ;
972
c = &lt;action3&gt; ;
973
        </programlisting></para>
974
 
975
      <para>In older versions of <code>sid</code>, it used to be possible to
976
        have ambiguous items, e.g.:
977
 
978
        <programlisting>
979
a = b ;
980
        </programlisting>
981
 
982
        where <code>b</code> was both a rule and a name. As local names may
983
        not shadow non-local and global names, then this is no longer a
984
        problem.</para>
985
 
986
      <para>In each case, the data flow through the rule is indicated using
987
        names. In the previous example of a production, both alternatives have
988
        the same data flow: the call to <code>mul-expr</code> returns one
989
        value, which is stored in the name <code>v1</code>, and the call to
990
        <code>expr</code> returns one value, which is stored in the name
991
        <code>v2</code>. Both of these names are passed to the action
992
        (<code>add</code> in the first alternative, <code>subtract</code> in
993
        the second), which returns a value into the name <code>v</code>
994
        (presumably the sum or difference of the previous two values). The
995
        name <code>v</code> is the name of the result, so this will be
996
        returned as the result of the rule. The exception handler (which is
997
        invoked if something fails) calls the action <code>ex</code> to
998
        produce the result <code>v</code>.</para>
999
 
1000
      <para>It is necessary that the types of the data flow through the
1001
        production are correct. It is also necessary to define all of the
1002
        result names for the production in each of the alternatives in that
1003
        production.</para>
1004
 
1005
      <para>An anonymous rule is written in the same way as the body of a
1006
        normal rule, e.g.:
1007
 
1008
        <programlisting>
1009
list : () -&gt; ( l : ListT ) = {
1010
n = number ;
1011
/* START ANONYMOUS RULE */ {
1012
? = &lt;at-eof&gt; ;
1013
l = &lt;make-list&gt; ( n ) ;
1014
||
1015
comma ;
1016
l1 = list ;
1017
l  = &lt;cons&gt; ( n, l1 ) ;
1018
##
1019
l = &lt;com-or-eof&gt; ( n ) ;
1020
} ; /* END ANONYMOUS RULE */
1021
} ;
1022
        </programlisting></para>
1023
 
1024
      <para>An anonymous rule is always inlined.</para>
1025
 
1026
      <para>The rule name may be followed by a sequence of definitions and
1027
        declarations surrounded by the <code>[</code> and <code>]</code>
1028
        symbols (which are followed by the rest of the rule). In this case,
1029
        the definitions are local to the rule, e.g.:
1030
 
1031
        <programlisting>
1032
x-list [
1033
x = {
1034
terminal1 ;
1035
terminal2 ;
1036
||
1037
terminal3 ;
1038
} ;
1039
] = {
1040
x ;
1041
||
1042
x ;
1043
x-list ;
1044
} ;
1045
        </programlisting></para>
1046
 
1047
      <para>In this case, the rule <code>x</code> would be local to the rule
1048
        <code>x-list</code> and no other rule would be able to use it.  In
1049
        error messages, the name of the rule <code>x</code> would be written
1050
        as <code>x-list::x</code>. All declarations and definitions that occur
1051
        inside of the <code>[</code> and <code>]</code> symbols have the scope
1052
        of the enclosing rule, unless they are preceded by the <code>::</code>
1053
        symbol, in which case they have global scope.  This is particularly
1054
        necessary for actions, as actions can only be defined with global
1055
        scope.</para>
1056
 
1057
      <para>It is also possible to define non-local names. These are declared
1058
        as an identifier (the name), followed by the <code>:</code> symbol,
1059
        followed by another identifier (its type), in a similar manner to an
1060
        entry in a type tuple. Non-local names are not allowed at the
1061
        outermost level (so they may not be prefixed with the <code>::</code>
1062
        symbol either). When a non-local name is defined, it is in scope for
1063
        all of the rules in its scope that are defined after it is, plus its
1064
        defining rule.</para>
1065
 
1066
      <para>Non-local names have their values saved on entry to their defining
1067
        rule, and the value will be restored when the rule is exited. This
1068
        includes exiting the rule tail recursively or because of an exception
1069
        (if the rule has an exception handler, the non-local name will not be
1070
        restored until the exception handler has exited). In almost all other
1071
        respects non-local names are the same as local names. An example
1072
        follows:
1073
 
1074
        <programlisting>
1075
rule1 [
1076
name1 : Type1T ;
1077
rule1_1 = {
1078
&lt;action1&gt; ( name1 ) ;
1079
rule2 ( name1 ) ;
1080
} ;
1081
] = {
1082
&lt;action2&gt; ( &amp;name1 ) ;
1083
rule1_1 ;
1084
} ;
1085
        </programlisting></para>
1086
 
1087
      <para>It is possible to associate an initialiser with a non-local name,
1088
        by following the type name with a <code>=</code> symbol and the action
1089
        name in angle brackets, e.g.:
1090
 
1091
        <programlisting>
1092
rule1 [
1093
name1 : Type1T = &lt;action1&gt; ;
1094
] = {
1095
// ....
1096
} ;
1097
        </programlisting></para>
1098
 
1099
      <para>In this case the action is called at the start of the rule to
1100
        initialise the non-local name. The action should return an object of
1101
        the same type as the non-local name. Normally, the action takes no
1102
        parameters, however it may take one parameter of the same type as the
1103
        non-local name (or a reference to that type), in which case it will be
1104
        given the saved value of the non-local name as an argument (this may
1105
        be used to build a stack automatically for example).</para>
1106
    </sect1>
1107
 
1108
    <sect1 id="entry-point">
1109
      <title>The grammar entry points section</title>
1110
 
1111
      <para>The final section lists the entry points to the grammar. It begins
1112
        with the section header, followed by a comma separated list of rule
1113
        names, terminated by a semicolon, e.g.:
1114
 
1115
        <programlisting>
1116
%entry% expr ;
1117
        </programlisting></para>
1118
 
1119
      <para>If you are going to use a rule as an entry point into the grammar
1120
        (i.e. you wish to call the associated function), you must list it in
1121
        the entry points list. If not, the function may not exist.</para>
1122
    </sect1>
1123
  </chapter>
1124
 
1125
  <chapter id="info-file">
1126
    <title>The C information file</title>
1127
 
1128
    <para>The grammar specification itself is not sufficient to produce a
1129
      parser. There also needs to be output language specific information to
1130
      allow the parser to interface with the program it is to be part of. In
1131
      the case of the C output routines, <code>sid</code> needs to know the
1132
      following information:
1133
 
1134
      <orderedlist>
1135
        <listitem>
1136
          <para>What code should precede and succeed the automatically
1137
            generated code.</para>
1138
        </listitem>
1139
 
1140
        <listitem>
1141
          <para>How to map the <code>sid</code> identifiers into C
1142
            identifiers.</para>
1143
        </listitem>
1144
 
1145
        <listitem>
1146
          <para>How to do assignments for each type.</para>
1147
        </listitem>
1148
 
1149
        <listitem>
1150
          <para>How to get the current terminal number.</para>
1151
        </listitem>
1152
 
1153
        <listitem>
1154
          <para>How to get the result of the current terminal.</para>
1155
        </listitem>
1156
 
1157
        <listitem>
1158
          <para>How to advance the lexical analyser, to get the next
1159
            terminal.</para>
1160
        </listitem>
1161
 
1162
        <listitem>
1163
          <para>What the actions are defined as, and how to pass parameters to
1164
            them.</para>
1165
        </listitem>
1166
 
1167
        <listitem>
1168
          <para>How to save and restore the current terminal when an error
1169
            occurs.</para>
1170
        </listitem>
1171
      </orderedlist></para>
1172
 
1173
    <para>Eventually almost all of this should be user suppliable. At the
1174
      moment, some of the information is supplied by the user in the C
1175
      information file, some through macros, and some is built in.
1176
      <code>sid</code> currently gets the information as follows:
1177
 
1178
      <orderedlist>
1179
        <listitem>
1180
          <para>The C information file has a header and a trailer section,
1181
            which define code that precedes and succeeds the code that
1182
            <code>sid</code> generates.</para>
1183
        </listitem>
1184
 
1185
        <listitem>
1186
          <para>The C information file has a section that allows the user to
1187
            specify mappings from <code>sid</code> identifiers into C
1188
            identifiers. These are only valid for the following types of
1189
            identifiers: types, functions (implementations of rules) and
1190
            terminals. For other identifier types (or when no mapping is
1191
            supplied), <code>sid</code> uses some default rules:</para>
1192
 
1193
          <para>Firstly, <code>sid</code> applies a transform to the
1194
            <code>sid</code> identifier name, to make it a legal C identifier.
1195
            At present this maps <code>_</code> to <code>__</code>,
1196
            <code>-</code> to <code>_H</code> and <code>:</code> (this occurs
1197
            in scoped names) to <code>_C</code>. All other characters are
1198
            left unmodified. This transform cannot be changed.</para>
1199
 
1200
          <para><code>sid</code> also puts a prefix before all identifiers, to
1201
            try to prevent clashes (and also to make automatically generated
1202
            - i.e. numeric - identifiers legal). These prefixes can be
1203
            redefined for each class of identifier, in the C information file.
1204
            They should be chosen so as not to clash with any other
1205
            identifiers (i.e. no other identifiers should begin with that
1206
            prefix).</para>
1207
 
1208
          <para>By default, the following prefixes are used:
1209
 
1210
            <table id="identifier-prefixes">
1211
              <title>Identifier prefixes</title>
1212
 
1213
              <tgroup cols="2">
1214
                <thead>
1215
                  <row>
1216
                    <entry>Prefix</entry>
1217
 
1218
                    <entry>Meaning</entry>
1219
                  </row>
1220
                </thead>
1221
 
1222
                <tbody>
1223
                  <row>
1224
                    <entry><code>ZT</code></entry>
1225
 
1226
                    <entry>This prefix is used before type identifiers, for
1227
                      the type name itself.</entry>
1228
                  </row>
1229
 
1230
                  <row>
1231
                    <entry><code>ZR</code></entry>
1232
 
1233
                    <entry>This prefix is used before rule identifiers, for
1234
                      the rule's implementation function.</entry>
1235
                  </row>
1236
 
1237
                  <row>
1238
                    <entry><code>ZL</code></entry>
1239
 
1240
                    <entry>This prefix is used before rule identifiers, for
1241
                      the rule's label when tail recursion is being
1242
                      eliminated. In this case, a number is added to the
1243
                      suffix before the identifier name, to prevent clashes
1244
                      when a rule is inlined twice in the same function.  It
1245
                      is also used before other labels that are automatically
1246
                      generated and are just numbered.</entry>
1247
                  </row>
1248
 
1249
                  <row>
1250
                    <entry><code>ZI</code></entry>
1251
 
1252
                    <entry>This prefix is used before name identifiers used as
1253
                      parameters to functions, or in normal usage. It is also
1254
                      used by non-local names (which doesn't cause a problem
1255
                      as they always occur scoped, and local names never
1256
                      do).</entry>
1257
                  </row>
1258
 
1259
                  <row>
1260
                    <entry><code>ZO</code></entry>
1261
 
1262
                    <entry>This prefix is used before name identifiers used as
1263
                      results of functions. Results are passed as reference
1264
                      parameters, and this suffix is used then. Another
1265
                      identifier with the <code>ZI</code> prefix is also used
1266
                      within the function, and the type reference assignment
1267
                      operator is used at the end of the function to assign
1268
                      the results to the reference parameters.</entry>
1269
                  </row>
1270
 
1271
                  <row>
1272
                    <entry><code>ZB</code></entry>
1273
 
1274
                    <entry>This prefix is used before the terminal symbol
1275
                      names in the generated header file.</entry>
1276
                  </row>
1277
                </tbody>
1278
              </tgroup>
1279
            </table>
1280
          </para>
1281
        </listitem>
1282
 
1283
        <listitem>
1284
          <para>Normally, <code>sid</code> will do assignments using the C
1285
            assignment operator. Sometimes, this will not do the right thing,
1286
            so the user can define a set of assignment operations for any type
1287
            in the C information file.</para>
1288
        </listitem>
1289
 
1290
        <listitem>
1291
          <para><code>sid</code> expects the <code>CURRENT_TERMINAL</code>
1292
            macro to be defined, and its definition should return an integer
1293
            that is the current terminal. The macro should be an expression,
1294
            not a statement.</para>
1295
        </listitem>
1296
 
1297
        <listitem>
1298
          <para>It is necessary to define how to extract the results of all
1299
            terminals in the C information file (if a terminal doesn't return
1300
            anything, then it is not necessary to define how to get the
1301
            result).</para>
1302
        </listitem>
1303
 
1304
        <listitem>
1305
          <para><code>sid</code> expects the <code>ADVANCE_LEXER</code> macro
1306
            to be defined, and its definition should cause the lexical
1307
            analyser to read a new token. The new terminal number should be
1308
            accessible through the <code>CURRENT_TERMINAL</code> macro. On
1309
            entry into the parser <code>CURRENT_TERMINAL</code> should give
1310
            the first terminal number.</para>
1311
        </listitem>
1312
 
1313
        <listitem>
1314
          <para>All actions, and their parameter and result names are defined
1315
            in the C information file.</para>
1316
        </listitem>
1317
 
1318
        <listitem>
1319
          <para><code>sid</code> expects the <code>SAVE_LEXER</code> and
1320
            <code>RESTORE_LEXER</code> macros to be defined. The first is
1321
            called with an argument which is the error terminal value. The
1322
            macro should save the current terminal's value, and set the
1323
            current terminal to be the error terminal value. The second macro
1324
            is called without arguments, and should restore the saved value of
1325
            the current terminal.  <code>SAVE_LEXER</code> will never be
1326
            called more than once without a call to
1327
            <code>RESTORE_LEXER</code>, so the save stack only needs one
1328
            element.</para>
1329
        </listitem>
1330
      </orderedlist></para>
1331
 
1332
    <para>The remainder of this section describes the layout of the C
1333
      information file. The lexical conventions are described first, followed
1334
      by a description of the sections in the order in which they should
1335
      appear. Unlike the <code>sid</code> grammar file, not all sections are
1336
      mandatory.</para>
1337
 
1338
    <sect1 id="info-lex">
1339
      <title>Lexical conventions</title>
1340
 
1341
      <para>The lexical conventions of the C information file are very similar
1342
        to those of the <code>sid</code> grammar file. There is a second class
1343
        of identifier: the C identifier, which is a subset of the valid
1344
        <code>sid</code> identifiers; there is also the C code block.</para>
1345
 
1346
      <para>A C code block begins with <code>@{</code> and is terminated by
1347
        <code>@}</code>. The code block consists of all of the characters
1348
        between the start and end of the code block, subject to substitutions.
1349
        All substitutions begin with the <code>@</code> character. The
1350
        following substitutions are recognised:
1351
 
1352
        <itemizedlist>
1353
          <listitem>
1354
            <para><code>@@</code></para>
1355
 
1356
            <para>This substitutes the <code>@</code> character itself.</para>
1357
          </listitem>
1358
 
1359
          <listitem><para><code>@:</code><token>label</token></para>
1360
 
1361
            <para>This form marks a label, which will be substituted for in
1362
              the output code. This is necessary, because an action may be
1363
              inlined into the same function more than once. If this happens,
1364
              then without doing label substitution there would be two
1365
              identical labels in the same scope. With label substitution,
1366
              this problem is avoided. In general, all references to a label
1367
              within an action should be prefixed with <code>@:</code>. This
1368
              substitution may not be used in header and trailer code
1369
              blocks.</para>
1370
          </listitem>
1371
 
1372
          <listitem>
1373
            <para><code>@</code><token>identifier</token></para>
1374
 
1375
            <para>This form marks a parameter or result identifier
1376
              substitution. If parameter and result identifiers are not
1377
              prefixed with an <code>@</code> character, then they will not be
1378
              substituted. It is an error if the identifier is not a parameter
1379
              or a result. Header and trailer code blocks have no parameters
1380
              or results, so it is always an error to use identifier
1381
              substitution in them. It is an error if any of the result
1382
              identifiers are not substituted at least once.</para>
1383
 
1384
              <para>Result identifiers may be assigned to using this form of
1385
                identifier substitution, but parameter identifiers may not be
1386
                (nor may there address be taken - they are immutable). To try
1387
                to prevent this, parameters that are substituted may be cast
1388
                to their own type, which makes them unmodifiable in ISO C (see
1389
                the notes on the <link linkend="casts"><code>casts</code>
1390
                language specific option</link>).</para>
1391
          </listitem>
1392
 
1393
          <listitem>
1394
            <para><code>@&amp;</code><token>identifier</token></para>
1395
 
1396
            <para>This form marks a parameter identifier whose address is to
1397
              be substituted, but whose contents will not be modified. The
1398
              effects of modifying the identifier are undefined. It is an
1399
              error to use this in parameter assignment operator
1400
              definitions.</para>
1401
          </listitem>
1402
 
1403
          <listitem>
1404
            <para><code>@=</code><token>identifier</token></para>
1405
 
1406
            <para>This form marks a parameter identifier that will be
1407
              modified. For this to be useful, the parameter should be a call
1408
              by reference parameter, so that the effect of the modification
1409
              will be propagated. This substitution is only valid in actions
1410
              (assignment operators are not allowed to modify their
1411
              parameters).</para>
1412
          </listitem>
1413
 
1414
          <listitem>
1415
            <para><code>@!</code></para>
1416
 
1417
            <para>This form marks an exception raise. In the generated code, a
1418
              jump to the current exception handler will be substituted. This
1419
              substitution is only valid in actions.</para>
1420
          </listitem>
1421
 
1422
          <listitem>
1423
            <para><code>@.</code></para>
1424
 
1425
            <para>This form marks an attempt to access the current terminal.
1426
              This substitution is only valid in actions.</para>
1427
          </listitem>
1428
 
1429
          <listitem>
1430
            <para><code>@&gt;</code></para>
1431
 
1432
            <para>This form marks an attempt to advance the lexical analyser.
1433
              This substitution is only valid in actions.</para>
1434
          </listitem>
1435
        </itemizedlist></para>
1436
 
1437
      <para>All other forms are illegal. Note that in the case of labels and
1438
        identifiers, no white space is allowed between the <code>@:</code>,
1439
        <code>@</code>, <code>@&amp;</code> or <code>@=</code> and the
1440
        identifier name. An example of a code block is:
1441
 
1442
        <programlisting>
1443
@{
1444
/* A code block */
1445
{
1446
int i ;
1447
if ( @param ) {
1448
@! ;
1449
}
1450
@result = 0 ;
1451
for ( i = 0 ; i &lt; 100 ; i++ ) {
1452
printf ( "{%d}\n", i ) ;
1453
@result += i ;
1454
}
1455
@=param += @result ;
1456
if ( @. == TOKEN_SEMI ) {
1457
@&gt; ;
1458
}
1459
}
1460
@}
1461
        </programlisting></para>
1462
    </sect1>
1463
 
1464
    <sect1 id="prefixes">
1465
      <title>The prefixes section</title>
1466
 
1467
      <para>The first section in the C information file is the prefix
1468
        definition section. This section is optional. It begins with the
1469
        section header, followed by a list of prefix definitions. A prefix
1470
        definition begins with the prefix name, followed by a <code>=</code>
1471
        symbol, followed by a C identifier that is the new prefix, and
1472
        terminated by a semicolon. The following example shows all of the
1473
        prefix names, and their default values:
1474
 
1475
        <programlisting>
1476
%prefixes%
1477
type            = ZT ;
1478
function        = ZR ;
1479
label           = ZL ;
1480
input           = ZI ;
1481
output          = ZO ;
1482
terminal        = ZB ;
1483
        </programlisting></para>
1484
    </sect1>
1485
 
1486
    <sect1 id="maps">
1487
      <title>The maps section</title>
1488
 
1489
      <para>The section that follows the prefixes section is the maps section.
1490
        This section is also optional. It begins with its section header,
1491
        followed by a list of identifier mappings. An identifier mapping
1492
        begins with a <code>sid</code> identifier (either a type, a rule or a
1493
        terminal), followed by the <code>-&gt;</code> symbol, followed by the
1494
        C identifier it is to be mapped to, and terminated by a semicolon. An
1495
        example follows:
1496
 
1497
        <programlisting>
1498
%maps%
1499
NumberT         -&gt; unsigned ;
1500
calculator      -&gt; calculator ;
1501
        </programlisting></para>
1502
 
1503
      <para>Note that it is not possible to map type identifiers to be
1504
        arbitrary C types. It will be necessary to <code>typedef</code> or
1505
        macro define the type name in the C file.</para>
1506
 
1507
      <para>It is recommended that all types, terminals and entry point rules
1508
        have their names mapped in this section, although this is not
1509
        necessary. If the names are not mapped, they will have funny names in
1510
        the rest of the program.</para>
1511
    </sect1>
1512
 
1513
    <sect1 id="header">
1514
      <title>The header section</title>
1515
 
1516
      <para>After the maps section comes the header section. This begins with
1517
        the section header, followed by a code block, followed by a comma,
1518
        followed by a second code block, and terminated with a semicolon. The
1519
        first code block will be inserted at the beginning of the generated
1520
        parser file; the second code block will be inserted at the start of
1521
        the generated header file. An example is:
1522
 
1523
        <programlisting>
1524
%header% @{
1525
#include "lexer.h"
1526
 
1527
LexerT token ;
1528
 
1529
#define CURRENT_TERMINAL        token.t
1530
#define ADVANCE_LEXER           next_token ()
1531
 
1532
extern void terminal_error () ;
1533
extern void syntax_error () ;
1534
@}, @{
1535
@} ;
1536
        </programlisting></para>
1537
    </sect1>
1538
 
1539
    <sect1 id="assign">
1540
      <title>The assignments section</title>
1541
 
1542
      <para>The assignments section follows the header section. This section
1543
        is optional. Normally, assignment between two identifiers will be done
1544
        using the C assignment operator. In some cases this will not do the
1545
        correct thing, and it is necessary to do the assignment differently.
1546
        All types for which this applies should have an entry in the
1547
        assignments section. The section begins with its header, followed by
1548
        definitions for each type that needs its own assignment operator. Each
1549
        definition should have one parameter, and one result. The action's
1550
        name should be the name of the type. An example follows:
1551
 
1552
        <programlisting>
1553
%assignments%
1554
 
1555
ListT : ( l1 ) -&gt; ( l2 ) = @{
1556
if ( @l2.head = @l1.head ) {
1557
@l2.tail = @l1.tail ;
1558
} else {
1559
@l2.tail = &amp;( @l2.head ) ;
1560
}
1561
@} ;
1562
        </programlisting></para>
1563
 
1564
      <para>If a type has an assignment operator defined, it must also have a
1565
        parameter assignment operator type defined and a result assignment
1566
        operator defined (more precisely it must have either no assignment
1567
        operations defined, or all three assignment operations
1568
        defined).</para>
1569
    </sect1>
1570
 
1571
    <sect1 id="param-assign">
1572
      <title>The parameter assignments section</title>
1573
 
1574
      <para>The parameter assignments section is very similar to the
1575
        assignments section (which it follows), and is also optional. If a
1576
        type has an assignment section entry, it must have a parameter
1577
        assignment entry as well.</para>
1578
 
1579
      <para>The parameter assignment operator is used in function calls to
1580
        ensure that the object is copied correctly: if no parameter assignment
1581
        operator is provided for a type, the standard C call by copy mechanism
1582
        is used; if a parameter assignment operator is provided for a type,
1583
        then the address of the object is passed by the calling function, and
1584
        the called function declares a local of the same type, and uses the
1585
        parameter assignment operator to copy the object (this should be
1586
        remembered when passing parameters to entry points that have arguments
1587
        of a type that has a parameter assignment operator defined).</para>
1588
 
1589
      <para>The difference between the parameter assignment operator and the
1590
        assignment operator is that the parameter identifier to the parameter
1591
        assignment operator is a pointer to the object being manipulated,
1592
        rather than the object itself. An example reference assignment section
1593
        is:
1594
 
1595
        <programlisting>
1596
%parameter-assignments%
1597
 
1598
ListT : ( l1 ) -&gt; ( l2 ) = @{
1599
if ( @l2.head = @l1-&gt;head ) {
1600
@l2.tail = @l1-&gt;tail ;
1601
} else {
1602
@l2.tail = &amp;( @l2.head ) ;
1603
}
1604
@} ;
1605
        </programlisting></para>
1606
    </sect1>
1607
 
1608
    <sect1 id="result-assign">
1609
      <title>The result assignments section</title>
1610
 
1611
      <para>The result assignments section is very similar to the assignments
1612
        section and the parameter assignments section (which it follows), and
1613
        is also optional. If a type has an assignment section entry, it must
1614
        also have a result assignment entry. The only difference between the
1615
        two is that the result identifier of the result assignment operation
1616
        is a pointer to the object being manipulated, rather than the object
1617
        itself. Result assignments are only used when the results of a rule
1618
        are assigned back through the reference parameters passed into the
1619
        function. An example result assignment section is:
1620
 
1621
        <programlisting>
1622
%result-assignments%
1623
 
1624
ListT : ( l1 ) -&gt; ( l2 ) = @{
1625
if ( @l2-&gt;head = @l1.head ) {
1626
@l2-&gt;tail = @l1.tail ;
1627
} else {
1628
@l2-&gt;tail = &amp;( @l2-&gt;head ) ;
1629
}
1630
@} ;
1631
        </programlisting></para>
1632
    </sect1>
1633
 
1634
    <sect1 id="term-result">
1635
      <title>The terminal result extraction section</title>
1636
 
1637
      <para>The terminal result extraction section follows the reference
1638
        assignment section. It defines how to extract the results from
1639
        terminals. The section begins with its section header, followed by the
1640
        terminal extraction definitions.</para>
1641
 
1642
      <para>There must be a definition for every terminal in the grammar that
1643
        returns a result. It is an error to include a definition for a
1644
        terminal that doesn't return a result. The result of the definition
1645
        should be the same as the result of the terminal. An example of the
1646
        terminal result extraction section follows:
1647
 
1648
        <programlisting>
1649
%terminals%
1650
 
1651
number : () -&gt; ( n ) = @{
1652
@n = token.u.number ;
1653
@} ;
1654
 
1655
identifier : () -&gt; ( i ) = @{
1656
@i = token.u.identifier ;
1657
@} ;
1658
 
1659
string : () -&gt; ( s ) = @{
1660
@s = token.u.string ;
1661
@} ;
1662
        </programlisting></para>
1663
    </sect1>
1664
 
1665
    <sect1 id="action-defn">
1666
      <title>The action definition section</title>
1667
 
1668
      <para>The action definition section follows the terminal result
1669
        extractor definition section. The format is similar to the previous
1670
        sections: the section header followed by definitions for all of the
1671
        actions. An action definition has the following form:
1672
 
1673
        <programlisting>
1674
&lt;action-name&gt; : ( parameters ) -&gt; ( results ) = code-block ;
1675
        </programlisting></para>
1676
 
1677
      <para>This is similar to the form of all previous definitions, except
1678
        that the name is surrounded in angle brackets. What follows is also
1679
        true of the other definitions as well (unless they state
1680
        otherwise).</para>
1681
 
1682
      <para>The <code>action-name</code> is a <code>sid</code> identifier that
1683
        is the name of the action being defined; <code>parameters</code> is a
1684
        comma separated list of C identifiers that will be the names of the
1685
        parameters passed to the action, and <code>results</code> is a comma
1686
        separated list of C identifiers that will be the names of the result
1687
        parameters passed to the action. The <code>code-block</code> is the C
1688
        code that defines the action. It is expected that this will assign a
1689
        valid result to each of the result identifier names.</para>
1690
 
1691
      <para>The parameter and result tuples have the same form as in the
1692
        language independent file, except that the types are optional.  Like
1693
        the language independent file, if the type of an action is zero-tuple
1694
        to zero-tuple, then the type can be omitted, e.g.:
1695
 
1696
        <programlisting>
1697
&lt;action&gt; = @{ /* .... */ @} ;
1698
        </programlisting></para>
1699
 
1700
      <para>An example action definition section is:
1701
 
1702
        <programlisting>
1703
%actions%
1704
 
1705
&lt;add&gt; : ( v1, v2 ) -&gt; ( v3 ) = @{
1706
@v3 = @v1 + @v2 ;
1707
@} ;
1708
 
1709
&lt;subtract&gt; : ( v1 : NumberT, v2 : NumberT ) -&gt; ( v3 : NumberT ) = @{
1710
@v3 = @v1 - @v2 ;
1711
@} ;
1712
 
1713
&lt;multiply&gt; : ( v1 : NumberT, v2 ) -&gt; ( v3 ) = @{
1714
@v3 = @v1 * @v2 ;
1715
@} ;
1716
 
1717
&lt;divide&gt; : ( v1, v2 ) -&gt; ( v3 : NumberT ) = @{
1718
@v3 = @v1 / @v2 ;
1719
@} ;
1720
 
1721
&lt;print&gt; : ( v ) -&gt; () = @{
1722
printf ( "%u\n", @v ) ;
1723
@} ;
1724
 
1725
&lt;error&gt; = @{
1726
fprintf ( stderr, "ERROR\n" ) ;
1727
exit ( EXIT_FAILURE ) ;
1728
@} ;
1729
        </programlisting></para>
1730
 
1731
      <para>Do not define static variables in action definitions; if you do,
1732
        you will get unexpected results. If you wish to use static variables
1733
        in actions definitions, then define them in the header block.</para>
1734
    </sect1>
1735
 
1736
    <sect1 id="trailer">
1737
      <title>The trailer section</title>
1738
 
1739
      <para>After the action definition section comes the trailer section.
1740
        This has the same form as the header section. An example is:
1741
 
1742
        <programlisting>
1743
%trailer% @{
1744
int main ()
1745
{
1746
next_token () ;
1747
calculator ( NULL ) ;
1748
return 0 ;
1749
}
1750
@}, @{
1751
@} ;
1752
        </programlisting></para>
1753
 
1754
      <para>The code blocks will be appended to the generated parser, and the
1755
        generated header file respectively.</para>
1756
    </sect1>
1757
  </chapter>
1758
 
1759
  <chapter id="predicate">
1760
    <title>Predicates</title>
1761
 
1762
    <para>Predicates provide the user with a mechanism for altering the
1763
      control flow in a manner that terminals alone cannot do.</para>
1764
 
1765
    <para>During the factorisation process, rules that begin with predicates
1766
      are expanded if necessary to ensure that predicates that may be used to
1767
      select which alternative to go down always begin the alternative, e.g.:
1768
 
1769
      <programlisting>
1770
rule1 = {
1771
rule2 ;
1772
/* .... */
1773
||
1774
/* .... */
1775
} ;
1776
 
1777
rule2 = {
1778
? = &lt;predicate&gt; ;
1779
/* .... */
1780
||
1781
/* .... */
1782
} ;
1783
      </programlisting>
1784
 
1785
      would be expanded into:
1786
 
1787
      <programlisting>
1788
rule1 = {
1789
? = predicate ;
1790
/* .... */
1791
/* .... */
1792
||
1793
/* .... */
1794
/* .... */
1795
||
1796
/* .... */
1797
} ;
1798
      </programlisting></para>
1799
 
1800
    <para>Also, if a predicate is used to select which alternative to use, it
1801
      must be the first thing in the alternative, so the following would not
1802
      be allowed:
1803
 
1804
      <programlisting>
1805
rule = {
1806
&lt;action&gt; ;
1807
? = &lt;predicate&gt; ;
1808
/* .... */
1809
||
1810
/* .... */
1811
} ;
1812
      </programlisting></para>
1813
 
1814
    <para>When predicates begin a rule, they are executed (in some arbitrary
1815
      order) until one of them returns true. The alternative that this
1816
      predicate begins is then selected. If no predicates return true, then
1817
      one of the remaining alternatives is selected based upon the current
1818
      terminal (or an error occurs).</para>
1819
 
1820
    <para>It is important that predicates do not contain dependencies upon the
1821
      order of evaluation. In practice, predicates are likely to be simple, so
1822
      this shouldn't be a problem.</para>
1823
 
1824
    <para>When predicates are used within an alternative, they behave like
1825
      terminals. If they evaluate to true, then parsing continues.  If they
1826
      evaluate to false, then an exception is raised.</para>
1827
  </chapter>
1828
 
1829
  <chapter id="exception">
1830
    <title>Error handling</title>
1831
 
1832
    <para>If the input given to the parser is valid, then the parser will not
1833
      need to produce any errors. Unfortunately this is not always the case,
1834
      so <code>sid</code> provides a mechanism for handling errors.</para>
1835
 
1836
    <para>When an error occurs, an exception is raised. This passes control to
1837
      the nearest enclosing exception handler. If there is no exception
1838
      handler at all, the entry point function will return with the current
1839
      terminal set to the error value.</para>
1840
 
1841
    <para>An exception handler is just an alternative that is executed when a
1842
      terminal or predicate fails. This should obviate the need to rely upon
1843
      language specific mechanisms (such as <code>setjmp</code> and
1844
      <code>longjmp</code>) for error recovery.</para>
1845
  </chapter>
1846
 
1847
  <chapter id="call-by-ref">
1848
    <title>Call by reference</title>
1849
 
1850
    <para>The default behaviour of <code>sid</code> is to do argument passing
1851
      using call by copy semantics, and to not allow mutation of parameters of
1852
      rules and actions (however inlined rules, and rules created during
1853
      factoring have call by reference parameters). However it is possible to
1854
      give rule and action parameters call by reference semantics, using the
1855
      <code>&amp;</code> symbol in the type specification (as described
1856
      earlier). It is also possible to mutate parameters of actions, using the
1857
      <code>@=</code> substitution in the action body (also described
1858
      earlier). It is important to do the correct substitutions in action
1859
      definitions, as <code>sid</code> uses this information to decide where
1860
      it can optimise the output code.</para>
1861
 
1862
    <para>If a call by copy parameter is mutated, then <code>sid</code> will
1863
      introduce a new temporary variable and copy the parameter into it - this
1864
      temporary will then be mutated. Similar code will be output for rules
1865
      that have call by copy parameters that are mutated (e.g. as a call by
1866
      reference argument to an action that mutates its parameters).</para>
1867
  </chapter>
1868
 
1869
  <chapter id="call-entry">
1870
    <title>Calling entry points</title>
1871
 
1872
    <para>When calling a function that implements an entry point rule, it
1873
      should be called with the rule's parameters as the first arguments,
1874
      followed by the addresses of the rule's results as the remaining
1875
      arguments. The parameters should have their addresses passed if they are
1876
      of a type that has a parameter assignment operator defined, or if the
1877
      parameter is a call by reference parameter.</para>
1878
 
1879
    <para>For example, given the following rule:
1880
 
1881
      <programlisting>
1882
rule1 : ( :Type1T, :Type2T, :Type3T &amp; ) -&gt; ( :Type4T ) ;
1883
      </programlisting>
1884
 
1885
      where <code>Type2T</code> has a parameter assignment operator defined,
1886
      and <code>rule1</code> is mapped to <code>rule1</code> (and the type
1887
      names are mapped to themselves), the call would be something like:
1888
 
1889
      <programlisting>
1890
Type1T a = make_type1 () ;
1891
Type2T b = make_type2 () ;
1892
Type3T c = make_type3 () ;
1893
Type4T d ;
1894
 
1895
rule1 ( a, b, &amp;c, &amp;d ) ;
1896
      </programlisting></para>
1897
  </chapter>
1898
 
1899
  <chapter id="glossary">
1900
    <title>Glossary</title>
1901
 
1902
    <para>This section describes some of the terms used in the
1903
      <code>sid</code> documentation.</para>
1904
 
1905
    <glossary id="sid-glossary">
1906
      <glossentry>
1907
        <glossterm>Alternative</glossterm>
1908
 
1909
        <glossdef>
1910
          <para>An alternative is a sequence of items.</para>
1911
        </glossdef>
1912
      </glossentry>
1913
 
1914
      <glossentry>
1915
        <glossterm>Exception handler</glossterm>
1916
 
1917
        <glossdef>
1918
          <para>An exception handler is a special type of alternative.  Each
1919
            rule may have at most one exception handler. An exception handler
1920
            is invoked if the current terminal does not match any of the
1921
            expected terminals, if a predicate fails, or if an action raises
1922
            an exception, within the scope of the exception handler.</para>
1923
        </glossdef>
1924
      </glossentry>
1925
 
1926
      <glossentry>
1927
        <glossterm>Expansion</glossterm>
1928
 
1929
        <glossdef>
1930
          <para>This is the process of physically inlining a rule into another
1931
            rule. It is done during the factoring process to turn the grammar
1932
            into a form that a parser can be produced for. See the entry for
1933
            factoring.</para>
1934
        </glossdef>
1935
      </glossentry>
1936
 
1937
      <glossentry>
1938
        <glossterm>Factoring</glossterm>
1939
 
1940
        <glossdef>
1941
          <para>This is one of the transforms that <code>sid</code> performs
1942
            on the grammar. See the <link linkend="overview">overview
1943
            section</link> for a description of the factoring process.</para>
1944
        </glossdef>
1945
      </glossentry>
1946
 
1947
      <glossentry>
1948
        <glossterm>First set</glossterm>
1949
 
1950
        <glossdef>
1951
          <para>The first set of a rule (or alternative) is the set of
1952
            terminals and predicates that can start that rule (or
1953
            alternative).</para>
1954
        </glossdef>
1955
      </glossentry>
1956
 
1957
      <glossentry>
1958
        <glossterm>Follow set</glossterm>
1959
 
1960
        <glossdef>
1961
          <para>The follow set of a rule is the set of terminals and
1962
            predicates that can follow the rule in any of its
1963
            invocations.</para>
1964
        </glossdef>
1965
      </glossentry>
1966
 
1967
      <glossentry>
1968
        <glossterm>Inlining</glossterm>
1969
 
1970
        <glossdef>
1971
          <para>This is the process of outputting the code for parsing one
1972
            rule within the function that parses another rule. This is
1973
            normally done as part of the output process. Expansion is a form
1974
            of inlining performed during the factoring process, but the
1975
            inlining is done by modifying the grammar, rather than as part of
1976
            the output phase.</para>
1977
        </glossdef>
1978
      </glossentry>
1979
 
1980
      <glossentry>
1981
        <glossterm>Item</glossterm>
1982
 
1983
        <glossdef>
1984
          <para>An item is the equivalent of a statement in a conventional
1985
            programming language. It can be an invocation of a rule, terminal,
1986
            action or predicate, or an identity operation (assignment).</para>
1987
        </glossdef>
1988
      </glossentry>
1989
 
1990
      <glossentry>
1991
        <glossterm>Name</glossterm>
1992
 
1993
        <glossdef>
1994
          <para>A name is an identifier that is used to pass information
1995
            between rules and actions. Local names are defined within a rule,
1996
            and only exist within the rule itself. Non-local names are defined
1997
            in a rule's scoped definitions section and exists in all of the
1998
            rules in that scope. Non-local rules are also saved and restored
1999
            across calls to the rule that defines them.</para>
2000
        </glossdef>
2001
      </glossentry>
2002
 
2003
      <glossentry>
2004
        <glossterm>Recursion</glossterm>
2005
 
2006
        <glossdef>
2007
          <para>Recursion is where a rule invokes itself. Direct recursion is
2008
            where the rule invokes itself from one of its own alternatives;
2009
            indirect recursion is where a rule invokes another rule (which
2010
            invokes another rule etc.) which eventually invokes the original
2011
            rule.</para>
2012
 
2013
          <para>Left recursion is a form of recursion where all of the
2014
            recursive calls occur as the first item in an alternative. It is
2015
            not possible to produce a parser for a grammar that contains left
2016
            recursions, so <code>sid</code> turns left recursions into right
2017
            recursions. This process is known as left recursion
2018
            elimination.</para>
2019
 
2020
          <para>Right recursion is a form of recursion where all of the
2021
            recursive calls occur as the last item in an alternative.</para>
2022
        </glossdef>
2023
      </glossentry>
2024
 
2025
      <glossentry>
2026
        <glossterm>Production</glossterm>
2027
 
2028
        <glossdef>
2029
          <para>See rule.</para>
2030
        </glossdef>
2031
      </glossentry>
2032
 
2033
      <glossentry>
2034
        <glossterm>Rule</glossterm>
2035
 
2036
        <glossdef>
2037
          <para>A rule is a sequence of alternatives. A rule may contain a
2038
            special alternative that is used as an exception handler.  A rule
2039
            is also referred to as a production; this term is normally used
2040
            when talking about the definition of a rule.</para>
2041
        </glossdef>
2042
      </glossentry>
2043
 
2044
      <glossentry>
2045
        <glossterm>See-through</glossterm>
2046
 
2047
        <glossdef>
2048
          <para>A rule is said to be see-through if there is an expansion of
2049
            the rule that does not contain any terminals or predicates.</para>
2050
        </glossdef>
2051
      </glossentry>
2052
    </glossary>
2053
  </chapter>
2054
 
2055
  <chapter id="errors">
2056
    <title>Understanding error messages</title>
2057
 
2058
    <para>This section tries to explain what some of the error messages that
2059
      are reported by the <code>sid</code> transforms mean. It does not
2060
      contain descriptions of messages like "type 'some type' is unknown", as
2061
      these should be self-explanatory.</para>
2062
 
2063
    <sect1 id="left-errors">
2064
      <title>Left recursion elimination errors</title>
2065
 
2066
      <para><errortext>The parameter or result types of the left recursive
2067
        calls in the following productions do not match:
2068
        PRODUCTIONS</errortext>: This means that there is a set of rules which
2069
        call each other left recursively (i.e. the first item in some of the
2070
        alternatives in each rule is a call to another rule in the set), and
2071
        they do not all have the same parameter and result types, e.g.:
2072
 
2073
        <programlisting>
2074
rule1 : ( a : Type1T, b : Type1T, c : Type2T, d : Type2T ) -&gt; () = {
2075
rule2 ( a, b ) ;
2076
||
2077
terminal1 ;
2078
} ;
2079
 
2080
rule2 : ( a : Type1T, b : Type2T ) -&gt; () = {
2081
rule1 ( a, a, b, b ) ;
2082
||
2083
terminal2 ;
2084
} ;
2085
        </programlisting></para>
2086
 
2087
      <para><errortext>The exception handlers in the left recursion involving
2088
        the following productions do not match: PRODUCTIONS</errortext>: This
2089
        means that there is a set of productions which call each other left
2090
        recursively (i.e. the first item in an alternative is a call to
2091
        another production in the set), and they do not all have the same
2092
        exception handler, e.g.:
2093
 
2094
        <programlisting>
2095
rule1 = {
2096
rule2 ;
2097
||
2098
terminal1 ;
2099
##
2100
&lt;action1&gt; ;
2101
} ;
2102
 
2103
rule2 = {
2104
rule1 ;
2105
||
2106
terminal2 ;
2107
##
2108
&lt;action2&gt; ;
2109
} ;
2110
        </programlisting></para>
2111
 
2112
      <para>It is quite likely that when using exception handlers, it may be
2113
        necessary to do the left recursion elimination manually to ensure that
2114
        the exception handlers occur at the correct place.</para>
2115
 
2116
      <para><errortext>The argument names of the left recursive calls in the
2117
        following productions do not match: PRODUCTIONS</errortext>: This
2118
        means that there is a set of productions which call each other left
2119
        recursively (i.e. the first item in an alternative is a call to
2120
        another production in the set), and the arguments to one of the left
2121
        recursive calls are not the same as the parameters of the calling
2122
        rule, e.g.:
2123
 
2124
        <programlisting>
2125
rule1 : ( a : Type1T, b : Type1T ) -&gt; () = {
2126
rule1 ( b, a ) ;
2127
||
2128
terminal1 ;
2129
} ;
2130
        </programlisting></para>
2131
 
2132
 
2133
      <para><errortext>A non-local name in the rule 'RULE' is not in scope in
2134
        the rule 'RULE' in the left recursive cycle involving the following
2135
        productions: PRODUCTIONS</errortext>: This means that there is a set
2136
        of productions which call each other left recursively (i.e. the first
2137
        item in an alternative is a call to another production in the set),
2138
        and the first named rule uses a non-local name that is not in scope in
2139
        the second named rule, e.g.:
2140
 
2141
        <programlisting>
2142
rule1 [
2143
name1 : Type1T ;
2144
rule1_1 [
2145
name1_1 : Type1T ;
2146
] = {
2147
rule1 ;
2148
&lt;action1_1&gt; ( name1_1 ) ;
2149
||
2150
terminal1 ;
2151
} ;
2152
] = {
2153
terminal2 ;
2154
||
2155
rule1_1 ;
2156
&lt;action1&gt; ( name1 ) ;
2157
} ;
2158
        </programlisting></para>
2159
 
2160
      <para><errortext>The rule 'RULE' declares non-local names in the left
2161
        recursive cycle with more than one entry point involving the following
2162
        productions: PRODUCTIONS</errortext>: This means that there is a set
2163
        of productions which call each other left recursively (i.e. the first
2164
        item in an alternative is a call to another production in the set),
2165
        and the named rule defines non-local variables even though it is not
2166
        the only entry point to the cycle, e.g.:
2167
 
2168
        <programlisting>
2169
rule1 [
2170
name1 : Type1T ;
2171
rule1_1 = {
2172
&lt;action1_1&gt; ( name1 ) ;
2173
} ;
2174
] = {
2175
terminal1 ;
2176
rule1_1 ;
2177
||
2178
rule2 ;
2179
&lt;action1&gt; ( name1 ) ;
2180
} ;
2181
 
2182
rule2 = {
2183
rule1 ;
2184
&lt;action2&gt; ;
2185
||
2186
terminal2 ;
2187
} ;
2188
        </programlisting></para>
2189
 
2190
      <para>No cycle termination for the left recursive set involving the
2191
        following rules: <token>RULES</token>: This means that there is a set
2192
        of productions which call each other left recursively (i.e.  the first
2193
        item in an alternative is a call to another production in the set),
2194
        and they do not contain an alternative that begins with a non-left
2195
        recursive call, e.g.:
2196
 
2197
        <programlisting>
2198
rule1 = {
2199
rule2 ;
2200
||
2201
rule3 ;
2202
} ;
2203
 
2204
rule2 = {
2205
rule1 ;
2206
||
2207
rule3 ;
2208
} ;
2209
 
2210
rule3 = {
2211
rule1 ;
2212
||
2213
rule2 ;
2214
} ;
2215
        </programlisting></para>
2216
    </sect1>
2217
 
2218
    <sect1 id="first-errors">
2219
      <title>First set computation errors</title>
2220
 
2221
      <para>Cannot compute first set for <token>PRODUCTION</token>: This means
2222
        that <code>sid</code> cannot compute the set of terminals and
2223
        predicates that may start the production. This is normally because
2224
        there is a recursive call (or cycle) that contains no terminals, e.g.:
2225
 
2226
        <programlisting>
2227
rule1 = {
2228
&lt;action1&gt; ;
2229
rule1 ;
2230
||
2231
terminal1 ;
2232
} ;
2233
        </programlisting></para>
2234
 
2235
      <para>This is not removed by the left recursion elimination phase, as
2236
        the call is not the leftmost item in the alternative.</para>
2237
 
2238
      <para>Can see through to predicate '<token>PREDICATE</token>' in
2239
        production <token>PRODUCTION</token>: This means that there is a
2240
        predicate that isn't the first item in its alternative, but is
2241
        preceded only by see-through items, e.g.:
2242
 
2243
        <programlisting>
2244
rule1 = {
2245
&lt;action1&gt; ;
2246
? = &lt;predicate&gt; ;
2247
||
2248
terminal1 ;
2249
} ;
2250
        </programlisting></para>
2251
 
2252
 
2253
      <para>Can see through to predicates in rule '<token>RULE</token>' in
2254
        production <token>PRODUCTION</token>: This means that the first rule
2255
        has at least one predicate in its first set, and the second rule calls
2256
        it in a position where it is not the first item in the alternative and
2257
        is preceded only by see-through items, e.g.:
2258
 
2259
        <programlisting>
2260
rule1 = {
2261
? = &lt;predicate&gt; ;
2262
||
2263
terminal1 ;
2264
} ;
2265
 
2266
rule2 = {
2267
&lt;action&gt; ;
2268
rule1 ;
2269
||
2270
terminal2 ;
2271
} ;
2272
        </programlisting></para>
2273
 
2274
      <para>The rule '<token>RULE</token>' has all terminals in its first set
2275
        and has a redundant see-through alternative: This means that the
2276
        rule's first set (the set of all terminals that can start the rule)
2277
        includes all possible input terminals, and the rule also has a
2278
        see-through alternative. The see-through alternative will never be
2279
        used, as one of the other alternatives will always be chosen.</para>
2280
    </sect1>
2281
 
2282
    <sect1 id="factor-errors">
2283
      <title>Factoring errors</title>
2284
 
2285
      <para>Too many productions (<token>NUMBER</token>) created during
2286
        factorisation: This normally means that <code>sid</code> cannot factor
2287
        the grammar. You will need to rewrite the offending part.
2288
        Unfortunately there is no easy way to do this. Start by looking at the
2289
        dump file for a set of rules that seem to have been expanded a
2290
        lot.</para>
2291
 
2292
      <para>The rule '<token>RULE</token>' cannot be expanded into
2293
        '<token>RULE</token>' as the exception handlers don't match: When
2294
        <code>sid</code> performs factoring, it needs to expand calls to
2295
        certain rules into the rules that calls them (this is described in the
2296
        <link linkend="overview">overview section</link>). If the called rule
2297
        has an exception handler and it is not the same as the exception
2298
        handler of the calling rule, then the expansion will fail.</para>
2299
 
2300
      <para>The rule '<token>RULE</token>' cannot be expanded into
2301
        '<token>RULE</token>' as it contains non-local name definitions: When
2302
        <code>sid</code> performs factoring, it needs to expand calls to
2303
        certain rules into the rules that calls them (this is described in the
2304
        <link linkend="overview">overview section</link>). If the called rule
2305
        defines any non-local names, then the expansion will fail.</para>
2306
    </sect1>
2307
 
2308
    <sect1 id="check-errors">
2309
      <title>Checking errors</title>
2310
 
2311
      <para><errortext>Collision of terminal(s) TERMINALS in rule
2312
        'RULE'</errortext>: This error means that more than one alternative in
2313
        the named rule begins with the named terminals, e.g.:
2314
 
2315
        <programlisting>
2316
rule1 = {
2317
&lt;action1&gt; ;
2318
terminal1 ;
2319
||
2320
terminal1 ;
2321
} ;
2322
        </programlisting></para>
2323
 
2324
      <para>Normally, the factoring process will remove the problem, but when
2325
        something like the above happens to stop the factoring occurring, this
2326
        error will be produced.</para>
2327
 
2328
      <para><errortext>Collision of predicate 'PREDICATE' in rule
2329
        'RULE'</errortext>: This error occurs when more than one alternative
2330
        in the named rule begins with the named predicate, e.g.:
2331
 
2332
        <programlisting>
2333
rule1 = {
2334
( a, ? ) = &lt;predicate&gt; ;
2335
&lt;action1&gt; ( a ) ;
2336
||
2337
( ?, b ) = &lt;predicate&gt; ;
2338
&lt;action2&gt; ( b ) ;
2339
} ;
2340
        </programlisting></para>
2341
 
2342
      <para>Again, it is normally the case that the factoring process will
2343
        remove this problem, but if the same predicate uses different
2344
        predicate results in different alternatives, this error will be
2345
        produced.</para>
2346
 
2347
      <para>The terminal(s) <token>TERMINALS</token> can start rule
2348
        '<token>RULE</token>' which is see-through, and the same terminal(s)
2349
        may appear in the following situations: <token>ALTERNATIVES</token>:
2350
        This means that there are one or more terminals that can start the
2351
        named rule (which is see-through), and may also follow it, e.g.:
2352
 
2353
        <programlisting>
2354
rule1 = {
2355
terminal1 ;
2356
||
2357
$ ;
2358
} ;
2359
 
2360
rule2 = {
2361
rule1 ;
2362
terminal1 ;
2363
||
2364
terminal2 ;
2365
} ;
2366
        </programlisting></para>
2367
 
2368
      <para>The alternatives listed are the alternatives which call the rule,
2369
        and contain (some of) the named terminals after the call. The call is
2370
        highlighted.</para>
2371
 
2372
      <para>The predicate(s) <token>PREDICATES</token> can start rule
2373
        '<token>RULE</token>' which is see-through and the same predicate(s)
2374
        may appear in the following situations: <token>ALTERNATIVES</token>:
2375
        This means that there are one or more predicates that can start the
2376
        named rule (which is see-through), and may also follow it, e.g.:
2377
 
2378
        <programlisting>
2379
rule1 = {
2380
? = &lt;predicate&gt; ;
2381
||
2382
$ ;
2383
} ;
2384
 
2385
rule2 = {
2386
terminal1 ;
2387
rule1 ;
2388
? = &lt;predicate&gt; ;
2389
||
2390
terminal2 ;
2391
} ;
2392
        </programlisting></para>
2393
 
2394
      <para>The alternatives listed are the alternatives which call the rule,
2395
        and contain (some of) the named predicates after the call.  The call
2396
        is highlighted.</para>
2397
 
2398
      <para>The rule '<token>RULE</token>' contains more than one see-through
2399
        alternative: This error occurs if the rule has more than one
2400
        alternative that doesn't need to read a terminal or a predicate, e.g.:
2401
 
2402
        <programlisting>
2403
rule1 = {
2404
&lt;action1&gt; ;
2405
||
2406
&lt;action2&gt; ;
2407
} ;
2408
        </programlisting></para>
2409
    </sect1>
2410
  </chapter>
2411
</book>