WebSVN – tendra.SVN – /branches/tendra4/doc/utilities/sid.html

<!-- Crown Copyright (c) 1998 -->
<HTML>
<HEAD>
<TITLE>sid users' guide</TITLE>
</HEAD>
<BODY TEXT="#000000" BGCOLOR="#FFFFFF" LINK="#0000FF" VLINK="#400080" ALINK="#FF0000">
<H1>sid Users' Guide</H1>
<H3>January 1998</H3>
<IMG SRC="../images/no_next.gif" ALT="next section">
<IMG SRC="../images/no_prev.gif" ALT="previous section">
<IMG SRC="../images/no_top.gif" ALT="current document">
<A HREF="../index.html"><IMG SRC="../images/home.gif" ALT="TenDRA home page">
</A>
<IMG SRC="../images/no_index.gif" ALT="document index"><P>
<HR>
<DL>
<DT><A HREF="#intro"><B>1</B> - Introduction</A><DD>
<DT><A HREF="#grammar"><B>2</B> - Grammars</A><DD>
<DL>
<DT><A HREF="#parse"><B>2.1</B> - Parsing</A><DD>
<DT><A HREF="#context-free"><B>2.2</B> - Context free grammars</A><DD>
<DT><A HREF="#sid-grammar"><B>2.3</B> - <CODE>sid</CODE> grammars</A><DD>
</DL>
<DT><A HREF="#overview"><B>3</B> - Overview</A><DD>
<DL>
<DT><A HREF="#left-recursion"><B>3.1</B> - Left recursion elimination</A><DD>
<DT><A HREF="#factoring"><B>3.2</B> - Factoring</A><DD>
<DT><A HREF="#optimise"><B>3.3</B> - Optimisations</A><DD>
</DL>
<DT><A HREF="#invoke"><B>4</B> - Invocation</A><DD>
<DT><A HREF="#grammar-file"><B>5</B> - The <CODE>sid</CODE> grammar
file</A><DD>
<DL>
<DT><A HREF="#grammar-lex"><B>5.1</B> - Lexical conventions</A><DD>
<DT><A HREF="#type-decl"><B>5.2</B> - The type declaration section</A><DD>
<DT><A HREF="#term-decl"><B>5.3</B> - The terminal declaration section</A><DD>
<DT><A HREF="#rule-defn"><B>5.4</B> - The rule definition section</A><DD>
<DT><A HREF="#entry-point"><B>5.5</B> - The grammar entry points section</A><DD>
</DL>
<DT><A HREF="#info-file"><B>6</B> - The C information file</A><DD>
<DL>
<DT><A HREF="#info-lex"><B>6.1</B> - Lexical conventions</A><DD>
<DT><A HREF="#prefixes"><B>6.2</B> - The prefixes section</A><DD>
<DT><A HREF="#maps"><B>6.3</B> - The maps section</A><DD>
<DT><A HREF="#header"><B>6.4</B> - The header section</A><DD>
<DT><A HREF="#assign"><B>6.5</B> - The assignments section</A><DD>
<DT><A HREF="#param-assign"><B>6.6</B> - The parameter assignments
section</A><DD>
<DT><A HREF="#result-assign"><B>6.7</B> - The result assignments section</A><DD>
<DT><A HREF="#term-result"><B>6.8</B> - The terminal result extraction
section</A><DD>
<DT><A HREF="#action-defn"><B>6.9</B> - The action definition section</A><DD>
<DT><A HREF="#trailer"><B>6.10</B> - The trailer section</A><DD>
</DL>
<DT><A HREF="#predicate"><B>7</B> - Predicates</A><DD>
<DT><A HREF="#exception"><B>8</B> - Error handling</A><DD>
<DT><A HREF="#call-by-ref"><B>9</B> - Call by reference</A><DD>
<DT><A HREF="#call-entry"><B>10</B> - Calling entry points</A><DD>
<DT><A HREF="#glossary"><B>11</B> - Glossary</A><DD>
<DT><A HREF="#errors"><B>12</B> - Understanding error messages</A><DD>
<DL>
<DT><A HREF="#left-errors"><B>12.1</B> - Left recursion elimination
errors</A><DD>
<DT><A HREF="#first-errors"><B>12.2</B> - First set computation errors</A><DD>
<DT><A HREF="#factor-errors"><B>12.3</B> - Factoring errors</A><DD>
<DT><A HREF="#check-errors"><B>12.4</B> - Checking errors</A><DD>
</DL>
</DL>

<HR>
<H2><A NAME="intro">1. Introduction</A></H2>
<P>
This document describes how to use the <CODE>sid</CODE> parser generator.
It was written for <CODE>sid</CODE> version 1.9.  The main features
of each version of <CODE>sid</CODE> are listed below: 
<UL>

<LI><CODE>sid</CODE> 1.0 - this was the first version of <CODE>sid</CODE>
to support attribute grammars.  The output language was C. 
<LI><CODE>sid</CODE> 1.1 - this was a bug fix version of <CODE>sid</CODE>
1.0. 
<LI><CODE>sid</CODE> 1.2 - this version of <CODE>sid</CODE> added
predicates, exception handling, improved inlining options, and a 
grammar test pseudo-language. 
<LI><CODE>sid</CODE> 1.3 - this version of <CODE>sid</CODE> added
anonymous rules, and a better syntax for the C specific information.

<LI><CODE>sid</CODE> 1.4 - this was a bug fix version of <CODE>sid</CODE>
1.3. 
<LI><CODE>sid</CODE> 1.5 - this was a bug fix version of <CODE>sid</CODE>
1.4.  The command line option syntax changed in this release as well.

<LI><CODE>sid</CODE> 1.6 - this version of <CODE>sid</CODE> changed
the     input syntax, added scoped rules, and removed basics (replacing
them with terminal symbols).  It also added a stricter ISO C language.

<LI><CODE>sid</CODE> 1.7 - The syntax of the actions file changed
slightly in this version.  The error messages and recovery from  
syntax errors were also improved in this version.  This version  
also added explicit call by reference support (rather than the   
inconsistent function call semantics of earlier versions).  The  
command line options changed in this version, to support language
specific options, and the <CODE>strict-ansi-c</CODE> language was
dropped.  Non-local variables were also added. 
<LI><CODE>sid</CODE> 1.8 - Initialisers were added for non-local 
variables.  Assignment was added. 
<LI><CODE>sid</CODE> 1.9 - this was a bug fix version of <CODE>sid</CODE>
1.8. 
</UL>
</P>
<P>
<CODE>sid</CODE> turns specifications of languages into programs that
recognise those languages.  One of the aims of <CODE>sid</CODE> was
to separate the specification of the language to be recognised from
the language that the recogniser program is written in.  For this
reason, input to <CODE>sid</CODE> is split into two components: output
language independent information, and output language dependent information.
</P>
<P>
At present, <CODE>sid</CODE> will only output programs in C (either
ISO or pre-ISO), but it is designed so that adding new output languages
should be fairly simple.  There is one other pseudo-language: the
test language.  This is used for testing grammars and the transforms,
but will not output a parser. 
</P>

<HR>
<H2><A NAME="grammar">2. Grammars</A></H2>

<H3><A NAME="parse">2.1. Parsing</A></H3>
<P>
There are two phases in traditional language recognition that are
relevant to <CODE>sid</CODE>: the first is lexical analysis (breaking
the input up into terminal symbols); the second is syntax analysis
or parsing (ensuring that the terminal symbols in the input are in
the correct order). 
</P>
<P>
<CODE>sid</CODE> currently does very little to help with lexical analysis.
It is possible to use <CODE>sid</CODE> to produce the lexical analyser,
but <CODE>sid</CODE> provides no real support for it at present. 
For now, the programmer is expected to write the lexical analyser,
or use another tool to produce it. 
</P>
<P>
The lexical analyser should break the input up into a series of terminals.
Each of these terminals is allocated a number.  In 
<CODE>sid</CODE>, these numbers range from zero to the maximum terminal
number (specified in the grammar description).  The terminals may
also have data associated with them (e.g. the value of a number),
known as the attributes of the terminal, or the results of the terminal.
</P>
<P>
<CODE>sid</CODE> generates the parser.  The parser is a program that
reads the sequence of terminals from the lexical analyser, and ensures
that they form a valid input in the language being recognised.  If
the input is not valid, then the parser will fail (<CODE>sid</CODE>
provides mechanisms to allow the parser to recover from errors). 
</P>

<H3><A NAME="context-free">2.2. Context free grammars</A></H3>
<P>
This section provides a brief introduction to grammars.  It is not
intended to be an in-depth guide to grammars, more an introduction
which defines some terminology. 
</P>
<P>
<CODE>sid</CODE> works with a subset of what are known as context
free grammars.  A context free grammar consists of a set of input
symbols (known as terminals), a set of rules (descriptions of what
are legal forms in the language, also known as non-terminals), and
an entry point (the rule from which the grammar starts). 
</P>
<P>
A rule is defined as a series of alternatives (throughout this document
the definition of a rule is called a production - this may conflict
with some other uses of the term).  Each alternative consists of a
sequence of items.  An item can be a terminal or a rule.  As an example
(using the <CODE>sid</CODE> notation, which now looks unlike the conventional
syntax for grammars), a comma separated list of integers could be
specified as: 
<PRE>
        list-of-integers = {
                integer ;
            ||
                integer ;
                comma ;
                list-of-integers ;
        } ;
</PRE>
This production contains two alternatives: the first matches the terminal
<CODE>integer</CODE>; the second matches the sequence of terminals
<CODE>integer</CODE> followed by <CODE>comma</CODE>, followed by another
list of integers. 
</P>
<P>
There is much documentation available on context free grammars (and
other types of formal grammar).  The reader is advised to find an
alternative source for more information. 
</P>

<H3><A NAME="sid-grammar">2.3. <CODE>sid</CODE> grammars</A></H3>
<P>
<CODE>sid</CODE> grammars are based upon a subset of context free
grammars, known as LL(1) grammars.  The main property of such grammars
is that the parser can always tell which alternative it is going to
parse next by looking at the current terminal symbol.  <CODE>sid</CODE>
does a number of transforms to turn grammars that are not in this
form into it (although it cannot turn all possible grammars into this
form). It also provides facilities that allow the user to alter the
control flow of the parser. 
</P>
<P>
<CODE>sid</CODE> makes the following changes to the context free grammars
described above: 
<OL>
<LI>There may be more than one entry point to the grammar. 
<LI>As well as being a rule or a terminal, an item may be an action,
a predicate or an identity.  An action is just a user supplied function.
A predicate is a cross between a basic and an action (it is a user
defined function but it may alter the control flow).  An identity
is like an assignment in a conventional programming language. 
<LI>Rules may take parameters and return results (as may actions;
terminals may only return results).  These may be passed between items
using names. 
<LI>Each rule can have an exception handler associated with it. Exception
handlers are used for error recovery when the input being parsed does
not match the grammar. 
<LI>Rules may be anonymous. 
<LI>Rules may have non-local names associated with them.  These names
are in scope for that rule and any rules that are defined inside it.
The value of each non-local name is saved on entry to the rule, and
restored when the rule is exited. 
</OL>
</P>

<HR>
<H2><A NAME="overview">3. Overview</A></H2>
<P>
<CODE>sid</CODE> takes the input grammar and applies several transformations
to it, before it produces the parser.  These transformations allow
the language descriptions to be written in a slightly more natural
form than would otherwise be necessary. 
</P>

<H3><A NAME="left-recursion">3.1. Left recursion elimination</A></H3>
<P>
The first transformation is to eliminate any left recursive productions,
replacing them with right recursive ones.  This transforms a set of
rules of the form: 
<PRE>
        Ai = Aj Bji, Ci
</PRE>
into: 
<PRE>
        Ai = Cj Xji
        Xji = Bjk Xki, Yji
</PRE>
where <CODE>Yji</CODE> is the identity function if <CODE>i</CODE>
equals 
<CODE>j</CODE>, and an error otherwise.  In order for this transformation
to work, the following conditions must hold: 
<OL>

<LI>The parameter type for all <CODE>Ai</CODE> must be the same. 
<LI>The result type for all <CODE>Ai</CODE> must be the same. 
<LI>The exception handlers for all <CODE>Ai</CODE> must be the same.

<LI>The parameters to each left recursive call to some <CODE>Aj</CODE>
must be exactly the formal parameters to the calling rule <CODE>Ai</CODE>.

<LI>Any non-local name referenced by any rule must be in scope for
all rules. 
<LI>A rule may not define non-local variables unless it is the only
entry point into the cycle (i.e. there is only one named rule in the
cycle). 
</OL>
<CODE>sid</CODE> will report an error if these conditions are not
met. 
</P>

<H3><A NAME="factoring">3.2. Factoring</A></H3>
<P>
The second major transformation is to factor the grammar.  It is possible
that this phase could go on for ever, so there is a limit to the number
of rules that can be generated.  This limit may be changed (see the
<A HREF="#invoke">invocation section</A>).  In practice it is rare
for this to happen. 
</P>
<P>
The factoring phase tries to increase the number of language specifications
that <CODE>sid</CODE> can produce a parser for, by factoring out common
initial items in alternatives, e.g.: 
<PRE>
        X = {
                a ; b ;
            ||  a ; c ;
        } ;
</PRE>
would be transformed into something like: 
<PRE>
        X = {
                a ; X1 ;
        } ;

        X1 = {
                b ;
            ||  c ;
        } ;
</PRE>
It will also expand calls to rules at the beginning of alternatives
if this is necessary to allow the parser to be produced (this is the
phase that the limit is needed for).  The expansions are done in the
following cases: 
<OL>

<LI>If the rule is see-through (i.e. there is an expansion of the
rule that does not contain any terminals or predicates) and its first
set contains terminals in the first set of the rest of the alternative.

<LI>If the rule has a predicate in its first set. 
<LI>If the rule has terminals in its first set that are also in the
first set of another alternative that does not begin with the same
rule. 
</OL>
If the rule being expanded contains an exception handler, and it is
not identical to the exception handler in the enclosing rule, then
an error will occur.  Similarly if the rule being expanded defines
any non-local variables then an error will occur. 
</P>

<H3><A NAME="optimise">3.3. Optimisations</A></H3>
<P>
After these two transformations, <CODE>sid</CODE> performs some optimisations
on the grammar, such as reducing multiple copies of identical rules,
removing tail recursion, inlining all basic rules, inlining single
alternative rules, inling rules used only once, and inlining everything
that can be inlined.  All of the inlining is optional. 
</P>
<P>
After these optimisations, <CODE>sid</CODE> checks the language description
to ensure that it is possible to produce a parser for it, and if so
it outputs the parser. 
</P>

<HR>
<H2><A NAME="invoke">4. Invocation</A></H2>
<P>
<CODE>sid</CODE> should be invoked in the following manner: 
<PRE>
        sid <I>options input-files output-files</I>
</PRE>
The options are described later.  The input files should be a number
of input file names dependent upon the output language.  The first
input file is the <CODE>sid</CODE> grammar file.  In the case of either
C dialects there should be one other input file that provides C specific
information to <CODE>sid</CODE>.  The number of output files is also
language specific.  At present, two output files should be specified
for the C languages.  The first should be a <CODE>.c</CODE> file into
which the parser is written; the second should be a <CODE>.h</CODE>
file into which the terminal definitions and external function declarations
are written. 
</P>
<P>
The options list should consist of zero or more of the following options.
There are short forms for each of these options as well; see the 
<CODE>sid</CODE> manual page for more information on invocation. 
<DL>

<DT><B>--dump-file <I>FILE</I></B>
<DD>
This option causes intermediate dumps of the grammar to be written
to the named file.  The format of the dump files is similar to the
format of the grammar specification, with the following exceptions:
<OL>

<LI>Predicates are written with the predicate result replaced by the
predicate identifier (this will always be zero), and the result is
followed by a <CODE>?</CODE> to indicate that it was a predicate.
As an example, the predicate: 
<PRE>
        ( b, ? ) = &lt;pred&gt; ( a )
</PRE>
would be printed out as: 
<PRE>
        ( b : Type1T, 0 : Type2T ) ? = &lt;pred&gt; ( a : Type3T )
</PRE>

<LI>Items that are considered to be inlinable are prefixed by a 
<CODE>+</CODE>.  Items that are tail calls which will be eliminated
are prefixed by a <CODE>*</CODE>. 
<LI>Nested rules are written at the outer level, with names of the
form <CODE>outer-rule::....::inner-rule</CODE>. 
<LI>Types are provided on call parameter and result tuples. 
<LI>Inline rules are given a generated name, and are written out as
a call to the generated rule (and a definition elsewhere). 
</OL>
<P>

<DT><B>--factor-limit <I>LIMIT</I></B>
<DD>
This option limits the number of rules that can be created during
the factorisation process.  It is probably best not to change this.
<P>

<DT><B>--help</B>
<DD>
This option writes a summary of the command line options to the standard
error stream. 
<P>

<DT><B>--inline <I>INLINES</I></B>
<DD>
This option controls what inlining will be done in the output parser.
The inlines argument should be a comma separated list of the following
words: 
<P>
<DL>

<DT><CODE>SINGLES</CODE>
<DD>
This causes single alternative rules to be inlined.  This inlining
is no longer performed as a modification to the grammar (it was in
version 1.0). 
<DT><CODE>BASICS</CODE>
<DD>
This causes rules that contain only basics (and no exception handlers
or empty alternatives) to be inlined.  The restriction on exception
handlers and empty alternatives is rather arbitrary, and may be changed
later. 
<DT><CODE>TAIL</CODE>
<DD>
This causes tail recursive calls to be inlined.  Without this, tail
recursion elimination will not be performed. 
<DT><CODE>OTHER</CODE>
<DD>
This causes other calls to be inlined wherever possible.  Unless the
<CODE>MULTI</CODE> inlining is also specified, this will be done only
for productions that are called once. 
<DT><CODE>MULTI</CODE>
<DD>
This causes calls to be inlined, even if the rule being called is
called more than once.  Turning this inlining on implies <CODE>OTHER</CODE>.
Similarly turning off <CODE>OTHER</CODE> inlining will turn off 
<CODE>MULTI</CODE> inlining.  For grammars of any size, this is probably
best avoided; if used the generated parser may be huge (e.g. a C grammar
has produced a file that was several hundred MB in size). 
<DT><CODE>ALL</CODE>
<DD>
This turns on all inlining. 
</DL>
</P>
<P>
In addition, prefixing a word with <CODE>NO</CODE> turns off that
inlining phase.  The words may be given in any case.  They are evaluated
in the order given, so: 
<PRE>
        --inline noall,singles
</PRE>
would turn on single alternative rule inlining only, whilst: 
<PRE>
        --inline singles,noall
</PRE>
would turn off all inlining.  The default is as if <CODE>sid</CODE>
were invoked with the option: 
<PRE>
        --inline noall,basics,tail
</PRE>
</P>

<DT><A NAME="lang"><B>--language <I>LANGUAGE</I></B></A>
<DD>
This option specifies the output language.  Currently this should
be one of <CODE>ansi-c</CODE>, <CODE>pre-ansi-c</CODE> and <CODE>test</CODE>.
The default is <CODE>ansi-c</CODE>. 
<P>
The <CODE>ansi-c</CODE> and <CODE>pre-ansi-c</CODE> languages are
basically the same.  The only difference is that <CODE>ansi-c</CODE>
initially uses function prototypes, and <CODE>pre-ansi-c</CODE> doesn't.
The C language specific options are: 
</P>
<P>
<DL>

<DT><CODE>prototypes, proto, no-prototypes, no-proto</CODE>
<DD>
These enable or disable the use of function prototypes.  By default
this is enabled for <CODE>ansi-c</CODE> and disabled for <CODE>pre-ansi-c</CODE>.
<P>

<DT><CODE>numeric-ids, numeric, no-numeric-ids, no-numeric</CODE>
<DD>
These enable or disable the use of numeric identifiers.  Numeric identifiers
replace the identifier name with a number, which is mainly of use
in stopping identifier names getting too long.  The disadvantage is
that the code becomes less readable, and more difficult to debug.
Numeric identifiers are not used by default. 
<P>

<DT><A NAME="casts"><CODE>casts, cast, no-casts, no-cast</CODE></A>
<DD>
These enable or disable casting of action and assignment operator
immutable parameters.  If enabled, a parameter is cast to its own
type when it is substituted into the action.  This will cause some
compilers to complain about attempts to modify the parameter (which
can help pick out attempts at mutating parameters that should not
be mutated).  The disadvantage is that not all compilers will reject
attempts at mutation, and that ISO C doesn't allow casting to structure
and union types, which means that some code may be illegal.  Parameter
casting is disabled by default. 
<P>

<DT><CODE>unreachable-macros, unreachable-macro, unreachable-comments,
unreachable-comment</CODE>
<DD>
These choose whether unreachable code is marked by a macro or a comment.
The default is to mark unreachable code with a comment 
<CODE>/*UNREACHED*/</CODE>, however a macro <CODE>UNREACHED ;</CODE>
may be used instead, if desired. 
</DL>
</P>
<P>
The <CODE>test</CODE> language only takes one input file, and produces
no output file.  It may be used to check that a grammar is valid.
In conjunction with the dump file, it may be used to check the transformations
that would be applied to the grammar.  There are no language specific
options for the <CODE>test</CODE> language. 
</P>

<DT><B>--show-errors</B>
<DD>
This option writes a copy of the current error messages to the standard
output.  See the manual entry for more details about changing the
error message content. 
<P>

<DT><B>--switch <I>OPTION</I></B>
<DD>
This passes through <I>OPTION</I> as a language specific option. 
The valid options are described <A HREF="#lang">above</A>. 
<P>

<DT><B>--tab-width <I>NUMBER</I></B>
<DD>
This option specifies the number of spaces that a tab occupies.  It
defaults to 8.  It is only used when indenting output. 
<P>

<DT><B>--version</B>
<DD>
This option causes the version number and supported languages to be
written to the standard error stream. 
</DL>

<HR>
<H2><A NAME="grammar-file">5. The <CODE>sid</CODE> grammar file</A></H2>
<P>
The <CODE>sid</CODE> grammar file should always be the first input
file specified on the command line.  It provides an output language
independent description of the language to be recognised.  The file
is split up into a number of different sections, each of which begins
with a section header.  All sections must be included, although it
is possible to leave most of them empty.  The following sections exist
at present: type declaration, terminal declaration, rule definition,
and grammar entry points.  They must appear in that order.  The sections
are detailed below, after the lexical conventions. 
</P>

<H3><A NAME="grammar-lex">5.1. Lexical conventions</A></H3>
<P>
Almost all non-printable characters (but definitely space, tab, newline
and form-feed) are considered to be white space, and are ignored other
than to separate other tokens.  In addition, two comment forms are
recognised: the C comment syntax, where comments begin with 
<CODE>/*</CODE>, and end with <CODE>*/</CODE>, although unlike C these
comments nest properly; and the C++ line comment syntax, where comments
begin with <CODE>//</CODE>, and are terminated at the end of the line.
Comments are treated in the same way as white space characters. 
</P>
<P>
Section headers are enclosed in percent characters, and are case insensitive.
The following section headers are currently recognised: 
<PRE>
        %types%
        %terminals%
        %productions%
        %entry%
</PRE>
</P>
<P>
Identifiers must begin with a letter, an underscore or a hyphen. 
This may be followed by zero or more letters, digits, underscores
and hyphens. Identifiers are case sensitive.  The following are all
legal identifiers: 
<PRE>
        expression      mult-expr       plus_expr       expr-1
</PRE>
Identifiers are split into two namespaces: local names have one space;
types, actions, rules, non-local names and terminals share the other
namespace, so it is not possible to have an identifier that is a type
as well as being a rule for example. 
</P>
<P>
The following symbols are also used: 
<PRE>
        &amp;   ;       =       [       :       ]       !       ,
        ||      $       ?       {       }       (       )       &lt;
        &gt;    -&gt;   ::      ##
</PRE>
All other characters are illegal. 
</P>

<H3><A NAME="type-decl">5.2. The type declaration section</A></H3>
<P>
The first section is the type declaration section.  This begins with
the types section header, followed by the names of all of the types
used by the grammar.  Each type name should be terminated by a semicolon.
An example of the type declaration section follows: 
<PRE>
        %types%
        NumberT ;
        StringT ;
</PRE>
This declares two types: <CODE>NumberT</CODE> and <CODE>StringT</CODE>.
There is no requirement for the type names to resemble names in the
target language (in fact this should be avoided, as it is possible
to use many different target languages).  All types used in the grammar
must be declared here.  Similarly, all types declared here must be
used in the grammar. 
</P>

<H3><A NAME="term-decl">5.3. The terminal declaration section</A></H3>
<P>
After the type declaration section comes the terminal declaration
section.  This section declares the terminals that will be used by
the grammar.  A terminal is a recogniser for a symbol in the input
alphabet of the language to be recognised.  It is possible to declare
terminals that are not used by the grammar. 
</P>
<P>
The section begins with its section header, followed by the declarations
of the terminals.  Each terminal declaration begins with the name
of the terminal being defined, followed by its type, and terminated
by a semicolon.  If the terminal is not used in the grammar, the declaration
should be preceded by a <CODE>!</CODE> symbol. 
</P>
<P>
A type (for terminals, rules and actions) is written as a colon, followed
by the parameter tuple, followed by the <CODE>-&gt;</CODE>
symbol, followed by the result tuple.  If the type is zero-tuple to
zero-tuple, then the type may be omitted.  A tuple consists of a comma
separated sequence of name-type pairs (with the name and type being
separated by a colon), surrounded by parentheses.  For parameter tuples,
the type may be suffixed by a <CODE>&amp;</CODE> symbol, indicating
that call by reference should be used (the default is call by copy).
For declarations, the names should be omitted.  For terminals, the
parameter type must be the zero-tuple. 
</P>
<P>
The simplest type of terminal declaration is as follows: 
<PRE>
        terminal1 ;
</PRE>
This means the same as: 
<PRE>
        terminal1 : () -&gt; () ;
</PRE>
An example of a more complex terminal declaration is: 
<PRE>
        terminal2 : () -&gt; ( :StringT ) ;
</PRE>
If these terminals were not to be used in the grammar, they should
be declared as: 
<PRE>
        !terminal1 ;
        !terminal2 : () -&gt; ( :StringT ) ;
</PRE>
</P>

<H3><A NAME="rule-defn">5.4. The rule definition section</A></H3>
<P>
The rule definition section follows the terminal declaration section.
It begins with the section header, followed by the definitions and
declarations of all of the rules used in the grammar, and the declarations
of all of the actions used in the grammar. 
</P>
<P>
Rule declarations look identical to terminal declarations, e.g.: 
<PRE>
        rule1 : ( :NumberT ) -&gt; ( :NumberListT ) ;
        rule2 ;
</PRE>
Action declarations are similar, although the names are surrounded
by angle brackets, e.g.: 
<PRE>
        &lt;action1&gt; : ( :StringT &amp; ) -&gt; () ;
        &lt;action2&gt; ;
</PRE>
A declaration (or a definition) may be prefixed with the <CODE>::</CODE>
symbol.  This symbol forces the definition into the outermost scope.
Scopes are described later on. 
</P>
<P>
A rule definition (called a production) looks something like the following:
<PRE>
        add-expr : () -&gt; ( v : NumberT ) = {
                v1 = mul-expr ;
                plus ;
                v2 = expr ;
                v  = &lt;add&gt; ( v1, v2 ) ;
            ||
                v1 = mul-expr ;
                minus ;
                v2 = expr ;
                v  = &lt;subtract&gt; ( v1, v2 ) ;
            ##
                v = &lt;ex&gt; ;
        } ;
</PRE>
The production begins with the rule name, followed by the parameter
and result names and types (in this case, the rule name is 
<CODE>add-expr</CODE>, there are no parameters, and there is one result
name <CODE>v</CODE> of type <CODE>NumberT</CODE>).  This may optionally
be followed by local declarations (there are none here - they are
described later). 
</P>
<P>
The left hand side of the rule is followed by the <CODE>=</CODE> symbol,
a list of alternatives surrounded by curly braces, and is terminated
by a semicolon.  The alternatives are separated by the <CODE>||</CODE>
symbol, and the last alternative may be separated from its predecessor
(there must be one) using the <CODE>##</CODE> symbol; if this is the
case, then this alternative is the exception handler for the production
(otherwise it has no exception handler). 
</P>
<P>
An alternative may match the empty string, using the symbol <CODE>$</CODE>
and the terminator symbol <CODE>;</CODE>, i.e.: 
<PRE>
        $ ;
</PRE>
unless it is an exception handler alternative (in which case it must
do something), or a sequence of items.  The empty string is only valid
if the production has no results.  If you want to match an empty string
in a production that has a result, it is necessary to use an action
(or identity) to provide a result. 
</P>
<P>
An item is an identity, or a call to a (possibly anonymous) rule,
a terminal, an action or a predicate.  An identity looks like an assignment
in conventional programming languages: 
<PRE>
        ( a, b, c ) = ( d, e, f ) ;
</PRE>
Each tuple must contain the same number of names.  In the case of
a one-tuple, the parentheses may be dropped, e.g.: 
<PRE>
        a = d ;
</PRE>
Note that this is a binding operation, not an assignment.  Each name
on the left hand side must be a new name.  It is illegal to redeclare
a name that is already in scope.  It is possible to assign to a name
which is already in scope by prefixing that name with the <CODE>&amp;</CODE>
symbol, e.g.: 
<PRE>
        ( a, &amp;b, c ) = ( d, e, f ) ;
</PRE>
would assign to the name <CODE>b</CODE>, which must have been previously
defined (it may be a parameter; if it is a call by reference parameter,
then the change will propagate outside to the calling rule). 
</P>
<P>
It is also possible to use the <CODE>!</CODE> symbol in the result
tuple to ignore results, e.g.: 
<PRE>
        ( a, !, b, ! ) = ( c, d, e, f ) ;
</PRE>
This is not particularly useful in an identity, but may be more useful
in a call to a rule, terminal or action.  A call to a terminal or
rule looks like a call to a function in a conventional programming
language, e.g.: 
<PRE>
        ( a, b ) = rule1 ( c, d ) ;
        ( e, f ) = terminal1 () ;
</PRE>
Calls to actions have the same form, except that action names are
surrounded by angle brackets, e.g.: 
<PRE>
        ( g, h, i ) = &lt;action1&gt; ( a, e ) ;
</PRE>
In addition, one of the names in the result tuple of the call to the
action may be the predicate result symbol <CODE>?</CODE>, in which
case the action is used as a predicate (more details on predicates
are given later). 
</P>
<P>
When calling a rule, terminal or action, it is necessary to have declared
it (or in the case of a rule declared or defined it) before the call.
</P>
<P>
If a rule or action is being invoked, and it takes one or more call
by reference parameters, then the corresponding arguments should be
prefixed by the <CODE>&amp;</CODE> symbol, e.g.: 
<PRE>
        length = &lt;string-length&gt; ( &amp;string ) ;
</PRE>
If the rule, terminal or action has the zero-tuple as a result, then
only the right hand side of the definition is required, e.g.: 
<PRE>
        rule2 ( a, b ) ;
        terminal2 () ;
        &lt;action2&gt; ( c, d ) ;
</PRE>
If the rule, terminal or action has the zero-tuple as a parameter,
then the parameter tuple may be omitted, e.g.: 
<PRE>
        ( a, b ) = rule3 ;
        terminal3 ;
        c = &lt;action3&gt; ;
</PRE>
In older versions of <CODE>sid</CODE>, it used to be possible to have
ambiguous items, e.g.: 
<PRE>
        a = b ;
</PRE>
where <CODE>b</CODE> was both a rule and a name.  As local names may
not shadow non-local and global names, then this is no longer a problem.
</P>
<P>
In each case, the data flow through the rule is indicated using names.
In the previous example of a production, both alternatives have the
same data flow: the call to <CODE>mul-expr</CODE> returns one value,
which is stored in the name <CODE>v1</CODE>, and the call to <CODE>expr</CODE>
returns one value, which is stored in the name <CODE>v2</CODE>.  Both
of these names are passed to the action (<CODE>add</CODE> in the first
alternative, <CODE>subtract</CODE> in the second), which returns a
value into the name <CODE>v</CODE> (presumably the sum or difference
of the previous two values).  The name <CODE>v</CODE> is the name
of the result, so this will be returned as the result of the rule.
The exception handler (which is invoked if something fails) calls
the action 
<CODE>ex</CODE> to produce the result <CODE>v</CODE>. 
</P>
<P>
It is necessary that the types of the data flow through the production
are correct.  It is also necessary to define all of the result names
for the production in each of the alternatives in that production.
</P>
<P>
An anonymous rule is written in the same way as the body of a normal
rule, e.g.: 
<PRE>
        list : () -&gt; ( l : ListT ) = {
                n = number ;
                /* START ANONYMOUS RULE */ {
                        ? = &lt;at-eof&gt; ;
                        l = &lt;make-list&gt; ( n ) ;
                    ||
                        comma ;
                        l1 = list ;
                        l  = &lt;cons&gt; ( n, l1 ) ;
                    ##
                        l = &lt;com-or-eof&gt; ( n ) ;
                } ; /* END ANONYMOUS RULE */
        } ;
</PRE>
An anonymous rule is always inlined. 
</P>
<P>
The rule name may be followed by a sequence of definitions and declarations
surrounded by the <CODE>[</CODE> and <CODE>]</CODE> symbols (which
are followed by the rest of the rule).  In this case, the definitions
are local to the rule, e.g.: 
<PRE>
        x-list [
                x = {
                        terminal1 ;
                        terminal2 ;
                    ||
                        terminal3 ;
                } ;
        ] = {
                x ;
            ||
                x ;
                x-list ;
        } ;
</PRE>
In this case, the rule <CODE>x</CODE> would be local to the rule 
<CODE>x-list</CODE> and no other rule would be able to use it.  In
error messages, the name of the rule <CODE>x</CODE> would be written
as 
<CODE>x-list::x</CODE>.  All declarations and definitions that occur
inside of the <CODE>[</CODE> and <CODE>]</CODE> symbols have the scope
of the enclosing rule, unless they are preceded by the <CODE>::</CODE>
symbol, in which case they have global scope.  This is particularly
necessary for actions, as actions can only be defined with global
scope. 
</P>
<P>
It is also possible to define non-local names.  These are declared
as an identifier (the name), followed by the <CODE>:</CODE> symbol,
followed by another identifier (its type), in a similar manner to
an entry in a type tuple.  Non-local names are not allowed at the
outermost level (so they may not be prefixed with the <CODE>::</CODE>
symbol either).  When a non-local name is defined, it is in scope
for all of the rules in its scope that are defined after it is, plus
its defining rule. 
</P>
<P>
Non-local names have their values saved on entry to their defining
rule, and the value will be restored when the rule is exited.  This
includes exiting the rule tail recursively or because of an exception
(if the rule has an exception handler, the non-local name will not
be restored until the exception handler has exited).  In almost all
other respects non-local names are the same as local names.  An example
follows: 
<PRE>
        rule1 [
                name1 : Type1T ;
                rule1_1 = {
                        &lt;action1&gt; ( name1 ) ;
                        rule2 ( name1 ) ;
                } ;
        ] = {
                &lt;action2&gt; ( &amp;name1 ) ;
                rule1_1 ;
        } ;
</PRE>
It is possible to associate an initialiser with a non-local name,
by following the type name with a <CODE>=</CODE> symbol and the action
name in angle brackets, e.g.: 
<PRE>
        rule1 [
                name1 : Type1T = &lt;action1&gt; ;
        ] = {
                // ....
        } ;
</PRE>
In this case the action is called at the start of the rule to initialise
the non-local name.  The action should return an object of the same
type as the non-local name.  Normally, the action takes no parameters,
however it may take one parameter of the same type as the non-local
name (or a reference to that type), in which case it will be given
the saved value of the non-local name as an argument (this may be
used to build a stack automatically for example). 
</P>

<H3><A NAME="entry-point">5.5. The grammar entry points section</A></H3>
<P>
The final section lists the entry points to the grammar.  It begins
with the section header, followed by a comma separated list of rule
names, terminated by a semicolon, e.g.: 
<PRE>
        %entry% expr ;
</PRE>
If you are going to use a rule as an entry point into the grammar
(i.e. you wish to call the associated function), you must list it
in the entry points list.  If not, the function may not exist. 
</P>

<HR>
<H2><A NAME="info-file">6. The C information file</A></H2>
<P>
The grammar specification itself is not sufficient to produce a parser.
There also needs to be output language specific information to allow
the parser to interface with the program it is to be part of.  In
the case of the C output routines, <CODE>sid</CODE> needs to know
the following information: 
<OL>

<LI>What code should precede and succeed the automatically generated
code. 
<LI>How to map the <CODE>sid</CODE> identifiers into C identifiers.

<LI>How to do assignments for each type. 
<LI>How to get the current terminal number. 
<LI>How to get the result of the current terminal. 
<LI>How to advance the lexical analyser, to get the next terminal.

<LI>What the actions are defined as, and how to pass parameters to
them. 
<LI>How to save and restore the current terminal when an error occurs.

</OL>
Eventually almost all of this should be user suppliable.  At the moment,
some of the information is supplied by the user in the C information
file, some through macros, and some is built in.  <CODE>sid</CODE>
currently gets the information as follows: 
</P>
<P>
1.  The C information file has a header and a trailer section, which
define code that precedes and succeeds the code that <CODE>sid</CODE>
generates. 
</P>
<P>
2.  The C information file has a section that allows the user to specify
mappings from <CODE>sid</CODE> identifiers into C identifiers.  These
are only valid for the following types of identifiers: types, functions
(implementations of rules) and terminals.  For other identifier types
(or when no mapping is supplied), <CODE>sid</CODE> uses some default
rules: 
<P>
Firstly, <CODE>sid</CODE> applies a transform to the <CODE>sid</CODE>
identifier name, to make it a legal C identifier.  At present this
maps 
<CODE>_</CODE> to <CODE>__</CODE>, <CODE>-</CODE> to <CODE>_H</CODE>
and 
<CODE>:</CODE> (this occurs in scoped names) to <CODE>_C</CODE>. 
All other characters are left unmodified.  This transform cannot be
changed. 
</P>
<P>
<CODE>sid</CODE> also puts a prefix before all identifiers, to try
to prevent clashes (and also to make automatically generated - i.e.
numeric - identifiers legal).  These prefixes can be redefined for
each class of identifier, in the C information file.  They should
be chosen so as not to clash with any other identifiers (i.e. no other
identifiers should begin with that prefix). 
</P>
<P>
By default, the following prefixes are used: 
</P>
<P>
<DL>

<DT><CODE>ZT</CODE>
<DD>
This prefix is used before type identifiers, for the type name itself.

<DT><CODE>ZR</CODE>
<DD>
This prefix is used before rule identifiers, for the rule's implementation
function. 
<DT><CODE>ZL</CODE>
<DD>
This prefix is used before rule identifiers, for the rule's label
when tail recursion is being eliminated.  In this case, a number is
added to the suffix before the identifier name, to prevent clashes
when a rule is inlined twice in the same function.  It is also used
before other labels that are automatically generated and are just
numbered. 
<DT><CODE>ZI</CODE>
<DD>
This prefix is used before name identifiers used as parameters to
functions, or in normal usage.  It is also used by non-local names
(which doesn't cause a problem as they always occur scoped, and local
names never do). 
<DT><CODE>ZO</CODE>
<DD>
This prefix is used before name identifiers used as results of functions.
Results are passed as reference parameters, and this suffix is used
then.  Another identifier with the <CODE>ZI</CODE> prefix is also
used within the function, and the type reference assignment operator
is used at the end of the function to assign the results to the reference
parameters. 
<DT><CODE>ZB</CODE>
<DD>
This prefix is used before the terminal symbol names in the generated
header file. 
</DL>
</P>

<P>
3.  Normally, <CODE>sid</CODE> will do assignments using the C assignment
operator.  Sometimes, this will not do the right thing, so the user
can define a set of assignment operations for any type in the C information
file. 
</P>
<P>
4.  <CODE>sid</CODE> expects the <CODE>CURRENT_TERMINAL</CODE> macro
to be defined, and its definition should return an integer that is
the current terminal.  The macro should be an expression, not a statement.
</P>
<P>
5.  It is necessary to define how to extract the results of all terminals
in the C information file (if a terminal doesn't return anything,
then it is not necessary to define how to get the result). 
</P>
<P>
6.  <CODE>sid</CODE> expects the <CODE>ADVANCE_LEXER</CODE> macro
to be defined, and its definition should cause the lexical analyser
to read a new token.  The new terminal number should be accessible
through the 
<CODE>CURRENT_TERMINAL</CODE> macro.  On entry into the parser 
<CODE>CURRENT_TERMINAL</CODE> should give the first terminal number.
</P>
<P>
7.  All actions, and their parameter and result names are defined
in the C information file. 
</P>
<P>
8.  <CODE>sid</CODE> expects the <CODE>SAVE_LEXER</CODE> and 
<CODE>RESTORE_LEXER</CODE> macros to be defined.  The first is called
with an argument which is the error terminal value.  The macro should
save the current terminal's value, and set the current terminal to
be the error terminal value.  The second macro is called without arguments,
and should restore the saved value of the current terminal. 
<CODE>SAVE_LEXER</CODE> will never be called more than once without
a call to <CODE>RESTORE_LEXER</CODE>, so the save stack only needs
one element. 
</P>
<P>
The remainder of this section describes the layout of the C information
file.  The lexical conventions are described first, followed by a
description of the sections in the order in which they should appear.
Unlike the <CODE>sid</CODE> grammar file, not all sections are mandatory.
</P>

<H3><A NAME="info-lex">6.1. Lexical conventions</A></H3>
<P>
The lexical conventions of the C information file are very similar
to those of the <CODE>sid</CODE> grammar file.  There is a second
class of identifier: the C identifier, which is a subset of the valid
<CODE>sid</CODE> identifiers; there is also the C code block. 
</P>
<P>
A C code block begins with <CODE>@{</CODE> and is terminated by 
<CODE>@}</CODE>.  The code block consists of all of the characters
between the start and end of the code block, subject to substitutions.
All substitutions begin with the <CODE>@</CODE> character.  The following
substitutions are recognised: 
</P>
<P>
<DL>

<DT><CODE>@@</CODE>
<DD>
This substitutes the <CODE>@</CODE> character itself. 
<P>

<DT><CODE>@:</CODE><I>label</I>
<DD>
This form marks a label, which will be substituted for in the output
code.  This is necessary, because an action may be inlined into the
same function more than once.  If this happens, then without doing
label substitution there would be two identical labels in the same
scope. With label substitution, this problem is avoided.  In general,
all references to a label within an action should be prefixed with
<CODE>@:</CODE>.  This substitution may not be used in header and
trailer code blocks. 
<P>

<DT><CODE>@</CODE><I>identifier</I>
<DD>
This form marks a parameter or result identifier substitution.  If
parameter and result identifiers are not prefixed with an <CODE>@</CODE>
character, then they will not be substituted.  It is an error if the
identifier is not a parameter or a result.  Header and trailer code
blocks have no parameters or results, so it is always an error to
use identifier substitution in them.  It is an error if any of the
result identifiers are not substituted at least once. 
<P>
Result identifiers may be assigned to using this form of identifier
substitution, but parameter identifiers may not be (nor may there
address be taken - they are immutable).  To try to prevent this, parameters
that are substituted may be cast to their own type, which makes them
unmodifiable in ISO C (see the notes on the 
<A HREF="#casts"><CODE>casts</CODE> language specific option</A>).
</P>

<DT><CODE>@&amp;</CODE><I>identifier</I>
<DD>
This form marks a parameter identifier whose address is to be substituted,
but whose contents will not be modified.  The effects of modifying
the identifier are undefined.  It is an error to use this in parameter
assignment operator definitions. 
<P>

<DT><CODE>@=</CODE><I>identifier</I>
<DD>
This form marks a parameter identifier that will be modified.  For
this to be useful, the parameter should be a call by reference parameter,
so that the effect of the modification will be propagated.  This substitution
is only valid in actions (assignment operators are not allowed to
modify their parameters). 
<P>

<DT><CODE>@!</CODE>
<DD>
This form marks an exception raise.  In the generated code, a jump
to the current exception handler will be substituted.  This substitution
is only valid in actions. 
<P>

<DT><CODE>@.</CODE>
<DD>
This form marks an attempt to access the current terminal.  This substitution
is only valid in actions. 
<P>

<DT><CODE>@&gt;</CODE>
<DD>
This form marks an attempt to advance the lexical analyser.  This
substitution is only valid in actions. 
</DL>
</P>
<P>
All other forms are illegal.  Note that in the case of labels and
identifiers, no white space is allowed between the <CODE>@:</CODE>,
<CODE>@</CODE>, <CODE>@&amp;</CODE> or <CODE>@=</CODE> and the identifier
name.  An example of a code block is: 
<PRE>
        @{
                /* A code block */
                {
                        int i ;
                        if ( @param ) {
                                @! ;
                        }
                        @result = 0 ;
                        for ( i = 0 ; i &lt; 100 ; i++ ) {
                                printf ( &quot;{%d}\n&quot;, i ) ;
                                @result += i ;
                        }
                        @=param += @result ;
                        if ( @. == TOKEN_SEMI ) {
                                @&gt; ;
                        }
                }
        @}
</PRE>
</P>

<H3><A NAME="prefixes">6.2. The prefixes section</A></H3>
<P>
The first section in the C information file is the prefix definition
section.  This section is optional.  It begins with the section header,
followed by a list of prefix definitions.  A prefix definition begins
with the prefix name, followed by a <CODE>=</CODE> symbol, followed
by a C identifier that is the new prefix, and terminated by a semicolon.
The following example shows all of the prefix names, and their default
values: 
<PRE>
        %prefixes%
        type            = ZT ;
        function        = ZR ;
        label           = ZL ;
        input           = ZI ;
        output          = ZO ;
        terminal        = ZB ;
</PRE>
</P>

<H3><A NAME="maps">6.3. The maps section</A></H3>
<P>
The section that follows the prefixes section is the maps section.
This section is also optional.  It begins with its section header,
followed by a list of identifier mappings.  An identifier mapping
begins with a 
<CODE>sid</CODE> identifier (either a type, a rule or a terminal),
followed by the <CODE>-&gt;</CODE> symbol, followed by the C identifier
it is to be mapped to, and terminated by a semicolon.  An example
follows: 
<PRE>
        %maps%
        NumberT         -&gt; unsigned ;
        calculator      -&gt; calculator ;
</PRE>
Note that it is not possible to map type identifiers to be arbitrary
C types.  It will be necessary to <CODE>typedef</CODE> or macro define
the type name in the C file. 
</P>
<P>
It is recommended that all types, terminals and entry point rules
have their names mapped in this section, although this is not necessary.
If the names are not mapped, they will have funny names in the rest
of the program. 
</P>

<H3><A NAME="header">6.4. The header section</A></H3>
<P>
After the maps section comes the header section.  This begins with
the section header, followed by a code block, followed by a comma,
followed by a second code block, and terminated with a semicolon.
The first code block will be inserted at the beginning of the generated
parser file; the second code block will be inserted at the start of
the generated header file.  An example is: 
<PRE>
        %header% @{
        #include &quot;lexer.h&quot;

        LexerT token ;

        #define CURRENT_TERMINAL        token.t
        #define ADVANCE_LEXER           next_token ()

        extern void terminal_error () ;
        extern void syntax_error () ;
        @}, @{
        @} ;
</PRE>
</P>

<H3><A NAME="assign">6.5. The assignments section</A></H3>
<P>
The assignments section follows the header section.  This section
is optional.  Normally, assignment between two identifiers will be
done using the C assignment operator.  In some cases this will not
do the correct thing, and it is necessary to do the assignment differently.
All types for which this applies should have an entry in the assignments
section.  The section begins with its header, followed by definitions
for each type that needs its own assignment operator.  Each definition
should have one parameter, and one result.  The action's name should
be the name of the type.  An example follows: 
<PRE>
        %assignments%

        ListT : ( l1 ) -&gt; ( l2 ) = @{
                if ( @l2.head = @l1.head ) {
                        @l2.tail = @l1.tail ;
                } else {
                        @l2.tail = &amp;( @l2.head ) ;
                }
        @} ;
</PRE>
If a type has an assignment operator defined, it must also have a
parameter assignment operator type defined and a result assignment
operator defined (more precisely it must have either no assignment
operations defined, or all three assignment operations defined). 
</P>

<H3><A NAME="param-assign">6.6. The parameter assignments section</A></H3>
<P>
The parameter assignments section is very similar to the assignments
section (which it follows), and is also optional.  If a type has an
assignment section entry, it must have a parameter assignment entry
as well. 
</P>
<P>
The parameter assignment operator is used in function calls to ensure
that the object is copied correctly: if no parameter assignment operator
is provided for a type, the standard C call by copy mechanism is used;
if a parameter assignment operator is provided for a type, then the
address of the object is passed by the calling function, and the called
function declares a local of the same type, and uses the parameter
assignment operator to copy the object (this should be remembered
when passing parameters to entry points that have arguments of a type
that has a parameter assignment operator defined). 
</P>
<P>
The difference between the parameter assignment operator and the assignment
operator is that the parameter identifier to the parameter assignment
operator is a pointer to the object being manipulated, rather than
the object itself.  An example reference assignment section is: 
<PRE>
        %parameter-assignments%

        ListT : ( l1 ) -&gt; ( l2 ) = @{
                if ( @l2.head = @l1-&gt;head ) {
                        @l2.tail = @l1-&gt;tail ;
                } else {
                        @l2.tail = &amp;( @l2.head ) ;
                }
        @} ;
</PRE>
</P>

<H3><A NAME="result-assign">6.7. The result assignments section</A></H3>
<P>
The result assignments section is very similar to the assignments
section and the parameter assignments section (which it follows),
and is also optional.  If a type has an assignment section entry,
it must also have a result assignment entry.  The only difference
between the two is that the result identifier of the result assignment
operation is a pointer to the object being manipulated, rather than
the object itself. Result assignments are only used when the results
of a rule are assigned back through the reference parameters passed
into the function.  An example result assignment section is: 
<PRE>
        %result-assignments%

        ListT : ( l1 ) -&gt; ( l2 ) = @{
                if ( @l2-&gt;head = @l1.head ) {
                        @l2-&gt;tail = @l1.tail ;
                } else {
                        @l2-&gt;tail = &amp;( @l2-&gt;head ) ;
                }
        @} ;
</PRE>
</P>

<H3><A NAME="term-result">6.8. The terminal result extraction section</A></H3>
<P>
The terminal result extraction section follows the reference assignment
section.  It defines how to extract the results from terminals.  The
section begins with its section header, followed by the terminal extraction
definitions. 
</P>
<P>
There must be a definition for every terminal in the grammar that
returns a result.  It is an error to include a definition for a terminal
that doesn't return a result.  The result of the definition should
be the same as the result of the terminal.  An example of the terminal
result extraction section follows: 
<PRE>
        %terminals%

        number : () -&gt; ( n ) = @{
                @n = token.u.number ;
        @} ;

        identifier : () -&gt; ( i ) = @{
                @i = token.u.identifier ;
        @} ;

        string : () -&gt; ( s ) = @{
                @s = token.u.string ;
        @} ;
</PRE>
</P>

<H3><A NAME="action-defn">6.9. The action definition section</A></H3>
<P>
The action definition section follows the terminal result extractor
definition section.  The format is similar to the previous sections:
the section header followed by definitions for all of the actions.
An action definition has the following form: 
<PRE>
        &lt;action-name&gt; : ( parameters ) -&gt; ( results ) = code-block ;
</PRE>
This is similar to the form of all previous definitions, except that
the name is surrounded in angle brackets.  What follows is also true
of the other definitions as well (unless they state otherwise). 
</P>
<P>
The <CODE>action-name</CODE> is a <CODE>sid</CODE> identifier that
is the name of the action being defined; <CODE>parameters</CODE> is
a comma separated list of C identifiers that will be the names of
the parameters passed to the action, and <CODE>results</CODE> is a
comma separated list of C identifiers that will be the names of the
result parameters passed to the action.  The <CODE>code-block</CODE>
is the C code that defines the action.  It is expected that this will
assign a valid result to each of the result identifier names. 
</P>
<P>
The parameter and result tuples have the same form as in the language
independent file, except that the types are optional.  Like the language
independent file, if the type of an action is zero-tuple to zero-tuple,
then the type can be omitted, e.g.: 
<PRE>
        &lt;action&gt; = @{ /* .... */ @} ;
</PRE>
An example action definition section is: 
<PRE>
        %actions%

        &lt;add&gt; : ( v1, v2 ) -&gt; ( v3 ) = @{
                @v3 = @v1 + @v2 ;
        @} ;

        &lt;subtract&gt; : ( v1 : NumberT, v2 : NumberT ) -&gt; ( v3 : NumberT ) = @{
                @v3 = @v1 - @v2 ;
        @} ;

        &lt;multiply&gt; : ( v1 : NumberT, v2 ) -&gt; ( v3 ) = @{
                @v3 = @v1 * @v2 ;
        @} ;

        &lt;divide&gt; : ( v1, v2 ) -&gt; ( v3 : NumberT ) = @{
                @v3 = @v1 / @v2 ;
        @} ;

        &lt;print&gt; : ( v ) -&gt; () = @{
                printf ( &quot;%u\n&quot;, @v ) ;
        @} ;

        &lt;error&gt; = @{
                fprintf ( stderr, &quot;ERROR\n&quot; ) ;
                exit ( EXIT_FAILURE ) ;
        @} ;
</PRE>
Do not define static variables in action definitions; if you do, you
will get unexpected results.  If you wish to use static variables
in actions definitions, then define them in the header block. 
</P>

<H3><A NAME="trailer">6.10. The trailer section</A></H3>
<P>
After the action definition section comes the trailer section.  This
has the same form as the header section.  An example is: 
<PRE>
        %trailer% @{
        int main ()
        {
                next_token () ;
                calculator ( NULL ) ;
                return 0 ;
        }
        @}, @{
        @} ;
</PRE>
The code blocks will be appended to the generated parser, and the
generated header file respectively. 
</P>

<HR>
<H2><A NAME="predicate">7. Predicates</A></H2>
<P>
Predicates provide the user with a mechanism for altering the control
flow in a manner that terminals alone cannot do. 
</P>
<P>
During the factorisation process, rules that begin with predicates
are expanded if necessary to ensure that predicates that may be used
to select which alternative to go down always begin the alternative,
e.g.: 
<PRE>
        rule1 = {
                rule2 ;
                /* .... */
            ||
                /* .... */
        } ;

        rule2 = {
                ? = &lt;predicate&gt; ;
                /* .... */
            ||
                /* .... */
        } ;
</PRE>
would be expanded into: 
<PRE>
        rule1 = {
                ? = predicate ;
                /* .... */
                /* .... */
            ||
                /* .... */
                /* .... */
            ||
                /* .... */
        } ;
</PRE>
Also, if a predicate is used to select which alternative to use, it
must be the first thing in the alternative, so the following would
not be allowed: 
<PRE>
        rule = {
                &lt;action&gt; ;
                ? = &lt;predicate&gt; ;
                /* .... */
            ||
                /* .... */
        } ;
</PRE>
When predicates begin a rule, they are executed (in some arbitrary
order) until one of them returns true.  The alternative that this
predicate begins is then selected.  If no predicates return true,
then one of the remaining alternatives is selected based upon the
current terminal (or an error occurs). 
</P>
<P>
It is important that predicates do not contain dependencies upon the
order of evaluation.  In practice, predicates are likely to be simple,
so this shouldn't be a problem. 
</P>
<P>
When predicates are used within an alternative, they behave like terminals.
If they evaluate to true, then parsing continues.  If they evaluate
to false, then an exception is raised. 
</P>

<HR>
<H2><A NAME="exception">8. Error handling</A></H2>
<P>
If the input given to the parser is valid, then the parser will not
need to produce any errors.  Unfortunately this is not always the
case, so 
<CODE>sid</CODE> provides a mechanism for handling errors. 
</P>
<P>
When an error occurs, an exception is raised.  This passes control
to the nearest enclosing exception handler.  If there is no exception
handler at all, the entry point function will return with the current
terminal set to the error value. 
</P>
<P>
An exception handler is just an alternative that is executed when
a terminal or predicate fails.  This should obviate the need to rely
upon language specific mechanisms (such as <CODE>setjmp</CODE> and
<CODE>longjmp</CODE>) for error recovery. 
</P>

<HR>
<H2><A NAME="call-by-ref">9. Call by reference</A></H2>
<P>
The default behaviour of <CODE>sid</CODE> is to do argument passing
using call by copy semantics, and to not allow mutation of parameters
of rules and actions (however inlined rules, and rules created during
factoring have call by reference parameters).  However it is possible
to give rule and action parameters call by reference semantics, using
the 
<CODE>&amp;</CODE> symbol in the type specification (as described
earlier).  It is also possible to mutate parameters of actions, using
the <CODE>@=</CODE> substitution in the action body (also described
earlier).  It is important to do the correct substitutions in action
definitions, as <CODE>sid</CODE> uses this information to decide where
it can optimise the output code. 
</P>
<P>
If a call by copy parameter is mutated, then <CODE>sid</CODE> will
introduce a new temporary variable and copy the parameter into it
- this temporary will then be mutated.  Similar code will be output
for rules that have call by copy parameters that are mutated (e.g.
as a call by reference argument to an action that mutates its parameters).
</P>

<HR>
<H2><A NAME="call-entry">10. Calling entry points</A></H2>
<P>
When calling a function that implements an entry point rule, it should
be called with the rule's parameters as the first arguments, followed
by the addresses of the rule's results as the remaining arguments.
The parameters should have their addresses passed if they are of a
type that has a parameter assignment operator defined, or if the parameter
is a call by reference parameter. 
</P>
<P>
For example, given the following rule: 
<PRE>
        rule1 : ( :Type1T, :Type2T, :Type3T &amp; ) -&gt; ( :Type4T ) ;
</PRE>
where <CODE>Type2T</CODE> has a parameter assignment operator defined,
and <CODE>rule1</CODE> is mapped to <CODE>rule1</CODE> (and the type
names are mapped to themselves), the call would be something like:
<PRE>
        Type1T a = make_type1 () ;
        Type2T b = make_type2 () ;
        Type3T c = make_type3 () ;
        Type4T d ;

        rule1 ( a, b, &amp;c, &amp;d ) ;
</PRE>
</P>

<HR>
<H2><A NAME="glossary">11. Glossary</A></H2>
<P>
This section describes some of the terms used in the <CODE>sid</CODE>
documentation. 
<DL>

<DT><B>Alternative</B>
<DD>
An alternative is a sequence of items. 
<P>

<DT><B>Exception handler</B>
<DD>
An exception handler is a special type of alternative.  Each rule
may have at most one exception handler.  An exception handler is invoked
if the current terminal does not match any of the expected terminals,
if a predicate fails, or if an action raises an exception, within
the scope of the exception handler. 
<P>

<DT><B>Expansion</B>
<DD>
This is the process of physically inlining a rule into another rule.
It is done during the factoring process to turn the grammar into a
form that a parser can be produced for.  See the entry for factoring.
<P>

<DT><B>Factoring</B>
<DD>
This is one of the transforms that <CODE>sid</CODE> performs on the
grammar. See the <A HREF="#overview">overview section</A> for a description
of the factoring process. 
<P>

<DT><B>First set</B>
<DD>
The first set of a rule (or alternative) is the set of terminals and
predicates that can start that rule (or alternative). 
<P>

<DT><B>Follow set</B>
<DD>
The follow set of a rule is the set of terminals and predicates that
can follow the rule in any of its invocations. 
<P>

<DT><B>Inlining</B>
<DD>
This is the process of outputting the code for parsing one rule within
the function that parses another rule.  This is normally done as part
of the output process.  Expansion is a form of inlining performed
during the factoring process, but the inlining is done by modifying
the grammar, rather than as part of the output phase. 
<P>

<DT><B>Item</B>
<DD>
An item is the equivalent of a statement in a conventional programming
language.  It can be an invocation of a rule, terminal, action or
predicate, or an identity operation (assignment). 
<P>

<DT><B>Name</B>
<DD>
A name is an identifier that is used to pass information between rules
and actions.  Local names are defined within a rule, and only exist
within the rule itself.  Non-local names are defined in a rule's scoped
definitions section and exists in all of the rules in that scope.
Non-local rules are also saved and restored across calls to the rule
that defines them. 
<P>

<DT><B>Recursion</B>
<DD>
Recursion is where a rule invokes itself.  Direct recursion is where
the rule invokes itself from one of its own alternatives; indirect
recursion is where a rule invokes another rule (which invokes another
rule etc.) which eventually invokes the original rule. 
</P>
<P>
Left recursion is a form of recursion where all of the recursive calls
occur as the first item in an alternative.  It is not possible to
produce a parser for a grammar that contains left recursions, so 
<CODE>sid</CODE> turns left recursions into right recursions.  This
process is known as left recursion elimination. 
</P>
<P>
Right recursion is a form of recursion where all of the recursive
calls occur as the last item in an alternative. 
</P>

<DT><B>Production</B>
<DD>
See rule. 
<P>

<DT><B>Rule</B>
<DD>
A rule is a sequence of alternatives.  A rule may contain a special
alternative that is used as an exception handler.  A rule is also
referred to as a production; this term is normally used when talking
about the definition of a rule. 
<P>

<DT><B>See-through</B>
<DD>
A rule is said to be see-through if there is an expansion of the rule
that does not contain any terminals or predicates. 
</DL>
</P>

<HR>
<H2><A NAME="errors">12. Understanding error messages</A></H2>
<P>
This section tries to explain what some of the error messages that
are reported by the <CODE>sid</CODE> transforms mean.  It does not
contain descriptions of messages like &quot;type 'some type' is unknown&quot;,
as these should be self-explanatory. 
</P>

<H3><A NAME="left-errors">12.1. Left recursion elimination errors</A></H3>

<P>
<B>The parameter or result types of the left recursive calls in the
following productions do not match: <I>PRODUCTIONS</I></B>: This means
that there is a set of rules which call each other left recursively
(i.e. the first item in some of the alternatives in each rule is a
call to another rule in the set), and they do not all have the same
parameter and result types, e.g.: 
<PRE>
        rule1 : ( a : Type1T, b : Type1T, c : Type2T, d : Type2T ) -&gt; () = {
                rule2 ( a, b ) ;
            ||
                terminal1 ;
        } ;

        rule2 : ( a : Type1T, b : Type2T ) -&gt; () = {
                rule1 ( a, a, b, b ) ;
            ||
                terminal2 ;
        } ;
</PRE>
</P>

<P>
<B>The exception handlers in the left recursion involving the following
productions do not match: <I>PRODUCTIONS</I></B>: This means that
there is a set of productions which call each other left recursively
(i.e. the first item in an alternative is a call to another production
in the set), and they do not all have the same exception handler,
e.g.: 
<PRE>
        rule1 = {
                rule2 ;
            ||
                terminal1 ;
            ##
                &lt;action1&gt ;
        } ;

        rule2 = {
                rule1 ;
            ||
                terminal2 ;
            ##
                &lt;action2&gt ;
        } ;
</PRE>
It is quite likely that when using exception handlers, it may be necessary
to do the left recursion elimination manually to ensure that the exception
handlers occur at the correct place. 
</P>

<P>
<B>The argument names of the left recursive calls in the following
productions do not match: <I>PRODUCTIONS</I></B>: This means that
there is a set of productions which call each other left recursively
(i.e. the first item in an alternative is a call to another production
in the set), and the arguments to one of the left recursive calls
are not the same as the parameters of the calling rule, e.g.: 
<PRE>
        rule1 : ( a : Type1T, b : Type1T ) -&gt; () = {
                rule1 ( b, a ) ;
            ||
                terminal1 ;
        } ;
</PRE>
</P>

<P>
<B>A non-local name in the rule '<I>RULE</I>' is not in scope in the
rule '<I>RULE</I>' in the left recursive cycle involving the following
productions: <I>PRODUCTIONS</I></B>: This means that there is a set
of productions which call each other left recursively (i.e. the first
item in an alternative is a call to another production in the set),
and the first named rule uses a non-local name that is not in scope
in the second named rule, e.g.: 
<PRE>
        rule1 [
                name1 : Type1T ;
                rule1_1 [
                        name1_1 : Type1T ;
                ] = {
                        rule1 ;
                        &lt;action1_1&gt; ( name1_1 ) ;
                  ||
                        terminal1 ;
                } ;
        ] = {
                terminal2 ;
          ||
                rule1_1 ;
                &lt;action1&gt; ( name1 ) ;
        } ;
</PRE>
</P>

<P>
<B>The rule '<I>RULE</I>' declares non-local names in the left recursive
cycle with more than one entry point involving the following productions:
<I>PRODUCTIONS</I></B>: This means that there is a set of productions
which call each other left recursively (i.e. the first item in an
alternative is a call to another production in the set), and the named
rule defines non-local variables even though it is not the only entry
point to the cycle, e.g.: 
<PRE>
        rule1 [
                name1 : Type1T ;
                rule1_1 = {
                        &lt;action1_1&gt; ( name1 ) ;
                } ;
        ] = {
                terminal1 ;
                rule1_1 ;
            ||
                rule2 ;
                &lt;action1&gt; ( name1 ) ;
        } ;

        rule2 = {
                rule1 ;
                &lt;action2&gt ;
            ||
                terminal2 ;
        } ;
</PRE>
</P>

<P>
<B>No cycle termination for the left recursive set involving the following
rules: <I>RULES</I></B>: This means that there is a set of productions
which call each other left recursively (i.e. the first item in an
alternative is a call to another production in the set), and they
do not contain an alternative that begins with a non-left recursive
call, e.g.: 
<PRE>
        rule1 = {
                rule2 ;
            ||
                rule3 ;
        } ;

        rule2 = {
                rule1 ;
            ||
                rule3 ;
        } ;

        rule3 = {
                rule1 ;
            ||
                rule2 ;
        } ;
</PRE>
</P>

<H3><A NAME="first-errors">12.2. First set computation errors</A></H3>

<P>
<B>Cannot compute first set for <I>PRODUCTION</I></B>: This means
that <CODE>sid</CODE> cannot compute the set of terminals and predicates
that may start the production.  This is normally because there is
a recursive call (or cycle) that contains no terminals, e.g.: 
<PRE>
        rule1 = {
                &lt;action1&gt; ;
                rule1 ;
            ||
                terminal1 ;
        } ;
</PRE>
This is not removed by the left recursion elimination phase, as the
call is  not the leftmost item in the alternative. 
</P>

<P>
<B>Can see through to predicate '<I>PREDICATE</I>' in production 
<I>PRODUCTION</I></B>: This means that there is a predicate that isn't
the first item in its alternative, but is preceded only by see-through
items, e.g.: 
<PRE>
        rule1 = {
                &lt;action1&gt; ;
                ? = &lt;predicate&gt; ;
            ||
                terminal1 ;
        } ;
</PRE>
</P>

<P>
<B>Can see through to predicates in rule '<I>RULE</I>' in production
<I>PRODUCTION</I></B>: This means that the first rule has at least
one predicate in its first set, and the second rule calls it in a
position where it is not the first item in the alternative and is
preceded only by see-through items, e.g.: 
<PRE>
        rule1 = {
                ? = &lt;predicate&gt; ;
            ||
                terminal1 ;
        } ;

        rule2 = {
                &lt;action&gt; ;
                rule1 ;
            ||
                terminal2 ;
        } ;
</PRE>
</P>

<P>
<B>The rule '<I>RULE</I>' has all terminals in its first set and has
a redundant see-through alternative</B>: This means that the rule's
first set (the set of all terminals that can start the rule) includes
all possible input terminals, and the rule also has a see-through
alternative.  The see-through alternative will never be used, as one
of the other alternatives will always be chosen. 
</P>

<H3><A NAME="factor-errors">12.3. Factoring errors</A></H3>

<P>
<B>Too many productions (<I>NUMBER</I>) created during factorisation</B>:
This normally means that <CODE>sid</CODE> cannot factor the grammar.
You will need to rewrite the offending part.  Unfortunately there
is no easy way to do this.  Start by looking at the dump file for
a set of rules that seem to have been expanded a lot. 
</P>

<P>
<B>The rule '<I>RULE</I>' cannot be expanded into '<I>RULE</I>' as
the exception handlers don't match</B>: When <CODE>sid</CODE> performs
factoring, it needs to expand calls to certain rules into the rules
that calls them (this is described in the 
<A HREF="#overview">overview section</A>).  If the called rule has
an exception handler and it is not the same as the exception handler
of the calling rule, then the expansion will fail. 
</P>

<P>
<B>The rule '<I>RULE</I>' cannot be expanded into '<I>RULE</I>' as
it contains non-local name definitions</B>: When <CODE>sid</CODE>
performs factoring, it needs to expand calls to certain rules into
the rules that calls them (this is described in the 
<A HREF="#overview">overview section</A>).  If the called rule defines
any non-local names, then the expansion will fail. 
</P>

<H3><A NAME="check-errors">12.4. Checking errors</A></H3>

<P>
<B>Collision of terminal(s) <I>TERMINALS</I> in rule '<I>RULE</I>'</B>:
This error means that more than one alternative in the named rule
begins with the named terminals, e.g.: 
<PRE>
        rule1 = {
                &lt;action1&gt; ;
                terminal1 ;
            ||
                terminal1 ;
        } ;
</PRE>
Normally, the factoring process will remove the problem, but when
something like the above happens to stop the factoring occurring,
this error will be produced. 
</P>

<P>
<B>Collision of predicate '<I>PREDICATE</I>' in rule '<I>RULE</I>'</B>:
This error occurs when more than one alternative in the named rule
begins with the named predicate, e.g.: 
<PRE>
        rule1 = {
                ( a, ? ) = &lt;predicate&gt; ;
                &lt;action1&gt; ( a ) ;
            ||
                ( ?, b ) = &lt;predicate&gt; ;
                &lt;action2&gt; ( b ) ;
        } ;
</PRE>
Again, it is normally the case that the factoring process will remove
this problem, but if the same predicate uses different predicate results
in different alternatives, this error will be produced. 
</P>

<P>
<B>The terminal(s) <I>TERMINALS</I> can start rule '<I>RULE</I>' which
is see-through, and the same terminal(s) may appear in the following
situations: <I>ALTERNATIVES</I></B>: This means that there are one
or more terminals that can start the named rule (which is see-through),
and may also follow it, e.g.: 
<PRE>
        rule1 = {
                terminal1 ;
            ||
                $ ;
        } ;

        rule2 = {
                rule1 ;
                terminal1 ;
            ||
                terminal2 ;
        } ;
</PRE>
The alternatives listed are the alternatives which call the rule,
and contain (some of) the named terminals after the call.  The call
is highlighted. 
</P>

<P>
<B>The predicate(s) <I>PREDICATES</I> can start rule '<I>RULE</I>'
which is see-through and the same predicate(s) may appear in the following
situations: <I>ALTERNATIVES</I></B>: This means that there are one
or more predicates that can start the named rule (which is see-through),
and may also follow it, e.g.: 
<PRE>
        rule1 = {
                ? = &lt;predicate&gt; ;
            ||
                $ ;
        } ;

        rule2 = {
                terminal1 ;
                rule1 ;
                ? = &lt;predicate&gt; ;
            ||
                terminal2 ;
        } ;
</PRE>
The alternatives listed are the alternatives which call the rule,
and contain (some of) the named predicates after the call.  The call
is highlighted. 
</P>

<P>
<B>The rule '<I>RULE</I>' contains more than one see-through alternative</B>:
This error occurs if the rule has more than one alternative that doesn't
need to read a terminal or a predicate, e.g.: 
<PRE>
        rule1 = {
                &lt;action1&gt; ;
            ||
                &lt;action2&gt; ;
        } ;
</PRE>
</P>

<HR>
<P><I>Part of the <A HREF="../index.html">TenDRA Web</A>.<BR>Crown
Copyright &copy; 1998.</I></P>
</BODY>
</HTML>
Subversion Repositories tendra.SVN

(root)/branches/tendra4/doc/utilities/sid.html – Rev 2