P.T. Breuer and J.P. Bowen

(10 August 1994)

Table of contents

Welcome to PRECCX v2.42

PRECCX stands for PREttier Compiler Compiler (eXtended). PRECCX converts context-grammar definition scripts (with a .y extension) into ANSI C code scripts (with a .c extension) that can in turn be compiled into working parsers, interpreters or compilers using a standard ANSI C compiler.

Copyright notice

PRECCX v2.42, Copyright 1989-1994, P.T. Breuer


PRECCX is a compiler compiler which converts PRECCX context-grammar definition scripts (with a .y extension) into ANSI C code scripts (with a .c extension). The output code compiles under the GNU Software Foundations ANSI C compiler, gcc, and will compile with other fully compliant ANSI compilers. PRECCX is coded in ANSI C generated by PRECCX itself from its own definition script, and is fully portable. PRECCX extends the facilities of the Unix yacc utility and its PC equivalents, including those based on GNU's bison, by allowing * infinite lookahead in place of yacc 1-token lookahead; * incremental compilation and linking of scripts arranged in separate modules. * full support for extended BNF descriptions, which makes for clearer and more efficient definition scripts. * parameterized grammar definitions, allowing context-dependent grammars. * variables which stand for grammars as parameters, allowing `macros' to replace repeated grammar constructions. * NEW in 2.42 * synthesized attributes which may be seamlessly used as parse parameters during the remainder of the parse in which they were synthesized. This is a powerful linguistic device. Lexical analysers such as lex and flex are suitable pre-filters for PRECCX. They are called in the same way that yacc calls them -- whenever a new token is required -- and lexers constructed for yacc should be `plug compatible' with PRECCX. And there is a trivial default lexer built-in to the utility which passes characters through to PRECCX as tokens. It might be thought that infinite lookahead is the big technical advance offered by PRECCX, but the parameterized grammars are more significant. The higher order capability offered by parametrization and the modular arrangement of scripts make parser specifications much more maintainable. The infinite lookahead makes the semantics declarative -- a non-terminal can be replaced by its definition without altering the semantics, a definition which is self-consistent is consistent in the context of the rest of the script, etc. * NEW * Now attributes can be synthesized `on the fly' and passed seamlessly into the parser, extending PRECCX's declarative programming paradigm from parameterized attribute grammar specifications to fully cover mixed and/or synthetic attributed grammars too. * NEW * PRECCX is now almost wholly compiling into C at the back end instead of interpreting for a virtual machine, with consequent increase in robustness.


INSTALLATION INSTRUCTIONS. a) Put the following files where your C compiler will find them: NEEDED BY YOU - HEADERS cc.h /* precc header file (no inherited attributes) */ ccx.h /* precc header file (for inherited attributes) */ - LIBRARY libcc1.a (UNIX) libcc2.a (UNIX) libcc4.a (UNIX) preccx1C.lib (DOS) preccx2C.lib (DOS) preccx4C.lib (DOS) preccx1L.lib (DOS) preccx2L.lib (DOS) preccx4L.lib (DOS) I can't help you with the placing because it will be specific to your installation and your compiler. But I have the headers in \home\include for DOS and ~/include for UNIX, and the libraries in \home\c\lib for DOS and ~/c/lib for UNIX. b) link or copy one of the libraries to libcc.a (UNIX) preccx.lib (DOS) You should choose libcc1.a/preccx1C.lib if you are going to use 1-byte TOKENs (this is normal) and the Compact memory model for C, and another accordingly if you have a larger size or different model in mind. c) - EXECUTABLES preccx[.exe] /* the precc executable */ Put this somewhere where your operating system will find it. That has to be in a directory named in your PATH variable. Or make a .BAT file (DOS) or .sh file (UNIX) which contains the single line preccx %1 %2 (DOS) preccx $1 $2 (UNIX) replacing `preccx' with the exact location of the preccx[.exe] executable. The .BAT/.sh file should then be placed somewhere in the PATH instead. I have a .BAT file in \bin\bat. d) Read the literature: preccx.1 (UNIX: use nroff -man preccx.1 | more) (DOS: use more < Also provided are a long technical paper in an ASCII and a .DVI version. The latter should be viewable using dviscr, dvivga or other publicly available DVI previewers. preccx.t[xt] preccx.dvi YOU ARE READY TO GO.



PRECCX - PREttier Compiler Compiler eXtnded v2.42 2.42 is in the 2.x line of PRECCX utilities, which extend the 1.x line with conteXt dependent and higher-order parsing over plain infinite LA parsing. The 2.2x subfamily essentially marks the change to a more `standard' LEX interface (see HISTORY section). The 2.3x subfamily consolidates certain forward and backward compatibility changes. The 2.4x subfamily has gradually introduced full support for synthesized attributes and bidirectional data-flow between synthesized and inherited parser attributes. The 2.42 version marks a change in the back-end to allow almost full compilation into C instead of run-time interpretation in a virtual machine.



PRECCX is intended to extend the Unix yacc utility, so it may be wise to get to know (or learn) about that first. But the technology is entirely different, which leads to some fundamental differences in the way that definition scripts have to be written. One _can_ convert yacc scripts to PRECCX ones quite simply (see below) in general, with some particular points of difficulty (again, see below), but why the big differences? Couldn't PRECCX have been written to use a scripting language exactly like yacc's? Well, no. The differences are essential because otherwise PRECCX would be restricted to yacc-style semantics. PRECCX scripts cannot be converted to yacc scripts because of the extra expressiveness of the semantics involved. PRECCX does not build finite state machines. But PRECCX is upwardly compatible with yacc. As far as possible, the PRECCX scripting language has been designed to look like an extension of yacc's, with the result that PRECCX can be thought of as yacc with parameters, arbitrarily complex compound expressions, infinite lookahead and a neater way of dealing with attributes. But the fundamental differences mean that subexpressions cannot be translated across independently of their context (yacc scripts are heavily context dependent), and some special features of yacc do not translate easily, such as precedence declarations, because they depend vitally on yacc semantics.


Yacc scripts all have the pattern a : b c d e ... | f g h i ... | ... ; where the b,c,d, etc. may be actions, terminals or non-terminals. The way to write this for PRECCX will be @ a = ... | f g h i ... | b c d e .. and the re-ordering is required to make the longest pattern come first in the definition for PRECCX, whereas it will normally appear last in the yacc script. PRECCX is `infinite lookahead', so it will investigate each branch of the grammar to the maximum depth. But it is often the case that grammars do not have explicit termination markers (such as an ENDIF or ENDBLOCK) and then it may be that an initial segment of the token stream will satisfy the grammar specification as well as the full stream will. To preclude this possibility, the potentially longest matches must come first in PRECCX definition scripts, to force the longest matches to be sought first. So one must write @ a = b c @ | b where in yacc one would have written a : b | b c ; (incidentally, @a= b [c] is better style in PRECCX; more concise, clearer and more efficient). Yacc may well have advised of a shift/reduce conflict in such scripts (this depends on the context). In those terms, PRECCX can be understood to always shift (look for more tokens) instead of reduce (jump to another rule with what it's got), and will backtrack if it eventually discovers that another interpretation is required.


The yacc built-in `error' construct has to be rendered by hand for PRECCX since resynchronising the input stream after an error is problematic. A two token resync would be rendered @ error = ? ? @ {: printf("...."); :} and a skip to the end of the block would be @ error = >ENDBLOCK<* @ {: printf("...."); :} assuming that ENDBLOCK matches a token value returned by the yylex() tokeniser. The >x<* construct means `not-an-x as often as necessary'. PRECCX does not jump automatically to the error handler. The handler has to be installed using the following notation: @ foo !{error} bar gum and then the error parser will be invoked on an attempt to backtrack through the `!' (pronounced `cut') point. The construction is logically the same as @ foo ! { bar gum | error } but has the advantage of being expressed without the trailing part of the specification having to be known, and it also performs some space saving technical manoevers with the C call stack. The bare `!' mark induces a call to the default error handler on an attempt to backtrack through it. This handler is called btk_error() and is a C function (see the file on_error.c) that has direct access to the token buffer. It can be replaced either by linking in a new function or by changing the C macro that calls it. This macro is BTK_ERROR(x) (the x always has the value -1). See section ERROR TRAPPING for more details.


PRECCX also differs in the way that it declares terminals. The yacc declaration %token FOO is equivalent to a PRECCX definition @ _FOO_ = <FOO> if thereafter yacc references to FOO are replaced by PRECCX references to _FOO_.


There is at present no equivalent for the declaration of yacc precedences and associativity. Instead, these have to be coded explicitly for PRECCX using the preferred ordering.


When the -old switch is used, PRECCX partially supports the yacc method of accessing attributes in actions using numerical references. E.g. @ sum = summand <'+'> summand {: total = $1+$3; :} but PRECCX 2.42 no longer fully supports the style. In particular, references to $0 or lower are not tolerated (they were valid in earlier versions). The correct way to handle attributes now is using names, and the named attributes can then be dereferenced in the action. E.g. @ sum = summand\x <'+'> summand\y {: total = $x+$y; :} and attributes should be used whenever possible instead of side-effecting actions. The intent in the above is to create a new attribute whose value is the sum of the summands: @ sum = summand\x <'+'> summand\y {@ $x+$y @} PRECCX 2.42 no longer supports the yacc method of assigning an attribute using an action. In previous versions of PRECCX the yacc $$ notation was supported: @ sum = summand <'+'> summand {: $$ = $1+$3; :} /* WRONG */ but it is no longer. Attach an attribute instead, as above.


You probably don't want to know this! But there is a difference in PRECCX between the time at which the parse occurs and the time at which actions are executed. The parse occurs first and the actions are `built' during this phase, and executed either at the end of the parse, or at an explicit `!' command in the parse definition. This is in contrast to yacc where parse and the execution of actions are interleaved -- but then PRECCX has to be able to backtrack across actions, and therefore cannot execute them immediately they are encountered. I think the complication of having to remember that the two phases are distinct is more than compensated for by the infinite lookahead that it allows. The distinction means that actions cannot be used to alter the parameters to the parse directly. Don't worry. PRECCX does pass values between the two phases correctly, so the definition @ foo(n) = ... {: a(n+1); :} ... uses the n in the action that was the parameter to the parse. But one _CANNOT_ alter the n during the action and have it passed back to the parse. It is simply too late. All that happens is that a local copy of n is altered, so @ foo(n) = ... {: n=n+1 :} ... foo(n) ... does *not* recurse with n+1 as the parameter to foo. @ foo(n) = ... foo(n+1) ... is the correct way to pass altered parameters.


The equivalent to the yacc %union declaration is found in the capacity to set the type of values that PRECCX manipulates as attached attributes at runtime. Use # define VALUE whateveryoulike provided sizeof(whateveryoulike)=sizeof(long). In other words, PRECCX does not really cater for anything other than (long) integer values as attributes. For structures rather than unions, this means passing addresses rather than values, unless the structures are exceptionally brief. This fits in with the general philosophy of C, but it is a bit restrictive. You will have to make your own memory handler if you are going to build any really substantially structured attributes and then pass around handles to the structures. PRECCX only guarantees to "unbuild" the attribute it sees if it backtracks, not whatever it points to if it is a handle, so you will have to do your own memory deallocation too. But you would have had to do that anyway in any substantial C utility.



Each grammar definition may be parameterized with contexts. For example, `n' is the context in the following definition: @ decl(n) = space(n) expression <'\\n'> decl(n+1)* This definition defines a grammar term decl(n) which expresses the idea that it starts n spaces in from the left hand margin. It is followed by terms decl(n+1) which start a little further in. Some languages determine whether a declaration is local (and to what) or global in scope by relative indentation, and this is how to express this kind of constraint. Note that it will be necessary to cast parameters to the type PARAM (long) if they are not of the same size (as long) under your model of C. E.g. @ decl1 = decl((PARAM)1) This is rarely necessary.


PRECCX 2.42 can synthesize attributes on the fly. An attribute is built by following the clause for which it is the attribute by an `@', followed by the expression for the attribute, followed by a final `@'. The expression _MUST_ _NOT_ be side-effecting because PRECCX may execute the expression more than once if it backtracks. E.g. @ foo = bar gum {@ 1 @} @ | nay {@ 2 @} attaches the attribute 1 to the first clause and 2 to the second. Attributes already attached to the terms of the clause may be referenced and then dereferenced as follows: @ arfarf = arf\x arf\y {@ $x+$y @} The dereferencing $ in front of the x in $x is only necessary to ensure proper casting of types from (PARAM to VALUE) in all circumstances. It will usually not be required, but it is safer to use it. The x can and should always be used as a parameter without the $. E.g. @ bowwow = bow\x wow(x) This is where the real power of synthesized attributes comes in. An attribute synthesized during the parse can be used as a parameter in the remainder of the parse. This makes it possible, for example, to identify a single token: @ foo = ?\x what(x) whereas otherwise a construction like @ foo = <'a'> what('a') @ | <'b'> what('b') @ | ... would have been necessary. Note that the attributes can be passed into actions too: @ foo = ?\x {: printf("%c",(int)$x); :} but remember that actions are not executed until later. In particular, it is no use expecting an action to alter an attribute value.


PRECCX has infinite lookahead and backtracking in place of the yacc 1-token lookahead, This means that PRECCX parsers distinguish correctly between sentences of the form `foo bah gum' and `foo bah NAY' on a single pass. If you cannot imagine why one should want to decide between the two, think about `if ... then' and `if ... then ... else'. One can write the grammar definition down straight away in PRECCX as @ statement1 = <'i'> <'f'> boolexpr @ <'t'> <'h'> <'e'> <'n'> statement @ [ <'e'> <'l'> <'s'> <'e'> statement ] but this is much harder to do for yacc-style parsers.


Complex compound expressions like explain {{this | that} {several | no} times}+ are legal almost anywhere within PRECCX definition scripts. The definition can be substituted for the definee anywhere in a script except in the parameter list of a higher-order parser application. Grouping parentheses may be required.


Parts of a script can be PRECCX'ed separately, compiled separately, and then linked together later, which makes maintenance and version control easy. Suppose that you have written a monolithic script mono.y of some 500 definitions, commencing with some # define ... ... # include ... definitions for the C precompiler, and terminating with a MAIN(foobie) declaration. You can cut this script into four: mono.h --- the heading declarations 1sthalf.y --- first 250 definitions 2ndhalf.y --- second 250 definitions monomain.y --- the MAIN declaration. and place the instruction # include "mono.h" at the head of each of the .y files. Then run PRECCX over each of these: preccx 1sthalf.y 1sthalf.c preccx 2ndhalf.y 2ndhalf.c preccx monomain.y monomain.c compile each: gcc -ANSI -Wall -c 1sthalf.c 2ndhalf.c monomain.c then link gcc -o mono 1sthalf.o 2ndhalf.o monomain.o The ordering and placement of the definitions in the files is not important.


You may notice it yourself, but I'll say it now. PRECCX is fast, typically taking two to five seconds to compile scripts of several hundred lines. And it builds fast parsers too.


PRECCX `Macros' may be defined in a script, simply by defining one parser as a context for another. For example, @ optional(parser) = parser | {} may be defined (this particular example is an equivalent for the built-in [parser] construct). And then the construct @ ice_cream(flavour) = tub(flavour) optional(sauce) may be used. The `macros' are really ordinary grammar definitions which just happen to take other grammars as parameters. You may find that you have to cast these parameters to be the same length as all the others, if your model of C uses different sized pointers for function addresses than long. The cast is only required when you introduce a grammar name as a constant: @ ice_cream(flavour) = tub(flavour) optional((PARAM)sauce) and you may also find that you have to declare extern PARSER sauce; somewhere above the line, just to let C know what is going on.


Look at the UNIX man pages for yacc(1), lex(1), and gcc(1L),



SYNOPSIS preccx [options] [ file.y [ file.c ] ] PRECCX can also be used as a stdin to stdout filter: preccx [options] < file.y > file.c It is better to use command line file names, however; there is then no possibility of the console or keyboard being misidentified for error messages and interrupts. If file.c is omitted, stdout is used. If file.y is omitted too, then stdin and stdout are used. It is sometimes useful to run PRECCX in stdin to stdout mode (interactive mode) in order to debug a complicated definition. The command line options alter the sizes of internal PRECCX buffers and tables. You may have to increase this if PRECCX runs out of space when compiling a script. -rNN Read token buffer size in Kb (10) -pNN internal Program size in Kb (16) -vNN Valued attribute buffer size in Kb (8) -fNN context Frame buffer size in Kb (16) See the section on LIMITS, STACKS AND BUFFERS. In addition, the -old switch in 2.42 supports the use of the old style yacc-a-like numerical references to attributes within actions. See the section on ACTIONS.



A specification in a grammar definition script may look like: @ expr = var {<'+'>|<'-'>} expr @ | <'('> expr <')'> The `@' is an `attention mark'. Every line which does not begin with an `@' is passed through to the output unchanged, so arbitrary C code can be embedded in a PRECCX script. This makes PRECCX scripts literate in the sense made popular by Donald Knuth. Comments must therefore be delimited by C comment marks, `/*' and `*/'. A sequence of lines each of which begins with `@' is read as continuous input by PRECCX. There must be one blank line of surround either side of each group of lines beginning with `@'.



`*' is a postfix operator which means `zero or more times'. It can be attached to any atomic PRECCX expression. That is, anything which looks like a single thing to PRECCX: a literal, a definition name, a group in braces, a ..., but not an unbracketed non-trivial sequence or alternation. Example (1): @ boring = <'z'>* valid inputs: zzzzzzzz zzzzzz (nothing) Example (2): @ identifier1 = alpha {alpha | numeric}* @ alpha = (isalpha) @ numeric = (isdigit) valid inputs: identifier1 isalpha go123on The `*' may be followed by an integer C expression in order to define a specific number of repetitions. Example (3): @ spaces(n) = space*n @ space = (isspace)


`+' is a postfix operator which means `one or more times', It is equivalent to `*' in the sense that one may always substitute a+ by a a* but it is sometimes more concise or revealing to use this form. Example (1): @ boring = <'z'>+ @ | :/*empty*/: valid inputs: zzzzzzzzzz zzzz (nothing)

`[ ... ]'

`[ ... ]' is an outfix operator which means `optionally'. Syntactically, it acts as a bracket. Actions can be captured within too. Example (1): @ integer = [ <'+'>|<'-'> ] @ unsigned_int @ [ {<'E'>|<'e'>} [<'+'>] unsigned_int ] @ unsigned_int = (isdigit)+ valid inputs: -100e2 1234567890 +456 321E+6 The [ foo ] construct is equivalent to { foo | }. The foo* construct is equivalent to [ foo+ ].

`{ ... }'

`{ ... }' are the grouping brackets for PRECCX expressions. Example (1): @ identifier2 = {(isalpha) | <'_'>} {(isalpha) | (isdigit) | <'_'>}* valid inputs: a123_ _______ g_l_e_e_0

`] ... ['

`] ... [' (anti-brackets) hide an expression, causing it to be required but ignored. This has different effects in the middle of an expression and at the end of a PRECCX expression. The principal intended use is to require trailing context which is not parsed. Example (1): @ word = {alphanum | <'\''> | <'-'>}* @ ] separator | punctuation | stop | EOF [ @ sentence = word @ {[punctuation] separator word}* @ stop @ stop = <'.'> @ punctuation = <','> | <';'> | <':'> | <' '> <'-'> @ separator = (isspace) But when used early in a sequence, the effect is of a demand for `parallel' parsing. The hidden context must be reparsed by later sequents and hence must satisfy two parse requirements simultaneously: Example (2): @ loweralpha = ](islower)[ (isalpha) valid inputs: a b c Note that the syntax is sometimes ambiguous, and care must be taken to split up definitions involving anti-brackets in order to disambiguate it for PRECCX. The definition @ foobie = ] a [ b ] c [ is either @ foobie = ]{ a [ b ] c}[ or @ foobie = {] a [} b {] c [} and no particular interpretation is guaranteed (by me). Use grouping brackets if in doubt. (OK. I'll come clean. The first interpretation ought to be the one PRECCX uses always, since it involves the longest match down the left hand side of the parse tree.)


`?' stands for any token (except the special 0 token which the yylex() lexical analyzer should use to signal a break), Example (1): @ error = ?* {: printf("something happened on line %d\n",yylineno); :}


`^' means `beginning of line', Example (1): @ foo = {^ | separator} <'F'> <'O'> <'O'> ]separator|EOF[


`$' means `(a) match the special 0 TOKEN'. This token is usually returned by yylex() to denote end of line or EOF, or some other break or termination condition, So `$' is the place to perform special actions. `$' means also `(b) prepare to append more input'. This is `append' in distinction to `overwrite'. After a `$' match, tokens are appended to the input buffer as though no interruption had occurred in normal processing, except that the 0 token is written to match the position of the `$'. This action should be compared with that of the `$!' construct. Example (1): @ EOL = $ :V(1)=' ';:


`!' means cut, or `execute all pending actions now'. The input buffer is reset so that the current TOKEN becomes first. Backtracking across the ! position is disabled, and an error is generated if it is attempted. Example (1): @ EOL = $ ! {: printf("no hope of backtracking\n"); :}


`$!' is short for `$ !'.

`( ... )'

`(foo)' where foo is a BOOLEAN valued predicate on tokens, means `match a token satisfying foo'. Foo may be defined as an int 1 or 0 -valued C function elsewhere in the script, Example (1): @ name = (myisalpha)+ BOOLEAN myisalpha(c) TOKEN c; { return(isalpha(c)||(c=='_')); }

`) ... ('

`)...(' round a C expression of BOOLEAN type, indicates a logical test condition. Example (1): @ linefrom(n) = )n==0( {?\x {: printf("%c",$x); :} }* @ | )n<80( ? linefrom(n-1)

`< ... >'

`<...>' may be placed around literals for a match. Variables may occur in the literal, which may be any C expression. Example (1): # define COLON ':' @ twocolons = <COLON> <':'> Example (2): @ encrypted(x) = <rot13(x)>

`> ... <'

`>...<' may be placed around C expressions of TOKEN type to mean `not a (particular literal)', Example (1): @ string = <'"'> strchar* <'"'> @ strchar= <'\\'> ? @ | >'"'<


`|' means `or', and is placed between alternate phrases of the grammar, Example (1): @ a_or_b = a @ | b


Simple conjunction indicates sequence. Example (1): @ abc = a b c is the term denoting an expression consisting of an `a expression' followed by a `b expression' followed by a `c expression'. An example of a full PRECCX script will be found in the section USAGE.


A default do-nothing tokeniser is provided in the PRECCX library and will be automatically linked in unless you specify a different yylex() routine to the C compiler. There is nothing to worry about here. If you do nothing yourself, you will get a working parser out of a PRECCX script immediately, but if you particularly want to put your own tokeniser on the input, then you do that by 1. naming it `yylex()' and 2. making it return a TOKEN when called. 3. Place its object module or source code file ahead of the `-lcc' argument when you use the C compiler. For example: gcc -ANSI -o foo foo.c mylex.c -L $PRECCDIR -lcc and it will be linked in instead of the default. Exact details of what yylex() should do: A) (Important) yylex() should signal EOF by setting yytchar to EOF and returning with the value 0, which yylex() routines generated by lex(1) do not seem to get right. Under normal conditions, it should 1) return a nonzero TOKEN and set yytchar to something other than EOF. 2) set yylval to the attribute VALUE of the token, e.g. the value of the integer for an INT token, the character itself for a CHAR token, and so on. The 0 return code is a special TOKEN only matched by the PRECCX constructs `!' and `$' and `$!' (and `$$', for EOF). B) yylex() should set yylen to the length of the string corresponding to the returned TOKEN (this is not currently required by PRECCX). C) yylex() should set yylloc to point to the string (this is not currently required by PRECCX). D) yylex() should increment yylineno when a new line is deemed to have been input. PRECCX uses this information in the default error messages.


The way to compile a C source code file `foo.c' generated by PRECCX into an executable `foo' is to use an incantation like: gcc -Wall -ANSI -o foo foo.c -L <PRECCX dir> -lcc (under UNIX). The command line will vary with particular installations and configurations. In DOS (under Turbo-C), I find that it is important to select the `assume code segment not the same as data segment' switch for the compiler and linker. This is especially important if several different modules are compiled and linked together. Note that the default call stack size in DOS is only 4K, and this is altered to 32K+ by PRECCX executables during the main() routine they set up. The size can be varied by setting C_STACKSIZE (default 0x7FFFF). See section LIMITS, STACKS AND BUFFERS.


The following macro may be set in the grammar definition script, above the # include lines for cc.h or ccx.h: # define TOKEN tokentype (default char) This defines the space reserved for each incoming token in the parser which PRECCX builds. You must choose a preccx link library that matches the size of TOKEN. Use libccn.a (or [preccxnM].lib, where M is the memory model) for n-byte TOKENs. You can change the TOKEN type seen by PRECCX by #defining it differently: # define TOKEN short (you may want a wider range of TOKENs than the 256 possibilities afforded by an 8-bit char, and `#define TOKEN short' is sometimes useful). Any integer type is valid.


# define VALUE valuetype (default char*) This defines the space reserved for each value on the runtime stack manipulated by the runtime program which PRECCX attaches to the parser. There is no good reason for changing this to a type which is shorter than long int (or far *char), because the actual space used will be a union type which is at least as long as this. Nor is it possible to change the size beyond that of long (it is not the largest type passed as a value by C, but PRECCX cannot handle any larger one). So you can only use VALUE to switch to using structure or union pointers, or to change the name of the value type without changing its size.


The PRECCX 2.4x series provides for explicit error trapping using labeled cut marks in scripts. For example: @ top = !{skip} foo @ skip = ?* $ top MAIN(top) defines a top parser with a default fallthrough to a parser that silently skips a line and then retries top. Actually, in this case, the call of top in skip is unnecessary because PRECCX always retries its top level parser continuously until input is exhausted, but this style of specification does keep the parse inside a single invocation of the top level parser, which is ideologically cleaner! Failing explicit error directions, PRECCX will use its defaults. The default error action may be intercepted at low level by #defining an ON_ERROR(x) macro in C. There are currently three error values reported by PRECCX: x=0 means a partial (but successful) parse; x=1 means an unsuccessful parse; x=-1 means an attempt to backtrack across a `cut' (`!'). By default. PRECCX calls the ON_ERROR(x) macro when these errors arise. Then it attempts to reparse the remaining input. You might redefine what the macro does. For example: #define ON_ERROR(x) switch(x){\ case 0: printf("ow!\n"); break;\ case 1: printf("ouch!\n"); break;\ case -1:printf("zowie!\n");break;\ } The x=0 value might arise from a specification like <'a'>* and an input like "abb". The remaining input will be "bb". The x=1 value might arise from a specification like <'a'> <'a'> and an input "abb". The remaining input will be "abb". The x=-1 value might arise from a specification like <'a'> ! <'a'> and an input "abb". The remaining input will be "bb". The default error macro supplied with PRECCX simply prints an error message and the portion of the string beyond the (TOKEN *)maxp pointer, which is generally accurate for error location. It points to the deepest successful penetration into the incoming string. For your information, the pointer (TOKEN *)pstr always gives the unparsed TOKEN string, of which (TOKEN *)maxp will be an end-segment. These will not necessarily make good reading, since the pointer is into PRECCX's buffer and that is only guaranteed to be populated between pstr and maxp. The pointer *yylval may be set by the lexer to show more detail, but support is limited. You can determine what tokens are in the buffer by looking at the *yybuffer pointer (or the PRECCX buffer[]) and then attempt to reconstruct where you are from that snapshot. The error routine macro is initially set to ON_ERROR(x) = switch(x) { case 0 : ZER_ERROR(0); break; case 1 : BAD_ERROR(1); break; case -1: BTK_ERROR(-1);break; } If you want to try and resync the parse at an error, a sensible thing to do would be to (rewrite ZER_ERROR or BAD_ERROR or BTK_ERROR to) skip a token at maxp, and rerun the parse from there. You would have to read the code of the run() function defined in cc.c to make sense of this, but you might try: get1token(); /* skip a TOKEN */ tok=the_top_level_parser(); /* run the parse again */ if(GOODSTATUS(tok)){ /* the parse succeeded, so .. */ pc=0;pc=p_evaluate()); /* run the pending actions */ } else printf("At least I tried!\n")); Using a counter to set a maximal number of resync attempts in a single line would also be sensible! You can avoid all BAD_ERROR(1) calls by making sure that the top-level parser has a failsafe fallthrough to a ?* parser, with some kind of error action attached. The version 2.x series PRECCX introduced the error, BTK_ERROR(-1), which traps an attempt to backtrack across a cut. The ZER_ERROR(0), BAD_ERROR(1) and BTK_ERROR(-1) defaults are what you get with ON_ERROR(0), ON_ERROR(1) and ON_ERROR(-1) respectively. The default values of these C macros are respectively zer_error(), bad_error() and btk_error(), the three functions defined in the on_error.c module.


You may include the lines #define BEGIN mybegincode #define END myendcode for C code to be executed at either end of a top level parse attempt. This means that BEGIN will be re-executed if the top level parser resyncs after an error, and your code should take account of that (most likely by installing and using a counter).


The parser generated from a PRECCX script will ordinarily signal valid input by absorbing it silently, and signal invalid input by rejecting it and spouting an error message. This is a standard style for compiler-compilers. To get the parser to do anything else, you must decorate the definition script with ACTIONs. So now for the horrors of synthetic attributes. To make a parser do anything significant, you need either to get it to synthesize a data structure, or get it to generate outputs. Whichever, you need to scatter actions through the grammar definition script. Actions are pieces of C code (terminated by a semi-colon) and placed between a pair of braced colons (`{: ... :}') in the grammar definition script. For example: @ addexpr = expr\x <'+'> expr\y {: printf("%d",$x+$y); :} is not unreasonable. `Values' attached to each term of a PRECCX expression are an easy way to think of what is going on. Note that literals (like <'+'>) have their attached value generated by the yylex() token analyzer which feeds PRECCX. The VALUE yylval should be set by yylex() when it returns a TOKEN, and this will be used as the attached value. Side-effecting actions need a little explanation. Because PRECCX is an infinite look-ahead parser, it cannot execute actions at the same time as it reads input. It might have to later backtrack across its parse, and, whilst it might deconstruct data structures built up in the parse, it is certainly impossible to undo writes to stdout which might have occurred. So PRECCX builds a program as it parses. When the parse finishes correctly, the program is executed by an internal engine, but if the parse is unsuccessful or has to be backtracked, the program is `unbuilt' before its actions are executed. This program is a linear sequence of C code actions which have been specified in the PRECCX definition file. Thus the specification: @abc=a b c :printf("D");: @a=<'a'> :printf("A");: @b=<'b'> :printf("B");: @c=<'c'> :printf("C");: will, upon receiving input "abc", generate the program printf("A");printf("B");printf("C");printf("D"); to be executed later. Thus actions attached to a sequence expression may be thought of as occurring immediately after the actions attached to sub-expressions, and so on down. That explanation should enable you to generate side-effects in the correct sequence.


In earlier versions of PRECCX, the call_mode (default 0/AUTO) parameter determined the way PRECCX handled its internal stack of attached values, using either automatic control or allowing client control. In 2.42 and above, the call_mode parameter is obsolete. The stack has been superseded by compiled code. PRECCX now operates in only one mode under normal circumstances. However, the -old switch on the command line will enable at least partial support for yacc-style numerical references $1, $2, $3, etc., to attached attributes.


PRECCX grammar description files conventionally have the .y suffix, and should follow the following format: # define TOKEN ... (default = char) # define VALUE ... (default = char*) # define BEGIN ... (default nothing) # define END ... (default nothing) # define ON_ERROR(x) ... (defaults to standard) # include "ccx.h" (or wherever the ccx.h file has gone) @ first definition : attached action; : @ ... @ ... MAIN(name of entry clause) The cc.h header file may be used instead of ccx.h in scripts which consist only of unparametrized definitions and terms.


The standard sizes for the token buffer and interpreted program stack inside PRECCX are respectively * READBUFFERSIZE and * PROGRAMSIZE, defined in the header files cc.h and ccx.h. These can be changed in the module which contains the MAIN(...) parser declaration. The READBUFFERSIZE limits the number of TOKENs PRECCX can accept between cut (`!') operations, and the PROGRAMSIZE limits the total number of TOKENs and ACTIONs. This number is about twice the number of TOKENs seen (assuming one action and one passed value per TOKEN), and thus limits the `line length' too. Note that the cost in space of a TOKEN itself may be small, but the token stack requires a parallel stack of token VALUEs, supplied by the yylex() lexer through the yylval variable. PRECCX no longer uses a runtime VALUE stack for the attributes attached to grammar components, but the parameter governing the size is retained for compatibility. It is * STACKSIZE. and the default value is 0. The only other limit imposed on PRECCX is the size of the C runtime stack. This can also be set in the MAIN(...) module, by defining * C_STACKSIZE You should avoid recursive calls in favour of the * and + constructions whenever space is tight. Each library call costs about 20 bytes of C stack space.



The following script defines a simple +/- calculator: # define TOKEN char # define VALUE int # include "cc.h" # include <ctype.h> int acc = 0; @ digit = (isdigit)\x {: acc=10*acc+$x-'0'; :} @ posint= digit posint @ | digit ! {: acc = 0; :} @ {@ acc @} @ anyint= <'-'> posint\x {@ -$x @} @ | posint @ atom = <'('> expr\x <')'> @ {@ $x @} @ | anyint @ expr = atom\x sign_sum\y @ {@ $x+$y @} @ | atom @ sign_sum= <'-'> atom\x sign_sum\y @ {@ -$x+$y @} @ | <'-'> atom\x {@ -$x @} @ | <'+'> atom\x sign_sum\y @ {@ $x+$y @} @ | <'+'> atom MAIN(expr) This script must be passed through PRECCX: PRECCX calculator.y calculator.c and then compiled, using the PRECCX kernel library in libcc.a: gcc -Wall -ANSI -o calculator calculator.c -L ... -lcc The three dots stand for the directory in which the PRECCX library file libcc.a is placed. Note that by default the attached attribute is that of the last term in a clause, so no {@ ... @} is needed in some places where it might have been thought to be required. Also note that it would have been more efficient to use an optional following action and write @ expr = atom\x { sign_sum\y {@ ... @} | {@ $x @} } instead of @ expr = atom\x sign_sum\y {@ ... @} | atom because the latter expression will build a parser which needlessly checks twice for atom when no sign_sum follows.


For an example of a parser which uses parameters essentially, the following definition of a parser which accepts only the fibonacci sequence as input may be instructive: # define TOKEN char # define VALUE char* # include "ccx.h" # include <math.h> # define INT(x) (int)(x) # define DIV(m,n) INT(INT(m)/INT(n)) # define MOD(m,n) INT(INT(m)%INT(n)) # define LOG10(n) INT(log10(DBLE(n))) # define DBLE(n) (double)(n) # define TEN DBLE(10) # define FIRSTDIGIT(n) \ (0!=n)?DIV((n),pow(TEN,DBLE(LOG10(n)))):0 # define LASTDIGITS(n) \ (0!=n)?MOD((n),pow(TEN,DBLE(LOG10(n)))):0 MAIN(fibs) @fibs = fib((PARAM)1,(PARAM)1)\k @ {: printf("%d terms OK\n",(int)$k); :} @fib(a,b) = number(a) <','> fib(b,a+b)\k {@ $k+1 @} @ | <'.'> <'.'> @ {: printf("Next terms are %d,%d,..\n",(int)a,(int)b); :} @ {@ 0 @} @number(n)= digit(n) @ | digit(FIRSTDIGIT(n)) number(LASTDIGITS(n)) @digit(n) = <n+'0'> /* rep. of 1 digit n */ The following are some example inputs and responses: 1,1,2,3,5,.. 5 terms OK Next terms are 8,13,.. 1,1,2,3,5,8,13,21,34,51,85,.. error: failed parse: probable error at <>1,85,..


PRECCX macros (see Section MACROS) are higher-order parsers. In principle one may define a separated-list macro: @ sep_list(p,q) = p {q p}* and use it as follows: @ mysep = <','> @ mylist(p) = sep_list(p,mysep) @ items = mylist(item) but you may find that you have to 1) declare a parser before you use it as a parameter: extern PARSER mysep, item; 2) cast the parser to the PARAM type (used by PRECCX for all parameters) as you use it for the first time: # define CONST(x) (PARAM)(x) @ mylist(p) = sep_list(p,CONST(mysep)) @ items = mylist(CONST(item)) This is only necessary if sizeof(short(*)()) is not the same as sizeof(long) in your C model.


PRECCX is intended to be both easy and convenient to use, but a compiler compiler cannot be understood in one minute. There is a directory of example files included with the PRECCX distribution. Have a look at the *.y files there to get more of the feel. (Sorry, but I can't put everything you need to know in here).



The following files may be found in the PRECCX distribution -- which is not to say that all of them are in it, or that this list is complete. I just want to let you know what some of the files you can see are: preccx PRECCX executable preccx.y Main PRECCX definition in its own language preccx.c PRECCX C source code (generated by PRECCX from preccx.y). preccx.h PRECCX header file, needed only to construct PRECCX. lex.y Lexer definition. lex.c Lexer C source code (generated by PRECCX from lex.y). c.y C parser definition. c.c C parser source code (generated by PRECCX from c.y). ccdata.c The global data used by PRECCX. ccx.c The source code of the PRECCX 2.x kernel operations, needed to make ccx.o, included in libcc.a. cc.c The source code of the unparametrized PRECCX 1.x kernel operations, needed to make cc.o, included in libcc.a. ccx.h The header file of the PRECCX 2.x kernel operations, needed by every code constructed by PRECCX. cc.h The header file of the unparametrized PRECCX 1.x kernel operations, an alternative to cc.h if you do not use parameterized definitions. common.c The source code of the kernel common to both 1.x and 2.x. engine.c The PRECCX runtime engine. yystuff.c Default lexer which allows you to escape newlines. on_error.c Default error routines. atexit.c Termination routines. libcc.a The UNIX library containing cc.o, ccx.o and yystuff.o, etc. needed to compile an executable from code built by PRECCX. Actually a link to one of: libcc1.a 1-byte TOKEN library libcc2.a 2-byte TOKEN library libcc4.a 4-byte TOKEN library preccx.lib The DOS library containing cc.o, ccx.o and yystuff.o, etc. needed to compile an executable from code built by PRECCX. Actually a copy of one of: preccx1c.lib 1-byte TOKEN library, compact memory model. preccx2c.lib 2-byte TOKEN library, compact memory model. preccx4c.lib 4-byte TOKEN library, compact memory model. preccx1l.lib 1-byte TOKEN library, large memory model. preccx2l.lib 2-byte TOKEN library, large memory model. preccx4l.lib 4-byte TOKEN library, large memory model. Makefile The makefile for PRECCX, which calls: Makefile.dos The makefile for PRECCX under DOS. Makefile.hpu The makefile for PRECCX under HP-UNIX. Makefile.syv The makefile for PRECCX under system V UNIX. test.y Simple test script for PRECCX. test.c C output from the test.y script. test The test parser built by `gcc -ANSI -o test test.c -L ... -lcc'.


PRECCX COMPILER COMPILER Copyright Peter T. Breuer, 1989, 1992, 1994. 3, Arthur St. Cambridge CB4 3BX <ptb@{,,,}> All rights reserved. In particular, you may not distribute for profit or cost the source code of the kernel libraries, or the description of PRECCX in its own scripting language, or its source code, without my permission. You may not make copies of the PRECCX executables and libraries, except for your own use and for the purpose of making backups, and you definitely may not sell any copies. See the licensing agreement which accompanies the package for full details.


Peter Breuer Programming Research Group, Oxford University Computing Laboratory, Wolfson Buidling, Parks Road, Oxford OX1 3QD, UK. DIT, Escuela Superior de Ingenieros de Telecomunicacion, Universidad Politecnica de Madrid, Ciudad Universitaria, Madrid E-28040, SPAIN. <ptb@{,,,}> Original man page also hacked by Jonathan Bowen <>.


This executable readme created by P.T. Breuer using Dave Harris' freeware DRC utility, David's Readme Compiler, available from SIMTEL and mirrors at numerous archive sites. PRECCX has been developed under Turbo-C 3.0 and Borland C 2.0 for 386 PCs under MS DOS 5.0 and PC DOS 6.3, and under GNU's gcc C compiler for UNIX on numerous platforms and operating systems. Blame them. The PRECCX.EXE executable for DOS has been compressed using Fabrice Bellard's freeware LZEXE utility, available from SIMTEL and mirrors as LZEXE91.ZIP .


1. On Sun3's, the gcc compiler still complains that printf is being redefined. I don't know why. If anyone finds the right compiler switch to magic this away, please tell me! For the hp300 series, the switch is -D__hp9000s300, if that's any clue? Mind you, on the hp's I get complaints about __fls being assumed to be int (it is). I presume all these reflect mess-up's in the gcc configuration. On my Sun SPARC I also get complaints about strlen being redefined, but then that gcc configuration is _definitely_ crocked. 2. (Cured Mar 10 1992 in v1.1) 3. (Not cured but irrelevant, as multiple libraries are now used, April 1992) There is no way to change the TOKEN and VALUE types compiled into the libraries, other than trivially. The size must remain the same. 4. (Cured Mar 17 1992 in v1.2). 5. (Cured July 10 1992 in v2.21, by using dynamic allocation in main) This is a perennial one. PRECCX allocates all stacks and buffers statically, and there are one or two I don't watch for overflow on at every possible overflow point. One day I will switch to dynamic allocation -- or at least command line parameters which determine the sizes, like yacc. See NOTES and LIMITS. (NB - should have been finally banished in v2.42 since all stacks are guarded there). 6. Cured in April 1993 in v2.40, failure to set a mark at some cut points resulted in fewer backtrack errors than there ought to have been. 7. Cured in August 1994 in v2.41; raw literal strings and chars were not always recognized as valid parser arguments. Easy one. 8. Cured finally in August 1994 in v2.42, I hope, several reports over the years all eventually traced to a failure to realign the read buffer when it ought to have been resulting in overflow about 2K tokens later. Symptoms are an overwriting of the program area with an "illegal instruction" report issued from the engine module. Let me know if the impossible happens again. Please report problems to <>.


v2_01 : March 1992 : Original issuable release of precc*x*. With parameterized grammars : now, but still with line-at-a-time lexer requirements. Preccx : asks the lexer for a string of tokens terminated by a zero token, : and offers the yybuffer location for them to be written to. v2_02 : March 1992 : Minor corrections (but necessary ones) to `!' and `!$' functionality. v2_10 : Wed Jun 10 08:53:56 1992 : Major revision of the library routines. They now use argument counts : instead of the P_STOP terminator to demarcate the list of parameters : in multi argument calls. This means that the P_STOP value is now valid : as a parameter (hooray, I always use MAXINT-7 in my programs). : : Bug fixes: more corrections to `!' and `!$' functionality. : : Improvements: * rewrite of bracket counting algorithm in preamble.c : to return more info (number of args) in support of the new style : libraries, and do it better. : * added yylen to lexer code. v2_20 : Thu Jun 11 21:07:09 1992 : Major revision of the lexer interface. Switched over to : token-at-a-time calls to yylex(), a la yacc, for compatibility. This : is extremely inefficient, technically, of course! I'll release the : default lexer code so that it can be figured out. : : Lexers now return the TOKEN value when called, and put the VALUE in : yylval. They should do their own buffering if they want to be : efficient. They still have to shift yylineno for themselves, and set : yylen, and yylloc (the string location for error calls), though I : can't make head or tail of the documentation on the latter in : bison/yacc, so I don't really know what it's for. : : Experiment: * began to allow the p[q,r](x,y) syntax to distinguish : meta-parameters from ordinary ones. I suppose it makes sense as a : convention, even though I don't do anything with it at present. v2_21 : Sun Jun 21 17:44:52 1992 : Bug fixes to new-style default lexer (of course). : : Improvements: * internal stacks now created dynamically. : : * new syntax, p*n for exactly-n repeats of p. This : required support in the kernel. The n can be a C expression and : catches local parameters thrown from the parser definitions. : : The documentation has been brought up to date and improved. ver 2_22 : Sun Jul 12 17:11:03 1992 : Making stack allocation user definable. Use STACKSIZE, C_STACKSIZE, : READBUFFERSIZE, FRAMEBUFFERSIZE in the main module (see cc.h). : Final lexer fixes. ver 2_23 : Sat Jul 25 21:39:54 1992 : Going back to ANSI code for MSDOS from Turbo-C. : The optimization default has been turned to off (because the expected : deficiency finally surfaced, and this will be fixed soon - I : have to include the instruction cache in the saved frame). : Shortened cc.h and ccx.h for users so that internals not exported. ver 2_30 : Mon Aug 24 14:31:32 1992 : Changes in source code for Unix and other compilers, to help : with portability and compatibility concerns. One bugfix in libraries. : Added a trivial atexit.c module to satisfy systems without atexit(). : Changed SUCCESS to 1 and FAILURE to 0, for forward compatibility : with the monad model of parsing. This is the reason for the new : release number. : Now supporting the monadic `a\x b(x)' syntax, but not all the : functionality (wait for 2.31). The `a[b]' syntax is withdrawn, as : it's purely decorative, and slows down the parse noticeably. : Corrected a bug in some0n() and another in the push macro : which meant that changes in MAXPROGRAMSIZE weren't seen by ccx.h. : I introduced globals to contain all the user-definable numbers, just : in case they need to change dynamically in future. The struct precc_data : contains them all. ver 2_31 : Mon Mar 19 1993 : Various minor internal changes for compatibility and better : program comprehensability. ver 2_32 : August 1994 : ditto. ver 2_40 : Apr 25 1993 : Implemented the `a\x b' syntax correctly at last. Various cleanups : of the precc.y script to support this, particularly in the management : of local environments (which really should be handled as inherited : parameters, but aren't, for bootstrap reasons). : Corrected a bug in the implementation of `!' which prevented : recognition of many backtrack errors through it. : Split the precc.y script into three: precc.y, lex.y, c.y . : Introduced the `!{foo}' construct, which causes reentry at parser : foo in case of a backtrack error through that point. : The buffer-sizes in the precc utility itself can now be set on : the command line, as well as by C macros in clients. : patch: August 1994. Altered sources to reduce compiler warnings. : Mended a bug in findbrkt which meant that strings and quotes were : not recognized in parser args. Made #line N "source" directives : appear in emitted code. ver 2_41 beta: august 1994. : Moved synthetic attribute construction into the compile stage. : The C stack is now being used for attribute passing, and C is looking : after the frame shifts, not precc. Synthesized attributes can now : be passed as inherited parameters at parse time. The old attribute : stack has been discarded (STACKSIZE=0 by default) and call_mode is : obsolete. Optimization also obsolete because shifting handled by C. : Synthetic attributes should be constructed within @...@ . : E.g foo = bar gum @hum@ . I added encryption to get around the problem : that naked zeros couldn't be constructed before. Now there is no : restriction and I think the encryption is invisible. : Named synthetic attributes should be dereferenced using the $foo : syntax: bar = gum\foo {: print($foo); :} . This does a cast. : The old $1 $2 syntax is supported, but only if you use the -old : switch to precc, and the generated code is horrible. It should still : be more robust than before, however. The $0 reference is DISALLOWED : now. $$ is now meaningless and should be replaced with $1, if at all. : Actions cannot make changes to these variables any longer. : Further cleanup in the bit of preccx.y I couldn't understand before. : Fixed: a huge bug that has been there forever. The read buffer is : now always flushed on _successfully_ passing a cut mark : (!). It wasn't, with resulting overflows, before. The buffer is now : also being watched for overflow. Cleaned up pstr/maxp/buffer code. : Removed: #line N directives from emitted code. Too confusing! ver 2.42 September 1994. : Minor code changes to pass ANSI lint and manual : text cleanups. Code is now clean if -w-pro flag is set to avoid : warnings about "function used with no prototype". This only happens : because I use foo(); instead of foo(void); style declarations. : Decided on {: :} and {@ @} syntax for actions and attributes : respectively, but it is not strictly enforced yet and a little : inefficient unless C compiler optimization is used.

This HTML document was generated automatically by Jonathan Bowen on Tue Oct 18 12:07:08 BST 1994 with minor manual corrections.