• Why don't compiler writers adhere to the dragon book recommendation of

    From Roger L Costello@21:1/5 to All on Sat Jun 25 12:58:25 2022
    Hi Folks,

    Page 101-102 of the dragon book recommends having one lexer rule for both keyword and identifiers (symtab = symbol table):

    [a-zA-Z][a-zA-Z0-9]* { Action: check symtab for the lexeme.
    If the symtab says the lexeme
    is a keyword, return the keyword.
    Else, if the symtab says the lexeme
    is an existing identifier, return ID
    and a pointer to the symtab entry.
    Else, add the identifier to the symtab
    and return ID and a pointer to the
    new symtab entry. }

    The book says the advantages of this approach are:

    1. Easy to add new keywords. Simply initialize the symtab with the new keywords. No changes to the lexer rules.

    2. Smaller transition diagram. Instead of several hundred states, there are less than a hundred states in the transition diagram.

    Those seem like pretty compelling reasons for having one lexer rule for both keywords and identifiers.

    So why don't compiler (lexer) writers follow that recommendation?

    Consider:

    Page 86-90 of the book "flex & bison" shows the lexer for MySQL. The lexer has a rule for every keyword.

    Pag 228 of "Introduction to Compiling Techniques" show a lexer for VSL (Very Simple Language). Again, the lexer has a rule for every keyword.

    /Roger
    [I think the answer to a lot of these questions contains the phrase "64K PDP-11." In
    the 1970s a thousand state DFA took a lot of memory. Today it's barely a rounding error.
    In flex&bison the main reason I did it the way I did is that it's a pedagogical example
    but it also meant the symbol table didn't need special cases for keywords. There is also
    the PL/I situation where keywords are only keywords in certain contexts so you can say
    IF IF = GOTO; THEN ELSE = BEGIN; ELSE END = IF;
    -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Roger L Costello on Sat Jun 25 13:01:40 2022
    On Saturday, June 25, 2022 at 9:41:32 AM UTC-7, Roger L Costello wrote:

    (snip)

    Page 101-102 of the dragon book recommends having one lexer rule for both keyword and identifiers (symtab = symbol table):

    (and our moderator says)
    [I think the answer to a lot of these questions contains the phrase "64K PDP-11." ...]

    The lost art of small memory compilers.

    Before the PDP-11 there were big computers like the IBM 704, where Fortran originated, and when core was $1/bit or more. (The 704 with core was an upgrade from the 701, using CRTs for main memory.) Big ones had 32K
    words, but I think the Fortran compiler ran in 8K or 16K words.

    After big computers got bigger, then we had minicomputers like the PDP-11
    and Data General Nova and Eclipse, and some others, with 32K or so bytes.

    And when minicomputers got bigger, everything happened again with microcomputers, and again compilers had to fit. I do remember swapping
    floppy disks for passes of a Fortran and Pascal compiler for MS-DOS 2.0.

    Some years ago, the Hercules group was trying to get gcc running
    on an emulated IBM S/370 running MVS, with an 8M region.
    (Out of the 16M byte address space, MVS takes up about half.)
    But you can't run gcc in 8M bytes.

    When I remember S/370 and OS/VS2, the usual region was 300K,
    which we thought was big.

    And now, we can barely run a system with 4G main memory,
    such as the Macbook Air that I am writing this on.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Sun Jun 26 00:09:46 2022
    Two parts to my answer.

    First, if they are using the Yacc++ my partner and I wrote, they are
    following that rule. Why? Because we built it into the product.

    You simply declare keywords like this:

    keyword if, then, else, begin, end, for, while, do, defun, let;
    substring keyword subroutine, function, procedure; // these match any
    unique substrings.
    sensitive keyword EoF; // only accepts that exact spelling (not eof or EOF) keyword amp "&" // keywords that are not strictly alphanumeric
    inline keyword ge ".ge.", le ".le.", eq ".eq."; // don't put these in the symbol table but instead make rules for them...
    ...

    It also incorporates the C-typedef hack if you want that. It also deals
    with the PL/I issue of context-sensitive keywords, where the word sometimes
    is and sometimes is not a keyword. . All of this is built-in.

    It does that by using the idea from Frank Deremer of separating the scanner from the screener. The keyword processing is done in the screener (using
    the symbol table), and because it uses the symbol table, the parser can add
    (or remove) entries from being keywords--although that's not the technique
    used for PL/I-style context-sensitive ones but is the right approach for C-style typedefs.

    So, I would say the Dragon book is correct. Not because it makes the
    tables smaller, but because it says cognitive space. You don't want to be reinventing the wheel for all those variations. You just want to declare
    what you mean and have the tool do the "right" thing.

    -----

    Now, the reverse side of the coin. ANTLR's lexer may be separate from the parser, but doesn't have the concept of a screener. You cannot get between
    the lexer and the parser to do "magic". Thus, you have to use explicit
    keyword tokens and if you want case insensitive ones, use regular
    expressions (character classes) to specify them, same thing for substring keywords (and it doesn't warn you if your substrings are ambiguous).

    So, while I normally laud Terence for his work on ANTLR. He gets far more right than wrong. This is a case where he got it wrong. (BTW, JavaCC
    which borrows heavily from ANTLRs implementation, suffers the same way).

    Now, being fair, ANTLR4 does have a unique feature in that regard in that
    if you write a keyword or any token (in quotes) in your grammar, it is processed only when that rule is being recognized. That's part of how
    ANTLR4 is more PEG-like. That is a different solution to the PL/I keywords
    as identifiers issue than what we support. And as far as I can tell, it
    even overrides the longest-match problem. Thus, if you have a place in
    your grammar where you want only ">" to be recognized and not ">>", simply
    put ">" in your parser at that location.

    -- ******************************************************************************

    Chris Clark email:
    [email protected]
    Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to [email protected] on Sun Jun 26 06:17:49 2022
    On 2022-06-25, gah4 <[email protected]> wrote:
    Some years ago, the Hercules group was trying to get gcc running
    on an emulated IBM S/370 running MVS, with an 8M region.
    (Out of the 16M byte address space, MVS takes up about half.)
    But you can't run gcc in 8M bytes.

    When I remember S/370 and OS/VS2, the usual region was 300K,
    which we thought was big.

    And now, we can barely run a system with 4G main memory,
    such as the Macbook Air that I am writing this on.

    My TXR Lisp hits a peak memory footprint of around 17 megabytes during
    the build of its compiler and standard library (on a 32 bit GNU/Linux
    system). If the CONFIG_SMALL_MEM build time option is used, it can get
    down to 11. That's a total footprint including all the code and data
    areas, such as mappings for the C library (which is a pig nowadays), as reported by common tools like top. (CONFIG_SMALL_MEM just adjusts the
    sizes of some static arrays used by the garbage collector and sets
    certain thresholds differently.)

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    [I think we should stop here before I start telling you what we did
    on a 4K PDP-8. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans-Peter Diettrich@21:1/5 to All on Sun Jun 26 10:20:25 2022
    On 6/25/22 10:01 PM, gah4 wrote:

    And now, we can barely run a system with 4G main memory,
    such as the Macbook Air that I am writing this on.

    Squeeze your system and run WinXP in a 1G VM ;-)

    DoDi

    P.S.: Of course Linux were a better choice ;-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)