Forum: >>> Magnum BBS <<<

Why don't compiler writers adhere to the dragon book recommendation of

From Roger L Costello@21:1/5 to All on Sat Jun 25 12:58:25 2022

Hi Folks,

Page 101-102 of the dragon book recommends having one lexer rule for both keyword and identifiers (symtab = symbol table):

[a-zA-Z][a-zA-Z0-9]* { Action: check symtab for the lexeme.
If the symtab says the lexeme
is a keyword, return the keyword.
Else, if the symtab says the lexeme
is an existing identifier, return ID
and a pointer to the symtab entry.
Else, add the identifier to the symtab
and return ID and a pointer to the
new symtab entry. }

The book says the advantages of this approach are:

1. Easy to add new keywords. Simply initialize the symtab with the new keywords. No changes to the lexer rules.

2. Smaller transition diagram. Instead of several hundred states, there are less than a hundred states in the transition diagram.

Those seem like pretty compelling reasons for having one lexer rule for both keywords and identifiers.

So why don't compiler (lexer) writers follow that recommendation?

Consider:

Page 86-90 of the book "flex & bison" shows the lexer for MySQL. The lexer has a rule for every keyword.

Pag 228 of "Introduction to Compiling Techniques" show a lexer for VSL (Very Simple Language). Again, the lexer has a rule for every keyword.

/Roger
[I think the answer to a lot of these questions contains the phrase "64K PDP-11." In
the 1970s a thousand state DFA took a lot of memory. Today it's barely a rounding error.
In flex&bison the main reason I did it the way I did is that it's a pedagogical example
but it also meant the symbol table didn't need special cases for keywords. There is also
the PL/I situation where keywords are only keywords in certain contexts so you can say
IF IF = GOTO; THEN ELSE = BEGIN; ELSE END = IF;
-John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Roger L Costello on Sat Jun 25 13:01:40 2022

On Saturday, June 25, 2022 at 9:41:32 AM UTC-7, Roger L Costello wrote:

(snip)

Page 101-102 of the dragon book recommends having one lexer rule for both keyword and identifiers (symtab = symbol table):

(and our moderator says)

[I think the answer to a lot of these questions contains the phrase "64K PDP-11." ...]

The lost art of small memory compilers.

Before the PDP-11 there were big computers like the IBM 704, where Fortran originated, and when core was $1/bit or more. (The 704 with core was an upgrade from the 701, using CRTs for main memory.) Big ones had 32K
words, but I think the Fortran compiler ran in 8K or 16K words.

After big computers got bigger, then we had minicomputers like the PDP-11
and Data General Nova and Eclipse, and some others, with 32K or so bytes.

And when minicomputers got bigger, everything happened again with microcomputers, and again compilers had to fit. I do remember swapping
floppy disks for passes of a Fortran and Pascal compiler for MS-DOS 2.0.

Some years ago, the Hercules group was trying to get gcc running
on an emulated IBM S/370 running MVS, with an 8M region.
(Out of the 16M byte address space, MVS takes up about half.)
But you can't run gcc in 8M bytes.

When I remember S/370 and OS/VS2, the usual region was 300K,
which we thought was big.

And now, we can barely run a system with 4G main memory,
such as the Macbook Air that I am writing this on.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Sun Jun 26 00:09:46 2022

Two parts to my answer.

First, if they are using the Yacc++ my partner and I wrote, they are
following that rule. Why? Because we built it into the product.

You simply declare keywords like this:

keyword if, then, else, begin, end, for, while, do, defun, let;
substring keyword subroutine, function, procedure; // these match any
unique substrings.
sensitive keyword EoF; // only accepts that exact spelling (not eof or EOF) keyword amp "&" // keywords that are not strictly alphanumeric
inline keyword ge ".ge.", le ".le.", eq ".eq."; // don't put these in the symbol table but instead make rules for them...
...

It also incorporates the C-typedef hack if you want that. It also deals
with the PL/I issue of context-sensitive keywords, where the word sometimes
is and sometimes is not a keyword. . All of this is built-in.

It does that by using the idea from Frank Deremer of separating the scanner from the screener. The keyword processing is done in the screener (using
the symbol table), and because it uses the symbol table, the parser can add
(or remove) entries from being keywords--although that's not the technique
used for PL/I-style context-sensitive ones but is the right approach for C-style typedefs.

So, I would say the Dragon book is correct. Not because it makes the
tables smaller, but because it says cognitive space. You don't want to be reinventing the wheel for all those variations. You just want to declare
what you mean and have the tool do the "right" thing.

-----

Now, the reverse side of the coin. ANTLR's lexer may be separate from the parser, but doesn't have the concept of a screener. You cannot get between
the lexer and the parser to do "magic". Thus, you have to use explicit
keyword tokens and if you want case insensitive ones, use regular
expressions (character classes) to specify them, same thing for substring keywords (and it doesn't warn you if your substrings are ambiguous).

So, while I normally laud Terence for his work on ANTLR. He gets far more right than wrong. This is a case where he got it wrong. (BTW, JavaCC
which borrows heavily from ANTLRs implementation, suffers the same way).

Now, being fair, ANTLR4 does have a unique feature in that regard in that
if you write a keyword or any token (in quotes) in your grammar, it is processed only when that rule is being recognized. That's part of how
ANTLR4 is more PEG-like. That is a different solution to the PL/I keywords
as identifiers issue than what we support. And as far as I can tell, it
even overrides the longest-match problem. Thus, if you have a place in
your grammar where you want only ">" to be recognized and not ">>", simply
put ">" in your parser at that location.

-- ******************************************************************************

Chris Clark email:
[email protected]
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to [email protected] on Sun Jun 26 06:17:49 2022

On 2022-06-25, gah4 <[email protected]> wrote:

Some years ago, the Hercules group was trying to get gcc running
on an emulated IBM S/370 running MVS, with an 8M region.
(Out of the 16M byte address space, MVS takes up about half.)
But you can't run gcc in 8M bytes.

When I remember S/370 and OS/VS2, the usual region was 300K,
which we thought was big.

And now, we can barely run a system with 4G main memory,
such as the Macbook Air that I am writing this on.

My TXR Lisp hits a peak memory footprint of around 17 megabytes during
the build of its compiler and standard library (on a 32 bit GNU/Linux
system). If the CONFIG_SMALL_MEM build time option is used, it can get
down to 11. That's a total footprint including all the code and data
areas, such as mappings for the C library (which is a pig nowadays), as reported by common tools like top. (CONFIG_SMALL_MEM just adjusts the
sizes of some static arrays used by the garbage collector and sets
certain thresholds differently.)

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[I think we should stop here before I start telling you what we did
on a 4K PDP-8. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans-Peter Diettrich@21:1/5 to All on Sun Jun 26 10:20:25 2022

On 6/25/22 10:01 PM, gah4 wrote:

And now, we can barely run a system with 4G main memory,
such as the Macbook Air that I am writing this on.

Squeeze your system and run WinXP in a 1G VM ;-)

DoDi

P.S.: Of course Linux were a better choice ;-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	128:25:41
Calls:	12,470
Files:	15,200
Messages:	6,538,464

Why don't compiler writers adhere to the dragon book recommendation of

Who's Online

System Info