Forum: >>> Magnum BBS <<<

How do you create a grammar for a multi-language language?

From Roger L Costello@21:1/5 to All on Thu Mar 3 13:57:55 2022

Hello Compiler Experts!

Suppose you are creating a grammar for a language that hosts other languages. For example, the (parent) language hosts the regular expression language and the XPath language. How do you create a grammar for a multi-language language? I can imagine two approaches:

1. Create the grammar for the parent language and copy and paste into it the grammars of the hosted languages. Copy-and-paste doesn't sound appealing.

2. Create a grammar just for the parent language. Then, create a parsing pipeline: parse the input first with the grammar for the parent language, then parse the input with the grammar for one hosted language, then parse the input with the grammar for the second hosted language, and so forth. I have no idea how this would work; e.g., how would an abstract syntax tree be constructed?

How do you create a grammar for a multi-language language?

/Roger

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roger L Costello@21:1/5 to All on Sat Mar 5 22:29:37 2022

Kartik Agaram wrote:

* I'd re-ask the question in the context of the concrete languages you're considering.

I am working with a team to develop a new language and our current thinking is that the language will use the XPath language for navigating through documents and the CSS language for expressing sets of match/action pairs.

This idea of multi-language languages is very common within the XML community. For example, the XSLT language uses (hosts) the XPath language. Here is an excerpt to illustrate:

select="/Bookstore/Book[1]/Title"

The expression /Bookstore/Book[1]/Title is an XPath expression, the other
parts are XSLT. So, the format of the select statement is:

select="XPath"

See how XSLT hosts XPath? That is, one language (XSLT) is using another language (XPath).

Another XML language called XML Schema hosts two languages: the regex language and the XPath language.

How do compiler experts create grammars for multi-language languages?

/Roger

From: Kartik Agaram <[email protected]>
Sent: Saturday, March 5, 2022 4:58 PM
Subject: [EXT] Re: How do you create a grammar for a multi-language language?

The first way is unlikely to work. Inlining one grammar into another is overwhelmingly likely to result in garbage. If you're lucky it'll recognize
one of the languages and ignore the productions of the other. More likely
it'll recognize a single language that's an uncontrollable melange of both. In the worst case it'll seem to work but fail in strange corner cases.

In general, I think you under-estimate the difficulty of this problem, and how easy it is for two languages to be fundamentally incompatible. The first question isn't "what's the cleanest way to do this for arbitrary languages?" but "is it even possible for these two languages?" followed by "is it worth
the trouble?"

So I'd re-ask the question in the context of the concrete languages you're considering.
[XSLT is written in XML. Conceptually at least, first you parse the XML using an XML grammar, then you parse the XSLT using the XSLT grammar. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kartik Agaram@21:1/5 to All on Sat Mar 5 16:55:48 2022

I am working with a team to develop a new language and our current thinking is
that the language will use the XPath language for navigating through documents
and the CSS language for expressing sets of match/action pairs.

That feels like a much more tractable problem. Don't even think of it
as a multi-language language. It's a language you're designing (so you
have lots of room for maneuver) around XPath.

This idea of multi-language languages is very common within the XML community.
For example, the XSLT language uses (hosts) the XPath language. Here is an excerpt to illustrate:

select="/Bookstore/Book[1]/Title"

The expression /Bookstore/Book[1]/Title is an XPath expression, the other parts are XSLT. So, the format of the select statement is:

select="XPath"

See how XSLT hosts XPath? That is, one language (XSLT) is using another language (XPath).

Yeah, putting a second language inside the string literals of the
first is a very common approach. All the tooling for the outer
language can remain oblivious of the second. You don't even need to
design a new language, this approach works with any existing one, and
you can focus on implementing a library to process XPath.

If you do want to design another language, however, you can make things a lot nicer. For example:

- any URL is a valid literal in the Red programming language: https://github.com/red/docs/blob/master/en/datatypes.adoc
- the JSX extension of JavaScript used in the React framework permits component literals that look a lot like some sort of markup language: https://reactjs.org/docs/introducing-jsx.html

Kartik
http://akkartik.name/about

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Roger L Costello on Sat Mar 5 21:10:46 2022

On Saturday, March 5, 2022 at 1:50:24 PM UTC-8, Roger L Costello wrote:

Hello Compiler Experts!

Suppose you are creating a grammar for a language that hosts other languages. For example, the (parent) language hosts the regular expression language and the XPath language. How do you create a grammar for a multi-language language?

A little it depends on how the two go together.

A not so unusual way to write compilers years ago, was recursive descent
for statements, and operator precedence for expressions. The recursive
descent parser calls the operator precedence parser when it needs an
expression to be parsed.

Some mixed languages can be parsed separately.

The C preprocessor, originally a separate pass, but now usually implemented together with the rest of the C compiler, processes its statements, and passes everything else through. It does that well enough, that it is commonly used with Fortran. (The traditional version, not the newer one.)

PHP is designed to recognize its syntax, and ignore everything else, which
is normally html, but could be another language.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Sun Mar 6 15:37:13 2022

At first glance, I misread the question and thought about using
inheritance to do so. We used that in Yacc++ (the one from compiler
resources, not the one that comes with your C++ compiler) to solve
certain multi-language problems.

However, you are actually solving a "simpler" problem. And, the
standard approach to that is to embed the second language as a
"string" in the outer language. Many languages (and their compilers/interpreters) do that. That's exactly what your XSLT case
does. The XPATH code is simply a string in the XSLT language, and the
XSLT language doesn't attempt to parse it. It simply hands the code
off to an XPATH parser when in knows the string is XPATH code.

Now, the only problem with that has to do with nested strings. If
your inner language has strings, that probably gives you the
quoting/backslash problem. Many shells have this issue where to nest
strings one needs to add backslashes (and double existing backslashes) resulting in a terrible counting problem as to when one has enough
(and not too many) of them.

foo = "a \"nested\" string, worse one with an \\\"extra\\\" level of
nesting, imaging doing that \\\\\"more than once\\\\\" and getting
them to line up, recursive \\s are a pain"
letsPlay = "whose newline \\n is it \n" // a game where the rules are
made up and the formatting doesn't matter (or does it)

An alternative to that is often "raw strings" where the outer language
has a syntax where the outer delimiter says within it, ignore quotes
and backslashes etc, and accept anything up to the matching closing
outer delimiter. The best form of these allows one to "tag" the
delimiter with a key that the closing delimiter must include and
match. That way you can use the key to keep the closing delimiter
from being "reserved" in the inner text. Something like this:

foo = """xyzzy
flk" \n \" \\\ js // none of these are special and used verbatim
sdflkjs
ssfs ''''' xlkjxlvj // not a closing """ because it doesn't have
the key xyzzy
sflsfj
''''xyzzy; // now we closed the string

Note that matching the outer delimiter plus the tag might be beyond
the capacity of your lexer, especially if it only does regular
expressions.

A different alternative is to make "strings" that are special in your
language and switch grammars. For example, I like / as a string
delimiter for regular expressions.
Thus /regex/ is a "string" in the outer language, it doesn't parse its contents, but it also knows that the string is a regular expression
and should be handled as one.
Here is an example of that.

foo = /ab.c*([abc]|xb+)*/;

We actually use a variant of that in Yacc++. We treat {code} as a
string in Yacc++. We don't parse the code. We send it to the C++ or
C# (et al) compiler. However, we have modified the lexer (using an LR
rule rather than just a regular expression) so that we do a little bit
of parsing to allow nested { } pairs in the code and comments. That
takes a little work to do. In Yacc++, we made that easy by merging
regular expressions and LR rules into a unified model, but most lexer
and parser generators don't do that. (A shame in my opinion, but
that's just *my* *opinion*.)

foo : bar { if (x == y) { a = b } /* a C style comment with a /*
inside, C comments don't nest per spec */
while (a < b) { a = a << 1; // C++ comment with a } inside
} // end while
} /* close code */ bletch;

The disadvantage of doing that in the lexer is that we cannot
conveniently use { } pairs in the "outer language" syntax. Not
without lexer states or the equivalent.

But, notice in all these methods we don't actually do (much) parsing
of the string. We don't try to make "one" grammar out of them,
because what's really hard to do (unless the languages are quite
similar, and especially have the same tokenizing rules) is to embed
the second language as something other than some kind of string. It
can be done, but it can be a lot of work and fragile.

-- ****************************************************************************** Chris Clark email: [email protected] Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans-Peter Diettrich@21:1/5 to Roger L Costello on Sun Mar 6 12:23:23 2022

On 3/3/22 2:57 PM, Roger L Costello wrote:

How do you create a grammar for a multi-language language?

A grammar is not required. In 1971 I saw a program source that could
tell the compiler it was compiled with - Fortran or Algol.

DoDi
[I've seen obfuscated programs that will work as shell scripts or
perl programs but I don't think that was what he was asking about.
-John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Christopher F Clark on Sun Mar 6 14:36:01 2022

On Sunday, March 6, 2022 at 9:06:34 AM UTC-8, Christopher F Clark wrote:

(snip)

However, you are actually solving a "simpler" problem. And, the
standard approach to that is to embed the second language as a
"string" in the outer language. Many languages (and their compilers/interpreters) do that. That's exactly what your XSLT case
does. The XPATH code is simply a string in the XSLT language, and the
XSLT language doesn't attempt to parse it. It simply hands the code
off to an XPATH parser when in knows the string is XPATH code.

That works when the outer language is parsed first.

For PHP, and I believe the C preprocessor, the inner language is parsed
first, so the parser has to ignore everything, including quoting, in
the outer language.

I don't believe I ever tried preprocessor statements inside C string
constants, but as far as I know, it works.

One has to be careful to avoid ambiguities between the two,
or are created with the combination of the two languages.
[PHP effectively treats material between ?> and <?php as an instruction
to print the material as if it were a quoted string.

C says that the input is tokenized before it does the preprocessor
phase, so it does not look inside quoted strings. The # and ##
preprocessor operators allow some preprocessor time creation of quoted
strings. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roger L Costello@21:1/5 to Chris Clark on Sun Mar 6 23:32:13 2022

Chris Clark wrote:

the standard approach to that is to embed the
second language as a "string" in the outer language.
Many languages (and their compilers/interpreters)
do that. That's exactly what your XSLT case
does. The XPATH code is simply a string in the
XSLT language, and the XSLT language doesn't
attempt to parse it. It simply hands the code
off to an XPATH parser when in knows the string
is XPATH code.

Okay, so there would be one grammar for XSLT, a second (independent) grammar for XPath. The grammar for XSLT just treats the XPath portions as strings. The grammar for XPath ignores the XSLT portions. So the input is processed in a pipeline fashion.

Is that correct?

But, but, but, ... how would an Abstract Syntax Tree be constructed when the input is processed in a pipeline?

/Roger
[I believe the answer is "incrementally." You can run the parsers sequentially or you can run them as coroutines. I don't think it makes much difference unless
the intermediate version is extremely huge, but that's rarely a problem with XML. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to All on Sun Mar 6 16:50:08 2022

On Sunday, March 6, 2022 at 3:22:57 PM UTC-8, gah4 wrote:

(snip)

I don't believe I ever tried preprocessor statements inside C string constants, but as far as I know, it works.

(snip)

[PHP effectively treats material between ?> and <?php as an instruction
to print the material as if it were a quoted string.

I suppose it works that way.

C says that the input is tokenized before it does the preprocessor
phase, so it does not look inside quoted strings. The # and ##
preprocessor operators allow some preprocessor time creation of quoted strings. -John]

It seems that the traditional C compiler, sometimes used with other
languages, such as Fortran, has different parsing and tokenizing rules.

gcc -E --traditional quote.c

will process files with a preprocessor statement inside quotes.

On the other hand, the result is likely not what was wanted,
at least not for C. Among others, it puts out lines like:

# 6 "quote.c" 2

which then end up inside the string.

On the other hand, if you language uses ' for other than strings,
or for two uses, then it should work.

One of the stranger things that I have known for about 50 years,
is direct access I/O in IBM Fortran IV. There are statements like:

WRITE(1'N) X, Y, Z

where N is the record in the direct access file. The compiler also
accepts string (Hollerith) constants with apostrophes.
(But not in direct access I/O statements.)

In any case, the --traditional C preprocessor is commonly used with Fortran, where I suspect C tokenizing could cause problems.

Some C preprocessors, but not the current ISO C version, and it
seems not gcc -E --traditional, will substitute preprocessor symbols
inside quoted strings.

As the OP notes, mixed parsing can have surprising effects!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robin Vowels@21:1/5 to All on Mon Mar 7 13:39:33 2022

From: "gah4" <[email protected]>
Sent: Sunday, March 06, 2022 4:10 PM

The C preprocessor, originally a separate pass, but now usually implemented together with the rest of the C compiler, processes its statements, and passes
everything else through. It does that well enough, that it is commonly used with Fortran. (The traditional version, not the newer one.)

IBM's PL/I has a preprocessor that accepts a subset of PL/I.
The finished text is then passed to the compiler.

XPL processes such text as it goes, handling the text processing
before passing it to the compiler.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans-Peter Diettrich@21:1/5 to Hans-Peter Diettrich on Mon Mar 7 05:08:33 2022

On 3/6/22 12:23 PM, Hans-Peter Diettrich wrote:

I don't think that was what he was asking about.

The question is too unspecific for me. A (traditional) grammar covers
the parser part, while the lexer is specified differently. Languages
based on the same lexer can be merged into one grammar, no problem so
far. But if the second language shall apply to a single token
(string...) of the primary language then both languages are independent
and can not share a single grammar. Like a document can contain parts of several natural languages, where each language applies only to specific
parts of the documents and is subject to special lexer rules (RTL/LTR reading...).

In C/C++ and its preprocessor we have a special construct where the preprocessor tokens are refined/extended by the C/C++ lexer. And there
were discussions whether the preprocessor should look into string
literals (Siemens?), or whether the preprocessor can generate C comments (Microsoft). A separate preprocessor run at least allows for the latter,
as the preprocessed source code is lexed again and can build tokens
differently from the tokens in the original source code.

My conclusion:
A single (formal) grammar can not contain multiple languages. Unless you specify that e.g. statements and expressions in a programming language
shall be considered subject to different languages. Such nitpicking is
not worth further thoughts :-(

DoDi

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Hans-Peter Diettrich on Sun Mar 6 21:22:17 2022

On Sunday, March 6, 2022 at 8:43:43 PM UTC-8, Hans-Peter Diettrich wrote:

(snip)

My conclusion:
A single (formal) grammar can not contain multiple languages. Unless you specify that e.g. statements and expressions in a programming language
shall be considered subject to different languages. Such nitpicking is
not worth further thoughts :-(

It would be complicated for compiled languages.

TeX allows one to change, character by character in the input, which characters are letters, and so used in a control sequence name.

LaTeX macros, to allow for internal names that don't conflict with any
user defined names, puts an @ sign in them, after changing @ to a letter.
Then, just before going into user code, changes @ back to not a letter. (Specifically, it is other.) The lexer can't read too far ahead, as the character
codes might change at any time.

It is also interesting to see how languages without reserved
words, keep track of which words have the keyword meaning,
and which are ordinary names.
[Back in the 1970s there were a bunch of extendible languages like EL/1 and IMP72 where you could add and change syntax on the fly. They all died since
it meant that in practice no two programs were written in the same language
and they were unreadable. Now we have overloading so you understand the
syntax but you don't know what it means. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Tue Mar 8 21:46:44 2022

It is also interesting to see how languages without reserved
words, keep track of which words have the keyword meaning,
and which are ordinary names.

This is a topic close to my heart. There are actually a variety of
methods that work relatively well in this regard.

-----------------------------------------------

First, recognizing keywords, if you are using a language (unlike the
original FORTRAN where spaces were not significant), it is useful to
follow a model proposed by Frank Deremer, with lexing divided into a
scanner and a screener. The scanner has only the responsibility of
dividing tokens into discrete entities and labelling those which have
fixed types, so keyword recognition is not part of the scanner to the
scanner they are just identifiers. The screener then checks the
identifier to see if it is a keyword.

In most languages, that lookup can easily be done by preloading the
symbol table with the identifiers that are keywords. This is not
inefficient, because even if the identifier is not a keyword, you need
a unique symbol table entry for each identifier in any case, so you
are doing that lookup anyway.

The only wrinkle here is the case (which we have in Yacc++) where you
want your keywords to be case insensitive and your identifier to be
case sensitive. You do get an extra lookup in that case, but only the
first time an identifier is seen. Once, it is seen, you have marked
the symbol table entry with whether it is a keyword or not.

-----------------------------------------------

Now, the inverse problem. For keywords that are reserved words.
There is nothing more to do. However, if you only want your keywords
to be recognized as keywords in specific contexts, then you need the
second part of the solution. This is a [set of] parser rule[s] that
turns keywords back into identifiers if they aren't being used in a
context where they have a special meaning.

This rule looks something like this:

identifier: IDENTIFIER | IF | THEN | ELSE | WHILE | DO | ... ;

That is, you define an identifier non-terminal that matches either an IDENTIFIER token or any of your KEYWORD[ token]s. And, when you want
an identifier you use the non-terminal rather than token in your rule.

Now, this may cause "conflicts" or "ambiguities" and you resolve that
by figuring out which keywords are special in that context and making
an alternate idenitiier_in_this_context rule, which omits the
conflicting keywords from its list. (And, by the way, this is one
reason to use a parser generator, because it will report those
conflicts to you and tell you when you have some keyword that is still
an issue in some context. The checking that a parser generator does
helps you not make mistakes.) And, in most cases the parsers
lookahead can figure out whether you mean the keyword or the
identifier so that you rarely have to make those extra rules that
exclude specific keywords from the list.

-----------------------------------------------

The main place one gets in trouble is when you have a [set of]
keyword[s] that is/are optional before a list of identifiers. In that
case, your identifier list cannot start with one of those keywords.
However, a better solution (if you are designing your own language) is
to put the optional keywords before a mandatory one.

For example, in Yacc++ we have a KEYWORD definition, and a variation
on it for SUBSTRING keywords. But we don't use SUBSTRING as a
declaration by itself, so the rule for the declaration of a list of
keywords looks like:

keyword_declaration:
SUBSTRING? KEYWORD identifier ("=" number)? (","? identifier ("="
number)?)* ";" ;

If we did it in the other order, we couldn't allow substring as the
first identifier in the list.
That is:

keyword_declaration:
KEYWORD SUBSTRING? identifier_but_not_substring ("=" number)? (","?
identifier ("=" number)?)* ";" ;

And, note, if we had hand-written a recursive descent parser, we
wouldn't have been warned of that edge case. Either we would have
mis-parsed it or we would have that weird exception in our language.
Neither is ideal in my point of view.

And, if you go through it carefully, you can see that you can make a
much more complex set of rules that covers the second case, so that if
you saw the SUBSTRING keyword or your "substring" identifier was
followed by an equals or a comma or a semi-colon it was still legal.
Something to introduce subtle bugs in users' programs where a
seemingly innocuous change causes something to go from legal (and
meaning one thing) to illegal or meaning something else.

Or as our wise moderator might put it, you start creating a language
where the user is no longer certain what their code means.

-----------------------------------------------

By the way, we used both parts in Yacc++. We do have a small number
of reserved words, like yy_eof which we use to communicate between
lexer and the parser to indicate the end of the input. You cannot use
that identifier for any other purpose in our grammars. However, we
allow keywords like TOKEN to be used also as names of tokens or
non-terminals and we can tell when you are using it to declare a token
versus using it as the name of a token or non-terminal. So, most of
our keywords are just that context sensitive keywords that only have
reserved meanings in certain contexts. Otherwise, they are just
identifiers. I guess one can tell that before starting compiler
resources, my partner and I had spent a fair amount of time writing
and using compilers in PL/I dialects.

-- ****************************************************************************** Chris Clark email: [email protected] Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------ [If anyone wants to know how to find the tokens in old space-insensitive Fortran
I can tell you, but it's as ugly as you might imagine. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to our moderator on Wed Mar 9 00:31:25 2022

(snip, our moderator wrote)

[If anyone wants to know how to find the tokens in old space-insensitive Fortran
I can tell you, but it's as ugly as you might imagine. -John]

As well as I know, the original 704 Fortran compiler has been lost to history, but the Fortran II compiler is still around.

Note, though, that fixed-form is still in the Fortran standard, and still has the
space-insensitivity. More popular for new programs, free-form is not space insensitive.

Also, to make it easier for programmers, some space is still optional in free-form Fortran, for old and new keywords. For example, both GO TO
and GOTO are allowed, I suspect with separate entries into the parsing
tables. PL/I also has both GO TO and GOTO statements.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robin Vowels@21:1/5 to All on Thu Mar 10 10:00:22 2022

From: "gah4" <[email protected]>
Sent: Wednesday, March 09, 2022 7:31 PM

Also, to make it easier for programmers, some space is still optional in free-form Fortran, for old and new keywords. For example, both GO TO
and GOTO are allowed, I suspect with separate entries into the parsing tables. PL/I also has both GO TO and GOTO statements.

GO and GOTO are keywords in both FORTRAN and PL/I,
and they are not reserved.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robin Vowels@21:1/5 to All on Thu Mar 10 09:55:46 2022

[If anyone wants to know how to find the tokens in old space-insensitive Fortran
I can tell you, but it's as ugly as you might imagine. -John]

No it's not.
See A. H. J. Sale, "The Classification of FORTRAN Statements", Computer Journal,
Vol. 14 No. 1, 1971.
[Having written production Fortran compilers, I can assure you that to
do it right is quite ugly. I looked at the article and he missed some things, and he made some unrealistic assumptions that a compiler would accept nothing beyond what was in the Basic Fortran standard. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robin Vowels@21:1/5 to All on Thu Mar 10 11:59:42 2022

From: "Christopher F Clark" <[email protected]>
Sent: Wednesday, March 09, 2022 6:46 AM

First, recognizing keywords, if you are using a language (unlike the
original FORTRAN where spaces were not significant), it is useful to
follow a model proposed by Frank Deremer, with lexing divided into a
scanner and a screener. The scanner has only the responsibility of
dividing tokens into discrete entities and labelling those which have
fixed types, so keyword recognition is not part of the scanner to the
scanner they are just identifiers. The screener then checks the
identifier to see if it is a keyword.

A similar approach is used in DEUCE PL/I. (more in a moment).

In XPL, the scanner recognises keywords (which are reserved words)
and classifies them differently from identifiers.
The other objects (integers, strings, operators, parentheses, etc)
are classified appropriately.

Returning to DEUCE PL/I, the scanner classifies indentifiers
without making any distinction between identifiers and keywords.
Other objects are classified as above. The symbol table
already contains the keywords. All new identifiers are entered
into the symbol table.

In the semantics section, initial recognition is on groupings of
identifiers (again keywords are not recognised as such).
Once a phrase is recognised, the opening keywords in the
statement are compared with permissible keyword combinations,
without resorting to the symbol table. For example, PUT EDIT ( ...
PUT LIST ( ... and PUT DATA etc are identified after comparison with
the first word, and comparison with the second word is sufficient to
identify the kind of statement.
At this point, it can be said that the keywords are not reserved.
Avoiding looking in the symbol table for keywords saves time.

In most languages, that lookup can easily be done by preloading the
symbol table with the identifiers that are keywords. This is not inefficient, because even if the identifier is not a keyword, you need
a unique symbol table entry for each identifier in any case, so you
are doing that lookup anyway.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Our moderator on Thu Mar 10 07:07:25 2022

Our moderator wrote

[Having written production Fortran compilers, I can assure you that to
do it right is quite ugly.

I can confirm that.

Looking at how a recent compiler does that, I can slightly rephrase https://gcc.gnu.org/wiki/GFortranHacking :

The front end's action can be grouped into four phases:

Parsing: This converts source code into a stream of tokens which
describe the language. Because Fortran does not have reserved
keywords, the gfortran runs a series of matchers against code
trying to find one that matches a statement. On failing a match
an error message may be queued, and another matcher tried. If all
attempts at matching fail, the error queue is dumped to the user.

Resolution: This resolves things left over from the parsing phase,
such as types of expressions, and compile-time simplification of
constants. Many errors are issued in this phase. At the end of
this phase, the abstract syntax tree is finished.

[...]

This works (sort of) but has the drawback that cleaning up after an
error is error-prone (hah!) so there are a lot of ice-after-invalid
(internal compiler error after invalid code) issues that have emerged
over the years.
[As I recall, first I did a pass to fold continuation cards and mark
quoted and hollerith strings, then a pass to look for an equal sign
not inside parens and not followed by an comma, which said it was an
assignment statement, and then the parsing was straightforward. Eacn non-assignment statement starts with a keyword and it is easy to tell
from the syntax when to look for another keyword. Repeat for the
statement that follows a logical IF with a special case for the
statement numbers in an arithmetic IF. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Shift
  Fri Jul 31 06:46:34 2026
  from Leeds, England via SSH
- Centurion
  Fri Jul 31 00:59:56 2026
  from Berea, Ohio via Telnet
- Rixter
  Fri Jul 31 00:00:46 2026
  from Madison, Nc via Telnet
- Bob Worm
  Thu Jul 30 20:01:55 2026
  from Wales, Uk via Telnet
- Rixter
  Thu Jul 30 14:17:17 2026
  from Madison, Nc via Telnet
- Krenn
  Thu Jul 30 13:16:49 2026
  from Sydney, Nsw via Telnet
- Bob Worm
  Thu Jul 30 09:03:28 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:47:34 2026
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	108:28:18
Calls:	12,461
Calls today:	3
Files:	15,197
Messages:	6,538,101

How do you create a grammar for a multi-language language?

Who's Online

Recent Visitors

System Info