• How do you create a grammar for a multi-language language?

    From Roger L Costello@21:1/5 to All on Thu Mar 3 13:57:55 2022
    Hello Compiler Experts!

    Suppose you are creating a grammar for a language that hosts other languages. For example, the (parent) language hosts the regular expression language and the XPath language. How do you create a grammar for a multi-language language? I can imagine two approaches:

    1. Create the grammar for the parent language and copy and paste into it the grammars of the hosted languages. Copy-and-paste doesn't sound appealing.

    2. Create a grammar just for the parent language. Then, create a parsing pipeline: parse the input first with the grammar for the parent language, then parse the input with the grammar for one hosted language, then parse the input with the grammar for the second hosted language, and so forth. I have no idea how this would work; e.g., how would an abstract syntax tree be constructed?

    How do you create a grammar for a multi-language language?

    /Roger

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roger L Costello@21:1/5 to All on Sat Mar 5 22:29:37 2022
    Kartik Agaram wrote:


    * I'd re-ask the question in the context of the concrete languages you're considering.

    I am working with a team to develop a new language and our current thinking is that the language will use the XPath language for navigating through documents and the CSS language for expressing sets of match/action pairs.

    This idea of multi-language languages is very common within the XML community. For example, the XSLT language uses (hosts) the XPath language. Here is an excerpt to illustrate:

    select="/Bookstore/Book[1]/Title"

    The expression /Bookstore/Book[1]/Title is an XPath expression, the other
    parts are XSLT. So, the format of the select statement is:

    select="XPath"

    See how XSLT hosts XPath? That is, one language (XSLT) is using another language (XPath).

    Another XML language called XML Schema hosts two languages: the regex language and the XPath language.

    How do compiler experts create grammars for multi-language languages?

    /Roger

    From: Kartik Agaram <[email protected]>
    Sent: Saturday, March 5, 2022 4:58 PM
    Subject: [EXT] Re: How do you create a grammar for a multi-language language?

    The first way is unlikely to work. Inlining one grammar into another is overwhelmingly likely to result in garbage. If you're lucky it'll recognize
    one of the languages and ignore the productions of the other. More likely
    it'll recognize a single language that's an uncontrollable melange of both. In the worst case it'll seem to work but fail in strange corner cases.

    In general, I think you under-estimate the difficulty of this problem, and how easy it is for two languages to be fundamentally incompatible. The first question isn't "what's the cleanest way to do this for arbitrary languages?" but "is it even possible for these two languages?" followed by "is it worth
    the trouble?"

    So I'd re-ask the question in the context of the concrete languages you're considering.
    [XSLT is written in XML. Conceptually at least, first you parse the XML using an XML grammar, then you parse the XSLT using the XSLT grammar. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kartik Agaram@21:1/5 to All on Sat Mar 5 16:55:48 2022
    I am working with a team to develop a new language and our current thinking is
    that the language will use the XPath language for navigating through documents
    and the CSS language for expressing sets of match/action pairs.

    That feels like a much more tractable problem. Don't even think of it
    as a multi-language language. It's a language you're designing (so you
    have lots of room for maneuver) around XPath.

    This idea of multi-language languages is very common within the XML community.
    For example, the XSLT language uses (hosts) the XPath language. Here is an excerpt to illustrate:

    select="/Bookstore/Book[1]/Title"

    The expression /Bookstore/Book[1]/Title is an XPath expression, the other parts are XSLT. So, the format of the select statement is:

    select="XPath"

    See how XSLT hosts XPath? That is, one language (XSLT) is using another language (XPath).

    Yeah, putting a second language inside the string literals of the
    first is a very common approach. All the tooling for the outer
    language can remain oblivious of the second. You don't even need to
    design a new language, this approach works with any existing one, and
    you can focus on implementing a library to process XPath.

    If you do want to design another language, however, you can make things a lot nicer. For example:

    - any URL is a valid literal in the Red programming language: https://github.com/red/docs/blob/master/en/datatypes.adoc
    - the JSX extension of JavaScript used in the React framework permits component literals that look a lot like some sort of markup language: https://reactjs.org/docs/introducing-jsx.html

    Kartik
    http://akkartik.name/about

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Roger L Costello on Sat Mar 5 21:10:46 2022
    On Saturday, March 5, 2022 at 1:50:24 PM UTC-8, Roger L Costello wrote:
    Hello Compiler Experts!

    Suppose you are creating a grammar for a language that hosts other languages. For example, the (parent) language hosts the regular expression language and the XPath language. How do you create a grammar for a multi-language language?

    A little it depends on how the two go together.

    A not so unusual way to write compilers years ago, was recursive descent
    for statements, and operator precedence for expressions. The recursive
    descent parser calls the operator precedence parser when it needs an
    expression to be parsed.

    Some mixed languages can be parsed separately.

    The C preprocessor, originally a separate pass, but now usually implemented together with the rest of the C compiler, processes its statements, and passes everything else through. It does that well enough, that it is commonly used with Fortran. (The traditional version, not the newer one.)

    PHP is designed to recognize its syntax, and ignore everything else, which
    is normally html, but could be another language.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Sun Mar 6 15:37:13 2022
    At first glance, I misread the question and thought about using
    inheritance to do so. We used that in Yacc++ (the one from compiler
    resources, not the one that comes with your C++ compiler) to solve
    certain multi-language problems.

    However, you are actually solving a "simpler" problem. And, the
    standard approach to that is to embed the second language as a
    "string" in the outer language. Many languages (and their compilers/interpreters) do that. That's exactly what your XSLT case
    does. The XPATH code is simply a string in the XSLT language, and the
    XSLT language doesn't attempt to parse it. It simply hands the code
    off to an XPATH parser when in knows the string is XPATH code.

    Now, the only problem with that has to do with nested strings. If
    your inner language has strings, that probably gives you the
    quoting/backslash problem. Many shells have this issue where to nest
    strings one needs to add backslashes (and double existing backslashes) resulting in a terrible counting problem as to when one has enough
    (and not too many) of them.

    foo = "a \"nested\" string, worse one with an \\\"extra\\\" level of
    nesting, imaging doing that \\\\\"more than once\\\\\" and getting
    them to line up, recursive \\s are a pain"
    letsPlay = "whose newline \\n is it \n" // a game where the rules are
    made up and the formatting doesn't matter (or does it)

    An alternative to that is often "raw strings" where the outer language
    has a syntax where the outer delimiter says within it, ignore quotes
    and backslashes etc, and accept anything up to the matching closing
    outer delimiter. The best form of these allows one to "tag" the
    delimiter with a key that the closing delimiter must include and
    match. That way you can use the key to keep the closing delimiter
    from being "reserved" in the inner text. Something like this:

    foo = """xyzzy
    flk" \n \" \\\ js // none of these are special and used verbatim
    sdflkjs
    ssfs ''''' xlkjxlvj // not a closing """ because it doesn't have
    the key xyzzy
    sflsfj
    ''''xyzzy; // now we closed the string

    Note that matching the outer delimiter plus the tag might be beyond
    the capacity of your lexer, especially if it only does regular
    expressions.

    A different alternative is to make "strings" that are special in your
    language and switch grammars. For example, I like / as a string
    delimiter for regular expressions.
    Thus /regex/ is a "string" in the outer language, it doesn't parse its contents, but it also knows that the string is a regular expression
    and should be handled as one.
    Here is an example of that.

    foo = /ab.c*([abc]|xb+)*/;

    We actually use a variant of that in Yacc++. We treat {code} as a
    string in Yacc++. We don't parse the code. We send it to the C++ or
    C# (et al) compiler. However, we have modified the lexer (using an LR
    rule rather than just a regular expression) so that we do a little bit
    of parsing to allow nested { } pairs in the code and comments. That
    takes a little work to do. In Yacc++, we made that easy by merging
    regular expressions and LR rules into a unified model, but most lexer
    and parser generators don't do that. (A shame in my opinion, but
    that's just *my* *opinion*.)

    foo : bar { if (x == y) { a = b } /* a C style comment with a /*
    inside, C comments don't nest per spec */
    while (a < b) { a = a << 1; // C++ comment with a } inside
    } // end while
    } /* close code */ bletch;

    The disadvantage of doing that in the lexer is that we cannot
    conveniently use { } pairs in the "outer language" syntax. Not
    without lexer states or the equivalent.

    But, notice in all these methods we don't actually do (much) parsing
    of the string. We don't try to make "one" grammar out of them,
    because what's really hard to do (unless the languages are quite
    similar, and especially have the same tokenizing rules) is to embed
    the second language as something other than some kind of string. It
    can be done, but it can be a lot of work and fragile.

    -- ****************************************************************************** Chris Clark email: [email protected] Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans-Peter Diettrich@21:1/5 to Roger L Costello on Sun Mar 6 12:23:23 2022
    On 3/3/22 2:57 PM, Roger L Costello wrote:
    How do you create a grammar for a multi-language language?

    A grammar is not required. In 1971 I saw a program source that could
    tell the compiler it was compiled with - Fortran or Algol.

    DoDi
    [I've seen obfuscated programs that will work as shell scripts or
    perl programs but I don't think that was what he was asking about.
    -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Christopher F Clark on Sun Mar 6 14:36:01 2022
    On Sunday, March 6, 2022 at 9:06:34 AM UTC-8, Christopher F Clark wrote:

    (snip)

    However, you are actually solving a "simpler" problem. And, the
    standard approach to that is to embed the second language as a
    "string" in the outer language. Many languages (and their compilers/interpreters) do that. That's exactly what your XSLT case
    does. The XPATH code is simply a string in the XSLT language, and the
    XSLT language doesn't attempt to parse it. It simply hands the code
    off to an XPATH parser when in knows the string is XPATH code.

    That works when the outer language is parsed first.

    For PHP, and I believe the C preprocessor, the inner language is parsed
    first, so the parser has to ignore everything, including quoting, in
    the outer language.

    I don't believe I ever tried preprocessor statements inside C string
    constants, but as far as I know, it works.

    One has to be careful to avoid ambiguities between the two,
    or are created with the combination of the two languages.
    [PHP effectively treats material between ?> and <?php as an instruction
    to print the material as if it were a quoted string.

    C says that the input is tokenized before it does the preprocessor
    phase, so it does not look inside quoted strings. The # and ##
    preprocessor operators allow some preprocessor time creation of quoted
    strings. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roger L Costello@21:1/5 to Chris Clark on Sun Mar 6 23:32:13 2022
    Chris Clark wrote:

    the standard approach to that is to embed the
    second language as a "string" in the outer language.
    Many languages (and their compilers/interpreters)
    do that. That's exactly what your XSLT case
    does. The XPATH code is simply a string in the
    XSLT language, and the XSLT language doesn't
    attempt to parse it. It simply hands the code
    off to an XPATH parser when in knows the string
    is XPATH code.

    Okay, so there would be one grammar for XSLT, a second (independent) grammar for XPath. The grammar for XSLT just treats the XPath portions as strings. The grammar for XPath ignores the XSLT portions. So the input is processed in a pipeline fashion.

    Is that correct?

    But, but, but, ... how would an Abstract Syntax Tree be constructed when the input is processed in a pipeline?

    /Roger
    [I believe the answer is "incrementally." You can run the parsers sequentially or you can run them as coroutines. I don't think it makes much difference unless
    the intermediate version is extremely huge, but that's rarely a problem with XML. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to All on Sun Mar 6 16:50:08 2022
    On Sunday, March 6, 2022 at 3:22:57 PM UTC-8, gah4 wrote:

    (snip)

    I don't believe I ever tried preprocessor statements inside C string constants, but as far as I know, it works.
    (snip)

    [PHP effectively treats material between ?> and <?php as an instruction
    to print the material as if it were a quoted string.

    I suppose it works that way.

    C says that the input is tokenized before it does the preprocessor
    phase, so it does not look inside quoted strings. The # and ##
    preprocessor operators allow some preprocessor time creation of quoted strings. -John]

    It seems that the traditional C compiler, sometimes used with other
    languages, such as Fortran, has different parsing and tokenizing rules.

    gcc -E --traditional quote.c

    will process files with a preprocessor statement inside quotes.

    On the other hand, the result is likely not what was wanted,
    at least not for C. Among others, it puts out lines like:

    # 6 "quote.c" 2

    which then end up inside the string.

    On the other hand, if you language uses ' for other than strings,
    or for two uses, then it should work.

    One of the stranger things that I have known for about 50 years,
    is direct access I/O in IBM Fortran IV. There are statements like:

    WRITE(1'N) X, Y, Z

    where N is the record in the direct access file. The compiler also
    accepts string (Hollerith) constants with apostrophes.
    (But not in direct access I/O statements.)

    In any case, the --traditional C preprocessor is commonly used with Fortran, where I suspect C tokenizing could cause problems.

    Some C preprocessors, but not the current ISO C version, and it
    seems not gcc -E --traditional, will substitute preprocessor symbols
    inside quoted strings.

    As the OP notes, mixed parsing can have surprising effects!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robin Vowels@21:1/5 to All on Mon Mar 7 13:39:33 2022
    From: "gah4" <[email protected]>
    Sent: Sunday, March 06, 2022 4:10 PM

    The C preprocessor, originally a separate pass, but now usually implemented together with the rest of the C compiler, processes its statements, and passes
    everything else through. It does that well enough, that it is commonly used with Fortran. (The traditional version, not the newer one.)

    IBM's PL/I has a preprocessor that accepts a subset of PL/I.
    The finished text is then passed to the compiler.

    XPL processes such text as it goes, handling the text processing
    before passing it to the compiler.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans-Peter Diettrich@21:1/5 to Hans-Peter Diettrich on Mon Mar 7 05:08:33 2022
    On 3/6/22 12:23 PM, Hans-Peter Diettrich wrote:
    I don't think that was what he was asking about.

    The question is too unspecific for me. A (traditional) grammar covers
    the parser part, while the lexer is specified differently. Languages
    based on the same lexer can be merged into one grammar, no problem so
    far. But if the second language shall apply to a single token
    (string...) of the primary language then both languages are independent
    and can not share a single grammar. Like a document can contain parts of several natural languages, where each language applies only to specific
    parts of the documents and is subject to special lexer rules (RTL/LTR reading...).

    In C/C++ and its preprocessor we have a special construct where the preprocessor tokens are refined/extended by the C/C++ lexer. And there
    were discussions whether the preprocessor should look into string
    literals (Siemens?), or whether the preprocessor can generate C comments (Microsoft). A separate preprocessor run at least allows for the latter,
    as the preprocessed source code is lexed again and can build tokens
    differently from the tokens in the original source code.

    My conclusion:
    A single (formal) grammar can not contain multiple languages. Unless you specify that e.g. statements and expressions in a programming language
    shall be considered subject to different languages. Such nitpicking is
    not worth further thoughts :-(

    DoDi

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Hans-Peter Diettrich on Sun Mar 6 21:22:17 2022
    On Sunday, March 6, 2022 at 8:43:43 PM UTC-8, Hans-Peter Diettrich wrote:

    (snip)
    My conclusion:
    A single (formal) grammar can not contain multiple languages. Unless you specify that e.g. statements and expressions in a programming language
    shall be considered subject to different languages. Such nitpicking is
    not worth further thoughts :-(

    It would be complicated for compiled languages.

    TeX allows one to change, character by character in the input, which characters are letters, and so used in a control sequence name.

    LaTeX macros, to allow for internal names that don't conflict with any
    user defined names, puts an @ sign in them, after changing @ to a letter.
    Then, just before going into user code, changes @ back to not a letter. (Specifically, it is other.) The lexer can't read too far ahead, as the character
    codes might change at any time.

    It is also interesting to see how languages without reserved
    words, keep track of which words have the keyword meaning,
    and which are ordinary names.
    [Back in the 1970s there were a bunch of extendible languages like EL/1 and IMP72 where you could add and change syntax on the fly. They all died since
    it meant that in practice no two programs were written in the same language
    and they were unreadable. Now we have overloading so you understand the
    syntax but you don't know what it means. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Tue Mar 8 21:46:44 2022
    It is also interesting to see how languages without reserved
    words, keep track of which words have the keyword meaning,
    and which are ordinary names.

    This is a topic close to my heart. There are actually a variety of
    methods that work relatively well in this regard.

    -----------------------------------------------

    First, recognizing keywords, if you are using a language (unlike the
    original FORTRAN where spaces were not significant), it is useful to
    follow a model proposed by Frank Deremer, with lexing divided into a
    scanner and a screener. The scanner has only the responsibility of
    dividing tokens into discrete entities and labelling those which have
    fixed types, so keyword recognition is not part of the scanner to the
    scanner they are just identifiers. The screener then checks the
    identifier to see if it is a keyword.

    In most languages, that lookup can easily be done by preloading the
    symbol table with the identifiers that are keywords. This is not
    inefficient, because even if the identifier is not a keyword, you need
    a unique symbol table entry for each identifier in any case, so you
    are doing that lookup anyway.

    The only wrinkle here is the case (which we have in Yacc++) where you
    want your keywords to be case insensitive and your identifier to be
    case sensitive. You do get an extra lookup in that case, but only the
    first time an identifier is seen. Once, it is seen, you have marked
    the symbol table entry with whether it is a keyword or not.

    -----------------------------------------------

    Now, the inverse problem. For keywords that are reserved words.
    There is nothing more to do. However, if you only want your keywords
    to be recognized as keywords in specific contexts, then you need the
    second part of the solution. This is a [set of] parser rule[s] that
    turns keywords back into identifiers if they aren't being used in a
    context where they have a special meaning.

    This rule looks something like this:

    identifier: IDENTIFIER | IF | THEN | ELSE | WHILE | DO | ... ;

    That is, you define an identifier non-terminal that matches either an IDENTIFIER token or any of your KEYWORD[ token]s. And, when you want
    an identifier you use the non-terminal rather than token in your rule.

    Now, this may cause "conflicts" or "ambiguities" and you resolve that
    by figuring out which keywords are special in that context and making
    an alternate idenitiier_in_this_context rule, which omits the
    conflicting keywords from its list. (And, by the way, this is one
    reason to use a parser generator, because it will report those
    conflicts to you and tell you when you have some keyword that is still
    an issue in some context. The checking that a parser generator does
    helps you not make mistakes.) And, in most cases the parsers
    lookahead can figure out whether you mean the keyword or the
    identifier so that you rarely have to make those extra rules that
    exclude specific keywords from the list.

    -----------------------------------------------

    The main place one gets in trouble is when you have a [set of]
    keyword[s] that is/are optional before a list of identifiers. In that
    case, your identifier list cannot start with one of those keywords.
    However, a better solution (if you are designing your own language) is
    to put the optional keywords before a mandatory one.

    For example, in Yacc++ we have a KEYWORD definition, and a variation
    on it for SUBSTRING keywords. But we don't use SUBSTRING as a
    declaration by itself, so the rule for the declaration of a list of
    keywords looks like:

    keyword_declaration:
    SUBSTRING? KEYWORD identifier ("=" number)? (","? identifier ("="
    number)?)* ";" ;

    If we did it in the other order, we couldn't allow substring as the
    first identifier in the list.
    That is:

    keyword_declaration:
    KEYWORD SUBSTRING? identifier_but_not_substring ("=" number)? (","?
    identifier ("=" number)?)* ";" ;

    And, note, if we had hand-written a recursive descent parser, we
    wouldn't have been warned of that edge case. Either we would have
    mis-parsed it or we would have that weird exception in our language.
    Neither is ideal in my point of view.

    And, if you go through it carefully, you can see that you can make a
    much more complex set of rules that covers the second case, so that if
    you saw the SUBSTRING keyword or your "substring" identifier was
    followed by an equals or a comma or a semi-colon it was still legal.
    Something to introduce subtle bugs in users' programs where a
    seemingly innocuous change causes something to go from legal (and
    meaning one thing) to illegal or meaning something else.

    Or as our wise moderator might put it, you start creating a language
    where the user is no longer certain what their code means.

    -----------------------------------------------

    By the way, we used both parts in Yacc++. We do have a small number
    of reserved words, like yy_eof which we use to communicate between
    lexer and the parser to indicate the end of the input. You cannot use
    that identifier for any other purpose in our grammars. However, we
    allow keywords like TOKEN to be used also as names of tokens or
    non-terminals and we can tell when you are using it to declare a token
    versus using it as the name of a token or non-terminal. So, most of
    our keywords are just that context sensitive keywords that only have
    reserved meanings in certain contexts. Otherwise, they are just
    identifiers. I guess one can tell that before starting compiler
    resources, my partner and I had spent a fair amount of time writing
    and using compilers in PL/I dialects.

    -- ****************************************************************************** Chris Clark email: [email protected] Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------ [If anyone wants to know how to find the tokens in old space-insensitive Fortran
    I can tell you, but it's as ugly as you might imagine. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to our moderator on Wed Mar 9 00:31:25 2022
    (snip, our moderator wrote)

    [If anyone wants to know how to find the tokens in old space-insensitive Fortran
    I can tell you, but it's as ugly as you might imagine. -John]

    As well as I know, the original 704 Fortran compiler has been lost to history, but the Fortran II compiler is still around.

    Note, though, that fixed-form is still in the Fortran standard, and still has the
    space-insensitivity. More popular for new programs, free-form is not space insensitive.

    Also, to make it easier for programmers, some space is still optional in free-form Fortran, for old and new keywords. For example, both GO TO
    and GOTO are allowed, I suspect with separate entries into the parsing
    tables. PL/I also has both GO TO and GOTO statements.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robin Vowels@21:1/5 to All on Thu Mar 10 10:00:22 2022
    From: "gah4" <[email protected]>
    Sent: Wednesday, March 09, 2022 7:31 PM

    Also, to make it easier for programmers, some space is still optional in free-form Fortran, for old and new keywords. For example, both GO TO
    and GOTO are allowed, I suspect with separate entries into the parsing tables. PL/I also has both GO TO and GOTO statements.

    GO and GOTO are keywords in both FORTRAN and PL/I,
    and they are not reserved.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robin Vowels@21:1/5 to All on Thu Mar 10 09:55:46 2022
    [If anyone wants to know how to find the tokens in old space-insensitive Fortran
    I can tell you, but it's as ugly as you might imagine. -John]

    No it's not.
    See A. H. J. Sale, "The Classification of FORTRAN Statements", Computer Journal,
    Vol. 14 No. 1, 1971.
    [Having written production Fortran compilers, I can assure you that to
    do it right is quite ugly. I looked at the article and he missed some things, and he made some unrealistic assumptions that a compiler would accept nothing beyond what was in the Basic Fortran standard. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robin Vowels@21:1/5 to All on Thu Mar 10 11:59:42 2022
    From: "Christopher F Clark" <[email protected]>
    Sent: Wednesday, March 09, 2022 6:46 AM

    First, recognizing keywords, if you are using a language (unlike the
    original FORTRAN where spaces were not significant), it is useful to
    follow a model proposed by Frank Deremer, with lexing divided into a
    scanner and a screener. The scanner has only the responsibility of
    dividing tokens into discrete entities and labelling those which have
    fixed types, so keyword recognition is not part of the scanner to the
    scanner they are just identifiers. The screener then checks the
    identifier to see if it is a keyword.

    A similar approach is used in DEUCE PL/I. (more in a moment).

    In XPL, the scanner recognises keywords (which are reserved words)
    and classifies them differently from identifiers.
    The other objects (integers, strings, operators, parentheses, etc)
    are classified appropriately.

    Returning to DEUCE PL/I, the scanner classifies indentifiers
    without making any distinction between identifiers and keywords.
    Other objects are classified as above. The symbol table
    already contains the keywords. All new identifiers are entered
    into the symbol table.

    In the semantics section, initial recognition is on groupings of
    identifiers (again keywords are not recognised as such).
    Once a phrase is recognised, the opening keywords in the
    statement are compared with permissible keyword combinations,
    without resorting to the symbol table. For example, PUT EDIT ( ...
    PUT LIST ( ... and PUT DATA etc are identified after comparison with
    the first word, and comparison with the second word is sufficient to
    identify the kind of statement.
    At this point, it can be said that the keywords are not reserved.
    Avoiding looking in the symbol table for keywords saves time.

    In most languages, that lookup can easily be done by preloading the
    symbol table with the identifiers that are keywords. This is not inefficient, because even if the identifier is not a keyword, you need
    a unique symbol table entry for each identifier in any case, so you
    are doing that lookup anyway.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Our moderator on Thu Mar 10 07:07:25 2022
    Our moderator wrote

    [Having written production Fortran compilers, I can assure you that to
    do it right is quite ugly.

    I can confirm that.

    Looking at how a recent compiler does that, I can slightly rephrase https://gcc.gnu.org/wiki/GFortranHacking :

    The front end's action can be grouped into four phases:

    Parsing: This converts source code into a stream of tokens which
    describe the language. Because Fortran does not have reserved
    keywords, the gfortran runs a series of matchers against code
    trying to find one that matches a statement. On failing a match
    an error message may be queued, and another matcher tried. If all
    attempts at matching fail, the error queue is dumped to the user.

    Resolution: This resolves things left over from the parsing phase,
    such as types of expressions, and compile-time simplification of
    constants. Many errors are issued in this phase. At the end of
    this phase, the abstract syntax tree is finished.

    [...]

    This works (sort of) but has the drawback that cleaning up after an
    error is error-prone (hah!) so there are a lot of ice-after-invalid
    (internal compiler error after invalid code) issues that have emerged
    over the years.
    [As I recall, first I did a pass to fold continuation cards and mark
    quoted and hollerith strings, then a pass to look for an equal sign
    not inside parens and not followed by an comma, which said it was an
    assignment statement, and then the parsing was straightforward. Eacn non-assignment statement starts with a keyword and it is easy to tell
    from the syntax when to look for another keyword. Repeat for the
    statement that follows a logical IF with a special case for the
    statement numbers in an arithmetic IF. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)