• Why does the lexer convert text integer lexemes to binary integers? I t

    From Roger L Costello@21:1/5 to All on Thu Jul 14 10:25:24 2022
    Hi Folks,

    A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic expressions. When the lexer encounters an integer lexeme, it casts the lexeme to a binary integer and returns the value to the parser. The lexer contains a rule that looks something like this:

    {INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }

    But, but, but, ...

    Countless times on this list I have been told: Keep the lexer simple!

    By converting the lexeme to an integer, the lexer has assumed that the parser needs/wants a binary integer, not a text number. How does the lexer know what the parser needs/wants? That seems like knowledge the lexer shouldn't have if the lexer is to be simple. Further, even if one parser needs/wants a binary integer value, that parser might be swapped out at a later date and replaced with a different parser that wants the text number.

    It seems to me that the lexer should return to the parser the text number and it is the responsibility of the parser to convert the value to an integer data type if it desires.

    What do you think?

    /Roger
    [I think the lexer should provide the tokens that the parser needs. If
    integers are always handled as numbers, convert them, if not, don't.
    If the parser does one and later changes to do the other, you can
    change the lexer, too. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Roger L Costello on Thu Jul 14 10:03:18 2022
    On Thursday, July 14, 2022 at 8:10:56 AM UTC-7, Roger L Costello wrote:

    A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic expressions. When the lexer encounters an integer lexeme, it casts the lexeme to a binary integer and returns the value to the parser. The lexer contains a rule that looks something like this:

    {INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }

    A common example is an RPN or algebraic calculator.
    (Which evaluates numerical expressions.)

    It is a nice simple example, compared to a complete programming language.

    For algebraic, you get into precedence and such, so real parsing problems,
    but in a very simple, and easy to follow, case.

    The other things, is that it isn't so easy to return character strings in C. You would malloc() it and copy the value over, and then hope that
    somewhere later it is free()d. (I suspect that is done more in Java.)

    In the STEP processor that I previously wrote about, the only return
    type is character string, and the processor keeps track of allocation.
    Macros that do arithmetic convert to numerical type, evaluate an
    expression, and convert back to characters.

    The code for the processor itself, as usual for compilers, is written
    using itself, has a macro to evaluate a constant integer
    expression. That is useful, as Fortran 66 (and Fortran 77)
    don't do that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Thu Jul 14 16:38:32 2022
    On Thu, 14 Jul 2022 10:25:24 +0000, Roger L Costello
    <[email protected]> wrote:

    Hi Folks,

    A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic >expressions. When the lexer encounters an integer lexeme, it casts the lexeme >to a binary integer and returns the value to the parser. The lexer contains a >rule that looks something like this:

    {INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }

    But, but, but, ...

    Countless times on this list I have been told: Keep the lexer simple!

    By converting the lexeme to an integer, the lexer has assumed that the parser >needs/wants a binary integer, not a text number. How does the lexer know what >the parser needs/wants?

    Presumably because the same developer that wrote the parser also wrote
    the rules for the lexer.

    In many (actually most) cases, the binary representation of an integer
    can be stored in less space than the text representation. When lex
    and yacc were created, memory was small (and extremely expensive) and
    storing things /small/ was very important.

    This is also why compilers "intern" symbols and most strings so as to
    store only one copy of any given text regardless of how many times it
    is seen.


    That seems like knowledge the lexer shouldn't have if
    the lexer is to be simple. Further, even if one parser needs/wants a binary >integer value, that parser might be swapped out at a later date and replaced >with a different parser that wants the text number.

    That's a valid point, but swapping "components" in that way is rarely,
    if ever, done. IME developers treat the lexer and parser combination
    as a single input subsystem.


    It seems to me that the lexer should return to the parser the text number and >it is the responsibility of the parser to convert the value to an integer data >type if it desires.

    What do you think?

    The general problem is performance. The place to decide what class a
    token falls into is where you contact the text.

    In principle the lexer could be as simple as strtok() - just
    separating the tokens, returning a pointer directly to its input
    buffer, and punting all further processing to the parser.

    But ...

    The lexer has already scanned the text just to locate the end of it.
    If it then simply gives the text to the parser, the parser must scan
    the text AGAIN to verify that it represents a number (or whatever the
    parser happens to be looking for at that point).

    Then the text gets scanned YET AGAIN to convert it to a number.

    That is 3 scans of the text if the parser does the work versus 2 scans
    if the lexer does the work.


    Now you may think numeric constants are short and occur rarely enough
    that this shouldn't matter ... and you'd be correct.

    But ...

    Remember that other things have to be recognized and classified as
    well: in particular language reserved words and symbols, and all kinds
    of user chosen names. So, a (possibly case sensitive) dictionary of
    already seen unique text must be maintained.

    Searching and maintaining the dictionary will involve yet more passes
    over the text: for indexing/hashing, for comparison, and for copying
    if the text is new and needs to be preserved.
    [These passes are unavoidable and must be done regardless of who does
    the work ... but they need to be mentioned nonetheless.]


    Moving (at least gross) input classification into the lexer avoids at
    least one pass over the entirety of the input. When considered
    against the length of the average input, that can add up to a LOT of
    avoided work.


    /Roger
    [I think the lexer should provide the tokens that the parser needs. If >integers are always handled as numbers, convert them, if not, don't.
    If the parser does one and later changes to do the other, you can
    change the lexer, too. -John]
    +1

    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jan Ziak@21:1/5 to George Neuner on Fri Jul 15 03:02:07 2022
    On Friday, July 15, 2022 at 4:13:42 AM UTC+2, George Neuner wrote:
    In many (actually most) cases, the binary representation of an integer
    can be stored in less space than the text representation.

    The output of a command such as (cd /usr/src/linux; grep --only-matching --recursive "\b[0-9][0-9]*\b") proves the falsity of the above claim.

    Binary, fixed-width, representation of integers is statistically more space-efficient compared to implicit-width textual representation only if the text representation of the integers includes (for example): a plain 32/64-bit pointer to the start of the text, a plain 16/32/64-bit relative/absolute
    offset to the start of the number in a character array, the [length of the textual form of the number] in an explicit form.

    Binary, fixed-width, representation of integers is more likely to be more space-efficient than [their textual representation with an implicit length] if the source of the integers doesn't originate from a human hand typing digits
    on a keyboard. For example, when the source of the integers is a
    high-precision measurement device, is something with a physical counterpart (such as: the GPS coordinates of objects spread across Earth's surface), etc.

    -atom
    [It's still hard to imagine that the size difference would matter in a compiler.
    If you're logging a million values a second and saving it to an archive, well, that's different. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Spiros Bousbouras@21:1/5 to George Neuner on Fri Jul 15 07:08:17 2022
    On Thu, 14 Jul 2022 16:38:32 -0400
    George Neuner <[email protected]> wrote:
    On Thu, 14 Jul 2022 10:25:24 +0000, Roger L Costello
    <[email protected]> wrote:

    Hi Folks,

    A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic
    expressions. When the lexer encounters an integer lexeme, it casts the lexeme
    to a binary integer and returns the value to the parser. The lexer contains a
    rule that looks something like this:

    {INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }

    But, but, but, ...

    Countless times on this list I have been told: Keep the lexer simple!

    By converting the lexeme to an integer, the lexer has assumed that the parser
    needs/wants a binary integer, not a text number. How does the lexer know what
    the parser needs/wants?

    Presumably because the same developer that wrote the parser also wrote
    the rules for the lexer.

    [...]

    It seems to me that the lexer should return to the parser the text number and
    it is the responsibility of the parser to convert the value to an integer data
    type if it desires.

    What do you think?

    The general problem is performance. The place to decide what class a
    token falls into is where you contact the text.

    In principle the lexer could be as simple as strtok() - just
    separating the tokens, returning a pointer directly to its input
    buffer, and punting all further processing to the parser.

    But ...

    The lexer has already scanned the text just to locate the end of it.
    If it then simply gives the text to the parser, the parser must scan
    the text AGAIN to verify that it represents a number (or whatever the
    parser happens to be looking for at that point).

    Then the text gets scanned YET AGAIN to convert it to a number.

    That is 3 scans of the text if the parser does the work versus 2 scans
    if the lexer does the work.

    I don't see why it has to be this way. The lexer can return a structure
    with components START , END , KIND which says that a token of kind , for example , "integer number" was found in the input stream starting at START
    and ending at END .The parser wouldn't have to verify again that it is
    a number and would simply go over the characters from START to END a 2nd
    time if it needs to convert the number to a different format. But if the
    parser knows that a number at this point is a syntax error then it can
    simply signal that and save the extra processing required for conversion
    i.e. call atoi() or whatever.

    [...]

    Moving (at least gross) input classification into the lexer avoids at
    least one pass over the entirety of the input. When considered
    against the length of the average input, that can add up to a LOT of
    avoided work.

    I don't think Roger was suggesting that the lexer should not do gross input classification (like recognise that a token is a number) but rather that perhaps it should not convert the number.
    [In flex lexers you can't count on text of the token being available
    after the action routine is done, so you have to make a copy and keep
    track of the copy. That is usually way more work than whatever else
    you might do with it. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Roger L Costello on Fri Jul 15 14:41:33 2022
    On 2022-07-14, Roger L Costello <[email protected]> wrote:
    Hi Folks,

    A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic expressions. When the lexer encounters an integer lexeme, it casts the lexeme to a binary integer and returns the value to the parser. The lexer contains a rule that looks something like this:

    {INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }

    But, but, but, ...

    Countless times on this list I have been told: Keep the lexer simple!

    By converting the lexeme to an integer, the lexer has assumed that the parser needs/wants a binary integer, not a text number.

    By the facts alone that you have something called "yylval" with a "yylval.intval" member, you're encoding a tight integration with a
    parser, from which these things are coming.

    There is no "yylval" global variable in a lex-generated program.

    How does the lexer know what
    the parser needs/wants? That seems like knowledge the lexer shouldn't have if the lexer is to be simple.

    That's like asking, how does the ethernet driver in Linux know that
    the protocol stack wants a "struct sk_buff *"?

    That seems like knowledge the driver shouldn't have if it is to be
    simple, and usable in any operating system.

    Sometimes you just make integration decisions: you make the pieces agree
    on some convenient data structure for data interchange and move on,
    tossing aside concerns like whether a given piece can be reused in any conceivable software system with the greatest possible ease.

    Further, even if one parser needs/wants a binary
    integer value, that parser might be swapped out at a later date and replaced with a different parser that wants the text number.

    Chances are that will never happen; write for today.

    A lexer which preserves the original lexemes could be useful for
    software such as code reformatters.

    If you strongly suspect that you're working on some language that will eventually have tooling that includes a code reformatter, then maybe
    plan for that in the scanning/parsing.

    (Right? I mean you clearly don't want a source-to-source beautifier to
    be rewriting 0xFF as 255, except as carefully controlled option.)

    A lexer could prepare a token structure which has all the information;
    a converted semantic item, and the lexeme.

    It seems to me that the lexer should return to the parser the text number and it is the responsibility of the parser to convert the value to an integer data
    type if it desires.

    But how does the parser know what the parser's client wants? What if
    the parser's client wants a syntax tree with the textual lexemes, and
    not objects like binary integers?

    The same argument can be repeated here that the parser should just
    produce a parse tree with literal tokens, punting the conversion problem
    higher up.

    What do you think?

    How about this: a given token like INTEGER is processed in one place in
    the lexer, right? One lexing rule recognizes the integer and can convert
    the textual integer to a semantic, computational integer.

    Whereas in the parser's grammar, an INTEGER symbol could appear in
    multiple places.

    Would you rather add code to multiple places to convert a textual
    integer to a value, or do it in one place, and just have the value
    readily available in those dozens of places?

    That's the main driver of why we convert semantic values at the lexical
    level in the lex/yacc stack: so that if we have a rule like

    whatever : foo INTEGER ':' INTEGER

    we can just use $2 and $4 to refer to integer values, and not have
    to do:

    {
    int x = get_integer($2);
    int y = get_integer($4);

    $$ = function_of(x, y);

    free_lexeme($2); /* textual lexemes need dynamic alloc! */
    free_lexeme($4);
    }

    versus:

    { $$ = function_of($2, $4); }

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jan Ziak@21:1/5 to John on Fri Jul 15 10:50:21 2022
    On Friday, July 15, 2022 at 6:09:05 PM UTC+2, John wrote:
    It's still hard to imagine that the size difference would matter in a
    compiler.
    If you're logging a million values a second and saving it to an archive, well,
    that's different. -John

    Size matters if the compiler is built around data caches as its core principle (which means that not only the lexical phase is utilizing caches, but also means that optimizations are being cached as well). Retrieving cached data
    from a storage device (such as: DDR4 DRAM, NVMe SSD) is limited by current hardware constraints to approximately 10 GB/s in mainstream PCs, while the speed of computing a SHA-256 hash is limited to approximately 2 GB/s per CPU core. It is easy to imagine sketches of a compiler based on such a core principle - it is hard to create a complete "fully fledged" compiler based on such a core principle.

    Btw: Starting a sentence with "It's still hard to imagine that ..." pretty
    much ensures that a counter-example will be sent in response.

    -atom
    [Surely you're aware that the best way to get an answer to a question on
    the Internet is to post a wrong answer. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From [email protected]@21:1/5 to Roger L Costello on Sat Jul 16 05:32:58 2022
    On Thursday, 14 July 2022 at 11:10:56 UTC-4, Roger L Costello wrote:
    [...]
    It seems to me that the lexer should return to the parser the text number
    and
    it is the responsibility of the parser to convert the value to an integer
    data
    type if it desires.

    What do you think?

    The division of the job into lexing and parsing is *not* an important separation of concerns. Both of these are written at the same time, by the same person, and the requirements of the parser feed into the lexer in many detailed ways. Their specifications are highly coupled and, I expect, almost always written by the same person pretty much simultaneously.

    Instead, this division -- a great big line that divides one part of a context-free grammar from the other -- is an annoying practical concession
    that we make to improve performance in time and size (smaller tables, and optimization for text), and to get around the limitations in our tools (LR(1) is a lot less useful when the (1) is a character).

    This is specifically why all the answers here are giving you less than compelling practical justifications for parsing numbers in the lexer and
    nobody seems to mind. There really is no "should" and "should not" w.r.t. the division between lexing and parsing except what is practical.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Sun Jul 17 13:10:52 2022
    You are asking the wrong question. You are optimizing at the wrong level. Stop.

    Not that long ago, I wrote an article on Quora about this exact phenomenon.

    https://www.quora.com/What-do-most-programmers-do-when-optimizing-code-that-is-essentially-wrong/answer/Christopher-F-Clark-1?ch=10&oid=337787257&share=2807f9fb&srid=20qA&target_type=answer

    You are focusing on the trivial, the irrelevant. It is unlikely that
    having the lexer convert integers (or floats or quaternions) into a binary representation is a sufficiently expensive operation to make sense fretting about it. And I mean that both in terms of runtime and time spent thinking about it. Just do whatever lexer you are following from as an example does
    and assume they made the right choice. The only time you should pay
    attention to it, is once you have something working correctly and you have measured its performance and determined that the code in question is
    actually a bottleneck that is significantly impacting performance. Then,
    you have a reason to revisit that choice.

    And, when you do, you are as likely to find that the call to atoi is as
    much a problem as having the lexer do it. atoi is going to rescan those
    digits causing twice as much work as you would have done by building the numeric representation while you lexed it. Of course, that alternative has
    its own issues. It's clearly more complex code in the lexer.

    But, again, you shouldn't be worrying about any of that, until you have isolated it as an actual problem. Otherwise, figure out something simple
    that works. Having the lexer call atoi seems pretty simple. It is also
    likely to work over a rather large range of inputs. So, you haven't likely bought yourself any problems by doing that.

    Now, if you are doing SQL (or PL/I) float decimal numbers or handling of BIGINTs (arbitrarily large integers) or reduced rationals, you might want something more complex. But that's isn't the problem in this case.

    Don't spend your time worrying about details until you know they are
    important details.

    Kind regards,
    Chris

    -- ******************************************************************************

    Chris Clark email:
    [email protected]
    Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Sun Jul 17 16:52:20 2022
    On Fri, 15 Jul 2022 03:02:07 -0700 (PDT), Jan Ziak
    <[email protected]> wrote:

    On Friday, July 15, 2022 at 4:13:42 AM UTC+2, George Neuner wrote:
    In many (actually most) cases, the binary representation of an integer
    can be stored in less space than the text representation.

    The output of a command such as (cd /usr/src/linux; grep --only-matching >--recursive "\b[0-9][0-9]*\b") proves the falsity of the above claim.

    Binary, fixed-width, representation of integers is statistically more >space-efficient compared to implicit-width textual representation only if the >text representation of the integers includes (for example): a plain 32/64-bit >pointer to the start of the text, a plain 16/32/64-bit relative/absolute >offset to the start of the number in a character array, the [length of the >textual form of the number] in an explicit form.

    ???

    Decimal 100 is 3 characters in text, but 1 byte in binary.
    65535 is 5 characters, but 2 bytes.
    2147483648 is 11 characters, but 4 bytes.

    0x101010 is 8 characters, 1 byte.
    052 is 2 characters, 1 byte.

    This dichotomy holds for almost all numbers: regardless of what number
    base is used for the text form, the binary form will be shorter.


    Binary, fixed-width, representation of integers is more likely to be more >space-efficient than [their textual representation with an implicit length] if >the source of the integers doesn't originate from a human hand typing digits >on a keyboard. For example, when the source of the integers is a >high-precision measurement device, is something with a physical counterpart >(such as: the GPS coordinates of objects spread across Earth's surface), etc.

    The source of the value is not relevant - only the form.

    -atom

    You completely misunderstood/misinterpreted what I wrote.

    I said the binary form CAN BE stored in less space than the text ... I
    said NOTHING about being restricted to fixed sized binary integers. It
    should be obvious to anyone that storing 42 into a 64-bit integer is
    not saving space.

    YMMV,
    George
    [I think the point was that variable length strings are more common than variable length binary integers, but as a message earlier today pointed out, it's not likely to make any difference in practical compilers. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Sun Jul 17 18:01:39 2022
    On Sun, 17 Jul 2022 16:52:20 -0400, George Neuner
    <[email protected]> wrote:

    0x101010 is 8 characters, 1 byte.

    Duh! Of course that should be b101010, 7 characters and not
    representable in C. Oh well.

    George
    [It's a perfectly good integer but I think we're done beating the dead horse. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to [email protected] on Sun Jul 17 20:39:50 2022
    On Sunday, July 17, 2022 at 10:29:53 AM UTC-7, [email protected] wrote:
    You are asking the wrong question. You are optimizing at the wrong level. Stop.

    (snip)

    You are focusing on the trivial, the irrelevant. It is unlikely that
    having the lexer convert integers (or floats or quaternions) into a binary representation is a sufficiently expensive operation to make sense fretting about it.

    Not so long ago, I wrote about the whole idea of separating lexing and parsing was overoptimizing. It was needed many years ago, when computer memories
    were in kilobytes, not yet megabytes or gigabytes. But got almost no comment.

    Converting to integer has lots of problems, especially in the case of overflow.

    In some cases, and even for some people, separating lexing and parsing might make it easier to write and/or understand. Those are important.

    But maybe not.

    The STEP processor, which I have written about before, mostly doesn't
    separate them. There are some built-in rules related to where macro
    matches can start and end, and those are similar to the results of
    some lexical analysis. Some of those rules speed up processing, which
    was more important 45 years ago than now.

    I suspect that I believe now that they could be separate, but implemented
    by the same program. That would allow them to be separated where
    convenient for the user, and also not separated where it is easier.
    [In my experience separating the lexer from the parser makes it a lot easier
    to deal with common lexical situations like skipping white space and comments. You could certainly do that in a combined scheme but I'm not sure it would end up any simpler. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to George Neuner on Mon Jul 18 05:44:05 2022
    George Neuner <[email protected]> schrieb:
    On Fri, 15 Jul 2022 03:02:07 -0700 (PDT), Jan Ziak
    <[email protected]> wrote:

    On Friday, July 15, 2022 at 4:13:42 AM UTC+2, George Neuner wrote:
    In many (actually most) cases, the binary representation of an integer
    can be stored in less space than the text representation.

    The output of a command such as (cd /usr/src/linux; grep --only-matching >>--recursive "\b[0-9][0-9]*\b") proves the falsity of the above claim.

    Binary, fixed-width, representation of integers is statistically more >>space-efficient compared to implicit-width textual representation only if the >>text representation of the integers includes (for example): a plain 32/64-bit >>pointer to the start of the text, a plain 16/32/64-bit relative/absolute >>offset to the start of the number in a character array, the [length of the >>textual form of the number] in an explicit form.

    ???

    Decimal 100 is 3 characters in text, but 1 byte in binary.

    ... if it is known that the number is in the range that can be
    represented by a single byte. If it is stored into a four-byte
    integer, 100 takes four bytes (obviously). If there is a length
    specification, it might even be longer.

    But of course our moderator was right when he wrote

    [...]

    but as a message earlier today pointed out,
    it's not likely to make any difference in practical compilers. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to or moderator on Tue Jul 19 16:39:36 2022
    On Monday, July 18, 2022 at 9:30:51 AM UTC-7, gah4 wrote:

    (snip, or moderator wrote)

    [In my experience separating the lexer from the parser makes it a lot easier to deal with common lexical situations like skipping white space and comments.
    You could certainly do that in a combined scheme but I'm not sure it would end
    up any simpler. -John]

    Interesting. As I previously noted, STEP mostly doesn't do a separate lexical analysis.

    It does, however, do three things before the macros see the input: convert multiple
    blanks to a single blank, pass apostrophed strings through whole, and remove comments delimited by double quotes.

    Apostrophed strings are slightly more interesting. Internal double apostrophes are converted to single apostrophes, and the delimiting apostrophes are converted to a special character that isn't an input character.

    One of my projects 45 years ago, was to write macros to recognize the
    syntax of IBM OS/360 Fortran IV. Direct access I/O statements use
    a single apostrophe to delimit the record number:

    READ(1'N) X,Y,Z

    There is no way to write macros for that syntax after the previous processing.

    Much fun figuring out all the strange things done in programming language syntax over the years.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ev. Drikos@21:1/5 to All on Thu Jul 21 13:41:25 2022
    On 18/07/2022 06:39, gah4 wrote:
    [In my experience separating the lexer from the parser makes it a lot easier to deal with common lexical situations like skipping white space and comments.
    You could certainly do that in a combined scheme but I'm not sure it would end
    up any simpler. -John]

    Maybe not simpler, but it won't be necessarily more complex. I've just transferred an example for SQL from my old desktop. The FSA/GLR parser
    built can parse ie this command, without a scanner, unambiguously in the simulator:

    SELECT ALL FIRST,LAST FROM USERS;

    Below, there are four BNF rules. The first has in the right part the
    required lookahead operator '-:' as the grammar allows consecutive IDs.

    The second rule may be seen as a meta-rule that will be expanded. That
    is, the Builder will make the 'dirty' job for the programmer and attach
    the separator to each element listed in the 4th BNF rule below, which in
    turn must have only one symbol in each alternative (in the right side).

    IMHO, this grammar doesn't look very complex, but others may see it so.

    Regards,
    Ev. Drikos

    ------------------------------------------------------------------------

    <identifier body> ::=
    <identifier start> [ <identifier part> ... ] -: <identifier part>

    (<can be followed by separator>) ::=
    ( <can be followed by separator> ) <separator>
    | ( <can be followed by separator> )

    <separator> ::=
    { <comment> | <white spaces> }...

    <can be followed by separator> ::=
    <regular identifier>
    | <unsigned numeric literal>
    ..
    ...
    | <left brace>
    | <right brace>

    [I don't see how this grammar will allow a comment before the statement. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ev. Drikos@21:1/5 to Ev. Drikos on Fri Jul 22 12:29:40 2022
    On 21/07/2022 13:41, Ev. Drikos wrote:
    ...

    [I don't see how this grammar will allow a comment before the statement. -John]


    The where-used list of the separator contains four regular rules. Which
    means I've added them explicitly, as shown at the end of the message.

    What I suspect may be restrictive is that the lookahead operator accepts
    only the next token, simply one letter in a scannerless grammar. Yet, in
    SQL the (Unicode) delimited identifier doesn't end ie with a repeated
    symbol as the identifier body of a regular identifier.

    Here is the grammar for further inspection and error report: https://gist.github.com/drikosev/3104c3cb79b0b4d8f41da6162f4c65d3


    Regards,
    Ev. Drikos


    -----------------------------------------------------------------------
    <direct SQL statement> ::=
    [ <separator> ] <directly executable statement> <semicolon>

    <binary string literal>::= X<quote>[{<hexit><hexit>}...]<quote>
    [{<separator><quote>[{<hexit><hexit>}...]<quote>}...]


    <national character string literal> ::= (too long) ...

    <Unicode character string literal> ::= (too long) ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From luser droog@21:1/5 to All on Thu Jul 21 14:16:02 2022
    On Wednesday, July 20, 2022 at 7:45:07 PM UTC-5, gah4 wrote:
    On Monday, July 18, 2022 at 9:30:51 AM UTC-7, gah4 wrote:

    (snip, or moderator wrote)
    [In my experience separating the lexer from the parser makes it a lot easier
    to deal with common lexical situations like skipping white space and comments.
    You could certainly do that in a combined scheme but I'm not sure it would end
    up any simpler. -John]
    Interesting. As I previously noted, STEP mostly doesn't do a separate lexical analysis.

    It does, however, do three things before the macros see the input: convert multiple
    blanks to a single blank, pass apostrophed strings through whole, and remove comments delimited by double quotes.

    Apostrophed strings are slightly more interesting. Internal double apostrophes
    are converted to single apostrophes, and the delimiting apostrophes are converted to a special character that isn't an input character.

    One of my projects 45 years ago, was to write macros to recognize the
    syntax of IBM OS/360 Fortran IV. Direct access I/O statements use
    a single apostrophe to delimit the record number:

    READ(1'N) X,Y,Z

    There is no way to write macros for that syntax after the previous processing.

    Much fun figuring out all the strange things done in programming language syntax over the years.

    This approach appears to offer a very nice simplification for most Algol-style languages. But removing the white space entirely makes it harder (or impossible)
    to parse languages like Python and Haskell which use the "offside rule" to interpret the white space as delimiting multi-line constructs.

    I haven't solved the above completely, but I've been building my parser combinators with
    an eye towards supporting significant white space in the syntax analysis, while mostly ignoring it.

    It's parsers all the way down, but the parsers are designed to operate over lists,
    so the infrastructure is agnostic as to the actual type of the elements of the list.
    So, you can build the lexical analysis layer as a graph of parsers that work on lists of integers (characters) and produce Symbol objects. The syntax analysis layer can then be built as a graph of parsers that work on lists of Symbols.

    The symbol object has an extra slot for stashing extra data, so the white space can be captured and then hidden from the syntax analysis (unless some handler or predicate function wants to peek in there).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)