• An ingenious scanning problem!

    From Roger L Costello@21:1/5 to All on Sat May 28 19:03:09 2022
    Hi Folks,

    Page 115-116 of the Flex User's Manual [1] has an ingenious problem posed by Jan Kort on 05 September 1998.

    Here's a description of the problem:

    The input contains this: TEST1\n

    That is, the input contains the string TEST1 followed by a newline.

    The Flex scanner contains a rule that matches on the input:

    TEST1\n { action }

    Here's a description of its action:

    Of the text that was matched, I only
    want the first 5 characters (i.e., TEST1).

    That action is expressed in Flex this way:

    TEST1\n { yyless(5); }

    In the scanner there are also two newline rules.

    One newline rule matches on a newline at the beginning of a line (i.e., empty line):

    ^\n { action }

    The other newline rule matches on a newline at the end of a line:

    \n { action }

    Question: After the action for TEST1\n is executed, which newline rule is executed?

    The answer depends on the meaning of yyless(). Here are two possible meanings for yyless():

    (a) The scanner is backing up in the input stream.
    (b) The scanner is pushing new characters onto the input stream.

    If the meaning of yyless() is (a), then the second newline rule will be executed.

    If the meaning of yyless() is (b), then the first newline rule will be executed.

    So which does Flex do?

    ............................

    Answer: (b)

    Flex treats the newline from TEST1\n as a newline at the beginning of a line (i.e., empty line). Flex executes this rule:

    ^\n { action }

    Such an ingenious problem!

    Comments?

    /Roger

    [1] https://epaperpress.com/lexandyacc/download/flex.pdf
    [This is a good example of the reasons you don't want your lexer to be too clever. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Wed Jun 1 13:34:22 2022
    I will skip most of Roger's posting prior to the question:

    Question: After the action for TEST1\n is executed, which newline rule is executed?

    The answer depends on the meaning of yyless(). Here are two possible meanings for yyless():

    (a) The scanner is backing up in the input stream.
    (b) The scanner is pushing new characters onto the input stream.

    I find the Flex choice of "b" to be semantically incorrect and a bug.
    because you cannot get functionality "a" if you do that. However, I
    suspect that the bug is relatively intractable. It may be quite hard
    to back up the input stream and recreate the proper lexer state,
    especially since I suspect that newlines are not part of the lexer
    state but an additional "widget" on the side.

    Note, that an alternate solution is to have what PCRE calls lookahead zero-width assertions. I think the original LEX had them for this
    case.

    That is, you don't use yyless for this case, but instead you write a rule like:

    TEST1 / \n

    where the "/" is an operator in PCRE it would look something like this

    TEST1 (?=\n)

    In either case, the \n is "peeked" at but not consumed as part of the
    token and not otherwise processed at all. So, you don't have to
    "backup" the stream as you never moved it forward over the \n. It
    stays in the lookahead part of the input stream.

    The other alternative is to not use yyless and simply deal with the \n
    as part of the actions for the TEST1\n rule, making the rule do the
    actions of both TEST1 and \n. Now in this case where it is just one
    character and that character is also a token, that probably works ok.
    However, if \n# (for example) is a special token, you have gotten
    yourself into a mess as you probably need to make TEST1\n# a rule
    also, and you can see where this is headed.

    But this all goes to my point about not making the lexer complicated.
    It is probably a flaw in the language design if you need lookahead
    past the end of the token to determine what to tokenize it as. And,
    note several designs by famous people in the compiler field have made
    that mistake. Two examples come immediately to mind.

    Yacc itself needs identifier: as a special token to deal with the fact
    that semicolons are optional at the end of a rule. It makes the
    language LR(2) otherwise. (In our (Compiler Resources) version of
    Yacc++, we disallowed rules that don't end with semicolons.) Note, I
    suspect most versions don't allow comments in between the identifier
    and the colon (they probably don't even allow whitespace).

    In Pascal 1..2 need lookahead to recognize that it is an int a ..
    operator and another int rather than 1.2 a floating point ("real")
    number. Better would have been to require spaces (as in 1 .. 2) in
    that case.

    Note, I even regret the way our version of Yacc++ handles code blocks
    in braces. It is the same kind of quagmire. I cannot easily use
    braces in other places because of it. I am looking at having to
    develop an even more complicated scheme to make that work, making the
    corner case even more convoluted.

    But when I rail against hand-written recursive descent parsers and not
    using tools and living within the limits of what can express simply,
    this is the kind of errors I am referring to. Without the guardrails
    of a tool telling you that your language design has gone off into the
    reeds, you will make those kind of errors and your language will be
    harder to evolve and have corner cases that you don't even know that
    are there. Enough preaching. I will get off my soapbox.

    Kind regards,
    Chris

    -- ****************************************************************************** Chris Clark email: [email protected] Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------ [Lex and flex both have trailing context. The manual says that if it
    can tell at compile time that the text matched by trailing context is
    fixed length, it's cheap, otherwise it's very expensive. It also has
    REJECT in an action which tells it to pretend that the pattern didn't
    match and try something else. That's also very expsnsive, and
    potentially very confusing. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)