Forum: >>> Magnum BBS <<<

An ingenious scanning problem!

From Roger L Costello@21:1/5 to All on Sat May 28 19:03:09 2022

Hi Folks,

Page 115-116 of the Flex User's Manual [1] has an ingenious problem posed by Jan Kort on 05 September 1998.

Here's a description of the problem:

The input contains this: TEST1\n

That is, the input contains the string TEST1 followed by a newline.

The Flex scanner contains a rule that matches on the input:

TEST1\n { action }

Here's a description of its action:

Of the text that was matched, I only
want the first 5 characters (i.e., TEST1).

That action is expressed in Flex this way:

TEST1\n { yyless(5); }

In the scanner there are also two newline rules.

One newline rule matches on a newline at the beginning of a line (i.e., empty line):

^\n { action }

The other newline rule matches on a newline at the end of a line:

\n { action }

Question: After the action for TEST1\n is executed, which newline rule is executed?

The answer depends on the meaning of yyless(). Here are two possible meanings for yyless():

(a) The scanner is backing up in the input stream.
(b) The scanner is pushing new characters onto the input stream.

If the meaning of yyless() is (a), then the second newline rule will be executed.

If the meaning of yyless() is (b), then the first newline rule will be executed.

So which does Flex do?

............................

Answer: (b)

Flex treats the newline from TEST1\n as a newline at the beginning of a line (i.e., empty line). Flex executes this rule:

^\n { action }

Such an ingenious problem!

Comments?

/Roger

[1] https://epaperpress.com/lexandyacc/download/flex.pdf
[This is a good example of the reasons you don't want your lexer to be too clever. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Wed Jun 1 13:34:22 2022

I will skip most of Roger's posting prior to the question:

Question: After the action for TEST1\n is executed, which newline rule is executed?

The answer depends on the meaning of yyless(). Here are two possible meanings for yyless():

(a) The scanner is backing up in the input stream.
(b) The scanner is pushing new characters onto the input stream.

I find the Flex choice of "b" to be semantically incorrect and a bug.
because you cannot get functionality "a" if you do that. However, I
suspect that the bug is relatively intractable. It may be quite hard
to back up the input stream and recreate the proper lexer state,
especially since I suspect that newlines are not part of the lexer
state but an additional "widget" on the side.

Note, that an alternate solution is to have what PCRE calls lookahead zero-width assertions. I think the original LEX had them for this
case.

That is, you don't use yyless for this case, but instead you write a rule like:

TEST1 / \n

where the "/" is an operator in PCRE it would look something like this

TEST1 (?=\n)

In either case, the \n is "peeked" at but not consumed as part of the
token and not otherwise processed at all. So, you don't have to
"backup" the stream as you never moved it forward over the \n. It
stays in the lookahead part of the input stream.

The other alternative is to not use yyless and simply deal with the \n
as part of the actions for the TEST1\n rule, making the rule do the
actions of both TEST1 and \n. Now in this case where it is just one
character and that character is also a token, that probably works ok.
However, if \n# (for example) is a special token, you have gotten
yourself into a mess as you probably need to make TEST1\n# a rule
also, and you can see where this is headed.

But this all goes to my point about not making the lexer complicated.
It is probably a flaw in the language design if you need lookahead
past the end of the token to determine what to tokenize it as. And,
note several designs by famous people in the compiler field have made
that mistake. Two examples come immediately to mind.

Yacc itself needs identifier: as a special token to deal with the fact
that semicolons are optional at the end of a rule. It makes the
language LR(2) otherwise. (In our (Compiler Resources) version of
Yacc++, we disallowed rules that don't end with semicolons.) Note, I
suspect most versions don't allow comments in between the identifier
and the colon (they probably don't even allow whitespace).

In Pascal 1..2 need lookahead to recognize that it is an int a ..
operator and another int rather than 1.2 a floating point ("real")
number. Better would have been to require spaces (as in 1 .. 2) in
that case.

Note, I even regret the way our version of Yacc++ handles code blocks
in braces. It is the same kind of quagmire. I cannot easily use
braces in other places because of it. I am looking at having to
develop an even more complicated scheme to make that work, making the
corner case even more convoluted.

But when I rail against hand-written recursive descent parsers and not
using tools and living within the limits of what can express simply,
this is the kind of errors I am referring to. Without the guardrails
of a tool telling you that your language design has gone off into the
reeds, you will make those kind of errors and your language will be
harder to evolve and have corner cases that you don't even know that
are there. Enough preaching. I will get off my soapbox.

Kind regards,
Chris

-- ****************************************************************************** Chris Clark email: [email protected] Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------ [Lex and flex both have trailing context. The manual says that if it
can tell at compile time that the text matched by trailing context is
fixed length, it's cheap, otherwise it's very expensive. It also has
REJECT in an action which tells it to pretend that the pattern didn't
match and try something else. That's also very expsnsive, and
potentially very confusing. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Shift
  Fri Jul 31 06:46:34 2026
  from Leeds, England via SSH
- Centurion
  Fri Jul 31 00:59:56 2026
  from Berea, Ohio via Telnet
- Rixter
  Fri Jul 31 00:00:46 2026
  from Madison, Nc via Telnet
- Bob Worm
  Thu Jul 30 20:01:55 2026
  from Wales, Uk via Telnet
- Rixter
  Thu Jul 30 14:17:17 2026
  from Madison, Nc via Telnet
- Krenn
  Thu Jul 30 13:16:49 2026
  from Sydney, Nsw via Telnet
- Bob Worm
  Thu Jul 30 09:03:28 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:47:34 2026
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	109:57:55
Calls:	12,461
Calls today:	3
Files:	15,197
Messages:	6,538,128

An ingenious scanning problem!

Who's Online

Recent Visitors

System Info