Forum: >>> Magnum BBS <<<

Flex is the most powerful lexical analysis language in the world. True

From Roger L Costello@21:1/5 to All on Wed May 4 11:22:34 2022

Hi Folks,

1. A lexical analysis language that exclusively provides regular expressions for scanning input can only process regular languages.

(a) True
(b) False

2. Flex provides, in addition to regular expressions, states and a pushdown stack. This greatly expands the set of languages that can be processed.

(a) True
(b) False

3. Because Flex provides states and a pushdown stack, Flex lexers can process context-free languages.

(a) True
(b) False

4. No other lexical analysis language provides states and a pushdown stack.

(a) True
(b) False

5. Flex is the most powerful lexical analysis language in the world.

(a) True
(b) False

/Roger
[I think that you could easily graft a state stack into any lexer that has start states.
Also, tools like Antlr combine the lexer and parser generators, so they're at least as
powerful as flex. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tom Shields@21:1/5 to All on Wed May 4 14:14:48 2022

FYI: RE/flex [https://github.com/Genivia/RE-flex] provides a superset
of flex functionality, generates a better implementation of a scanner
in C++ than does flex, and, most importantly, is actively maintained.

Tom Shields

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Thu May 5 15:20:07 2022

First, of all, I think Flex is a relatively fine lexer generator with
some good properties and reasons why it is popular, but claiming it is
"the most powerful lexical analysis language in the world" is way too
far overstating it.

It is not even as powerful as some other lexical analysis languages
and even exploiting its power often requires things that cannot be
expressed in Flex alone nor can they be done in ways that are simple
and easy to reason about. So, that part of the statement is patently
false. I will answer the question, but go beyond simple, yes and no
responses to show the actual issues and not some over-simplification
of the problem.

1. A lexical analysis language that exclusively provides regular expressions for scanning input can only process regular languages.

True if you mean automata theoretic regular expressions, i.e. not
including PCRE extensions, like back references and captures and
zero-width assertions. However, if you include such extensions (or
other ones I will mention below), you can escape those limits and get
to larger language classes, including ones that are Turing
Complete--not that Turing Machine power cannot be abused and result in
a lexer who you cannot figure out what the tokens are without solving
the Halting Problem. So, what one really wants is a nicely constrained
but larger lexing language that is still easy to reason about. There
are tools that approximate that. None are perfect yet (if perfection
is even achievable and not merely a limit to be approached but never
reached).

2. Flex provides, in addition to regular expressions, states and a pushdown stack. This greatly expands the set of languages that can be processed.
3. Because Flex provides states and a pushdown stack, Flex lexers can process
context-free languages.

True and true. States by themselves don't add anything. The
intersection or union of two regular expression languages is still
regular. However, a pushdown automata can allow you to process a
context free language. Now, not having used that feature myself, I
don't know whether one has to "program" the automata by hand or one
can express it "naturally" as context free rules. I suspect the
latter, but don't know for sure.

In comparison, in 1986, when we built the lexer for our version of
Yacc++, we integrated regular expressions and LR rules, so that one
can simply write yacc-like rules with recursion to express
[deterministic, i.e. ELR] grammars (with regular expressions on the
right hand side). To my mind, the notation being unified and used
both in the lexer and the parser is easier to comprehend and within
reason is much easier to reason about. It makes things like nested
comments into an "obvious" recursive lexical rule.

We also have lexer classes which allow one to do lexer states (for
context sensitivity) with inheritance between them and you can change
their states (which class is being lexed) via rules in the lexer or
parser, doing the things that Flex does.

Terence Parr liked our idea and eventually incorporated it into ANTLR (originally doing an LL version of it), the current version in ANTLR4
is more like a PEG (parsing expression grammar) with extensions to
handle simple (aka direct) left-recursion and precedence.

Moreover, he has some simple lexer extensions that look quite FLEX
like (rules for changing "state" in a controlled fashion including pushing/popping from a stack). Being more constrained, it is actually
easier to reason about. You aren't trying to analyze arbitrary C
(Java) code.

4. No other lexical analysis language provides states and a pushdown stack. 5. Flex is the most powerful lexical analysis language in the world.

False and false. See the preceding paragraphs. Not only do they do
so. They do so in a manner which is more easy to reason about.

Moreover, syntactic (and semantic) predicates (an innovation Terence
introduced and we borrowed later) allow one to express languages that
are larger than the context free family. In fact, there are proofs
that one can use them to express any Turing Complete language. They
are typically implemented using backtracking.

However, Bryan Forward took the concept of predicates and developed
packrat parsing as a way of implemented PEGs efficiently, effectively
linear. And, the notation is only minorly different than context free grammars, with the main difference being that the or (alternative) is
treated as ordered, if the first rule matches apply it and don't even
try the following rules. As I noted above, this nicely solves certain precedence problems. (And, as I said, ANTLR4 incorporates this idea
to positive effect.)

This work has further been extended by Alexander Okhotin into a
variety of "Conjuctive" and "Boolean" grammars. These again have
Turing Machine power, but are usually simpler to understand and reason
about than most Turing Machines.

In a different direction researchers like Quinn-Tyler Jackson have
developed Adaptive Grammars (see the relevant Wikipeida article).
This is a different solution to the problem of making a constrained
version of Turing Machine power. Notably they are well designed to
handle "type checking" with the equivalent of the C typedef hack, but
without it being a hack but a formalism that can give clear semantics.
Note that this is traditionally very hard to do in context free
grammars--and as far as I can tell, most people writing lexers and
parsers flounder trying to incorporate even simple type checking into
their grammar, by using the wrong tool (context free grammars) to do
so.

There are other directions that have been pursued also. Scanner-less
parser generators, which merge the lexing and parsing into one phase.
GLR parsers (and presumably lexers) that deal with ambiguity by making
forests rather than just trees/DAGs. Parsing combinators can be used
for lexing also. There are probably even concepts that I have never
read about or have read about and forgotten.

Many of these advances have been incorporated into lexer generators.
They are [almost] all more powerful lexical analysis languages than
what Flex provides. Most of them are more clear and easier to reason
about also.

So, pardon my knee-jerk reaction to your question. It is simply that
most people aren't aware of all the different technologies that are
available. Flex is fine, but it is certainly not a be-all-end-all
solution. It wouldn't even be the solution I would reach for first.

Kind regards,
Chris

-- ****************************************************************************** Chris Clark email: [email protected] Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roger L Costello@21:1/5 to All on Fri May 6 11:16:05 2022

Thank you Chris for that superb explanation!

Flex is fine, but it is certainly not a be-all-end-all
solution. It wouldn't even be the solution I would
reach for first.

May I ask, what would be the solution that you would reach for first?

/Roger
[I'm not Chris but the answer is the one that does the job. You don't need
a flame thrower if you only need to light the BBQ. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Fri May 6 11:00:16 2022

On Wed, 4 May 2022 11:22:34 +0000, Roger L Costello
<[email protected]> wrote:

Hi Folks,
1. A lexical analysis language that exclusively provides regular expressions >for scanning input can only process regular languages.
2. Flex provides, in addition to regular expressions, states and a pushdown >stack. This greatly expands the set of languages that can be processed.
3. Because Flex provides states and a pushdown stack, Flex lexers can process >context-free languages.
4. No other lexical analysis language provides states and a pushdown stack. >5. Flex is the most powerful lexical analysis language in the world.

[I think that you could easily graft a state stack into any lexer that has start states.
Also, tools like Antlr combine the lexer and parser generators, so they're at least as
powerful as flex. -John]

+1 John.

Roger, if you hadn't already asked some more interesting questions, I
would suspect this 'test' was homework.

Flex is powerful, but it certainly is not alone. As John's response
hinted, there are (plenty of other) tools that more or less are
equivalent. And not all of them are based on LR. https://en.wikipedia.org/wiki/Comparison_of_parser_generators

Not to mention that programming languages which tend to actually be
used also tend to be [relatively] easily parsed using LL(k).

LR is useful AS AN IMPLEMENTATION TECHNIQUE, but in general if your
language is complex enough to really /require/ (G)LR or PEG parsing,
it probably is too complicated to be used by average programmers.

YMMV,
George

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to George Neuner on Fri May 6 14:30:36 2022

On Friday, May 6, 2022 at 9:14:54 AM UTC-7, George Neuner wrote:

(snip)

Not to mention that programming languages which tend to actually be
used also tend to be [relatively] easily parsed using LL(k).

An important part of a programming language is that people can understand it.

I suspect it isn't hard to design a language that computers can easily
parse, but people can't. Your lexer only needs to be good enough for
actual programming languages.

As with BBQs, that doesn't stop people from trying.
[Take a look at Postscript, which is trivial to tokenize and parse since
it's a stream of tokens in RPN order, but making sense of it
by humans is a challenge. Or, of course, m4. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Sat May 7 13:15:58 2022

Roger, since you asked, I will answer what solution I would reach for
(and give advice on what I think others should reach for).

First, John's advice is straight on. You don't need a flamethrower in
most cases, and in fact having one invites one to abuse it. I think
Purdue has a series of videos on the fastest way to light a barbeque
which end with completely melting the bbq in a matter of seconds using
liquid oxygen. Fast, but probably the wrong solution for grilling
burgers and hot dogs.

Next, if you already have a lexer, I would NOT change the technology
using it, at least not by much, unless one had a specific reason to do otherwise. I might switch from original LEX to Flex (and original
yacc to Bison), but that would be one of the few exceptions to rule.
Now, if I had issues I would switch, but I would first attempt to see
if there are workarounds.

in my last three projects, there was already a lexer-parser
combinartion in use. So, in my last three projects (and four
implementations) we used: Bison + Flex, ANTLR, Parser-RS, and JAVACC.
In the last two, I don't even know what lexer was used as I never
touched it other than to add keywords.

And that goes to an important point. Your lexer *should be* almost
trivially simple (i.e. regular expressions only and not complicated
ones). You rarely want to solve problems at the lexical level. You
are much less likely to get good error reporting if you do. In most
cases, your parser should be simple also. You might want LR parsing
for expressions, but otherwise you want your grammar to be LL(1) (with
the if-then-else hack).

And, all else being equal, if you don't have a lexer-parser
combination you can reuse. I would pick somewhat based upon
programming language, since most tools are relatively tied to one
language even when they support more than one.

I haven't decided on my favorite for Rust yet, parser-RS isn't bad,
Nom is also popular.

For Java, I would go with ANTLR4. And, overall, I would say that is
my current favorite despite a few nits.

For C++ or C#, I would use the Yacc++ we wrote even though it needs
some tweaking to catch up to ANTLR at this point. I prefer our
solution to keywords to what ANTLR has and indirect left recursion
(i.e. parsing expressions with lists of expressions as a first
argument) doesn't work right in ANTLR.

If someone else was paying for it, I would investigate the DMS Toolkit
for Semantic Designs, because they have done most of the work to make
GLR practical.

If I really wanted to solve types in my grammar, I would look into
"meta-ess" by Jackson. I don't know how available that is.

And, if I were using Scheme or Lisp, I would look into Racket.

-- ****************************************************************************** Chris Clark email: [email protected] Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to All on Sat May 7 13:10:50 2022

(snip, I wrote)

An important part of a programming language is that people can understand it.

(snip)

[Take a look at Postscript, which is trivial to tokenize and parse since
it's a stream of tokens in RPN order, but making sense of it
by humans is a challenge. Or, of course, m4. -John]

Reminds me of the origin of RISC, when (I believe) IBM noticed that
compilers were using a small subset of the available instructions, and that much less programming was being done in assembly.

But okay, was Postscript supposed to be written by people,
or programs?

As with code generated by compilers, it has to be understood
by people at least once, but even then, only step by step, and not
(usually) the whole program at once.

But yes you can write unreadable Postscript. Once, some years ago,
we (me and some others) needed to redefine def.

[Actually, RISC was at Berkeley, and IBM's project was the 801. But
yes, they noticed compilers used only a fraction of the S/360
instruction set so they made a minimal design that supported only what
their state-of-the art compiler used. RISC was sort of the same but
they used the mediocre PCC compiler which is why they had register
windows to compensate for PCC's weak register allocation. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roger L Costello@21:1/5 to Another question if I may. You on Sun May 8 13:34:03 2022

Thank you again Chris. Terrific information.

Another question if I may. You wrote:

And that goes to an important point. Your lexer *should be* almost
trivially simple (i.e. regular expressions only and not complicated
ones). You rarely want to solve problems at the lexical level. You
are much less likely to get good error reporting if you do. In most
cases, your parser should be simple also.

For a while now I have been (for fun) working on building a parser for
parsing XML documents. I have experimented with making the lexer
simple and with making the parser simple. If I make the lexer simple,
then the parser is complex. If I make the lexer complex (using lots of
states and making heavy use of Flex's pushdown stack) then the parser
is simple. It doesn't seem possible to make both the lexer and parser
simple.

There are lots of "conditional rules" in XML. For example, in XML the
& is called an "XML entity." Since the & is a reserved symbol, XML documents need to use & instead of &. An XML parser is to convert
& to &. However, if the & is in certain contexts -- within a
comment or within a CDATA section -- then the & is not converted.
Thus, there is conditional processing:

IF (& is in a comment or in a CDATA section) THEN
OUTPUT(&)
ELSE
OUTPUT(&)

Flex's states/stack mechanism is ideally suited for conditional
processing like this. From the section on Start Conditions in the Flex
manual: "flex provides a mechanism for conditionally activating
rules."

So while it would be great to have a simple lexer, I am leaning
towards dealing with the conditional rules in XML using the Flex
states/stack mechanism rather than dealing with the conditional rules
in Bison. In other words, I am leaning towards a complex lexer.

I am interested in hearing your thoughts on this.

You don't need a flamethrower

My apologies. It wasn't my intent to throw a flame. But in hindsight I
can see that I should have worded things much better. I will do better
in the future.

/Roger

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Centurion
  Fri Jul 31 00:59:56 2026
  from Berea, Ohio via Telnet
- Rixter
  Fri Jul 31 00:00:46 2026
  from Madison, Nc via Telnet
- Bob Worm
  Thu Jul 30 20:01:55 2026
  from Wales, Uk via Telnet
- Rixter
  Thu Jul 30 14:17:17 2026
  from Madison, Nc via Telnet
- Krenn
  Thu Jul 30 13:16:49 2026
  from Sydney, Nsw via Telnet
- Bob Worm
  Thu Jul 30 09:03:28 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:47:34 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:36:06 2026
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	107:00:49
Calls:	12,460
Calls today:	2
Files:	15,197
Messages:	6,538,087

Flex is the most powerful lexical analysis language in the world. True

Who's Online

Recent Visitors

System Info