• Flex is the most powerful lexical analysis language in the world. True

    From Roger L Costello@21:1/5 to All on Wed May 4 11:22:34 2022
    Hi Folks,

    1. A lexical analysis language that exclusively provides regular expressions for scanning input can only process regular languages.

    (a) True
    (b) False

    2. Flex provides, in addition to regular expressions, states and a pushdown stack. This greatly expands the set of languages that can be processed.

    (a) True
    (b) False

    3. Because Flex provides states and a pushdown stack, Flex lexers can process context-free languages.

    (a) True
    (b) False

    4. No other lexical analysis language provides states and a pushdown stack.

    (a) True
    (b) False

    5. Flex is the most powerful lexical analysis language in the world.

    (a) True
    (b) False

    /Roger
    [I think that you could easily graft a state stack into any lexer that has start states.
    Also, tools like Antlr combine the lexer and parser generators, so they're at least as
    powerful as flex. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tom Shields@21:1/5 to All on Wed May 4 14:14:48 2022
    FYI: RE/flex [https://github.com/Genivia/RE-flex] provides a superset
    of flex functionality, generates a better implementation of a scanner
    in C++ than does flex, and, most importantly, is actively maintained.

    Tom Shields

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Thu May 5 15:20:07 2022
    First, of all, I think Flex is a relatively fine lexer generator with
    some good properties and reasons why it is popular, but claiming it is
    "the most powerful lexical analysis language in the world" is way too
    far overstating it.

    It is not even as powerful as some other lexical analysis languages
    and even exploiting its power often requires things that cannot be
    expressed in Flex alone nor can they be done in ways that are simple
    and easy to reason about. So, that part of the statement is patently
    false. I will answer the question, but go beyond simple, yes and no
    responses to show the actual issues and not some over-simplification
    of the problem.

    1. A lexical analysis language that exclusively provides regular expressions for scanning input can only process regular languages.

    True if you mean automata theoretic regular expressions, i.e. not
    including PCRE extensions, like back references and captures and
    zero-width assertions. However, if you include such extensions (or
    other ones I will mention below), you can escape those limits and get
    to larger language classes, including ones that are Turing
    Complete--not that Turing Machine power cannot be abused and result in
    a lexer who you cannot figure out what the tokens are without solving
    the Halting Problem. So, what one really wants is a nicely constrained
    but larger lexing language that is still easy to reason about. There
    are tools that approximate that. None are perfect yet (if perfection
    is even achievable and not merely a limit to be approached but never
    reached).

    2. Flex provides, in addition to regular expressions, states and a pushdown stack. This greatly expands the set of languages that can be processed.
    3. Because Flex provides states and a pushdown stack, Flex lexers can process
    context-free languages.

    True and true. States by themselves don't add anything. The
    intersection or union of two regular expression languages is still
    regular. However, a pushdown automata can allow you to process a
    context free language. Now, not having used that feature myself, I
    don't know whether one has to "program" the automata by hand or one
    can express it "naturally" as context free rules. I suspect the
    latter, but don't know for sure.

    In comparison, in 1986, when we built the lexer for our version of
    Yacc++, we integrated regular expressions and LR rules, so that one
    can simply write yacc-like rules with recursion to express
    [deterministic, i.e. ELR] grammars (with regular expressions on the
    right hand side). To my mind, the notation being unified and used
    both in the lexer and the parser is easier to comprehend and within
    reason is much easier to reason about. It makes things like nested
    comments into an "obvious" recursive lexical rule.

    We also have lexer classes which allow one to do lexer states (for
    context sensitivity) with inheritance between them and you can change
    their states (which class is being lexed) via rules in the lexer or
    parser, doing the things that Flex does.

    Terence Parr liked our idea and eventually incorporated it into ANTLR (originally doing an LL version of it), the current version in ANTLR4
    is more like a PEG (parsing expression grammar) with extensions to
    handle simple (aka direct) left-recursion and precedence.

    Moreover, he has some simple lexer extensions that look quite FLEX
    like (rules for changing "state" in a controlled fashion including pushing/popping from a stack). Being more constrained, it is actually
    easier to reason about. You aren't trying to analyze arbitrary C
    (Java) code.

    4. No other lexical analysis language provides states and a pushdown stack. 5. Flex is the most powerful lexical analysis language in the world.

    False and false. See the preceding paragraphs. Not only do they do
    so. They do so in a manner which is more easy to reason about.

    Moreover, syntactic (and semantic) predicates (an innovation Terence
    introduced and we borrowed later) allow one to express languages that
    are larger than the context free family. In fact, there are proofs
    that one can use them to express any Turing Complete language. They
    are typically implemented using backtracking.

    However, Bryan Forward took the concept of predicates and developed
    packrat parsing as a way of implemented PEGs efficiently, effectively
    linear. And, the notation is only minorly different than context free grammars, with the main difference being that the or (alternative) is
    treated as ordered, if the first rule matches apply it and don't even
    try the following rules. As I noted above, this nicely solves certain precedence problems. (And, as I said, ANTLR4 incorporates this idea
    to positive effect.)

    This work has further been extended by Alexander Okhotin into a
    variety of "Conjuctive" and "Boolean" grammars. These again have
    Turing Machine power, but are usually simpler to understand and reason
    about than most Turing Machines.

    In a different direction researchers like Quinn-Tyler Jackson have
    developed Adaptive Grammars (see the relevant Wikipeida article).
    This is a different solution to the problem of making a constrained
    version of Turing Machine power. Notably they are well designed to
    handle "type checking" with the equivalent of the C typedef hack, but
    without it being a hack but a formalism that can give clear semantics.
    Note that this is traditionally very hard to do in context free
    grammars--and as far as I can tell, most people writing lexers and
    parsers flounder trying to incorporate even simple type checking into
    their grammar, by using the wrong tool (context free grammars) to do
    so.

    There are other directions that have been pursued also. Scanner-less
    parser generators, which merge the lexing and parsing into one phase.
    GLR parsers (and presumably lexers) that deal with ambiguity by making
    forests rather than just trees/DAGs. Parsing combinators can be used
    for lexing also. There are probably even concepts that I have never
    read about or have read about and forgotten.

    Many of these advances have been incorporated into lexer generators.
    They are [almost] all more powerful lexical analysis languages than
    what Flex provides. Most of them are more clear and easier to reason
    about also.

    So, pardon my knee-jerk reaction to your question. It is simply that
    most people aren't aware of all the different technologies that are
    available. Flex is fine, but it is certainly not a be-all-end-all
    solution. It wouldn't even be the solution I would reach for first.

    Kind regards,
    Chris

    -- ****************************************************************************** Chris Clark email: [email protected] Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roger L Costello@21:1/5 to All on Fri May 6 11:16:05 2022
    Thank you Chris for that superb explanation!

    Flex is fine, but it is certainly not a be-all-end-all
    solution. It wouldn't even be the solution I would
    reach for first.

    May I ask, what would be the solution that you would reach for first?

    /Roger
    [I'm not Chris but the answer is the one that does the job. You don't need
    a flame thrower if you only need to light the BBQ. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Fri May 6 11:00:16 2022
    On Wed, 4 May 2022 11:22:34 +0000, Roger L Costello
    <[email protected]> wrote:

    Hi Folks,
    1. A lexical analysis language that exclusively provides regular expressions >for scanning input can only process regular languages.
    2. Flex provides, in addition to regular expressions, states and a pushdown >stack. This greatly expands the set of languages that can be processed.
    3. Because Flex provides states and a pushdown stack, Flex lexers can process >context-free languages.
    4. No other lexical analysis language provides states and a pushdown stack. >5. Flex is the most powerful lexical analysis language in the world.

    [I think that you could easily graft a state stack into any lexer that has start states.
    Also, tools like Antlr combine the lexer and parser generators, so they're at least as
    powerful as flex. -John]

    +1 John.

    Roger, if you hadn't already asked some more interesting questions, I
    would suspect this 'test' was homework.

    Flex is powerful, but it certainly is not alone. As John's response
    hinted, there are (plenty of other) tools that more or less are
    equivalent. And not all of them are based on LR. https://en.wikipedia.org/wiki/Comparison_of_parser_generators

    Not to mention that programming languages which tend to actually be
    used also tend to be [relatively] easily parsed using LL(k).

    LR is useful AS AN IMPLEMENTATION TECHNIQUE, but in general if your
    language is complex enough to really /require/ (G)LR or PEG parsing,
    it probably is too complicated to be used by average programmers.

    YMMV,
    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to George Neuner on Fri May 6 14:30:36 2022
    On Friday, May 6, 2022 at 9:14:54 AM UTC-7, George Neuner wrote:

    (snip)

    Not to mention that programming languages which tend to actually be
    used also tend to be [relatively] easily parsed using LL(k).

    An important part of a programming language is that people can understand it.

    I suspect it isn't hard to design a language that computers can easily
    parse, but people can't. Your lexer only needs to be good enough for
    actual programming languages.

    As with BBQs, that doesn't stop people from trying.
    [Take a look at Postscript, which is trivial to tokenize and parse since
    it's a stream of tokens in RPN order, but making sense of it
    by humans is a challenge. Or, of course, m4. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Sat May 7 13:15:58 2022
    Roger, since you asked, I will answer what solution I would reach for
    (and give advice on what I think others should reach for).

    First, John's advice is straight on. You don't need a flamethrower in
    most cases, and in fact having one invites one to abuse it. I think
    Purdue has a series of videos on the fastest way to light a barbeque
    which end with completely melting the bbq in a matter of seconds using
    liquid oxygen. Fast, but probably the wrong solution for grilling
    burgers and hot dogs.

    Next, if you already have a lexer, I would NOT change the technology
    using it, at least not by much, unless one had a specific reason to do otherwise. I might switch from original LEX to Flex (and original
    yacc to Bison), but that would be one of the few exceptions to rule.
    Now, if I had issues I would switch, but I would first attempt to see
    if there are workarounds.

    in my last three projects, there was already a lexer-parser
    combinartion in use. So, in my last three projects (and four
    implementations) we used: Bison + Flex, ANTLR, Parser-RS, and JAVACC.
    In the last two, I don't even know what lexer was used as I never
    touched it other than to add keywords.

    And that goes to an important point. Your lexer *should be* almost
    trivially simple (i.e. regular expressions only and not complicated
    ones). You rarely want to solve problems at the lexical level. You
    are much less likely to get good error reporting if you do. In most
    cases, your parser should be simple also. You might want LR parsing
    for expressions, but otherwise you want your grammar to be LL(1) (with
    the if-then-else hack).

    And, all else being equal, if you don't have a lexer-parser
    combination you can reuse. I would pick somewhat based upon
    programming language, since most tools are relatively tied to one
    language even when they support more than one.

    I haven't decided on my favorite for Rust yet, parser-RS isn't bad,
    Nom is also popular.

    For Java, I would go with ANTLR4. And, overall, I would say that is
    my current favorite despite a few nits.

    For C++ or C#, I would use the Yacc++ we wrote even though it needs
    some tweaking to catch up to ANTLR at this point. I prefer our
    solution to keywords to what ANTLR has and indirect left recursion
    (i.e. parsing expressions with lists of expressions as a first
    argument) doesn't work right in ANTLR.

    If someone else was paying for it, I would investigate the DMS Toolkit
    for Semantic Designs, because they have done most of the work to make
    GLR practical.

    If I really wanted to solve types in my grammar, I would look into
    "meta-ess" by Jackson. I don't know how available that is.

    And, if I were using Scheme or Lisp, I would look into Racket.

    -- ****************************************************************************** Chris Clark email: [email protected] Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to All on Sat May 7 13:10:50 2022
    (snip, I wrote)

    An important part of a programming language is that people can understand it.

    (snip)

    [Take a look at Postscript, which is trivial to tokenize and parse since
    it's a stream of tokens in RPN order, but making sense of it
    by humans is a challenge. Or, of course, m4. -John]

    Reminds me of the origin of RISC, when (I believe) IBM noticed that
    compilers were using a small subset of the available instructions, and that much less programming was being done in assembly.

    But okay, was Postscript supposed to be written by people,
    or programs?

    As with code generated by compilers, it has to be understood
    by people at least once, but even then, only step by step, and not
    (usually) the whole program at once.

    But yes you can write unreadable Postscript. Once, some years ago,
    we (me and some others) needed to redefine def.

    [Actually, RISC was at Berkeley, and IBM's project was the 801. But
    yes, they noticed compilers used only a fraction of the S/360
    instruction set so they made a minimal design that supported only what
    their state-of-the art compiler used. RISC was sort of the same but
    they used the mediocre PCC compiler which is why they had register
    windows to compensate for PCC's weak register allocation. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roger L Costello@21:1/5 to Another question if I may. You on Sun May 8 13:34:03 2022
    Thank you again Chris. Terrific information.

    Another question if I may. You wrote:

    And that goes to an important point. Your lexer *should be* almost
    trivially simple (i.e. regular expressions only and not complicated
    ones). You rarely want to solve problems at the lexical level. You
    are much less likely to get good error reporting if you do. In most
    cases, your parser should be simple also.

    For a while now I have been (for fun) working on building a parser for
    parsing XML documents. I have experimented with making the lexer
    simple and with making the parser simple. If I make the lexer simple,
    then the parser is complex. If I make the lexer complex (using lots of
    states and making heavy use of Flex's pushdown stack) then the parser
    is simple. It doesn't seem possible to make both the lexer and parser
    simple.

    There are lots of "conditional rules" in XML. For example, in XML the
    &amp; is called an "XML entity." Since the & is a reserved symbol, XML documents need to use &amp; instead of &. An XML parser is to convert
    &amp; to &. However, if the &amp; is in certain contexts -- within a
    comment or within a CDATA section -- then the &amp; is not converted.
    Thus, there is conditional processing:

    IF (&amp; is in a comment or in a CDATA section) THEN
    OUTPUT(&amp;)
    ELSE
    OUTPUT(&)

    Flex's states/stack mechanism is ideally suited for conditional
    processing like this. From the section on Start Conditions in the Flex
    manual: "flex provides a mechanism for conditionally activating
    rules."

    So while it would be great to have a simple lexer, I am leaning
    towards dealing with the conditional rules in XML using the Flex
    states/stack mechanism rather than dealing with the conditional rules
    in Bison. In other words, I am leaning towards a complex lexer.

    I am interested in hearing your thoughts on this.

    You don't need a flamethrower

    My apologies. It wasn't my intent to throw a flame. But in hindsight I
    can see that I should have worded things much better. I will do better
    in the future.

    /Roger

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)