• Error correction -- was State-of-the-art algorithms for lexical analysi

    From Christopher F Clark@21:1/5 to All on Tue Jun 7 09:22:13 2022
    I wouldn't do it as part of lexical analysis per se, but approximate
    matching (Levershtein distances and related measures) can be
    optimized.

    Some of the seminal work in that area was done by Wu-Mamber. They
    effectively shown how to make NFAs that are insensitive to errors like insertions and deletions. Like many excellent advances, the idea is wonderfully simple once you have seen it.

    I would start with this wiki page if interested:

    https://en.wikipedia.org/wiki/Agrep

    Notably Villa Laurkkari had done a variant of his TRE tool which
    handles PCRE extensions to include approximate matches.

    You might start with his Master'sThesis:

    https://laurikari.net/ville/regex-submatch.pdf

    The FREJ (Fuzzy regular expressions for Java) library mentioned has
    its own idiosyncratic notation, but is essentially a subset of PCRE
    and cannot do anything that PCRE notation cannot.

    -----------

    All of this gets more use in the computational biology area. They
    divide regular expressions into two portions. The fixed portion (i.e.
    parts that can match only finite strings (no closures, no + or *, but
    . and ?) are called Network Expressions) and the portions that can
    match indefinite strings (parts with closures + or * operators) are
    called separators. One can argue about {m,n} constructs (counted
    repetitons), if they are long enough calling them spacers is more
    efficient.

    Michella Becchi did lots of work in this area, especially some related
    to efficient hardware processing of spacers including counted spacers.
    You might start here:

    https://regex.wustl.edu/indexd80ad80a.html?title=Regular_Expression_Processor

    At Intel, she was interning with me there while at WashU (St Louis),
    we did some other work on "Bushy" v. "Stringy" regular expressions
    based upon insights from studying the Aho-Corasick algorithm. After a
    spacer (actually forming the tail of the spacer to be precise), there
    tends to be a "bushy" part where the regular expression has high
    fan-out (and the fail edges of the AC algorithm tend to return to that section). It tends to be short and wide. Once the regular expression
    has been trimmed to one (or a few) candidates, the fan-out essentially
    becomes linear and it is long and narrow. It can be desirable to
    optimize these sections separately possibly even using different representations. I don't think we ever published a paper about that,
    although you can infer it from some of the patents Intel holds in that
    area, where we optimized hardware based upon that distinction.

    You can also see it in LL/LR parsing algorithms. Wherever you have an
    embedded non-terminal, you get a bushy state as you process the
    "first" items of that non-terminal. If the grammar is LL(1), that set
    of bushy states is 1 long, if LL(2) 2 bushy states, etc.

    A similar effect applies to captures and back-references. In fact,
    that's my next "project" (should I ever find I have spare time).
    Captures (and back-references) work like non-terminals. The regular
    expression in the capture is the definition of the non-terminal. The
    main distinction is that the capture and its back-reference need to
    "match" the same string. You can do that either by comparing them or
    if you are really ambitious, building a "stringy" finite machine on
    the fly, maybe using Boyer-Moore techniques to optimize it. Suffix
    tries might be useful there.

    --------

    There is very interesting work in this area. However, I wouldn't use
    much of it for lexical analysis. the reason being simple. Tokens in
    lexical analysis all touch (abut) each other. The closest you get to
    this area (aside from error correction) is dealing with whitespace and
    comments which you can treat like spacers, but it is really overkill
    for that. I suppose, if you have an indentation sensitive language,
    you might look at the counted spacer work Michella did for
    inspiration, but again it is likely overkill.

    This work made sense when optimizing Snort (or maybe something ClamAV
    like, but that got all morphed into comparing hashes ("signatures")).
    Snort really abuses regular expressions to do long matches (and
    grammar things) but without tokenizing. So, this applies. However,
    anyone writing a "compiler" that way should have their head examined,
    because it makes a slow expensive mess out of it.

    Kind regards,
    Chris

    -- ****************************************************************************** Chris Clark email: [email protected] Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)