A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic expressions. When the lexer encounters an integer lexeme, it casts the lexeme to a binary integer and returns the value to the parser. The lexer contains a rule that looks something like this:
{INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }
Hi Folks,
A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic >expressions. When the lexer encounters an integer lexeme, it casts the lexeme >to a binary integer and returns the value to the parser. The lexer contains a >rule that looks something like this:
{INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }
But, but, but, ...
Countless times on this list I have been told: Keep the lexer simple!
By converting the lexeme to an integer, the lexer has assumed that the parser >needs/wants a binary integer, not a text number. How does the lexer know what >the parser needs/wants?
That seems like knowledge the lexer shouldn't have if
the lexer is to be simple. Further, even if one parser needs/wants a binary >integer value, that parser might be swapped out at a later date and replaced >with a different parser that wants the text number.
It seems to me that the lexer should return to the parser the text number and >it is the responsibility of the parser to convert the value to an integer data >type if it desires.
What do you think?
/Roger+1
[I think the lexer should provide the tokens that the parser needs. If >integers are always handled as numbers, convert them, if not, don't.
If the parser does one and later changes to do the other, you can
change the lexer, too. -John]
In many (actually most) cases, the binary representation of an integer
can be stored in less space than the text representation.
On Thu, 14 Jul 2022 10:25:24 +0000, Roger L Costello
<[email protected]> wrote:
Hi Folks,
A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic
expressions. When the lexer encounters an integer lexeme, it casts the lexeme
to a binary integer and returns the value to the parser. The lexer contains a
rule that looks something like this:
{INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }
But, but, but, ...
Countless times on this list I have been told: Keep the lexer simple!
By converting the lexeme to an integer, the lexer has assumed that the parser
needs/wants a binary integer, not a text number. How does the lexer know what
the parser needs/wants?
Presumably because the same developer that wrote the parser also wrote
the rules for the lexer.
It seems to me that the lexer should return to the parser the text number and
it is the responsibility of the parser to convert the value to an integer data
type if it desires.
What do you think?
The general problem is performance. The place to decide what class a
token falls into is where you contact the text.
In principle the lexer could be as simple as strtok() - just
separating the tokens, returning a pointer directly to its input
buffer, and punting all further processing to the parser.
But ...
The lexer has already scanned the text just to locate the end of it.
If it then simply gives the text to the parser, the parser must scan
the text AGAIN to verify that it represents a number (or whatever the
parser happens to be looking for at that point).
Then the text gets scanned YET AGAIN to convert it to a number.
That is 3 scans of the text if the parser does the work versus 2 scans
if the lexer does the work.
Moving (at least gross) input classification into the lexer avoids at
least one pass over the entirety of the input. When considered
against the length of the average input, that can add up to a LOT of
avoided work.
Hi Folks,
A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic expressions. When the lexer encounters an integer lexeme, it casts the lexeme to a binary integer and returns the value to the parser. The lexer contains a rule that looks something like this:
{INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }
But, but, but, ...
Countless times on this list I have been told: Keep the lexer simple!
By converting the lexeme to an integer, the lexer has assumed that the parser needs/wants a binary integer, not a text number.
How does the lexer know what
the parser needs/wants? That seems like knowledge the lexer shouldn't have if the lexer is to be simple.
Further, even if one parser needs/wants a binary
integer value, that parser might be swapped out at a later date and replaced with a different parser that wants the text number.
It seems to me that the lexer should return to the parser the text number and it is the responsibility of the parser to convert the value to an integer data
type if it desires.
What do you think?
It's still hard to imagine that the size difference would matter in acompiler.
If you're logging a million values a second and saving it to an archive, well,
that's different. -John
[...]and
It seems to me that the lexer should return to the parser the text number
it is the responsibility of the parser to convert the value to an integerdata
type if it desires.
What do you think?
On Friday, July 15, 2022 at 4:13:42 AM UTC+2, George Neuner wrote:
In many (actually most) cases, the binary representation of an integer
can be stored in less space than the text representation.
The output of a command such as (cd /usr/src/linux; grep --only-matching >--recursive "\b[0-9][0-9]*\b") proves the falsity of the above claim.
Binary, fixed-width, representation of integers is statistically more >space-efficient compared to implicit-width textual representation only if the >text representation of the integers includes (for example): a plain 32/64-bit >pointer to the start of the text, a plain 16/32/64-bit relative/absolute >offset to the start of the number in a character array, the [length of the >textual form of the number] in an explicit form.
Binary, fixed-width, representation of integers is more likely to be more >space-efficient than [their textual representation with an implicit length] if >the source of the integers doesn't originate from a human hand typing digits >on a keyboard. For example, when the source of the integers is a >high-precision measurement device, is something with a physical counterpart >(such as: the GPS coordinates of objects spread across Earth's surface), etc.
-atom
0x101010 is 8 characters, 1 byte.
You are asking the wrong question. You are optimizing at the wrong level. Stop.
You are focusing on the trivial, the irrelevant. It is unlikely that
having the lexer convert integers (or floats or quaternions) into a binary representation is a sufficiently expensive operation to make sense fretting about it.
On Fri, 15 Jul 2022 03:02:07 -0700 (PDT), Jan Ziak
<[email protected]> wrote:
On Friday, July 15, 2022 at 4:13:42 AM UTC+2, George Neuner wrote:
In many (actually most) cases, the binary representation of an integer
can be stored in less space than the text representation.
The output of a command such as (cd /usr/src/linux; grep --only-matching >>--recursive "\b[0-9][0-9]*\b") proves the falsity of the above claim.
Binary, fixed-width, representation of integers is statistically more >>space-efficient compared to implicit-width textual representation only if the >>text representation of the integers includes (for example): a plain 32/64-bit >>pointer to the start of the text, a plain 16/32/64-bit relative/absolute >>offset to the start of the number in a character array, the [length of the >>textual form of the number] in an explicit form.
???
Decimal 100 is 3 characters in text, but 1 byte in binary.
but as a message earlier today pointed out,
it's not likely to make any difference in practical compilers. -John]
[In my experience separating the lexer from the parser makes it a lot easier to deal with common lexical situations like skipping white space and comments.
You could certainly do that in a combined scheme but I'm not sure it would end
up any simpler. -John]
[In my experience separating the lexer from the parser makes it a lot easier to deal with common lexical situations like skipping white space and comments.
You could certainly do that in a combined scheme but I'm not sure it would end
up any simpler. -John]
...
[I don't see how this grammar will allow a comment before the statement. -John]
On Monday, July 18, 2022 at 9:30:51 AM UTC-7, gah4 wrote:
(snip, or moderator wrote)
[In my experience separating the lexer from the parser makes it a lot easierInteresting. As I previously noted, STEP mostly doesn't do a separate lexical analysis.
to deal with common lexical situations like skipping white space and comments.
You could certainly do that in a combined scheme but I'm not sure it would end
up any simpler. -John]
It does, however, do three things before the macros see the input: convert multiple
blanks to a single blank, pass apostrophed strings through whole, and remove comments delimited by double quotes.
Apostrophed strings are slightly more interesting. Internal double apostrophes
are converted to single apostrophes, and the delimiting apostrophes are converted to a special character that isn't an input character.
One of my projects 45 years ago, was to write macros to recognize the
syntax of IBM OS/360 Fortran IV. Direct access I/O statements use
a single apostrophe to delimit the record number:
READ(1'N) X,Y,Z
There is no way to write macros for that syntax after the previous processing.
Much fun figuring out all the strange things done in programming language syntax over the years.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (0 / 16) |
| Uptime: | 166:41:37 |
| Calls: | 12,096 |
| Calls today: | 4 |
| Files: | 15,003 |
| Messages: | 6,517,808 |