On 11/23/2022 12:33 PM, Bart wrote:
On 23/11/2022 14:38, James Harris wrote:
Any advice on what node types should be in an IR tree? There seem to
be many options and I am not sure where to strike the balance.
My comments will be about an AST. I don't know whether you are using
'IR' as a synonym for the AST; I think it's generally used for the next
stage beyond AST.
Yes, typically they are different stages.
Typically the AST is a tree structured representation of the program.
IR is typically either a stack machine (often also called a "bytecode"
or "IL");
Something along the lines of 3AC (Three Address Code) or SSA (Static
Single Assignment).
In my compilers, I have mostly ended up going from the AST from a stack
machine IL; and then from this IL into a 3AC IR.
There is a difference in that while an AST is usually tree structured, something like a 3AC IR is typically organized into "basic blocks"
containing a series of operations operating on temporaries or similar.
I have mostly ended up keeping the stack IL as:
AST -> Stack: Easy
AST -> 3AC/SSA: Not as Easy
Along with:
Stack -> 3AC / SSA: Relatively easy;
Serializing an IR in SSA form: Massive pain.
As for AST structure, I have tried a few major strategies:
S-Expressions: Used in my early VM projects (Scheme based);
Made a return for most of my "BGBScript 1.x" VMs (~ 2006..2013).
XML Based, used in the first version of BGBScript;
Still used in BGBCC, which was a fork off the first VM.
Later on, BGBScript switched back to S-Expressions, using a core VM
based on my original Scheme interpreter.
JSON style: Used in my BGBScript2 VM (~ 2015..2018). This technically
worked well, and had a "good compromise" between XML nodes and
S-Expressions. But, this VM was short lived (mostly used in my
"BGBTech2" 3D engine, and mostly died off along with it).
Where, syntactically:
BGBScript was along similar lines to JavaScript and ActionScript.
BGBScript2 was a constrained BGBScript, and mostly went over to a more Java-like syntax (but retains some elements of its predecessors).
BGBCC was originally a case of, when I was younger, it didn't seem like
that huge of a stretch to go from a "terrible JavaScript knock off" over
to trying to compile C (I was pretty wrong about this, but did
eventually end up with a C compiler...).
Currently I am working on my "BJX2 CPU ISA" project, which mostly uses
BGBCC (my C compiler, but also compiles a variant of BGBScript2). In its present form, BGBCC is using a similar structure to that of the dictionary-object approach, but still representing things in terms of
XML abstractions.
In this case, nodes consist of a collection of key/value pairs, where:
Each 'key' is a 16-bit tag (4-bit type, 12-bit symbol ID or index);
Each 'value' is 64 bits, with a meaning depending on the 4b type.
Each node only directly holds 14 key/value pairs, if the node needs more
than this, it expands out B-Tree style, so:
1 node, 1 level, 14 keys;
15 nodes, 2 levels, 196 keys;
211 nodes, 3 levels, 2744 keys;
...
All the keys are kept sorted, so access time is "O(log2 n)".
Most calls to get/set a key in a node supply both a name and a pointer
to a variable to cache the ID number. Typically, the ID is primary, but
the string is used if the ID number hasn't been initialized yet
(originally, everything used strings, but this was not ideal for
performance).
Also originally, nodes were organized into linked lists (with internal linkage), but this has been eliminated.
But, it sort of awkwardly fakes the original DOM style interface mostly
because otherwise I would have needed to rewrite both the parser and
compiler to use the dictionary objects directly.
There is some trickery, say, something at the AST level like:
<if>
<cond>
<binary op=">">
<left> <ref name="x"/> </left>
<right> <int value="0"/> </right>
</binary>
</cond>
<then> ... </then>
</if>
Is internally represented as:
{ tag: #if,
cond:
{tag: #binary, op: ">",
left: {tag: #ref, name: "S"},
right: {tag: #int, value: 0}}
then: { ... }
}
With named keys holding a node implicitly assumed to represent a
"virtual node" as far as the XML is concerned.
There are no arrays, rather array-like cases are handled via
index-number keys and some additional hackery to deal with nodes with a
large number of children.
For nodes which represent a body section containing an ordered list of
child nodes, these are given a sequentially assigned index number
(almost always, the newest node is added to the end of the list).
For a small to moderate number of indexed child nodes, key ranges could be:
000..3FF: Named Keys
400..7FF: Radix Space
800..FFF: Indexed Keys (reverse numbered)
Where, say, the "radix" hack could be used to significantly extend the
max number of indexed keys (via dividing the index space into multiple
levels of nested nodes).
Though, a case could be made for skipping the whole XML thing...
A case could also be made for using tagged references as values rather
than type-tagging the key.
My guess is that it's best to keep the number of different node types
reasonably small. That should mean less code is needed to walk the
tree. But when does that number become too small?
For example, consider expressions.
There /could/ be one node type for "+" and another for "*", as in a
node with these fields:
+
left
right
Alternatively, both operators could be handled by a single node type
of "dyadic operation". Such a node would have four fields:
nodetype = dyadic
operator (e.g. "+")
left
right
The problem is, there are loads of ways to do this, they will all work.
There is no right or wrong way.
I use both of the above methods across three active compilers. With the first, its easy to dispatch directly the "+" node when you need to deal
with "add".
With the second, you have to dispatch first on "dyadic" and then on "operator". But it's not a big deal.
I used a "binary" node for all of the binary operators.
That could be further generalised by having one type of node which
would be usable for both monadic and dyadic operators:
nodetype = operation
operator
number of arguments: 1 or 2
arguments
Further, why limit the number of arguments to 1 or 2? If one were to
allow an arbitrary number of arguments then such a type of node could
also be used for function calls.
How practical this is depends on your language, the arrangements for
matching an arbitrary length list to the node.
Currently I allow 3 operands for one language, and 2 for another.
Remember this not just about expressions, it has to be able to represent anything.
For variable-length lists, any of the operands (although usually just
the first) is interpreted as the root of a linked list. (With some
compiler versions in dynamic code, each operand could directly be a list.)
In fact, that form is effectively an s expression of the form
(operator arguments)
So for expressions, which of the above schemes is best to use for
nodes in an IR tree?
(A potential fly in the ointment is whether such a node is suitable
for functions which modify some of their parameters.)
Then there are other (non-expression) nodes.
To allow for things like code hoisting I think I need the tree
initially to be at a fairly high level, containing things such as
loops and selections though it would be lowered later to use labels
and conditional branches.
So what other node types are needed? One could have
program
function
identifier: constant or variable or parameter
assignment
iteration
sequential selection
switch
break iteration
break sequential selection
break switch
Is that enough? Too many? Too few? What types are missing?
My ASTs are unusual (although I only discovered this recently) in that
they only represent executable code. That is, the bodies of functions,
or init data for a variable.
Others use the AST also for non-executable elements, like function declarations. I couldn't see the point of this, unless this is a
scripting where declarations /are/ executed from the top of the module.
However even without declaration nodes, I use either 130 or 200 node
types (depending on whether ADD, SUB etc have dedicated tags, or grouped under one BINOP tag). My C compiler uses 80.
But you must have already done all this in your compiler; didn't that
use an AST?
My ASTs hold "nearly everything", though for C, things like struct
bodies and typedefs are folded off into their own special lists.
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)