The only good arguments I have heard wrt big architectural register
files has to do with things like Register-Windows and/or optimizing
CALL/RET interface.
But even there, it justifies only additional "second-class registers",
i.e. where the set of immediately addressable registers can still be the
same size as usual (e.g. 16 or 32), but you can quickly push some of
those to some kind of "stack" and then pull them back in.
IIRC the Mill had actually 2 categories of "second-class registers":
the stack and the scratch registers.
I think you can get similar benefits with "cache-line sized" memory operations that load/store several registers at a time (assuming you
have good enough store-to-load forwarding). Or even fold those
loads&stores into some kind of CALL/RET instructions, which can let you
start the control-flow part of the CALL before the stores, and similarly start the loads before the control flow part of the RET is done.
Stefan
Stefan Monnier <[email protected]> writes:
The only good arguments I have heard wrt big architectural register
files has to do with things like Register-Windows and/or optimizing
CALL/RET interface.
But even there, it justifies only additional "second-class registers",
i.e. where the set of immediately addressable registers can still be the >>same size as usual (e.g. 16 or 32), but you can quickly push some of
those to some kind of "stack" and then pull them back in.
Not efficiently. You would have to wait until the last instruction
has written back its result, then make the switch, and only then start reading registers from instructions behind the SAVE/RESTORE
instruction.
Each SAVE and each RESTORE would cost several cycles
even on an in-order machine. Not what the mechanism was designed for.
I think you can get similar benefits with "cache-line sized" memory >>operations that load/store several registers at a time (assuming you
have good enough store-to-load forwarding).
ARM A64's load pair and store pair instructions.
Or even fold those
loads&stores into some kind of CALL/RET instructions, which can let you >>start the control-flow part of the CALL before the stores, and similarly >>start the loads before the control flow part of the RET is done.
In an OoO machine with correct predictions (the usual case), control
flow often runs far ahead of functional-unit processing and retirement
(and only retirement is architectural execution). Any stores on the predicted control flow will be speculatively performed as soon as
their source data is available,
and the same goes for loads, with (non)aliases being predicted. Plus really modern machines often can
achieve 0-cycle store-to-load forwarding. All of this makes
mechanisms like register windows and IA-64's register stack
unnecessary.
- anton
Not efficiently. You would have to wait until the last instruction
has written back its result, then make the switch, and only then start
reading registers from instructions behind the SAVE/RESTORE
instruction.
What you write is INVARIABLY true if SW is the one that has to do
this work. The previous value has to leave the register before the
new value arrives to be written.
The only good arguments I have heard wrt big architectural register
files has to do with things like Register-Windows and/or optimizing
CALL/RET interface.
But even there, it justifies only additional "second-class registers",
i.e. where the set of immediately addressable registers can still be the
same size as usual (e.g. 16 or 32), but you can quickly push some of
those to some kind of "stack" and then pull them back in.
IIRC the Mill had actually 2 categories of "second-class registers":
the stack and the scratch registers.
The Mill "belt" (I assume this is what you call the "stack") corresponds to the "first-class registers": all computations take operands from the belt
and push results to the belt. When a function is called, it sees a fresh
belt with only its "in" arguments; when the function returns, it leaves its "out" results pushed on the belt that the caller sees.
The Mill "belt" (I assume this is what you call the "stack") corresponds to >> the "first-class registers": all computations take operands from the belt
and push results to the belt. When a function is called, it sees a fresh
belt with only its "in" arguments; when the function returns, it leaves its >> "out" results pushed on the belt that the caller sees.
That's right. The belt is the closest that the Mill has to "first-class registers", tho in a sense it addresses elements of the forwarding
network more than "registers".
What I meant by "stack" is the place where scratch registers and
in-flight belt values get pushed/popped when you enter/leave a function, which give you a kind of "register window" functionality.
IIRC the Mill
documents it as living in memory but the moment when stack elements
actually reach memory was never clearly specified, so I assume the
idea was that it could be kept in "second class registers" (IIRC
that was managed by the "spiller" you refer to).
Of course, in a traditional CPU, the top of the stack is kept in the L1
cache and only ever touches the higher levels of the memory hierarchy
when cache pressure is very high or upon context switches, so the L1
cache plays a similar role.
Not sure if a set of "second class
registers" dedicated to storing the top of the stack (like SPARC has,
and the Mill seemed to want to have) can be made to be faster and/or
lower power than the L1 cache to justify the effort.
On 2025-07-18 18:29, Stefan Monnier wrote:
The Mill "belt" (I assume this is what you call the "stack") corresponds >>> to
the "first-class registers": all computations take operands from the
belt
and push results to the belt. When a function is called, it sees a fresh >>> belt with only its "in" arguments; when the function returns, it leaves
its
"out" results pushed on the belt that the caller sees.
That's right. The belt is the closest that the Mill has to "first-class
registers", tho in a sense it addresses elements of the forwarding
network more than "registers".
Yes. Values on the belt do not stay at fixed addresses ("register
numbers", "names") but move to new addresses (larger offsets from the
"top") as computations push more results on the belt. As I understand
it, the HW implementation is indeed like a renaming/forwarding network, although the "names" (offsets from the "top") are known to the compiler because of the static instruction sheduling and known latencies of all operations.
What I meant by "stack" is the place where scratch registers and
in-flight belt values get pushed/popped when you enter/leave a function,
which give you a kind of "register window" functionality.
Ah, apologies for my wrong assumption. (But calling that a "stack" risks confusion with the normal SW stack in memory, which is certainly still
needed in a Mill processor, for example to pass function arguments by reference.)
I agree that the Mill architecture, with its "new belt for each call"
and "new scratch-pad for each call" programmer's model, is similar to register-window designs.
On Fri, 18 Jul 2025 20:17:23 +0000, Niklas Holsti wrote:
On 2025-07-18 18:29, Stefan Monnier wrote:
The Mill "belt" (I assume this is what you call the "stack") corresponds >>>> to
the "first-class registers": all computations take operands from the
belt
and push results to the belt. When a function is called, it sees a fresh >>>> belt with only its "in" arguments; when the function returns, it leaves >>>> its
"out" results pushed on the belt that the caller sees.
That's right. The belt is the closest that the Mill has to "first-class >>> registers", tho in a sense it addresses elements of the forwarding
network more than "registers".
Yes. Values on the belt do not stay at fixed addresses ("register
numbers", "names") but move to new addresses (larger offsets from the
"top") as computations push more results on the belt. As I understand
it, the HW implementation is indeed like a renaming/forwarding network,
although the "names" (offsets from the "top") are known to the compiler
because of the static instruction sheduling and known latencies of all
operations.
What I meant by "stack" is the place where scratch registers and
in-flight belt values get pushed/popped when you enter/leave a function, >>> which give you a kind of "register window" functionality.
Ah, apologies for my wrong assumption. (But calling that a "stack" risks
confusion with the normal SW stack in memory, which is certainly still
needed in a Mill processor, for example to pass function arguments by
reference.)
I agree that the Mill architecture, with its "new belt for each call"
and "new scratch-pad for each call" programmer's model, is similar to
register-window designs.
Allows me to disagree: in My 66000 ABI, up to 50% of subroutines
do not need any preserved registers, and save time by not saving
and restoring them, many of these do not need stack space (local
variables) saving even more time. These subroutines can all be
performed in the temporary registers provided by ABI.
Here I would disagree with the new belt per subroutine as it
cost too much (time and energy).
When a subroutine DOES need saving and restoring of registers,
and allocation of stack space, this can all be done in a single
instruction for prologue (ENTER) an d epilogue (EXIT).
Here I would agree with the new belt per subroutine.
I do agree with some of what Mill does, including placing the
preserved registers in memory where they cannot be damaged.
My 66000 calls this mode of operation "safe stack".
On Fri, 18 Jul 2025 20:17:23 +0000, Niklas Holsti wrote:
On 2025-07-18 18:29, Stefan Monnier wrote:
I agree that the Mill architecture, with its "new belt for each call"
and "new scratch-pad for each call" programmer's model, is similar to
register-window designs.
Allows me to disagree:
in My 66000 ABI, up to 50% of subroutines
do not need any preserved registers, and save time by not saving
and restoring them, many of these do not need stack space (local
variables) saving even more time. These subroutines can all be
performed in the temporary registers provided by ABI.
Here I would disagree with the new belt per subroutine as it
cost too much (time and energy).
When a subroutine DOES need saving and restoring of registers,
and allocation of stack space, this can all be done in a single
instruction for prologue (ENTER) and epilogue (EXIT).
I do agree with some of what Mill does, including placing the preserved registers in memory where they cannot be damaged.
My 66000 calls this mode of operation "safe stack".
On Sun, 20 Jul 2025 17:28:37 +0000, MitchAlsup1 wrote:
I do agree with some of what Mill does, including placing the preserved registers in memory where they cannot be damaged.
My 66000 calls this mode of operation "safe stack".
This sounds like an idea worth stealing, although no doubt the way I
would attempt to copy it would be a failure which removed all the
usefulness of it.
For one thing, I don't have a stack for calling subroutines, or any other purpose.
But I could easily add a feature where a mode is turned on, and instead of using the registers, it works off of a workspace pointer, like the TI 9900.
The trouble is, though, that this would be an extremely slow mode. When registers are _saved_, they're already saved to memory, as I can't think
of anywhere else to save them. (There might be multiple sets of registers, for things like SMT, but *not* for user vs supervisor or anything like
that.)
So I've probably completely misunderstood you here.
John Savard
I have harped on you for a while to start development of your compiler.
One of the first things a compiler needs to do is to develop its means
to call subroutines and return back. This requires a philosophy of passing arguments, returning results, dealing with recursion, dealing with TRY- THROW-CATCH SW defined exception handling. I KNOW of nobody who does this without some kind of stack.
There is one additional, quite thorny issue: How to maintain
state for nested functions to be invoked via pointers, which
have to have access local variables in the outer scope.
gcc does so by default by making the stack executable, but
that is problematic. An alternative is to make some sort of
executable heap. This is now becoming a real problem, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .
There is one additional, quite thorny issue: How to maintain
state for nested functions to be invoked via pointers, which
have to have access local variables in the outer scope.
gcc does so by default by making the stack executable, but
that is problematic. An alternative is to make some sort of
executable heap. This is now becoming a real problem, see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .
AFAIK this is a problem only in those rare languages where a function
value is expected to take up the same space as any other pointer while
at the same time supporting nested functions.
In most cases you have either one of the other but not both. E.g. in
C we don't have nested functions, and in Javascript functions are heap-allocated objects.
Other than GNU C (with its support for nested functions), which other language has this weird combination of features?
Function pointer consists of a pointer to a blob of memory holding
a code pointer and typically the callee's GOT pointer.
AFAIK this is a problem only in those rare languages where a function...
value is expected to take up the same space as any other pointer while
at the same time supporting nested functions.
Other than GNU C (with its support for nested functions), which other >language has this weird combination of features?
On 8/30/2025 1:22 PM, Stefan Monnier wrote:
Function pointer consists of a pointer to a blob of memory holding
a code pointer and typically the callee's GOT pointer.
Better skip the redirection and make function pointers take up 2 words (address of the code plus address of the context/environment/GOT), so there's no dynamic allocation involved.
FDPIC typically always uses the normal pointer width, just with more indirection:
Load target function pointer from GOT;
Save off current GOT pointer to stack;
Load code pointer from function pointer;
Load GOT pointer from function pointer;
Call function;
Reload previous GOT pointer.
It, errm, kinda sucks...
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 18:54:59 |
| Calls: | 12,103 |
| Calls today: | 3 |
| Files: | 15,004 |
| Messages: | 6,518,087 |