Forum: >>> Magnum BBS <<<

Re: Register windows

From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Jul 17 16:53:59 2025

On Thu, 17 Jul 2025 16:20:13 +0000, Stefan Monnier wrote:

The only good arguments I have heard wrt big architectural register
files has to do with things like Register-Windows and/or optimizing
CALL/RET interface.

But even there, it justifies only additional "second-class registers",
i.e. where the set of immediately addressable registers can still be the
same size as usual (e.g. 16 or 32), but you can quickly push some of
those to some kind of "stack" and then pull them back in.
IIRC the Mill had actually 2 categories of "second-class registers":
the stack and the scratch registers.

I think you can get similar benefits with "cache-line sized" memory operations that load/store several registers at a time (assuming you
have good enough store-to-load forwarding). Or even fold those
loads&stores into some kind of CALL/RET instructions, which can let you
start the control-flow part of the CALL before the stores, and similarly start the loads before the control flow part of the RET is done.

Yes, that is what we ended up doing in My 66000:
ENTER saves the preserved registers (and generally Ret Address)
EXIT restores the preserved registers and transfers control back
....to caller

Both ENTER and EXIT can perform 1-32 reads/writes to sequential
memory addresses and add/sub to SP and optionally setup FP.

ENTER and EXIT are within the called subroutine.

By using ENTER and EXIT, the preserved registers can be written
to memory the normal LD/ST instructions have no access, so bad
array indexes cannot harm the call/return contract, eliminating
ROP attacks.

I am happier with this than the register windows stuff I had to
deal with in SPARC.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Thu Jul 17 19:38:07 2025

On Thu, 17 Jul 2025 17:38:35 +0000, Anton Ertl wrote:

Stefan Monnier <[email protected]> writes:

The only good arguments I have heard wrt big architectural register
files has to do with things like Register-Windows and/or optimizing
CALL/RET interface.

But even there, it justifies only additional "second-class registers",
i.e. where the set of immediately addressable registers can still be the >>same size as usual (e.g. 16 or 32), but you can quickly push some of
those to some kind of "stack" and then pull them back in.

Not efficiently. You would have to wait until the last instruction
has written back its result, then make the switch, and only then start reading registers from instructions behind the SAVE/RESTORE
instruction.

What you write is INVARIABLY true if SW is the one that has to do
this work. The previous value has to leave the register before the
new value arrives to be written.

It is not true at all if HW is the one doing the work. HW can send out
cache line reads for the new data, while the (soon to be) previous
values sit in RF--then--as data arrives, HW can readout the previous
value while writing in the new values, shipping the previous values
to memory as they fill cache line buffers.

So, how does HW remember where the registers were in memory--that my
friends is called Architecture (not Arch 101, but Arch 315).

And that is why My 66000 has HW perform context switching--CPU
continues to run while the front end gets the new data ready--
then switches to the new (interrupting) thread in "just a few
cycles" including all the Program Status Line stuff which includes
Root pointers, priority, privilege, ASID,...

Oh, and when done this way--when control arrives at dispatcher/handler
it is already re-entrant (no Interrupt/Exception Disable),...

Each SAVE and each RESTORE would cost several cycles
even on an in-order machine. Not what the mechanism was designed for.

I think you can get similar benefits with "cache-line sized" memory >>operations that load/store several registers at a time (assuming you
have good enough store-to-load forwarding).

ARM A64's load pair and store pair instructions.

Or even fold those
loads&stores into some kind of CALL/RET instructions, which can let you >>start the control-flow part of the CALL before the stores, and similarly >>start the loads before the control flow part of the RET is done.

In an OoO machine with correct predictions (the usual case), control
flow often runs far ahead of functional-unit processing and retirement

Hundreds of instructions--or 15-20 conditional branches.

(and only retirement is architectural execution). Any stores on the predicted control flow will be speculatively performed as soon as
their source data is available,

Where the word "speculation" places a requirement on the CPU to
buffer the stored result but no actually update cache/memory
until the ST is retired. ST-to-LD forwarding CAN happen.

I could argue that this is not what the word "performed" is generally
regarded as implying. Instead, I would use the term "conditionally
stored" as is what we documented in Mc 88120 (1992) and what the
conditional cache did. While sitting in the CC, ST-to-LD forwarding
was performed; and after retirement ST.data migrates to the cache/
memory.

and the same goes for loads, with (non)aliases being predicted. Plus really modern machines often can
achieve 0-cycle store-to-load forwarding. All of this makes
mechanisms like register windows and IA-64's register stack
unnecessary.

Along with 0-cycle MOVs, and 0-cycle BC-misprediction recovery.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu Jul 17 16:18:47 2025

Not efficiently. You would have to wait until the last instruction
has written back its result, then make the switch, and only then start
reading registers from instructions behind the SAVE/RESTORE
instruction.

What you write is INVARIABLY true if SW is the one that has to do
this work. The previous value has to leave the register before the
new value arrives to be written.

With register renaming that is not a requirement any more: the new value
is sent to another physical register.
I'm pretty sure you know that, so I wonder what it is that you meant.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Stefan Monnier on Fri Jul 18 18:11:51 2025

On 2025-07-17 19:20, Stefan Monnier wrote:

The only good arguments I have heard wrt big architectural register
files has to do with things like Register-Windows and/or optimizing
CALL/RET interface.

But even there, it justifies only additional "second-class registers",
i.e. where the set of immediately addressable registers can still be the
same size as usual (e.g. 16 or 32), but you can quickly push some of
those to some kind of "stack" and then pull them back in.
IIRC the Mill had actually 2 categories of "second-class registers":
the stack and the scratch registers.

As I understood the Mill (from a few years ago, possibly changed since):

The Mill "belt" (I assume this is what you call the "stack") corresponds
to the "first-class registers": all computations take operands from the
belt and push results to the belt. When a function is called, it sees a
fresh belt with only its "in" arguments; when the function returns, it
leaves its "out" results pushed on the belt that the caller sees.

The Mill "scratch" memory is a function-local piece of fast (on-chip)
RAM, allocated on function entry and deallocated on return, with
function-local addressing. It is like a register file in that it cannot
be addressed indirectly, only by static (but function-local) addresses,
but like a stack frame in being local to a function and having a
variable size, depending on the function.

The necessary physical data transfers and renamings are handled by the "spiller" HW component which has internal fast data buffering and uses
cache or main RAM if the internal buffering overflows.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Fri Jul 18 11:29:00 2025

The Mill "belt" (I assume this is what you call the "stack") corresponds to the "first-class registers": all computations take operands from the belt
and push results to the belt. When a function is called, it sees a fresh
belt with only its "in" arguments; when the function returns, it leaves its "out" results pushed on the belt that the caller sees.

That's right. The belt is the closest that the Mill has to "first-class registers", tho in a sense it addresses elements of the forwarding
network more than "registers".

What I meant by "stack" is the place where scratch registers and
in-flight belt values get pushed/popped when you enter/leave a function,
which give you a kind of "register window" functionality. IIRC the Mill documents it as living in memory but the moment when stack elements
actually reach memory was never clearly specified, so I assume the
idea was that it could be kept in "second class registers" (IIRC
that was managed by the "spiller" you refer to).

Of course, in a traditional CPU, the top of the stack is kept in the L1
cache and only ever touches the higher levels of the memory hierarchy
when cache pressure is very high or upon context switches, so the L1
cache plays a similar role. Not sure if a set of "second class
registers" dedicated to storing the top of the stack (like SPARC has,
and the Mill seemed to want to have) can be made to be faster and/or
lower power than the L1 cache to justify the effort.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Stefan Monnier on Fri Jul 18 23:17:23 2025

On 2025-07-18 18:29, Stefan Monnier wrote:

The Mill "belt" (I assume this is what you call the "stack") corresponds to >> the "first-class registers": all computations take operands from the belt
and push results to the belt. When a function is called, it sees a fresh
belt with only its "in" arguments; when the function returns, it leaves its >> "out" results pushed on the belt that the caller sees.

That's right. The belt is the closest that the Mill has to "first-class registers", tho in a sense it addresses elements of the forwarding
network more than "registers".

Yes. Values on the belt do not stay at fixed addresses ("register
numbers", "names") but move to new addresses (larger offsets from the
"top") as computations push more results on the belt. As I understand
it, the HW implementation is indeed like a renaming/forwarding network, although the "names" (offsets from the "top") are known to the compiler
because of the static instruction sheduling and known latencies of all operations.

What I meant by "stack" is the place where scratch registers and
in-flight belt values get pushed/popped when you enter/leave a function, which give you a kind of "register window" functionality.

Ah, apologies for my wrong assumption. (But calling that a "stack" risks confusion with the normal SW stack in memory, which is certainly still
needed in a Mill processor, for example to pass function arguments by reference.)

I agree that the Mill architecture, with its "new belt for each call"
and "new scratch-pad for each call" programmer's model, is similar to register-window designs.

IIRC the Mill
documents it as living in memory but the moment when stack elements
actually reach memory was never clearly specified, so I assume the
idea was that it could be kept in "second class registers" (IIRC
that was managed by the "spiller" you refer to).

As I understood it, the HW implementation of the Mill's
function-specific belt and function-specific scratch-pad is not meant to
be visible to a normal application program running in a Mill processor.
A function executing in a normal program cannot access the belt of its
caller, nor the scratch-pad of its caller.

For debuggers and other system tools that need such visiblity the intent
was to provide a dedicated (HW-dependent) interface or "service",
essentially an interface to the spiller HW. But the structure of the
spiller and its buffers and memory areas is not part of the visible Mill architecture, rather it is specific to each implementation ("model") of
a Mill processor core.

Of course, in a traditional CPU, the top of the stack is kept in the L1
cache and only ever touches the higher levels of the memory hierarchy
when cache pressure is very high or upon context switches, so the L1
cache plays a similar role.

A similar role from the performance point of view, yes, with the
difference that data in the L1 cache is accessed via normal load/store instructions, while the spiller data structures would not be accessible
in that way.

As I understood it, the spiller would have some private internal buffers
for your "stack", and would use main memory (via the caches) as backup
when the private buffers overflowed. Thus the most recently spilled data
would be either in the spiller's private buffers or in the L1 cache.

Not sure if a set of "second class
registers" dedicated to storing the top of the stack (like SPARC has,
and the Mill seemed to want to have) can be made to be faster and/or
lower power than the L1 cache to justify the effort.

I am not a HW guy, but I think the answer is yes. However, speed is not
the only advanage of such "second class" registers: they can also make
for a simpler programmer's model where the second-class registers are
hidden, implicit, and need not (and perhaps cannot) be named in the instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Niklas Holsti on Sun Jul 20 17:28:37 2025

On Fri, 18 Jul 2025 20:17:23 +0000, Niklas Holsti wrote:

On 2025-07-18 18:29, Stefan Monnier wrote:

The Mill "belt" (I assume this is what you call the "stack") corresponds >>> to
the "first-class registers": all computations take operands from the
belt
and push results to the belt. When a function is called, it sees a fresh >>> belt with only its "in" arguments; when the function returns, it leaves
its
"out" results pushed on the belt that the caller sees.

That's right. The belt is the closest that the Mill has to "first-class
registers", tho in a sense it addresses elements of the forwarding
network more than "registers".

Yes. Values on the belt do not stay at fixed addresses ("register
numbers", "names") but move to new addresses (larger offsets from the
"top") as computations push more results on the belt. As I understand
it, the HW implementation is indeed like a renaming/forwarding network, although the "names" (offsets from the "top") are known to the compiler because of the static instruction sheduling and known latencies of all operations.

What I meant by "stack" is the place where scratch registers and
in-flight belt values get pushed/popped when you enter/leave a function,
which give you a kind of "register window" functionality.

Ah, apologies for my wrong assumption. (But calling that a "stack" risks confusion with the normal SW stack in memory, which is certainly still
needed in a Mill processor, for example to pass function arguments by reference.)

I agree that the Mill architecture, with its "new belt for each call"
and "new scratch-pad for each call" programmer's model, is similar to register-window designs.

Allows me to disagree: in My 66000 ABI, up to 50% of subroutines
do not need any preserved registers, and save time by not saving
and restoring them, many of these do not need stack space (local
variables) saving even more time. These subroutines can all be
performed in the temporary registers provided by ABI.

Here I would disagree with the new belt per subroutine as it
cost too much (time and energy).

When a subroutine DOES need saving and restoring of registers,
and allocation of stack space, this can all be done in a single
instruction for prologue (ENTER) an d epilogue (EXIT).

Here I would agree with the new belt per subroutine.

I do agree with some of what Mill does, including placing the
preserved registers in memory where they cannot be damaged.
My 66000 calls this mode of operation "safe stack".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to All on Sun Jul 20 22:27:20 2025

On Sun, 20 Jul 2025 17:28:37 +0000, [email protected] (MitchAlsup1)
wrote:

On Fri, 18 Jul 2025 20:17:23 +0000, Niklas Holsti wrote:

On 2025-07-18 18:29, Stefan Monnier wrote:

The Mill "belt" (I assume this is what you call the "stack") corresponds >>>> to
the "first-class registers": all computations take operands from the
belt
and push results to the belt. When a function is called, it sees a fresh >>>> belt with only its "in" arguments; when the function returns, it leaves >>>> its
"out" results pushed on the belt that the caller sees.

That's right. The belt is the closest that the Mill has to "first-class >>> registers", tho in a sense it addresses elements of the forwarding
network more than "registers".

Yes. Values on the belt do not stay at fixed addresses ("register
numbers", "names") but move to new addresses (larger offsets from the
"top") as computations push more results on the belt. As I understand
it, the HW implementation is indeed like a renaming/forwarding network,
although the "names" (offsets from the "top") are known to the compiler
because of the static instruction sheduling and known latencies of all
operations.

What I meant by "stack" is the place where scratch registers and
in-flight belt values get pushed/popped when you enter/leave a function, >>> which give you a kind of "register window" functionality.

Ah, apologies for my wrong assumption. (But calling that a "stack" risks
confusion with the normal SW stack in memory, which is certainly still
needed in a Mill processor, for example to pass function arguments by
reference.)

I agree that the Mill architecture, with its "new belt for each call"
and "new scratch-pad for each call" programmer's model, is similar to
register-window designs.

Allows me to disagree: in My 66000 ABI, up to 50% of subroutines
do not need any preserved registers, and save time by not saving
and restoring them, many of these do not need stack space (local
variables) saving even more time. These subroutines can all be
performed in the temporary registers provided by ABI.

Here I would disagree with the new belt per subroutine as it
cost too much (time and energy).

[Take with salt. Hope I got this right ... it is how I have understood
it based on Ivan's presentations. Details might be incorrect or may
have been misunderstood.]

Saving and restoring the Mill's belt costs between nothing (best case)
and very little. The belt is the CPU's bypass network, and many of
its elements are simply data sitting in the various FU output latches.

Every possible location has a unique name, and there is a naming unit
which keeps track of the location of everything. Every location name
includes an id for the call frame to which the data belongs.

When a new function is entered, the frame id changes and so no data on
the existing belt will match. Presto! belt empty. When the function
returns, the caller's frame id is restored and data having that frame
id again becomes visible.

If/when FUs need to overwrite existing data in their output latches,
there is a "spiller" unit which moves the data to a new location
within its own internal memory. Nothing ever gets moved back - the
spiller's memory is connected directly to the bypass network and the
spiller will respond if any location name matches something it is
currently storing.

The spiller memory, of course, can fill up and then data gets pushed
out to RAM (and eventually pulled in again as the call stack unwinds).
If data is not in the spiller when it is needed again, it is treated
as-if a load miss and the pipeline is stalled. The size of the
spiller's memory is one of the many configuration parameters used to
define a new chip.

The spiller acts like an associative cache limited to existing belt
data. When a frame is no longer needed, any spiller locations which
held its data become available for reuse.

There are, in fact, 2 levels of spiller. Ivan has said that - in his experience - the 1st level spiller typically is able to hold data for
many frames, so most operations involving it will run at core speed.

When a subroutine DOES need saving and restoring of registers,
and allocation of stack space, this can all be done in a single
instruction for prologue (ENTER) an d epilogue (EXIT).

Here I would agree with the new belt per subroutine.

I do agree with some of what Mill does, including placing the
preserved registers in memory where they cannot be damaged.
My 66000 calls this mode of operation "safe stack".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to All on Mon Jul 21 12:11:20 2025

On 2025-07-20 20:28, MitchAlsup1 wrote:

On Fri, 18 Jul 2025 20:17:23 +0000, Niklas Holsti wrote:

On 2025-07-18 18:29, Stefan Monnier wrote:

[ snip ]

I agree that the Mill architecture, with its "new belt for each call"
and "new scratch-pad for each call" programmer's model, is similar to
register-window designs.

Allows me to disagree:

I'm not sure what exactly you disagree with: with the statement that the
Mill belt is "similar" to register-window designs? or with the Mill's
belt design being a good idea in general? From your later text it seems
to be the latter, which I did not claim.

in My 66000 ABI, up to 50% of subroutines
do not need any preserved registers, and save time by not saving
and restoring them, many of these do not need stack space (local
variables) saving even more time. These subroutines can all be
performed in the temporary registers provided by ABI.

Good, but what is the conclusion? That the My 66000 is better than register-window machines, or better than the Mill, or something else?

The only register-window machine I am familiar with is the SPARC. The
SPARC CALL instruction does not by itself open a new register window,
and the SPARC compilers I have used can generate code for many
subroutines without using the SAVE/RESTORE instructions that do invoke
the register windowing, just as you do on the My 66000. However, the
SPARC compilers, with limited global optimization, did this only for
leaf subroutines that do not call any other subroutines.

Here I would disagree with the new belt per subroutine as it
cost too much (time and energy).

As George Neuner explained in his reply, the cost is usually zero or
small. The same holds for the SPARC SAVE instruction, depending on HW
resources and SW structure (also true for the Mill case).

When a subroutine DOES need saving and restoring of registers,
and allocation of stack space, this can all be done in a single
instruction for prologue (ENTER) and epilogue (EXIT).

That seems quite similar to the SPARC case, where use of SAVE/RESTORE is optional and done only when either a new register window is needed or a
stack frame is needed, with SAVE ~ ENTER and RESTORE ~ EXIT. However,
I'm sure the details of the instructions and their HW implementations
differ, and so may the performance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Mon Jul 21 15:42:21 2025

On Sun, 20 Jul 2025 17:28:37 +0000, MitchAlsup1 wrote:

I do agree with some of what Mill does, including placing the preserved registers in memory where they cannot be damaged.
My 66000 calls this mode of operation "safe stack".

This sounds like an idea worth stealing, although no doubt the way I
would attempt to copy it would be a failure which removed all the
usefulness of it.

For one thing, I don't have a stack for calling subroutines, or any other purpose.

But I could easily add a feature where a mode is turned on, and instead of using the registers, it works off of a workspace pointer, like the TI 9900.

The trouble is, though, that this would be an extremely slow mode. When registers are _saved_, they're already saved to memory, as I can't think
of anywhere else to save them. (There might be multiple sets of registers,
for things like SMT, but *not* for user vs supervisor or anything like
that.)

So I've probably completely misunderstood you here.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to All on Thu Aug 21 21:48:03 2025

John Savard <[email protected]d> posted:

On Sun, 20 Jul 2025 17:28:37 +0000, MitchAlsup1 wrote:

I do agree with some of what Mill does, including placing the preserved registers in memory where they cannot be damaged.
My 66000 calls this mode of operation "safe stack".

This sounds like an idea worth stealing, although no doubt the way I
would attempt to copy it would be a failure which removed all the
usefulness of it.

For one thing, I don't have a stack for calling subroutines, or any other purpose.

But I could easily add a feature where a mode is turned on, and instead of using the registers, it works off of a workspace pointer, like the TI 9900.

The trouble is, though, that this would be an extremely slow mode. When registers are _saved_, they're already saved to memory, as I can't think
of anywhere else to save them. (There might be multiple sets of registers, for things like SMT, but *not* for user vs supervisor or anything like
that.)

In reverse order:
If TI 9900 used their registers like a write-back cache, then typical access would be fast and efficient. When the register pointer is altered, the old
file is written en-massé and a new file is read in en-massé {possibly with some buffering to lessen visible cycle count} ... but I digress.

{Conceptually}
My 66000 uses this concept for its GPRs and for its Thread State but only at context switch time, not for subroutine calls and returns. HW saves and restores Thread State and Registers on context switches so that the CPU
never has to Disable Interrupts (it can, it just doesn't have to). {/Conceptually}
I bracketed the above with 'Conceptually' because it is completely
reasonable to envision a Core big enough to have 4 copies of Thread
State and Register files, and bank switch between them. The important properties are that the switch delivers control reentrantly, HOW any
given implementation does that is NOT architecture--that it does IS architecture.

I specifically left how many registers are preserved to SW per CALL
because up to 50% need 0, and few % require more than 4. This appears
to indicate that SPARC using 8 was overkill ... but I digress again.

Safe Stack is a stack used for preserving the ABI contract between caller
and callee even in the face of buffer overruns, RoP, and other malicious program behavior. SS places the return address and the preserved registers
in an area of memory where LD and ST instructions have no access (RWE = 000) but ENTER, EXIT, and RET do. This was done in such a way that correct code
runs both with SS=on and SS=off, so the compiler does not have to know.

Only CALL, CALX, RET, ENTER, and EXIT are aware of the existence of SS
and only in HW implementations.

I have harped on you for a while to start development of your compiler.
One of the first things a compiler needs to do is to develop its means
to call subroutines and return back. This requires a philosophy of passing arguments, returning results, dealing with recursion, dealing with TRY- THROW-CATCH SW defined exception handling. I KNOW of nobody who does this without some kind of stack.

I happen to use 2 such stacks mostly to harden the environment at low
cost to malicious attack vectors. It comes with benefits: Lines removed
from SS do not migrate to L2 or even DRAM, they can be discarded at
end-of-use, reducing memory traffic; the SW contract between Caller and
Callee is guaranteed even in the face of malicious code; it can be used
as a debug tool to catch malicious code. ...

NOTE: malicious code can still damage data*, just not the preserved regs,
the return address, guaranteeing that control returns to the instruction following CALL. And all without adding a single instruction to the CALL
RET instruction sequence.

(*) memory

So I've probably completely misunderstood you here.

Not the first time ...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to MitchAlsup on Sat Aug 23 08:51:34 2025

MitchAlsup <[email protected]d> schrieb:

I have harped on you for a while to start development of your compiler.
One of the first things a compiler needs to do is to develop its means
to call subroutines and return back. This requires a philosophy of passing arguments, returning results, dealing with recursion, dealing with TRY- THROW-CATCH SW defined exception handling. I KNOW of nobody who does this without some kind of stack.

There is one additional, quite thorny issue: How to maintain
state for nested functions to be invoked via pointers, which
have to have access local variables in the outer scope.
gcc does so by default by making the stack executable, but
that is problematic. An alternative is to make some sort of
executable heap. This is now becoming a real problem, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

I think we discussed this for My 66000 some time ago, but there
is no resolution as yet.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Fri Aug 29 17:07:03 2025

There is one additional, quite thorny issue: How to maintain
state for nested functions to be invoked via pointers, which
have to have access local variables in the outer scope.
gcc does so by default by making the stack executable, but
that is problematic. An alternative is to make some sort of
executable heap. This is now becoming a real problem, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

AFAIK this is a problem only in those rare languages where a function
value is expected to take up the same space as any other pointer while
at the same time supporting nested functions.

In most cases you have either one of the other but not both. E.g. in
C we don't have nested functions, and in Javascript functions are heap-allocated objects.

Other than GNU C (with its support for nested functions), which other
language has this weird combination of features?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Stefan Monnier on Sat Aug 30 15:36:46 2025

Stefan Monnier <[email protected]> wrote:

There is one additional, quite thorny issue: How to maintain
state for nested functions to be invoked via pointers, which
have to have access local variables in the outer scope.
gcc does so by default by making the stack executable, but
that is problematic. An alternative is to make some sort of
executable heap. This is now becoming a real problem, see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

AFAIK this is a problem only in those rare languages where a function
value is expected to take up the same space as any other pointer while
at the same time supporting nested functions.

In most cases you have either one of the other but not both. E.g. in
C we don't have nested functions, and in Javascript functions are heap-allocated objects.

Other than GNU C (with its support for nested functions), which other language has this weird combination of features?

Well, more precisely:
- function pointer is supposed to take the same space as a single
machine address
- function pointer is supposed to be directly invokable, that is
point to machine code of the function
- one wants to support nested functions
- there is no garbage collector, one does not want to introduce extra
stack and one does not want to leak storage allocated to nested
functions.

To explain more:
- arguably in "safe" C data pointers should consist
of 3 machine words, such pointer have place for extra data needed
for nested functions.
- some calling conventions introduce extra indirection, that is
function pointer point to a data structure containing address
of machine code and extra data needed by nested functions.
Function call puts extra data in dedicated machine register and
then transfers control via address contained in function data
structure. IIUC IBM AIX uses such approach.
- one could create trampolines in a separate area of memory. In
such case there is trouble with dealocating no longer needed
trampolines. This trouble can be resolved by using GC. Or
by using a parallel stack dedicated to trampolines.

Concerning languages, any language which has nested functions and
wants seamless cooperation with C needs to resolve the problem.
That affects Pascal, Ada, PL/I. That is basicaly most classic
non-C languages. IIUC several "higher level" languages resolve
the trouble by combination of parallel stack and/or GC. But
when language want to compete with efficiency of C and does not
want GC, then trampolines allocated on machine stack may be the
only choice (on register starved machine parallel stack may be
too expensive). AFAIK GNU Ada uses (or used) trampolines
allocated on machine stack.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Sat Aug 30 14:22:32 2025

Function pointer consists of a pointer to a blob of memory holding
a code pointer and typically the callee's GOT pointer.

Better skip the redirection and make function pointers take up 2 words
(address of the code plus address of the context/environment/GOT), so
there's no dynamic allocation involved.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Sun Aug 31 05:36:53 2025

Stefan Monnier <[email protected]> writes:

AFAIK this is a problem only in those rare languages where a function
value is expected to take up the same space as any other pointer while
at the same time supporting nested functions.

...

Other than GNU C (with its support for nested functions), which other >language has this weird combination of features?

Forth has single-cell (cell=machine word) execution tokens (~function pointers). In contrast to C, where in theory you could have bigger
function pointers and only ABIs and legacy code encourage function
pointers that fit in a single machine word, in a Forth you would have
to make all cells bigger (i.e., all integers and all addresses) if you
want more space for xts.

Gforth adds closures, that of course have to be represented by
single-cell execution tokens that behave like other execution tokens.
But Gforth has the advantage that xts are not represented by addresses
of machine code, and instead there is one indirection level between
the xt and the machine code. The reason for that is that xts not only represent colon definitions (~C functions), but also variables,
constants, and words defined with create...does> (somewhat
closure-like, but "statically" allocated). So Gforth implements
closures using an extension of that mechanism, see Section 4 of [ertl&paysan18].

However, there are Forth systems that implement xts as addresses of
machine code, and if they implemented closures, they would need to use
run-time code generation.

- anton

@InProceedings{ertl&paysan18,
author = {M. Anton Ertl and Bernd Paysan},
title = {Closures --- the {Forth} way},
crossref = {euroforth18},
pages = {17--30},
url = {https://www.complang.tuwien.ac.at/papers/ertl%26paysan.pdf},
url2 = {http://www.euroforth.org/ef18/papers/ertl.pdf},
slides-url = {http://www.euroforth.org/ef18/papers/ertl-slides.pdf},
video = {https://wiki.forth-ev.de/doku.php/events:ef2018:closures},
OPTnote = {refereed},
abstract = {In Forth 200x, a quotation cannot access a local
defined outside it, and therefore cannot be
parameterized in the definition that produces its
execution token. We present Forth closures; they
lift this restriction with minimal implementation
complexity. They are based on passing parameters on
the stack when producing the execution token. The
programmer has to explicitly manage the memory of
the closure. We show a number of usage examples.
We also present the current implementation, which
takes 109~source lines of code (including some extra
features). The programmer can mechanically convert
lexical scoping (accessing a local defined outside)
into code using our closures, by applying assignment
conversion and flat-closure conversion. The result
can do everything one expects from closures,
including passing Knuth's man-or-boy test and living
beyond the end of their enclosing definitions.}
}

@Proceedings{euroforth18,
title = {34th EuroForth Conference},
booktitle = {34th EuroForth Conference},
year = {2018},
key = {EuroForth'18},
url = {http://www.euroforth.org/ef18/papers/proceedings.pdf}
}
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to All on Sun Aug 31 16:21:38 2025

BGB <[email protected]> posted:

On 8/30/2025 1:22 PM, Stefan Monnier wrote:

Function pointer consists of a pointer to a blob of memory holding
a code pointer and typically the callee's GOT pointer.

Better skip the redirection and make function pointers take up 2 words (address of the code plus address of the context/environment/GOT), so there's no dynamic allocation involved.

FDPIC typically always uses the normal pointer width, just with more indirection:
Load target function pointer from GOT;
Save off current GOT pointer to stack;
Load code pointer from function pointer;
Load GOT pointer from function pointer;
Call function;
Reload previous GOT pointer.

My 66000 can indirect through GOT so the above sequence is::

CALX [ip,,GOT[n]-.]

and references to GOT are like above (functions) or (extern) as::

LDD Rp,[ip,,GOT[n]-.]

Each linked module gets its own GOT.

It, errm, kinda sucks...

Bad ISA makes many things suck--whereas good ISA does not.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet
- Bob Worm
  Mon Jul 27 15:19:55 2026
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	51:55:40
Calls:	12,445
Calls today:	5
Files:	15,192
Messages:	6,537,256

Re: Register windows

Who's Online

Recent Visitors

System Info