• Re: Register windows

    From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Jul 17 16:53:59 2025
    On Thu, 17 Jul 2025 16:20:13 +0000, Stefan Monnier wrote:

    The only good arguments I have heard wrt big architectural register
    files has to do with things like Register-Windows and/or optimizing
    CALL/RET interface.

    But even there, it justifies only additional "second-class registers",
    i.e. where the set of immediately addressable registers can still be the
    same size as usual (e.g. 16 or 32), but you can quickly push some of
    those to some kind of "stack" and then pull them back in.
    IIRC the Mill had actually 2 categories of "second-class registers":
    the stack and the scratch registers.

    I think you can get similar benefits with "cache-line sized" memory operations that load/store several registers at a time (assuming you
    have good enough store-to-load forwarding). Or even fold those
    loads&stores into some kind of CALL/RET instructions, which can let you
    start the control-flow part of the CALL before the stores, and similarly start the loads before the control flow part of the RET is done.

    Yes, that is what we ended up doing in My 66000:
    ENTER saves the preserved registers (and generally Ret Address)
    EXIT restores the preserved registers and transfers control back
    ....to caller

    Both ENTER and EXIT can perform 1-32 reads/writes to sequential
    memory addresses and add/sub to SP and optionally setup FP.

    ENTER and EXIT are within the called subroutine.

    By using ENTER and EXIT, the preserved registers can be written
    to memory the normal LD/ST instructions have no access, so bad
    array indexes cannot harm the call/return contract, eliminating
    ROP attacks.

    I am happier with this than the register windows stuff I had to
    deal with in SPARC.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Thu Jul 17 19:38:07 2025
    On Thu, 17 Jul 2025 17:38:35 +0000, Anton Ertl wrote:

    Stefan Monnier <[email protected]> writes:
    The only good arguments I have heard wrt big architectural register
    files has to do with things like Register-Windows and/or optimizing
    CALL/RET interface.

    But even there, it justifies only additional "second-class registers",
    i.e. where the set of immediately addressable registers can still be the >>same size as usual (e.g. 16 or 32), but you can quickly push some of
    those to some kind of "stack" and then pull them back in.

    Not efficiently. You would have to wait until the last instruction
    has written back its result, then make the switch, and only then start reading registers from instructions behind the SAVE/RESTORE
    instruction.

    What you write is INVARIABLY true if SW is the one that has to do
    this work. The previous value has to leave the register before the
    new value arrives to be written.

    It is not true at all if HW is the one doing the work. HW can send out
    cache line reads for the new data, while the (soon to be) previous
    values sit in RF--then--as data arrives, HW can readout the previous
    value while writing in the new values, shipping the previous values
    to memory as they fill cache line buffers.

    So, how does HW remember where the registers were in memory--that my
    friends is called Architecture (not Arch 101, but Arch 315).

    And that is why My 66000 has HW perform context switching--CPU
    continues to run while the front end gets the new data ready--
    then switches to the new (interrupting) thread in "just a few
    cycles" including all the Program Status Line stuff which includes
    Root pointers, priority, privilege, ASID,...

    Oh, and when done this way--when control arrives at dispatcher/handler
    it is already re-entrant (no Interrupt/Exception Disable),...

    Each SAVE and each RESTORE would cost several cycles
    even on an in-order machine. Not what the mechanism was designed for.

    I think you can get similar benefits with "cache-line sized" memory >>operations that load/store several registers at a time (assuming you
    have good enough store-to-load forwarding).

    ARM A64's load pair and store pair instructions.

    Or even fold those
    loads&stores into some kind of CALL/RET instructions, which can let you >>start the control-flow part of the CALL before the stores, and similarly >>start the loads before the control flow part of the RET is done.

    In an OoO machine with correct predictions (the usual case), control
    flow often runs far ahead of functional-unit processing and retirement

    Hundreds of instructions--or 15-20 conditional branches.

    (and only retirement is architectural execution). Any stores on the predicted control flow will be speculatively performed as soon as
    their source data is available,

    Where the word "speculation" places a requirement on the CPU to
    buffer the stored result but no actually update cache/memory
    until the ST is retired. ST-to-LD forwarding CAN happen.

    I could argue that this is not what the word "performed" is generally
    regarded as implying. Instead, I would use the term "conditionally
    stored" as is what we documented in Mc 88120 (1992) and what the
    conditional cache did. While sitting in the CC, ST-to-LD forwarding
    was performed; and after retirement ST.data migrates to the cache/
    memory.

    and the same goes for loads, with (non)aliases being predicted. Plus really modern machines often can
    achieve 0-cycle store-to-load forwarding. All of this makes
    mechanisms like register windows and IA-64's register stack
    unnecessary.

    Along with 0-cycle MOVs, and 0-cycle BC-misprediction recovery.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Jul 17 16:18:47 2025
    Not efficiently. You would have to wait until the last instruction
    has written back its result, then make the switch, and only then start
    reading registers from instructions behind the SAVE/RESTORE
    instruction.

    What you write is INVARIABLY true if SW is the one that has to do
    this work. The previous value has to leave the register before the
    new value arrives to be written.

    With register renaming that is not a requirement any more: the new value
    is sent to another physical register.
    I'm pretty sure you know that, so I wonder what it is that you meant.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Stefan Monnier on Fri Jul 18 18:11:51 2025
    On 2025-07-17 19:20, Stefan Monnier wrote:
    The only good arguments I have heard wrt big architectural register
    files has to do with things like Register-Windows and/or optimizing
    CALL/RET interface.

    But even there, it justifies only additional "second-class registers",
    i.e. where the set of immediately addressable registers can still be the
    same size as usual (e.g. 16 or 32), but you can quickly push some of
    those to some kind of "stack" and then pull them back in.
    IIRC the Mill had actually 2 categories of "second-class registers":
    the stack and the scratch registers.

    As I understood the Mill (from a few years ago, possibly changed since):

    The Mill "belt" (I assume this is what you call the "stack") corresponds
    to the "first-class registers": all computations take operands from the
    belt and push results to the belt. When a function is called, it sees a
    fresh belt with only its "in" arguments; when the function returns, it
    leaves its "out" results pushed on the belt that the caller sees.

    The Mill "scratch" memory is a function-local piece of fast (on-chip)
    RAM, allocated on function entry and deallocated on return, with
    function-local addressing. It is like a register file in that it cannot
    be addressed indirectly, only by static (but function-local) addresses,
    but like a stack frame in being local to a function and having a
    variable size, depending on the function.

    The necessary physical data transfers and renamings are handled by the "spiller" HW component which has internal fast data buffering and uses
    cache or main RAM if the internal buffering overflows.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Jul 18 11:29:00 2025
    The Mill "belt" (I assume this is what you call the "stack") corresponds to the "first-class registers": all computations take operands from the belt
    and push results to the belt. When a function is called, it sees a fresh
    belt with only its "in" arguments; when the function returns, it leaves its "out" results pushed on the belt that the caller sees.

    That's right. The belt is the closest that the Mill has to "first-class registers", tho in a sense it addresses elements of the forwarding
    network more than "registers".

    What I meant by "stack" is the place where scratch registers and
    in-flight belt values get pushed/popped when you enter/leave a function,
    which give you a kind of "register window" functionality. IIRC the Mill documents it as living in memory but the moment when stack elements
    actually reach memory was never clearly specified, so I assume the
    idea was that it could be kept in "second class registers" (IIRC
    that was managed by the "spiller" you refer to).

    Of course, in a traditional CPU, the top of the stack is kept in the L1
    cache and only ever touches the higher levels of the memory hierarchy
    when cache pressure is very high or upon context switches, so the L1
    cache plays a similar role. Not sure if a set of "second class
    registers" dedicated to storing the top of the stack (like SPARC has,
    and the Mill seemed to want to have) can be made to be faster and/or
    lower power than the L1 cache to justify the effort.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Stefan Monnier on Fri Jul 18 23:17:23 2025
    On 2025-07-18 18:29, Stefan Monnier wrote:
    The Mill "belt" (I assume this is what you call the "stack") corresponds to >> the "first-class registers": all computations take operands from the belt
    and push results to the belt. When a function is called, it sees a fresh
    belt with only its "in" arguments; when the function returns, it leaves its >> "out" results pushed on the belt that the caller sees.

    That's right. The belt is the closest that the Mill has to "first-class registers", tho in a sense it addresses elements of the forwarding
    network more than "registers".


    Yes. Values on the belt do not stay at fixed addresses ("register
    numbers", "names") but move to new addresses (larger offsets from the
    "top") as computations push more results on the belt. As I understand
    it, the HW implementation is indeed like a renaming/forwarding network, although the "names" (offsets from the "top") are known to the compiler
    because of the static instruction sheduling and known latencies of all operations.


    What I meant by "stack" is the place where scratch registers and
    in-flight belt values get pushed/popped when you enter/leave a function, which give you a kind of "register window" functionality.


    Ah, apologies for my wrong assumption. (But calling that a "stack" risks confusion with the normal SW stack in memory, which is certainly still
    needed in a Mill processor, for example to pass function arguments by reference.)

    I agree that the Mill architecture, with its "new belt for each call"
    and "new scratch-pad for each call" programmer's model, is similar to register-window designs.


    IIRC the Mill
    documents it as living in memory but the moment when stack elements
    actually reach memory was never clearly specified, so I assume the
    idea was that it could be kept in "second class registers" (IIRC
    that was managed by the "spiller" you refer to).


    As I understood it, the HW implementation of the Mill's
    function-specific belt and function-specific scratch-pad is not meant to
    be visible to a normal application program running in a Mill processor.
    A function executing in a normal program cannot access the belt of its
    caller, nor the scratch-pad of its caller.

    For debuggers and other system tools that need such visiblity the intent
    was to provide a dedicated (HW-dependent) interface or "service",
    essentially an interface to the spiller HW. But the structure of the
    spiller and its buffers and memory areas is not part of the visible Mill architecture, rather it is specific to each implementation ("model") of
    a Mill processor core.


    Of course, in a traditional CPU, the top of the stack is kept in the L1
    cache and only ever touches the higher levels of the memory hierarchy
    when cache pressure is very high or upon context switches, so the L1
    cache plays a similar role.


    A similar role from the performance point of view, yes, with the
    difference that data in the L1 cache is accessed via normal load/store instructions, while the spiller data structures would not be accessible
    in that way.

    As I understood it, the spiller would have some private internal buffers
    for your "stack", and would use main memory (via the caches) as backup
    when the private buffers overflowed. Thus the most recently spilled data
    would be either in the spiller's private buffers or in the L1 cache.


    Not sure if a set of "second class
    registers" dedicated to storing the top of the stack (like SPARC has,
    and the Mill seemed to want to have) can be made to be faster and/or
    lower power than the L1 cache to justify the effort.


    I am not a HW guy, but I think the answer is yes. However, speed is not
    the only advanage of such "second class" registers: they can also make
    for a simpler programmer's model where the second-class registers are
    hidden, implicit, and need not (and perhaps cannot) be named in the instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Niklas Holsti on Sun Jul 20 17:28:37 2025
    On Fri, 18 Jul 2025 20:17:23 +0000, Niklas Holsti wrote:

    On 2025-07-18 18:29, Stefan Monnier wrote:
    The Mill "belt" (I assume this is what you call the "stack") corresponds >>> to
    the "first-class registers": all computations take operands from the
    belt
    and push results to the belt. When a function is called, it sees a fresh >>> belt with only its "in" arguments; when the function returns, it leaves
    its
    "out" results pushed on the belt that the caller sees.

    That's right. The belt is the closest that the Mill has to "first-class
    registers", tho in a sense it addresses elements of the forwarding
    network more than "registers".


    Yes. Values on the belt do not stay at fixed addresses ("register
    numbers", "names") but move to new addresses (larger offsets from the
    "top") as computations push more results on the belt. As I understand
    it, the HW implementation is indeed like a renaming/forwarding network, although the "names" (offsets from the "top") are known to the compiler because of the static instruction sheduling and known latencies of all operations.


    What I meant by "stack" is the place where scratch registers and
    in-flight belt values get pushed/popped when you enter/leave a function,
    which give you a kind of "register window" functionality.


    Ah, apologies for my wrong assumption. (But calling that a "stack" risks confusion with the normal SW stack in memory, which is certainly still
    needed in a Mill processor, for example to pass function arguments by reference.)

    I agree that the Mill architecture, with its "new belt for each call"
    and "new scratch-pad for each call" programmer's model, is similar to register-window designs.

    Allows me to disagree: in My 66000 ABI, up to 50% of subroutines
    do not need any preserved registers, and save time by not saving
    and restoring them, many of these do not need stack space (local
    variables) saving even more time. These subroutines can all be
    performed in the temporary registers provided by ABI.

    Here I would disagree with the new belt per subroutine as it
    cost too much (time and energy).

    When a subroutine DOES need saving and restoring of registers,
    and allocation of stack space, this can all be done in a single
    instruction for prologue (ENTER) an d epilogue (EXIT).

    Here I would agree with the new belt per subroutine.

    I do agree with some of what Mill does, including placing the
    preserved registers in memory where they cannot be damaged.
    My 66000 calls this mode of operation "safe stack".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to All on Sun Jul 20 22:27:20 2025
    On Sun, 20 Jul 2025 17:28:37 +0000, [email protected] (MitchAlsup1)
    wrote:

    On Fri, 18 Jul 2025 20:17:23 +0000, Niklas Holsti wrote:

    On 2025-07-18 18:29, Stefan Monnier wrote:
    The Mill "belt" (I assume this is what you call the "stack") corresponds >>>> to
    the "first-class registers": all computations take operands from the
    belt
    and push results to the belt. When a function is called, it sees a fresh >>>> belt with only its "in" arguments; when the function returns, it leaves >>>> its
    "out" results pushed on the belt that the caller sees.

    That's right. The belt is the closest that the Mill has to "first-class >>> registers", tho in a sense it addresses elements of the forwarding
    network more than "registers".


    Yes. Values on the belt do not stay at fixed addresses ("register
    numbers", "names") but move to new addresses (larger offsets from the
    "top") as computations push more results on the belt. As I understand
    it, the HW implementation is indeed like a renaming/forwarding network,
    although the "names" (offsets from the "top") are known to the compiler
    because of the static instruction sheduling and known latencies of all
    operations.


    What I meant by "stack" is the place where scratch registers and
    in-flight belt values get pushed/popped when you enter/leave a function, >>> which give you a kind of "register window" functionality.


    Ah, apologies for my wrong assumption. (But calling that a "stack" risks
    confusion with the normal SW stack in memory, which is certainly still
    needed in a Mill processor, for example to pass function arguments by
    reference.)

    I agree that the Mill architecture, with its "new belt for each call"
    and "new scratch-pad for each call" programmer's model, is similar to
    register-window designs.

    Allows me to disagree: in My 66000 ABI, up to 50% of subroutines
    do not need any preserved registers, and save time by not saving
    and restoring them, many of these do not need stack space (local
    variables) saving even more time. These subroutines can all be
    performed in the temporary registers provided by ABI.

    Here I would disagree with the new belt per subroutine as it
    cost too much (time and energy).

    [Take with salt. Hope I got this right ... it is how I have understood
    it based on Ivan's presentations. Details might be incorrect or may
    have been misunderstood.]


    Saving and restoring the Mill's belt costs between nothing (best case)
    and very little. The belt is the CPU's bypass network, and many of
    its elements are simply data sitting in the various FU output latches.

    Every possible location has a unique name, and there is a naming unit
    which keeps track of the location of everything. Every location name
    includes an id for the call frame to which the data belongs.

    When a new function is entered, the frame id changes and so no data on
    the existing belt will match. Presto! belt empty. When the function
    returns, the caller's frame id is restored and data having that frame
    id again becomes visible.

    If/when FUs need to overwrite existing data in their output latches,
    there is a "spiller" unit which moves the data to a new location
    within its own internal memory. Nothing ever gets moved back - the
    spiller's memory is connected directly to the bypass network and the
    spiller will respond if any location name matches something it is
    currently storing.

    The spiller memory, of course, can fill up and then data gets pushed
    out to RAM (and eventually pulled in again as the call stack unwinds).
    If data is not in the spiller when it is needed again, it is treated
    as-if a load miss and the pipeline is stalled. The size of the
    spiller's memory is one of the many configuration parameters used to
    define a new chip.

    The spiller acts like an associative cache limited to existing belt
    data. When a frame is no longer needed, any spiller locations which
    held its data become available for reuse.

    There are, in fact, 2 levels of spiller. Ivan has said that - in his experience - the 1st level spiller typically is able to hold data for
    many frames, so most operations involving it will run at core speed.


    When a subroutine DOES need saving and restoring of registers,
    and allocation of stack space, this can all be done in a single
    instruction for prologue (ENTER) an d epilogue (EXIT).

    Here I would agree with the new belt per subroutine.

    I do agree with some of what Mill does, including placing the
    preserved registers in memory where they cannot be damaged.
    My 66000 calls this mode of operation "safe stack".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to All on Mon Jul 21 12:11:20 2025
    On 2025-07-20 20:28, MitchAlsup1 wrote:
    On Fri, 18 Jul 2025 20:17:23 +0000, Niklas Holsti wrote:

    On 2025-07-18 18:29, Stefan Monnier wrote:

    [ snip ]

    I agree that the Mill architecture, with its "new belt for each call"
    and "new scratch-pad for each call" programmer's model, is similar to
    register-window designs.

    Allows me to disagree:

    I'm not sure what exactly you disagree with: with the statement that the
    Mill belt is "similar" to register-window designs? or with the Mill's
    belt design being a good idea in general? From your later text it seems
    to be the latter, which I did not claim.

    in My 66000 ABI, up to 50% of subroutines
    do not need any preserved registers, and save time by not saving
    and restoring them, many of these do not need stack space (local
    variables) saving even more time. These subroutines can all be
    performed in the temporary registers provided by ABI.

    Good, but what is the conclusion? That the My 66000 is better than register-window machines, or better than the Mill, or something else?

    The only register-window machine I am familiar with is the SPARC. The
    SPARC CALL instruction does not by itself open a new register window,
    and the SPARC compilers I have used can generate code for many
    subroutines without using the SAVE/RESTORE instructions that do invoke
    the register windowing, just as you do on the My 66000. However, the
    SPARC compilers, with limited global optimization, did this only for
    leaf subroutines that do not call any other subroutines.

    Here I would disagree with the new belt per subroutine as it
    cost too much (time and energy).

    As George Neuner explained in his reply, the cost is usually zero or
    small. The same holds for the SPARC SAVE instruction, depending on HW
    resources and SW structure (also true for the Mill case).

    When a subroutine DOES need saving and restoring of registers,
    and allocation of stack space, this can all be done in a single
    instruction for prologue (ENTER) and epilogue (EXIT).

    That seems quite similar to the SPARC case, where use of SAVE/RESTORE is optional and done only when either a new register window is needed or a
    stack frame is needed, with SAVE ~ ENTER and RESTORE ~ EXIT. However,
    I'm sure the details of the instructions and their HW implementations
    differ, and so may the performance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Mon Jul 21 15:42:21 2025
    On Sun, 20 Jul 2025 17:28:37 +0000, MitchAlsup1 wrote:

    I do agree with some of what Mill does, including placing the preserved registers in memory where they cannot be damaged.
    My 66000 calls this mode of operation "safe stack".

    This sounds like an idea worth stealing, although no doubt the way I
    would attempt to copy it would be a failure which removed all the
    usefulness of it.

    For one thing, I don't have a stack for calling subroutines, or any other purpose.

    But I could easily add a feature where a mode is turned on, and instead of using the registers, it works off of a workspace pointer, like the TI 9900.

    The trouble is, though, that this would be an extremely slow mode. When registers are _saved_, they're already saved to memory, as I can't think
    of anywhere else to save them. (There might be multiple sets of registers,
    for things like SMT, but *not* for user vs supervisor or anything like
    that.)

    So I've probably completely misunderstood you here.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to All on Thu Aug 21 21:48:03 2025
    John Savard <[email protected]d> posted:

    On Sun, 20 Jul 2025 17:28:37 +0000, MitchAlsup1 wrote:

    I do agree with some of what Mill does, including placing the preserved registers in memory where they cannot be damaged.
    My 66000 calls this mode of operation "safe stack".

    This sounds like an idea worth stealing, although no doubt the way I
    would attempt to copy it would be a failure which removed all the
    usefulness of it.

    For one thing, I don't have a stack for calling subroutines, or any other purpose.

    But I could easily add a feature where a mode is turned on, and instead of using the registers, it works off of a workspace pointer, like the TI 9900.

    The trouble is, though, that this would be an extremely slow mode. When registers are _saved_, they're already saved to memory, as I can't think
    of anywhere else to save them. (There might be multiple sets of registers, for things like SMT, but *not* for user vs supervisor or anything like
    that.)

    In reverse order:
    If TI 9900 used their registers like a write-back cache, then typical access would be fast and efficient. When the register pointer is altered, the old
    file is written en-massé and a new file is read in en-massé {possibly with some buffering to lessen visible cycle count} ... but I digress.

    {Conceptually}
    My 66000 uses this concept for its GPRs and for its Thread State but only at context switch time, not for subroutine calls and returns. HW saves and restores Thread State and Registers on context switches so that the CPU
    never has to Disable Interrupts (it can, it just doesn't have to). {/Conceptually}
    I bracketed the above with 'Conceptually' because it is completely
    reasonable to envision a Core big enough to have 4 copies of Thread
    State and Register files, and bank switch between them. The important properties are that the switch delivers control reentrantly, HOW any
    given implementation does that is NOT architecture--that it does IS architecture.

    I specifically left how many registers are preserved to SW per CALL
    because up to 50% need 0, and few % require more than 4. This appears
    to indicate that SPARC using 8 was overkill ... but I digress again.

    Safe Stack is a stack used for preserving the ABI contract between caller
    and callee even in the face of buffer overruns, RoP, and other malicious program behavior. SS places the return address and the preserved registers
    in an area of memory where LD and ST instructions have no access (RWE = 000) but ENTER, EXIT, and RET do. This was done in such a way that correct code
    runs both with SS=on and SS=off, so the compiler does not have to know.

    Only CALL, CALX, RET, ENTER, and EXIT are aware of the existence of SS
    and only in HW implementations.

    I have harped on you for a while to start development of your compiler.
    One of the first things a compiler needs to do is to develop its means
    to call subroutines and return back. This requires a philosophy of passing arguments, returning results, dealing with recursion, dealing with TRY- THROW-CATCH SW defined exception handling. I KNOW of nobody who does this without some kind of stack.

    I happen to use 2 such stacks mostly to harden the environment at low
    cost to malicious attack vectors. It comes with benefits: Lines removed
    from SS do not migrate to L2 or even DRAM, they can be discarded at
    end-of-use, reducing memory traffic; the SW contract between Caller and
    Callee is guaranteed even in the face of malicious code; it can be used
    as a debug tool to catch malicious code. ...

    NOTE: malicious code can still damage data*, just not the preserved regs,
    the return address, guaranteeing that control returns to the instruction following CALL. And all without adding a single instruction to the CALL
    RET instruction sequence.

    (*) memory

    So I've probably completely misunderstood you here.

    Not the first time ...

    John Savard


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Sat Aug 23 08:51:34 2025
    MitchAlsup <[email protected]d> schrieb:

    I have harped on you for a while to start development of your compiler.
    One of the first things a compiler needs to do is to develop its means
    to call subroutines and return back. This requires a philosophy of passing arguments, returning results, dealing with recursion, dealing with TRY- THROW-CATCH SW defined exception handling. I KNOW of nobody who does this without some kind of stack.

    There is one additional, quite thorny issue: How to maintain
    state for nested functions to be invoked via pointers, which
    have to have access local variables in the outer scope.
    gcc does so by default by making the stack executable, but
    that is problematic. An alternative is to make some sort of
    executable heap. This is now becoming a real problem, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

    I think we discussed this for My 66000 some time ago, but there
    is no resolution as yet.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Aug 29 17:07:03 2025
    There is one additional, quite thorny issue: How to maintain
    state for nested functions to be invoked via pointers, which
    have to have access local variables in the outer scope.
    gcc does so by default by making the stack executable, but
    that is problematic. An alternative is to make some sort of
    executable heap. This is now becoming a real problem, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

    AFAIK this is a problem only in those rare languages where a function
    value is expected to take up the same space as any other pointer while
    at the same time supporting nested functions.

    In most cases you have either one of the other but not both. E.g. in
    C we don't have nested functions, and in Javascript functions are heap-allocated objects.

    Other than GNU C (with its support for nested functions), which other
    language has this weird combination of features?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Stefan Monnier on Sat Aug 30 15:36:46 2025
    Stefan Monnier <[email protected]> wrote:
    There is one additional, quite thorny issue: How to maintain
    state for nested functions to be invoked via pointers, which
    have to have access local variables in the outer scope.
    gcc does so by default by making the stack executable, but
    that is problematic. An alternative is to make some sort of
    executable heap. This is now becoming a real problem, see
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

    AFAIK this is a problem only in those rare languages where a function
    value is expected to take up the same space as any other pointer while
    at the same time supporting nested functions.

    In most cases you have either one of the other but not both. E.g. in
    C we don't have nested functions, and in Javascript functions are heap-allocated objects.

    Other than GNU C (with its support for nested functions), which other language has this weird combination of features?

    Well, more precisely:
    - function pointer is supposed to take the same space as a single
    machine address
    - function pointer is supposed to be directly invokable, that is
    point to machine code of the function
    - one wants to support nested functions
    - there is no garbage collector, one does not want to introduce extra
    stack and one does not want to leak storage allocated to nested
    functions.

    To explain more:
    - arguably in "safe" C data pointers should consist
    of 3 machine words, such pointer have place for extra data needed
    for nested functions.
    - some calling conventions introduce extra indirection, that is
    function pointer point to a data structure containing address
    of machine code and extra data needed by nested functions.
    Function call puts extra data in dedicated machine register and
    then transfers control via address contained in function data
    structure. IIUC IBM AIX uses such approach.
    - one could create trampolines in a separate area of memory. In
    such case there is trouble with dealocating no longer needed
    trampolines. This trouble can be resolved by using GC. Or
    by using a parallel stack dedicated to trampolines.

    Concerning languages, any language which has nested functions and
    wants seamless cooperation with C needs to resolve the problem.
    That affects Pascal, Ada, PL/I. That is basicaly most classic
    non-C languages. IIUC several "higher level" languages resolve
    the trouble by combination of parallel stack and/or GC. But
    when language want to compete with efficiency of C and does not
    want GC, then trampolines allocated on machine stack may be the
    only choice (on register starved machine parallel stack may be
    too expensive). AFAIK GNU Ada uses (or used) trampolines
    allocated on machine stack.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Sat Aug 30 14:22:32 2025
    Function pointer consists of a pointer to a blob of memory holding
    a code pointer and typically the callee's GOT pointer.

    Better skip the redirection and make function pointers take up 2 words
    (address of the code plus address of the context/environment/GOT), so
    there's no dynamic allocation involved.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Sun Aug 31 05:36:53 2025
    Stefan Monnier <[email protected]> writes:
    AFAIK this is a problem only in those rare languages where a function
    value is expected to take up the same space as any other pointer while
    at the same time supporting nested functions.
    ...
    Other than GNU C (with its support for nested functions), which other >language has this weird combination of features?

    Forth has single-cell (cell=machine word) execution tokens (~function pointers). In contrast to C, where in theory you could have bigger
    function pointers and only ABIs and legacy code encourage function
    pointers that fit in a single machine word, in a Forth you would have
    to make all cells bigger (i.e., all integers and all addresses) if you
    want more space for xts.

    Gforth adds closures, that of course have to be represented by
    single-cell execution tokens that behave like other execution tokens.
    But Gforth has the advantage that xts are not represented by addresses
    of machine code, and instead there is one indirection level between
    the xt and the machine code. The reason for that is that xts not only represent colon definitions (~C functions), but also variables,
    constants, and words defined with create...does> (somewhat
    closure-like, but "statically" allocated). So Gforth implements
    closures using an extension of that mechanism, see Section 4 of [ertl&paysan18].

    However, there are Forth systems that implement xts as addresses of
    machine code, and if they implemented closures, they would need to use
    run-time code generation.

    - anton

    @InProceedings{ertl&paysan18,
    author = {M. Anton Ertl and Bernd Paysan},
    title = {Closures --- the {Forth} way},
    crossref = {euroforth18},
    pages = {17--30},
    url = {https://www.complang.tuwien.ac.at/papers/ertl%26paysan.pdf},
    url2 = {http://www.euroforth.org/ef18/papers/ertl.pdf},
    slides-url = {http://www.euroforth.org/ef18/papers/ertl-slides.pdf},
    video = {https://wiki.forth-ev.de/doku.php/events:ef2018:closures},
    OPTnote = {refereed},
    abstract = {In Forth 200x, a quotation cannot access a local
    defined outside it, and therefore cannot be
    parameterized in the definition that produces its
    execution token. We present Forth closures; they
    lift this restriction with minimal implementation
    complexity. They are based on passing parameters on
    the stack when producing the execution token. The
    programmer has to explicitly manage the memory of
    the closure. We show a number of usage examples.
    We also present the current implementation, which
    takes 109~source lines of code (including some extra
    features). The programmer can mechanically convert
    lexical scoping (accessing a local defined outside)
    into code using our closures, by applying assignment
    conversion and flat-closure conversion. The result
    can do everything one expects from closures,
    including passing Knuth's man-or-boy test and living
    beyond the end of their enclosing definitions.}
    }

    @Proceedings{euroforth18,
    title = {34th EuroForth Conference},
    booktitle = {34th EuroForth Conference},
    year = {2018},
    key = {EuroForth'18},
    url = {http://www.euroforth.org/ef18/papers/proceedings.pdf}
    }
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to All on Sun Aug 31 16:21:38 2025
    BGB <[email protected]> posted:

    On 8/30/2025 1:22 PM, Stefan Monnier wrote:
    Function pointer consists of a pointer to a blob of memory holding
    a code pointer and typically the callee's GOT pointer.

    Better skip the redirection and make function pointers take up 2 words (address of the code plus address of the context/environment/GOT), so there's no dynamic allocation involved.


    FDPIC typically always uses the normal pointer width, just with more indirection:
    Load target function pointer from GOT;
    Save off current GOT pointer to stack;
    Load code pointer from function pointer;
    Load GOT pointer from function pointer;
    Call function;
    Reload previous GOT pointer.

    My 66000 can indirect through GOT so the above sequence is::

    CALX [ip,,GOT[n]-.]

    and references to GOT are like above (functions) or (extern) as::

    LDD Rp,[ip,,GOT[n]-.]

    Each linked module gets its own GOT.

    It, errm, kinda sucks...

    Bad ISA makes many things suck--whereas good ISA does not.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)