• Re: register sets

    From Stephen Fuld@21:1/5 to Robert Finch on Wed Apr 16 23:26:57 2025
    On 4/16/2025 8:42 PM, Robert Finch wrote:
    Working on the Qupls3/StarkCPU core it looks like there will be enough resources to support two sets of registers. The extra set of registers
    comes for free for the register file as the BRAMs can support them. The
    only increase is in the RAT. The issue I have to trade-off on now is
    which of the four operating modes gets its own set of registers while
    the other three share a set. However, the first eight registers will be shared between all modes so that arguments can be passed between them.
    The ARM does this. My thought is that the application /  user  mode gets its own register set, and the rest of the system shares the other set.
    That way there is no need to save and restore the app registers when
    calling the system.

    Another thought is to not include float registers for anything other
    than apps. It would save 32 regs per mode, possibly allowing three
    register sets to be provided.


    Not to mention speeding up context switches as you don't need to
    save/restore the FP registers for those levels that don't have them, and
    if only one level does have them, no need to save them if the switch is
    to a level that doesn't have them, as they then can't be clobbered.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Thu Apr 17 13:35:36 2025
    Stephen Fuld <[email protected]d> writes:
    On 4/16/2025 8:42 PM, Robert Finch wrote:
    Working on the Qupls3/StarkCPU core it looks like there will be enough
    resources to support two sets of registers. The extra set of registers
    comes for free for the register file as the BRAMs can support them. The
    only increase is in the RAT. The issue I have to trade-off on now is
    which of the four operating modes gets its own set of registers while
    the other three share a set. However, the first eight registers will be
    shared between all modes so that arguments can be passed between them.
    The ARM does this. My thought is that the application /  user  mode gets >> its own register set, and the rest of the system shares the other set.
    That way there is no need to save and restore the app registers when
    calling the system.

    Another thought is to not include float registers for anything other
    than apps. It would save 32 regs per mode, possibly allowing three
    register sets to be provided.


    Not to mention speeding up context switches as you don't need to
    save/restore the FP registers for those levels that don't have them, and
    if only one level does have them, no need to save them if the switch is
    to a level that doesn't have them, as they then can't be clobbered.

    Many modern CPUs including intel/amd have mechanisms that the OS
    can use to determine if floating point registers have been used
    since the user process was dispatched, including a trap to the
    OS on the first floating point use. This allows them to avoid
    saving and restoring the FP registers during context switches.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Apr 17 18:26:38 2025
    On Thu, 17 Apr 2025 13:35:36 +0000, Scott Lurndal wrote:

    Stephen Fuld <[email protected]d> writes:
    On 4/16/2025 8:42 PM, Robert Finch wrote:
    Working on the Qupls3/StarkCPU core it looks like there will be enough
    resources to support two sets of registers. The extra set of registers
    comes for free for the register file as the BRAMs can support them. The
    only increase is in the RAT. The issue I have to trade-off on now is
    which of the four operating modes gets its own set of registers while
    the other three share a set. However, the first eight registers will be
    shared between all modes so that arguments can be passed between them.
    The ARM does this. My thought is that the application /  user  mode gets >>> its own register set, and the rest of the system shares the other set.
    That way there is no need to save and restore the app registers when
    calling the system.

    Another thought is to not include float registers for anything other
    than apps. It would save 32 regs per mode, possibly allowing three
    register sets to be provided.


    Not to mention speeding up context switches as you don't need to >>save/restore the FP registers for those levels that don't have them, and
    if only one level does have them, no need to save them if the switch is
    to a level that doesn't have them, as they then can't be clobbered.

    Many modern CPUs including intel/amd have mechanisms that the OS
    can use to determine if floating point registers have been used
    since the user process was dispatched, including a trap to the
    OS on the first floating point use. This allows them to avoid
    saving and restoring the FP registers during context switches.

    The converse point is that the OS may want to use vector instructions
    to move data around (Disk-cache to User-buffer) and thus have to have
    access to those registers anyway.

    But, yes, if you have 14 different kinds of register files, you need
    something close to 13 of them under flags of control to thin the
    work at context switch time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Fri Apr 18 17:12:46 2025
    On Fri, 18 Apr 2025 1:56:24 +0000, Robert Finch wrote:

    On 2025-04-17 2:26 p.m., MitchAlsup1 wrote:
    On Thu, 17 Apr 2025 13:35:36 +0000, Scott Lurndal wrote:

    Stephen Fuld <[email protected]d> writes:
    On 4/16/2025 8:42 PM, Robert Finch wrote:
    Working on the Qupls3/StarkCPU core it looks like there will be enough >>>>> resources to support two sets of registers. The extra set of registers >>>>> comes for free for the register file as the BRAMs can support them. The >>>>> only increase is in the RAT. The issue I have to trade-off on now is >>>>> which of the four operating modes gets its own set of registers while >>>>> the other three share a set. However, the first eight registers will be >>>>> shared between all modes so that arguments can be passed between them. >>>>> The ARM does this. My thought is that the application /  user  mode >>>>> gets
    its own register set, and the rest of the system shares the other set. >>>>> That way there is no need to save and restore the app registers when >>>>> calling the system.

    Another thought is to not include float registers for anything other >>>>> than apps. It would save 32 regs per mode, possibly allowing three
    register sets to be provided.


    Not to mention speeding up context switches as you don't need to
    save/restore the FP registers for those levels that don't have them, and >>>> if only one level does have them, no need to save them if the switch is >>>> to a level that doesn't have them, as they then can't be clobbered.

    I have several examples where My 66000 with only 32-GPRs compiles
    to fewer instructions than RISC-V with 32+32 registers.

    I have other examples where My 66000 does not need spill/fill code
    and RISC-V does, too.

    Many modern CPUs including intel/amd have mechanisms that the OS
    can use to determine if floating point registers have been used
    since the user process was dispatched, including a trap to the
    OS on the first floating point use.  This allows them to avoid
    saving and restoring the FP registers during context switches.

    The converse point is that the OS may want to use vector instructions
    to move data around (Disk-cache to User-buffer) and thus have to have
    access to those registers anyway.

    But, yes, if you have 14 different kinds of register files, you need
    something close to 13 of them under flags of control to thin the
    work at context switch time.

    I like having the extra register files, it is just a personal
    programming convenience. It reduces the pressure on the general-purpose register file.

    Universal constants also reduces pressure both on ISA on the GPRs, and
    on executing instructions.

    It is also a matter of encoding the register selection in
    a 32-bit instruction. I did not want to waste more than five bits on the register selection. It is really a 52?-register machine, but the
    instruction encodings do not allow all registers for all instructions.
    One can use the move instruction to swap any registers.

    But it makes the compiler more complex as it has to deal with different architectural register files. I suppose one could code the compiler for
    a machine with a flat 96 registers by surrounding operations with move instructions. It is code bloat but I do not know if it would affect performance.

    Thinking of including a register exchange instruction.

    Changing the capital R in RISC into a lower case r ?!?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Scott Lurndal on Tue Jul 15 04:56:53 2025
    On Thu, 17 Apr 2025 13:35:36 +0000, Scott Lurndal wrote:

    Many modern CPUs including intel/amd have mechanisms that the OS can use
    to determine if floating point registers have been used since the user process was dispatched, including a trap to the OS on the first floating point use. This allows them to avoid saving and restoring the FP
    registers during context switches.

    That certainly is helpful. But if a user program happens not to have
    done any floating-point computations during a particular time-slice,
    but did perform some earlier which it will later continue, it could be
    that while the FP registers don't need to be saved, they will still
    need to be restored on returning to that task from the old copy.

    So a mechanism like that still has to be used carefully, and it
    doesn't just make the whole problem disappear.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Robert Finch on Tue Jul 15 04:49:55 2025
    On Wed, 16 Apr 2025 23:42:20 -0400, Robert Finch wrote:

    Another thought is to not include float registers for anything other
    than apps. It would save 32 regs per mode, possibly allowing three
    register sets to be provided.

    I had to re-read this to understand it.

    Some architectures in the past did speed up context-switching between
    user programs and the operating system by giving the operating system
    its own set of registers.

    And the operating system doesn't need to do floating-point math, only
    user programs that do calculations. So what could go wrong with this
    great idea?

    The first thing that comes to mind is that even a computer with a GUI,
    not merely a time-sharing mainframe, may have multiple user programs
    running at once. So when a real-time clock interrupt hits, and it's
    time to rotate from one compute-bound user program to another, those floating-point registers *also* need to be saved to memory.

    So you will probably need an outer ring of the OS that doesn't use
    the special OS-specific set of registers.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Savard on Tue Jul 15 14:10:13 2025
    John Savard <[email protected]d> writes:
    On Wed, 16 Apr 2025 23:42:20 -0400, Robert Finch wrote:

    Another thought is to not include float registers for anything other
    than apps. It would save 32 regs per mode, possibly allowing three
    register sets to be provided.

    I had to re-read this to understand it.

    Some architectures in the past did speed up context-switching between
    user programs and the operating system by giving the operating system
    its own set of registers.

    And the operating system doesn't need to do floating-point math, only
    user programs that do calculations. So what could go wrong with this
    great idea?

    The first thing that comes to mind is that even a computer with a GUI,
    not merely a time-sharing mainframe, may have multiple user programs
    running at once. So when a real-time clock interrupt hits, and it's
    time to rotate from one compute-bound user program to another, those >floating-point registers *also* need to be saved to memory.

    A problem that was solved decades ago. Disable FP when a new task/thread/process is dispatched; trap the FP accesses, enable FP
    and set OS state indicating that the FP registers are in use; subsequent accesses won't be trapped. The FP registers only need to be saved
    if they were actually used by the current thread.

    The kernel doesn't use FP at all, so they don't need to be saved
    on all kernel entries.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Tue Jul 15 17:14:20 2025
    On Tue, 15 Jul 2025 4:49:55 +0000, John Savard wrote:

    On Wed, 16 Apr 2025 23:42:20 -0400, Robert Finch wrote:

    Another thought is to not include float registers for anything other
    than apps. It would save 32 regs per mode, possibly allowing three
    register sets to be provided.

    I had to re-read this to understand it.

    Some architectures in the past did speed up context-switching between
    user programs and the operating system by giving the operating system
    its own set of registers.

    And the operating system doesn't need to do floating-point math, only
    user programs that do calculations. So what could go wrong with this
    great idea?

    OSs often need to move big chunks of memory (disk cache->application)
    and thus need the registers that are most powerful when doing MOVs
    that is the SIMD registers (at least a few)

    The first thing that comes to mind is that even a computer with a GUI,
    not merely a time-sharing mainframe, may have multiple user programs
    running at once. So when a real-time clock interrupt hits, and it's
    time to rotate from one compute-bound user program to another, those floating-point registers *also* need to be saved to memory.

    So you will probably need an outer ring of the OS that doesn't use
    the special OS-specific set of registers.

    Each GuestOS gets to decide for himself when to rotate the work queues.
    The higher-ups just decide on which RTIs go to which GuestOSs.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Tue Jul 15 17:16:55 2025
    On Tue, 15 Jul 2025 4:56:53 +0000, John Savard wrote:

    On Thu, 17 Apr 2025 13:35:36 +0000, Scott Lurndal wrote:

    Many modern CPUs including intel/amd have mechanisms that the OS can use
    to determine if floating point registers have been used since the user
    process was dispatched, including a trap to the OS on the first floating
    point use. This allows them to avoid saving and restoring the FP
    registers during context switches.

    That certainly is helpful. But if a user program happens not to have
    done any floating-point computations during a particular time-slice,
    but did perform some earlier which it will later continue, it could be
    that while the FP registers don't need to be saved, they will still
    need to be restored on returning to that task from the old copy.

    John, you misunderstand, the HW flag bits keep track of whether
    the FP or SIMD registers have been used "THIS TIME SLICE".

    So a mechanism like that still has to be used carefully, and it
    doesn't just make the whole problem disappear.

    It almost does once you understand.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Robert Finch on Sat Jul 19 16:37:15 2025
    Robert Finch <[email protected]> writes:
    I am wondering if there is a way to tag each register individually with
    the process id (app id) and involve the register renamer such that there
    is no need to save / restore the register context.

    Simulteneous multi-threading (SMT) tags each microinstruction and
    instruction with a (hardware) thread id; two hardware threads may
    belong to different processes. How registers are handled probably
    depends on the actual implementation, not sure if any implementation
    tags it, but they all can tell which physical register belongs to
    which hardware thread (if any). And as long as you are not using more
    hardware threads than your hardware has available, there is no need to
    switch contexts, and therefore no need to save and restore it.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sat Jul 19 20:02:22 2025
    On Sat, 19 Jul 2025 16:37:15 +0000, Anton Ertl wrote:

    Robert Finch <[email protected]> writes:
    I am wondering if there is a way to tag each register individually with
    the process id (app id) and involve the register renamer such that there
    is no need to save / restore the register context.

    Simulteneous multi-threading (SMT) tags each microinstruction and
    instruction with a (hardware) thread id; two hardware threads may
    belong to different processes. How registers are handled probably
    depends on the actual implementation, not sure if any implementation
    tags it, but they all can tell which physical register belongs to
    which hardware thread (if any). And as long as you are not using more hardware threads than your hardware has available, there is no need to
    switch contexts, and therefore no need to save and restore it.

    Correct, you tag the SMT instruction through the pipeline not the
    registers in the files.

    Also note:: SMT requires that the RM and FPflags be piped through
    the FUs.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)