• Interrupts on the Concertina II

    From Quadibloc@21:1/5 to All on Wed Jan 17 17:33:19 2024
    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    This is because interrupts can happen at any time, and thus
    programs don't prepare for them or expect them. Any disturbance
    to the contents of any register would risk causing programs to
    crash.

    The Concertina II has a potentially large machine state which most
    programs do not use. There are vector registers, of the huge
    kind found in the Cray I. There are banks of 128 registers to
    supplement the banks of 32 registers.

    One obvious step in addressing this is for programs that don't
    use these registers to run without access to those registers.
    If this is indicated in the PSW, then the interrupt routine will
    know what it needs to save and restore.

    A more elaborate and more automated method is also possible.

    Let us imagine the computer speeds up interrupts by having a
    second bank of registers that interrupt routines use. But two
    register banks aren't enough, as many user programs are running
    concurrently.

    Here is how I envisage the sequence of events in response to
    an interrupt could work:

    1) The computer, at the beginning of an area of memory
    sufficient to hold all the contents of the computer's
    registers, including the PSW and program counter, places
    a _restore status_.

    2) The computer switches to the interrupt register bank,
    and places a pointer to the restore status in one of the
    registers, according to a known convention.

    3) As the interrupt service routine runs, the computer,
    separately and in the background, saves the registers of the
    interrupted program into memory. Once this is complete, the
    _restore status_ value in memory is changed to reflect this.

    4) The restore status value has _two_ uses.

    One is, obviously, that when returning from an interrupt,
    there will be a 'return from interrupt' routine that will
    either just switch register banks, if the registers aren't
    saved yet, or re-fill the registers that are actually in
    use (the register status also indicating what the complement
    of registers available to the interrupted program was) from
    memory.

    The other is that the restore status can be tested. If the
    main register set isn't saved yet, then it's too soon after
    the interrupt to *switch to another user program* which also
    would use the main register set, but with a different set
    of saved values.

    5) Some other factors complicate this.

    There may be multiple sets of user program registers to
    facilitate SMT.

    The standard practice in an operating system is to leave
    the privileged interrupt service routine as quickly as
    possible, and continue handling the interrupt in an
    unprivileged portion of the operating system. However, the
    return from interrupt instruction is obviously privileged,
    as it allows one to put an arbitrary value from memory into
    the Program Status Word, including one that would place
    the computer into a privileged state after the return.

    That last is not unique to the Concertina II, however. So
    the obvious solution of allowing the kernel to call
    unprivileged subroutines - which terminate in a supervisor
    call rather than a normal return - has been found.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Quadibloc on Wed Jan 17 18:02:23 2024
    Quadibloc <[email protected]d> writes:
    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    This is because interrupts can happen at any time, and thus
    programs don't prepare for them or expect them. Any disturbance
    to the contents of any register would risk causing programs to
    crash.

    Something needs to preserve state, either the hardware or
    the software. Most risc processors lean towards the latter,
    generally for good reason - one may not need to save
    all the state if the interrupt handler only touchs part of it.


    The Concertina II has a potentially large machine state which most
    programs do not use. There are vector registers, of the huge
    kind found in the Cray I. There are banks of 128 registers to
    supplement the banks of 32 registers.

    One obvious step in addressing this is for programs that don't
    use these registers to run without access to those registers.
    If this is indicated in the PSW, then the interrupt routine will
    know what it needs to save and restore.

    Just like x86 floating point.


    A more elaborate and more automated method is also possible.

    Let us imagine the computer speeds up interrupts by having a
    second bank of registers that interrupt routines use. But two
    register banks aren't enough, as many user programs are running
    concurrently.

    Here is how I envisage the sequence of events in response to
    an interrupt could work:

    1) The computer, at the beginning of an area of memory
    sufficient to hold all the contents of the computer's
    registers, including the PSW and program counter, places
    a _restore status_.

    Slow DRAM or special SRAMs? The former will add
    considerable latency to an interrupt, the later costs
    area (on a per-hardware-thread basis) and floorplanning
    issues.

    Best is to save the minimal amount of state in hardware
    and let software deal with the rest, perhaps with
    hints from the hardware (e.g. a bit that indicates
    whether the FPRs were modified since the last context
    switch, etc).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Quadibloc on Wed Jan 17 15:35:56 2024
    Quadibloc wrote:
    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    It needs to save the portion of the machine state overwritten by that interrupt. Often this is a small subset of the whole state because many interrupts only need a few integer registers, maybe 6 or 7.
    Additional integer state will be saved and restored as needed by the call ABI so does not need to be done for every interrupt by the handler prologue.

    This allows the OS to only save and restore the full state on thread
    switch which only happens after something significant occurs that
    changes the highest priority thread on a particular core.
    This occurs much less than the frequency of interrupts.

    This is because interrupts can happen at any time, and thus
    programs don't prepare for them or expect them. Any disturbance
    to the contents of any register would risk causing programs to
    crash.

    The Concertina II has a potentially large machine state which most
    programs do not use. There are vector registers, of the huge
    kind found in the Cray I. There are banks of 128 registers to
    supplement the banks of 32 registers.

    I have trouble imagining what an interrupt handler might use vectors for.

    Some OS's deal with this by specifying that drivers can only use integers. (Graphics drivers get special dispensation but they run in special context.)

    One obvious step in addressing this is for programs that don't
    use these registers to run without access to those registers.
    If this is indicated in the PSW, then the interrupt routine will
    know what it needs to save and restore.

    A more elaborate and more automated method is also possible.

    Let us imagine the computer speeds up interrupts by having a
    second bank of registers that interrupt routines use. But two
    register banks aren't enough, as many user programs are running
    concurrently.

    A consequence of this approach is that the second bank of interrupt
    registers are architecturally visible, which means they tie down
    resources like physical registers for all those interrupt registers
    even when not in use.

    And since there are multiple priority interrupt levels each with a bank,
    you quickly wind up with hundreds of physical registers mostly sitting
    around doing nothing but tied to other contexts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Jan 17 22:17:03 2024
    EricP wrote:

    Quadibloc wrote:
    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    It needs to save the portion of the machine state overwritten by that interrupt. Often this is a small subset of the whole state because many interrupts only need a few integer registers, maybe 6 or 7.

    Would an interrupt handler not run fewer instructions if its register
    state was seeded with pointers of use to the device(s) being serviced ??

    Additional integer state will be saved and restored as needed by the call ABI so does not need to be done for every interrupt by the handler prologue.

    This allows the OS to only save and restore the full state on thread
    switch which only happens after something significant occurs that
    changes the highest priority thread on a particular core.

    Doesn't running a SoftIRQ (or DPC) require a full register state ??
    And don't most device initerrupts need to SoftIRQ ??
    {{Yes, I see timer interrupts not needing so much of that}}

    This occurs much less than the frequency of interrupts.

    This is because interrupts can happen at any time, and thus
    programs don't prepare for them or expect them. Any disturbance
    to the contents of any register would risk causing programs to
    crash.

    The Concertina II has a potentially large machine state which most
    programs do not use. There are vector registers, of the huge
    kind found in the Cray I. There are banks of 128 registers to
    supplement the banks of 32 registers.

    I have trouble imagining what an interrupt handler might use vectors for.

    Memory to memory move from Disk Cache to User Buffer.
    SoftIRQ might use Vector arithmetic to verify CRC, Encryption, ...

    Some OS's deal with this by specifying that drivers can only use integers. (Graphics drivers get special dispensation but they run in special context.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Quadibloc on Wed Jan 17 22:11:46 2024
    Quadibloc wrote:

    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    State needs to be saved, whether SW or HW does the save is a free
    variable.

    This is because interrupts can happen at any time, and thus
    programs don't prepare for them or expect them. Any disturbance
    to the contents of any register would risk causing programs to
    crash.

    Also note:: the ABA problem can happen when an interrupt transpires
    in the middle of an ATOMIC sequence. Thus, My 66000 fails the event
    before transferring control to the interrupt handler.

    The Concertina II has a potentially large machine state which most
    programs do not use. There are vector registers, of the huge
    kind found in the Cray I. There are banks of 128 registers to
    supplement the banks of 32 registers.

    In the past, each register set can be guarded by an in-use bit
    and save/restore can be avoided when not in-use.

    Then again, My 66000 only has 64 registers, and one reason for this
    is to keep context switch times minimal, and the SW path from one
    context to the next minimal.

    One obvious step in addressing this is for programs that don't
    use these registers to run without access to those registers.

    Or not have them to begin with:: Take Vector Registers:: My 66000
    Virtual Vector Method (VVM) enables HW to vectorize a loop with
    the property that if an interrupt or exception transpires, the
    loop collapses into Scalar mode and the interrupt handler remains
    blissfully unaware the the instruction raising the exception is at
    a precise point. When control returns the rest of one loop runs in
    Scalar mode and when control transfers back to the top of the loop
    the loop is re-vectorized. This costs 2 instructions (VEC and LOOP)
    and 6-bits of state in PSL

    If this is indicated in the PSW, then the interrupt routine will
    know what it needs to save and restore.

    I use PSL instead of PSW because the amount of state is a cache line
    not a word, doubleword, or quadword. But space here is at a premium.

    A more elaborate and more automated method is also possible.

    Let us imagine the computer speeds up interrupts by having a
    second bank of registers that interrupt routines use. But two
    register banks aren't enough, as many user programs are running
    concurrently.

    You are ignoring several interesting facts with modern interrupt
    systems.

    a) each GuestOS has its own interrupt table(s) and the Hypervisor
    has its own interrupt tables.

    b) multiprocessing is a given. There are situations where an interrupt
    is sent from a device and a number of cores can respond. The proper
    core is the one running at the lowest priority level, and the core
    that gets there first.

    In My 66000 case, the interrupt is recognized by the core detecting a
    write to the raised interrupt register, going out and fetching the
    interrupt #. This bus transaction can return "here is the interrupt
    #" or it can respond with "someone else already got it". In the later
    case, the core goes on doing its current thing. In the former case,
    core responds with "I am going to handle this one", and the interrupt controller acknowledges the interrupt.

    Until the interrupt # returns, the core continues processing whatever
    it was doing, and then the core goes out and fetches the 5 cache lines
    of the interrupt dispatcher thread, as these lines arrive, they displace
    the current thread lines. So the restore happens before the save, the
    save after the reload, and arriving data pushes out current data.

    When control arrives at interrupt dispatcher it has a complete set of
    registers (minus 2 used to tell the dispatcher what interrupt it is
    to handle.) So, the receiving thread has its SP, its FP, its Root
    pointer, its ASID, and for all practical purposes it begins running
    as if the first instruction at the dispatcher saw the 30 register
    the last instruction performed the last time this thread ran. Thus,
    it can hold any variety of pointers to various data structures the
    OS deems relevant.

    The interrupt dispatcher is 4 instructions long, the first checks
    that the interrupt is in bounds, the second directs control out of
    if the interrupt is not within bounds, the third transfers control
    the the handler (CALL) and when the handler returns control attempts
    to return to the interrupted thread.

    Many times the handler SoftIRQs (or DPCs) cleanup handlers. The
    SuperVisor Return instruction (SVR) checks if there are scheduled
    thread above the thread we are attempting to return control to,
    and transfers control too them before transferring control to
    the interrupted thread. There is no need for SW to sort this out.

    Here is how I envisage the sequence of events in response to
    an interrupt could work:

    1) The computer, at the beginning of an area of memory
    sufficient to hold all the contents of the computer's
    registers, including the PSW and program counter, places
    a _restore status_.

    I found no reason and no rational to assign any of this to
    any specific place in memory.

    Instead, the entire Supervisor "Stack" is in a set of control
    registers, and from those control registers we can find any
    thread and all state associated with them.

    The Supervisor Stack contains 4 CRs that point at {HV, GuestHV,
    GuestOS, Application} a 2-bit "who is currently in charge" a
    6-bit priority, an interrupt table pointer, and a dispatcher.
    It is (also) 1 cache line long.

    2) The computer switches to the interrupt register bank,
    and places a pointer to the restore status in one of the
    registers, according to a known convention.

    3) As the interrupt service routine runs, the computer,
    separately and in the background, saves the registers of the
    interrupted program into memory. Once this is complete, the
    _restore status_ value in memory is changed to reflect this.

    4) The restore status value has _two_ uses.

    One is, obviously, that when returning from an interrupt,
    there will be a 'return from interrupt' routine that will
    either just switch register banks, if the registers aren't
    saved yet, or re-fill the registers that are actually in
    use (the register status also indicating what the complement
    of registers available to the interrupted program was) from
    memory.

    You are going to have a lot of complications with SoftIRQ/DPCs
    doing it this way.

    The other is that the restore status can be tested. If the
    main register set isn't saved yet, then it's too soon after
    the interrupt to *switch to another user program* which also
    would use the main register set, but with a different set
    of saved values.

    5) Some other factors complicate this.

    There may be multiple sets of user program registers to
    facilitate SMT.

    The standard practice in an operating system is to leave
    the privileged interrupt service routine as quickly as
    possible, and continue handling the interrupt in an
    unprivileged portion of the operating system.

    This is the SoftIRQs and DPCs.

    However, the
    return from interrupt instruction is obviously privileged,

    Why ?? If you are not "in" an interrupt it can raise OPERATION
    exception WHITHOUT BEING PRIVILEGEd !! {{Hint: you don't want
    a privileged thread to perform a return from interrupt unless
    you are IN an interrupt EITHER.}}

    as it allows one to put an arbitrary value from memory into
    the Program Status Word, including one that would place
    the computer into a privileged state after the return.

    That last is not unique to the Concertina II, however. So
    the obvious solution of allowing the kernel to call
    unprivileged subroutines - which terminate in a supervisor
    call rather than a normal return - has been found.

    How does privilege get restored on return ??

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Wed Jan 17 23:55:45 2024
    [email protected] (MitchAlsup1) writes:
    EricP wrote:

    Quadibloc wrote:
    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    It needs to save the portion of the machine state overwritten by that
    interrupt. Often this is a small subset of the whole state because many
    interrupts only need a few integer registers, maybe 6 or 7.

    Would an interrupt handler not run fewer instructions if its register
    state was seeded with pointers of use to the device(s) being serviced ??

    I doubt that would have any effect one way or the other on the
    number of instructions executed by the handler (a difference of one
    instruction isn't significant).


    Additional integer state will be saved and restored as needed by the call ABI
    so does not need to be done for every interrupt by the handler prologue.

    This allows the OS to only save and restore the full state on thread
    switch which only happens after something significant occurs that
    changes the highest priority thread on a particular core.

    Doesn't running a SoftIRQ (or DPC) require a full register state ??

    Those are run in a separate kernel thread not the interrupt
    handler. They havve full context, provided by the thread.

    And don't most device initerrupts need to SoftIRQ ??

    Some do, some don't. It really depends on the interrupt
    and the operating software (i.e. the linux kernel stack,
    which is structured around PCIe semantics).


    I have trouble imagining what an interrupt handler might use vectors for.

    Memory to memory move from Disk Cache to User Buffer.

    That's DMA - no cpu intervention required. If you're referring
    to a "Soft" disk cache maintained by the kernel (in unix, the
    buffer cache or file cache), a dedicated DMA engine that can offload
    such transfers would be a better solution than using
    vector registers in an interrupt handler which has no business
    transferring bulk data.

    SoftIRQ might use Vector arithmetic to verify CRC, Encryption, ...

    Both of those are offloaded to on-chip accelerators. Much better
    use of area.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Thu Jan 18 00:01:33 2024
    Chris M. Thomasson wrote:

    On 1/17/2024 2:11 PM, MitchAlsup1 wrote:
    Quadibloc wrote:

    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    State needs to be saved, whether SW or HW does the save is a free variable. >>
    This is because interrupts can happen at any time, and thus
    programs don't prepare for them or expect them. Any disturbance
    to the contents of any register would risk causing programs to
    crash.

    Also note:: the ABA problem can happen when an interrupt transpires
    in the middle of an ATOMIC sequence. Thus, My 66000 fails the event
    before transferring control to the interrupt handler.
    [...]

    Just to be clear an interrupt occurring within the hardware
    implementation of a CAS operation (e.g, lock cmpxchg over on Intel)
    should not effect the outcome of the CAS. Actually, it should not happen
    at all, right? CAS does not have any spurious failures.

    ABA failure happens BECAUSE one uses the value of data to decide if
    something appeared ATOMIC. The CAS instruction (itself and all variants)
    is ATOMIC, the the setup to CAS is non-ATOMIC, because the original value
    to be compared was fetched without any ATOMIC indicator, and someone else
    can alter it before CAS. If more than 1 thread alters the location, it
    can (seldom) end up with the same data value as the suspended thread
    thought it should be.

    CAS is ATOMIC, the code leading to CAS was not and this opens up the
    hole.

    Note:: CAS functionality implemented with LL/SC does not suffer ABA
    because the core monitors the LL address until the SC is performed.
    It is an addressed based comparison not a data value based one.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed Jan 17 21:00:06 2024
    MitchAlsup1 wrote:
    EricP wrote:

    Quadibloc wrote:
    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    It needs to save the portion of the machine state overwritten by that
    interrupt. Often this is a small subset of the whole state because many
    interrupts only need a few integer registers, maybe 6 or 7.

    Would an interrupt handler not run fewer instructions if its register
    state was seeded with pointers of use to the device(s) being serviced ??

    It would save a couple of load immediate instructions and
    costs a set of preserved registers for each interrupt level.
    Not worth the trouble.

    Additional integer state will be saved and restored as needed by the
    call ABI
    so does not need to be done for every interrupt by the handler prologue.

    This allows the OS to only save and restore the full state on thread
    switch which only happens after something significant occurs that
    changes the highest priority thread on a particular core.

    Doesn't running a SoftIRQ (or DPC) require a full register state ??
    And don't most device initerrupts need to SoftIRQ ??
    {{Yes, I see timer interrupts not needing so much of that}}

    No, not if you layer the software appropriately.

    The ISR prologue saves the non-preserved register subset (say R0 to R7).
    The First Level Interrupt Handler (FLIH) determines whether to
    restore the saved register subset or jump into the OS.
    If it decides to jump into the OS then R0:R7 are already saved on the stack, and since this was a FLIH that stack must be the prior thread kernel stack.
    So that leaves R8:R31 as still containing the prior threads values.

    You call whatever routines you like, when they return to this routine
    R8:R31 will still contain the prior thread's data. Only when you decide
    to switch threads do you needs to spill R8:R31, into the thread header
    context save area, plus any float, simd or vector registers,
    and then save the kernel stack pointer there so you can find the values
    you pushed in the prologue (if you need to edit the thread context).

    You then switch thread header pointers to the new thread's.

    To load the next thread you pick up from the new thread context R8:R31
    and float, simd, vector registers, and the kernel stack pointer.
    Its R0:R7 remain on the kernel stack where they were left when saved.

    You can now return to the ISR epilogue which pops R0:R7 for this new
    thread and REI Return from Exception or Interrupt to run the new thread.

    This occurs much less than the frequency of interrupts.

    This is because interrupts can happen at any time, and thus
    programs don't prepare for them or expect them. Any disturbance
    to the contents of any register would risk causing programs to
    crash.

    The Concertina II has a potentially large machine state which most
    programs do not use. There are vector registers, of the huge
    kind found in the Cray I. There are banks of 128 registers to
    supplement the banks of 32 registers.

    I have trouble imagining what an interrupt handler might use vectors for.

    Memory to memory move from Disk Cache to User Buffer.
    SoftIRQ might use Vector arithmetic to verify CRC, Encryption, ...

    I was thinking of device interrupt service routines but yeah
    the DPC/SoftIRQ routines might do this.

    I would have those routines that do want this manually save and restore
    any non-integer registers. There may be other sync issues to deal with
    (pending exceptions intended for the prior thread context).

    A set of utility Hardware Abstraction Layer (HAL) subroutines could
    handle this for each platform.

    Some OS's deal with this by specifying that drivers can only use
    integers.
    (Graphics drivers get special dispensation but they run in special
    context.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to EricP on Thu Jan 18 06:22:30 2024
    On Wed, 17 Jan 2024 15:35:56 -0500, EricP wrote:
    Quadibloc wrote:

    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    It needs to save the portion of the machine state overwritten by that interrupt. Often this is a small subset of the whole state because many interrupts only need a few integer registers, maybe 6 or 7.
    Additional integer state will be saved and restored as needed by the call ABI so does not need to be done for every interrupt by the handler prologue.

    Yes, that is true in many cases.

    I have trouble imagining what an interrupt handler might use vectors for.

    However, you're apparently forgetting one very important case.

    What if the interrupt is a *real-time clock* interrupt, and what is going
    to happen is that the computer will _not_ return immediately from that interrupt to the interrupted program, but instead, regarding it as compute-bound, proceed to a different program possibly even belonging
    to another user?

    So you're quite correct that the problem does not _always_ arise. But it
    does arise on occasion.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Quadibloc on Thu Jan 18 10:31:22 2024
    Quadibloc wrote:
    On Wed, 17 Jan 2024 15:35:56 -0500, EricP wrote:
    Quadibloc wrote:

    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.
    It needs to save the portion of the machine state overwritten by that
    interrupt. Often this is a small subset of the whole state because many
    interrupts only need a few integer registers, maybe 6 or 7.
    Additional integer state will be saved and restored as needed by the call ABI
    so does not need to be done for every interrupt by the handler prologue.

    Yes, that is true in many cases.

    I have trouble imagining what an interrupt handler might use vectors for.

    However, you're apparently forgetting one very important case.

    What if the interrupt is a *real-time clock* interrupt, and what is going
    to happen is that the computer will _not_ return immediately from that interrupt to the interrupted program, but instead, regarding it as compute-bound, proceed to a different program possibly even belonging
    to another user?

    So you're quite correct that the problem does not _always_ arise. But it
    does arise on occasion.

    John Savard

    No, I'm not forgetting that. Return from Exception or Interrupt (REI)
    has two possible paths, return to what it was doing before or jump into
    the OS and do more processing. On many platforms this particular piece
    of code is long, complicated, and riddled with race conditions.

    But designing an REI mechanism that makes this sequence
    simple, efficient and fast is a separate issue.

    For interrupts there are two main pieces of code, the Interrupt Service
    Routine (ISR), and post processing routine DPC/SoftIrq.

    The ISR for a particular device is called by the OS in response
    to a hardware priority interrupt. The ISR may decide it needs further processing but does not want to block other interrupts while doing it
    so ISR posts a request for deferred post processing.

    There also can be many restrictions on what an ISR is allowed to do
    because the OS designers did not want to, say, force every ISR to
    sync with the slow x87 FPU just in case someone wanted to use it.

    I would not assume that anything other than integer registers would be available in an ISR.

    In a post processing DPC/SoftIrq routine it might be possible but again
    there can be limitations. What you don't ever want to happen is to hang
    the cpu waiting to sync with a piece of hardware so you can save its state,
    as might happen if it was a co-processor. You also don't want to have to
    save any state just in case a post routine might want to do something,
    but rather save/restore the state on demand and just what is needed.
    So it really depends on the device and the platform.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Fri Jan 19 12:10:56 2024
    MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 1/17/2024 2:11 PM, MitchAlsup1 wrote:
    Quadibloc wrote:

    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    State needs to be saved, whether SW or HW does the save is a free
    variable.

    This is because interrupts can happen at any time, and thus
    programs don't prepare for them or expect them. Any disturbance
    to the contents of any register would risk causing programs to
    crash.

    Also note:: the ABA problem can happen when an interrupt transpires
    in the middle of an ATOMIC sequence. Thus, My 66000 fails the event
    before transferring control to the interrupt handler.
    [...]

    Just to be clear an interrupt occurring within the hardware
    implementation of a CAS operation (e.g, lock cmpxchg over on Intel)
    should not effect the outcome of the CAS. Actually, it should not
    happen at all, right? CAS does not have any spurious failures.

    ABA failure happens BECAUSE one uses the value of data to decide if
    something appeared ATOMIC. The CAS instruction (itself and all variants)
    is ATOMIC, the the setup to CAS is non-ATOMIC, because the original value
    to be compared was fetched without any ATOMIC indicator, and someone else
    can alter it before CAS. If more than 1 thread alters the location, it
    can (seldom) end up with the same data value as the suspended thread
    thought it should be.

    CAS is ATOMIC, the code leading to CAS was not and this opens up the hole.

    Note:: CAS functionality implemented with LL/SC does not suffer ABA
    because the core monitors the LL address until the SC is performed.
    It is an addressed based comparison not a data value based one.

    Yes but an equal point of view is that LL/SC only emulates atomic and
    uses the cache line ownership grab while "locked" to detect possible interference and infer potential change.

    Note that if LL/SC is implemented with temporary line pinning
    (as might be done to guarantee forward progress and prevent ping-pong)
    then it cannot be interfered with, and CAS and atomic-fetch-op sequences
    are semantically identical to the equivalent single instructions
    (which may also be implemented with temporary line pinning if their
    data must move from cache through the core and back).

    Also LL/SC as implemented on Alpha, MIPS, Power, ARM, RISC-V don't allow
    any other location loads or stores between them so really aren't useful
    for detecting ABA because detecting it requires monitoring two memory
    locations for change.

    The classic example is the single linked list with items head->A->B->C Detecting ABA requires monitoring if either head or head->Next change
    which LL/SC cannot do as reading head->Next cancels the lock on head.

    x86 has cmpxchg8b and ARM has double wide LL/SC which can be used to
    implement CASD atomic-double-wide-compare-and-swap. The first word holds
    the head pointer and the second word holds a generation counter whose
    change is used to infer that head->Next might have changed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to EricP on Sat Jan 20 18:54:33 2024
    On Wed, 17 Jan 2024 15:35:56 -0500, EricP wrote:
    Quadibloc wrote:

    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    It needs to save the portion of the machine state overwritten by that interrupt. Often this is a small subset of the whole state because many interrupts only need a few integer registers, maybe 6 or 7.
    Additional integer state will be saved and restored as needed by the call ABI so does not need to be done for every interrupt by the handler prologue.

    Having been so concerned by the large machine state of the Concertina
    II, parts of which were rarely used, and not realizing the conventional approach was entirely adequate... I missed what was the biggest flaw in interrupts on the Concertina II.

    Because in some important ways it is patterned after the IBM System/360,
    it shares its biggest problem with interrupts.

    On the System/360, it is a *convention* that the last few general registers, registers 11, 12, 13, 14, and 15 or so, are used as base registers. A
    base register *must* be properly set up before a program can write any data
    to memory.

    So one can't just have an interrupt behave like on an 8-bit microprocessor, saving only the program counter and the status bits, and leaving any
    registers to be saved by software. At least some of the general registers
    have to be saved, and set up with new starting values, for the interrupt routine to be able to save anything else, if need be.

    Of course, the System/360 was able to solve this problem, so it's not intractable. But the fact that the System/360 solved it by saving all
    sixteen general registers, and then loading them from an area in memory allocated to that interrupt type, is what fooled me into thinking I
    would need to automatically save _everything_. It didn't save the floating-point registers - software did, if the need was to move to
    a different user's program, and saving the state in two separate pieces
    by two different parts of the OS did not cause hopeless confusion.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Jan 21 20:54:05 2024
    EricP wrote:

    There also can be many restrictions on what an ISR is allowed to do
    because the OS designers did not want to, say, force every ISR to
    sync with the slow x87 FPU just in case someone wanted to use it.

    What about all the architectures that are not x86 and do not need to synch
    to FP, Vectors, SIMD, ..... ?? Why are they constrained by the one badly designed long life architecture ??

    I would not assume that anything other than integer registers would be available in an ISR.

    This is quite reasonable: as long as you have a sufficient number that
    the ISR can be written in some HLL without a bunch of flags to the compiler.

    In a post processing DPC/SoftIrq routine it might be possible but again
    there can be limitations. What you don't ever want to happen is to hang
    the cpu waiting to sync with a piece of hardware so you can save its state, as might happen if it was a co-processor. You also don't want to have to
    save any state just in case a post routine might want to do something,
    but rather save/restore the state on demand and just what is needed.
    So it really depends on the device and the platform.

    As long as there are not more than one flag to clue the compiler in,
    I am on board.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Sun Jan 21 20:55:40 2024
    Chris M. Thomasson wrote:

    On 1/17/2024 4:01 PM, MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 1/17/2024 2:11 PM, MitchAlsup1 wrote:
    Quadibloc wrote:

    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    State needs to be saved, whether SW or HW does the save is a free
    variable.

    This is because interrupts can happen at any time, and thus
    programs don't prepare for them or expect them. Any disturbance
    to the contents of any register would risk causing programs to
    crash.

    Also note:: the ABA problem can happen when an interrupt transpires
    in the middle of an ATOMIC sequence. Thus, My 66000 fails the event
    before transferring control to the interrupt handler.
    [...]

    Just to be clear an interrupt occurring within the hardware
    implementation of a CAS operation (e.g, lock cmpxchg over on Intel)
    should not effect the outcome of the CAS. Actually, it should not
    happen at all, right? CAS does not have any spurious failures.

    ABA failure happens BECAUSE one uses the value of data to decide if
    something appeared ATOMIC. The CAS instruction (itself and all variants)
    is ATOMIC, the the setup to CAS is non-ATOMIC, because the original value
    to be compared was fetched without any ATOMIC indicator, and someone else
    can alter it before CAS. If more than 1 thread alters the location, it
    can (seldom) end up with the same data value as the suspended thread
    thought it should be.

    Yup. Fwiw, some years ago I actually tried to BURN a CAS by creating
    several rouge threads that would alter the CAS target using random
    numbers at full speed ahead. The interesting part is that forward
    progress was damaged for sure, but still occurred. It did not live lock
    on me. Interesting.


    CAS is ATOMIC, the code leading to CAS was not and this opens up the hole.

    Indeed.


    Note:: CAS functionality implemented with LL/SC does not suffer ABA
    because the core monitors the LL address until the SC is performed.
    It is an addressed based comparison not a data value based one.

    Exactly. Actually, I asked myself if I just wrote a stupid question to
    you. Sorry Mitch... ;^)


    No need to be sorry, this is a NG dedicated to make people think, and
    then after they have expressed what they though, to correct and refine
    what they think and how.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Jan 21 20:58:51 2024
    EricP wrote:

    MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 1/17/2024 2:11 PM, MitchAlsup1 wrote:
    Quadibloc wrote:

    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    State needs to be saved, whether SW or HW does the save is a free
    variable.

    This is because interrupts can happen at any time, and thus
    programs don't prepare for them or expect them. Any disturbance
    to the contents of any register would risk causing programs to
    crash.

    Also note:: the ABA problem can happen when an interrupt transpires
    in the middle of an ATOMIC sequence. Thus, My 66000 fails the event
    before transferring control to the interrupt handler.
    [...]

    Just to be clear an interrupt occurring within the hardware
    implementation of a CAS operation (e.g, lock cmpxchg over on Intel)
    should not effect the outcome of the CAS. Actually, it should not
    happen at all, right? CAS does not have any spurious failures.

    ABA failure happens BECAUSE one uses the value of data to decide if
    something appeared ATOMIC. The CAS instruction (itself and all variants)
    is ATOMIC, the the setup to CAS is non-ATOMIC, because the original value
    to be compared was fetched without any ATOMIC indicator, and someone else
    can alter it before CAS. If more than 1 thread alters the location, it
    can (seldom) end up with the same data value as the suspended thread
    thought it should be.

    CAS is ATOMIC, the code leading to CAS was not and this opens up the hole. >>
    Note:: CAS functionality implemented with LL/SC does not suffer ABA
    because the core monitors the LL address until the SC is performed.
    It is an addressed based comparison not a data value based one.

    Yes but an equal point of view is that LL/SC only emulates atomic and
    uses the cache line ownership grab while "locked" to detect possible interference and infer potential change.

    Which, BTW, opens up a different side channel ...

    Note that if LL/SC is implemented with temporary line pinning
    (as might be done to guarantee forward progress and prevent ping-pong)
    then it cannot be interfered with, and CAS and atomic-fetch-op sequences
    are semantically identical to the equivalent single instructions
    (which may also be implemented with temporary line pinning if their
    data must move from cache through the core and back).

    Line pinning requires a NAK in the coherence protocol. As far as I know,
    only My 66000 interconnect protocol has such a NaK.

    Also LL/SC as implemented on Alpha, MIPS, Power, ARM, RISC-V don't allow
    any other location loads or stores between them so really aren't useful
    for detecting ABA because detecting it requires monitoring two memory locations for change.

    The classic example is the single linked list with items head->A->B->C Detecting ABA requires monitoring if either head or head->Next change
    which LL/SC cannot do as reading head->Next cancels the lock on head.

    Detecting ABA requires one to monitor addresses not data values.

    x86 has cmpxchg8b and ARM has double wide LL/SC which can be used to implement CASD atomic-double-wide-compare-and-swap. The first word holds
    the head pointer and the second word holds a generation counter whose
    change is used to infer that head->Next might have changed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Quadibloc on Sun Jan 21 21:02:01 2024
    Quadibloc wrote:

    On Wed, 17 Jan 2024 15:35:56 -0500, EricP wrote:
    Quadibloc wrote:

    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    It needs to save the portion of the machine state overwritten by that
    interrupt. Often this is a small subset of the whole state because many
    interrupts only need a few integer registers, maybe 6 or 7.
    Additional integer state will be saved and restored as needed by the call ABI
    so does not need to be done for every interrupt by the handler prologue.

    Having been so concerned by the large machine state of the Concertina
    II, parts of which were rarely used, and not realizing the conventional approach was entirely adequate... I missed what was the biggest flaw in interrupts on the Concertina II.

    The second to last word is misleading and unnecessary::

    interrupts on Concertina II.

    The implies there will be only one and it already exists.

    Because in some important ways it is patterned after the IBM System/360,
    it shares its biggest problem with interrupts.

    On the System/360, it is a *convention* that the last few general registers, registers 11, 12, 13, 14, and 15 or so, are used as base registers. A
    base register *must* be properly set up before a program can write any data to memory.

    Captain Obvious strikes again...

    So one can't just have an interrupt behave like on an 8-bit microprocessor, saving only the program counter and the status bits, and leaving any registers to be saved by software. At least some of the general registers have to be saved, and set up with new starting values, for the interrupt routine to be able to save anything else, if need be.

    Once HW starts saving "a few" it might as well save "enough" of them to mater.

    Of course, the System/360 was able to solve this problem, so it's not intractable. But the fact that the System/360 solved it by saving all
    sixteen general registers, and then loading them from an area in memory allocated to that interrupt type, is what fooled me into thinking I
    would need to automatically save _everything_. It didn't save the floating-point registers - software did, if the need was to move to
    a different user's program, and saving the state in two separate pieces
    by two different parts of the OS did not cause hopeless confusion.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun Jan 21 21:18:06 2024
    BGB wrote:

    On 1/20/2024 12:54 PM, Quadibloc wrote:
    On Wed, 17 Jan 2024 15:35:56 -0500, EricP wrote:

    So one can't just have an interrupt behave like on an 8-bit microprocessor, >> saving only the program counter and the status bits, and leaving any
    registers to be saved by software. At least some of the general registers
    have to be saved, and set up with new starting values, for the interrupt
    routine to be able to save anything else, if need be.


    IIRC, saving off PC, some flags bits, swapping the stack registers, and
    doing a computed branch relative to a control register (via bit-slicing,
    *). This is effectively the interrupt mechanism I am using on a 64-bit ISA.

    And sounds like the interrupt mechanism for an 8-bit µprocessor...

    *: For a table that is generally one-off in the kernel or similar, it
    doesn't ask much to mandate that it has a certain alignment. And if the required alignment is larger than the size of the table, you have just
    saved yourself needing an adder...

    In a modern system where you have several HyperVisors and a multiplicity
    of GuestOSs, a single interrupt table is unworkable looking forward.
    What you want and need is every GuestOS to have its own table, and
    every HyperVisor have its own table, some kind of routing mechanism to
    route device interrupts to the correct table, and inform appropriate
    cores of raised and enabled interrupts. All these tables have to be concurrently available continuously and simultaneously. The old fixed
    mapping will no longer work efficiently--you can make them work with
    \a near-Herculean amount of carefully programming.

    Or you can look at the problem from a modern viewpoint and fix the
    model so the above is manifest.

    If anything, it is a little simpler than the mechanism used on some
    8-bit systems, which would have needed a mechanism to push these values
    to the stack, and restore them from the stack.

    Do you think you mechanism would work "well" with 1024 cores in your
    system ??

    Having side-channels that allow these values to "magically appear" in
    certain SPRs is at least simpler, though the cost of the logic itself
    could be more up for debate.

    Once you have an FMAC FPU none of the interrupt logic adds any area
    to any core.

    Of course, the System/360 was able to solve this problem, so it's not
    intractable. But the fact that the System/360 solved it by saving all
    sixteen general registers, and then loading them from an area in memory
    allocated to that interrupt type, is what fooled me into thinking I
    would need to automatically save _everything_. It didn't save the
    floating-point registers - software did, if the need was to move to
    a different user's program, and saving the state in two separate pieces
    by two different parts of the OS did not cause hopeless confusion.


    For ISR convenience, it would make sense to have, say, two SPR's or CR's designated for ISR use to squirrel off values from GPRs to make the prolog/epilog easier. Had considered this, but not done so (in this
    case, first thing the ISR does is save a few registers to the ISR stack
    to free them up to get the rest of the registers saved "more properly",
    then has to do a secondary reload to get these registers back into the correct place).

    Or, you could take the point of view that your architecture makes context switching easy (like 1 instruction from application 1 to application 2)
    and when you do this the rest of the model pretty much drops in for free.

    Assuming interrupts aren't too common, then it isn't a huge issue, and seemingly a majority of the clock-cycles spent on interrupt entry mostly
    have to do with L1 misses (since typically pretty much nothing the ISR touches is in the L1 cache; and in my implementation, ISR's may not
    share L1 cache lines with non-ISR code; so basically it is an
    architecturally required "L1 miss" storm).

    The fewer cycles before raising the interrupt at the device, and the
    servicing of the interrupt by ISR (prior to scheduling a SoftIRQ/DPC)
    is the key number. All these little pieces of state that are obtained
    1 at a time by running code not in the ICache, along with the manipulations
    on the TLBs, ... get in the way of your goal.

    Only real way to avoid a lot of the L1 misses would be to have multiple
    sets of banked registers or similar, but, this is not cheap for the hardware... Similarly, the ISR would need to do as little, and touch as little memory, as is possible to perform its task.

    You can read up thread-state from DRAM and continue what you are working
    on until they arrive, and when they arrive, you ship out the old values
    as if the core state was a cache in its own right. Thus, you continue to
    make progress on the current thread until you have the state needed to
    run the ISR thread and when it arrives you have everything you need to
    proceed. ...

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Sun Jan 21 21:58:53 2024
    [email protected] (MitchAlsup1) writes:


    In a modern system where you have several HyperVisors and a multiplicity
    of GuestOSs, a single interrupt table is unworkable looking forward.
    What you want and need is every GuestOS to have its own table, and
    every HyperVisor have its own table, some kind of routing mechanism to
    route device interrupts to the correct table, and inform appropriate
    cores of raised and enabled interrupts. All these tables have to be >concurrently available continuously and simultaneously. The old fixed
    mapping will no longer work efficiently--you can make them work with
    \a near-Herculean amount of carefully programming.

    For an extant implementation thereof, see

    GICv3 Architecture Specification.

    https://documentation-service.arm.com/static/6012f2e54ccc190e5e681256

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Mon Jan 22 02:07:22 2024
    Chris M. Thomasson wrote:

    On 1/21/2024 12:58 PM, MitchAlsup1 wrote:

    Detecting ABA requires one to monitor addresses not data values.

    Not 100% true.

    IBM's original ABA problem was encountered when a background task
    (once a week or once a month) was swapped out to disk the instruction
    prior to CAS, and when it came back the data comparison register
    matched the memory data, but the value to be swapped in had no
    relationship with the current linked list structure. Machine crashed.

    Without knowing the address, how can this particular problem be
    rectified ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Jan 22 01:22:53 2024
    BGB wrote:

    On 1/21/2024 3:18 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 1/20/2024 12:54 PM, Quadibloc wrote:
    On Wed, 17 Jan 2024 15:35:56 -0500, EricP wrote:

    So one can't just have an interrupt behave like on an 8-bit
    microprocessor,
    saving only the program counter and the status bits, and leaving any
    registers to be saved by software. At least some of the general
    registers
    have to be saved, and set up with new starting values, for the interrupt >>>> routine to be able to save anything else, if need be.


    IIRC, saving off PC, some flags bits, swapping the stack registers,
    and doing a computed branch relative to a control register (via
    bit-slicing, *). This is effectively the interrupt mechanism I am
    using on a 64-bit ISA.

    And sounds like the interrupt mechanism for an 8-bit µprocessor...


    It was partly a simplification of the design from the SH-4, which was a 32-bit CPU mostly used in embedded systems (and in the Sega Dreamcast...).

    Though, the SH-4 did bank out half the registers, which was a feature
    that ended up being dropped for cost-saving reasons.


    *: For a table that is generally one-off in the kernel or similar, it
    doesn't ask much to mandate that it has a certain alignment. And if
    the required alignment is larger than the size of the table, you have
    just saved yourself needing an adder...

    In a modern system where you have several HyperVisors and a multiplicity
    of GuestOSs, a single interrupt table is unworkable looking forward.
    What you want and need is every GuestOS to have its own table, and
    every HyperVisor have its own table, some kind of routing mechanism to
    route device interrupts to the correct table, and inform appropriate
    cores of raised and enabled interrupts. All these tables have to be
    concurrently available continuously and simultaneously. The old fixed
    mapping will no longer work efficiently--you can make them work with
    \a near-Herculean amount of carefully programming.

    Or you can look at the problem from a modern viewpoint and fix the
    model so the above is manifest.


    Presumably, only the actual "bare metal" layer has an actual
    hardware-level interrupt table, and all of the "guest" tables are faked
    in software?...

    Much like with MMU:
    Only the base level needs to actually handle TLB miss events, and
    everything else (nested translation, etc), can be left to software
    emulation.

    Name a single ISA that fakes the TLB ?? (and has an MMU)

    If anything, it is a little simpler than the mechanism used on some
    8-bit systems, which would have needed a mechanism to push these
    values to the stack, and restore them from the stack.

    Do you think you mechanism would work "well" with 1024 cores in your
    system ??


    Number of cores should not matter that much.

    Exactly !! but then try running 1024 cores under differing GuestOSs, and HyperVisors under one set of system-wide Tables !!

    Presumably, each core gets its own ISR stack, which should not have any reason to need to interact with each other.

    I presume an interrupt can be serviced by any number of cores.
    I presume that there are a vast number of devices. Each device assigned
    to a few GuestOSs.
    I presume the core that services the interrupt (ISR) is running the same GuestOS under the same HyperVisor that initiated the device.
    I presume the core that services the interrupt was of the lowest priority
    of all the cores then running that GuestOS.
    I presume the core that services the interrupt wasted no time in doing so.

    And the GuestOS decides on how its ISR stack is {formatted, allocated, used, serviced, ...} which can be different for each GuestOS.

    For extra speed, maybe the ISR stacks could be mapped to some sort of core-local SRAM. This hasn't been done yet though.

    Caches either work or they don't.

    Wasting cycles fetching instructions, translations, and data are genuine overhead that can be avoided if one treats thread-state as a cache.

    If the interrupt occurs often enough to mater, its instructions, data,
    and translations will be in the cache hierarchy.

    HW that knows what it is doing can start fetching these things even
    BEFORE it can execute the first instruction on behalf of the interrupt dispatcher. SW can NEVER do any of this prior to starting to run instr.

    Idea here being probably the SRAM region could have a special address
    range, and any access to this region would be invisible to any other
    cores (and it need not need have backing in external RAM).

    One could maybe debate the cost of giving each core 4K or 8K of
    dedicated local SRAM though merely for "slightly faster interrupt handling".

    Yech: end of debate.....

    Having side-channels that allow these values to "magically appear" in
    certain SPRs is at least simpler, though the cost of the logic itself
    could be more up for debate.

    Once you have an FMAC FPU none of the interrupt logic adds any area
    to any core.


    I don't have conventional FMA because it would have had too much cost
    and latency.

    Because you are measuring from the wrong implementation technology,
    using macros ill suited to the problem at hand.


    For ISR convenience, it would make sense to have, say, two SPR's or
    CR's designated for ISR use to squirrel off values from GPRs to make
    the prolog/epilog easier. Had considered this, but not done so (in
    this case, first thing the ISR does is save a few registers to the ISR
    stack to free them up to get the rest of the registers saved "more
    properly", then has to do a secondary reload to get these registers
    back into the correct place).

    Or, you could take the point of view that your architecture makes context
    switching easy (like 1 instruction from application 1 to application 2)
    and when you do this the rest of the model pretty much drops in for free.


    This would cost well more, on the hardware side, than having two non-specialized CRs and being like "ISR's are allowed to stomp these at
    will, nothing else may use them".

    The sequencer is surprisingly small. Everything else already exists and
    is just waiting around for a signal to capture this or emit that.


    Some other tasks can be handled with the microsecond timer and a loop.

    Say:
    //void DelayUsec(int usec);
    DelayUsec:
    CPIUD 30
    ADD R4, R0, R6
    .L0:
    CPUID 30
    CMPQGT R0, R6
    BT .L0
    RTS
    Which would create a certain amount of delay (in microseconds) relative
    to when the function is called.

    CPUID n Opteron's time frame took 200-600 cycles. Do you really want to
    talk to your timer with those kinds of delay ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Mon Jan 22 08:16:39 2024
    On 17/01/2024 19:02, Scott Lurndal wrote:
    Quadibloc <[email protected]d> writes:
    When a computer recieves an interrupt signal, it needs to save
    the complete machine state, so that upon return from the
    interrupt, the program thus interrupted is in no way affected.

    This is because interrupts can happen at any time, and thus
    programs don't prepare for them or expect them. Any disturbance
    to the contents of any register would risk causing programs to
    crash.

    Something needs to preserve state, either the hardware or
    the software. Most risc processors lean towards the latter,
    generally for good reason - one may not need to save
    all the state if the interrupt handler only touchs part of it.


    The Concertina II has a potentially large machine state which most
    programs do not use. There are vector registers, of the huge
    kind found in the Cray I. There are banks of 128 registers to
    supplement the banks of 32 registers.

    One obvious step in addressing this is for programs that don't
    use these registers to run without access to those registers.
    If this is indicated in the PSW, then the interrupt routine will
    know what it needs to save and restore.

    Just like x86 floating point.

    Also ARM floating point (at least, on the 32-bit Cortex-M ARM families).



    A more elaborate and more automated method is also possible.

    Let us imagine the computer speeds up interrupts by having a
    second bank of registers that interrupt routines use. But two
    register banks aren't enough, as many user programs are running
    concurrently.

    Here is how I envisage the sequence of events in response to
    an interrupt could work:

    1) The computer, at the beginning of an area of memory
    sufficient to hold all the contents of the computer's
    registers, including the PSW and program counter, places
    a _restore status_.

    Slow DRAM or special SRAMs? The former will add
    considerable latency to an interrupt, the later costs
    area (on a per-hardware-thread basis) and floorplanning
    issues.

    An SRAM block sufficient to hold a small number of copies of registers,
    even for ISA's with lots of registers, would be small compared to
    typical cache blocks. Indeed, it could be considered as a kind of
    dedicated cache.


    Best is to save the minimal amount of state in hardware
    and let software deal with the rest, perhaps with
    hints from the hardware (e.g. a bit that indicates
    whether the FPRs were modified since the last context
    switch, etc).

    A combined effort sounds good to me.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Mon Jan 22 10:00:20 2024
    MitchAlsup1 wrote:
    EricP wrote:

    There also can be many restrictions on what an ISR is allowed to do
    because the OS designers did not want to, say, force every ISR to
    sync with the slow x87 FPU just in case someone wanted to use it.

    What about all the architectures that are not x86 and do not need to synch
    to FP, Vectors, SIMD, ..... ?? Why are they constrained by the one badly designed long life architecture ??

    This is about minimizing what it saves *by default*.

    I am not assuming a vector coprocessor would be halted by interrupts.
    Not automatically halting for interrupts is one reason to have a coprocessor.

    Also I am not assuming that halting these devices is free.

    It is also about an OS *by design* discouraging people from putting code
    which requires a large state save and restore into latency sensitive
    device drivers that can effect the whole system performance.

    If you really insist on using SIMD in a driver then
    (a) don't put it in an ISR, put it in a post routine,
    (b) use utility routines to manually save and restore that state.

    I would not assume that anything other than integer registers would be
    available in an ISR.

    This is quite reasonable: as long as you have a sufficient number that
    the ISR can be written in some HLL without a bunch of flags to the
    compiler.

    Yes, an OS can have a different ABI for ISR routines and everything else.
    ISR level routines would have a different declaration as 99% of the time
    a small save/restore set is sufficient.

    In a post processing DPC/SoftIrq routine it might be possible but again
    there can be limitations. What you don't ever want to happen is to hang
    the cpu waiting to sync with a piece of hardware so you can save its
    state,
    as might happen if it was a co-processor. You also don't want to have to
    save any state just in case a post routine might want to do something,
    but rather save/restore the state on demand and just what is needed.
    So it really depends on the device and the platform.

    As long as there are not more than one flag to clue the compiler in,
    I am on board.

    I would do it with declarations (routine attributes) as it is less
    error prone, just like MS C has stdcall, cdecl calling conventions.

    void __isrcall MyDeviceIsr (IoDevice_t *dev, etc);

    This would just be for an ISR routine and the few routines it calls.
    The driver post routines could use a standard ABI.

    The isrcall attribute changes the ABI to be R0:R7 are not preserved,
    R8:R31 are preserved. Also there is no need for a frame pointer
    as variable allocations are not allowed, neither are exceptions.
    It could also do things like change the stack pointer to be in
    R7 instead of R31 (just pointing out the possibilities).

    The interrupt prologue saves R0:R7 and loops calling the ISR for
    each device receiving an interrupt at that interrupt priority level.
    After all are serviced it checks if it needs post processing.
    If not then it executes the epilogue to restore R0:R7 and REI's.
    If it does then it saves R8:R15 to comply with the standard ABI
    and jumps into the OS Dispatcher which flushes the post routines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Mon Jan 22 16:09:54 2024
    [email protected] (MitchAlsup1) writes:
    BGB wrote:


    Much like with MMU:
    Only the base level needs to actually handle TLB miss events, and
    everything else (nested translation, etc), can be left to software
    emulation.

    Name a single ISA that fakes the TLB ?? (and has an MMU)

    MIPS?


    Presumably, each core gets its own ISR stack, which should not have any
    reason to need to interact with each other.

    I presume an interrupt can be serviced by any number of cores.

    Or restricted to a specific set of cores (i.e. those currently
    owned by the target guest).

    The guest OS will generally specify the target virutal core (or set of cores) for a specific interrupt. The Hypervisor and/or hardware needs
    to deal with the case where the interrupt arrives while the target
    guest core isn't currently scheduled on a physical core (and poke
    the kernel to schedule the guest optionally). Such as recording
    the pending interrupt and optionally notifying the hypervisor that
    there is a pending guest interrupt so it can schedule the guest
    core(s) on physical cores to handle the interrupt.

    I presume that there are a vast number of devices. Each device assigned
    to a few GuestOSs.

    Or, with SR-IOV, virtual functions are assigned to specific guests
    and all interrupts are MSI-X messages from the device to the
    interrupt controller (LAPIC, GIC, etc).

    Dealing with inter-processor interrupts in a multicore guest can also
    be tricky; either trapped by the hypervisor or there must be hardware
    support in the interrupt controller to notify the hypervisor that a pending guest IPI interrupt has arrived. ARM started with the former behavior, but added a mechanism to handle direct injection of interprocessor interrupts
    by the guest, without hypervisor intervention (assuming the guest core
    is currently scheduled on a physical core, otherwise the hypervisor gets notified that there is a pending interrupt for a non-scheduled guest
    core).

    I presume the core that services the interrupt (ISR) is running the same >GuestOS under the same HyperVisor that initiated the device.

    Generally a safe assumption. Note that the guest core may not be
    resident on any physical core when the guest interrupt arives.

    I presume the core that services the interrupt was of the lowest priority
    of all the cores then running that GuestOS.
    I presume the core that services the interrupt wasted no time in doing so.

    And the GuestOS decides on how its ISR stack is {formatted, allocated, used, >serviced, ...} which can be different for each GuestOS.

    To a certain extent, the format of the ISR stack is hardware defined,
    and there rest is completely up to the guest. ARM for example,
    saves the current PC into a system register (ELR_ELx) and switches
    the stack pointer. Everything else is up to the software interrupt
    handler to save/restore. I see little benefit in hardware doing
    any state saving other than that.



    If the interrupt occurs often enough to mater, its instructions, data,
    and translations will be in the cache hierarchy.

    Although there has been a great deal of work mitigating the
    number of interrupts (setting interrupt threshholds, RSS,
    polling (DPDK, ODP), etc)

    I don't see any advantages to all the fancy hardware interrupt
    proposals from either of you.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Mon Jan 22 11:24:48 2024
    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    Chris M. Thomasson wrote:
    Just to be clear an interrupt occurring within the hardware
    implementation of a CAS operation (e.g, lock cmpxchg over on Intel)
    should not effect the outcome of the CAS. Actually, it should not
    happen at all, right? CAS does not have any spurious failures.

    ABA failure happens BECAUSE one uses the value of data to decide if
    something appeared ATOMIC. The CAS instruction (itself and all variants) >>> is ATOMIC, the the setup to CAS is non-ATOMIC, because the original
    value
    to be compared was fetched without any ATOMIC indicator, and someone
    else
    can alter it before CAS. If more than 1 thread alters the location,
    it can (seldom) end up with the same data value as the suspended thread
    thought it should be.

    CAS is ATOMIC, the code leading to CAS was not and this opens up the
    hole.

    Note:: CAS functionality implemented with LL/SC does not suffer ABA
    because the core monitors the LL address until the SC is performed.
    It is an addressed based comparison not a data value based one.

    Yes but an equal point of view is that LL/SC only emulates atomic and
    uses the cache line ownership grab while "locked" to detect possible
    interference and infer potential change.

    Which, BTW, opens up a different side channel ...

    How so? The location has to be inside the same virtual space.

    Note that if LL/SC is implemented with temporary line pinning
    (as might be done to guarantee forward progress and prevent ping-pong)
    then it cannot be interfered with, and CAS and atomic-fetch-op sequences
    are semantically identical to the equivalent single instructions
    (which may also be implemented with temporary line pinning if their
    data must move from cache through the core and back).

    Line pinning requires a NAK in the coherence protocol. As far as I know,
    only My 66000 interconnect protocol has such a NaK.

    Not necessarily, provided it is time limited (few tens of clocks).

    Also I suspect the worst case latency for moving a line ownership
    could be quite large (a lots of queues and cache levels to traverse),
    and main memory can be many hundreds of clocks away.

    So the cache protocol should already be long latency tolerant
    and adding some 10's of clocks shouldn't really matter.

    Also LL/SC as implemented on Alpha, MIPS, Power, ARM, RISC-V don't allow
    any other location loads or stores between them so really aren't useful
    for detecting ABA because detecting it requires monitoring two memory
    locations for change.

    The classic example is the single linked list with items head->A->B->C
    Detecting ABA requires monitoring if either head or head->Next change
    which LL/SC cannot do as reading head->Next cancels the lock on head.

    Detecting ABA requires one to monitor addresses not data values.

    It is a method for reading a pair of addresses, and knowing that
    neither of them has changed between those two steps,
    proceeding to update the first address.

    It requires monitoring a first address while reading a second address,
    and then updating the first address (releasing the monitor),
    and using any update to the first address between those three steps to
    infer there might have been a change to the second and blocking the update.

    Which none of the LL/SC guarantee you can do.

    x86 has cmpxchg8b and ARM has double wide LL/SC which can be used to
    implement CASD atomic-double-wide-compare-and-swap. The first word holds
    the head pointer and the second word holds a generation counter whose
    change is used to infer that head->Next might have changed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Mon Jan 22 19:19:55 2024
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    Chris M. Thomasson wrote:
    Just to be clear an interrupt occurring within the hardware
    implementation of a CAS operation (e.g, lock cmpxchg over on Intel)
    should not effect the outcome of the CAS. Actually, it should not
    happen at all, right? CAS does not have any spurious failures.

    ABA failure happens BECAUSE one uses the value of data to decide if
    something appeared ATOMIC. The CAS instruction (itself and all variants) >>>> is ATOMIC, the the setup to CAS is non-ATOMIC, because the original
    value
    to be compared was fetched without any ATOMIC indicator, and someone
    else
    can alter it before CAS. If more than 1 thread alters the location,
    it can (seldom) end up with the same data value as the suspended thread >>>> thought it should be.

    CAS is ATOMIC, the code leading to CAS was not and this opens up the
    hole.

    Note:: CAS functionality implemented with LL/SC does not suffer ABA
    because the core monitors the LL address until the SC is performed.
    It is an addressed based comparison not a data value based one.

    Yes but an equal point of view is that LL/SC only emulates atomic and
    uses the cache line ownership grab while "locked" to detect possible
    interference and infer potential change.

    Which, BTW, opens up a different side channel ...

    How so? The location has to be inside the same virtual space.

    Anything, that changes the amount of time something takes; opens up a
    side channel. Whether data can flow through the channel is a different
    story. Holding a line changes the bounds on the time taken.

    Note that if LL/SC is implemented with temporary line pinning
    (as might be done to guarantee forward progress and prevent ping-pong)
    then it cannot be interfered with, and CAS and atomic-fetch-op sequences >>> are semantically identical to the equivalent single instructions
    (which may also be implemented with temporary line pinning if their
    data must move from cache through the core and back).

    Line pinning requires a NAK in the coherence protocol. As far as I know,
    only My 66000 interconnect protocol has such a NaK.

    Not necessarily, provided it is time limited (few tens of clocks).

    That I will grant.

    Also I suspect the worst case latency for moving a line ownership
    could be quite large (a lots of queues and cache levels to traverse),
    and main memory can be many hundreds of clocks away.

    Figure 100-cycles as a loaded system average.

    So the cache protocol should already be long latency tolerant
    and adding some 10's of clocks shouldn't really matter.

    But does 100+cycles ??

    Also LL/SC as implemented on Alpha, MIPS, Power, ARM, RISC-V don't allow >>> any other location loads or stores between them so really aren't useful
    for detecting ABA because detecting it requires monitoring two memory
    locations for change.

    The classic example is the single linked list with items head->A->B->C
    Detecting ABA requires monitoring if either head or head->Next change
    which LL/SC cannot do as reading head->Next cancels the lock on head.

    Detecting ABA requires one to monitor addresses not data values.

    It is a method for reading a pair of addresses, and knowing that
    neither of them has changed between those two steps,
    proceeding to update the first address.

    It requires monitoring a first address while reading a second address,
    and then updating the first address (releasing the monitor),
    and using any update to the first address between those three steps to
    infer there might have been a change to the second and blocking the update.

    Which none of the LL/SC guarantee you can do.

    Right LL/SC is a single container synchronization model.

    x86 has cmpxchg8b and ARM has double wide LL/SC which can be used to
    implement CASD atomic-double-wide-compare-and-swap. The first word holds >>> the head pointer and the second word holds a generation counter whose
    change is used to infer that head->Next might have changed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Jan 22 19:15:49 2024
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    BGB wrote:


    Much like with MMU:
    Only the base level needs to actually handle TLB miss events, and
    everything else (nested translation, etc), can be left to software
    emulation.

    Name a single ISA that fakes the TLB ?? (and has an MMU)

    MIPS?

    Even R2000 has a TLB, it is a SW serviced TLB, but the "zero overhead
    on hit" part is present.

    Presumably, each core gets its own ISR stack, which should not have any
    reason to need to interact with each other.

    I presume an interrupt can be serviced by any number of cores.

    Or restricted to a specific set of cores (i.e. those currently
    owned by the target guest).

    Even that gets tricky when you (or the OS) virtualizes cores.

    The guest OS will generally specify the target virutal core (or set of cores)

    Yes, set of cores.

    for a specific interrupt. The Hypervisor and/or hardware needs
    to deal with the case where the interrupt arrives while the target
    guest core isn't currently scheduled on a physical core (and poke
    the kernel to schedule the guest optionally). Such as recording
    the pending interrupt and optionally notifying the hypervisor that
    there is a pending guest interrupt so it can schedule the guest
    core(s) on physical cores to handle the interrupt.

    That is the routing I was talking about.

    I presume that there are a vast number of devices. Each device assigned
    to a few GuestOSs.

    Or, with SR-IOV, virtual functions are assigned to specific guests
    and all interrupts are MSI-X messages from the device to the
    interrupt controller (LAPIC, GIC, etc).

    In my case, the interrupt controller merely sets bits in the interrupt
    table, the watching cores watch for changes to its pending interrupt
    register (64-bits). Said messages come up from PCIe as MSI-X messages,
    and are directed to the interrupt controller over in the Memory Controller (L3).

    Dealing with inter-processor interrupts in a multicore guest can also
    be tricky;

    Core sends MSI-X message to interrupt controller and the rest happens
    no different than a device initerrupt.

    either trapped by the hypervisor or there must be hardware
    support in the interrupt controller to notify the hypervisor that a pending guest IPI interrupt has arrived. ARM started with the former behavior, but added a mechanism to handle direct injection of interprocessor interrupts
    by the guest, without hypervisor intervention (assuming the guest core
    is currently scheduled on a physical core, otherwise the hypervisor gets notified that there is a pending interrupt for a non-scheduled guest
    core).

    I presume the core that services the interrupt (ISR) is running the same >>GuestOS under the same HyperVisor that initiated the device.

    Generally a safe assumption. Note that the guest core may not be
    resident on any physical core when the guest interrupt arives.

    Which is why its table has to be present at all times--even if the threads
    are not. When one or more threads from that GuestOS are activated, the
    pending interrupt will be serviced.

    I presume the core that services the interrupt was of the lowest priority >>of all the cores then running that GuestOS.
    I presume the core that services the interrupt wasted no time in doing so.

    And the GuestOS decides on how its ISR stack is {formatted, allocated, used, >>serviced, ...} which can be different for each GuestOS.

    To a certain extent, the format of the ISR stack is hardware defined,
    and there rest is completely up to the guest. ARM for example,
    saves the current PC into a system register (ELR_ELx) and switches
    the stack pointer. Everything else is up to the software interrupt
    handler to save/restore. I see little benefit in hardware doing
    any state saving other than that.



    If the interrupt occurs often enough to mater, its instructions, data,
    and translations will be in the cache hierarchy.

    Although there has been a great deal of work mitigating the
    number of interrupts (setting interrupt threshholds, RSS,
    polling (DPDK, ODP), etc)

    I don't see any advantages to all the fancy hardware interrupt
    proposals from either of you.

    I understand.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Jan 22 20:31:08 2024
    BGB wrote:

    On 1/22/2024 10:09 AM, Scott Lurndal wrote:
    [email protected] (MitchAlsup1) writes:


    In my case, the use of Soft TLB is not strictly required, as the OS may opt-in to use a hardware page-walker "if it exists", with TLB Miss
    interrupts mostly happening if no hardware page walker exists (or if
    there is not a valid page in the page table).

    Has anyone done a SW refill TLB implementation that has both Hypervisor
    and Supervisor page <nested> translations ??

    This seems to me a bad idea as HV would end up having to manipulate
    GuestOS mappings {Because you cannot allow GuestOS to see HV mappings}.

    {{Aside:: At one time I was enamored with SW TLB refill and one could
    reduce TLB refill penalty by allocating a "big enough" secondary hashed
    TLB (1MB+). When HV + GuesOS came about, I saw the futility of it all}}


    The guest OS will generally specify the target virutal core (or set of cores)
    for a specific interrupt. The Hypervisor and/or hardware needs
    to deal with the case where the interrupt arrives while the target
    guest core isn't currently scheduled on a physical core (and poke
    the kernel to schedule the guest optionally). Such as recording
    the pending interrupt and optionally notifying the hypervisor that
    there is a pending guest interrupt so it can schedule the guest
    core(s) on physical cores to handle the interrupt.


    I am guessing maybe my assumed approach of always routing all of the
    external hardware interrupts to a specific core, is not typical then?...

    Say, only Core=0 or Core=1, will get the interrupts.

    What do you think happens when there are thousands of cores and thousands
    of disks, hundreds of Gigabit Ethernets controllers, where the number of interrupts per second is larger than 1 or 2 cores can manage ??

    <snip>

    So, a mechanism to swap a pair of stack-pointer registers seemed like a necessary evil.


    With a Soft-TLB, it is also basically required to fall back to physical addressing for ISR's (and with HW page-walking, if virtual-memory could
    exist in ISRs, it would likely be necessary to jump over to a different
    set of page-tables from the usermode program).

    Danger Will Robinson, Danger


    In my case, I had not been arguing for any fancy interrupt handling in hardware...

    In my case, MSI-X interrupts are routed to MC(L3) where each message sets
    up to 2 bits, one demarking the unique interrupt, and the other merging interrupts of a priority level into a second single bit. The setting of
    this second bit is SNOOPed by cores to decide if they should attempt to recognize an interrupt. Cores not associated with that interrupt table
    do not see that interrupt; but those that are do. Thus, there is no pre- assigned cores to service interrupts.

    The most fancy part of my interrupt mechanism, is that one can encode
    the ID of a core into the value passed to a "TRAPA", and it will
    redirect the interrupt to that specific core.


    But, this mechanism currently has the limitations of a 4-bit field, so
    going beyond ~ 15 cores is going to require a nesting scheme and
    bouncing IPI's across multiple cores.

    Danger Will Robinson, Danger !!

    Though, if needed, I could tweak the format slightly in this case, and
    maybe expand the Core-ID for IPI's to 8-bits, albeit limiting it to 16
    unique IPI interrupt types.

    I have 512 unique interrupts per priority level. There are 64 priority
    levels.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Mon Jan 22 20:40:38 2024
    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:


    Or restricted to a specific set of cores (i.e. those currently
    owned by the target guest).

    Even that gets tricky when you (or the OS) virtualizes cores.

    Oh, indeed. It's helpful to have good hardware support. The
    ARM GIC, for example, helps eliminate hypervisor interaction
    during normal guest interrupt handling (aside from scheduling the
    guest on a host core).



    In my case, the interrupt controller merely sets bits in the interrupt
    table, the watching cores watch for changes to its pending interrupt
    register (64-bits). Said messages come up from PCIe as MSI-X messages,

    The interrupt space for MSI-X messages is 32-bits. Implementations
    may support fewer than 2**32 interrupts - ours support 2**24 distinct
    interrupt vectors.

    and are directed to the interrupt controller over in the Memory Controller >(L3).

    Dealing with inter-processor interrupts in a multicore guest can also
    be tricky;

    Core sends MSI-X message to interrupt controller and the rest happens
    no different than a device initerrupt.

    Not necessarily, particularly if the guest isn't resident on any
    core at the time the interrupt is received.



    I presume the core that services the interrupt (ISR) is running the same >>>GuestOS under the same HyperVisor that initiated the device.

    Generally a safe assumption. Note that the guest core may not be
    resident on any physical core when the guest interrupt arives.

    Which is why its table has to be present at all times--even if the threads >are not. When one or more threads from that GuestOS are activated, the >pending interrupt will be serviced.

    Yes, but the hypervisor needs to be notified by the hardware when the table
    is updated and the target guest VCPU isn't currently scheduled
    on any core so that it can decide to schedule the guest (which may,
    for instance, have been parked because it executed a WFI, PAUSE
    or MWAIT instruction).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Mon Jan 22 22:15:29 2024
    BGB <[email protected]> writes:
    On 1/22/2024 10:09 AM, Scott Lurndal wrote:

    I am guessing maybe my assumed approach of always routing all of the
    external hardware interrupts to a specific core, is not typical then?...

    Say, only Core=0 or Core=1, will get the interrupts.

    Maybe on a microcontroller :-).

    On a desktop or server system (particularly the latter), the kernel
    may distribute interrupts however it likes. Network card RSS
    (Receive Side Scaling) requires being able to distribute interrupts
    over a set of (or all) cores. Any time you make a core "special"
    all kinds of new usage constraints arise (not to mention reduced
    fault tolerance).




    Trying to route actual HW interrupts into virtual guest OS's seems like
    a pain.

    Check out the ARM GICv3/v4 implementation to see how it does
    this. It has evolved over time to where you see it now. Originally,
    they only provided a set of CPU system registers to the hypervisor
    that allowed the hypervisor to inject interrupts into the guest. The hypervisor handled all interrupts itself, then queued them
    (in a set of one or more List Registers) to the guest. When the
    hypervisor dispatched the guest on the core, it would get
    an interrupt and read the same interrupt ack register that
    the hypervisor uses but the hardware would, for the guest
    access, access one of the list registers and announce that
    interrupt to the guest. The guest would end the interrupt
    just like a bare-metal os by writing the interrupt number
    to and interrupt END system register, which would drop
    the running interrupt priority (for nested interrupts).
    If the interrupt was level sensitive, unmasked and the
    highest priority pending interrupt, the guest would
    get another interrupt (wash, rinse, repeat).

    Lots of trips (even low cost on AAarch64) between exception
    levels.

    So, they've added a capability (only for message signaled
    interrupts) to deliver the MSI interrupt directly to the
    guest - if the target guest core isn't resident, the hardware
    will ring a doorbell for the hypervisor. Once the HV makes
    the guest resident on the CPU, it will take any pending
    interrupts recored for that virtual CPU, in order of
    interrupt priority.

    The final enhancement (GICV4.1) adds the ability to issue
    virtual inter-processor interrupts between virtual CPU's
    without hypervisor intervention (other than making the
    guest vcpus resident on real cores).


    As I see it, the main limiting factor for interrupt performance is not
    the instructions to save and restore the registers, but rather the L1
    misses that result from doing so.

    If the interrupt is happening at a rate where the L1
    cache miss is significant, then the device probably needs to be
    redesigned to reduce the number of interrupts (e.g.
    interrupt coalescing), use DMA, or do more work per interrupt,
    or poll the completion status from the driver rather
    than waiting for the interrupt.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Mon Jan 22 18:01:54 2024
    MitchAlsup1 wrote:
    BGB wrote:

    On 1/22/2024 10:09 AM, Scott Lurndal wrote:
    [email protected] (MitchAlsup1) writes:


    In my case, the use of Soft TLB is not strictly required, as the OS
    may opt-in to use a hardware page-walker "if it exists", with TLB Miss
    interrupts mostly happening if no hardware page walker exists (or if
    there is not a valid page in the page table).

    Has anyone done a SW refill TLB implementation that has both Hypervisor
    and Supervisor page <nested> translations ??

    This seems to me a bad idea as HV would end up having to manipulate
    GuestOS mappings {Because you cannot allow GuestOS to see HV mappings}.

    I actually pondered something like this to eliminate the two-level table
    walk in virtual machines. I was thinking that the HV might propagate its
    PTE entries into the GuestOS PTE entries, then mark them (somehow)
    so they trap to the HV if GuestOS tries to look at them.
    But it got complicated and never really went anywhere.

    One accomplishes the same effect by caching the interior PTE nodes
    for each of the HV and GuestOS tables separately on the downward walk,
    and hold the combined nested table mapping in the TLB.
    The bottom-up table walkers on each interior PTE cache should
    eliminate 98% of the PTE reads with none of the headaches.

    {{Aside:: At one time I was enamored with SW TLB refill and one could
    reduce TLB refill penalty by allocating a "big enough" secondary hashed
    TLB (1MB+). When HV + GuesOS came about, I saw the futility of it all}}

    I also wondered if an hashed/inverted page table could help here.
    But that also went nowhere. The separate bottom-up walkers looked best.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Tue Jan 23 00:54:54 2024
    EricP wrote:

    MitchAlsup1 wrote:
    BGB wrote:

    On 1/22/2024 10:09 AM, Scott Lurndal wrote:
    [email protected] (MitchAlsup1) writes:


    In my case, the use of Soft TLB is not strictly required, as the OS
    may opt-in to use a hardware page-walker "if it exists", with TLB Miss
    interrupts mostly happening if no hardware page walker exists (or if
    there is not a valid page in the page table).

    Has anyone done a SW refill TLB implementation that has both Hypervisor
    and Supervisor page <nested> translations ??

    This seems to me a bad idea as HV would end up having to manipulate
    GuestOS mappings {Because you cannot allow GuestOS to see HV mappings}.

    I actually pondered something like this to eliminate the two-level table
    walk in virtual machines. I was thinking that the HV might propagate its
    PTE entries into the GuestOS PTE entries, then mark them (somehow)
    so they trap to the HV if GuestOS tries to look at them.
    But it got complicated and never really went anywhere.

    One accomplishes the same effect by caching the interior PTE nodes
    for each of the HV and GuestOS tables separately on the downward walk,
    and hold the combined nested table mapping in the TLB.
    The bottom-up table walkers on each interior PTE cache should
    eliminate 98% of the PTE reads with none of the headaches.

    I call these things:: TableWalk Accelerators.

    Given CAMs at your access, one can cache the outer layers and short
    circuit most of the MMU accesses--such that you don't siply read the Accelerator RAM 25 times (two 5-level tables), you CAM down both
    GuestOS and HV tables so only walk the parts not in your CAM. {And
    them put them in your CAM.} A Density trick is for each CAM to have
    access to a whole cache line of PTEs (8 in my case).

    {{Aside:: At one time I was enamored with SW TLB refill and one could
    reduce TLB refill penalty by allocating a "big enough" secondary hashed
    TLB (1MB+). When HV + GuesOS came about, I saw the futility of it all}}

    I also wondered if an hashed/inverted page table could help here.
    But that also went nowhere. The separate bottom-up walkers looked best.

    Best I could do was two tables, one mapping appliction to GuestPA, the
    other mapping GuestPA to RealPA. If the former missed, GuestOS fixed its
    own table, if the late, HV fixed its own table.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Tue Jan 23 14:33:56 2024
    MitchAlsup1 wrote:
    EricP wrote:

    One accomplishes the same effect by caching the interior PTE nodes
    for each of the HV and GuestOS tables separately on the downward walk,
    and hold the combined nested table mapping in the TLB.
    The bottom-up table walkers on each interior PTE cache should
    eliminate 98% of the PTE reads with none of the headaches.

    I call these things:: TableWalk Accelerators.

    Given CAMs at your access, one can cache the outer layers and short
    circuit most of the MMU accesses--such that you don't siply read the Accelerator RAM 25 times (two 5-level tables), you CAM down both
    GuestOS and HV tables so only walk the parts not in your CAM. {And
    them put them in your CAM.} A Density trick is for each CAM to have
    access to a whole cache line of PTEs (8 in my case).

    An idea I had here was to allow the OS more explicit control
    for the invalidates of the interior nodes caches.

    On x86/x64 the interior cache invalidation had to be backwards compatible,
    so the INVLPG instruction has to guess what besides the main TLB needs to be invalidated, and it has to do so in a conservative (ie paranoid) manner.
    So it tosses these interior PTE's just in case which means they
    have to be reloaded on the next TLB miss.

    The OS knows which paging levels it is recycling memory for and
    can provide a finer grain control for these TLB invalidates.
    The INVLPG and INVPCID instructions need a control bit mask allowing OS
    to invalidate just the TLB levels it is changing for a virtual address.

    And for OS debugging purposes, all these HW TLB tables need to be readable
    and writable by some means (as control registers or whatever).
    Because when something craps out, what's in memory may not be the same
    as what was loaded into HW some time ago. A debugger should be able to
    look into and manipulate these HW structures.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Tue Jan 23 21:09:45 2024
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    One accomplishes the same effect by caching the interior PTE nodes
    for each of the HV and GuestOS tables separately on the downward walk,
    and hold the combined nested table mapping in the TLB.
    The bottom-up table walkers on each interior PTE cache should
    eliminate 98% of the PTE reads with none of the headaches.

    I call these things:: TableWalk Accelerators.

    Given CAMs at your access, one can cache the outer layers and short
    circuit most of the MMU accesses--such that you don't siply read the
    Accelerator RAM 25 times (two 5-level tables), you CAM down both
    GuestOS and HV tables so only walk the parts not in your CAM. {And
    them put them in your CAM.} A Density trick is for each CAM to have
    access to a whole cache line of PTEs (8 in my case).

    An idea I had here was to allow the OS more explicit control
    for the invalidates of the interior nodes caches.

    The interior nodes, stored in the CAM, retain their physical address, and
    are snooped, so no invalidation is required. ANY write to them is seen and
    the entry invalidates itself.

    On x86/x64 the interior cache invalidation had to be backwards compatible,
    so the INVLPG instruction has to guess what besides the main TLB needs to be invalidated, and it has to do so in a conservative (ie paranoid) manner.
    So it tosses these interior PTE's just in case which means they
    have to be reloaded on the next TLB miss.

    The OS knows which paging levels it is recycling memory for and
    can provide a finer grain control for these TLB invalidates.
    The INVLPG and INVPCID instructions need a control bit mask allowing OS
    to invalidate just the TLB levels it is changing for a virtual address.

    OS or HV does not need to bother in My 66000.

    And for OS debugging purposes, all these HW TLB tables need to be readable and writable by some means (as control registers or whatever).
    Because when something craps out, what's in memory may not be the same
    as what was loaded into HW some time ago. A debugger should be able to
    look into and manipulate these HW structures.

    All control registers, including the TLB CAMs are accessible via MMI/O accesses. So a remote core can decide what a crashed core was doing at
    the instant of the crash.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Jan 25 17:01:24 2024
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:


    Or restricted to a specific set of cores (i.e. those currently
    owned by the target guest).

    Even that gets tricky when you (or the OS) virtualizes cores.

    Oh, indeed. It's helpful to have good hardware support. The
    ARM GIC, for example, helps eliminate hypervisor interaction
    during normal guest interrupt handling (aside from scheduling the
    guest on a host core).



    In my case, the interrupt controller merely sets bits in the interrupt >>table, the watching cores watch for changes to its pending interrupt >>register (64-bits). Said messages come up from PCIe as MSI-X messages,

    The interrupt space for MSI-X messages is 32-bits. Implementations
    may support fewer than 2**32 interrupts - ours support 2**24 distinct interrupt vectors.

    My 66000 supports 2^16 tables of 2^15 distinct interrupts (non vectored)
    per table.

    and are directed to the interrupt controller over in the Memory Controller >>(L3).

    Dealing with inter-processor interrupts in a multicore guest can also
    be tricky;

    Core sends MSI-X message to interrupt controller and the rest happens
    no different than a device initerrupt.

    Not necessarily, particularly if the guest isn't resident on any
    core at the time the interrupt is received.

    When an interrupt is registered (recognized are raised and enabled)
    and the receiving GuestOS is not running on any core, the interrupt
    remains pending until some core context switches to that GuestOS.


    I presume the core that services the interrupt (ISR) is running the same >>>>GuestOS under the same HyperVisor that initiated the device.

    Generally a safe assumption. Note that the guest core may not be
    resident on any physical core when the guest interrupt arives.

    Which is why its table has to be present at all times--even if the threads >>are not. When one or more threads from that GuestOS are activated, the >>pending interrupt will be serviced.

    Yes, but the hypervisor needs to be notified by the hardware when the table is updated and the target guest VCPU isn't currently scheduled
    on any core so that it can decide to schedule the guest (which may,
    for instance, have been parked because it executed a WFI, PAUSE
    or MWAIT instruction).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Thu Jan 25 17:48:00 2024
    [email protected] (MitchAlsup1) writes:

    Not necessarily, particularly if the guest isn't resident on any
    core at the time the interrupt is received.

    When an interrupt is registered (recognized are raised and enabled)
    and the receiving GuestOS is not running on any core, the interrupt
    remains pending until some core context switches to that GuestOS.

    It is useful to notify the hypervisor of that, so that it can
    schedule the guest.


    Yes, but the hypervisor needs to be notified by the hardware when the table >> is updated and the target guest VCPU isn't currently scheduled
    on any core so that it can decide to schedule the guest (which may,
    for instance, have been parked because it executed a WFI, PAUSE
    or MWAIT instruction).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)