When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
This is because interrupts can happen at any time, and thus
programs don't prepare for them or expect them. Any disturbance
to the contents of any register would risk causing programs to
crash.
The Concertina II has a potentially large machine state which most
programs do not use. There are vector registers, of the huge
kind found in the Cray I. There are banks of 128 registers to
supplement the banks of 32 registers.
One obvious step in addressing this is for programs that don't
use these registers to run without access to those registers.
If this is indicated in the PSW, then the interrupt routine will
know what it needs to save and restore.
A more elaborate and more automated method is also possible.
Let us imagine the computer speeds up interrupts by having a
second bank of registers that interrupt routines use. But two
register banks aren't enough, as many user programs are running
concurrently.
Here is how I envisage the sequence of events in response to
an interrupt could work:
1) The computer, at the beginning of an area of memory
sufficient to hold all the contents of the computer's
registers, including the PSW and program counter, places
a _restore status_.
When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
This is because interrupts can happen at any time, and thus
programs don't prepare for them or expect them. Any disturbance
to the contents of any register would risk causing programs to
crash.
The Concertina II has a potentially large machine state which most
programs do not use. There are vector registers, of the huge
kind found in the Cray I. There are banks of 128 registers to
supplement the banks of 32 registers.
One obvious step in addressing this is for programs that don't
use these registers to run without access to those registers.
If this is indicated in the PSW, then the interrupt routine will
know what it needs to save and restore.
A more elaborate and more automated method is also possible.
Let us imagine the computer speeds up interrupts by having a
second bank of registers that interrupt routines use. But two
register banks aren't enough, as many user programs are running
concurrently.
Quadibloc wrote:
When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
It needs to save the portion of the machine state overwritten by that interrupt. Often this is a small subset of the whole state because many interrupts only need a few integer registers, maybe 6 or 7.
Additional integer state will be saved and restored as needed by the call ABI so does not need to be done for every interrupt by the handler prologue.
This allows the OS to only save and restore the full state on thread
switch which only happens after something significant occurs that
changes the highest priority thread on a particular core.
This occurs much less than the frequency of interrupts.
This is because interrupts can happen at any time, and thus
programs don't prepare for them or expect them. Any disturbance
to the contents of any register would risk causing programs to
crash.
The Concertina II has a potentially large machine state which most
programs do not use. There are vector registers, of the huge
kind found in the Cray I. There are banks of 128 registers to
supplement the banks of 32 registers.
I have trouble imagining what an interrupt handler might use vectors for.
Some OS's deal with this by specifying that drivers can only use integers. (Graphics drivers get special dispensation but they run in special context.)
When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
This is because interrupts can happen at any time, and thus
programs don't prepare for them or expect them. Any disturbance
to the contents of any register would risk causing programs to
crash.
The Concertina II has a potentially large machine state which most
programs do not use. There are vector registers, of the huge
kind found in the Cray I. There are banks of 128 registers to
supplement the banks of 32 registers.
One obvious step in addressing this is for programs that don't
use these registers to run without access to those registers.
If this is indicated in the PSW, then the interrupt routine will
know what it needs to save and restore.
A more elaborate and more automated method is also possible.
Let us imagine the computer speeds up interrupts by having a
second bank of registers that interrupt routines use. But two
register banks aren't enough, as many user programs are running
concurrently.
Here is how I envisage the sequence of events in response to
an interrupt could work:
1) The computer, at the beginning of an area of memory
sufficient to hold all the contents of the computer's
registers, including the PSW and program counter, places
a _restore status_.
2) The computer switches to the interrupt register bank,
and places a pointer to the restore status in one of the
registers, according to a known convention.
3) As the interrupt service routine runs, the computer,
separately and in the background, saves the registers of the
interrupted program into memory. Once this is complete, the
_restore status_ value in memory is changed to reflect this.
4) The restore status value has _two_ uses.
One is, obviously, that when returning from an interrupt,
there will be a 'return from interrupt' routine that will
either just switch register banks, if the registers aren't
saved yet, or re-fill the registers that are actually in
use (the register status also indicating what the complement
of registers available to the interrupted program was) from
memory.
The other is that the restore status can be tested. If the
main register set isn't saved yet, then it's too soon after
the interrupt to *switch to another user program* which also
would use the main register set, but with a different set
of saved values.
5) Some other factors complicate this.
There may be multiple sets of user program registers to
facilitate SMT.
The standard practice in an operating system is to leave
the privileged interrupt service routine as quickly as
possible, and continue handling the interrupt in an
unprivileged portion of the operating system.
However, the
return from interrupt instruction is obviously privileged,
as it allows one to put an arbitrary value from memory into
the Program Status Word, including one that would place
the computer into a privileged state after the return.
That last is not unique to the Concertina II, however. So
the obvious solution of allowing the kernel to call
unprivileged subroutines - which terminate in a supervisor
call rather than a normal return - has been found.
John Savard
EricP wrote:
Quadibloc wrote:
When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
It needs to save the portion of the machine state overwritten by that
interrupt. Often this is a small subset of the whole state because many
interrupts only need a few integer registers, maybe 6 or 7.
Would an interrupt handler not run fewer instructions if its register
state was seeded with pointers of use to the device(s) being serviced ??
Additional integer state will be saved and restored as needed by the call ABI
so does not need to be done for every interrupt by the handler prologue.
This allows the OS to only save and restore the full state on thread
switch which only happens after something significant occurs that
changes the highest priority thread on a particular core.
Doesn't running a SoftIRQ (or DPC) require a full register state ??
And don't most device initerrupts need to SoftIRQ ??
I have trouble imagining what an interrupt handler might use vectors for.
Memory to memory move from Disk Cache to User Buffer.
SoftIRQ might use Vector arithmetic to verify CRC, Encryption, ...
On 1/17/2024 2:11 PM, MitchAlsup1 wrote:
Quadibloc wrote:[...]
When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
State needs to be saved, whether SW or HW does the save is a free variable. >>
This is because interrupts can happen at any time, and thus
programs don't prepare for them or expect them. Any disturbance
to the contents of any register would risk causing programs to
crash.
Also note:: the ABA problem can happen when an interrupt transpires
in the middle of an ATOMIC sequence. Thus, My 66000 fails the event
before transferring control to the interrupt handler.
Just to be clear an interrupt occurring within the hardware
implementation of a CAS operation (e.g, lock cmpxchg over on Intel)
should not effect the outcome of the CAS. Actually, it should not happen
at all, right? CAS does not have any spurious failures.
EricP wrote:
Quadibloc wrote:
When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
It needs to save the portion of the machine state overwritten by that
interrupt. Often this is a small subset of the whole state because many
interrupts only need a few integer registers, maybe 6 or 7.
Would an interrupt handler not run fewer instructions if its register
state was seeded with pointers of use to the device(s) being serviced ??
Additional integer state will be saved and restored as needed by the
call ABI
so does not need to be done for every interrupt by the handler prologue.
This allows the OS to only save and restore the full state on thread
switch which only happens after something significant occurs that
changes the highest priority thread on a particular core.
Doesn't running a SoftIRQ (or DPC) require a full register state ??
And don't most device initerrupts need to SoftIRQ ??
{{Yes, I see timer interrupts not needing so much of that}}
This occurs much less than the frequency of interrupts.
This is because interrupts can happen at any time, and thus
programs don't prepare for them or expect them. Any disturbance
to the contents of any register would risk causing programs to
crash.
The Concertina II has a potentially large machine state which most
programs do not use. There are vector registers, of the huge
kind found in the Cray I. There are banks of 128 registers to
supplement the banks of 32 registers.
I have trouble imagining what an interrupt handler might use vectors for.
Memory to memory move from Disk Cache to User Buffer.
SoftIRQ might use Vector arithmetic to verify CRC, Encryption, ...
Some OS's deal with this by specifying that drivers can only use
integers.
(Graphics drivers get special dispensation but they run in special
context.)
Quadibloc wrote:
When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
It needs to save the portion of the machine state overwritten by that interrupt. Often this is a small subset of the whole state because many interrupts only need a few integer registers, maybe 6 or 7.
Additional integer state will be saved and restored as needed by the call ABI so does not need to be done for every interrupt by the handler prologue.
I have trouble imagining what an interrupt handler might use vectors for.
On Wed, 17 Jan 2024 15:35:56 -0500, EricP wrote:
Quadibloc wrote:
When a computer recieves an interrupt signal, it needs to saveIt needs to save the portion of the machine state overwritten by that
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
interrupt. Often this is a small subset of the whole state because many
interrupts only need a few integer registers, maybe 6 or 7.
Additional integer state will be saved and restored as needed by the call ABI
so does not need to be done for every interrupt by the handler prologue.
Yes, that is true in many cases.
I have trouble imagining what an interrupt handler might use vectors for.
However, you're apparently forgetting one very important case.
What if the interrupt is a *real-time clock* interrupt, and what is going
to happen is that the computer will _not_ return immediately from that interrupt to the interrupted program, but instead, regarding it as compute-bound, proceed to a different program possibly even belonging
to another user?
So you're quite correct that the problem does not _always_ arise. But it
does arise on occasion.
John Savard
Chris M. Thomasson wrote:
On 1/17/2024 2:11 PM, MitchAlsup1 wrote:
Quadibloc wrote:[...]
When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
State needs to be saved, whether SW or HW does the save is a free
variable.
This is because interrupts can happen at any time, and thus
programs don't prepare for them or expect them. Any disturbance
to the contents of any register would risk causing programs to
crash.
Also note:: the ABA problem can happen when an interrupt transpires
in the middle of an ATOMIC sequence. Thus, My 66000 fails the event
before transferring control to the interrupt handler.
Just to be clear an interrupt occurring within the hardware
implementation of a CAS operation (e.g, lock cmpxchg over on Intel)
should not effect the outcome of the CAS. Actually, it should not
happen at all, right? CAS does not have any spurious failures.
ABA failure happens BECAUSE one uses the value of data to decide if
something appeared ATOMIC. The CAS instruction (itself and all variants)
is ATOMIC, the the setup to CAS is non-ATOMIC, because the original value
to be compared was fetched without any ATOMIC indicator, and someone else
can alter it before CAS. If more than 1 thread alters the location, it
can (seldom) end up with the same data value as the suspended thread
thought it should be.
CAS is ATOMIC, the code leading to CAS was not and this opens up the hole.
Note:: CAS functionality implemented with LL/SC does not suffer ABA
because the core monitors the LL address until the SC is performed.
It is an addressed based comparison not a data value based one.
Quadibloc wrote:
When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
It needs to save the portion of the machine state overwritten by that interrupt. Often this is a small subset of the whole state because many interrupts only need a few integer registers, maybe 6 or 7.
Additional integer state will be saved and restored as needed by the call ABI so does not need to be done for every interrupt by the handler prologue.
There also can be many restrictions on what an ISR is allowed to do
because the OS designers did not want to, say, force every ISR to
sync with the slow x87 FPU just in case someone wanted to use it.
I would not assume that anything other than integer registers would be available in an ISR.
In a post processing DPC/SoftIrq routine it might be possible but again
there can be limitations. What you don't ever want to happen is to hang
the cpu waiting to sync with a piece of hardware so you can save its state, as might happen if it was a co-processor. You also don't want to have to
save any state just in case a post routine might want to do something,
but rather save/restore the state on demand and just what is needed.
So it really depends on the device and the platform.
On 1/17/2024 4:01 PM, MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 1/17/2024 2:11 PM, MitchAlsup1 wrote:
Quadibloc wrote:[...]
When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
State needs to be saved, whether SW or HW does the save is a free
variable.
This is because interrupts can happen at any time, and thus
programs don't prepare for them or expect them. Any disturbance
to the contents of any register would risk causing programs to
crash.
Also note:: the ABA problem can happen when an interrupt transpires
in the middle of an ATOMIC sequence. Thus, My 66000 fails the event
before transferring control to the interrupt handler.
Just to be clear an interrupt occurring within the hardware
implementation of a CAS operation (e.g, lock cmpxchg over on Intel)
should not effect the outcome of the CAS. Actually, it should not
happen at all, right? CAS does not have any spurious failures.
ABA failure happens BECAUSE one uses the value of data to decide if
something appeared ATOMIC. The CAS instruction (itself and all variants)
is ATOMIC, the the setup to CAS is non-ATOMIC, because the original value
to be compared was fetched without any ATOMIC indicator, and someone else
can alter it before CAS. If more than 1 thread alters the location, it
can (seldom) end up with the same data value as the suspended thread
thought it should be.
Yup. Fwiw, some years ago I actually tried to BURN a CAS by creating
several rouge threads that would alter the CAS target using random
numbers at full speed ahead. The interesting part is that forward
progress was damaged for sure, but still occurred. It did not live lock
on me. Interesting.
CAS is ATOMIC, the code leading to CAS was not and this opens up the hole.
Indeed.
Note:: CAS functionality implemented with LL/SC does not suffer ABA
because the core monitors the LL address until the SC is performed.
It is an addressed based comparison not a data value based one.
Exactly. Actually, I asked myself if I just wrote a stupid question to
you. Sorry Mitch... ;^)
MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 1/17/2024 2:11 PM, MitchAlsup1 wrote:
Quadibloc wrote:[...]
When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
State needs to be saved, whether SW or HW does the save is a free
variable.
This is because interrupts can happen at any time, and thus
programs don't prepare for them or expect them. Any disturbance
to the contents of any register would risk causing programs to
crash.
Also note:: the ABA problem can happen when an interrupt transpires
in the middle of an ATOMIC sequence. Thus, My 66000 fails the event
before transferring control to the interrupt handler.
Just to be clear an interrupt occurring within the hardware
implementation of a CAS operation (e.g, lock cmpxchg over on Intel)
should not effect the outcome of the CAS. Actually, it should not
happen at all, right? CAS does not have any spurious failures.
ABA failure happens BECAUSE one uses the value of data to decide if
something appeared ATOMIC. The CAS instruction (itself and all variants)
is ATOMIC, the the setup to CAS is non-ATOMIC, because the original value
to be compared was fetched without any ATOMIC indicator, and someone else
can alter it before CAS. If more than 1 thread alters the location, it
can (seldom) end up with the same data value as the suspended thread
thought it should be.
CAS is ATOMIC, the code leading to CAS was not and this opens up the hole. >>
Note:: CAS functionality implemented with LL/SC does not suffer ABA
because the core monitors the LL address until the SC is performed.
It is an addressed based comparison not a data value based one.
Yes but an equal point of view is that LL/SC only emulates atomic and
uses the cache line ownership grab while "locked" to detect possible interference and infer potential change.
Note that if LL/SC is implemented with temporary line pinning
(as might be done to guarantee forward progress and prevent ping-pong)
then it cannot be interfered with, and CAS and atomic-fetch-op sequences
are semantically identical to the equivalent single instructions
(which may also be implemented with temporary line pinning if their
data must move from cache through the core and back).
Also LL/SC as implemented on Alpha, MIPS, Power, ARM, RISC-V don't allow
any other location loads or stores between them so really aren't useful
for detecting ABA because detecting it requires monitoring two memory locations for change.
The classic example is the single linked list with items head->A->B->C Detecting ABA requires monitoring if either head or head->Next change
which LL/SC cannot do as reading head->Next cancels the lock on head.
x86 has cmpxchg8b and ARM has double wide LL/SC which can be used to implement CASD atomic-double-wide-compare-and-swap. The first word holds
the head pointer and the second word holds a generation counter whose
change is used to infer that head->Next might have changed.
On Wed, 17 Jan 2024 15:35:56 -0500, EricP wrote:
Quadibloc wrote:
When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
It needs to save the portion of the machine state overwritten by that
interrupt. Often this is a small subset of the whole state because many
interrupts only need a few integer registers, maybe 6 or 7.
Additional integer state will be saved and restored as needed by the call ABI
so does not need to be done for every interrupt by the handler prologue.
Having been so concerned by the large machine state of the Concertina
II, parts of which were rarely used, and not realizing the conventional approach was entirely adequate... I missed what was the biggest flaw in interrupts on the Concertina II.
interrupts on Concertina II.
Because in some important ways it is patterned after the IBM System/360,
it shares its biggest problem with interrupts.
On the System/360, it is a *convention* that the last few general registers, registers 11, 12, 13, 14, and 15 or so, are used as base registers. A
base register *must* be properly set up before a program can write any data to memory.
So one can't just have an interrupt behave like on an 8-bit microprocessor, saving only the program counter and the status bits, and leaving any registers to be saved by software. At least some of the general registers have to be saved, and set up with new starting values, for the interrupt routine to be able to save anything else, if need be.
Of course, the System/360 was able to solve this problem, so it's not intractable. But the fact that the System/360 solved it by saving all
sixteen general registers, and then loading them from an area in memory allocated to that interrupt type, is what fooled me into thinking I
would need to automatically save _everything_. It didn't save the floating-point registers - software did, if the need was to move to
a different user's program, and saving the state in two separate pieces
by two different parts of the OS did not cause hopeless confusion.
John Savard
On 1/20/2024 12:54 PM, Quadibloc wrote:
On Wed, 17 Jan 2024 15:35:56 -0500, EricP wrote:
So one can't just have an interrupt behave like on an 8-bit microprocessor, >> saving only the program counter and the status bits, and leaving any
registers to be saved by software. At least some of the general registers
have to be saved, and set up with new starting values, for the interrupt
routine to be able to save anything else, if need be.
IIRC, saving off PC, some flags bits, swapping the stack registers, and
doing a computed branch relative to a control register (via bit-slicing,
*). This is effectively the interrupt mechanism I am using on a 64-bit ISA.
*: For a table that is generally one-off in the kernel or similar, it
doesn't ask much to mandate that it has a certain alignment. And if the required alignment is larger than the size of the table, you have just
saved yourself needing an adder...
If anything, it is a little simpler than the mechanism used on some
8-bit systems, which would have needed a mechanism to push these values
to the stack, and restore them from the stack.
Having side-channels that allow these values to "magically appear" in
certain SPRs is at least simpler, though the cost of the logic itself
could be more up for debate.
Of course, the System/360 was able to solve this problem, so it's not
intractable. But the fact that the System/360 solved it by saving all
sixteen general registers, and then loading them from an area in memory
allocated to that interrupt type, is what fooled me into thinking I
would need to automatically save _everything_. It didn't save the
floating-point registers - software did, if the need was to move to
a different user's program, and saving the state in two separate pieces
by two different parts of the OS did not cause hopeless confusion.
For ISR convenience, it would make sense to have, say, two SPR's or CR's designated for ISR use to squirrel off values from GPRs to make the prolog/epilog easier. Had considered this, but not done so (in this
case, first thing the ISR does is save a few registers to the ISR stack
to free them up to get the rest of the registers saved "more properly",
then has to do a secondary reload to get these registers back into the correct place).
Assuming interrupts aren't too common, then it isn't a huge issue, and seemingly a majority of the clock-cycles spent on interrupt entry mostly
have to do with L1 misses (since typically pretty much nothing the ISR touches is in the L1 cache; and in my implementation, ISR's may not
share L1 cache lines with non-ISR code; so basically it is an
architecturally required "L1 miss" storm).
Only real way to avoid a lot of the L1 misses would be to have multiple
sets of banked registers or similar, but, this is not cheap for the hardware... Similarly, the ISR would need to do as little, and touch as little memory, as is possible to perform its task.
John Savard
In a modern system where you have several HyperVisors and a multiplicity
of GuestOSs, a single interrupt table is unworkable looking forward.
What you want and need is every GuestOS to have its own table, and
every HyperVisor have its own table, some kind of routing mechanism to
route device interrupts to the correct table, and inform appropriate
cores of raised and enabled interrupts. All these tables have to be >concurrently available continuously and simultaneously. The old fixed
mapping will no longer work efficiently--you can make them work with
\a near-Herculean amount of carefully programming.
On 1/21/2024 12:58 PM, MitchAlsup1 wrote:
Detecting ABA requires one to monitor addresses not data values.
Not 100% true.
On 1/21/2024 3:18 PM, MitchAlsup1 wrote:
BGB wrote:
On 1/20/2024 12:54 PM, Quadibloc wrote:
On Wed, 17 Jan 2024 15:35:56 -0500, EricP wrote:
So one can't just have an interrupt behave like on an 8-bit
microprocessor,
saving only the program counter and the status bits, and leaving any
registers to be saved by software. At least some of the general
registers
have to be saved, and set up with new starting values, for the interrupt >>>> routine to be able to save anything else, if need be.
IIRC, saving off PC, some flags bits, swapping the stack registers,
and doing a computed branch relative to a control register (via
bit-slicing, *). This is effectively the interrupt mechanism I am
using on a 64-bit ISA.
And sounds like the interrupt mechanism for an 8-bit µprocessor...
It was partly a simplification of the design from the SH-4, which was a 32-bit CPU mostly used in embedded systems (and in the Sega Dreamcast...).
Though, the SH-4 did bank out half the registers, which was a feature
that ended up being dropped for cost-saving reasons.
*: For a table that is generally one-off in the kernel or similar, it
doesn't ask much to mandate that it has a certain alignment. And if
the required alignment is larger than the size of the table, you have
just saved yourself needing an adder...
In a modern system where you have several HyperVisors and a multiplicity
of GuestOSs, a single interrupt table is unworkable looking forward.
What you want and need is every GuestOS to have its own table, and
every HyperVisor have its own table, some kind of routing mechanism to
route device interrupts to the correct table, and inform appropriate
cores of raised and enabled interrupts. All these tables have to be
concurrently available continuously and simultaneously. The old fixed
mapping will no longer work efficiently--you can make them work with
\a near-Herculean amount of carefully programming.
Or you can look at the problem from a modern viewpoint and fix the
model so the above is manifest.
Presumably, only the actual "bare metal" layer has an actual
hardware-level interrupt table, and all of the "guest" tables are faked
in software?...
Much like with MMU:
Only the base level needs to actually handle TLB miss events, and
everything else (nested translation, etc), can be left to software
emulation.
If anything, it is a little simpler than the mechanism used on some
8-bit systems, which would have needed a mechanism to push these
values to the stack, and restore them from the stack.
Do you think you mechanism would work "well" with 1024 cores in your
system ??
Number of cores should not matter that much.
Presumably, each core gets its own ISR stack, which should not have any reason to need to interact with each other.
For extra speed, maybe the ISR stacks could be mapped to some sort of core-local SRAM. This hasn't been done yet though.
Idea here being probably the SRAM region could have a special address
range, and any access to this region would be invisible to any other
cores (and it need not need have backing in external RAM).
One could maybe debate the cost of giving each core 4K or 8K of
dedicated local SRAM though merely for "slightly faster interrupt handling".
Having side-channels that allow these values to "magically appear" in
certain SPRs is at least simpler, though the cost of the logic itself
could be more up for debate.
Once you have an FMAC FPU none of the interrupt logic adds any area
to any core.
I don't have conventional FMA because it would have had too much cost
and latency.
For ISR convenience, it would make sense to have, say, two SPR's or
CR's designated for ISR use to squirrel off values from GPRs to make
the prolog/epilog easier. Had considered this, but not done so (in
this case, first thing the ISR does is save a few registers to the ISR
stack to free them up to get the rest of the registers saved "more
properly", then has to do a secondary reload to get these registers
back into the correct place).
Or, you could take the point of view that your architecture makes context
switching easy (like 1 instruction from application 1 to application 2)
and when you do this the rest of the model pretty much drops in for free.
This would cost well more, on the hardware side, than having two non-specialized CRs and being like "ISR's are allowed to stomp these at
will, nothing else may use them".
Some other tasks can be handled with the microsecond timer and a loop.
Say:
//void DelayUsec(int usec);
DelayUsec:
CPIUD 30
ADD R4, R0, R6
.L0:
CPUID 30
CMPQGT R0, R6
BT .L0
RTS
Which would create a certain amount of delay (in microseconds) relative
to when the function is called.
Quadibloc <[email protected]d> writes:
When a computer recieves an interrupt signal, it needs to save
the complete machine state, so that upon return from the
interrupt, the program thus interrupted is in no way affected.
This is because interrupts can happen at any time, and thus
programs don't prepare for them or expect them. Any disturbance
to the contents of any register would risk causing programs to
crash.
Something needs to preserve state, either the hardware or
the software. Most risc processors lean towards the latter,
generally for good reason - one may not need to save
all the state if the interrupt handler only touchs part of it.
The Concertina II has a potentially large machine state which most
programs do not use. There are vector registers, of the huge
kind found in the Cray I. There are banks of 128 registers to
supplement the banks of 32 registers.
One obvious step in addressing this is for programs that don't
use these registers to run without access to those registers.
If this is indicated in the PSW, then the interrupt routine will
know what it needs to save and restore.
Just like x86 floating point.
A more elaborate and more automated method is also possible.
Let us imagine the computer speeds up interrupts by having a
second bank of registers that interrupt routines use. But two
register banks aren't enough, as many user programs are running
concurrently.
Here is how I envisage the sequence of events in response to
an interrupt could work:
1) The computer, at the beginning of an area of memory
sufficient to hold all the contents of the computer's
registers, including the PSW and program counter, places
a _restore status_.
Slow DRAM or special SRAMs? The former will add
considerable latency to an interrupt, the later costs
area (on a per-hardware-thread basis) and floorplanning
issues.
Best is to save the minimal amount of state in hardware
and let software deal with the rest, perhaps with
hints from the hardware (e.g. a bit that indicates
whether the FPRs were modified since the last context
switch, etc).
EricP wrote:
There also can be many restrictions on what an ISR is allowed to do
because the OS designers did not want to, say, force every ISR to
sync with the slow x87 FPU just in case someone wanted to use it.
What about all the architectures that are not x86 and do not need to synch
to FP, Vectors, SIMD, ..... ?? Why are they constrained by the one badly designed long life architecture ??
I would not assume that anything other than integer registers would be
available in an ISR.
This is quite reasonable: as long as you have a sufficient number that
the ISR can be written in some HLL without a bunch of flags to the
compiler.
In a post processing DPC/SoftIrq routine it might be possible but again
there can be limitations. What you don't ever want to happen is to hang
the cpu waiting to sync with a piece of hardware so you can save its
state,
as might happen if it was a co-processor. You also don't want to have to
save any state just in case a post routine might want to do something,
but rather save/restore the state on demand and just what is needed.
So it really depends on the device and the platform.
As long as there are not more than one flag to clue the compiler in,
I am on board.
BGB wrote:
Much like with MMU:
Only the base level needs to actually handle TLB miss events, and
everything else (nested translation, etc), can be left to software
emulation.
Name a single ISA that fakes the TLB ?? (and has an MMU)
Presumably, each core gets its own ISR stack, which should not have any
reason to need to interact with each other.
I presume an interrupt can be serviced by any number of cores.
I presume that there are a vast number of devices. Each device assigned
to a few GuestOSs.
I presume the core that services the interrupt (ISR) is running the same >GuestOS under the same HyperVisor that initiated the device.
I presume the core that services the interrupt was of the lowest priority
of all the cores then running that GuestOS.
I presume the core that services the interrupt wasted no time in doing so.
And the GuestOS decides on how its ISR stack is {formatted, allocated, used, >serviced, ...} which can be different for each GuestOS.
If the interrupt occurs often enough to mater, its instructions, data,
and translations will be in the cache hierarchy.
EricP wrote:
MitchAlsup1 wrote:
Chris M. Thomasson wrote:
Just to be clear an interrupt occurring within the hardware
implementation of a CAS operation (e.g, lock cmpxchg over on Intel)
should not effect the outcome of the CAS. Actually, it should not
happen at all, right? CAS does not have any spurious failures.
ABA failure happens BECAUSE one uses the value of data to decide if
something appeared ATOMIC. The CAS instruction (itself and all variants) >>> is ATOMIC, the the setup to CAS is non-ATOMIC, because the original
value
to be compared was fetched without any ATOMIC indicator, and someone
else
can alter it before CAS. If more than 1 thread alters the location,
it can (seldom) end up with the same data value as the suspended thread
thought it should be.
CAS is ATOMIC, the code leading to CAS was not and this opens up the
hole.
Note:: CAS functionality implemented with LL/SC does not suffer ABA
because the core monitors the LL address until the SC is performed.
It is an addressed based comparison not a data value based one.
Yes but an equal point of view is that LL/SC only emulates atomic and
uses the cache line ownership grab while "locked" to detect possible
interference and infer potential change.
Which, BTW, opens up a different side channel ...
Note that if LL/SC is implemented with temporary line pinning
(as might be done to guarantee forward progress and prevent ping-pong)
then it cannot be interfered with, and CAS and atomic-fetch-op sequences
are semantically identical to the equivalent single instructions
(which may also be implemented with temporary line pinning if their
data must move from cache through the core and back).
Line pinning requires a NAK in the coherence protocol. As far as I know,
only My 66000 interconnect protocol has such a NaK.
Also LL/SC as implemented on Alpha, MIPS, Power, ARM, RISC-V don't allow
any other location loads or stores between them so really aren't useful
for detecting ABA because detecting it requires monitoring two memory
locations for change.
The classic example is the single linked list with items head->A->B->C
Detecting ABA requires monitoring if either head or head->Next change
which LL/SC cannot do as reading head->Next cancels the lock on head.
Detecting ABA requires one to monitor addresses not data values.
x86 has cmpxchg8b and ARM has double wide LL/SC which can be used to
implement CASD atomic-double-wide-compare-and-swap. The first word holds
the head pointer and the second word holds a generation counter whose
change is used to infer that head->Next might have changed.
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
Chris M. Thomasson wrote:
Just to be clear an interrupt occurring within the hardware
implementation of a CAS operation (e.g, lock cmpxchg over on Intel)
should not effect the outcome of the CAS. Actually, it should not
happen at all, right? CAS does not have any spurious failures.
ABA failure happens BECAUSE one uses the value of data to decide if
something appeared ATOMIC. The CAS instruction (itself and all variants) >>>> is ATOMIC, the the setup to CAS is non-ATOMIC, because the original
value
to be compared was fetched without any ATOMIC indicator, and someone
else
can alter it before CAS. If more than 1 thread alters the location,
it can (seldom) end up with the same data value as the suspended thread >>>> thought it should be.
CAS is ATOMIC, the code leading to CAS was not and this opens up the
hole.
Note:: CAS functionality implemented with LL/SC does not suffer ABA
because the core monitors the LL address until the SC is performed.
It is an addressed based comparison not a data value based one.
Yes but an equal point of view is that LL/SC only emulates atomic and
uses the cache line ownership grab while "locked" to detect possible
interference and infer potential change.
Which, BTW, opens up a different side channel ...
How so? The location has to be inside the same virtual space.
Note that if LL/SC is implemented with temporary line pinning
(as might be done to guarantee forward progress and prevent ping-pong)
then it cannot be interfered with, and CAS and atomic-fetch-op sequences >>> are semantically identical to the equivalent single instructions
(which may also be implemented with temporary line pinning if their
data must move from cache through the core and back).
Line pinning requires a NAK in the coherence protocol. As far as I know,
only My 66000 interconnect protocol has such a NaK.
Not necessarily, provided it is time limited (few tens of clocks).
Also I suspect the worst case latency for moving a line ownership
could be quite large (a lots of queues and cache levels to traverse),
and main memory can be many hundreds of clocks away.
So the cache protocol should already be long latency tolerant
and adding some 10's of clocks shouldn't really matter.
Also LL/SC as implemented on Alpha, MIPS, Power, ARM, RISC-V don't allow >>> any other location loads or stores between them so really aren't useful
for detecting ABA because detecting it requires monitoring two memory
locations for change.
The classic example is the single linked list with items head->A->B->C
Detecting ABA requires monitoring if either head or head->Next change
which LL/SC cannot do as reading head->Next cancels the lock on head.
Detecting ABA requires one to monitor addresses not data values.
It is a method for reading a pair of addresses, and knowing that
neither of them has changed between those two steps,
proceeding to update the first address.
It requires monitoring a first address while reading a second address,
and then updating the first address (releasing the monitor),
and using any update to the first address between those three steps to
infer there might have been a change to the second and blocking the update.
Which none of the LL/SC guarantee you can do.
x86 has cmpxchg8b and ARM has double wide LL/SC which can be used to
implement CASD atomic-double-wide-compare-and-swap. The first word holds >>> the head pointer and the second word holds a generation counter whose
change is used to infer that head->Next might have changed.
[email protected] (MitchAlsup1) writes:
BGB wrote:
Much like with MMU:
Only the base level needs to actually handle TLB miss events, and
everything else (nested translation, etc), can be left to software
emulation.
Name a single ISA that fakes the TLB ?? (and has an MMU)
MIPS?
Presumably, each core gets its own ISR stack, which should not have any
reason to need to interact with each other.
I presume an interrupt can be serviced by any number of cores.
Or restricted to a specific set of cores (i.e. those currently
owned by the target guest).
The guest OS will generally specify the target virutal core (or set of cores)
for a specific interrupt. The Hypervisor and/or hardware needs
to deal with the case where the interrupt arrives while the target
guest core isn't currently scheduled on a physical core (and poke
the kernel to schedule the guest optionally). Such as recording
the pending interrupt and optionally notifying the hypervisor that
there is a pending guest interrupt so it can schedule the guest
core(s) on physical cores to handle the interrupt.
I presume that there are a vast number of devices. Each device assigned
to a few GuestOSs.
Or, with SR-IOV, virtual functions are assigned to specific guests
and all interrupts are MSI-X messages from the device to the
interrupt controller (LAPIC, GIC, etc).
Dealing with inter-processor interrupts in a multicore guest can also
be tricky;
either trapped by the hypervisor or there must be hardware
support in the interrupt controller to notify the hypervisor that a pending guest IPI interrupt has arrived. ARM started with the former behavior, but added a mechanism to handle direct injection of interprocessor interrupts
by the guest, without hypervisor intervention (assuming the guest core
is currently scheduled on a physical core, otherwise the hypervisor gets notified that there is a pending interrupt for a non-scheduled guest
core).
I presume the core that services the interrupt (ISR) is running the same >>GuestOS under the same HyperVisor that initiated the device.
Generally a safe assumption. Note that the guest core may not be
resident on any physical core when the guest interrupt arives.
I presume the core that services the interrupt was of the lowest priority >>of all the cores then running that GuestOS.
I presume the core that services the interrupt wasted no time in doing so.
And the GuestOS decides on how its ISR stack is {formatted, allocated, used, >>serviced, ...} which can be different for each GuestOS.
To a certain extent, the format of the ISR stack is hardware defined,
and there rest is completely up to the guest. ARM for example,
saves the current PC into a system register (ELR_ELx) and switches
the stack pointer. Everything else is up to the software interrupt
handler to save/restore. I see little benefit in hardware doing
any state saving other than that.
If the interrupt occurs often enough to mater, its instructions, data,
and translations will be in the cache hierarchy.
Although there has been a great deal of work mitigating the
number of interrupts (setting interrupt threshholds, RSS,
polling (DPDK, ODP), etc)
I don't see any advantages to all the fancy hardware interrupt
proposals from either of you.
On 1/22/2024 10:09 AM, Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
In my case, the use of Soft TLB is not strictly required, as the OS may opt-in to use a hardware page-walker "if it exists", with TLB Miss
interrupts mostly happening if no hardware page walker exists (or if
there is not a valid page in the page table).
The guest OS will generally specify the target virutal core (or set of cores)
for a specific interrupt. The Hypervisor and/or hardware needs
to deal with the case where the interrupt arrives while the target
guest core isn't currently scheduled on a physical core (and poke
the kernel to schedule the guest optionally). Such as recording
the pending interrupt and optionally notifying the hypervisor that
there is a pending guest interrupt so it can schedule the guest
core(s) on physical cores to handle the interrupt.
I am guessing maybe my assumed approach of always routing all of the
external hardware interrupts to a specific core, is not typical then?...
Say, only Core=0 or Core=1, will get the interrupts.
So, a mechanism to swap a pair of stack-pointer registers seemed like a necessary evil.
With a Soft-TLB, it is also basically required to fall back to physical addressing for ISR's (and with HW page-walking, if virtual-memory could
exist in ISRs, it would likely be necessary to jump over to a different
set of page-tables from the usermode program).
In my case, I had not been arguing for any fancy interrupt handling in hardware...
The most fancy part of my interrupt mechanism, is that one can encode
the ID of a core into the value passed to a "TRAPA", and it will
redirect the interrupt to that specific core.
But, this mechanism currently has the limitations of a 4-bit field, so
going beyond ~ 15 cores is going to require a nesting scheme and
bouncing IPI's across multiple cores.
Though, if needed, I could tweak the format slightly in this case, and
maybe expand the Core-ID for IPI's to 8-bits, albeit limiting it to 16
unique IPI interrupt types.
Scott Lurndal wrote:
Or restricted to a specific set of cores (i.e. those currently
owned by the target guest).
Even that gets tricky when you (or the OS) virtualizes cores.
In my case, the interrupt controller merely sets bits in the interrupt
table, the watching cores watch for changes to its pending interrupt
register (64-bits). Said messages come up from PCIe as MSI-X messages,
and are directed to the interrupt controller over in the Memory Controller >(L3).
Dealing with inter-processor interrupts in a multicore guest can also
be tricky;
Core sends MSI-X message to interrupt controller and the rest happens
no different than a device initerrupt.
I presume the core that services the interrupt (ISR) is running the same >>>GuestOS under the same HyperVisor that initiated the device.
Generally a safe assumption. Note that the guest core may not be
resident on any physical core when the guest interrupt arives.
Which is why its table has to be present at all times--even if the threads >are not. When one or more threads from that GuestOS are activated, the >pending interrupt will be serviced.
On 1/22/2024 10:09 AM, Scott Lurndal wrote:
I am guessing maybe my assumed approach of always routing all of the
external hardware interrupts to a specific core, is not typical then?...
Say, only Core=0 or Core=1, will get the interrupts.
Trying to route actual HW interrupts into virtual guest OS's seems like
a pain.
As I see it, the main limiting factor for interrupt performance is not
the instructions to save and restore the registers, but rather the L1
misses that result from doing so.
BGB wrote:
On 1/22/2024 10:09 AM, Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
In my case, the use of Soft TLB is not strictly required, as the OS
may opt-in to use a hardware page-walker "if it exists", with TLB Miss
interrupts mostly happening if no hardware page walker exists (or if
there is not a valid page in the page table).
Has anyone done a SW refill TLB implementation that has both Hypervisor
and Supervisor page <nested> translations ??
This seems to me a bad idea as HV would end up having to manipulate
GuestOS mappings {Because you cannot allow GuestOS to see HV mappings}.
{{Aside:: At one time I was enamored with SW TLB refill and one could
reduce TLB refill penalty by allocating a "big enough" secondary hashed
TLB (1MB+). When HV + GuesOS came about, I saw the futility of it all}}
MitchAlsup1 wrote:
BGB wrote:
On 1/22/2024 10:09 AM, Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
In my case, the use of Soft TLB is not strictly required, as the OS
may opt-in to use a hardware page-walker "if it exists", with TLB Miss
interrupts mostly happening if no hardware page walker exists (or if
there is not a valid page in the page table).
Has anyone done a SW refill TLB implementation that has both Hypervisor
and Supervisor page <nested> translations ??
This seems to me a bad idea as HV would end up having to manipulate
GuestOS mappings {Because you cannot allow GuestOS to see HV mappings}.
I actually pondered something like this to eliminate the two-level table
walk in virtual machines. I was thinking that the HV might propagate its
PTE entries into the GuestOS PTE entries, then mark them (somehow)
so they trap to the HV if GuestOS tries to look at them.
But it got complicated and never really went anywhere.
One accomplishes the same effect by caching the interior PTE nodes
for each of the HV and GuestOS tables separately on the downward walk,
and hold the combined nested table mapping in the TLB.
The bottom-up table walkers on each interior PTE cache should
eliminate 98% of the PTE reads with none of the headaches.
{{Aside:: At one time I was enamored with SW TLB refill and one could
reduce TLB refill penalty by allocating a "big enough" secondary hashed
TLB (1MB+). When HV + GuesOS came about, I saw the futility of it all}}
I also wondered if an hashed/inverted page table could help here.
But that also went nowhere. The separate bottom-up walkers looked best.
EricP wrote:
One accomplishes the same effect by caching the interior PTE nodes
for each of the HV and GuestOS tables separately on the downward walk,
and hold the combined nested table mapping in the TLB.
The bottom-up table walkers on each interior PTE cache should
eliminate 98% of the PTE reads with none of the headaches.
I call these things:: TableWalk Accelerators.
Given CAMs at your access, one can cache the outer layers and short
circuit most of the MMU accesses--such that you don't siply read the Accelerator RAM 25 times (two 5-level tables), you CAM down both
GuestOS and HV tables so only walk the parts not in your CAM. {And
them put them in your CAM.} A Density trick is for each CAM to have
access to a whole cache line of PTEs (8 in my case).
MitchAlsup1 wrote:
EricP wrote:
One accomplishes the same effect by caching the interior PTE nodes
for each of the HV and GuestOS tables separately on the downward walk,
and hold the combined nested table mapping in the TLB.
The bottom-up table walkers on each interior PTE cache should
eliminate 98% of the PTE reads with none of the headaches.
I call these things:: TableWalk Accelerators.
Given CAMs at your access, one can cache the outer layers and short
circuit most of the MMU accesses--such that you don't siply read the
Accelerator RAM 25 times (two 5-level tables), you CAM down both
GuestOS and HV tables so only walk the parts not in your CAM. {And
them put them in your CAM.} A Density trick is for each CAM to have
access to a whole cache line of PTEs (8 in my case).
An idea I had here was to allow the OS more explicit control
for the invalidates of the interior nodes caches.
On x86/x64 the interior cache invalidation had to be backwards compatible,
so the INVLPG instruction has to guess what besides the main TLB needs to be invalidated, and it has to do so in a conservative (ie paranoid) manner.
So it tosses these interior PTE's just in case which means they
have to be reloaded on the next TLB miss.
The OS knows which paging levels it is recycling memory for and
can provide a finer grain control for these TLB invalidates.
The INVLPG and INVPCID instructions need a control bit mask allowing OS
to invalidate just the TLB levels it is changing for a virtual address.
And for OS debugging purposes, all these HW TLB tables need to be readable and writable by some means (as control registers or whatever).
Because when something craps out, what's in memory may not be the same
as what was loaded into HW some time ago. A debugger should be able to
look into and manipulate these HW structures.
[email protected] (MitchAlsup1) writes:
Scott Lurndal wrote:
Or restricted to a specific set of cores (i.e. those currently
owned by the target guest).
Even that gets tricky when you (or the OS) virtualizes cores.
Oh, indeed. It's helpful to have good hardware support. The
ARM GIC, for example, helps eliminate hypervisor interaction
during normal guest interrupt handling (aside from scheduling the
guest on a host core).
In my case, the interrupt controller merely sets bits in the interrupt >>table, the watching cores watch for changes to its pending interrupt >>register (64-bits). Said messages come up from PCIe as MSI-X messages,
The interrupt space for MSI-X messages is 32-bits. Implementations
may support fewer than 2**32 interrupts - ours support 2**24 distinct interrupt vectors.
and are directed to the interrupt controller over in the Memory Controller >>(L3).
Dealing with inter-processor interrupts in a multicore guest can also
be tricky;
Core sends MSI-X message to interrupt controller and the rest happens
no different than a device initerrupt.
Not necessarily, particularly if the guest isn't resident on any
core at the time the interrupt is received.
I presume the core that services the interrupt (ISR) is running the same >>>>GuestOS under the same HyperVisor that initiated the device.
Generally a safe assumption. Note that the guest core may not be
resident on any physical core when the guest interrupt arives.
Which is why its table has to be present at all times--even if the threads >>are not. When one or more threads from that GuestOS are activated, the >>pending interrupt will be serviced.
Yes, but the hypervisor needs to be notified by the hardware when the table is updated and the target guest VCPU isn't currently scheduled
on any core so that it can decide to schedule the guest (which may,
for instance, have been parked because it executed a WFI, PAUSE
or MWAIT instruction).
Not necessarily, particularly if the guest isn't resident on any
core at the time the interrupt is received.
When an interrupt is registered (recognized are raised and enabled)
and the receiving GuestOS is not running on any core, the interrupt
remains pending until some core context switches to that GuestOS.
Yes, but the hypervisor needs to be notified by the hardware when the table >> is updated and the target guest VCPU isn't currently scheduled
on any core so that it can decide to schedule the guest (which may,
for instance, have been parked because it executed a WFI, PAUSE
or MWAIT instruction).
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 41:24:56 |
| Calls: | 12,109 |
| Files: | 15,006 |
| Messages: | 6,518,410 |