This may be a silly idea... but it seems to be the sort of thing that
current concerns about computer security may be calling for.
It is typical for computers to have a privileged mode of operation,
wherein I/O operations and certain special changes to the state of the >computer are allowed that are barred to normal computational tasks.
For various reasons, miscreants have not been completely foiled by the >existence of this feature.
Some types of instruction that are required for normal computation are
still, to a certain extent, potentially harmful.
So I am thinking it might be useful to have, for example, two states
less privileged than the user state, and some mechanism for user
programs to call subroutines which are in that state until they return
- the return instruction being limited, sort of like a supervisor
call, so it can only return in a proper manner.
This may be a silly idea... but it seems to be the sort of thing that
current concerns about computer security may be calling for.
It is typical for computers to have a privileged mode of operation,
wherein I/O operations and certain special changes to the state of the computer are allowed that are barred to normal computational tasks.
For various reasons, miscreants have not been completely foiled by the existence of this feature.
Some types of instruction that are required for normal computation are
still, to a certain extent, potentially harmful.
So I am thinking it might be useful to have, for example, two states
less privileged than the user state, and some mechanism for user
programs to call subroutines which are in that state until they return
- the return instruction being limited, sort of like a supervisor
call, so it can only return in a proper manner.
The first reduced-privilege state would not allow any branch
instructions, particularly conditional branches.
The second, in addition, would not allow any access to memory, only
allowing access to registers.
To use these states to aid in security, more is required.
For one thing, blocks of memory would need to be able to be marked as
not only containing code or data, but as containing code that runs at
one of these reduced privilege levels.
And then comes the payaoff: a block of memory could be marked as
writeable, but yet containing executable code, for things like
just-in-time compilation...
these reduced privilege levels. Thus preventing the generation of code containing branches or memory accesses, as desired, while allowing the generation of computational sequences.
John Savard
John Savard <[email protected]d> writes:
This may be a silly idea... but it seems to be the sort of thing that >>current concerns about computer security may be calling for.
It is typical for computers to have a privileged mode of operation,
wherein I/O operations and certain special changes to the state of the >>computer are allowed that are barred to normal computational tasks.
For various reasons, miscreants have not been completely foiled by the >>existence of this feature.
Some types of instruction that are required for normal computation are >>still, to a certain extent, potentially harmful.
So I am thinking it might be useful to have, for example, two states
less privileged than the user state, and some mechanism for user
programs to call subroutines which are in that state until they return
- the return instruction being limited, sort of like a supervisor
call, so it can only return in a proper manner.
There are already more than five security rings in most
processors.
Intel: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, VMX, Enclave,
SMM
AMD: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, SVM, SMM
ARM64: Realm Monitor, EL3 (Secure monitor), EL2(Hypervisor), EL1
(Kernel), EL0 (user)
<snip description of useless feature>
So I am thinking it might be useful to have, for example, two states
less privileged than the user state, and some mechanism for user
programs to call subroutines which are in that state until they return -
the return instruction being limited, sort of like a supervisor call, so
it can only return in a proper manner.
On Fri, 07 Jun 2024 12:03:03 -0600, John Savard wrote:
So I am thinking it might be useful to have, for example, two
states less privileged than the user state, and some mechanism
for user programs to call subroutines which are in that state
until they return - the return instruction being limited, sort
of like a supervisor call, so it can only return in a proper
manner.
User code normally ran at ring 4. This left 5, 6 and 7 available
for ordinary users to impose their own additional isolation on code
they didn't quite trust.
That was the next-generation kitchen-sink OS from the latter 1960s that
was taking so long to develop, Bell Labs pulled out of the project and set >about creating their own, much less ambitious OS instead, which they >initially called “UNICS” (to indicate it was the opposite of “MULTICS”).
John Savard <[email protected]d> writes:
This may be a silly idea... but it seems to be the sort of thing that
current concerns about computer security may be calling for.
It is typical for computers to have a privileged mode of operation,
wherein I/O operations and certain special changes to the state of the
computer are allowed that are barred to normal computational tasks.
For various reasons, miscreants have not been completely foiled by the
existence of this feature.
Some types of instruction that are required for normal computation are
still, to a certain extent, potentially harmful.
So I am thinking it might be useful to have, for example, two states
less privileged than the user state, and some mechanism for user
programs to call subroutines which are in that state until they return
- the return instruction being limited, sort of like a supervisor
call, so it can only return in a proper manner.
There are already more than five security rings in most
processors.
Intel: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, VMX, Enclave, SMM AMD: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, SVM, SMM
ARM64: Realm Monitor, EL3 (Secure monitor), EL2(Hypervisor), EL1 (Kernel), EL0 (user)
Scott Lurndal wrote:
John Savard <[email protected]d> writes:
This may be a silly idea... but it seems to be the sort of thing that
current concerns about computer security may be calling for.
It is typical for computers to have a privileged mode of operation,
wherein I/O operations and certain special changes to the state of the
computer are allowed that are barred to normal computational tasks.
For various reasons, miscreants have not been completely foiled by the
existence of this feature.
Some types of instruction that are required for normal computation are
still, to a certain extent, potentially harmful.
So I am thinking it might be useful to have, for example, two states
less privileged than the user state, and some mechanism for user
programs to call subroutines which are in that state until they return
- the return instruction being limited, sort of like a supervisor
call, so it can only return in a proper manner.
There are already more than five security rings in most
processors.
Intel: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, VMX, Enclave,
SMM
AMD: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, SVM, SMM
ARM64: Realm Monitor, EL3 (Secure monitor), EL2(Hypervisor), EL1
(Kernel), EL0 (user)
VAX had 4 modes, User, Supervisor, Executive, Kernel.
VMS used Super for debugger and the command language DCL,
Exec was mostly for the file system.
Kernel was for the core of the OS.
What they found that not only do they not need 4 levels,
it was a pointless overhead to have to constantly switch between them.
(There is a pretty high penalty to switching modes, copying in args, validating args, doing something usually simple, then switching back,
when it is all the OS's code anyway.)
I don't know what privileges Unix on VAX used but it was
probably 2 levels because PDP-11 had only 2 levels.
Alpha had 3 levels, User, Supervisor, and a higher third mode called
PAL for Privileged Architecture Library. It was supposed to be thought
of like microcode, privileged subroutines. Then PAL mode was used to
emulate the 4 levels that VMS expected when they ported it.
(I think PAL mode was a way to patent a feature that made the
ISA impossible to copy without their permission,
and therefore someone can't take DEC's executables and run them
on a clone processor, like what happened to IBM with Amdahl.)
WinNT was written to be portable so the lowest common denominator
is 2 levels, User and Super, and everything worked just fine.
On the other hand, if Multics hadn't been so late, and so closely tied
to expensive hardware that wasn't byte addressable and was already
running out of address bits, who knows how much of its other features
might have been more widely adopted.
On 6/8/2024 11:01 AM, EricP wrote:
Scott Lurndal wrote:
John Savard <[email protected]d> writes:
Though, the time returned by the CPUID microsecond timer is not
currently the same as the one given by "TK_GetTimeUS()", where the
latter effectively gives a 64-bit value (conceptually) representing the
number of microseconds since 1/1/1970; though with the kernel currently
assuming that its build-time is the starting time for the clock (and
none of the FPGA boards support a hardware clock, and one would need
internet access to use NTP, ...).
A 64-bit value in microseconds can express around +/- 300k years, which
should be plenty.
A 64-bit value expressed in seconds could express values relative to
the
current age of the universe, but this is likely unnecessary for most purposes, and ability to express fractions of a second is likely more
useful than the ability to express the age of the universe.
Granted, one could use a 128-bit value, and have both (and in
picoseconds if they wanted). But, this would be overkill.
Or, go extra overkill, and use 256 bits, to express the current age of
the universe in Planck units...
VAX was before common era Hypervisors, do you think VAX could have
supported secure mode and hypervisor with their 4 levels ??
EricP wrote:
Alpha had 3 levels, User, Supervisor, and a higher third mode called
PAL for Privileged Architecture Library. It was supposed to be thought
of like microcode, privileged subroutines. Then PAL mode was used to
emulate the 4 levels that VMS expected when they ported it.
PAL was microcode in <fast> ROM in the native ISA.
(I think PAL mode was a way to patent a feature that made the
ISA impossible to copy without their permission,
and therefore someone can't take DEC's executables and run them
on a clone processor, like what happened to IBM with Amdahl.)
Worked real well for them !!
The Motorola 680x0 family was I think properly virtualizable in this
sense. Or maybe the 68020 and 68030 were, but the 68040 was. I think the Motorola engineers working on the ’040 asked if any customers were interested in preserving the self-virtualization feature, and nobody
seemed to care.
On Sat, 8 Jun 2024 17:37:46 +0000, MitchAlsup1 wrote:
VAX was before common era Hypervisors, do you think VAX could have
supported secure mode and hypervisor with their 4 levels ??
“Virtualization” was bandied about in the 1980s more as an idle, >theoretical concept rather than a practical one.
Are Supervisor Calls "brnches" since they go to controlled entry
points??
How are you going to perform elementary functions {SIN, COS, EXP, LOG}?
A C compiler is an application running in a different process. Why
is a JIT "not like that" ??
The first reduced-privilege state would not allow any branch
instructions, particularly conditional branches.
The second, in addition, would not allow any access to memory, only
allowing access to registers.
There are already more than five security rings in most
processors.
Intel: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, VMX, Enclave, SMM >AMD: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, SVM, SMM
ARM64: Realm Monitor, EL3 (Secure monitor), EL2(Hypervisor), EL1 (Kernel), EL0 (user)
The proper answer to hardware bugs is not adding software limitations,
nor software mitigations (what the hardware makers suggest), but to
fix the hardware.
If no branches... then no need for retpolines and stuff.
If no access to memory... no worries about rowhammer.
Given that, a third mode - not reduced-privilege so much as >reduced-efficiency - suggests itself.
Cause some code to be executed... without any speculative execution;
allow branches, but don't execute anything until where the branch goes
is fully resolved.
This deals with Spectre and friends.
So the idea is to give an unprivileged user application, like a web
browser, a capability, without going through the operating system, to
run code that is sandboxed in appropriate ways to prevent it from
causing trouble although it is untrusted.
That browsers have to be able to run untrusted JavaScript
has been the
basic reason why computers today are insecure.
If the only code that
ran on computers was trusted code, then the virus situation would be
like it was back in the days of 8-bit computers; except for
supply-chain attacks, just don't run pirated software, and you're
pretty much safe.
On Sun, 09 Jun 2024 16:52:45 GMT, [email protected]
(Anton Ertl) wrote:
The proper answer to hardware bugs is not adding software limitations,
nor software mitigations (what the hardware makers suggest), but to
fix the hardware.
In the case of Spectre, fixing the hardware has a cost in performance.
So allowing the processor to run code with out-of-order execution
turned off for that code is a way to limit the performance loss to the untrusted code.
And this would work well on my Concertina II architecture, where VLIW features, such as the break bit, and extended register banks of 128
registers each, are present. Code can be generated that avoids
register hazards when run in order.
John Savard
On Fri, 07 Jun 2024 12:03:03 -0600, John Savard <[email protected]d> wrote:
The first reduced-privilege state would not allow any branch
instructions, particularly conditional branches.
The second, in addition, would not allow any access to memory, only >>allowing access to registers.
Maybe I haven't made clear what this is _for_ as I thought it would be obvious.
If no branches... then no need for retpolines and stuff.
If no access to memory... no worries about rowhammer.
Given that, a third mode - not reduced-privilege so much as reduced-efficiency - suggests itself.
Cause some code to be executed... without any speculative execution;
allow branches, but don't execute anything until where the branch goes
is fully resolved.
This deals with Spectre and friends.
So the idea is to give an unprivileged user application, like a web
browser, a capability, without going through the operating system, to
run code that is sandboxed in appropriate ways to prevent it from
causing trouble although it is untrusted.
That browsers have to be able to run untrusted JavaScript (and,
formerly, even Java and Flash, which have now been discarded) to
support the flexibility desired for modern web sites... has been the
basic reason why computers today are insecure. If the only code that
ran on computers was trusted code, then the virus situation would be
like it was back in the days of 8-bit computers; except for
supply-chain attacks, just don't run pirated software, and you're
pretty much safe.
John Savard
It also uses less specials than I expected; e.g., on the EV45 the IMB (instruction-memory barrier) PAL call is implemented by just executing
a big chunk of code such that the previous contents of the I-cache are evicted, while I expected that it would set a bit in a model-specific register.
Lawrence D'Oliveiro <[email protected]d> writes:
On Sat, 8 Jun 2024 17:37:46 +0000, MitchAlsup1 wrote:
VAX was before common era Hypervisors, do you think VAX could have
supported secure mode and hypervisor with their 4 levels ??
“Virtualization” was bandied about in the 1980s more as an idle, >>theoretical concept rather than a practical one.
I'm quite sure that IBM would disagree with this statement.
OTOH: A 32-bit value in seconds will overflow in 2038, so isn't really sufficient at this point.
A C compiler doesn't save data in memory that can then be executed. It
writes to a file.
Though, there are some instructions which are currently allowed in user
mode but which it could make sense to trap in some contexts, such as
CPUID, or potentially just parts of CPUID, ...
Say, for example, CPUID has several pieces of information available:
CPU type and features;
Microsecond timer (local);
Clock cycle timer;
Hardware RNG;
...
In various contexts, it may be reasonable to want to trap and emulate
some of these while still allowing others to be unhindered.
Though, the time returned by the CPUID microsecond timer is not
currently the same as the one given by "TK_GetTimeUS()", where the
latter effectively gives a 64-bit value (conceptually) representing the number of microseconds since 1/1/1970; though with the kernel currently assuming that its build-time is the starting time for the clock (and
none of the FPGA boards support a hardware clock, and one would need internet access to use NTP, ...).
A 64-bit value in microseconds can express around +/- 300k years, which should be plenty.
A 64-bit value expressed in seconds could express values relative to the current age of the universe, but this is likely unnecessary for most purposes, and ability to express fractions of a second is likely more
useful than the ability to express the age of the universe.
Granted, one could use a 128-bit value, and have both (and in
picoseconds if they wanted). But, this would be overkill.
Or, go extra overkill, and use 256 bits, to express the current age of
the universe in Planck units...
Modern Unix typically provides 64-bit time_t seconds and a (effectively) =
30-bit ns field, so you can store them in a 96-bit container but I don't =
think anyone does that?
On Sun, 09 Jun 2024 16:52:45 GMT, [email protected]
(Anton Ertl) wrote:
The proper answer to hardware bugs is not adding software limitations,
nor software mitigations (what the hardware makers suggest), but to
fix the hardware.
In the case of Spectre, fixing the hardware has a cost in performance.
So allowing the processor to run code with out-of-order execution
turned off for that code is a way to limit the performance loss to the >untrusted code.
And this would work well on my Concertina II architecture, where VLIW >features, such as the break bit, and extended register banks of 128
registers each, are present. Code can be generated that avoids
register hazards when run in order.
And just as a further nitpick (if the above weren�t enough), what happens
if the �file� your C compiler is writing to is in a RAM disk?
My 66000 needs no retpolnes for external calls/returns or for
SVCs and SVRs.
EricP wrote:
Scott Lurndal wrote:
John Savard <[email protected]d> writes:
This may be a silly idea... but it seems to be the sort of thing that
current concerns about computer security may be calling for.
It is typical for computers to have a privileged mode of operation,
wherein I/O operations and certain special changes to the state of the >>>> computer are allowed that are barred to normal computational tasks.
For various reasons, miscreants have not been completely foiled by the >>>> existence of this feature.
Some types of instruction that are required for normal computation are >>>> still, to a certain extent, potentially harmful.
So I am thinking it might be useful to have, for example, two states
less privileged than the user state, and some mechanism for user
programs to call subroutines which are in that state until they return >>>> - the return instruction being limited, sort of like a supervisor
call, so it can only return in a proper manner.
There are already more than five security rings in most
processors.
Intel: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, VMX, Enclave,
SMM
AMD: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, SVM, SMM
ARM64: Realm Monitor, EL3 (Secure monitor), EL2(Hypervisor), EL1
(Kernel), EL0 (user)
VAX had 4 modes, User, Supervisor, Executive, Kernel.
VMS used Super for debugger and the command language DCL,
Exec was mostly for the file system.
Kernel was for the core of the OS.
What they found that not only do they not need 4 levels,
it was a pointless overhead to have to constantly switch between them.
(There is a pretty high penalty to switching modes, copying in args,
validating args, doing something usually simple, then switching back,
when it is all the OS's code anyway.)
VAX was before common era Hypervisors, do you think VAX could have
supported secure mode and hypervisor with their 4 levels ??
But for similar reasons ring 1 and 2 are not used in x86 machines,
either. {{NOw, if we could just go back to 1982 and not invent IDTs, and
call gates, .....}}
I don't know what privileges Unix on VAX used but it was
probably 2 levels because PDP-11 had only 2 levels.
Alpha had 3 levels, User, Supervisor, and a higher third mode called
PAL for Privileged Architecture Library. It was supposed to be thought
of like microcode, privileged subroutines. Then PAL mode was used to
emulate the 4 levels that VMS expected when they ported it.
PAL was microcode in <fast> ROM in the native ISA.
(I think PAL mode was a way to patent a feature that made the
ISA impossible to copy without their permission,
and therefore someone can't take DEC's executables and run them
on a clone processor, like what happened to IBM with Amdahl.)
Worked real well for them !!
On Sun, 09 Jun 2024 14:13:25 GMT, Scott Lurndal wrote:
Lawrence D'Oliveiro <[email protected]d> writes:
On Sat, 8 Jun 2024 17:37:46 +0000, MitchAlsup1 wrote:
VAX was before common era Hypervisors, do you think VAX could have
supported secure mode and hypervisor with their 4 levels ??
“Virtualization” was bandied about in the 1980s more as an idle, >>>theoretical concept rather than a practical one.
I'm quite sure that IBM would disagree with this statement.
I’m sure they would.
On Fri, 07 Jun 2024 18:18:33 GMT, [email protected] (Scott Lurndal)
wrote:
There are already more than five security rings in most
processors.
Intel: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, VMX, Enclave, SMM >>AMD: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, SVM, SMM
ARM64: Realm Monitor, EL3 (Secure monitor), EL2(Hypervisor), EL1 (Kernel), EL0 (user)
Yes, but these are multiple levels _higher_ than User, and what I was
talking about were levels *lower* than User, so I fail to see how this >indicates my idea isn't new.
Or perhaps your complaint is simply that we have too many levels
already.
But that's somebody else's fault, and doesn't bear on whether
the feature I suggest might be useful.
PAL code is stored an a writable control store that
is a separate address space from main memory
But I came to realize that none of that is actually *required*.
It doesn't *need* a third privilege mode, and actually it looks
more expensive performance wise to have one than not.
It would be simpler and cheaper to just transition directly
to and from Super mode without also going through PAL mode.
And there is NO technical reason to restrict access to HW control
register from Super mode.
Many processors automatically disable interrupts on trap because it
greatly simplifies the race conditions in their prologue and epilogue.
x86 did not disable interrupts on exceptions but x64 allows it as an option.
PAL mode does not require its own on-chip SRAM - it could exist in main >memory addressed through a base physical register or an MMU hack.
And having a dedicated private on-chip SRAM to hold critical OS code
does not mean that it is microcode. I would have this for my design
with an MMU fiddle to hard-wire a VA->PA mapping for some OS code.
After realizing it didn't need to exist, and that PAL mode looks more >expensive than just User/Super modes, I began to wonder why it was there. >Which leads me to here:
(I think PAL mode was a way to patent a feature that made the
ISA impossible to copy without their permission,
Many processors automatically disable interrupts on trap because it
greatly simplifies the race conditions in their prologue and epilogue.
x86 did not disable interrupts on exceptions but x64 allows it as an
option.
EricP wrote:
[snip]
Many processors automatically disable interrupts on trap because it
greatly simplifies the race conditions in their prologue and
epilogue. x86 did not disable interrupts on exceptions but x64
allows it as an option.
I have written a lot of x86 interrupt handlers, these chips did very
much disable all interrupts when transferring control to my handler.
The typical approach was to do the minimum work possible to save
whatever HW buffer/data needed saving, before executing a STI (SeT
Interrupt enable bit?) and then do anything else that had to be done
while still in the primary handler.
IRET restored flags, IP and CS, transferring control back to whatever
was running when the hw interrupt happened.
Terje
On Sun, 9 Jun 2024 22:48:28 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:
And just as a further nitpick (if the above werent enough), what
happens
if the file your C compiler is writing to is in a RAM disk?
Well, the output could be stored with no problem, because while it's
on the RAM disk, it can't be executed. It has to be copied from the
RAM disk, into memory that's not pretending to be a disk, by the
loader. So this case doesn't change anything from the case of a real
disk.
John Savard
John Savard <[email protected]d> writes:
So allowing the processor to run code with out-of-order execution
turned off for that code is a way to limit the performance loss to the >>untrusted code.
Your trust in "trusted code" is unfounded.
And this would work well on my Concertina II architecture, where VLIW >>features, such as the break bit, and extended register banks of 128 >>registers each, are present. Code can be generated that avoids
register hazards when run in order.
How do "register hazards" come into play?
But I have seen similar trains of thoughts several times from static scheduling advocates. They see Spectre as the opportunity to tout
their uncompetetive solutions by advocating solutions (like disabling speculation) that maximize the performance loss.
- anton
EricP wrote:
[snip]
Many processors automatically disable interrupts on trap because it
greatly simplifies the race conditions in their prologue and epilogue.
x86 did not disable interrupts on exceptions but x64 allows it as an
option.
I have written a lot of x86 interrupt handlers, these chips did very
much disable all interrupts when transferring control to my handler.
The typical approach was to do the minimum work possible to save
whatever HW buffer/data needed saving, before executing a STI (SeT
Interrupt enable bit?) and then do anything else that had to be done
while still in the primary handler.
IRET restored flags, IP and CS, transferring control back to whatever
was running when the hw interrupt happened.
Terje
John Savard wrote:
On Sun, 9 Jun 2024 22:48:28 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:
And just as a further nitpick (if the above weren?t enough), what
happens
if the ?file? your C compiler is writing to is in a RAM disk?
Well, the output could be stored with no problem, because while it's
on the RAM disk, it can't be executed. It has to be copied from the
RAM disk, into memory that's not pretending to be a disk, by the
loader. So this case doesn't change anything from the case of a real
disk.
One can create a PTE pointing at that RAM disk page and then allow
someone to execute it directly.
OR
One can copy it somewhere that has execute permission in a single
instruction (MM = memory to memory move)
Neither is any real burden to enabling execute.
John Savard <[email protected]d> writes:
In the case of Spectre, fixing the hardware has a cost in performance.
How do you know?
Papers on so-called "invisible speculation" schemes have reported
slowdowns <10% for the more advanced schemes, with IIRC some even
reporting a speedup.
On Mon, 10 Jun 2024 07:16:48 GMT, [email protected]
(Anton Ertl) wrote:
John Savard <[email protected]d> writes:
In the case of Spectre, fixing the hardware has a cost in performance.
How do you know?
Papers on so-called "invisible speculation" schemes have reported
slowdowns <10% for the more advanced schemes, with IIRC some even
reporting a speedup.
I've heard claims - especially from Mitch Alsup - that, indeed, all
one has to do is avoid certain _mistakes_ when designing a pipeline,
and there's no room for Spectre any more.
I'm no expert on these things at all, so I don't know that this can't
be true. But I also don't know that it _is_ true.
What does Spectre exploit? it exploits the fact that speculative
execution keeps around data that was fetched into cache by the
speculative execution of some code that was never supposed to be
executed. Just in case it might be useful later.
Obviously, keeping around any data that just happens to be
accidentally in cache, just in case it might be useful later, does
have a positive (but likely very slight) effect on performance. Being
strict about what speculative execution can do, on the other hand, so
nothing is allowed to leak information, will reduce performance... at
least a little bit.
It could well be that the losses aren't enough to be concerned about,
if this is done carefully. That is, not even the 3% quoted as the cost
of one of the earliest fixes. But since I've heard higher figures for
the fixes for later variants, without positive knowledge, I have to be skeptical about claims that all possible variants of this kind of
attack can be prevented at little cost.
And Rowhammer is even worse. It's not at all clear to me what can be
done without adding an expensive layer of monitoring to memory
accesses. However, only DRAM is vulnerable to Rowhammer, and so it may
be possible to turn cache into a bulwark against it somehow.
John Savard
Lawrence D'Oliveiro <[email protected]d> writes:
On Sun, 09 Jun 2024 14:13:25 GMT, Scott Lurndal wrote:
Lawrence D'Oliveiro <[email protected]d> writes:
On Sat, 8 Jun 2024 17:37:46 +0000, MitchAlsup1 wrote:
VAX was before common era Hypervisors, do you think VAX could have
supported secure mode and hypervisor with their 4 levels ??
“Virtualization” was bandied about in the 1980s more as an idle, >>>>theoretical concept rather than a practical one.
I'm quite sure that IBM would disagree with this statement.
I’m sure they would.
[Your] attempt to scramble to avoid being wrong was unsucessful.
One other thing they did: they had one PAL code coming with the SRM
console for VMS and Digital OSF/1, and another PAL code with the ARC/AlphaBIOS console for Windows NT and Linux. This allowed them to
charge extra (quite a lot) for hardware capable of running their premium
OSs, while providing almost competetive prices for hardware running PC
OSs.
Unfortunately, the PC-like package was still not price/performance competetive, and AlphaBIOS (which we had on our EV56 boxes) was a horror
to work with.
Intel's official terminology makes distinction between interrupts and exceptions. The former are external/asynchronous, the later are internal/synchronous. Exceptions are further sub-divided into faults,
traps and aborts.
On the face of it, your feature is not useful.
And there was the NMI race condition bug ...
Given that ARM is able to charge an architecture licensing fee for the instruction set alone ...
.. while it's on the RAM disk, it can't be executed.
Write can be enabled to memory. Only enabling write and execute together
is potentially subject to restricions.
One can create a PTE pointing at that RAM disk page and then allow
someone to execute it directly.
On Mon, 10 Jun 2024 07:16:48 GMT, [email protected]
(Anton Ertl) wrote:
John Savard <[email protected]d> writes:
In the case of Spectre, fixing the hardware has a cost in performance.
How do you know?
Papers on so-called "invisible speculation" schemes have reported
slowdowns <10% for the more advanced schemes, with IIRC some even
reporting a speedup.
I've heard claims - especially from Mitch Alsup - that, indeed, all
one has to do is avoid certain _mistakes_ when designing a pipeline,
and there's no room for Spectre any more.
I'm no expert on these things at all, so I don't know that this can't
be true. But I also don't know that it _is_ true.
What does Spectre exploit? it exploits the fact that speculative
execution keeps around data that was fetched into cache by the
speculative execution of some code that was never supposed to be
executed. Just in case it might be useful later.
Obviously, keeping around any data that just happens to be
accidentally in cache, just in case it might be useful later, does
have a positive (but likely very slight) effect on performance.
Terje Mathisen wrote:
EricP wrote:
[snip]
Many processors automatically disable interrupts on trap because it
greatly simplifies the race conditions in their prologue and epilogue.
x86 did not disable interrupts on exceptions but x64 allows it as an
option.
I have written a lot of x86 interrupt handlers, these chips did very
much disable all interrupts when transferring control to my handler.
The typical approach was to do the minimum work possible to save
whatever HW buffer/data needed saving, before executing a STI (SeT
Interrupt enable bit?) and then do anything else that had to be done
while still in the primary handler.
IRET restored flags, IP and CS, transferring control back to whatever
was running when the hw interrupt happened.
Terje
Yes, for x86/x64 external interrupts it raises the IRQ priority to that of the requesting device, masking further interrupts of the same or lower IRQ priority. Or you can explicitly disable all maskable interrupts.
However for exceptions and NMI x86 does not mask interrupts so it is
possible for, say, a page fault or INT instruction to trap to the OS,
saving a frame on the stack, and just then an external interrupt to
arrive, saving another frame.
On the return from the interrupt or exception (we want a common return
code path) we need to know if this is a First Level Exception/Interrupt.
If not, we take the simple path and just REI Return Exception or Interrupt. If it is a FLEI then we need to check for deferred work and jump into
the OS. Also it we are returning to User mode we may need to check
for things like thread APCs/signals that arrived while we were away.
On x86 there is also the difference between stack frame shape
depending on whether the prior mode was User or Super.
On x64 they fixed this so they are the same shape.
Then there is the difference between SYSCALL/SYSRET vs SYSENTER/SYSEXIT,
and that one did not set the system stack pointer on entry,
which leaves a security hole if an interrupt arrives just before
you can patch it.
And there was the NMI race condition bug, details of which I have
forgotten but was again something to do with the system stack not
being set correctly after switching to Super and then an NMI arrives
which does not set the stack because the prior mode was already Super.
Its not that these are not handleable, its that it takes literally
hundreds of instructions in the x86/x64 prologues and epilogues closing
each of these holes and idiosyncrasies. And that's on top of the already large clocks cost for the IDT and call gates, and REI instructions.
*None* of this should be necessary.
Even the pipeline drain on mode switch should often be avoidable.
I'm quite sure that IBM would disagree with this statement.
I’m sure they would. But they invented virtualization in CP/CMS because >their attempt at an “interactive timesharing” system, CMS, was only >single-user.
I guess my vintage is showing! When I wrote HW interrupt handlers,
none of this applied so it was a much simpler world.
Initially there was no real priority in use because my handler would
start with IRQ disabled, I would poll/read the single byte serial
port buffer, then clear a hardware interrupt flag and then simply
IRET.
A little later (286?) it became possible to selectively re-enable
only those interrupts that had a higher priority, so I would do that
when my most critical work was done.
Even later the serial port chip was replaced with a far better one
which had 16-byte IO buffers and programmable interrupt levels. AFAIR
I would typically set it to signal when the buffer was half full, but
14 of 16 was also possible?
However for exceptions and NMI x86 does not mask interrupts so it is possible for, say, a page fault or INT instruction to trap to the
OS, saving a frame on the stack, and just then an external
interrupt to arrive, saving another frame.
On the return from the interrupt or exception (we want a common
return code path) we need to know if this is a First Level Exception/Interrupt. If not, we take the simple path and just REI
Return Exception or Interrupt. If it is a FLEI then we need to
check for deferred work and jump into the OS. Also it we are
returning to User mode we may need to check for things like thread APCs/signals that arrived while we were away.
On x86 there is also the difference between stack frame shape
depending on whether the prior mode was User or Super.
On x64 they fixed this so they are the same shape.
Then there is the difference between SYSCALL/SYSRET vs
SYSENTER/SYSEXIT, and that one did not set the system stack pointer
on entry, which leaves a security hole if an interrupt arrives just
before you can patch it.
And there was the NMI race condition bug, details of which I have
forgotten but was again something to do with the system stack not
being set correctly after switching to Super and then an NMI arrives
which does not set the stack because the prior mode was already
Super.
Its not that these are not handleable, its that it takes literally
hundreds of instructions in the x86/x64 prologues and epilogues
closing each of these holes and idiosyncrasies. And that's on top
of the already large clocks cost for the IDT and call gates, and
REI instructions.
*None* of this should be necessary.
Even the pipeline drain on mode switch should often be avoidable.
Ouch! Glad I got out of the IRQ handler business before 1990.
Terje
On Mon, 10 Jun 2024 20:41:31 +0300, Michael S wrote:
Intel's official terminology makes distinction between interrupts and
exceptions. The former are external/asynchronous, the later are
internal/synchronous. Exceptions are further sub-divided into faults,
traps and aborts.
That all sounds very DEC-like.
In particular, the DEC definition of a “fault” is that the saved PC on the >stack still points at the instruction that caused the exception, so a >return-from-exception will attempt to re-execute the same instruction.
This is exactly what you want for page faults, for example, but also for >long-running interruptible instructions that haven’t finished yet.
On Mon, 10 Jun 2024 14:32:33 -0400, EricP wrote:
And there was the NMI race condition bug ...
Not surprised there was trouble with the concept of a “non-maskable >interrupt”. When I first heard of such a thing, I threw up my hands in >horror.
On Mon, 10 Jun 2024 15:23:51 GMT, Anton Ertl wrote:
Given that ARM is able to charge an architecture licensing fee for the
instruction set alone ...
I think that applies to newer versions, not the older ones. Given that ARM >goes back to the 1980s, any patents from the earliest years would have >expired by now.
EricP <[email protected]> writes:
PAL code is stored an a writable control store that
is a separate address space from main memory
Given the way that it (the EV45 PAL code) implements the PAL-call IMB,
i.e., by executing enough code to flush the I-cache, means that the
PAL-code is loaded into the I-cache, so I expect that it resides in
normal RAM. If that was in a separate memory space, there would need
to be an additional bit in each I-cache tag that records this fact.
But I came to realize that none of that is actually *required*.
It doesn't *need* a third privilege mode, and actually it looks
more expensive performance wise to have one than not.
It would be simpler and cheaper to just transition directly
to and from Super mode without also going through PAL mode.
And there is NO technical reason to restrict access to HW control
register from Super mode.
Many processors automatically disable interrupts on trap because it
greatly simplifies the race conditions in their prologue and epilogue.
x86 did not disable interrupts on exceptions but x64 allows it as an option. >>
PAL mode does not require its own on-chip SRAM - it could exist in main
memory addressed through a base physical register or an MMU hack.
And having a dedicated private on-chip SRAM to hold critical OS code
does not mean that it is microcode. I would have this for my design
with an MMU fiddle to hard-wire a VA->PA mapping for some OS code.
After realizing it didn't need to exist, and that PAL mode looks more
expensive than just User/Super modes, I began to wonder why it was there.
Which leads me to here:
(I think PAL mode was a way to patent a feature that made the
ISA impossible to copy without their permission,
Not really. If there was a patent that is specific to it being a
different address space or a dedicated private on-chip SRAM, that
patent could be easily circumvented by the Amdahl-alike by putting the PAL-code in RAM and using a base register or MMU hack, as you
describe.
Also if there was enough room for more on-chip SRAM on any of the
Alpha chips, the designers would have used that room to put in
features that make the chip faster.
Given that ARM is able to charge an architecture licensing fee for the instruction set alone, I am sure that DEC had enough patents on its instruction set, no need for unnecessary and circumventable
implementation ideas.
One other thing they did: they had one PAL code coming with the SRM
console for VMS and Digital OSF/1, and another PAL code with the ARC/AlphaBIOS console for Windows NT and Linux. This allowed them to
charge extra (quite a lot) for hardware capable of running their
premium OSs, while providing almost competetive prices for hardware
running PC OSs. Unfortunately, the PC-like package was still not price/performance competetive, and AlphaBIOS (which we had on our EV56
boxes) was a horror to work with.
- anton
On Mon, 10 Jun 2024 20:41:31 +0300, Michael S wrote:
Intel's official terminology makes distinction between interrupts and
exceptions. The former are external/asynchronous, the later are
internal/synchronous. Exceptions are further sub-divided into faults,
traps and aborts.
That all sounds very DEC-like.
In particular, the DEC definition of a “fault” is that the saved PC on the
stack still points at the instruction that caused the exception, so a return-from-exception will attempt to re-execute the same instruction.
This is exactly what you want for page faults, for example, but also
for
long-running interruptible instructions that haven’t finished yet.
Whereas a “trap” left the PC pointing at the following instruction. So
a
return from the exception handler will simply resume execution there.
Over the evolution of the VAX architecture, some exceptions which
initially were “traps” became “faults” instead.
I forgot to add that Mc 88120 had these features in 1992.
Stores waited for retirement.
ALL I have DONE is to not have the MB write into the cache until the
causing instruction retires !!
My 66000 is also insensitive to RowHammer and derivatives.....
Not always. If the mistakenly speculated cache-fetch /evicted/ some
other data from the (finite-sized) cache, and the evicted data are
needed later on the /true/ execution path, the mistakenly speculated
fetch has a /negative/ effect on performance. (This kind of "timing
anomaly" is very bothersome in static WCET analysis.)
On Tue, 11 Jun 2024 00:45:28 +0000, [email protected] (MitchAlsup1)
wrote:
I forgot to add that Mc 88120 had these features in 1992.
Stores waited for retirement.
Given that in the case of external RAM, as opposed to registers inside
the processor, there is only one possible value at any location...
memory doesn't have a pile of rename locations to play with... I am so unimaginative that I don't think I could design a CPU in which stores
to RAM didn't wait for the instruction that performed them to retire.
That, though, wouldn't save me from Spectre, since Spectre leaks
information by virtue of fetches of stuff _read_ in earlier speculated
code that didn't really happen being in cache.
John Savard
On Tue, 11 Jun 2024 00:27:02 +0000, [email protected] (MitchAlsup1)
wrote:
ALL I have DONE is to not have the MB write into the cache until the >>causing instruction retires !!
I suppose that depends on how you define "write".
If by "write" you mean store data in the cache, for eventual writing
out into RAM, well, since RAM doesn't contain "rename locations" to
play with, it seems to me that any CPU designer had better do that.
At least, I'm not imaginative enough to think of doing it any other
way.
However, if by "write" you mean to change the state of the cache in
any way, such as by reading data from memory... now, _then_ you would
indeed have done what is necessary to combat Spectre.
Obviously, though, a "load" instruction will _never_ retire unless it
can read the data from memory it is trying to put in a register.
So apparently WHAT you have REALLY DONE is to modify how memory reads
work...
if the data a load instruction requires is not already in the cache,
then a direct read from memory
is performed which *completely
bypasses* the cache;
this data (and its associated address) are
retained by the CPU to be placed in the cache _if_ the instruction is actually executed and when it retires.
And, in fact, the various cache levels have to work this way too. You
have an L1 cache miss, but an L2 cache hit? Fine, you take your data
directly from L2, and don't promote the data into L2 until instruction retirement.
So now the process of fetching data from memory is _not_ done by
fetching always from L1 and going _throughl_ L1 to access L2, and
going _through_ L2 to access RAM, which seems to be the usual way
these days.
That certainly can be done. But it isn't quite as simple and obvious
as you seem to claim.
My 66000 is also insensitive to RowHammer and derivatives.....
When I first read that sentence, I was completely incredulous. DRAM is sensitive to RowHammer because it's gone to feature sizes which are
beyond the state-of-the-art to do properly... so corners have been
cut.
How a CPU can be "insensitive" to it was mysterious.
After all, RowHammer is caused by multiple rapid-fire accesses to the
same address, or to related addresses, in memory.
But given that you are now explicitly passing accesses to DRAM around
the caches, instead of having the caches access DRAM as needed,
perhaps that also makes it possible for the CPU to detect suspicious
behavior more easily. (Since _relateld_ accesses may be used in a
RowHammer attack, simply pruning redundant memory accesses from the
operation stream won't be enough. I could see you doing _that_ as part
of "doing it right".)
If the "row" that was "hammered" just consisted of the 16 consecutive locations that can be accessed speedily after the first one is ready,
then pruning reduntant accesses _would_ be enough, since to "hammer" a
row one has to access it hundreds of times, not at most 32 times; but
I'm afraid that isn't the case.
John Savard
On Tue, 11 Jun 2024 08:54:16 +0300, Niklas Holsti <[email protected]d> wrote:
Not always. If the mistakenly speculated cache-fetch /evicted/ some
other data from the (finite-sized) cache, and the evicted data are
needed later on the /true/ execution path, the mistakenly speculated
fetch has a /negative/ effect on performance. (This kind of "timing >>anomaly" is very bothersome in static WCET analysis.)
Ouch. Another argument for having a victim cache. And a benefit of
doing it in what is apparently Mitch Alsup's way - holding off cache
updates until instruction retirement.
John Savard
According to Lawrence D'Oliveiro <[email protected]d>:
I'm quite sure that IBM would disagree with this statement.
I’m sure they would. But they invented virtualization in CP/CMS because >>their attempt at an “interactive timesharing” system, CMS, was only >>single-user.
There's no need to make up silly stories like this ...
Windows NT was a disaster to the entire Unix workstation market. The irony >was, NT “Workstation” wasn’t really feature-equivalent to the OSes the >Unix workstations were running. But it was enough for the customers, it
seems ...
Lawrence D'Oliveiro wrote:
In particular, the DEC definition of a “fault” is that the saved PC on >> the stack still points at the instruction that caused the exception, so
a return-from-exception will attempt to re-execute the same
instruction. This is exactly what you want for page faults, for
example, but also for long-running interruptible instructions that
haven’t finished yet.
Whereas a “trap” left the PC pointing at the following instruction. So >> a return from the exception handler will simply resume execution there.
Both have the property where the PC is pointing at the first instruction
not executed.
Lawrence D'Oliveiro <[email protected]d> writes:
On Mon, 10 Jun 2024 15:23:51 GMT, Anton Ertl wrote:
Given that ARM is able to charge an architecture licensing fee for the
instruction set alone ...
I think that applies to newer versions, not the older ones. Given that
ARM goes back to the 1980s, any patents from the earliest years would
have expired by now.
It has nothing to do with patents.
The architecture license provides far more than the ability to implement
the arm instruction set. BTDT.
1) NMI are incredibly useful in certain cases, particularly for
in-kernel debuggers.
2) NMI is actually maskable on Intel hardware (in the chipset, not the processor)
Lawrence D'Oliveiro <[email protected]d> writes:
On Mon, 10 Jun 2024 15:23:51 GMT, Anton Ertl wrote:
Given that ARM is able to charge an architecture licensing fee for the
instruction set alone ...
I think that applies to newer versions, not the older ones. Given that ARM >>goes back to the 1980s, any patents from the earliest years would have >>expired by now.
It has nothing to do with patents.
The architecture license provides far more than the ability
to implement the arm instruction set. BTDT.
The spat also appears to be about ARM wants a bigger slice of the pie on smartphones, they demand a share of the sales price of the final product instead of the CPU. That actually sounds like something that the
antitrust authorities might be interested in.
There's no need to make up silly stories like this ...
No need to take my word for it. Bitsavers added issues of a magazine
called “Mainframe” a few months back. I took the trouble to read the first >one--it’s all about IBM, as though other “mainframe” machines didn’t >exist. There’s a description of the background to CP/CMS (later VM/CMS) >there.
In any event, I'd find the second article I linked to, the VM history
written by IBMers who were there, more credible than some random third
party magazine.
Windows NT was a disaster to the entire Unix workstation market.
The irony was, NT _Workstation_ wasn_t really feature-equivalent to
the OSes the Unix workstations were running. But it was enough for
the customers, it seems ...
On Mon, 10 Jun 2024 14:32:33 -0400, EricP wrote:
And there was the NMI race condition bug ...
Not surprised there was trouble with the concept of a “non-maskable interrupt”. When I first heard of such a thing, I threw up my hands in horror.
Lawrence D'Oliveiro <[email protected]d> writes:
On Mon, 10 Jun 2024 14:32:33 -0400, EricP wrote:
And there was the NMI race condition bug ...Not surprised there was trouble with the concept of a “non-maskable
interrupt”. When I first heard of such a thing, I threw up my hands in
horror.
1) NMI are incredibly useful in certain cases, particularly for in-kernel debuggers.
2) NMI is actually maskable on Intel hardware (in the chipset, not the processor)
3) ARM refused to support NMI in Aarch64 (partially because they didn't
have a spare exception vector). They've backtracked and hacked in a
solution using the interrupt controller to create a pseudo-unmaskable
interrupt due to customer demand.
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/a-profile-non-maskable-interrupts
Back in the 90's, we had a custom PCI card with a single button that would trigger an NMI for debugging new hardware.
John Savard wrote:
After all, RowHammer is caused by multiple rapid-fire accesses to the
same address, or to related addresses, in memory.
Yes, the write buffer in my DRAM controller is the L3 cache. Modified
data in the L3 migrates towards DRAM as DRAM cycles permit, but there
is no way to cause a line to be continuously be written into DRAM.
If a modified line has migrated to DRAM, and it gets modified again
in the L3, that 2nd write will not be performed until a refresh cycle
on that DRAM is performed.
Thus if one tries to RowHammer My 66000 DRAM, DRAM gets refresh cycle
between each write.
On Tue, 11 Jun 2024 10:03:36 +0200
Terje Mathisen <[email protected]> wrote:
Ouch! Glad I got out of the IRQ handler business before 1990.
*None* of this should be necessary.
Even the pipeline drain on mode switch should often be avoidable.
Terje
I think, Eric more than a little exaggerates about the level of
complexity of end-of-interrupt processing needed in common case.
May be, the code is long, but absolute majority of it is executed very rarely, if at all.
According to Lawrence D'Oliveiro <[email protected]d>:
There's no need to make up silly stories like this ...
No need to take my word for it. Bitsavers added issues of a magazine
called “Mainframe” a few months back. I took the trouble to read the first
one--it’s all about IBM, as though other “mainframe” machines didn’t >>exist. There’s a description of the background to CP/CMS (later VM/CMS) >>there.
I see Mainframe Journal, with the earliest issue being Jul/Aug 1988. Is
that it? I don't see anything in the ToC that looks like a VM overview.
In any event, I'd find the second article I linked to, the VM history
written by IBMers who were there, more credible than some random third
party magazine. CMS really was written at the same time as CP, and
they always intended them to work together as a time-sharing system.
On Wed, 12 Jun 2024 07:47:18 -0000 (UTC), John Levine wrote:
In any event, I'd find the second article I linked to, the VM history
written by IBMers who were there, more credible than some random third
party magazine.
By all means, check the bios of the authors, included as with any
magazine. It was written by IBM pros, for IBM pros.
On Tue, 11 Jun 2024 14:11:55 GMT, Scott Lurndal wrote:
1) NMI are incredibly useful in certain cases, particularly for
in-kernel debuggers.
2) NMI is actually maskable on Intel hardware (in the chipset, not the
processor)
Do you see a contradiction between the two?
I think, Eric more than a little exaggerates about the level ofPossibly, as I do have a tendency to get somewhat animated about this.
complexity of end-of-interrupt processing needed in common case.
May be, the code is long, but absolute majority of it is executed very
rarely, if at all.
I can't find it just now but a while back I was looking at some
Linux source code for the x86 interrupt return path,
and it went on for page after page after page.
John Savard wrote:
On Tue, 11 Jun 2024 00:27:02 +0000, [email protected] (MitchAlsup1)
wrote:
ALL I have DONE is to not have the MB write into the cache until the
causing instruction retires !!
I suppose that depends on how you define "write".
I mean the memory cell does not get modified.
If by "write" you mean store data in the cache, for eventual writing
out into RAM, well, since RAM doesn't contain "rename locations" to
play with, it seems to me that any CPU designer had better do that.
The cache itself is not modified until the memory reference retires.
But there is a buffer holding the data which can be accessed as if
it were an L0 cache until the data migrates to the real cache at
retirement.
At least, I'm not imaginative enough to think of doing it any other
way.
However, if by "write" you mean to change the state of the cache in
any way, such as by reading data from memory... now, _then_ you would
indeed have done what is necessary to combat Spectre.
The cache is not modified, the data is available through another means.
a means that can be backed up like a mispredicted branch. The buffer
I am talking about is temporally organized not spatially organized.
Obviously, though, a "load" instruction will _never_ retire unless it
can read the data from memory it is trying to put in a register.
The LD instruction can obtain data from either the buffer or from
the data cache itself. The buffer covers the execution window,
allowing the LD to retire (assuming every older instruction also
retires).
So apparently WHAT you have REALLY DONE is to modify how memory reads
work...
I pipelined them through a temporally organized memory execution
window. This also provides for allowing the memory system to run
OoO wrt program order, and detect actual ordering violations, and
rerun the memory references in a proper memory order by rerunning
the references in order.
You get relaxed memory order performance and precise memory order simultaneously.
if the data a load instruction requires is not already in the cache,
then a direct read from memory
The request is forwards towards memory through the cache hierarchy
and data arrives back at requestor (sooner or later).
is performed which *completely
bypasses* the cache;
Yes, critical word first.
this data (and its associated address) are
retained by the CPU to be placed in the cache _if_ the instruction is
actually executed and when it retires.
Yes !! While the data resides in the buffer, the whole line can be
accessed by a number of memory reference instructions.
And, in fact, the various cache levels have to work this way too. You
have an L1 cache miss, but an L2 cache hit? Fine, you take your data
directly from L2, and don't promote the data into L2 until instruction
retirement.
I use an exclusive cache organization. so data arriving at the CPU
goes into buffer, which upon retirement goes into L1, which has the
potential to push a L1->L2 line, and so forth.
So now the process of fetching data from memory is _not_ done by
fetching always from L1 and going _throughl_ L1 to access L2, and
going _through_ L2 to access RAM, which seems to be the usual way
these days.
Its back to the Athlon/Operon organizations.
That certainly can be done. But it isn't quite as simple and obvious
as you seem to claim.
If you had worked on them you can recognize the advantages and dis- advantages.
My 66000 is also insensitive to RowHammer and derivatives.....
When I first read that sentence, I was completely incredulous. DRAM is
sensitive to RowHammer because it's gone to feature sizes which are
beyond the state-of-the-art to do properly... so corners have been
cut.
How a CPU can be "insensitive" to it was mysterious.
After all, RowHammer is caused by multiple rapid-fire accesses to the
same address, or to related addresses, in memory.
Yes, the write buffer in my DRAM controller is the L3 cache. Modified
data in the L3 migrates towards DRAM as DRAM cycles permit, but there
is no way to cause a line to be continuously be written into DRAM.
If a modified line has migrated to DRAM, and it gets modified again
in the L3, that 2nd write will not be performed until a refresh cycle
on that DRAM is performed.
Thus if one tries to RowHammer My 66000 DRAM, DRAM gets refresh cycle
between each write.
In any event, I'd find the second article I linked to, the VM history
written by IBMers who were there, more credible than some random third
party magazine. CMS really was written at the same time as CP, and
they always intended them to work together as a time-sharing system.
when it was decided to add virtual memory to all 370s, it was also
decided to rewrite CP67 for VM370, simplifying and/or dropping lots of features (also renaming Cambridge Monitor System to Conversational
Monitor System and crippling its ability to run on real machine).
https://www.oreilly.com/library/view/understanding-the-linux/0596002130/ch04s08.html
https://coral.googlesource.com/linux-imx/+/refs/heads/release-chef/arch/x86/entry/entry_32.S
The problem I have with this approach is that it deals with all the
race conditions (eg a nested interrupt posts a new softirq between
when you checked for pending softirq's and the IRET) by running with interrupts disabled for long instruction sequences. I consider that
to be a poor way to do this as that blocks processing all other
interrupts.
1) NMI are incredibly useful in certain cases, particularly for
in-kernel debuggers.
2) NMI is actually maskable on Intel hardware (in the chipset, not the processor)
3) ARM refused to support NMI in Aarch64 (partially because they didn't
have a spare exception vector). They've backtracked and hacked in
a
solution using the interrupt controller to create a
pseudo-unmaskable
interrupt due to customer demand.
MitchAlsup1 wrote:
John Savard wrote:
After all, RowHammer is caused by multiple rapid-fire accesses to the
same address, or to related addresses, in memory.
Yes, the write buffer in my DRAM controller is the L3 cache. Modified
data in the L3 migrates towards DRAM as DRAM cycles permit, but there
is no way to cause a line to be continuously be written into DRAM.
If a modified line has migrated to DRAM, and it gets modified again
in the L3, that 2nd write will not be performed until a refresh cycle
on that DRAM is performed.
Thus if one tries to RowHammer My 66000 DRAM, DRAM gets refresh cycle
between each write.
What does it do if L3 receives more writes than it has ways in a row,
does it stall evicts from L2?
Lets say L3 is 4 way assoc and all four in a L3 row been updated,
then a 5th way in that same row is written from L2.
L3 has no place to hold that 5th way and it can't evict one
of the other 4 ways because that could cause rowhammer.
Seems to me that all it can do is stall the 5th write from L2 until
DRAM refresh rolls around and re-enables one of the pending L3 writes,
which would back up victim evicts from L2.
Or maybe L3 has a small fully assoc emergency overflow buffer,
but still that could fill up too.
Lawrence D'Oliveiro <[email protected]d> writes:
I recall CMS was single-user to start with, and the point of running it
under “CP” aka “VM” was to offer a multi-user service. Did CMS ever >> become multi-user in its own right?
over years relying more & more on CP kernel services, no multi-user ...
but did get multitasking ...
trivia: my brother was regional Apple rep (largest physical area CONUS)
and when he came into town, I could be invited to business dinners and
argue MAC design (even before MAC announced).
I recall CMS was single-user to start with, and the point of running it
under “CP” aka “VM” was to offer a multi-user service. Did CMS ever become
multi-user in its own right?
Can one NOT infer that; a SW convention to leave at least 1 enable bit
always enabled, gives the system an NMI ??
So what did you think of it? The original hardware architecture was
heavily centred around the 60.15Hz video refresh. Each refresh interval, 21888 bytes were read out of the video buffer (for the 512×342 display),
and 740 bytes were read out of the sound buffer to go to the speaker.
On Thu, 13 Jun 2024 00:43:51 +0000, MitchAlsup1 wrote:
Can one NOT infer that; a SW convention to leave at least 1 enable
bit always enabled, gives the system an NMI ??
Every interrupt needs to be maskable at some point, if only to avoid
infinite recursion and resulting stack overflow.
Rowhammer can modify nearby lines, not just the ones that are being
hammered, right? How do you guarantee that all neighbors will also
be refreshed?
Similarly, if the accesses are LOCK XADD operations, and you have multiple CPUs (or cores not sharing a common last level cache, then I don't see any way to avoid those accesses from making it all the way to the RAM chips?
Scott Lurndal wrote:
1) NMI are incredibly useful in certain cases, particularly for
in-kernel debuggers.
2) NMI is actually maskable on Intel hardware (in the chipset, not the
processor)
3) ARM refused to support NMI in Aarch64 (partially because they didn't
have a spare exception vector). They've backtracked and hacked in
a
solution using the interrupt controller to create a
pseudo-unmaskable
interrupt due to customer demand.
On an architecture where one has multiple simultaneous interrupt tables
(say 1 per Guest OS and 1 per HyperVisor) and each table manages 32K >individual interrupts each interrupt mask by its corresponding Enable
bit::
Can one NOT infer that; a SW convention to leave at least 1 enable bit
always enabled, gives the system an NMI ??
On Thu, 13 Jun 2024 01:46:40 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Thu, 13 Jun 2024 00:43:51 +0000, MitchAlsup1 wrote:
Can one NOT infer that; a SW convention to leave at least 1 enableEvery interrupt needs to be maskable at some point, if only to avoid
bit always enabled, gives the system an NMI ??
infinite recursion and resulting stack overflow.
Edge-sensitive interrupt is effectively masked for as long as it is
latched.
On Thu, 13 Jun 2024 01:46:40 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Thu, 13 Jun 2024 00:43:51 +0000, MitchAlsup1 wrote:
Can one NOT infer that; a SW convention to leave at least 1 enable
bit always enabled, gives the system an NMI ??
Every interrupt needs to be maskable at some point, if only to avoid
infinite recursion and resulting stack overflow.
Edge-sensitive interrupt is effectively masked for as long as it is
latched.
On 6/8/24 1:37 PM, MitchAlsup1 wrote:
EricP wrote:[snip]
Scott Lurndal wrote:
What they found that not only do they not need 4 levels,
it was a pointless overhead to have to constantly switch between them.
(There is a pretty high penalty to switching modes, copying in args,
validating args, doing something usually simple, then switching back,
when it is all the OS's code anyway.)
VAX was before common era Hypervisors, do you think VAX could have
supported secure mode and hypervisor with their 4 levels ??
But for similar reasons ring 1 and 2 are not used in x86 machines,
either. {{NOw, if we could just go back to 1982 and not invent
IDTs, and call gates, .....}}
I thought My 66000 had Port Holes that are vaguely similar to
call gates, so rather than "not invent" perhaps invent with better
semantics and a better interface?
(Though 1982 might have been too
early to implement such. Better perceiving when to wait for the
technology or understanding to implement something better is
presumably one of the skills acquired by long experience as well
as the related what can be implemented to provide the most attractive/marketable features without excessively limiting future developments.
Letting a competitor provide a temporarily better
product — or delaying entry into a market expecting a feature —
can sometimes be sensible if one expects to leapfrog with
a better long-term alternative, but "worse is better" has some
truth.)
It seems that in terms of computer architectures, the world is not going
to beat a path to your door even if you invent a better mousetrap.
On Thu, 13 Jun 2024 23:48:14 +0000, MitchAlsup1 wrote:
It seems that in terms of computer architectures, the world is not going
to beat a path to your door even if you invent a better mousetrap.
There is an inherent conflict between wanting an idea to be widely
adopted, and wanting to maximize your profit from it.
It seems that in terms of computer architectures, the world is
not going to beat a path to your door even if you invent a
better mousetrap.
Lawrence D'Oliveiro wrote:
On Wed, 12 Jun 2024 09:38:17 -0400, EricP wrote:
https://www.oreilly.com/library/view/understanding-the-linux/0596002130/ch04s08.html
That book is from 2002.
https://coral.googlesource.com/linux-imx/+/refs/heads/release-chef/arch/x86/entry/entry_32.S
That, too, seems a bit old. How about this for a more up-to-date
version:
<https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_32.S>. >> Or try the 64-bit version:
<https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S>.
Thanks, I'll have a look that entry.s. It looks quite different.
The copyright on common.c file I referenced was 2015 so those
files seemed to be relatively up to date and being maintained.
The problem I have with this approach is that it deals with all the
race conditions (eg a nested interrupt posts a new softirq between
when you checked for pending softirq's and the IRET) by running with
interrupts disabled for long instruction sequences. I consider that
to be a poor way to do this as that blocks processing all other
interrupts.
But then again, things are complicated enough as it is.
The cautionary tail here is that return code path is complicated
exactly because it wasn't sorted out during the ISA and HW design phase.
On Wed, 12 Jun 2024 09:38:17 -0400, EricP wrote:
https://www.oreilly.com/library/view/understanding-the-linux/0596002130/ch04s08.html
That book is from 2002.
https://coral.googlesource.com/linux-imx/+/refs/heads/release-chef/arch/x86/entry/entry_32.S
That, too, seems a bit old. How about this for a more up-to-date
version: <https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_32.S>.
Or try the 64-bit version: <https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S>.
The problem I have with this approach is that it deals with all the
race conditions (eg a nested interrupt posts a new softirq between
when you checked for pending softirq's and the IRET) by running with
interrupts disabled for long instruction sequences. I consider that
to be a poor way to do this as that blocks processing all other
interrupts.
But then again, things are complicated enough as it is.
Or perhaps, the cautionary tale is that a 1970 architecture must adapt
to new paradigms over five decades, and backward compatability
requirements lead to inevitable complexity.
But the failure of RISC-B to make x86 obsolete shows that even giving it
away for free is not enough. Because not being able to run your old
Windows programs is the real problem.
John Savard <[email protected]d> schrieb:
But _some_ people use Linux, which essentially makes them free to hop
to any ISA for which the Gnu C compiler works.
It's not quite that simple - if you try to build a modern web brower for POWER on Linux, for example, you're in for quite an adventure.
But _some_ people use Linux, which essentially makes them free to hop
to any ISA for which the Gnu C compiler works.
On Fri, 14 Jun 2024 22:10:23 -0000 (UTC), Thomas Koenig wrote:
It's not quite that simple - if you try to build a modern web brower for
POWER on Linux, for example, you're in for quite an adventure.
Endianness assumptions?
I think essentially all of the basic toolchain is
already available, so what’s left would be mostly bugs in the app code >itself.
For which I’m sure they would accept patches.
You mean RISC-V?
I think it is succeeding in its goals. From what I hear, it's
already shipping in the billions of units per year, in a similar
league to ARM.
Lawrence D'Oliveiro <[email protected]d> writes:
On Fri, 14 Jun 2024 22:10:23 -0000 (UTC), Thomas Koenig wrote:
It's not quite that simple - if you try to build a modern web brower for >>> POWER on Linux, for example, you're in for quite an adventure.
Endianness assumptions?
OpenPower is little-endian, so I doubt that this is the reason. From
what I read, Web browsers are a beast to build.
On Fri, 14 Jun 2024 22:10:23 -0000 (UTC), Thomas Koenig wrote:
It's not quite that simple - if you try to build a modern web
brower for POWER on Linux, for example, you're in for quite
an adventure.
Endianness assumptions? I think essentially all of the basic
toolchain is already available, so what's left would be mostly bugs
in the app code itself. For which I'm sure they would accept
patches.
There are a _lot_ of libraries and other components that go into a
modern web browser, many of which will never have been built on POWER.
The JITer for the Javascript engine, and the Web Assembly translator
seem to be among them, and they need to make use of the native
instruction set. That's not a bug fix, that's a significant
implementation task.
Lawrence D'Oliveiro <[email protected]d> writes:
For which I’m sure they would accept patches.
Who would write them?
On Sat, 15 Jun 2024 12:16 +0100 (BST), John Dallman wrote:
There are a _lot_ of libraries and other components that go into a
modern web browser, many of which will never have been built on
POWER. The JITer for the Javascript engine, and the Web Assembly
translator seem to be among them, and they need to make use of the
native instruction set. That's not a bug fix, that's a significant implementation task.
But those are not required for correctness, only for efficiency. The
original question, as I understood it, was to get the code running on
the specified architecture, not necessarily to get it running at top
speed.
Do you use video codecs in FF for correctness or only for efficiency?
On Sun, 16 Jun 2024 02:52:02 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Sun, 16 Jun 2024 01:55:26 +0300, Michael S wrote:
Do you use video codecs in FF for correctness or only for
efficiency?
I think FFmpeg is one of those basic toolkits that has already been
ported to OpenPOWER.
Is it capable to decode H264?
On Sun, 16 Jun 2024 01:55:26 +0300, Michael S wrote:
Do you use video codecs in FF for correctness or only for
efficiency?
I think FFmpeg is one of those basic toolkits that has already been
ported to OpenPOWER.
On Sun, 16 Jun 2024 02:52:02 -0000 (UTC) Lawrence D'Oliveiro
<[email protected]d> wrote:
On Sun, 16 Jun 2024 01:55:26 +0300, Michael S wrote:
Do you use video codecs in FF for correctness or only for efficiency?
I think FFmpeg is one of those basic toolkits that has already been
ported to OpenPOWER.
Is it capable to decode H264?
Michael S <[email protected]> schrieb:
On Sun, 16 Jun 2024 02:52:02 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Sun, 16 Jun 2024 01:55:26 +0300, Michael S wrote:
Do you use video codecs in FF for correctness or only for
efficiency?
I think FFmpeg is one of those basic toolkits that has already been
ported to OpenPOWER.
Is it capable to decode H264?
https://ffmpeg.org/ffmpeg-codecs.html says yes, if you use http://www.openh264.org/ .
On Sun, 16 Jun 2024 10:34:47 +0300, Michael S wrote:
On Sun, 16 Jun 2024 02:52:02 -0000 (UTC) Lawrence D'Oliveiro <[email protected]d> wrote:
On Sun, 16 Jun 2024 01:55:26 +0300, Michael S wrote:
Do you use video codecs in FF for correctness or only for
efficiency?
I think FFmpeg is one of those basic toolkits that has already been
ported to OpenPOWER.
Is it capable to decode H264?
I’m surprised you didn’t know, since you were the one who mentioned
it.
It has options to build against toolkits for every codec and file
format that is still worth using these days, and a few more besides.
On Sun, 16 Jun 2024 08:49:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Sun, 16 Jun 2024 10:34:47 +0300, Michael S wrote:
Is it capable to decode H264?
I’m surprised you didn’t know, since you were the one who mentioned it. >>
It has options to build against toolkits for every codec and file
format that is still worth using these days, and a few more besides.
All I know about it is that typical FF installation on x86-64 uses plug
in provided by Cisco. I have no idea if the reason for it is technical
or legal.
On Sun, 16 Jun 2024 08:49:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
It has options to build against toolkits for every codec and file
format that is still worth using these days, and a few more besides.
All I know about it is that typical FF installation on x86-64 uses plug
in provided by Cisco. I have no idea if the reason for it is technical
or legal.
In article <v4ifbt$32kuq$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:
On Fri, 14 Jun 2024 22:10:23 -0000 (UTC), Thomas Koenig wrote:
It's not quite that simple - if you try to build a modern web brower
for POWER on Linux, for example, you're in for quite an adventure.
Endianness assumptions? I think essentially all of the basic toolchain
is already available, so what's left would be mostly bugs in the app
code itself. For which I'm sure they would accept patches.
There are a _lot_ of libraries and other components that go into a
modern web browser, many of which will never have been built on POWER.
The JITer for the Javascript engine, and the Web Assembly translator
seem to be among them, and they need to make use of the native
instruction set. That's not a bug fix, that's a significant
implementation task.
As modern web browser does a _lot_ more than interpret HTML and display bitmaps, and most of the code for the extra functionality is in the
browser. They're more like multimedia operating systems than document viewers.
On Sun, 16 Jun 2024 08:00:02 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Sun, 16 Jun 2024 02:52:02 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Sun, 16 Jun 2024 01:55:26 +0300, Michael S wrote:
Do you use video codecs in FF for correctness or only for
efficiency?
I think FFmpeg is one of those basic toolkits that has already been
ported to OpenPOWER.
Is it capable to decode H264?
https://ffmpeg.org/ffmpeg-codecs.html says yes, if you use
http://www.openh264.org/ .
Thank you.
I see that ppc64el is not supported, but verified to be working.
Hopefully it means that it's not just shows something under FHD
resolution, but can work without dropping frames. Which is not so easy
when done purely in software.
h.264 is/was extremely heavily patented, the MPEGLA patent consortium
list for what patent they include in their license is 58 pages[1], with
three colums!
THREAD NECROMANCY
On 6/11/24 5:18 PM, MitchAlsup1 wrote:
[snip]
I doubt that RowHammer still works when refreshes are interspersed
between accesses--RowHammer generally works because the events are
not protected by refreshes--the DRC sees the right ROW open and
simple streams at the open bank.
If one refreshes the two adjacent rows to avoid data disruption,
those refreshes would be adjacent reads to two other rows so it
seems one would have to be a little cautious about excessively
frequent refreshes.
Also note, there are no instructions in My 66000 that force a cache
to DRAM whereas there are instructions that can force a cache line
into L3.
How does a system suspend to DRAM if it cannot force a writeback
of all dirty lines to memory?
I am *guessing* this would not use a
special instruction but rather configuration of power management
that would cause hardware/firmware to clean the cache.
Writing back specific data to persistent memory might also
motivate cache block cleaning operations. Perhaps one could
implement such by copying from a cacheable mapping to a
non-cacheable(I/O?) memory?? (I simply remember that Intel added
instructions to write cache lines to persistent memory.)
L3 is the buffer to DRAM. Nothing gets to DRAM without
going through L3 and nothing comes out of DRM that is not also
buffer by L3. So, if 96 cores simultaneously read a line residing in
DRAM, DRAM is read once and 95 cores are serviced through L3. So,
you can't RowHammer based on reading DRAM, either.
If 128 cores read distinct cache lines from the same page quickly
enough to hammer the adjacent pages but not quickly enough to get
DRAM page open hits, this would seem to require relatively
frequent refreshes of adjacent DRAM rows.
Since the L3/memory controller could see that the DRAM row was
unusually active, it could increase prefetching while the DRAM
row was open and/or queue the accesses longer so that the
hammering frequency was reduced and page open hits would be more
common.
The simple statement that L3 would avoid RowHammer by providing
the same cache line to all requesters seemed a bit too simple.
Your design may very well handle all the problematic cases,
perhaps even with minimal performance penalties for inadvertent
hammering and logging/notification for questionable activity just
like for error correction (and has been proposed for detected race conditions). I just know that these are hard problems.
On Sat, 8 Jun 2024 17:37:46 +0000, MitchAlsup1 wrote:
VAX was before common era Hypervisors, do you think VAX could have
supported secure mode and hypervisor with their 4 levels ??
“Virtualization” was bandied about in the 1980s more as an idle, theoretical concept rather than a practical one.
The question was: was the instruction set defined so that code that was designed to run in a privileged mode be run unprivileged, so that any
attempt to do privileged things would be trapped and emulated by the
real privileged code? And there was nothing it could do to discover
it wasn’t running in privileged mode?
(Obviously performance was not the issue here, but correctness was.)
For example, the VAX had a MOVPSL instruction that allowed read-only
access to the entire processor status register. Through this,
nonprivileged user-mode code could discover it was running in user mode, which would blow the illusion.
The Motorola 680x0 family was I think properly virtualizable in this
sense. Or maybe the 68020 and 68030 were, but the 68040 was. I think the Motorola engineers working on the ’040 asked if any customers were interested in preserving the self-virtualization feature, and nobody
seemed to care.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 20:13:44 |
| Calls: | 12,104 |
| Calls today: | 4 |
| Files: | 15,004 |
| Messages: | 6,518,100 |