• Privilege Levels Below User

    From John Savard@21:1/5 to All on Fri Jun 7 12:03:03 2024
    This may be a silly idea... but it seems to be the sort of thing that
    current concerns about computer security may be calling for.

    It is typical for computers to have a privileged mode of operation,
    wherein I/O operations and certain special changes to the state of the
    computer are allowed that are barred to normal computational tasks.

    For various reasons, miscreants have not been completely foiled by the existence of this feature.

    Some types of instruction that are required for normal computation are
    still, to a certain extent, potentially harmful.

    So I am thinking it might be useful to have, for example, two states
    less privileged than the user state, and some mechanism for user
    programs to call subroutines which are in that state until they return
    - the return instruction being limited, sort of like a supervisor
    call, so it can only return in a proper manner.

    The first reduced-privilege state would not allow any branch
    instructions, particularly conditional branches.

    The second, in addition, would not allow any access to memory, only
    allowing access to registers.

    To use these states to aid in security, more is required.

    For one thing, blocks of memory would need to be able to be marked as
    not only containing code or data, but as containing code that runs at
    one of these reduced privilege levels.

    And then comes the payaoff: a block of memory could be marked as
    writeable, but yet containing executable code, for things like
    just-in-time compilation... but as only containing code at one of
    these reduced privilege levels. Thus preventing the generation of code containing branches or memory accesses, as desired, while allowing the generation of computational sequences.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Savard on Fri Jun 7 18:18:33 2024
    John Savard <[email protected]d> writes:
    This may be a silly idea... but it seems to be the sort of thing that
    current concerns about computer security may be calling for.

    It is typical for computers to have a privileged mode of operation,
    wherein I/O operations and certain special changes to the state of the >computer are allowed that are barred to normal computational tasks.

    For various reasons, miscreants have not been completely foiled by the >existence of this feature.

    Some types of instruction that are required for normal computation are
    still, to a certain extent, potentially harmful.

    So I am thinking it might be useful to have, for example, two states
    less privileged than the user state, and some mechanism for user
    programs to call subroutines which are in that state until they return
    - the return instruction being limited, sort of like a supervisor
    call, so it can only return in a proper manner.

    There are already more than five security rings in most
    processors.

    Intel: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, VMX, Enclave, SMM AMD: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, SVM, SMM
    ARM64: Realm Monitor, EL3 (Secure monitor), EL2(Hypervisor), EL1 (Kernel), EL0 (user)

    <snip description of useless feature>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Fri Jun 7 20:40:34 2024
    John Savard wrote:

    This may be a silly idea... but it seems to be the sort of thing that
    current concerns about computer security may be calling for.

    It is typical for computers to have a privileged mode of operation,
    wherein I/O operations and certain special changes to the state of the computer are allowed that are barred to normal computational tasks.

    For various reasons, miscreants have not been completely foiled by the existence of this feature.

    Most of the miscreations have to do with allowing microarchitectural
    state to be come visible through a high precision timing mechanism,
    not with the skirting of privilege.

    Some types of instruction that are required for normal computation are
    still, to a certain extent, potentially harmful.

    So I am thinking it might be useful to have, for example, two states
    less privileged than the user state, and some mechanism for user
    programs to call subroutines which are in that state until they return
    - the return instruction being limited, sort of like a supervisor
    call, so it can only return in a proper manner.

    In My 66000, the Monitor, Hypervisor, Supervisor, and guest can
    share the dynamic libraries contining no privileged instructions.
    And since there is only 1 such instruction it is easy to check.

    However, a Pthread can transfer control to another Pthread without
    privilege in a single instruction.

    The first reduced-privilege state would not allow any branch
    instructions, particularly conditional branches.

    Are My 66000 predication shadows considered "branching" since they
    do not alter where the Fetch end of the pipeline is working??

    Are My 66000 Switch instructions considered branches ?? since the
    transfer table is in .text and relative to the current switch
    instruction?

    Are Supervisor Calls "brnches" since they go to controlled entry
    points??

    Are Supervisor Returns "branches" since they to to controlled
    return points ??

    The second, in addition, would not allow any access to memory, only
    allowing access to registers.

    To use these states to aid in security, more is required.

    For one thing, blocks of memory would need to be able to be marked as
    not only containing code or data, but as containing code that runs at
    one of these reduced privilege levels.

    How are you going to perform elementary functions {SIN, COS, EXP, LOG}?

    And then comes the payaoff: a block of memory could be marked as
    writeable, but yet containing executable code, for things like
    just-in-time compilation...

    A C compiler is an application running in a different process. Why
    is a JIT "not like that" ??

    but as only containing code at one of
    these reduced privilege levels. Thus preventing the generation of code containing branches or memory accesses, as desired, while allowing the generation of computational sequences.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Fri Jun 7 20:43:23 2024
    Scott Lurndal wrote:

    John Savard <[email protected]d> writes:
    This may be a silly idea... but it seems to be the sort of thing that >>current concerns about computer security may be calling for.

    It is typical for computers to have a privileged mode of operation,
    wherein I/O operations and certain special changes to the state of the >>computer are allowed that are barred to normal computational tasks.

    For various reasons, miscreants have not been completely foiled by the >>existence of this feature.

    Some types of instruction that are required for normal computation are >>still, to a certain extent, potentially harmful.

    So I am thinking it might be useful to have, for example, two states
    less privileged than the user state, and some mechanism for user
    programs to call subroutines which are in that state until they return
    - the return instruction being limited, sort of like a supervisor
    call, so it can only return in a proper manner.

    There are already more than five security rings in most
    processors.

    Intel: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, VMX, Enclave,

    SMM

    I count 5 (unused privilege levels are not real privilege levels)

    AMD: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, SVM, SMM

    I count 4

    ARM64: Realm Monitor, EL3 (Secure monitor), EL2(Hypervisor), EL1
    (Kernel), EL0 (user)

    I count 5

    <snip description of useless feature>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Sat Jun 8 00:06:41 2024
    On Fri, 07 Jun 2024 12:03:03 -0600, John Savard wrote:

    So I am thinking it might be useful to have, for example, two states
    less privileged than the user state, and some mechanism for user
    programs to call subroutines which are in that state until they return -
    the return instruction being limited, sort of like a supervisor call, so
    it can only return in a proper manner.

    MULTICS lives!

    That was the next-generation kitchen-sink OS from the latter 1960s that
    was taking so long to develop, Bell Labs pulled out of the project and set about creating their own, much less ambitious OS instead, which they
    initially called “UNICS” (to indicate it was the opposite of “MULTICS”).

    MULTICS required hardware with 8 different privilege levels (rings), from
    0 (most privileged) to 7 (least privileged).

    User code normally ran at ring 4. This left 5, 6 and 7 available for
    ordinary users to impose their own additional isolation on code they
    didn’t quite trust.

    Another option, less of a hierarchy and more of a privilege matrix, would
    be to use capabilities. I think I mentioned CHERI in this newsgroup
    previously.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to D'Oliveiro on Sat Jun 8 10:06:00 2024
    In article <v407ah$29fla$[email protected]>, [email protected]d (Lawrence
    D'Oliveiro) wrote:

    On Fri, 07 Jun 2024 12:03:03 -0600, John Savard wrote:
    So I am thinking it might be useful to have, for example, two
    states less privileged than the user state, and some mechanism
    for user programs to call subroutines which are in that state
    until they return - the return instruction being limited, sort
    of like a supervisor call, so it can only return in a proper
    manner.

    As a practical matter, ISA features requiring assembly coding are not accessible to application programmers these days, because they mostly
    don't know assembler and are scared by the idea. They also don't want to
    know about privilege levels. Such features are accessible to compiler
    writers, JIT creators and OS creators, but introducing additional
    complexity into their work is not welcome.

    If this feature existed and worked, it would hardly be used at all.

    User code normally ran at ring 4. This left 5, 6 and 7 available
    for ordinary users to impose their own additional isolation on code
    they didn't quite trust.

    Was this used? People were much more willing to do low-level programming
    in those days, but I bet this went unused.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Jun 8 10:17:36 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    That was the next-generation kitchen-sink OS from the latter 1960s that
    was taking so long to develop, Bell Labs pulled out of the project and set >about creating their own, much less ambitious OS instead, which they >initially called “UNICS” (to indicate it was the opposite of “MULTICS”).

    Bell Labs did indeed give up on Multics, but Unix was an unofficial
    skunkworks project and the name is a joke, a castrated Multics. This
    is well documented in many Unix history papers.

    On the other hand, if Multics hadn't been so late, and so closely tied
    to expensive hardware that wasn't byte addressable and was already
    running out of address bits, who knows how much of its other features
    might have been more widely adopted. It was pretty cool that you could
    write your code one piece at a time, and if you called a routine that
    didn't exist, it would stop, you could write and compile the routine,
    and then continue the original program.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Sat Jun 8 12:01:56 2024
    Scott Lurndal wrote:
    John Savard <[email protected]d> writes:
    This may be a silly idea... but it seems to be the sort of thing that
    current concerns about computer security may be calling for.

    It is typical for computers to have a privileged mode of operation,
    wherein I/O operations and certain special changes to the state of the
    computer are allowed that are barred to normal computational tasks.

    For various reasons, miscreants have not been completely foiled by the
    existence of this feature.

    Some types of instruction that are required for normal computation are
    still, to a certain extent, potentially harmful.

    So I am thinking it might be useful to have, for example, two states
    less privileged than the user state, and some mechanism for user
    programs to call subroutines which are in that state until they return
    - the return instruction being limited, sort of like a supervisor
    call, so it can only return in a proper manner.

    There are already more than five security rings in most
    processors.

    Intel: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, VMX, Enclave, SMM AMD: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, SVM, SMM
    ARM64: Realm Monitor, EL3 (Secure monitor), EL2(Hypervisor), EL1 (Kernel), EL0 (user)

    VAX had 4 modes, User, Supervisor, Executive, Kernel.
    VMS used Super for debugger and the command language DCL,
    Exec was mostly for the file system.
    Kernel was for the core of the OS.

    What they found that not only do they not need 4 levels,
    it was a pointless overhead to have to constantly switch between them.
    (There is a pretty high penalty to switching modes, copying in args,
    validating args, doing something usually simple, then switching back,
    when it is all the OS's code anyway.)

    I don't know what privileges Unix on VAX used but it was
    probably 2 levels because PDP-11 had only 2 levels.

    Alpha had 3 levels, User, Supervisor, and a higher third mode called
    PAL for Privileged Architecture Library. It was supposed to be thought
    of like microcode, privileged subroutines. Then PAL mode was used to
    emulate the 4 levels that VMS expected when they ported it.

    (I think PAL mode was a way to patent a feature that made the
    ISA impossible to copy without their permission,
    and therefore someone can't take DEC's executables and run them
    on a clone processor, like what happened to IBM with Amdahl.)

    WinNT was written to be portable so the lowest common denominator
    is 2 levels, User and Super, and everything worked just fine.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sat Jun 8 17:37:46 2024
    EricP wrote:

    Scott Lurndal wrote:
    John Savard <[email protected]d> writes:
    This may be a silly idea... but it seems to be the sort of thing that
    current concerns about computer security may be calling for.

    It is typical for computers to have a privileged mode of operation,
    wherein I/O operations and certain special changes to the state of the
    computer are allowed that are barred to normal computational tasks.

    For various reasons, miscreants have not been completely foiled by the
    existence of this feature.

    Some types of instruction that are required for normal computation are
    still, to a certain extent, potentially harmful.

    So I am thinking it might be useful to have, for example, two states
    less privileged than the user state, and some mechanism for user
    programs to call subroutines which are in that state until they return
    - the return instruction being limited, sort of like a supervisor
    call, so it can only return in a proper manner.

    There are already more than five security rings in most
    processors.

    Intel: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, VMX, Enclave,

    SMM
    AMD: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, SVM, SMM
    ARM64: Realm Monitor, EL3 (Secure monitor), EL2(Hypervisor), EL1
    (Kernel), EL0 (user)

    VAX had 4 modes, User, Supervisor, Executive, Kernel.
    VMS used Super for debugger and the command language DCL,
    Exec was mostly for the file system.
    Kernel was for the core of the OS.

    What they found that not only do they not need 4 levels,
    it was a pointless overhead to have to constantly switch between them.
    (There is a pretty high penalty to switching modes, copying in args, validating args, doing something usually simple, then switching back,
    when it is all the OS's code anyway.)

    VAX was before common era Hypervisors, do you think VAX could have
    supported secure mode and hypervisor with their 4 levels ??

    But for similar reasons ring 1 and 2 are not used in x86 machines,
    either. {{NOw, if we could just go back to 1982 and not invent
    IDTs, and call gates, .....}}

    I don't know what privileges Unix on VAX used but it was
    probably 2 levels because PDP-11 had only 2 levels.

    Alpha had 3 levels, User, Supervisor, and a higher third mode called
    PAL for Privileged Architecture Library. It was supposed to be thought
    of like microcode, privileged subroutines. Then PAL mode was used to
    emulate the 4 levels that VMS expected when they ported it.

    PAL was microcode in <fast> ROM in the native ISA.

    (I think PAL mode was a way to patent a feature that made the
    ISA impossible to copy without their permission,
    and therefore someone can't take DEC's executables and run them
    on a clone processor, like what happened to IBM with Amdahl.)

    Worked real well for them !!

    WinNT was written to be portable so the lowest common denominator
    is 2 levels, User and Super, and everything worked just fine.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Sun Jun 9 00:21:01 2024
    On Sat, 8 Jun 2024 10:17:36 -0000 (UTC), John Levine wrote:

    On the other hand, if Multics hadn't been so late, and so closely tied
    to expensive hardware that wasn't byte addressable and was already
    running out of address bits, who knows how much of its other features
    might have been more widely adopted.

    Multics did finally reach production, and was still available for purchase
    into the 1980s. There is a brochure from Honeywell (who took over the GE computer business), dated 1982, that touts its features. And considering
    it had been something like 15 years since development commenced at that
    point, it doesn’t look dated at all. What other platforms from that time
    had integrated graphics, text processing and database support?

    I heard somewhere that Honeywell had no idea what to do with this MULTICS thing. They priced it at something obscene (seven figures) to put off
    people buying it, but some customers stayed loyal, regardless.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun Jun 9 00:29:42 2024
    BGB wrote:

    On 6/8/2024 11:01 AM, EricP wrote:
    Scott Lurndal wrote:
    John Savard <[email protected]d> writes:

    Though, the time returned by the CPUID microsecond timer is not
    currently the same as the one given by "TK_GetTimeUS()", where the
    latter effectively gives a 64-bit value (conceptually) representing the

    number of microseconds since 1/1/1970; though with the kernel currently

    assuming that its build-time is the starting time for the clock (and
    none of the FPGA boards support a hardware clock, and one would need
    internet access to use NTP, ...).

    A 64-bit value in microseconds can express around +/- 300k years, which

    should be plenty.

    What do you do when you need a 200 picosecond timer ?? (5GHz cycle
    counter)

    A 64-bit value expressed in seconds could express values relative to
    the
    current age of the universe, but this is likely unnecessary for most purposes, and ability to express fractions of a second is likely more
    useful than the ability to express the age of the universe.

    Interesting factoid::
    The universe is currently 10^80 Plank times old since Big Bang,
    and universe will die around 10^80 years,
    and there are about 10^80-10^88 particles in the universe.

    Granted, one could use a 128-bit value, and have both (and in
    picoseconds if they wanted). But, this would be overkill.

    Or, go extra overkill, and use 256 bits, to express the current age of
    the universe in Planck units...

    160-bits will be shown to be sufficient to count Plank times.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sun Jun 9 02:23:35 2024
    On Sat, 8 Jun 2024 17:37:46 +0000, MitchAlsup1 wrote:

    VAX was before common era Hypervisors, do you think VAX could have
    supported secure mode and hypervisor with their 4 levels ??

    “Virtualization” was bandied about in the 1980s more as an idle, theoretical concept rather than a practical one.

    The question was: was the instruction set defined so that code that was designed to run in a privileged mode be run unprivileged, so that any
    attempt to do privileged things would be trapped and emulated by the real privileged code? And there was nothing it could do to discover it wasn’t running in privileged mode?

    (Obviously performance was not the issue here, but correctness was.)

    For example, the VAX had a MOVPSL instruction that allowed read-only
    access to the entire processor status register. Through this,
    nonprivileged user-mode code could discover it was running in user mode,
    which would blow the illusion.

    The Motorola 680x0 family was I think properly virtualizable in this
    sense. Or maybe the 68020 and 68030 were, but the 68040 was. I think the Motorola engineers working on the ’040 asked if any customers were
    interested in preserving the self-virtualization feature, and nobody
    seemed to care.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Sun Jun 9 12:25:44 2024
    [email protected] (MitchAlsup1) writes:
    EricP wrote:
    Alpha had 3 levels, User, Supervisor, and a higher third mode called
    PAL for Privileged Architecture Library. It was supposed to be thought
    of like microcode, privileged subroutines. Then PAL mode was used to
    emulate the 4 levels that VMS expected when they ported it.

    PAL was microcode in <fast> ROM in the native ISA.

    What is called when you perform a PAL call (at least on EV45, but most
    likely on all Alphas) is Alpha code, and it resides in RAM and is
    loaded there from the boot loader. I know, because I enhanced the PAL
    code supplied with the MILO boot loader for EV45 to activate the full
    16KB of D-cache (rather than just 8KB).

    It also uses less specials than I expected; e.g., on the EV45 the IMB (instruction-memory barrier) PAL call is implemented by just executing
    a big chunk of code such that the previous contents of the I-cache are
    evicted, while I expected that it would set a bit in a model-specific
    register.

    (I think PAL mode was a way to patent a feature that made the
    ISA impossible to copy without their permission,
    and therefore someone can't take DEC's executables and run them
    on a clone processor, like what happened to IBM with Amdahl.)

    Worked real well for them !!

    Definitely. Note that the first Amdahl machine shipped 11 years after
    the first S/360. Alpha was canceled 9 years after the first Alpha was
    shipped.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Schultz@21:1/5 to Lawrence D'Oliveiro on Sun Jun 9 08:34:29 2024
    On 6/8/24 9:23 PM, Lawrence D'Oliveiro wrote:
    The Motorola 680x0 family was I think properly virtualizable in this
    sense. Or maybe the 68020 and 68030 were, but the 68040 was. I think the Motorola engineers working on the ’040 asked if any customers were interested in preserving the self-virtualization feature, and nobody
    seemed to care.

    The 68010 made the move from SR instruction privileged.

    CP/M-68K V1.2 added support for the 68010. The exception handler would
    patch the code to change a move from SR into a move from CCR.


    --
    http://davesrocketworks.com
    David Schultz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Sun Jun 9 14:13:25 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Sat, 8 Jun 2024 17:37:46 +0000, MitchAlsup1 wrote:

    VAX was before common era Hypervisors, do you think VAX could have
    supported secure mode and hypervisor with their 4 levels ??

    “Virtualization” was bandied about in the 1980s more as an idle, >theoretical concept rather than a practical one.

    I'm quite sure that IBM would disagree with this statement.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Sun Jun 9 09:14:11 2024
    On Fri, 7 Jun 2024 20:40:34 +0000, [email protected] (MitchAlsup1)
    wrote:

    Are Supervisor Calls "brnches" since they go to controlled entry
    points??

    Well, they're a kind of subroutine call. But they're really
    instructions that initiate the computer's response to an interrupt,
    which is what makes the entry point controlled and the instruction
    able to increase privilege.

    How are you going to perform elementary functions {SIN, COS, EXP, LOG}?

    Just because the feature exists doesn't mean it needs to be used for everything. Ordinary subroutine calls will still exist, so if these
    routines require scratchpad memory, that will be fine.

    A C compiler is an application running in a different process. Why
    is a JIT "not like that" ??

    A C compiler doesn't save data in memory that can then be executed. It
    writes to a file. The linking loader, instead, is "like" a JIT
    compiler in that respect.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Sun Jun 9 09:26:42 2024
    On Fri, 07 Jun 2024 12:03:03 -0600, John Savard
    <[email protected]d> wrote:

    The first reduced-privilege state would not allow any branch
    instructions, particularly conditional branches.

    The second, in addition, would not allow any access to memory, only
    allowing access to registers.

    Maybe I haven't made clear what this is _for_ as I thought it would be
    obvious.

    If no branches... then no need for retpolines and stuff.

    If no access to memory... no worries about rowhammer.

    Given that, a third mode - not reduced-privilege so much as
    reduced-efficiency - suggests itself.

    Cause some code to be executed... without any speculative execution;
    allow branches, but don't execute anything until where the branch goes
    is fully resolved.

    This deals with Spectre and friends.

    So the idea is to give an unprivileged user application, like a web
    browser, a capability, without going through the operating system, to
    run code that is sandboxed in appropriate ways to prevent it from
    causing trouble although it is untrusted.

    That browsers have to be able to run untrusted JavaScript (and,
    formerly, even Java and Flash, which have now been discarded) to
    support the flexibility desired for modern web sites... has been the
    basic reason why computers today are insecure. If the only code that
    ran on computers was trusted code, then the virus situation would be
    like it was back in the days of 8-bit computers; except for
    supply-chain attacks, just don't run pirated software, and you're
    pretty much safe.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Sun Jun 9 09:16:44 2024
    On Fri, 07 Jun 2024 18:18:33 GMT, [email protected] (Scott Lurndal)
    wrote:

    There are already more than five security rings in most
    processors.

    Intel: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, VMX, Enclave, SMM >AMD: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, SVM, SMM
    ARM64: Realm Monitor, EL3 (Secure monitor), EL2(Hypervisor), EL1 (Kernel), EL0 (user)

    Yes, but these are multiple levels _higher_ than User, and what I was
    talking about were levels *lower* than User, so I fail to see how this indicates my idea isn't new.

    Or perhaps your complaint is simply that we have too many levels
    already. But that's somebody else's fault, and doesn't bear on whether
    the feature I suggest might be useful.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Anton Ertl on Sun Jun 9 11:10:17 2024
    On Sun, 09 Jun 2024 16:52:45 GMT, [email protected]
    (Anton Ertl) wrote:

    The proper answer to hardware bugs is not adding software limitations,
    nor software mitigations (what the hardware makers suggest), but to
    fix the hardware.

    In the case of Spectre, fixing the hardware has a cost in performance.
    So allowing the processor to run code with out-of-order execution
    turned off for that code is a way to limit the performance loss to the untrusted code.

    And this would work well on my Concertina II architecture, where VLIW
    features, such as the break bit, and extended register banks of 128
    registers each, are present. Code can be generated that avoids
    register hazards when run in order.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Savard on Sun Jun 9 16:52:45 2024
    John Savard <[email protected]d> writes:
    If no branches... then no need for retpolines and stuff.

    If no access to memory... no worries about rowhammer.

    The proper answer to hardware bugs is not adding software limitations,
    nor software mitigations (what the hardware makers suggest), but to
    fix the hardware.

    Given that, a third mode - not reduced-privilege so much as >reduced-efficiency - suggests itself.

    That would be one fix, but fixes that cost less performance are
    possible.

    Cause some code to be executed... without any speculative execution;
    allow branches, but don't execute anything until where the branch goes
    is fully resolved.

    This deals with Spectre and friends.

    So the idea is to give an unprivileged user application, like a web
    browser, a capability, without going through the operating system, to
    run code that is sandboxed in appropriate ways to prevent it from
    causing trouble although it is untrusted.

    That browsers have to be able to run untrusted JavaScript

    In general JavaScript cannot be executed without branches nor without
    memory accesses. Therefore your modes will not be used for
    JavaScript.

    has been the
    basic reason why computers today are insecure.

    There is certainly something to that, even without hardware bugs.
    JavaScript offers a huge attack surface, and lots of software-only vulnerabilities have been found in JavaScript engines over the
    decades. One way to deal with that problem is to disable JavaScript.

    But JavaScript and hardware bugs are not the only security problems on computers today.

    If the only code that
    ran on computers was trusted code, then the virus situation would be
    like it was back in the days of 8-bit computers; except for
    supply-chain attacks, just don't run pirated software, and you're
    pretty much safe.

    That's naive. All kinds of "trusted" software has vulnerabilities,
    and hardware bugs make things worse.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Sun Jun 9 18:21:32 2024
    John Savard wrote:

    On Sun, 09 Jun 2024 16:52:45 GMT, [email protected]
    (Anton Ertl) wrote:

    The proper answer to hardware bugs is not adding software limitations,
    nor software mitigations (what the hardware makers suggest), but to
    fix the hardware.

    In the case of Spectre, fixing the hardware has a cost in performance.

    It does not have to have any performance cost.

    So allowing the processor to run code with out-of-order execution
    turned off for that code is a way to limit the performance loss to the untrusted code.

    I, personally, do not trust any code; application of supervision.

    And this would work well on my Concertina II architecture, where VLIW features, such as the break bit, and extended register banks of 128
    registers each, are present. Code can be generated that avoids
    register hazards when run in order.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Sun Jun 9 18:19:07 2024
    John Savard wrote:

    On Fri, 07 Jun 2024 12:03:03 -0600, John Savard <[email protected]d> wrote:

    The first reduced-privilege state would not allow any branch
    instructions, particularly conditional branches.

    The second, in addition, would not allow any access to memory, only >>allowing access to registers.

    Maybe I haven't made clear what this is _for_ as I thought it would be obvious.

    If no branches... then no need for retpolines and stuff.

    My 66000 needs no retpolnes for external calls/returns or for
    SVCs and SVRs.

    If no access to memory... no worries about rowhammer.

    Rowhammer can be eliminated without restricting access to memory.

    Given that, a third mode - not reduced-privilege so much as reduced-efficiency - suggests itself.

    Cause some code to be executed... without any speculative execution;
    allow branches, but don't execute anything until where the branch goes
    is fully resolved.

    This deals with Spectre and friends.

    Spectré exploits the inability to keep microarchitectural state hidden
    from architectural state. Design the pipeline correctly and you won't
    Spectré, Meltdown, or friends...

    So the idea is to give an unprivileged user application, like a web
    browser, a capability, without going through the operating system, to
    run code that is sandboxed in appropriate ways to prevent it from
    causing trouble although it is untrusted.

    That browsers have to be able to run untrusted JavaScript (and,
    formerly, even Java and Flash, which have now been discarded) to
    support the flexibility desired for modern web sites... has been the
    basic reason why computers today are insecure. If the only code that
    ran on computers was trusted code, then the virus situation would be
    like it was back in the days of 8-bit computers; except for
    supply-chain attacks, just don't run pirated software, and you're
    pretty much safe.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Jun 9 22:38:34 2024
    On Sun, 09 Jun 2024 12:25:44 GMT, Anton Ertl wrote:

    It also uses less specials than I expected; e.g., on the EV45 the IMB (instruction-memory barrier) PAL call is implemented by just executing
    a big chunk of code such that the previous contents of the I-cache are evicted, while I expected that it would set a bit in a model-specific register.

    I find that somehow amusing and horrifying at the same time ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Sun Jun 9 22:41:12 2024
    On Sun, 09 Jun 2024 14:13:25 GMT, Scott Lurndal wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:
    On Sat, 8 Jun 2024 17:37:46 +0000, MitchAlsup1 wrote:

    VAX was before common era Hypervisors, do you think VAX could have
    supported secure mode and hypervisor with their 4 levels ??

    “Virtualization” was bandied about in the 1980s more as an idle, >>theoretical concept rather than a practical one.

    I'm quite sure that IBM would disagree with this statement.

    I’m sure they would. But they invented virtualization in CP/CMS because
    their attempt at an “interactive timesharing” system, CMS, was only single-user. Rather than make it multiuser, they simply invented CP as a
    big hack to run multiple copies of CMS, so each user felt they had an
    entire machine to themself.

    There were some privilege holes in that, as well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Sun Jun 9 22:45:43 2024
    On Sun, 9 Jun 2024 12:36:09 -0500, BGB wrote:

    OTOH: A 32-bit value in seconds will overflow in 2038, so isn't really sufficient at this point.

    A signed 32-bit value overflows in 2038, an unsigned value gives you a
    little bit more breathing room.

    32-bit builds of the Linux kernel already offer the option for a 64-bit
    time_t. And Debian, for one, is currently in the middle of transitioning
    its 32-bit builds to that.

    How can I tell, on my 64-bit system? Because all the affected packages
    have acquired “t64” suffixes on their names, and this applies to all architectures. I think this is temporary, though: once everything is fully compatible again, the “t64” suffixes will disappear. Presumably in time
    for the next stable release.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Sun Jun 9 22:48:28 2024
    On Sun, 09 Jun 2024 09:14:11 -0600, John Savard wrote:

    A C compiler doesn't save data in memory that can then be executed. It
    writes to a file.

    That’s an implementation issue. Back in the day, there were such things as “load-and-go” compilers. E.g. the Waterloo Fortran that I used in some undergraduate courses. I’m sure people would have done ones for C.

    These days, there are things called “JIT” (“just-in-time”) compilers.

    And just as a further nitpick (if the above weren’t enough), what happens
    if the “file” your C compiler is writing to is in a RAM disk?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to BGB on Mon Jun 10 07:53:08 2024
    BGB wrote:
    Though, there are some instructions which are currently allowed in user
    mode but which it could make sense to trap in some contexts, such as
    CPUID, or potentially just parts of CPUID, ...

    Say, for example, CPUID has several pieces of information available:
      CPU type and features;
      Microsecond timer (local);
      Clock cycle timer;
      Hardware RNG;
      ...

    In various contexts, it may be reasonable to want to trap and emulate
    some of these while still allowing others to be unhindered.

    Yeah.

    Though, the time returned by the CPUID microsecond timer is not
    currently the same as the one given by "TK_GetTimeUS()", where the
    latter effectively gives a 64-bit value (conceptually) representing the number of microseconds since 1/1/1970; though with the kernel currently assuming that its build-time is the starting time for the clock (and
    none of the FPGA boards support a hardware clock, and one would need internet access to use NTP, ...).

    A 64-bit value in microseconds can express around +/- 300k years, which should be plenty.

    Experience have shown that microsecond resolution is NOT good enough,
    i.e. GPS timing receivers can typically give you ~25 ns RMS accuracy for
    less than $100.

    WinNT settled on 64-bit 100 ns ticks from 1600-01-01, that has turned
    out to be pretty good, but (see above) not quite good enough for all uses.

    Modern Unix typically provides 64-bit time_t seconds and a (effectively)
    30-bit ns field, so you can store them in a 96-bit container but I don't
    think anyone does that?

    If you have a lot of such timestamps I would suggest you instead
    truncate the time_t seconds field to just the classic 32 bits and use windowing around the current (full resolution) time.


    A 64-bit value expressed in seconds could express values relative to the current age of the universe, but this is likely unnecessary for most purposes, and ability to express fractions of a second is likely more
    useful than the ability to express the age of the universe.

    NTP only needs relative timestamps, so Dr Mills settled on 32-bit
    seconds (since 1900!) + 32-bit fractions, so NTP timestamps have roughly
    0.25 ns resolution. The latter corresponds to 5 cm of fiber optic
    transmission delay.


    Granted, one could use a 128-bit value, and have both (and in
    picoseconds if they wanted). But, this would be overkill.

    Or, go extra overkill, and use 256 bits, to express the current age of
    the universe in Planck units...

    :-)

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Mon Jun 10 07:04:10 2024
    Terje Mathisen <[email protected]> writes:
    Modern Unix typically provides 64-bit time_t seconds and a (effectively) =

    30-bit ns field, so you can store them in a 96-bit container but I don't =

    think anyone does that?

    man time_t tells me:

    |time_t
    | ...
    | Used for time in seconds. According to POSIX, it shall be an | integer type.

    |timespec
    | ...
    | struct timespec {
    | time_t tv_sec; /* Seconds */
    | long tv_nsec; /* Nanoseconds */
    | };
    |
    | Describes times in seconds and nanoseconds.
    |
    | Conforming to: C11 and later; POSIX.1-2001 and later.

    So if you have a 64-bit time_t, the C standard does that, and POSIX
    does it earlier. Typical ABIs pad struct timespec to 128 bits,
    though.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Savard on Mon Jun 10 07:16:48 2024
    John Savard <[email protected]d> writes:
    On Sun, 09 Jun 2024 16:52:45 GMT, [email protected]
    (Anton Ertl) wrote:

    The proper answer to hardware bugs is not adding software limitations,
    nor software mitigations (what the hardware makers suggest), but to
    fix the hardware.

    In the case of Spectre, fixing the hardware has a cost in performance.

    How do you know?

    Papers on so-called "invisible speculation" schemes have reported
    slowdowns <10% for the more advanced schemes, with IIRC some even
    reporting a speedup.

    The main thing such solutions cost is area and design time. Ok, one
    can argue that the area and design time could also be spent on making
    faster vulnerable hardware, and then shift the responsibility for
    dealing with the vulnerabilities to software, where those mitigations
    that can be generally applied cost more than a factor of 2.

    So allowing the processor to run code with out-of-order execution
    turned off for that code is a way to limit the performance loss to the >untrusted code.

    Your trust in "trusted code" is unfounded.

    And this would work well on my Concertina II architecture, where VLIW >features, such as the break bit, and extended register banks of 128
    registers each, are present. Code can be generated that avoids
    register hazards when run in order.

    How do "register hazards" come into play?

    But I have seen similar trains of thoughts several times from static
    scheduling advocates. They see Spectre as the opportunity to tout
    their uncompetetive solutions by advocating solutions (like disabling speculation) that maximize the performance loss.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Mon Jun 10 01:26:23 2024
    On Sun, 9 Jun 2024 22:48:28 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    And just as a further nitpick (if the above weren�t enough), what happens
    if the �file� your C compiler is writing to is in a RAM disk?

    Well, the output could be stored with no problem, because while it's
    on the RAM disk, it can't be executed. It has to be copied from the
    RAM disk, into memory that's not pretending to be a disk, by the
    loader. So this case doesn't change anything from the case of a real
    disk.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Mon Jun 10 07:15:05 2024
    [email protected] (MitchAlsup1) writes:
    My 66000 needs no retpolnes for external calls/returns or for
    SVCs and SVRs.

    Retpolines are used for performing indirect branches without
    activating the indirect-branch predictor.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Mon Jun 10 10:57:20 2024
    MitchAlsup1 wrote:
    EricP wrote:

    Scott Lurndal wrote:
    John Savard <[email protected]d> writes:
    This may be a silly idea... but it seems to be the sort of thing that
    current concerns about computer security may be calling for.

    It is typical for computers to have a privileged mode of operation,
    wherein I/O operations and certain special changes to the state of the >>>> computer are allowed that are barred to normal computational tasks.

    For various reasons, miscreants have not been completely foiled by the >>>> existence of this feature.

    Some types of instruction that are required for normal computation are >>>> still, to a certain extent, potentially harmful.

    So I am thinking it might be useful to have, for example, two states
    less privileged than the user state, and some mechanism for user
    programs to call subroutines which are in that state until they return >>>> - the return instruction being limited, sort of like a supervisor
    call, so it can only return in a proper manner.

    There are already more than five security rings in most
    processors.

    Intel: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, VMX, Enclave,

    SMM
    AMD: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, SVM, SMM
    ARM64: Realm Monitor, EL3 (Secure monitor), EL2(Hypervisor), EL1
    (Kernel), EL0 (user)

    VAX had 4 modes, User, Supervisor, Executive, Kernel.
    VMS used Super for debugger and the command language DCL,
    Exec was mostly for the file system.
    Kernel was for the core of the OS.

    What they found that not only do they not need 4 levels,
    it was a pointless overhead to have to constantly switch between them.
    (There is a pretty high penalty to switching modes, copying in args,
    validating args, doing something usually simple, then switching back,
    when it is all the OS's code anyway.)

    VAX was before common era Hypervisors, do you think VAX could have
    supported secure mode and hypervisor with their 4 levels ??

    According to these DEC'ers, work on a Virtual Machine Monitor (VMM)
    for VAX began in 1981 for a high A1-level secure system.
    It required rewriting some microcode.

    Virtualizing the VAX Architecture, 1991 https://homes.cs.aau.dk/~kleist/Courses/nds-e05/papers/virtual-vax.pdf

    But for similar reasons ring 1 and 2 are not used in x86 machines,
    either. {{NOw, if we could just go back to 1982 and not invent IDTs, and
    call gates, .....}}

    I don't know what privileges Unix on VAX used but it was
    probably 2 levels because PDP-11 had only 2 levels.

    Alpha had 3 levels, User, Supervisor, and a higher third mode called
    PAL for Privileged Architecture Library. It was supposed to be thought
    of like microcode, privileged subroutines. Then PAL mode was used to
    emulate the 4 levels that VMS expected when they ported it.

    PAL was microcode in <fast> ROM in the native ISA.

    As Anton also points out elsewhere, it was normal macro instructions.

    However it does have aspects which are similar to microcode,
    which are that PAL code is stored an a writable control store that
    is a separate address space from main memory, that it has elevated
    privilege while executing allowing access to HW not otherwise allowed,
    and that interrupts are disabled while it executes.

    But I came to realize that none of that is actually *required*.
    It doesn't *need* a third privilege mode, and actually it looks
    more expensive performance wise to have one than not.
    It would be simpler and cheaper to just transition directly
    to and from Super mode without also going through PAL mode.
    And there is NO technical reason to restrict access to HW control
    register from Super mode.

    Many processors automatically disable interrupts on trap because it
    greatly simplifies the race conditions in their prologue and epilogue.
    x86 did not disable interrupts on exceptions but x64 allows it as an option.

    PAL mode does not require its own on-chip SRAM - it could exist in main
    memory addressed through a base physical register or an MMU hack.
    And having a dedicated private on-chip SRAM to hold critical OS code
    does not mean that it is microcode. I would have this for my design
    with an MMU fiddle to hard-wire a VA->PA mapping for some OS code.

    After realizing it didn't need to exist, and that PAL mode looks more
    expensive than just User/Super modes, I began to wonder why it was there.
    Which leads me to here:

    (I think PAL mode was a way to patent a feature that made the
    ISA impossible to copy without their permission,
    and therefore someone can't take DEC's executables and run them
    on a clone processor, like what happened to IBM with Amdahl.)

    Worked real well for them !!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Mon Jun 10 15:28:55 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Sun, 09 Jun 2024 14:13:25 GMT, Scott Lurndal wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:
    On Sat, 8 Jun 2024 17:37:46 +0000, MitchAlsup1 wrote:

    VAX was before common era Hypervisors, do you think VAX could have
    supported secure mode and hypervisor with their 4 levels ??

    “Virtualization” was bandied about in the 1980s more as an idle, >>>theoretical concept rather than a practical one.

    I'm quite sure that IBM would disagree with this statement.

    I’m sure they would.

    You're attempt to scramble to avoid being wrong was unsucessful.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Savard on Mon Jun 10 15:30:01 2024
    John Savard <[email protected]d> writes:
    On Fri, 07 Jun 2024 18:18:33 GMT, [email protected] (Scott Lurndal)
    wrote:

    There are already more than five security rings in most
    processors.

    Intel: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, VMX, Enclave, SMM >>AMD: Ring 3, Ring 2 (unused), Ring 1(unused), Ring 0, SVM, SMM
    ARM64: Realm Monitor, EL3 (Secure monitor), EL2(Hypervisor), EL1 (Kernel), EL0 (user)

    Yes, but these are multiple levels _higher_ than User, and what I was
    talking about were levels *lower* than User, so I fail to see how this >indicates my idea isn't new.

    Or perhaps your complaint is simply that we have too many levels
    already.

    That's part of it.

    But that's somebody else's fault, and doesn't bear on whether
    the feature I suggest might be useful.

    On the face of it, your feature is not useful.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Mon Jun 10 15:23:51 2024
    EricP <[email protected]> writes:
    PAL code is stored an a writable control store that
    is a separate address space from main memory

    Given the way that it (the EV45 PAL code) implements the PAL-call IMB,
    i.e., by executing enough code to flush the I-cache, means that the
    PAL-code is loaded into the I-cache, so I expect that it resides in
    normal RAM. If that was in a separate memory space, there would need
    to be an additional bit in each I-cache tag that records this fact.

    But I came to realize that none of that is actually *required*.
    It doesn't *need* a third privilege mode, and actually it looks
    more expensive performance wise to have one than not.
    It would be simpler and cheaper to just transition directly
    to and from Super mode without also going through PAL mode.
    And there is NO technical reason to restrict access to HW control
    register from Super mode.

    Many processors automatically disable interrupts on trap because it
    greatly simplifies the race conditions in their prologue and epilogue.
    x86 did not disable interrupts on exceptions but x64 allows it as an option.

    PAL mode does not require its own on-chip SRAM - it could exist in main >memory addressed through a base physical register or an MMU hack.
    And having a dedicated private on-chip SRAM to hold critical OS code
    does not mean that it is microcode. I would have this for my design
    with an MMU fiddle to hard-wire a VA->PA mapping for some OS code.

    After realizing it didn't need to exist, and that PAL mode looks more >expensive than just User/Super modes, I began to wonder why it was there. >Which leads me to here:

    (I think PAL mode was a way to patent a feature that made the
    ISA impossible to copy without their permission,

    Not really. If there was a patent that is specific to it being a
    different address space or a dedicated private on-chip SRAM, that
    patent could be easily circumvented by the Amdahl-alike by putting the
    PAL-code in RAM and using a base register or MMU hack, as you
    describe.

    Also if there was enough room for more on-chip SRAM on any of the
    Alpha chips, the designers would have used that room to put in
    features that make the chip faster.

    Given that ARM is able to charge an architecture licensing fee for the instruction set alone, I am sure that DEC had enough patents on its
    instruction set, no need for unnecessary and circumventable
    implementation ideas.

    One other thing they did: they had one PAL code coming with the SRM
    console for VMS and Digital OSF/1, and another PAL code with the
    ARC/AlphaBIOS console for Windows NT and Linux. This allowed them to
    charge extra (quite a lot) for hardware capable of running their
    premium OSs, while providing almost competetive prices for hardware
    running PC OSs. Unfortunately, the PC-like package was still not price/performance competetive, and AlphaBIOS (which we had on our EV56
    boxes) was a horror to work with.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to EricP on Mon Jun 10 18:52:06 2024
    EricP wrote:
    [snip]
    Many processors automatically disable interrupts on trap because it
    greatly simplifies the race conditions in their prologue and epilogue.
    x86 did not disable interrupts on exceptions but x64 allows it as an
    option.

    I have written a lot of x86 interrupt handlers, these chips did very
    much disable all interrupts when transferring control to my handler.

    The typical approach was to do the minimum work possible to save
    whatever HW buffer/data needed saving, before executing a STI (SeT
    Interrupt enable bit?) and then do anything else that had to be done
    while still in the primary handler.

    IRET restored flags, IP and CS, transferring control back to whatever
    was running when the hw interrupt happened.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Mon Jun 10 20:41:31 2024
    On Mon, 10 Jun 2024 18:52:06 +0200
    Terje Mathisen <[email protected]> wrote:

    EricP wrote:
    [snip]
    Many processors automatically disable interrupts on trap because it
    greatly simplifies the race conditions in their prologue and
    epilogue. x86 did not disable interrupts on exceptions but x64
    allows it as an option.

    I have written a lot of x86 interrupt handlers, these chips did very
    much disable all interrupts when transferring control to my handler.


    Intel's official terminology makes distinction between interrupts and exceptions. The former are external/asynchronous, the later are internal/synchronous. Exceptions are further sub-divided into faults,
    traps and aborts.
    Manual says that IF flag is cleared when interrupt is handled through
    an interrupt gate. It is not cleared when interrupt is handled through
    a trap gate. At that point manual does not say that exception handled
    through an interrupt gate also clear IF flag, but later on (in p.
    6.12.1.2 in my copy of the manual) it says that they do.

    The typical approach was to do the minimum work possible to save
    whatever HW buffer/data needed saving, before executing a STI (SeT
    Interrupt enable bit?) and then do anything else that had to be done
    while still in the primary handler.

    IRET restored flags, IP and CS, transferring control back to whatever
    was running when the hw interrupt happened.

    Terje



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Mon Jun 10 18:43:09 2024
    John Savard wrote:

    On Sun, 9 Jun 2024 22:48:28 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    And just as a further nitpick (if the above weren’t enough), what
    happens

    if the “file” your C compiler is writing to is in a RAM disk?

    Well, the output could be stored with no problem, because while it's
    on the RAM disk, it can't be executed. It has to be copied from the
    RAM disk, into memory that's not pretending to be a disk, by the
    loader. So this case doesn't change anything from the case of a real
    disk.

    One can create a PTE pointing at that RAM disk page and then allow
    someone to execute it directly.
    OR
    One can copy it somewhere that has execute permission in a single
    instruction (MM = memory to memory move)

    Neither is any real burden to enabling execute.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Jun 10 19:02:58 2024
    Anton Ertl wrote:

    John Savard <[email protected]d> writes:

    So allowing the processor to run code with out-of-order execution
    turned off for that code is a way to limit the performance loss to the >>untrusted code.

    Your trust in "trusted code" is unfounded.

    Indeed.

    And this would work well on my Concertina II architecture, where VLIW >>features, such as the break bit, and extended register banks of 128 >>registers each, are present. Code can be generated that avoids
    register hazards when run in order.

    How do "register hazards" come into play?

    Registers values must appear to have been read and written as if the instruction stream was processed sequentially. This is the vonNeumann
    paradigm.

    But I have seen similar trains of thoughts several times from static scheduling advocates. They see Spectre as the opportunity to tout
    their uncompetetive solutions by advocating solutions (like disabling speculation) that maximize the performance loss.

    My 66000 has made no such claim..........on static scheduling.
    My 66000 intends to have both In Order implementations and
    Great Big Out of Order implementation.

    But a funny thing happens when the ISA is sufficiently expressive
    such as my universal constants implementation::

    You lose a lot of instructions that are easily scheduled, sometimes
    to the point all you have left is the instructions at the core of the algorithm. I have several subroutines with 30-40 FMAC FU instructions
    in a row without anything else to do. No amount of code scheduling or
    OoOness helps these cases.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Terje Mathisen on Mon Jun 10 14:32:33 2024
    Terje Mathisen wrote:
    EricP wrote:
    [snip]
    Many processors automatically disable interrupts on trap because it
    greatly simplifies the race conditions in their prologue and epilogue.
    x86 did not disable interrupts on exceptions but x64 allows it as an
    option.

    I have written a lot of x86 interrupt handlers, these chips did very
    much disable all interrupts when transferring control to my handler.

    The typical approach was to do the minimum work possible to save
    whatever HW buffer/data needed saving, before executing a STI (SeT
    Interrupt enable bit?) and then do anything else that had to be done
    while still in the primary handler.

    IRET restored flags, IP and CS, transferring control back to whatever
    was running when the hw interrupt happened.

    Terje

    Yes, for x86/x64 external interrupts it raises the IRQ priority to that of
    the requesting device, masking further interrupts of the same or lower IRQ priority. Or you can explicitly disable all maskable interrupts.

    However for exceptions and NMI x86 does not mask interrupts so it is
    possible for, say, a page fault or INT instruction to trap to the OS,
    saving a frame on the stack, and just then an external interrupt to
    arrive, saving another frame.

    On the return from the interrupt or exception (we want a common return
    code path) we need to know if this is a First Level Exception/Interrupt.
    If not, we take the simple path and just REI Return Exception or Interrupt.
    If it is a FLEI then we need to check for deferred work and jump into
    the OS. Also it we are returning to User mode we may need to check
    for things like thread APCs/signals that arrived while we were away.

    On x86 there is also the difference between stack frame shape
    depending on whether the prior mode was User or Super.
    On x64 they fixed this so they are the same shape.

    Then there is the difference between SYSCALL/SYSRET vs SYSENTER/SYSEXIT,
    and that one did not set the system stack pointer on entry,
    which leaves a security hole if an interrupt arrives just before
    you can patch it.

    And there was the NMI race condition bug, details of which I have
    forgotten but was again something to do with the system stack not
    being set correctly after switching to Super and then an NMI arrives
    which does not set the stack because the prior mode was already Super.

    Its not that these are not handleable, its that it takes literally
    hundreds of instructions in the x86/x64 prologues and epilogues closing
    each of these holes and idiosyncrasies. And that's on top of the already
    large clocks cost for the IDT and call gates, and REI instructions.

    *None* of this should be necessary.
    Even the pipeline drain on mode switch should often be avoidable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Mon Jun 10 17:11:47 2024
    On Mon, 10 Jun 2024 18:43:09 +0000, [email protected] (MitchAlsup1)
    wrote:

    John Savard wrote:

    On Sun, 9 Jun 2024 22:48:28 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    And just as a further nitpick (if the above weren?t enough), what
    happens

    if the ?file? your C compiler is writing to is in a RAM disk?

    Well, the output could be stored with no problem, because while it's
    on the RAM disk, it can't be executed. It has to be copied from the
    RAM disk, into memory that's not pretending to be a disk, by the
    loader. So this case doesn't change anything from the case of a real
    disk.

    One can create a PTE pointing at that RAM disk page and then allow
    someone to execute it directly.
    OR
    One can copy it somewhere that has execute permission in a single
    instruction (MM = memory to memory move)

    Neither is any real burden to enabling execute.

    I'm not claiming that locating code in a RAM disk would _prevent_ a
    program from enabling its execution. Normally, though, that wouldn't
    be done just because it would mess things up for the software that is
    supposed to be in charge of reading and writing from the RAM disk if
    anything else accessies it.

    My point was entirely different. Just as a JIT compiler doesn't run
    into issues because it writes code to memory, but because it writes
    code to memory with the intent of executing it later - and enabling
    both write and execute is restricted in the case of the sort of security-focused system we're discussing - an ordinary compiler
    writing to a RAM disk instead of a physical disk runs into no issues.

    Writing code in memory is not an issue. Write can be enabled to
    memory. Only enabling write and execute together is potentially
    subject to restricions.

    So the idea of a RAM disk doesn't change anything.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Anton Ertl on Mon Jun 10 17:22:05 2024
    On Mon, 10 Jun 2024 07:16:48 GMT, [email protected]
    (Anton Ertl) wrote:

    John Savard <[email protected]d> writes:

    In the case of Spectre, fixing the hardware has a cost in performance.

    How do you know?

    Papers on so-called "invisible speculation" schemes have reported
    slowdowns <10% for the more advanced schemes, with IIRC some even
    reporting a speedup.

    I've heard claims - especially from Mitch Alsup - that, indeed, all
    one has to do is avoid certain _mistakes_ when designing a pipeline,
    and there's no room for Spectre any more.

    I'm no expert on these things at all, so I don't know that this can't
    be true. But I also don't know that it _is_ true.

    What does Spectre exploit? it exploits the fact that speculative
    execution keeps around data that was fetched into cache by the
    speculative execution of some code that was never supposed to be
    executed. Just in case it might be useful later.

    Obviously, keeping around any data that just happens to be
    accidentally in cache, just in case it might be useful later, does
    have a positive (but likely very slight) effect on performance. Being
    strict about what speculative execution can do, on the other hand, so
    nothing is allowed to leak information, will reduce performance... at
    least a little bit.

    It could well be that the losses aren't enough to be concerned about,
    if this is done carefully. That is, not even the 3% quoted as the cost
    of one of the earliest fixes. But since I've heard higher figures for
    the fixes for later variants, without positive knowledge, I have to be skeptical about claims that all possible variants of this kind of
    attack can be prevented at little cost.

    And Rowhammer is even worse. It's not at all clear to me what can be
    done without adding an expensive layer of monitoring to memory
    accesses. However, only DRAM is vulnerable to Rowhammer, and so it may
    be possible to turn cache into a bulwark against it somehow.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Tue Jun 11 00:45:28 2024
    I forgot to add that Mc 88120 had these features in 1992.

    There was a staging buffer between AGEN and LDalign where up to 48
    memory reference instructions could wait for data to become available,
    for modified results to wait to be written back to DCache, for inst-
    ructions to wait for memory order to resolve, and it could even
    retarget when modified data could go. we called this thing the
    Conditional Cache.

    Stores waited for retirement.
    Mises waited for retirement to modify DCache
    Memory references could access data in CC so it added no0 cycle latency
    and acted like memory forwarding (==register forwarding)
    but the DCache was not modified until the causing instruction
    to retire. The CC was in effect the memory pipeline !!
    ..

    For example one could have modified data waiting for a DCache
    write to become available when a subsequent memory reference
    displaces the first line out towards memory, so there is no
    line in the cache to write to !! so the CC wrote the data out
    towards memory.

    1992 !! And it could run MATRIX 300 at 5.9 IPC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Tue Jun 11 00:27:02 2024
    John Savard wrote:

    On Mon, 10 Jun 2024 07:16:48 GMT, [email protected]
    (Anton Ertl) wrote:

    John Savard <[email protected]d> writes:

    In the case of Spectre, fixing the hardware has a cost in performance.

    How do you know?

    Papers on so-called "invisible speculation" schemes have reported
    slowdowns <10% for the more advanced schemes, with IIRC some even
    reporting a speedup.

    I've heard claims - especially from Mitch Alsup - that, indeed, all
    one has to do is avoid certain _mistakes_ when designing a pipeline,
    and there's no room for Spectre any more.

    I'm no expert on these things at all, so I don't know that this can't
    be true. But I also don't know that it _is_ true.

    Timeline:: the microarchitecture of Intel's latest chips are derived
    all the way back to Pentium Pro. Sure they have tweaked lots of
    things and created an explosion of new instructions, but deep inside
    it is still PP.

    What does Spectre exploit? it exploits the fact that speculative
    execution keeps around data that was fetched into cache by the
    speculative execution of some code that was never supposed to be
    executed. Just in case it might be useful later.

    Yes

    Obviously, keeping around any data that just happens to be
    accidentally in cache, just in case it might be useful later, does
    have a positive (but likely very slight) effect on performance. Being
    strict about what speculative execution can do, on the other hand, so
    nothing is allowed to leak information, will reduce performance... at
    least a little bit.

    In the course of accessing data from the cache, one also has to check
    if
    there is an outstanding request to memory for this same cache line. So,
    when multiple requests all target the same cache line, one only fetches
    it once. This check is performed fully associatively in the miss
    buffer.
    Since one is already checking the miss buffer, and the miss buffer has
    to have any cache line pass through it during instruction execution::

    ALL I have DONE is to not have the MB write into the cache until the
    causing instruction retires !! Should the instruction NOT retire, the
    data in the miss buffer can be delivered back to memory/whence it came (depending on coherence protocol) and we remain coherent without::
    a) delaying the core
    b) modifying the cache
    c) exposing microarchitectural details

    The only piece of logic that needs to change is the miss buffer in that
    they currently only deliver the "critical word" of the miss and then
    dump the buffer into the cache. All I ask is for the miss buffer to
    deliver data to all outstanding requests while initiator is waiting
    to retire. {This may need an extra entry (or 2) in MB to avoid losing performance.

    Intel and AMD (and everyone else it appears) have not done a major
    new microarchitecture since Spectré was announced {they may NOT even
    CARE !} Instead of new microarchitecture, they prefer to add o the
    width and depth of the execution window {not that anyone would
    disagree).

    Noting in My 66000 requires and serious modification to the GBOoO
    general architecture of the execution window--just modifications to
    some sequences to prevent microarchitectural leakage.

    It could well be that the losses aren't enough to be concerned about,
    if this is done carefully. That is, not even the 3% quoted as the cost
    of one of the earliest fixes. But since I've heard higher figures for
    the fixes for later variants, without positive knowledge, I have to be skeptical about claims that all possible variants of this kind of
    attack can be prevented at little cost.

    And Rowhammer is even worse. It's not at all clear to me what can be
    done without adding an expensive layer of monitoring to memory
    accesses. However, only DRAM is vulnerable to Rowhammer, and so it may
    be possible to turn cache into a bulwark against it somehow.

    My 66000 is also insensitive to RowHammer and derivatives.....

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Tue Jun 11 04:00:57 2024
    On Mon, 10 Jun 2024 15:28:55 GMT, Scott Lurndal wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    On Sun, 09 Jun 2024 14:13:25 GMT, Scott Lurndal wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:
    On Sat, 8 Jun 2024 17:37:46 +0000, MitchAlsup1 wrote:

    VAX was before common era Hypervisors, do you think VAX could have
    supported secure mode and hypervisor with their 4 levels ??

    “Virtualization” was bandied about in the 1980s more as an idle, >>>>theoretical concept rather than a practical one.

    I'm quite sure that IBM would disagree with this statement.

    I’m sure they would.

    [Your] attempt to scramble to avoid being wrong was unsucessful.

    Conway’s Law applies: any piece of software reflects the organizational structure that produced it. IBM had all these different, incompatible and operating systems that didn’t communicate with each other because they
    were produced by different divisions of the company that didn’t
    communicate with each other. So how to tie them together? Create yet
    another system to act as the glue; not to provide any actual inter- communication capability, but just so users could at least run more than
    one of them at a time, without needing a lot of extra hardware.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Jun 11 04:07:17 2024
    On Mon, 10 Jun 2024 15:23:51 GMT, Anton Ertl wrote:

    One other thing they did: they had one PAL code coming with the SRM
    console for VMS and Digital OSF/1, and another PAL code with the ARC/AlphaBIOS console for Windows NT and Linux. This allowed them to
    charge extra (quite a lot) for hardware capable of running their premium
    OSs, while providing almost competetive prices for hardware running PC
    OSs.

    Let me offer an anecdote related (or not?) to this. I had a client with
    several DEC Alphas, all running DEC Unix (variously branded “OSF/1” and “Tru64”). The battery died on one of them, and it came up with a prompt asking for a Windows NT boot disk.

    Both the user and I felt disappointed that a piece of such premium DEC hardware, when it lost its mind, would default to asking for Windows NT,
    of all OSes.

    Unfortunately, the PC-like package was still not price/performance competetive, and AlphaBIOS (which we had on our EV56 boxes) was a horror
    to work with.

    Windows NT was a disaster to the entire Unix workstation market. The irony
    was, NT “Workstation” wasn’t really feature-equivalent to the OSes the Unix workstations were running. But it was enough for the customers, it
    seems ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Tue Jun 11 04:10:00 2024
    On Mon, 10 Jun 2024 20:41:31 +0300, Michael S wrote:

    Intel's official terminology makes distinction between interrupts and exceptions. The former are external/asynchronous, the later are internal/synchronous. Exceptions are further sub-divided into faults,
    traps and aborts.

    That all sounds very DEC-like.

    In particular, the DEC definition of a “fault” is that the saved PC on the stack still points at the instruction that caused the exception, so a return-from-exception will attempt to re-execute the same instruction.
    This is exactly what you want for page faults, for example, but also for long-running interruptible instructions that haven’t finished yet.

    Whereas a “trap” left the PC pointing at the following instruction. So a return from the exception handler will simply resume execution there.

    Over the evolution of the VAX architecture, some exceptions which
    initially were “traps” became “faults” instead.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Tue Jun 11 04:13:49 2024
    On Mon, 10 Jun 2024 15:30:01 GMT, Scott Lurndal wrote:

    On the face of it, your feature is not useful.

    It allows for more fine-grained privilege separation. That is very likely
    to be useful to certain, um, TLA markets, shall we say. Even ordinary
    users now have a need to run potentially hostile code in a sandbox, just
    as part of normal web-browsing. That need might have to develop in more
    complex ways in future.

    But on the other hand, those increased levels of separation are probably
    needed in a less hierarchical, more matrix-connected way. I.e.
    capabilities might be more relevant, rather than privilege rings.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to EricP on Tue Jun 11 04:11:23 2024
    On Mon, 10 Jun 2024 14:32:33 -0400, EricP wrote:

    And there was the NMI race condition bug ...

    Not surprised there was trouble with the concept of a “non-maskable interrupt”. When I first heard of such a thing, I threw up my hands in horror.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Jun 11 04:02:33 2024
    On Mon, 10 Jun 2024 15:23:51 GMT, Anton Ertl wrote:

    Given that ARM is able to charge an architecture licensing fee for the instruction set alone ...

    I think that applies to newer versions, not the older ones. Given that ARM
    goes back to the 1980s, any patents from the earliest years would have
    expired by now.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Tue Jun 11 04:14:48 2024
    On Mon, 10 Jun 2024 01:26:23 -0600, John Savard wrote:

    .. while it's on the RAM disk, it can't be executed.

    Why not? A filesystem can still have executable mode bits on its files, regardless of whether the underlying medium is in persistent storage or
    not.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Tue Jun 11 04:18:26 2024
    On Mon, 10 Jun 2024 17:11:47 -0600, John Savard wrote:

    Write can be enabled to memory. Only enabling write and execute together
    is potentially subject to restricions.

    I was going to say, it might be acceptable in current programming
    environments to keep the two states (writable versus executable) carefully separated, with an explicit transition from one to the other.

    But it turns out this isn’t always enough. I wrote some C code taking advantage of the GCC extension that lets you define nested routines, and
    in that situation it creates “thunks” to allow inner routines to access local variables in outer routines, and that requires an executable stack.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Jun 11 04:15:35 2024
    On Mon, 10 Jun 2024 18:43:09 +0000, MitchAlsup1 wrote:

    One can create a PTE pointing at that RAM disk page and then allow
    someone to execute it directly.

    Funny, that’s how demand-paging works.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to John Savard on Tue Jun 11 08:54:16 2024
    On 2024-06-11 2:22, John Savard wrote:
    On Mon, 10 Jun 2024 07:16:48 GMT, [email protected]
    (Anton Ertl) wrote:

    John Savard <[email protected]d> writes:

    In the case of Spectre, fixing the hardware has a cost in performance.

    How do you know?

    Papers on so-called "invisible speculation" schemes have reported
    slowdowns <10% for the more advanced schemes, with IIRC some even
    reporting a speedup.

    I've heard claims - especially from Mitch Alsup - that, indeed, all
    one has to do is avoid certain _mistakes_ when designing a pipeline,
    and there's no room for Spectre any more.

    I'm no expert on these things at all, so I don't know that this can't
    be true. But I also don't know that it _is_ true.

    What does Spectre exploit? it exploits the fact that speculative
    execution keeps around data that was fetched into cache by the
    speculative execution of some code that was never supposed to be
    executed. Just in case it might be useful later.

    Obviously, keeping around any data that just happens to be
    accidentally in cache, just in case it might be useful later, does
    have a positive (but likely very slight) effect on performance.


    Not always. If the mistakenly speculated cache-fetch /evicted/ some
    other data from the (finite-sized) cache, and the evicted data are
    needed later on the /true/ execution path, the mistakenly speculated
    fetch has a /negative/ effect on performance. (This kind of "timing
    anomaly" is very bothersome in static WCET analysis.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to EricP on Tue Jun 11 10:03:36 2024
    EricP wrote:
    Terje Mathisen wrote:
    EricP wrote:
    [snip]
    Many processors automatically disable interrupts on trap because it
    greatly simplifies the race conditions in their prologue and epilogue.
    x86 did not disable interrupts on exceptions but x64 allows it as an
    option.

    I have written a lot of x86 interrupt handlers, these chips did very
    much disable all interrupts when transferring control to my handler.

    The typical approach was to do the minimum work possible to save
    whatever HW buffer/data needed saving, before executing a STI (SeT
    Interrupt enable bit?) and then do anything else that had to be done
    while still in the primary handler.

    IRET restored flags, IP and CS, transferring control back to whatever
    was running when the hw interrupt happened.

    Terje

    Yes, for x86/x64 external interrupts it raises the IRQ priority to that of the requesting device, masking further interrupts of the same or lower IRQ priority. Or you can explicitly disable all maskable interrupts.

    I guess my vintage is showing! When I wrote HW interrupt handlers, none
    of this applied so it was a much simpler world.

    Initially there was no real priority in use because my handler would
    start with IRQ disabled, I would poll/read the single byte serial port
    buffer, then clear a hardware interrupt flag and then simply IRET.

    A little later (286?) it became possible to selectively re-enable only
    those interrupts that had a higher priority, so I would do that when my
    most critical work was done.

    Even later the serial port chip was replaced with a far better one which
    had 16-byte IO buffers and programmable interrupt levels. AFAIR I would typically set it to signal when the buffer was half full, but 14 of 16
    was also possible?

    However for exceptions and NMI x86 does not mask interrupts so it is
    possible for, say, a page fault or INT instruction to trap to the OS,
    saving a frame on the stack, and just then an external interrupt to
    arrive, saving another frame.

    On the return from the interrupt or exception (we want a common return
    code path) we need to know if this is a First Level Exception/Interrupt.
    If not, we take the simple path and just REI Return Exception or Interrupt. If it is a FLEI then we need to check for deferred work and jump into
    the OS. Also it we are returning to User mode we may need to check
    for things like thread APCs/signals that arrived while we were away.

    On x86 there is also the difference between stack frame shape
    depending on whether the prior mode was User or Super.
    On x64 they fixed this so they are the same shape.

    Then there is the difference between SYSCALL/SYSRET vs SYSENTER/SYSEXIT,
    and that one did not set the system stack pointer on entry,
    which leaves a security hole if an interrupt arrives just before
    you can patch it.

    And there was the NMI race condition bug, details of which I have
    forgotten but was again something to do with the system stack not
    being set correctly after switching to Super and then an NMI arrives
    which does not set the stack because the prior mode was already Super.

    Its not that these are not handleable, its that it takes literally
    hundreds of instructions in the x86/x64 prologues and epilogues closing
    each of these holes and idiosyncrasies. And that's on top of the already large clocks cost for the IDT and call gates, and REI instructions.

    *None* of this should be necessary.
    Even the pipeline drain on mode switch should often be avoidable.


    Ouch! Glad I got out of the IRQ handler business before 1990.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Tue Jun 11 09:37:32 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    I'm quite sure that IBM would disagree with this statement.

    I’m sure they would. But they invented virtualization in CP/CMS because >their attempt at an “interactive timesharing” system, CMS, was only >single-user.

    There's no need to make up silly stories like this when the actual
    history is so well documented. CP/CMS was the IBM Cambridge Scientific
    Center's response to the end of CTSS and the loss of the bid to build
    Multics. CP and CMS were developed in tandem and it was always a
    time-sharing system, originally on a modified 360/40, later on a
    360/67.

    For a long time IBM insisted the real time-sharing system was TSS,
    then later TSO on MVS, while CP was just an unsupported lab curiosity. Eventually they gave in to the obvious, renamed it to VM, ported it to
    S/370, and made it a real product.

    The Wikipedia article has lots od details https://en.wikipedia.org/wiki/History_of_CP/CMS

    As does this IBM paper on the history of VM https://www.vm.ibm.com/history/50th/vm370ori.pdf
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue Jun 11 13:12:23 2024
    On Tue, 11 Jun 2024 10:03:36 +0200
    Terje Mathisen <[email protected]> wrote:


    I guess my vintage is showing! When I wrote HW interrupt handlers,
    none of this applied so it was a much simpler world.

    Initially there was no real priority in use because my handler would
    start with IRQ disabled, I would poll/read the single byte serial
    port buffer, then clear a hardware interrupt flag and then simply
    IRET.


    I think, even the very first IBM PC had one Intel 8259 PIC. PC/XT had it
    for sure. So, priorities were here. How useful, is another question.

    One of the problems was that right from the beginning IBM engineers
    ignored Intel's recommendations to wire external interrupts to IRQ
    numbers 32 or higher. They thought that they know better. Of course,
    they didn't.

    A little later (286?) it became possible to selectively re-enable
    only those interrupts that had a higher priority, so I would do that
    when my most critical work was done.


    PC/AT had two 8259 PICs connected as master and slave. So, more
    priority levels at cost of less simple programming.
    Now, 80286 CPU had ALOT of interrupt processing features non-heard of by earlier CPUs, but those were available only in protected mode, so
    that's probably not what you had in mind above.

    Even later the serial port chip was replaced with a far better one
    which had 16-byte IO buffers and programmable interrupt levels. AFAIR
    I would typically set it to signal when the buffer was half full, but
    14 of 16 was also possible?

    However for exceptions and NMI x86 does not mask interrupts so it is possible for, say, a page fault or INT instruction to trap to the
    OS, saving a frame on the stack, and just then an external
    interrupt to arrive, saving another frame.

    On the return from the interrupt or exception (we want a common
    return code path) we need to know if this is a First Level Exception/Interrupt. If not, we take the simple path and just REI
    Return Exception or Interrupt. If it is a FLEI then we need to
    check for deferred work and jump into the OS. Also it we are
    returning to User mode we may need to check for things like thread APCs/signals that arrived while we were away.

    On x86 there is also the difference between stack frame shape
    depending on whether the prior mode was User or Super.
    On x64 they fixed this so they are the same shape.

    Then there is the difference between SYSCALL/SYSRET vs
    SYSENTER/SYSEXIT, and that one did not set the system stack pointer
    on entry, which leaves a security hole if an interrupt arrives just
    before you can patch it.

    And there was the NMI race condition bug, details of which I have
    forgotten but was again something to do with the system stack not
    being set correctly after switching to Super and then an NMI arrives
    which does not set the stack because the prior mode was already
    Super.

    Its not that these are not handleable, its that it takes literally
    hundreds of instructions in the x86/x64 prologues and epilogues
    closing each of these holes and idiosyncrasies. And that's on top
    of the already large clocks cost for the IDT and call gates, and
    REI instructions.

    *None* of this should be necessary.
    Even the pipeline drain on mode switch should often be avoidable.


    Ouch! Glad I got out of the IRQ handler business before 1990.

    Terje


    I think, Eric more than a little exaggerates about the level of
    complexity of end-of-interrupt processing needed in common case.
    May be, the code is long, but absolute majority of it is executed very
    rarely, if at all.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Tue Jun 11 14:07:48 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Mon, 10 Jun 2024 20:41:31 +0300, Michael S wrote:

    Intel's official terminology makes distinction between interrupts and
    exceptions. The former are external/asynchronous, the later are
    internal/synchronous. Exceptions are further sub-divided into faults,
    traps and aborts.

    That all sounds very DEC-like.

    In particular, the DEC definition of a “fault” is that the saved PC on the >stack still points at the instruction that caused the exception, so a >return-from-exception will attempt to re-execute the same instruction.
    This is exactly what you want for page faults, for example, but also for >long-running interruptible instructions that haven’t finished yet.

    That distinction predated the VAX, of course. Pretty much every
    hardware architecture at the time supported similar instruction restart semantics, particularly when it supported some form of memory management
    trap behavior.

    For example, the Burroughs mainframes distinguished between restartable faults/exceptions and interrupts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Tue Jun 11 14:11:55 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Mon, 10 Jun 2024 14:32:33 -0400, EricP wrote:

    And there was the NMI race condition bug ...

    Not surprised there was trouble with the concept of a “non-maskable >interrupt”. When I first heard of such a thing, I threw up my hands in >horror.

    1) NMI are incredibly useful in certain cases, particularly for in-kernel debuggers.
    2) NMI is actually maskable on Intel hardware (in the chipset, not the processor)
    3) ARM refused to support NMI in Aarch64 (partially because they didn't
    have a spare exception vector). They've backtracked and hacked in a
    solution using the interrupt controller to create a pseudo-unmaskable
    interrupt due to customer demand.

    https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/a-profile-non-maskable-interrupts

    Back in the 90's, we had a custom PCI card with a single button that would trigger an NMI for debugging new hardware.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Tue Jun 11 14:04:59 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Mon, 10 Jun 2024 15:23:51 GMT, Anton Ertl wrote:

    Given that ARM is able to charge an architecture licensing fee for the
    instruction set alone ...

    I think that applies to newer versions, not the older ones. Given that ARM >goes back to the 1980s, any patents from the earliest years would have >expired by now.

    It has nothing to do with patents.

    The architecture license provides far more than the ability
    to implement the arm instruction set. BTDT.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Tue Jun 11 10:49:17 2024
    Anton Ertl wrote:
    EricP <[email protected]> writes:
    PAL code is stored an a writable control store that
    is a separate address space from main memory

    Given the way that it (the EV45 PAL code) implements the PAL-call IMB,
    i.e., by executing enough code to flush the I-cache, means that the
    PAL-code is loaded into the I-cache, so I expect that it resides in
    normal RAM. If that was in a separate memory space, there would need
    to be an additional bit in each I-cache tag that records this fact.

    It is normal SRAM but private to each core.
    So there are no tags, no coherence.

    Access to this SRAM is enabled by PAL mode and the program counter
    register contains an address in that SRAM. I don't see an explicit
    statement to say so but it looks to me that this is just a physical
    address that the cpu decodes to this SRAM when in PAL mode.
    I don't see a specific restriction that this memory be only used for
    read-only code and that, with care, data cannot be stored in it.

    The initial address for entry to PAL code comes from the 26-bit field
    in the CALL_PAL instruction. That code number is validated, then shifted
    left by some number of bits defined by a control register, say 6 bits,
    and that is OR'd with base address register and stuffed into the PC.

    But I came to realize that none of that is actually *required*.
    It doesn't *need* a third privilege mode, and actually it looks
    more expensive performance wise to have one than not.
    It would be simpler and cheaper to just transition directly
    to and from Super mode without also going through PAL mode.
    And there is NO technical reason to restrict access to HW control
    register from Super mode.

    Many processors automatically disable interrupts on trap because it
    greatly simplifies the race conditions in their prologue and epilogue.
    x86 did not disable interrupts on exceptions but x64 allows it as an option. >>
    PAL mode does not require its own on-chip SRAM - it could exist in main
    memory addressed through a base physical register or an MMU hack.
    And having a dedicated private on-chip SRAM to hold critical OS code
    does not mean that it is microcode. I would have this for my design
    with an MMU fiddle to hard-wire a VA->PA mapping for some OS code.

    After realizing it didn't need to exist, and that PAL mode looks more
    expensive than just User/Super modes, I began to wonder why it was there.
    Which leads me to here:

    (I think PAL mode was a way to patent a feature that made the
    ISA impossible to copy without their permission,

    Not really. If there was a patent that is specific to it being a
    different address space or a dedicated private on-chip SRAM, that
    patent could be easily circumvented by the Amdahl-alike by putting the PAL-code in RAM and using a base register or MMU hack, as you
    describe.

    If a clone used off chip memory then it will have much worse performance
    and not be competitive. But what I believe PAL patented was the particular
    set of functional behaviors: the third mode that enables this memory,
    and disables interrupts, and enables HW register access, etc.

    That would force cloners to rewrite those parts of an OS that depend
    on this, which blocks running DEC's EXE's on your partial clone.

    Also if there was enough room for more on-chip SRAM on any of the
    Alpha chips, the designers would have used that room to put in
    features that make the chip faster.

    Given that ARM is able to charge an architecture licensing fee for the instruction set alone, I am sure that DEC had enough patents on its instruction set, no need for unnecessary and circumventable
    implementation ideas.

    As I understand it, you can't patent an ISA but you can patent a particular implementation of a function or feature. One of the patents that protects
    the ARM32 ISA was on its interrupt mechanism. If you want to clone it then
    you have to duplicate their interrupt mechanism. ARM has sued and won
    on that basis which is why there are no ARM clones.

    ARM vs picoTurbo patent lawsuit, 2001 https://www.electronicsweekly.com/news/archived/resources-archived/arm-and-picoturbo-settle-patent-lawsuit-2001-12/

    One other thing they did: they had one PAL code coming with the SRM
    console for VMS and Digital OSF/1, and another PAL code with the ARC/AlphaBIOS console for Windows NT and Linux. This allowed them to
    charge extra (quite a lot) for hardware capable of running their
    premium OSs, while providing almost competetive prices for hardware
    running PC OSs. Unfortunately, the PC-like package was still not price/performance competetive, and AlphaBIOS (which we had on our EV56
    boxes) was a horror to work with.

    - anton

    But this is exactly what I was thinking. They would not be able to
    charge this way if there are exact clones because I could just run
    the VMS PAL code on my cheap-o clone.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Jun 11 16:55:45 2024
    Lawrence D'Oliveiro wrote:

    On Mon, 10 Jun 2024 20:41:31 +0300, Michael S wrote:

    Intel's official terminology makes distinction between interrupts and
    exceptions. The former are external/asynchronous, the later are
    internal/synchronous. Exceptions are further sub-divided into faults,
    traps and aborts.

    That all sounds very DEC-like.

    In particular, the DEC definition of a “fault” is that the saved PC on the
    stack still points at the instruction that caused the exception, so a return-from-exception will attempt to re-execute the same instruction.
    This is exactly what you want for page faults, for example, but also
    for
    long-running interruptible instructions that haven’t finished yet.

    Whereas a “trap” left the PC pointing at the following instruction. So
    a
    return from the exception handler will simply resume execution there.

    Both have the property where the PC is pointing at the first
    instruction
    not executed.

    Over the evolution of the VAX architecture, some exceptions which
    initially were “traps” became “faults” instead.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Tue Jun 11 13:07:56 2024
    On Tue, 11 Jun 2024 00:45:28 +0000, [email protected] (MitchAlsup1)
    wrote:

    I forgot to add that Mc 88120 had these features in 1992.

    Stores waited for retirement.

    Given that in the case of external RAM, as opposed to registers inside
    the processor, there is only one possible value at any location...
    memory doesn't have a pile of rename locations to play with... I am so unimaginative that I don't think I could design a CPU in which stores
    to RAM didn't wait for the instruction that performed them to retire.

    That, though, wouldn't save me from Spectre, since Spectre leaks
    information by virtue of fetches of stuff _read_ in earlier speculated
    code that didn't really happen being in cache.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Tue Jun 11 13:27:35 2024
    On Tue, 11 Jun 2024 00:27:02 +0000, [email protected] (MitchAlsup1)
    wrote:

    ALL I have DONE is to not have the MB write into the cache until the
    causing instruction retires !!

    I suppose that depends on how you define "write".

    If by "write" you mean store data in the cache, for eventual writing
    out into RAM, well, since RAM doesn't contain "rename locations" to
    play with, it seems to me that any CPU designer had better do that.

    At least, I'm not imaginative enough to think of doing it any other
    way.

    However, if by "write" you mean to change the state of the cache in
    any way, such as by reading data from memory... now, _then_ you would
    indeed have done what is necessary to combat Spectre.

    Obviously, though, a "load" instruction will _never_ retire unless it
    can read the data from memory it is trying to put in a register.

    So apparently WHAT you have REALLY DONE is to modify how memory reads
    work...

    if the data a load instruction requires is not already in the cache,
    then a direct read from memory is performed which *completely
    bypasses* the cache; this data (and its associated address) are
    retained by the CPU to be placed in the cache _if_ the instruction is
    actually executed and when it retires.

    And, in fact, the various cache levels have to work this way too. You
    have an L1 cache miss, but an L2 cache hit? Fine, you take your data
    directly from L2, and don't promote the data into L2 until instruction retirement.

    So now the process of fetching data from memory is _not_ done by
    fetching always from L1 and going _throughl_ L1 to access L2, and
    going _through_ L2 to access RAM, which seems to be the usual way
    these days.

    That certainly can be done. But it isn't quite as simple and obvious
    as you seem to claim.

    My 66000 is also insensitive to RowHammer and derivatives.....

    When I first read that sentence, I was completely incredulous. DRAM is sensitive to RowHammer because it's gone to feature sizes which are
    beyond the state-of-the-art to do properly... so corners have been
    cut.

    How a CPU can be "insensitive" to it was mysterious.

    After all, RowHammer is caused by multiple rapid-fire accesses to the
    same address, or to related addresses, in memory.

    But given that you are now explicitly passing accesses to DRAM around
    the caches, instead of having the caches access DRAM as needed,
    perhaps that also makes it possible for the CPU to detect suspicious
    behavior more easily. (Since _relateld_ accesses may be used in a
    RowHammer attack, simply pruning redundant memory accesses from the
    operation stream won't be enough. I could see you doing _that_ as part
    of "doing it right".)

    If the "row" that was "hammered" just consisted of the 16 consecutive
    locations that can be accessed speedily after the first one is ready,
    then pruning reduntant accesses _would_ be enough, since to "hammer" a
    row one has to access it hundreds of times, not at most 32 times; but
    I'm afraid that isn't the case.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Tue Jun 11 13:30:10 2024
    On Tue, 11 Jun 2024 08:54:16 +0300, Niklas Holsti <[email protected]d> wrote:

    Not always. If the mistakenly speculated cache-fetch /evicted/ some
    other data from the (finite-sized) cache, and the evicted data are
    needed later on the /true/ execution path, the mistakenly speculated
    fetch has a /negative/ effect on performance. (This kind of "timing
    anomaly" is very bothersome in static WCET analysis.)

    Ouch. Another argument for having a victim cache. And a benefit of
    doing it in what is apparently Mitch Alsup's way - holding off cache
    updates until instruction retirement.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Tue Jun 11 20:50:14 2024
    John Savard wrote:

    On Tue, 11 Jun 2024 00:45:28 +0000, [email protected] (MitchAlsup1)
    wrote:

    I forgot to add that Mc 88120 had these features in 1992.

    Stores waited for retirement.

    Given that in the case of external RAM, as opposed to registers inside
    the processor, there is only one possible value at any location...
    memory doesn't have a pile of rename locations to play with... I am so unimaginative that I don't think I could design a CPU in which stores
    to RAM didn't wait for the instruction that performed them to retire.

    I never said it did. Effectively, there is a buffer that feeds the OoO
    engine results as early as possible (acting like a cache, but a cache
    withy the property that it can be discarded (in part or whole) just
    like
    instructions in the shadow of a branch can be discarded.) So, the
    pipeline
    gets fed by the buffer and the update of the actual cache is delayed
    until
    the instruction causing the event to retire.

    So depending on where you look you can see the front of the pipeline or
    the end of the pipeline--and it works the exact same way as branch
    pred-
    iction and with the same property that one can back up to some sane
    point
    based on external events (coherent messages,...) rather than branch
    mis-
    prediction.

    That, though, wouldn't save me from Spectre, since Spectre leaks
    information by virtue of fetches of stuff _read_ in earlier speculated
    code that didn't really happen being in cache.

    Here, Spectré performs back to back dependent LDs, by doing these many
    times in a row, and prediction mechanism will lock in on these
    instructions
    being OK to execute.

    Then, the first LD returns a pointer while the TLB returns "bad access"
    But the loaded pointer goes through AGEN before permissions have been
    fully checked. This second LD is not allowed to modify the data cache;
    or Spectré can see this microarchitectural state change.

    When the bad LD is discarded from execution so is the damage done by
    the second LD--its data is discarded from the buffer and never makes
    it to the cache. Instruction entering the pipeline after do not see
    the data that should have never been there. Q.E.D. no Spectré sensi-
    tivity.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Tue Jun 11 21:18:47 2024
    John Savard wrote:

    On Tue, 11 Jun 2024 00:27:02 +0000, [email protected] (MitchAlsup1)
    wrote:

    ALL I have DONE is to not have the MB write into the cache until the >>causing instruction retires !!

    I suppose that depends on how you define "write".

    I mean the memory cell does not get modified.

    If by "write" you mean store data in the cache, for eventual writing
    out into RAM, well, since RAM doesn't contain "rename locations" to
    play with, it seems to me that any CPU designer had better do that.

    The cache itself is not modified until the memory reference retires.
    But there is a buffer holding the data which can be accessed as if
    it were an L0 cache until the data migrates to the real cache at
    retirement.

    At least, I'm not imaginative enough to think of doing it any other
    way.

    However, if by "write" you mean to change the state of the cache in
    any way, such as by reading data from memory... now, _then_ you would
    indeed have done what is necessary to combat Spectre.

    The cache is not modified, the data is available through another means.
    a means that can be backed up like a mispredicted branch. The buffer
    I am talking about is temporally organized not spatially organized.

    Obviously, though, a "load" instruction will _never_ retire unless it
    can read the data from memory it is trying to put in a register.

    The LD instruction can obtain data from either the buffer or from
    the data cache itself. The buffer covers the execution window,
    allowing the LD to retire (assuming every older instruction also
    retires).

    So apparently WHAT you have REALLY DONE is to modify how memory reads
    work...

    I pipelined them through a temporally organized memory execution
    window. This also provides for allowing the memory system to run
    OoO wrt program order, and detect actual ordering violations, and
    rerun the memory references in a proper memory order by rerunning
    the references in order.

    You get relaxed memory order performance and precise memory order simultaneously.

    if the data a load instruction requires is not already in the cache,
    then a direct read from memory

    The request is forwards towards memory through the cache hierarchy
    and data arrives back at requestor (sooner or later).

    is performed which *completely
    bypasses* the cache;

    Yes, critical word first.

    this data (and its associated address) are
    retained by the CPU to be placed in the cache _if_ the instruction is actually executed and when it retires.

    Yes !! While the data resides in the buffer, the whole line can be
    accessed by a number of memory reference instructions.

    And, in fact, the various cache levels have to work this way too. You
    have an L1 cache miss, but an L2 cache hit? Fine, you take your data
    directly from L2, and don't promote the data into L2 until instruction retirement.

    I use an exclusive cache organization. so data arriving at the CPU
    goes into buffer, which upon retirement goes into L1, which has the
    potential to push a L1->L2 line, and so forth.

    So now the process of fetching data from memory is _not_ done by
    fetching always from L1 and going _throughl_ L1 to access L2, and
    going _through_ L2 to access RAM, which seems to be the usual way
    these days.

    Its back to the Athlon/Operon organizations.

    That certainly can be done. But it isn't quite as simple and obvious
    as you seem to claim.

    If you had worked on them you can recognize the advantages and dis-
    advantages.

    My 66000 is also insensitive to RowHammer and derivatives.....

    When I first read that sentence, I was completely incredulous. DRAM is sensitive to RowHammer because it's gone to feature sizes which are
    beyond the state-of-the-art to do properly... so corners have been
    cut.

    How a CPU can be "insensitive" to it was mysterious.

    After all, RowHammer is caused by multiple rapid-fire accesses to the
    same address, or to related addresses, in memory.

    Yes, the write buffer in my DRAM controller is the L3 cache. Modified
    data in the L3 migrates towards DRAM as DRAM cycles permit, but there
    is no way to cause a line to be continuously be written into DRAM.
    If a modified line has migrated to DRAM, and it gets modified again
    in the L3, that 2nd write will not be performed until a refresh cycle
    on that DRAM is performed.

    Thus if one tries to RowHammer My 66000 DRAM, DRAM gets refresh cycle
    between each write.

    But given that you are now explicitly passing accesses to DRAM around
    the caches, instead of having the caches access DRAM as needed,
    perhaps that also makes it possible for the CPU to detect suspicious
    behavior more easily. (Since _relateld_ accesses may be used in a
    RowHammer attack, simply pruning redundant memory accesses from the
    operation stream won't be enough. I could see you doing _that_ as part
    of "doing it right".)

    Banging on related cache lines also results in refresh cycles.

    If the "row" that was "hammered" just consisted of the 16 consecutive locations that can be accessed speedily after the first one is ready,
    then pruning reduntant accesses _would_ be enough, since to "hammer" a
    row one has to access it hundreds of times, not at most 32 times; but
    I'm afraid that isn't the case.

    I doubt that RowHammer still works when refreshes are interspersed
    between accesses--RowHammer generally works because the events are
    not protected by refreshes--the DRC sees the right ROW open and
    simple streams at the open bank.

    Also note, there are no instructions in My 66000 that force a cache
    to DRAM whereas there are instructions that can force a cache line
    into L3. L3 is the buffer to DRAM. Nothing gets to DRAM without
    going through L3 and nothing comes out of DRM that is not also
    buffer by L3. So, if 96 cores simultaneously read a line residing in
    DRAM, DRAM is read once and 95 cores are serviced through L3. So,
    you can't RowHammer based on reading DRAM, either.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Tue Jun 11 21:20:45 2024
    John Savard wrote:

    On Tue, 11 Jun 2024 08:54:16 +0300, Niklas Holsti <[email protected]d> wrote:

    Not always. If the mistakenly speculated cache-fetch /evicted/ some
    other data from the (finite-sized) cache, and the evicted data are
    needed later on the /true/ execution path, the mistakenly speculated
    fetch has a /negative/ effect on performance. (This kind of "timing >>anomaly" is very bothersome in static WCET analysis.)

    That is why you don't update the cache until the causing instruction
    retires.

    Ouch. Another argument for having a victim cache. And a benefit of
    doing it in what is apparently Mitch Alsup's way - holding off cache
    updates until instruction retirement.

    Change your thought from victim cache to pipeline buffer pre cache.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed Jun 12 02:47:48 2024
    On Tue, 11 Jun 2024 09:37:32 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:

    I'm quite sure that IBM would disagree with this statement.

    I’m sure they would. But they invented virtualization in CP/CMS because >>their attempt at an “interactive timesharing” system, CMS, was only >>single-user.

    There's no need to make up silly stories like this ...

    No need to take my word for it. Bitsavers added issues of a magazine
    called “Mainframe” a few months back. I took the trouble to read the first one--it’s all about IBM, as though other “mainframe” machines didn’t exist. There’s a description of the background to CP/CMS (later VM/CMS) there.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Tue Jun 11 22:18:44 2024
    On Tue, 11 Jun 2024 04:07:17 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    Windows NT was a disaster to the entire Unix workstation market. The irony >was, NT “Workstation” wasn’t really feature-equivalent to the OSes the >Unix workstations were running. But it was enough for the customers, it
    seems ...

    The differences were almost all at user level: the most glaring
    examples being NT's single user shell [even on server editions], lack
    of admin tools in the workstation edition, and lack of development
    tools in all editions.

    Considered as an "operating system" - ie. what could be implemented on
    the platform - NT certainly was (mostly) equivalent to Unix. Note:
    equivalence is not "sameness" - NT was implemented differently, its
    APIs were different, and code that was "equivalent" in function often
    did not look the same (and was not transportable).

    BTDTGTTS

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Wed Jun 12 02:51:51 2024
    On Tue, 11 Jun 2024 16:55:45 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    In particular, the DEC definition of a “fault” is that the saved PC on >> the stack still points at the instruction that caused the exception, so
    a return-from-exception will attempt to re-execute the same
    instruction. This is exactly what you want for page faults, for
    example, but also for long-running interruptible instructions that
    haven’t finished yet.

    Whereas a “trap” left the PC pointing at the following instruction. So >> a return from the exception handler will simply resume execution there.

    Both have the property where the PC is pointing at the first instruction
    not executed.

    Perhaps you meant “completed” rather than “executed”.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Wed Jun 12 02:50:24 2024
    On Tue, 11 Jun 2024 14:04:59 GMT, Scott Lurndal wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    On Mon, 10 Jun 2024 15:23:51 GMT, Anton Ertl wrote:

    Given that ARM is able to charge an architecture licensing fee for the
    instruction set alone ...

    I think that applies to newer versions, not the older ones. Given that
    ARM goes back to the 1980s, any patents from the earliest years would
    have expired by now.

    It has nothing to do with patents.

    The architecture license provides far more than the ability to implement
    the arm instruction set. BTDT.

    IANAL, but there are four kinds of “intellectual property”: copyrights, patents, trademarks and trade secrets.

    If you were incorporating logic components developed by ARM, then
    licensing those might be covered by copyrights and trade secrets.

    But if you’re a company like Apple, which designs and builds its own
    chips, then they need neither of those things.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Wed Jun 12 02:53:41 2024
    On Tue, 11 Jun 2024 14:11:55 GMT, Scott Lurndal wrote:

    1) NMI are incredibly useful in certain cases, particularly for
    in-kernel debuggers.
    2) NMI is actually maskable on Intel hardware (in the chipset, not the processor)

    Do you see a contradiction between the two? In that a “non-maskable” interrupt inevitably has to be “maskable” in certain situations. And how does that affect your in-kernel debugger?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Wed Jun 12 05:38:01 2024
    Scott Lurndal <[email protected]> schrieb:
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Mon, 10 Jun 2024 15:23:51 GMT, Anton Ertl wrote:

    Given that ARM is able to charge an architecture licensing fee for the
    instruction set alone ...

    I think that applies to newer versions, not the older ones. Given that ARM >>goes back to the 1980s, any patents from the earliest years would have >>expired by now.

    It has nothing to do with patents.

    The architecture license provides far more than the ability
    to implement the arm instruction set. BTDT.

    The current spat between ARM and Qualcomm is quite interesting in
    that respect. It seems that ARM now demands that all PCs using Snapdragon-X-CPUs be destroyed. In return, Qualcomm accuses ARM
    of all sorts of bad things, including threatening to terminate
    Qualcomm's licenses if they insisted on enforcing their contractual
    rights.

    The spat also appears to be about ARM wants a bigger slice of
    the pie on smartphones, they demand a share of the sales price of
    the final product instead of the CPU. That actually sounds like
    something that the antitrust authorities might be interested in.

    If the cases ever go to trial, at least one ARM license agreement
    will be publically available.

    And, finally, if people will excuse the pun: This looks like
    StrongARM tactics.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Wed Jun 12 06:31:17 2024
    On Wed, 12 Jun 2024 05:38:01 -0000 (UTC), Thomas Koenig wrote:

    The spat also appears to be about ARM wants a bigger slice of the pie on smartphones, they demand a share of the sales price of the final product instead of the CPU. That actually sounds like something that the
    antitrust authorities might be interested in.

    This kind of greed can only boost the fortunes of alternatives like
    RISC-V.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed Jun 12 07:47:18 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    There's no need to make up silly stories like this ...

    No need to take my word for it. Bitsavers added issues of a magazine
    called “Mainframe” a few months back. I took the trouble to read the first >one--it’s all about IBM, as though other “mainframe” machines didn’t >exist. There’s a description of the background to CP/CMS (later VM/CMS) >there.

    I see Mainframe Journal, with the earliest issue being Jul/Aug 1988. Is
    that it? I don't see anything in the ToC that looks like a VM overview.

    In any event, I'd find the second article I linked to, the VM history
    written by IBMers who were there, more credible than some random third
    party magazine. CMS really was written at the same time as CP, and
    they always intended them to work together as a time-sharing system.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed Jun 12 07:56:50 2024
    On Wed, 12 Jun 2024 07:47:18 -0000 (UTC), John Levine wrote:

    In any event, I'd find the second article I linked to, the VM history
    written by IBMers who were there, more credible than some random third
    party magazine.

    By all means, check the bios of the authors, included as with any
    magazine. It was written by IBM pros, for IBM pros.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to D'Oliveiro on Wed Jun 12 09:46:00 2024
    In article <v48ihl$sc37$[email protected]>, [email protected]d (Lawrence
    D'Oliveiro) wrote:

    Windows NT was a disaster to the entire Unix workstation market.
    The irony was, NT _Workstation_ wasn_t really feature-equivalent to
    the OSes the Unix workstations were running. But it was enough for
    the customers, it seems ...

    It had important advantages for many customers:

    Lower costs at equivalent or better performance, once the Intel Pentium
    Pro had appeared. The OS cost much less than a commercial Unix, and mass production meant the hardware was much cheaper.

    It ran Microsoft Office. This was really important to corporate managers,
    who wanted their engineers to be able to read and create Office documents,
    and were frustrated by Unix workstations' inability to do so, and their engineers not being worried about the problem. Supplying extra PCs was expensive and took up office space. Software emulation was slow and
    unreliable; add-in cards to provide a PC capability were rare, expensive
    and unreliable.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Lawrence D'Oliveiro on Wed Jun 12 07:48:53 2024
    Lawrence D'Oliveiro wrote:
    On Mon, 10 Jun 2024 14:32:33 -0400, EricP wrote:

    And there was the NMI race condition bug ...

    Not surprised there was trouble with the concept of a “non-maskable interrupt”. When I first heard of such a thing, I threw up my hands in horror.

    Yes, NMI has "reentrancy issues".

    8086 has an NMI input pin and according to the manual has
    higher priority than the maskable interrupt INTR input pin.
    It will re-trigger on each rising edge so if you don't want it to
    trigger multiple times you have to latch the input yourself.
    Intel manual suggests it might be used for a power fail routine.

    I seem to recall that on the original PC someone started selling
    add-on FPU boards. Maybe that was Weitek or AMD? Anyway, they hijacked
    the unused NMI input and used it to signal an FPU error.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Wed Jun 12 07:57:43 2024
    Scott Lurndal wrote:
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Mon, 10 Jun 2024 14:32:33 -0400, EricP wrote:

    And there was the NMI race condition bug ...
    Not surprised there was trouble with the concept of a “non-maskable
    interrupt”. When I first heard of such a thing, I threw up my hands in
    horror.

    1) NMI are incredibly useful in certain cases, particularly for in-kernel debuggers.
    2) NMI is actually maskable on Intel hardware (in the chipset, not the processor)
    3) ARM refused to support NMI in Aarch64 (partially because they didn't
    have a spare exception vector). They've backtracked and hacked in a
    solution using the interrupt controller to create a pseudo-unmaskable
    interrupt due to customer demand.

    https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/a-profile-non-maskable-interrupts

    Back in the 90's, we had a custom PCI card with a single button that would trigger an NMI for debugging new hardware.

    As you point out, this "NMI" is maskable, it's just masked in a different control register than usual interrupts.

    The problem with a real NMI is controlling reentracy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed Jun 12 08:22:03 2024
    MitchAlsup1 wrote:
    John Savard wrote:


    After all, RowHammer is caused by multiple rapid-fire accesses to the
    same address, or to related addresses, in memory.

    Yes, the write buffer in my DRAM controller is the L3 cache. Modified
    data in the L3 migrates towards DRAM as DRAM cycles permit, but there
    is no way to cause a line to be continuously be written into DRAM.
    If a modified line has migrated to DRAM, and it gets modified again
    in the L3, that 2nd write will not be performed until a refresh cycle
    on that DRAM is performed.

    Thus if one tries to RowHammer My 66000 DRAM, DRAM gets refresh cycle
    between each write.

    What does it do if L3 receives more writes than it has ways in a row,
    does it stall evicts from L2?

    Lets say L3 is 4 way assoc and all four in a L3 row been updated,
    then a 5th way in that same row is written from L2.
    L3 has no place to hold that 5th way and it can't evict one
    of the other 4 ways because that could cause rowhammer.

    Seems to me that all it can do is stall the 5th write from L2 until
    DRAM refresh rolls around and re-enables one of the pending L3 writes,
    which would back up victim evicts from L2.
    Or maybe L3 has a small fully assoc emergency overflow buffer,
    but still that could fill up too.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Wed Jun 12 09:38:17 2024
    Michael S wrote:
    On Tue, 11 Jun 2024 10:03:36 +0200
    Terje Mathisen <[email protected]> wrote:


    *None* of this should be necessary.
    Even the pipeline drain on mode switch should often be avoidable.

    Ouch! Glad I got out of the IRQ handler business before 1990.

    Terje

    I think, Eric more than a little exaggerates about the level of
    complexity of end-of-interrupt processing needed in common case.
    May be, the code is long, but absolute majority of it is executed very rarely, if at all.

    Possibly, as I do have a tendency to get somewhat animated about this.
    I can't find it just now but a while back I was looking at some
    Linux source code for the x86 interrupt return path,
    and it went on for page after page after page.

    I did find this diagram which shows slightly convoluted but still understandable return path for Linux and some x86 assembler for it:

    https://www.oreilly.com/library/view/understanding-the-linux/0596002130/ch04s08.html

    https://coral.googlesource.com/linux-imx/+/refs/heads/release-chef/arch/x86/entry/entry_32.S

    You can find down to label ret_from_intr: for example, which does a
    conditional jb resume_kernel then falls through into resume_userspace,
    which DISABLE_INTERRUPTS and calls prepare_exit_to_usermode,
    then jumps to restore_all which eventually does INTERRUPT_RETURN.

    prepare_exit_to_usermode is in common.c here and does quite
    a lot of other checks:

    https://coral.googlesource.com/linux-imx/+/refs/heads/release-chef/arch/x86/entry/common.c

    The problem I have with this approach is that it deals with all the race conditions (eg a nested interrupt posts a new softirq between when you
    checked for pending softirq's and the IRET) by running with interrupts
    disabled for long instruction sequences. I consider that to be a poor way
    to do this as that blocks processing all other interrupts.

    Ideally the ISA and hardware should be designed so the interrupt return
    path should not have to disable interrupts at all, or at worst then
    just for a few instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Levine on Wed Jun 12 13:56:35 2024
    John Levine <[email protected]> writes:
    According to Lawrence D'Oliveiro <[email protected]d>:
    There's no need to make up silly stories like this ...

    No need to take my word for it. Bitsavers added issues of a magazine
    called “Mainframe” a few months back. I took the trouble to read the first
    one--it’s all about IBM, as though other “mainframe” machines didn’t >>exist. There’s a description of the background to CP/CMS (later VM/CMS) >>there.

    I see Mainframe Journal, with the earliest issue being Jul/Aug 1988. Is
    that it? I don't see anything in the ToC that looks like a VM overview.

    In any event, I'd find the second article I linked to, the VM history
    written by IBMers who were there, more credible than some random third
    party magazine. CMS really was written at the same time as CP, and
    they always intended them to work together as a time-sharing system.

    Lynn's old posts make this pretty clear. And he was there.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Wed Jun 12 13:57:26 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Wed, 12 Jun 2024 07:47:18 -0000 (UTC), John Levine wrote:

    In any event, I'd find the second article I linked to, the VM history
    written by IBMers who were there, more credible than some random third
    party magazine.

    By all means, check the bios of the authors, included as with any
    magazine. It was written by IBM pros, for IBM pros.

    In other words, you have nothing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Wed Jun 12 13:53:21 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Tue, 11 Jun 2024 14:11:55 GMT, Scott Lurndal wrote:

    1) NMI are incredibly useful in certain cases, particularly for
    in-kernel debuggers.
    2) NMI is actually maskable on Intel hardware (in the chipset, not the
    processor)

    Do you see a contradiction between the two?

    No.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jun 12 10:52:20 2024
    I think, Eric more than a little exaggerates about the level of
    complexity of end-of-interrupt processing needed in common case.
    May be, the code is long, but absolute majority of it is executed very
    rarely, if at all.
    Possibly, as I do have a tendency to get somewhat animated about this.
    I can't find it just now but a while back I was looking at some
    Linux source code for the x86 interrupt return path,
    and it went on for page after page after page.

    Beside the code size cost and associated runtime impact, there's also
    the fact that this complexity inevitably comes with an increased risk
    of bugs.

    Nick McLaren could go on and on about this as an infinite source of bugs
    that are so hard to track down that they're basically never even
    diagnosed correctly (let alone fixed).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Wed Jun 12 20:34:13 2024
    MitchAlsup1 wrote:
    John Savard wrote:

    On Tue, 11 Jun 2024 00:27:02 +0000, [email protected] (MitchAlsup1)
    wrote:

    ALL I have DONE is to not have the MB write into the cache until the
    causing instruction retires !!

    I suppose that depends on how you define "write".

    I mean the memory cell does not get modified.

    If by "write" you mean store data in the cache, for eventual writing
    out into RAM, well, since RAM doesn't contain "rename locations" to
    play with, it seems to me that any CPU designer had better do that.

    The cache itself is not modified until the memory reference retires.
    But there is a buffer holding the data which can be accessed as if
    it were an L0 cache until the data migrates to the real cache at
    retirement.

    At least, I'm not imaginative enough to think of doing it any other
    way.

    However, if by "write" you mean to change the state of the cache in
    any way, such as by reading data from memory... now, _then_ you would
    indeed have done what is necessary to combat Spectre.

    The cache is not modified, the data is available through another means.
    a means that can be backed up like a mispredicted branch. The buffer
    I am talking about is temporally organized not spatially organized.

    Obviously, though, a "load" instruction will _never_ retire unless it
    can read the data from memory it is trying to put in a register.

    The LD instruction can obtain data from either the buffer or from
    the data cache itself. The buffer covers the execution window,
    allowing the LD to retire (assuming every older instruction also
    retires).

    So apparently WHAT you have REALLY DONE is to modify how memory reads
    work...

    I pipelined them through a temporally organized memory execution
    window. This also provides for allowing the memory system to run
    OoO wrt program order, and detect actual ordering violations, and
    rerun the memory references in a proper memory order by rerunning
    the references in order.

    You get relaxed memory order performance and precise memory order simultaneously.

    if the data a load instruction requires is not already in the cache,
    then a direct read from memory

    The request is forwards towards memory through the cache hierarchy
    and data arrives back at requestor (sooner or later).

                                   is performed which *completely
    bypasses* the cache;

    Yes, critical word first.

                         this data (and its associated address) are
    retained by the CPU to be placed in the cache _if_ the instruction is
    actually executed and when it retires.

    Yes !! While the data resides in the buffer, the whole line can be
    accessed by a number of memory reference instructions.

    And, in fact, the various cache levels have to work this way too. You
    have an L1 cache miss, but an L2 cache hit? Fine, you take your data
    directly from L2, and don't promote the data into L2 until instruction
    retirement.

    I use an exclusive cache organization. so data arriving at the CPU
    goes into buffer, which upon retirement goes into L1, which has the
    potential to push a L1->L2 line, and so forth.

    So now the process of fetching data from memory is _not_ done by
    fetching always from L1 and going _throughl_ L1 to access L2, and
    going _through_ L2 to access RAM, which seems to be the usual way
    these days.

    Its back to the Athlon/Operon organizations.

    That certainly can be done. But it isn't quite as simple and obvious
    as you seem to claim.

    If you had worked on them you can recognize the advantages and dis- advantages.

    My 66000 is also insensitive to RowHammer and derivatives.....

    When I first read that sentence, I was completely incredulous. DRAM is
    sensitive to RowHammer because it's gone to feature sizes which are
    beyond the state-of-the-art to do properly... so corners have been
    cut.

    How a CPU can be "insensitive" to it was mysterious.

    After all, RowHammer is caused by multiple rapid-fire accesses to the
    same address, or to related addresses, in memory.

    Yes, the write buffer in my DRAM controller is the L3 cache. Modified
    data in the L3 migrates towards DRAM as DRAM cycles permit, but there
    is no way to cause a line to be continuously be written into DRAM.
    If a modified line has migrated to DRAM, and it gets modified again
    in the L3, that 2nd write will not be performed until a refresh cycle
    on that DRAM is performed.

    Thus if one tries to RowHammer My 66000 DRAM, DRAM gets refresh cycle
    between each write.

    Rowhammer can modify nearby lines, not just the ones that are being
    hammered, right? How do you guarantee that all neighbors will also be refreshed?

    Similarly, if the accesses are LOCK XADD operations, and you have
    multiple CPUs (or cores not sharing a common last level cache, then I
    don't see any way to avoid those accesses from making it all the way to
    the RAM chips?

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to John Levine on Wed Jun 12 11:54:55 2024
    John Levine <[email protected]> writes:
    In any event, I'd find the second article I linked to, the VM history
    written by IBMers who were there, more credible than some random third
    party magazine. CMS really was written at the same time as CP, and
    they always intended them to work together as a time-sharing system.

    Some of the MIT CTSS/7094 people went to the 5th flr to do Multics;
    others went to the science center on the 4th flr to do virtual machines, internal network, invent GML in 1969, other interactive applications.

    cambridge science center wanted a 360/50 to add virtual memory to
    ... but all the spare 360/50s were going to FAA ATC project ... and they
    had to settle for 360/40. (virtual machine) CP/40 (running on bare
    hardware using hardware virtual memory mods _ was developed in parallel
    with CMS (running on bare 360/40). When CP/40 virtual machines was
    operational, they then could run CMS in CP/40 virtual machines.

    Melinda history
    http://www.leeandmelindavarian.com/Melinda#VMHist
    and CP/40 http://www.leeandmelindavarian.com/Melinda/JimMarch/CP40_The_Origin_of_VM370.pdf
    my OCR from Comeau's original paper https://www.garlic.com/~lynn/cp40seas1982.txt

    CP/40 morphs into CP/67 when 360/67 standard with virtual memory becomes available. I was responsible for OS/360 running on 360/67 (as 360/65),
    univ shutdown datacenter on weekends and I had datacenter dedicated for
    48hrs straight). CSC came out Jan1968 to install CP/67 (3rd install
    after CSC itself and MIT Lincoln Labs) ,,, and I mostly played with it
    during my weekend dedicated time. First couple months was rewritting pathlengths for running OS/360 in virtual machine. Benchmark was OS/360 jobstream that ran 322secs on real machine. Started out 858secs in
    virtual machine (CP67 CPU 534secs) .... after few months got CP67 CPU
    down to 113secs. I then rewrite time-sharing system scheduling and
    dispatching, page I/O and page replacement, I/O arm scheduling, etc.

    I'v joked that original CP/67 scheduling delivered to univ (and I
    completely replaced) ... looked a lot like Unix scheduling that I first
    saw 15yrs later. Also 1st install at univ (jan1968) had CP67 source in
    OS/360 datasets ... it wasn't until a few months later that they moved
    source to CMS files. After I graduated and joined science center, one of
    my hobbies was enhanced production operating systems for internal
    datacenters.

    CP-67
    https://en.wikipedia.org/wiki/CP-67
    CP/CMS
    https://en.wikipedia.org/wiki/CP/CMS
    History of CP/CMS
    https://en.wikipedia.org/wiki/History_of_CP/CMS
    Cambridge Scientific Center https://en.wikipedia.org/wiki/Cambridge_Scientific_Center

    when it was decided to add virtual memory to all 370s, it was also
    decided to rewrite CP67 for VM370, simplifying and/or dropping lots of
    features (also renaming Cambridge Monitor System to Conversational
    Monitor System and crippling its ability to run on real machine).

    1974, I start migrating lots of original CP67 stuff (lots that I had
    done as undergraduate) to VM370 Release2 base for an enhanced internal
    CSC/VM (including for world-wide online sales&marketing support HONE
    systems). Then in 1975 I upgrade to VM370 Release3 base and add the
    CP67 multiprocessor support (one of the things dropped in CP67->VM370)
    ... originally for US consolidated HONE complex so they could add 2nd
    processor to each of their systems (all the US HONE systems had been consolidated in Palo Alto, trivia: when FACEBOOK 1st moved into silicon
    valley, it was into new bldg built next door to the former US
    consolidated HONE datacenter).

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Lynn Wheeler on Wed Jun 12 22:26:48 2024
    On Wed, 12 Jun 2024 11:54:55 -1000, Lynn Wheeler wrote:

    when it was decided to add virtual memory to all 370s, it was also
    decided to rewrite CP67 for VM370, simplifying and/or dropping lots of features (also renaming Cambridge Monitor System to Conversational
    Monitor System and crippling its ability to run on real machine).

    I recall CMS was single-user to start with, and the point of running it
    under “CP” aka “VM” was to offer a multi-user service. Did CMS ever become
    multi-user in its own right?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to EricP on Wed Jun 12 22:33:47 2024
    On Wed, 12 Jun 2024 09:38:17 -0400, EricP wrote:

    https://www.oreilly.com/library/view/understanding-the-linux/0596002130/ch04s08.html

    That book is from 2002.

    https://coral.googlesource.com/linux-imx/+/refs/heads/release-chef/arch/x86/entry/entry_32.S

    That, too, seems a bit old. How about this for a more up-to-date
    version: <https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_32.S>.
    Or try the 64-bit version: <https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S>.

    The problem I have with this approach is that it deals with all the
    race conditions (eg a nested interrupt posts a new softirq between
    when you checked for pending softirq's and the IRET) by running with interrupts disabled for long instruction sequences. I consider that
    to be a poor way to do this as that blocks processing all other
    interrupts.

    But then again, things are complicated enough as it is.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Jun 13 00:43:51 2024
    Scott Lurndal wrote:


    1) NMI are incredibly useful in certain cases, particularly for
    in-kernel debuggers.
    2) NMI is actually maskable on Intel hardware (in the chipset, not the processor)
    3) ARM refused to support NMI in Aarch64 (partially because they didn't
    have a spare exception vector). They've backtracked and hacked in
    a
    solution using the interrupt controller to create a
    pseudo-unmaskable
    interrupt due to customer demand.

    On an architecture where one has multiple simultaneous interrupt tables

    (say 1 per Guest OS and 1 per HyperVisor) and each table manages 32K
    individual interrupts each interrupt mask by its corresponding Enable
    bit::

    Can one NOT infer that; a SW convention to leave at least 1 enable bit
    always enabled, gives the system an NMI ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Jun 13 00:34:34 2024
    EricP wrote:

    MitchAlsup1 wrote:
    John Savard wrote:


    After all, RowHammer is caused by multiple rapid-fire accesses to the
    same address, or to related addresses, in memory.

    Yes, the write buffer in my DRAM controller is the L3 cache. Modified
    data in the L3 migrates towards DRAM as DRAM cycles permit, but there
    is no way to cause a line to be continuously be written into DRAM.
    If a modified line has migrated to DRAM, and it gets modified again
    in the L3, that 2nd write will not be performed until a refresh cycle
    on that DRAM is performed.

    Thus if one tries to RowHammer My 66000 DRAM, DRAM gets refresh cycle
    between each write.

    What does it do if L3 receives more writes than it has ways in a row,
    does it stall evicts from L2?

    There is a 128 line temporally organized buffer between L3 and DRAM.
    So with your proposed 4-way L3, you have 130 accesses between banging
    on
    one particular line a second time. This buffer is also SNOOPed acting
    like a victim cache. And likewise a similar buffer on the read side,
    acting like a prefetch buffer. There is a connection between the
    buffers so that when ECC error is corrected, the corrected data is
    migrated back into DRAM.

    Lets say L3 is 4 way assoc and all four in a L3 row been updated,
    then a 5th way in that same row is written from L2.

    L3 dumps a selected line into the DRAM write staging buffer. This
    staging buffer is controlled by high and low water marks, and
    operated to minimize DRAM->CHIP and CHIP->DRAM bus turn arounds.
    You try to insert refresh cycles on the DRAM while its pins are
    doing their electrical turn around thing.

    L3 has no place to hold that 5th way and it can't evict one
    of the other 4 ways because that could cause rowhammer.

    You typical L3 has a lot more ways than you postulate.

    Seems to me that all it can do is stall the 5th write from L2 until
    DRAM refresh rolls around and re-enables one of the pending L3 writes,
    which would back up victim evicts from L2.
    Or maybe L3 has a small fully assoc emergency overflow buffer,
    but still that could fill up too.

    Buffers my friend, buffers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Lynn Wheeler on Thu Jun 13 01:45:49 2024
    On Wed, 12 Jun 2024 15:19:14 -1000, Lynn Wheeler wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    I recall CMS was single-user to start with, and the point of running it
    under “CP” aka “VM” was to offer a multi-user service. Did CMS ever >> become multi-user in its own right?

    over years relying more & more on CP kernel services, no multi-user ...
    but did get multitasking ...

    Interesting. This fits in with the idea that the “CP” in “CP/CMS” (later
    “VM/CMS”) was invented purely/primarily in order to turn a single-user OS into a kind-of-multi-user OS.

    trivia: my brother was regional Apple rep (largest physical area CONUS)
    and when he came into town, I could be invited to business dinners and
    argue MAC design (even before MAC announced).

    So what did you think of it? The original hardware architecture was
    heavily centred around the 60.15Hz video refresh. Each refresh interval,
    21888 bytes were read out of the video buffer (for the 512×342 display),
    and 740 bytes were read out of the sound buffer to go to the speaker.

    The serial controller chip, a Zilog 8530, was remarkably flexible, too.
    The third-party “MacRecorder” device involved reprogramming that to
    receive digital sound data from the external hardware microphone/dongle
    that plugged into the serial port, back before Macs had “official” sound input.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to Lawrence D'Oliveiro on Wed Jun 12 15:19:14 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    I recall CMS was single-user to start with, and the point of running it
    under “CP” aka “VM” was to offer a multi-user service. Did CMS ever become
    multi-user in its own right?

    over years relying more & more on CP kernel services, no multi-user
    ... but did get multitasking https://www.ibm.com/docs/en/zvm/7.3?topic=cms-application-multitasking https://www.ibm.com/docs/en/zvm/7.3?topic=programming-zvm-cms-application-multitasking
    https://www.vm.ibm.com/pubs/redbooks/sg245164.pdf

    original CMS that could run on real hardware support SIO and channel
    programs for file i/o ... a CP "diagnose" function for CMS file i/o was
    added to CP/67 that ran purely synchronous (didn't return to CMS until
    file I/O was completed) ... in transition to VM370, CMS went purely for
    CP "diagnose" (and SIO capability was eliminated).

    When I joined science center and also saw the virtual memory file
    support by MULTICS ... I figured I could do one for CMS ... that scaled
    up faster than the normal file I/O operation ... and I claimed I learned
    what not to do for a page-mapped filesystem from TSS/360 (part of
    TSS/360 was just memory mapped the filesystem then mostly faulted in
    pages ... while I did combination of memory mapping and pre-fetching, read-ahead and write-behind support).

    Some of the IBM Future System issues was specifying a TSS/360-like
    filesystem ... one of the last nails in the FS coffin was study that
    showed if 370/195 applications were ported to FS machine made out of the fastest available hardware, it would have throughput of 370/145 (about
    30 times slowdown ... part of it was serialization of file i/o).

    Some existing FS descriptions talk about how FS lived on with S/38 ...
    for entry-level business operation ... there was sufficient hardware performance provide necessary throughput for the s/38 market.

    In any case, the FS implosion contributed to memory mapped filesystem implementations acquiring very bad reputation inside IBM. In 1980s, I
    could show that heavily loaded, high-end systems with 3380 (3mbyte/sec
    disks) running my page-mapped CMS filesystem had at least three times
    the sustained throughput of standard CMS filesystems,

    some FS
    http://www.jfsowa.com/computer/memo125.htm https://people.computing.clemson.edu/~mark/fs.html

    trivia: my brother was regional Apple rep (largest physical area CONUS)
    and when he came into town, I could be invited to business dinners and
    argue MAC design (even before MAC announced). He also figured out how to remotely dial into the S/38 that ran Apple to monitor manufacutring and delivery schedules.

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Thu Jun 13 01:46:40 2024
    On Thu, 13 Jun 2024 00:43:51 +0000, MitchAlsup1 wrote:

    Can one NOT infer that; a SW convention to leave at least 1 enable bit
    always enabled, gives the system an NMI ??

    Every interrupt needs to be maskable at some point, if only to avoid
    infinite recursion and resulting stack overflow.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to Lawrence D'Oliveiro on Wed Jun 12 16:13:03 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    So what did you think of it? The original hardware architecture was
    heavily centred around the 60.15Hz video refresh. Each refresh interval, 21888 bytes were read out of the video buffer (for the 512×342 display),
    and 740 bytes were read out of the sound buffer to go to the speaker.

    biggest issue was what I characterized as kitchen table "only" with no
    business uses ... desktop publishing was somewhat inbetween (visicalc
    wasn't supposedly part of it)... at a time when large corporations
    ordering tens of thousands of IBM/PC with 3270 terminal emulation
    ... single desktop footprint doing both mainframe terminal and
    increasing kinds of local processing.

    later IBM co-worker left and did some work for Apple using Cray with 100mbyte/sec high-end graphics ... could be used to simulate various
    processor and graphic performance ... part of the joke that Cray used
    apple to design Cray machines and Apple used Cray machine to design
    Apple machines.

    some history
    https://arstechnica.com/features/2005/12/total-share/ https://arstechnica.com/features/2005/12/total-share.ars/2 https://arstechnica.com/features/2005/12/total-share.ars/3 https://arstechnica.com/features/2005/12/total-share.ars/4 https://arstechnica.com/features/2005/12/total-share.ars/5 https://arstechnica.com/features/2005/12/total-share.ars/6 https://arstechnica.com/features/2005/12/total-share.ars/7 https://arstechnica.com/features/2005/12/total-share.ars/8 https://arstechnica.com/features/2005/12/total-share.ars/9 https://arstechnica.com/features/2005/12/total-share.ars/10

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Thu Jun 13 11:41:25 2024
    On Thu, 13 Jun 2024 01:46:40 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 13 Jun 2024 00:43:51 +0000, MitchAlsup1 wrote:

    Can one NOT infer that; a SW convention to leave at least 1 enable
    bit always enabled, gives the system an NMI ??

    Every interrupt needs to be maskable at some point, if only to avoid
    infinite recursion and resulting stack overflow.

    Edge-sensitive interrupt is effectively masked for as long as it is
    latched.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Jun 13 10:56:16 2024
    Rowhammer can modify nearby lines, not just the ones that are being
    hammered, right? How do you guarantee that all neighbors will also
    be refreshed?

    I don't know the answer to this one.

    Similarly, if the accesses are LOCK XADD operations, and you have multiple CPUs (or cores not sharing a common last level cache, then I don't see any way to avoid those accesses from making it all the way to the RAM chips?

    But I can answer that one: don't!
    I.e. the DRAM is attached to one and only one CPU. Any other CPU that
    wants to access that DRAM has to do it through that DRAM's CPU, which
    will make it pass through its shared "last level" cache.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Thu Jun 13 15:34:30 2024
    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:


    1) NMI are incredibly useful in certain cases, particularly for
    in-kernel debuggers.
    2) NMI is actually maskable on Intel hardware (in the chipset, not the
    processor)
    3) ARM refused to support NMI in Aarch64 (partially because they didn't
    have a spare exception vector). They've backtracked and hacked in
    a
    solution using the interrupt controller to create a
    pseudo-unmaskable
    interrupt due to customer demand.

    On an architecture where one has multiple simultaneous interrupt tables

    (say 1 per Guest OS and 1 per HyperVisor) and each table manages 32K >individual interrupts each interrupt mask by its corresponding Enable
    bit::

    Can one NOT infer that; a SW convention to leave at least 1 enable bit
    always enabled, gives the system an NMI ??

    CLI masks everything (except NMI) on x86 cores. Likewise PSTATE.I and PSTATE.F on Aarch64 cores.

    On arm64, there are only two interrupts presented to the CPU (IRQ, FIQ).

    Interrupt prioritization, assignment to CPU signal, and pending status is managed by the interrupt controller (GIC) which has routing tables, security assignments, priority and mask bits for each of five classes of interrupts (SGI, PPI, ePPI, SPI, eSPI and LPI). SGI has 16, PPI has 16, SPI has 950,
    ePPI and eSPI extend the PPI and SPI ranges, and LPI support 24-bit interrupt numbers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Thu Jun 13 11:35:55 2024
    Michael S wrote:
    On Thu, 13 Jun 2024 01:46:40 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 13 Jun 2024 00:43:51 +0000, MitchAlsup1 wrote:

    Can one NOT infer that; a SW convention to leave at least 1 enable
    bit always enabled, gives the system an NMI ??
    Every interrupt needs to be maskable at some point, if only to avoid
    infinite recursion and resulting stack overflow.

    Edge-sensitive interrupt is effectively masked for as long as it is
    latched.

    The 8086 reset the internal NMI latch after it pushed the trap frame on
    the stack (flags low, flags high, CS, IP) and jumped to the NMI vector,
    so a subsequent rising edge triggered another NMI.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Thu Jun 13 15:36:16 2024
    Michael S <[email protected]> writes:
    On Thu, 13 Jun 2024 01:46:40 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 13 Jun 2024 00:43:51 +0000, MitchAlsup1 wrote:

    Can one NOT infer that; a SW convention to leave at least 1 enable
    bit always enabled, gives the system an NMI ??

    Every interrupt needs to be maskable at some point, if only to avoid
    infinite recursion and resulting stack overflow.

    Edge-sensitive interrupt is effectively masked for as long as it is
    latched.


    Not necessarily. Subsequent edge assertions while the interrupt is pending will be coalesced such that there will be only one delivery.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Thu Jun 13 23:48:14 2024
    Paul A. Clayton wrote:

    On 6/8/24 1:37 PM, MitchAlsup1 wrote:
    EricP wrote:

    Scott Lurndal wrote:
    [snip]
    What they found that not only do they not need 4 levels,
    it was a pointless overhead to have to constantly switch between them.
    (There is a pretty high penalty to switching modes, copying in args,
    validating args, doing something usually simple, then switching back,
    when it is all the OS's code anyway.)

    VAX was before common era Hypervisors, do you think VAX could have
    supported secure mode and hypervisor with their 4 levels ??

    But for similar reasons ring 1 and 2 are not used in x86 machines,
    either. {{NOw, if we could just go back to 1982 and not invent
    IDTs, and call gates, .....}}

    I thought My 66000 had Port Holes that are vaguely similar to
    call gates, so rather than "not invent" perhaps invent with better
    semantics and a better interface?

    I would place them congruent to Load-From and Store-TO PDP-11/70
    instructions.

    I have since converted to a more Linux friendly MMU structure.
    Port Holes can be easily resurrected.

    (Though 1982 might have been too
    early to implement such. Better perceiving when to wait for the
    technology or understanding to implement something better is
    presumably one of the skills acquired by long experience as well
    as the related what can be implemented to provide the most attractive/marketable features without excessively limiting future developments.


    Letting a competitor provide a temporarily better
    product — or delaying entry into a market expecting a feature —
    can sometimes be sensible if one expects to leapfrog with
    a better long-term alternative, but "worse is better" has some
    truth.)

    It seems that in terms of computer architectures, the world is
    not going to beat a path to your door even if you invent a
    better mousetrap.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Fri Jun 14 00:36:20 2024
    On Thu, 13 Jun 2024 23:48:14 +0000, MitchAlsup1 wrote:

    It seems that in terms of computer architectures, the world is not going
    to beat a path to your door even if you invent a better mousetrap.

    There is an inherent conflict between wanting an idea to be widely
    adopted, and wanting to maximize your profit from it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Fri Jun 14 02:57:46 2024
    On Fri, 14 Jun 2024 00:36:20 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:
    On Thu, 13 Jun 2024 23:48:14 +0000, MitchAlsup1 wrote:

    It seems that in terms of computer architectures, the world is not going
    to beat a path to your door even if you invent a better mousetrap.

    There is an inherent conflict between wanting an idea to be widely
    adopted, and wanting to maximize your profit from it.

    But the failure of RISC-B to make x86 obsolete shows that even giving
    it away for free is not enough. Because not being able to run your old
    Windows programs is the real problem.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Fri Jun 14 02:55:49 2024
    On Thu, 13 Jun 2024 23:48:14 +0000, [email protected] (MitchAlsup1)
    wrote:

    It seems that in terms of computer architectures, the world is
    not going to beat a path to your door even if you invent a
    better mousetrap.

    And we all know the main reason for that. If you already have a
    computer that you have bought software for, when you upgrade your
    computer, you want to be able to move all that software over to your
    new computer without additional costs or issues.
    That means the new computer must have the same ISA and run the same
    operating system, or that operating system's fully upwards-compatible successor.
    That means that x86 is king for the foreseeable future.
    But _some_ people use Linux, which essentially makes them free to hop
    to any ISA for which the Gnu C compiler works.
    The other option, of course, is a new niche - and so while the desktop
    is Windows on the x86, smartphones are Android on ARM.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Fri Jun 14 16:41:24 2024
    EricP <[email protected]> writes:
    Lawrence D'Oliveiro wrote:
    On Wed, 12 Jun 2024 09:38:17 -0400, EricP wrote:

    https://www.oreilly.com/library/view/understanding-the-linux/0596002130/ch04s08.html

    That book is from 2002.

    https://coral.googlesource.com/linux-imx/+/refs/heads/release-chef/arch/x86/entry/entry_32.S

    That, too, seems a bit old. How about this for a more up-to-date
    version:
    <https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_32.S>. >> Or try the 64-bit version:
    <https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S>.

    Thanks, I'll have a look that entry.s. It looks quite different.
    The copyright on common.c file I referenced was 2015 so those
    files seemed to be relatively up to date and being maintained.

    The problem I have with this approach is that it deals with all the
    race conditions (eg a nested interrupt posts a new softirq between
    when you checked for pending softirq's and the IRET) by running with
    interrupts disabled for long instruction sequences. I consider that
    to be a poor way to do this as that blocks processing all other
    interrupts.

    But then again, things are complicated enough as it is.

    The cautionary tail here is that return code path is complicated
    exactly because it wasn't sorted out during the ISA and HW design phase.


    Or perhaps, the cautionary tale is that a 1970 architecture
    must adapt to new paradigms over five decades, and backward
    compatability requirements lead to inevitable complexity.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Lawrence D'Oliveiro on Fri Jun 14 12:24:42 2024
    Lawrence D'Oliveiro wrote:
    On Wed, 12 Jun 2024 09:38:17 -0400, EricP wrote:

    https://www.oreilly.com/library/view/understanding-the-linux/0596002130/ch04s08.html

    That book is from 2002.

    https://coral.googlesource.com/linux-imx/+/refs/heads/release-chef/arch/x86/entry/entry_32.S

    That, too, seems a bit old. How about this for a more up-to-date
    version: <https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_32.S>.
    Or try the 64-bit version: <https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S>.

    Thanks, I'll have a look that entry.s. It looks quite different.
    The copyright on common.c file I referenced was 2015 so those
    files seemed to be relatively up to date and being maintained.

    The problem I have with this approach is that it deals with all the
    race conditions (eg a nested interrupt posts a new softirq between
    when you checked for pending softirq's and the IRET) by running with
    interrupts disabled for long instruction sequences. I consider that
    to be a poor way to do this as that blocks processing all other
    interrupts.

    But then again, things are complicated enough as it is.

    The cautionary tail here is that return code path is complicated
    exactly because it wasn't sorted out during the ISA and HW design phase.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Fri Jun 14 21:56:19 2024
    On Fri, 14 Jun 2024 16:41:24 GMT, Scott Lurndal wrote:

    Or perhaps, the cautionary tale is that a 1970 architecture must adapt
    to new paradigms over five decades, and backward compatability
    requirements lead to inevitable complexity.

    With Linux, it’s easy enough to compare the corresponding code in the
    source subdirectories specific to the other architectures it supports.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Fri Jun 14 21:59:29 2024
    On Fri, 14 Jun 2024 02:57:46 -0600, John Savard wrote:

    But the failure of RISC-B to make x86 obsolete shows that even giving it
    away for free is not enough. Because not being able to run your old
    Windows programs is the real problem.

    You mean RISC-V?

    I think it is succeeding in its goals. From what I hear, it’s already shipping in the billions of units per year, in a similar league to ARM.

    Compare this to x86, which at its peak was shipping about 360 million
    units per year (a million a day), but is now down to about 280 million,
    and continues to stagnate. Sure, there’s a lot more money to be made from those hundreds of millions of x86 chips than from those billions of RISC-V
    and ARM chips.

    Moral: the desktop is not the centre of the computing universe. It is only
    a small part of it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Fri Jun 14 22:14:21 2024
    On Fri, 14 Jun 2024 22:10:23 -0000 (UTC), Thomas Koenig wrote:

    John Savard <[email protected]d> schrieb:

    But _some_ people use Linux, which essentially makes them free to hop
    to any ISA for which the Gnu C compiler works.

    It's not quite that simple - if you try to build a modern web brower for POWER on Linux, for example, you're in for quite an adventure.

    Endianness assumptions? I think essentially all of the basic toolchain is already available, so what’s left would be mostly bugs in the app code itself. For which I’m sure they would accept patches.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Savard on Fri Jun 14 22:10:23 2024
    John Savard <[email protected]d> schrieb:

    But _some_ people use Linux, which essentially makes them free to hop
    to any ISA for which the Gnu C compiler works.

    It's not quite that simple - if you try to build a modern web brower
    for POWER on Linux, for example, you're in for quite an adventure.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sat Jun 15 07:43:09 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Fri, 14 Jun 2024 22:10:23 -0000 (UTC), Thomas Koenig wrote:
    It's not quite that simple - if you try to build a modern web brower for
    POWER on Linux, for example, you're in for quite an adventure.

    Endianness assumptions?

    OpenPower is little-endian, so I doubt that this is the reason. From
    what I read, Web browsers are a beast to build.

    I think essentially all of the basic toolchain is
    already available, so what’s left would be mostly bugs in the app code >itself.

    There's a huge difference between what application maintainers
    consider bugs in application code and what the C and C++ compiler
    maintainers do.

    For which I’m sure they would accept patches.

    Who would write them? I have posed a challenge to advocates of
    undefined behaviour as the way to efficiency in <[email protected]>:

    |Write a proof-of-concept Forth interpreter in the language you
    |advocate that runs at least one of bubble-sort, matrix-mult or sieve
    |from bench/forth in
    |<http://www.complang.tuwien.ac.at/forth/bench.zip>

    Nobody has risen to the challenge, much less submitted patches to
    convert Gforth to the kind of C code that gcc officially supports.

    Fortunately, the practice is quite a bit better than what the
    advocates threaten, so Gforth builds nicely on RISC-V, and, last I
    tried it, on Power.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to D'Oliveiro on Sat Jun 15 08:28:00 2024
    In article <v4ieg1$32kuq$[email protected]>, [email protected]d (Lawrence
    D'Oliveiro) wrote:

    You mean RISC-V?

    I think it is succeeding in its goals. From what I hear, it's
    already shipping in the billions of units per year, in a similar
    league to ARM.

    As an open-source project, it doesn't seem to have "goals" in quite the
    same way as a project controlled by a single organisation, or a small
    group of them.

    It's competing effectively with ARM in the embedded world. For mobile,
    desktop and datacentre, things have not progressed that far. When SiFive abandoned development of high-powered general-purpose CPUs, the push into
    those spaces faltered. I'm sorry about that: I was looking forward to
    having another architecture to learn and support.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sat Jun 15 09:59:15 2024
    Anton Ertl <[email protected]> schrieb:
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Fri, 14 Jun 2024 22:10:23 -0000 (UTC), Thomas Koenig wrote:
    It's not quite that simple - if you try to build a modern web brower for >>> POWER on Linux, for example, you're in for quite an adventure.

    Endianness assumptions?

    OpenPower is little-endian, so I doubt that this is the reason. From
    what I read, Web browsers are a beast to build.

    So they are, especially the build times.

    But for an unsupported architecture: If you want to have an
    idea what needed to be done for Chrome at one time, look at https://github.com/shawnanastasio/chromium_power It's a lot of
    configuration stuff, but also some code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to D'Oliveiro on Sat Jun 15 12:16:00 2024
    In article <v4ifbt$32kuq$[email protected]>, [email protected]d (Lawrence
    D'Oliveiro) wrote:

    On Fri, 14 Jun 2024 22:10:23 -0000 (UTC), Thomas Koenig wrote:
    It's not quite that simple - if you try to build a modern web
    brower for POWER on Linux, for example, you're in for quite
    an adventure.

    Endianness assumptions? I think essentially all of the basic
    toolchain is already available, so what's left would be mostly bugs
    in the app code itself. For which I'm sure they would accept
    patches.

    There are a _lot_ of libraries and other components that go into a modern
    web browser, many of which will never have been built on POWER. The JITer
    for the Javascript engine, and the Web Assembly translator seem to be
    among them, and they need to make use of the native instruction set.
    That's not a bug fix, that's a significant implementation task.

    As modern web browser does a _lot_ more than interpret HTML and display bitmaps, and most of the code for the extra functionality is in the
    browser. They're more like multimedia operating systems than document
    viewers.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Dallman on Sat Jun 15 22:33:03 2024
    On Sat, 15 Jun 2024 12:16 +0100 (BST), John Dallman wrote:

    There are a _lot_ of libraries and other components that go into a
    modern web browser, many of which will never have been built on POWER.
    The JITer for the Javascript engine, and the Web Assembly translator
    seem to be among them, and they need to make use of the native
    instruction set. That's not a bug fix, that's a significant
    implementation task.

    But those are not required for correctness, only for efficiency. The
    original question, as I understood it, was to get the code running on the specified architecture, not necessarily to get it running at top speed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sat Jun 15 22:31:14 2024
    On Sat, 15 Jun 2024 07:43:09 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    For which I’m sure they would accept patches.

    Who would write them?

    Whoever cared.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun Jun 16 01:55:26 2024
    On Sat, 15 Jun 2024 22:33:03 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sat, 15 Jun 2024 12:16 +0100 (BST), John Dallman wrote:

    There are a _lot_ of libraries and other components that go into a
    modern web browser, many of which will never have been built on
    POWER. The JITer for the Javascript engine, and the Web Assembly
    translator seem to be among them, and they need to make use of the
    native instruction set. That's not a bug fix, that's a significant implementation task.

    But those are not required for correctness, only for efficiency. The
    original question, as I understood it, was to get the code running on
    the specified architecture, not necessarily to get it running at top
    speed.

    Do you use video codecs in FF for correctness or only for efficiency?
    ;-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Sun Jun 16 02:52:02 2024
    On Sun, 16 Jun 2024 01:55:26 +0300, Michael S wrote:

    Do you use video codecs in FF for correctness or only for efficiency?

    I think FFmpeg is one of those basic toolkits that has already been ported
    to OpenPOWER.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Sun Jun 16 08:00:02 2024
    Michael S <[email protected]> schrieb:
    On Sun, 16 Jun 2024 02:52:02 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sun, 16 Jun 2024 01:55:26 +0300, Michael S wrote:

    Do you use video codecs in FF for correctness or only for
    efficiency?

    I think FFmpeg is one of those basic toolkits that has already been
    ported to OpenPOWER.

    Is it capable to decode H264?

    https://ffmpeg.org/ffmpeg-codecs.html says yes, if you use http://www.openh264.org/ .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun Jun 16 10:34:47 2024
    On Sun, 16 Jun 2024 02:52:02 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sun, 16 Jun 2024 01:55:26 +0300, Michael S wrote:

    Do you use video codecs in FF for correctness or only for
    efficiency?

    I think FFmpeg is one of those basic toolkits that has already been
    ported to OpenPOWER.

    Is it capable to decode H264?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Sun Jun 16 08:49:42 2024
    On Sun, 16 Jun 2024 10:34:47 +0300, Michael S wrote:

    On Sun, 16 Jun 2024 02:52:02 -0000 (UTC) Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Sun, 16 Jun 2024 01:55:26 +0300, Michael S wrote:

    Do you use video codecs in FF for correctness or only for efficiency?

    I think FFmpeg is one of those basic toolkits that has already been
    ported to OpenPOWER.

    Is it capable to decode H264?

    I’m surprised you didn’t know, since you were the one who mentioned it.

    It has options to build against toolkits for every codec and file format
    that is still worth using these days, and a few more besides.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Sun Jun 16 12:51:42 2024
    On Sun, 16 Jun 2024 08:00:02 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Sun, 16 Jun 2024 02:52:02 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sun, 16 Jun 2024 01:55:26 +0300, Michael S wrote:

    Do you use video codecs in FF for correctness or only for
    efficiency?

    I think FFmpeg is one of those basic toolkits that has already been
    ported to OpenPOWER.

    Is it capable to decode H264?

    https://ffmpeg.org/ffmpeg-codecs.html says yes, if you use http://www.openh264.org/ .

    Thank you.
    I see that ppc64el is not supported, but verified to be working.
    Hopefully it means that it's not just shows something under FHD
    resolution, but can work without dropping frames. Which is not so easy
    when done purely in software.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun Jun 16 12:43:40 2024
    On Sun, 16 Jun 2024 08:49:42 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sun, 16 Jun 2024 10:34:47 +0300, Michael S wrote:

    On Sun, 16 Jun 2024 02:52:02 -0000 (UTC) Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sun, 16 Jun 2024 01:55:26 +0300, Michael S wrote:

    Do you use video codecs in FF for correctness or only for
    efficiency?

    I think FFmpeg is one of those basic toolkits that has already been
    ported to OpenPOWER.

    Is it capable to decode H264?

    I’m surprised you didn’t know, since you were the one who mentioned
    it.

    It has options to build against toolkits for every codec and file
    format that is still worth using these days, and a few more besides.

    All I know about it is that typical FF installation on x86-64 uses plug
    in provided by Cisco. I have no idea if the reason for it is technical
    or legal.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Sun Jun 16 09:56:29 2024
    On Sun, 16 Jun 2024 12:43:40 +0300, Michael S wrote:

    On Sun, 16 Jun 2024 08:49:42 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sun, 16 Jun 2024 10:34:47 +0300, Michael S wrote:

    Is it capable to decode H264?

    I’m surprised you didn’t know, since you were the one who mentioned it. >>
    It has options to build against toolkits for every codec and file
    format that is still worth using these days, and a few more besides.

    All I know about it is that typical FF installation on x86-64 uses plug
    in provided by Cisco. I have no idea if the reason for it is technical
    or legal.

    Or a Windows thing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Torbjorn Lindgren@21:1/5 to [email protected] on Sun Jun 16 12:47:40 2024
    Michael S <[email protected]> wrote:
    On Sun, 16 Jun 2024 08:49:42 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:
    It has options to build against toolkits for every codec and file
    format that is still worth using these days, and a few more besides.

    All I know about it is that typical FF installation on x86-64 uses plug
    in provided by Cisco. I have no idea if the reason for it is technical
    or legal.

    I believe the reason is patents... Which is another way of saying it
    is/was for legal reasons.

    h.264 is/was extremely heavily patented, the MPEGLA patent consortium
    list for what patent they include in their license is 58 pages[1],
    with three colums! A lot of that is due to national patent duplication
    but still, Firefox just says "More than 1000 matches" when I search
    for just US patents!

    Now, a lot of the US patent expired during 2023 but did ALL? And what
    about all the patents in other countries. And also remember that this
    decision was made long ago when alll these patents were presumed to be
    valid (including by courts) and there were real grumblings about
    various people suing other people under these patents.

    So when Cisco open-sourced their h.264 implementation under a BSD
    2-clause license back in 2014? and let Firefox use it while being
    covered by Cisco's MPEGLA agreement (unknown date)?!

    Well, since no one else stepped up (and there was calls since, well,
    Cisco) it was the *only* way Firefox could safely ship binaries that
    could do h.264 out of the box. So lots of hand wringing and teeth
    gnashing but... no choice really if they wanted to survive.

    IIRC before (and after) that there was various workarounds with
    external modules (say the user providing FFmpeg) or just "use the
    system video player" (possibly playing in a different window!) but
    none of them was good solutions.

    1. https://www.mpegla.com/wp-content/uploads/avc-att1.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Swindells@21:1/5 to John Dallman on Sun Jun 16 16:58:58 2024
    On Sat, 15 Jun 2024 12:16 +0100 (BST), John Dallman wrote:

    In article <v4ifbt$32kuq$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

    On Fri, 14 Jun 2024 22:10:23 -0000 (UTC), Thomas Koenig wrote:
    It's not quite that simple - if you try to build a modern web brower
    for POWER on Linux, for example, you're in for quite an adventure.

    Endianness assumptions? I think essentially all of the basic toolchain
    is already available, so what's left would be mostly bugs in the app
    code itself. For which I'm sure they would accept patches.

    There are a _lot_ of libraries and other components that go into a
    modern web browser, many of which will never have been built on POWER.
    The JITer for the Javascript engine, and the Web Assembly translator
    seem to be among them, and they need to make use of the native
    instruction set. That's not a bug fix, that's a significant
    implementation task.

    As modern web browser does a _lot_ more than interpret HTML and display bitmaps, and most of the code for the extra functionality is in the
    browser. They're more like multimedia operating systems than document viewers.

    Even the code that interprets HTML can be difficult to port to a less
    common architecture, or even keep it working on that architecture.

    In Firefox, a lot of it is written in Rust which is still changing
    rapidly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Sun Jun 16 21:23:45 2024
    Michael S wrote:
    On Sun, 16 Jun 2024 08:00:02 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Sun, 16 Jun 2024 02:52:02 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sun, 16 Jun 2024 01:55:26 +0300, Michael S wrote:

    Do you use video codecs in FF for correctness or only for
    efficiency?

    I think FFmpeg is one of those basic toolkits that has already been
    ported to OpenPOWER.

    Is it capable to decode H264?

    https://ffmpeg.org/ffmpeg-codecs.html says yes, if you use
    http://www.openh264.org/ .

    Thank you.
    I see that ppc64el is not supported, but verified to be working.
    Hopefully it means that it's not just shows something under FHD
    resolution, but can work without dropping frames. Which is not so easy
    when done purely in software.

    Thanks, I know!

    On the very first 4-core Intel CPUs their own reference codec only
    managed 30 frames/second if everything was maxed out, i.e. CABAC
    encoding, 60 frames/second, 1080p resolution, 40 Mbit/s bitrate.

    (This was while running all 4 cores at 100%)

    They paid me very well to show them how they could in fact double this,
    but instead of also paying me to actually implement the code for them,
    they licensed a chunk of VLSI to do it in HW. (Which I think was the
    right thing to do.)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Torbjorn Lindgren on Mon Jun 17 00:10:38 2024
    On Sun, 16 Jun 2024 12:47:40 -0000 (UTC), Torbjorn Lindgren wrote:

    h.264 is/was extremely heavily patented, the MPEGLA patent consortium
    list for what patent they include in their license is 58 pages[1], with
    three colums!

    But here’s the fun thing: a proprietary OS like Windows can include MPEG-4 H.264 playback for free, but if you want to play older-format DVD-Video (MPEG-2), that’s an extra cost.

    Or it was, unless those patents have all expired by now.

    H.265 is even worse. MPEG-LA has its own patent pool for that, but I think there is an entirely separate group also claiming “intellectual property” rights on aspects of that.

    This is why Google and others are promoting AV1. That’s meant to be comparable to H.265 in performance and quality, but completely patent- unencumbered.

    Which has not stopped some greedy groups from claiming they’re bound to
    have some patents somewhere that are likely to apply. Details not publicly available, of course.

    Now you know why the FFmpeg project is based in Hungary. Funny thing about
    US patents, is that they only apply in the US.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Tue Oct 22 21:08:24 2024
    On Mon, 21 Oct 2024 0:42:32 +0000, Paul A. Clayton wrote:

    THREAD NECROMANCY

    On 6/11/24 5:18 PM, MitchAlsup1 wrote:
    [snip]
    I doubt that RowHammer still works when refreshes are interspersed
    between accesses--RowHammer generally works because the events are
    not protected by refreshes--the DRC sees the right ROW open and
    simple streams at the open bank.

    If one refreshes the two adjacent rows to avoid data disruption,
    those refreshes would be adjacent reads to two other rows so it
    seems one would have to be a little cautious about excessively
    frequent refreshes.

    Also note, there are no instructions in My 66000 that force a cache
    to DRAM whereas there are instructions that can force a cache line
    into L3.

    How does a system suspend to DRAM if it cannot force a writeback
    of all dirty lines to memory?

    In GENERAL, you do not want to give this capability to applications
    nor use it willy-nilly.

    I am *guessing* this would not use a
    special instruction but rather configuration of power management
    that would cause hardware/firmware to clean the cache.

    There is a sideband command from any master (anywhere) that causes
    L3 to get dumped to DRAM over the next refresh interval. It is not
    an instruction, and the TLB has to cooperate. A device may initiate
    "suspend to DRAM" as well as a CPU (or any other bus master).

    Writing back specific data to persistent memory might also
    motivate cache block cleaning operations. Perhaps one could
    implement such by copying from a cacheable mapping to a
    non-cacheable(I/O?) memory?? (I simply remember that Intel added
    instructions to write cache lines to persistent memory.)

    L3 is the buffer to DRAM. Nothing gets to DRAM without
    going through L3 and nothing comes out of DRM that is not also
    buffer by L3. So, if 96 cores simultaneously read a line residing in
    DRAM, DRAM is read once and 95 cores are serviced through L3. So,
    you can't RowHammer based on reading DRAM, either.

    If 128 cores read distinct cache lines from the same page quickly
    enough to hammer the adjacent pages but not quickly enough to get
    DRAM page open hits, this would seem to require relatively
    frequent refreshes of adjacent DRAM rows.

    DDR 5 has a 64 GB/s transfer rate
    128 cache lines (64B) is 8192 bytes
    So this takes 1/8 of a millisecond or 125µs.
    A DDR5 refresh interval is 3.9µs.

    https://www.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf#:~:text=REFRESH%20commands%20are%20issued%20at%20an%20average%20periodic,of%20295ns%20for%20a%2016Gb%20DDR5%20SDRAM%20device.

    So one has refreshes in the described situation.

    Since the L3/memory controller could see that the DRAM row was
    unusually active, it could increase prefetching while the DRAM
    row was open and/or queue the accesses longer so that the
    hammering frequency was reduced and page open hits would be more
    common.

    A DRAM Row stays active, commands just CAS-out more data. That is
    there is no ROW Hammering--the word line remains asserted while
    the sense amplifiers remain asserted with captured data--while
    CASs are used to strobe out ore data {subject to refresh}.

    The simple statement that L3 would avoid RowHammer by providing
    the same cache line to all requesters seemed a bit too simple.

    You need to investigate the difference between RAS and CAS for
    DRAMs.

    Your design may very well handle all the problematic cases,
    perhaps even with minimal performance penalties for inadvertent
    hammering and logging/notification for questionable activity just
    like for error correction (and has been proposed for detected race conditions). I just know that these are hard problems.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed Oct 23 22:38:40 2024
    On Sun, 9 Jun 2024 2:23:35 +0000, Lawrence D'Oliveiro wrote:

    On Sat, 8 Jun 2024 17:37:46 +0000, MitchAlsup1 wrote:

    VAX was before common era Hypervisors, do you think VAX could have
    supported secure mode and hypervisor with their 4 levels ??

    “Virtualization” was bandied about in the 1980s more as an idle, theoretical concept rather than a practical one.

    The question was: was the instruction set defined so that code that was designed to run in a privileged mode be run unprivileged, so that any
    attempt to do privileged things would be trapped and emulated by the
    real privileged code? And there was nothing it could do to discover
    it wasn’t running in privileged mode?

    My 66000 ISA has this property, and it is used when hypervisors host hypervisors.

    On the other hand, there is only 1 privileged instruction which
    provides access to 4 separate control register spaces based on
    current Core-Stack level.

    (Obviously performance was not the issue here, but correctness was.)

    For example, the VAX had a MOVPSL instruction that allowed read-only
    access to the entire processor status register. Through this,
    nonprivileged user-mode code could discover it was running in user mode, which would blow the illusion.

    While illustrative, we have entered the realm where processor state
    is closer to a cache line in size than a register in size. And the
    processor (core) stack of software layers is closer to 4 cache lines
    in size.

    The Motorola 680x0 family was I think properly virtualizable in this
    sense. Or maybe the 68020 and 68030 were, but the 68040 was. I think the Motorola engineers working on the ’040 asked if any customers were interested in preserving the self-virtualization feature, and nobody
    seemed to care.

    During 020 development and testing, there was a mode whereby each
    instruction executed raised every possible exception--this only found
    99% of the virtualization problems.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)