• weak consistency and the supercomputer attitude (was: Memory dependency

    From Anton Ertl@21:1/5 to Chris M. Thomasson on Mon Nov 13 07:48:35 2023
    "Chris M. Thomasson" <[email protected]> writes:
    Also, think about converting any sound lock-free algorithm's finely
    tuned memory barriers to _all_ sequential consistency... That would ruin >performance right off the bat... Think about it.

    Proof by claim?

    I think about several similar instances, where people went for
    simple-minded hardware designs and threw the complexity over the wall
    to the software people, and claimed that it was for performance; I
    call that the "supercomputing attitude", and it may work in areas
    where the software crisis has not yet struck[1], but is a bad attitude
    in areas like general-purpose computing where it has struck.

    1) People thought that they could achieve faster hardware by throwing
    the task of scheduling instructions for maximum instruction-level
    parallelism over to the compiler people. Several companies (in
    particular, Intel, HP, and Transmeta) invested a lot of money into
    this dream (and the Mill project relives this dream), but it turned
    out that doing the scheduling in hardware is faster.

    2) A little earlier, the Alpha designers thought that they could gain
    speed by not implementing denormal numbers and by implementing
    imprecise exceptions for FP operations, so that it is not possible to
    implement denormal numbers through a software fixup, either. For
    dealing properly with denormal numbers, you had to insert a trap
    barrier right after each FP instruction, and presumably this cost a
    lot of performance on early Alpha implementations. However, when I
    measured it on the 21264 (released six years after the first Alpha),
    the cost was like that of a nop; I guess that the trap barrier was
    actually a nop on the 21264, because, as an OoO processor, the 21264
    performs precise exceptions without breaking a sweat. And the 21264
    is faster than the models where the trap barrier actually does
    something. Meanwhile, Mitch Alsup also has posted that he can
    implement fast denormal numbers with IIRC 30 extra gates (which is
    probably less than what is needed for implementing the trap barrier).

    3) The Alpha is a rich source of examples of the supercomputer
    attitude: It started out without instructions for accessing 8-bit and
    16-bit data in memory. Instead, the idea was that for accessing
    memory, you would use instruction sequences, and for accessing I/O
    devices, the device was mapped three times or so: In one address range
    you performed bytewise access, in another address range 16-bit
    accesses, and in the third address range 32-bit and 64-bit accesses;
    I/O driver writers had to write or modify their drivers for this
    model. The rationale for that was that they required ECC for
    permanent storage and that would supposedly require slow RMW accesses
    for writing bytes to write-back caches. Now the 21064 and 21164 had a write-through D-cache. That made it easy to add byte and word
    accesses (BWX) in the 21164A (released 1996), but they could have done
    it from the start. The 21164A is in no way slower than the 21164; it
    has the same IPC and a higher clock rate.

    Some people welcome and celebrate the challenges that the
    supercomputer attitude poses for software, and justify it with
    "performance", but as the examples above show, such claims often turn
    out to be false when you actually invest effort into more capable
    hardware.

    Given that multi-processors come out of supercomputing, it's no
    surprise that the supercomputing attitude is particularly strong
    there.

    But if you look at it from an architecture (i.e., hardware/software
    interface) perspective, weak consistency is just bad architecture:
    good architecture says what happens to the architectural state when
    software performs some instruction. From that perspective sequential consistency is architecturally best. Weaker consistency models
    describe how the architecture does not provide the sequential
    consistency guarantees that are so easy to describe; the weaker the
    model, the more deviations it has to describe.

    [1] The software crisis is that software costs are higher than
    hardware costs, and supercomputing with its gigantic hardware costs
    notices the software crisis much later than general-purpose computing.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Nov 13 10:36:52 2023
    1) People thought that they could achieve faster hardware by throwing
    the task of scheduling instructions for maximum instruction-level
    parallelism over to the compiler people. Several companies (in
    particular, Intel, HP, and Transmeta) invested a lot of money into
    this dream (and the Mill project relives this dream), but it turned
    out that doing the scheduling in hardware is faster.

    IIRC the main argument for the Mill wasn't that it was going to be
    faster but that it would give a better performance per watt by avoiding
    the administrative cost of managing those hundreds of reordered
    in-flight instructions, without losing too much peak performance.

    The fact that performance per watt of in-order ARM cores is not really
    lower than that of OOO cores suggests that the Mill wouldn't deliver on
    this "promise" either.
    Still, I really would like to see how it plays out in practice, instead
    of having to guess.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Mon Nov 13 19:11:51 2023
    Anton Ertl wrote:

    "Chris M. Thomasson" <[email protected]> writes:
    Also, think about converting any sound lock-free algorithm's finely
    tuned memory barriers to _all_ sequential consistency... That would ruin >>performance right off the bat... Think about it.

    Proof by claim?

    I think about several similar instances, where people went for
    simple-minded hardware designs and threw the complexity over the wall
    to the software people, and claimed that it was for performance; I
    call that the "supercomputing attitude", and it may work in areas
    where the software crisis has not yet struck[1], but is a bad attitude
    in areas like general-purpose computing where it has struck.

    1) People thought that they could achieve faster hardware by throwing
    the task of scheduling instructions for maximum instruction-level
    parallelism over to the compiler people. Several companies (in
    particular, Intel, HP, and Transmeta) invested a lot of money into
    this dream (and the Mill project relives this dream), but it turned
    out that doing the scheduling in hardware is faster.
    <
    Not faster, but easier to do with acceptable HW costs. The pipeline
    is 1-3 stages longer, but HW has dynamic information that SW cannot have.
    <
    2) A little earlier, the Alpha designers thought that they could gain
    speed by not implementing denormal numbers and by implementing
    imprecise exceptions for FP operations, so that it is not possible to implement denormal numbers through a software fixup, either. For
    <
    So did I in Mc 88100--just as wrong then as it is now.
    <
    dealing properly with denormal numbers, you had to insert a trap
    barrier right after each FP instruction, and presumably this cost a
    lot of performance on early Alpha implementations. However, when I
    measured it on the 21264 (released six years after the first Alpha),
    the cost was like that of a nop; I guess that the trap barrier was
    actually a nop on the 21264, because, as an OoO processor, the 21264
    performs precise exceptions without breaking a sweat. And the 21264
    is faster than the models where the trap barrier actually does
    something. Meanwhile, Mitch Alsup also has posted that he can
    implement fast denormal numbers with IIRC 30 extra gates (which is
    probably less than what is needed for implementing the trap barrier).
    <
    I recall saying it is about 2% of the gate count of an FMAC unit.
    <
    3) The Alpha is a rich source of examples of the supercomputer
    attitude: It started out without instructions for accessing 8-bit and
    16-bit data in memory. Instead, the idea was that for accessing
    memory, you would use instruction sequences, and for accessing I/O
    devices, the device was mapped three times or so: In one address range
    you performed bytewise access, in another address range 16-bit
    accesses, and in the third address range 32-bit and 64-bit accesses;
    I/O driver writers had to write or modify their drivers for this
    model. The rationale for that was that they required ECC for
    permanent storage and that would supposedly require slow RMW accesses
    for writing bytes to write-back caches. Now the 21064 and 21164 had a write-through D-cache. That made it easy to add byte and word
    accesses (BWX) in the 21164A (released 1996), but they could have done
    it from the start. The 21164A is in no way slower than the 21164; it
    has the same IPC and a higher clock rate.

    Some people welcome and celebrate the challenges that the
    supercomputer attitude poses for software, and justify it with
    "performance", but as the examples above show, such claims often turn
    out to be false when you actually invest effort into more capable
    hardware.

    Given that multi-processors come out of supercomputing, it's no
    surprise that the supercomputing attitude is particularly strong
    there.

    But if you look at it from an architecture (i.e., hardware/software interface) perspective, weak consistency is just bad architecture:
    good architecture says what happens to the architectural state when
    software performs some instruction. From that perspective sequential consistency is architecturally best. Weaker consistency models
    describe how the architecture does not provide the sequential
    consistency guarantees that are so easy to describe; the weaker the
    model, the more deviations it has to describe.
    <
    The problem that the weak consistency models enabled comes from the
    fact it was universal over all accesses. However the TLB can be used
    to solve that problem so each access has its own model and the HW has
    to perform with that model often across a multiplicity of memory
    references. For my part I have 4 memory models and the CPUs switch to
    the appropriate model upon detection without needing instructions. So
    when the first instruction of an ATOMIC event is detected (decode),
    All weaker outstanding request are allowed to complete, and all of
    the ATOMIC requests are performed in sequentially consistent manner,
    then afterwards the memory model is weakened, again.
    <
    [1] The software crisis is that software costs are higher than
    hardware costs, and supercomputing with its gigantic hardware costs
    notices the software crisis much later than general-purpose computing.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul A. Clayton@21:1/5 to Anton Ertl on Mon Nov 20 10:50:34 2023
    On 11/13/23 2:48 AM, Anton Ertl wrote:
    [snip]
    I think about several similar instances, where people went for
    simple-minded hardware designs and threw the complexity over the wall
    to the software people, and claimed that it was for performance; I
    call that the "supercomputing attitude", and it may work in areas
    where the software crisis has not yet struck[1], but is a bad attitude
    in areas like general-purpose computing where it has struck.

    This is not just a hardware-software wall problem, though that
    wall and its abuse is usually well-established. As someone with a micro-optimization orientation, I know I need more external
    awareness, but as a non-practicing entity what I think or present
    has little effect/danger. Even in my case, there is some danger of
    spreading a falsehood (or dangerously incomplete truths), so
    external correction is valuable (and I value it myself as I
    dislike being wrong, being corrected early hurts but hurts less
    than being corrected after the inaccuracy has been well-
    established in my own and others' minds).

    System-aware optimization also interacts with interface layering.
    Isolating concerns reduces design complexity and from a given
    complexity allows exploiting "don't care" aspects. The "don't
    care" aspects can be painful when the interface user does care;
    sometimes these can force a violating of the abstraction,
    introducing a dependency of a specific implementation (which can
    then introduce an informal interface [performance compatibility
    is a common informal interface]).

    1) People thought that they could achieve faster hardware by throwing
    the task of scheduling instructions for maximum instruction-level
    parallelism over to the compiler people. Several companies (in
    particular, Intel, HP, and Transmeta) invested a lot of money into
    this dream (and the Mill project relives this dream), but it turned
    out that doing the scheduling in hardware is faster.

    Yet there does not seem to be a strong push to develop a dataflow-
    oriented interface/ISA (that is does not require genius
    programmers or super-genius compilers). I am not certain what such
    an interface would look like, but I suspect something closer to a transport-triggered architecture (TTA) would be an early step. A
    TTA-like architecture would compactly encode single use values and
    provide some routing information while supporting (possible)
    multiple use and some sense of use deferment (loads and stores).

    Value prediction (including branch/predicate prediction) also
    seems to be required to be included in design considerations.

    Such an ISA would also probably blur the boundaries between
    threads and naturally support speculative multithreading, which is
    in some sense a distant/variable deferment communication/dataflow.

    [snip]
    Meanwhile, Mitch Alsup also has posted that he can
    implement fast denormal numbers with IIRC 30 extra gates (which is
    probably less than what is needed for implementing the trap barrier).

    I think that cost estimate assumes the inclusion of (single
    rounding) FMADD. Single-rounding FMADD was not common for RISCs
    when the Alpha designers made their choice.

    I am **certainly not** a numerical analyst, but I had the
    impression that flush-to-zero was not horrible for analyzing for
    correctness and (for double precision) not commonly a problem. Yet
    I also think that having multiply round based on an "integer"
    power-of-two high result (without carry-in) — where the hardware
    could also be used for integer multiply by reciprocal — might have
    been "better", so my opinion should probably be taken with a mine
    of salt.

    I would not be surprised if special-purpose low-power DSPs not
    only use not-IEEE formats but use inexact rounding. Even using
    inexact computation might be justified for extreme cases.

    3) The Alpha is a rich source of examples of the supercomputer
    attitude: It started out without instructions for accessing 8-bit and
    16-bit data in memory. Instead, the idea was that for accessing
    memory, you would use instruction sequences, and for accessing I/O
    devices, the device was mapped three times or so: In one address range
    you performed bytewise access, in another address range 16-bit
    accesses, and in the third address range 32-bit and 64-bit accesses;
    I/O driver writers had to write or modify their drivers for this
    model. The rationale for that was that they required ECC for
    permanent storage and that would supposedly require slow RMW accesses
    for writing bytes to write-back caches. Now the 21064 and 21164 had a write-through D-cache. That made it easy to add byte and word
    accesses (BWX) in the 21164A (released 1996), but they could have done
    it from the start. The 21164A is in no way slower than the 21164; it
    has the same IPC and a higher clock rate.

    Yet Intel has been using byte parity for L1 Dcaches, so that
    design choice was perhaps not *entirely* irrational. (I disagree
    with that choice, having hindsight, but I can appreciate the
    reasoning.) Parity-only L1 Dcaches are not that bad since the
    SRAM design will likely be more robust to allow faster access (I
    think) and dirty values will tend to be either evicted quickly or
    checked often.

    If smaller writes are rare, hardware RMW in a writeback cache
    would not have been that expensive, but the cost would have no
    value if smaller writes are never necessary.

    (I do wonder if there is an interface that would allow software to
    reduce hardware RMW costs — often a value is read before being
    modified — without introducing more complexity than benefit.
    Exploiting the standard double-wide read used for unaligned
    accesses to access a double-wide aligned memory seems similarly
    desirable. While idiom-detection would allow this to be done in
    hardware without changing the interface, idiom detection is more
    complex than direct encoding and typically relies on software to
    reduce that complexity — e.g., only detecting short contiguous
    idioms.)

    The different memory regions trick is also used for bit-granular
    accesses in some ISAs (e.g., ARM) mainly for I/O device accesses.
    Even without side-effects for accesses, non-atomicity might be a
    concern. (Of course, one could architect that all simple load-op-
    store sequences on that type of memory are atomic, using three
    instruction idiom detection.)

    Some people welcome and celebrate the challenges that the
    supercomputer attitude poses for software, and justify it with
    "performance", but as the examples above show, such claims often turn
    out to be false when you actually invest effort into more capable
    hardware.

    The tricky part seems to be in discerning when (and where) extra
    effort is justified. This also depends on how easily the
    difficulty can be encapsulated. Can a compiler reliably "do the
    right thing" (without having to have been written by a supergenius
    AI)? Can a library reliably provide the necessary extra
    functionality — splitting the difficulty between application
    programmer discipline and difficulty of developing the system
    software — without requiring genius system programmers and highly
    competent application programmers?

    Someone who writes lock-free methods for fun is probably not well-
    positioned to estimate the difficulty/lack-of-fun of such for most
    programmers. Communication between different interest groups seems
    critical, but communication also requires data and not just
    anecdotes or traditional wisdom. (Anecdotes and traditional wisdom
    do have value!)

    [snip]
    But if you look at it from an architecture (i.e., hardware/software interface) perspective, weak consistency is just bad architecture:
    good architecture says what happens to the architectural state when
    software performs some instruction. From that perspective sequential consistency is architecturally best. Weaker consistency models
    describe how the architecture does not provide the sequential
    consistency guarantees that are so easy to describe; the weaker the
    model, the more deviations it has to describe.

    I am not convinced that sequential consistency is the best
    interface. My66000 does not provide sequential consistency for
    ordinary memory. While Mitch Alsup would have difficulty
    empathizing with most programmers, he has enough experience to
    write specifications for "hostile" engineers so he probably
    understands the tradeoffs on both sides of the interface fairly
    well.

    When an effort is considered hard like parallel programming, there
    seems to be a spectrum of viewpoints from the UNIX/"real
    programmers" perspective of limiting effort to experts to simplify
    the system interface so that almost anyone can do almost anything.
    The extreme positions have obvious cultural issues (where
    expertise is either required for worth or expertise is despised as
    arrogance) as well as mechanical issues (expertise is naturally
    limited by finite knowledge — where vast knowledge implies
    communication overhead even within a single supercomputer
    complex).
    [1] The software crisis is that software costs are higher than
    hardware costs, and supercomputing with its gigantic hardware costs
    notices the software crisis much later than general-purpose computing.

    This is one strong reason for complexity to be shifted toward
    hardware, but I think that there is a danger of "toward" becoming
    "into".

    I do not know nearly enough about memory ordering considerations
    in hardware and software to have more than an opinion based on
    which experts I believe (and a tiny amount of rational/data
    consistency inference on my part). From what I have read, TSO
    seems to be the best tradeoff of hardware overhead and software
    difficulty, but I suspect the best set of fine-grained guarantees
    may be somewhat different than provided by simple TSO. I also
    think there may be noticeable advantages for allowing use that is
    outside of recipes/the formal interface.

    Documenting an interface often brings an assumption of continuity,
    so (besides the cost of writing documentation) there is a
    disincentive to expose internals that leak through the abstraction
    layer.

    (That was a very wordy response.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Mon Nov 20 18:51:49 2023
    Paul A. Clayton wrote:



    [snip]
    But if you look at it from an architecture (i.e., hardware/software
    interface) perspective, weak consistency is just bad architecture:
    good architecture says what happens to the architectural state when
    software performs some instruction. From that perspective sequential
    consistency is architecturally best. Weaker consistency models
    describe how the architecture does not provide the sequential
    consistency guarantees that are so easy to describe; the weaker the
    model, the more deviations it has to describe.

    I am not convinced that sequential consistency is the best
    interface. My66000 does not provide sequential consistency for
    ordinary memory. While Mitch Alsup would have difficulty
    empathizing with most programmers, he has enough experience to
    write specifications for "hostile" engineers so he probably
    understands the tradeoffs on both sides of the interface fairly
    well.

    All accesses being universally sequentially consistent is way too
    much ordering, however, the ability to detect the start-end of
    ATOMIC events and switching to SC gives the programmer all the
    order he needs without constraining the non-concurrent memory
    at all.

    Over at config-space control registers--these need more than TSO or SC,
    these need strong ordering.

    On the other hand true ROM needs no ordering whatsoever--so why
    impose any ??

    One size does not fit all !!



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Mon Nov 20 12:29:01 2023
    On 11/20/2023 10:51 AM, MitchAlsup wrote:
    Paul A. Clayton wrote:



    [snip]
    But if you look at it from an architecture (i.e., hardware/software
    interface) perspective, weak consistency is just bad architecture:
    good architecture says what happens to the architectural state when
    software performs some instruction.  From that perspective sequential
    consistency is architecturally best.  Weaker consistency models
    describe how the architecture does not provide the sequential
    consistency guarantees that are so easy to describe; the weaker the
    model, the more deviations it has to describe.

    I am not convinced that sequential consistency is the best
    interface. My66000 does not provide sequential consistency for
    ordinary memory. While Mitch Alsup would have difficulty
    empathizing with most programmers, he has enough experience to
    write specifications for "hostile" engineers so he probably
    understands the tradeoffs on both sides of the interface fairly
    well.

    All accesses being universally sequentially consistent is way too
    much ordering, however, the ability to detect the start-end of
    ATOMIC events and switching to SC gives the programmer all the
    order he needs without constraining the non-concurrent memory
    at all.

    Over at config-space control registers--these need more than TSO or SC,
    these need strong ordering.

    On the other hand true ROM needs no ordering whatsoever--so why
    impose any ??

    One size does not fit all !!



    Fwiw, I remember posting an idea of so-called tagged memory barriers on
    this group some years ago. I need to try to dig it up.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)