Forum: >>> Magnum BBS <<<

weak consistency and the supercomputer attitude (was: Memory dependency

From Anton Ertl@21:1/5 to Chris M. Thomasson on Mon Nov 13 07:48:35 2023

"Chris M. Thomasson" <[email protected]> writes:

Also, think about converting any sound lock-free algorithm's finely
tuned memory barriers to _all_ sequential consistency... That would ruin >performance right off the bat... Think about it.

Proof by claim?

I think about several similar instances, where people went for
simple-minded hardware designs and threw the complexity over the wall
to the software people, and claimed that it was for performance; I
call that the "supercomputing attitude", and it may work in areas
where the software crisis has not yet struck[1], but is a bad attitude
in areas like general-purpose computing where it has struck.

1) People thought that they could achieve faster hardware by throwing
the task of scheduling instructions for maximum instruction-level
parallelism over to the compiler people. Several companies (in
particular, Intel, HP, and Transmeta) invested a lot of money into
this dream (and the Mill project relives this dream), but it turned
out that doing the scheduling in hardware is faster.

2) A little earlier, the Alpha designers thought that they could gain
speed by not implementing denormal numbers and by implementing
imprecise exceptions for FP operations, so that it is not possible to
implement denormal numbers through a software fixup, either. For
dealing properly with denormal numbers, you had to insert a trap
barrier right after each FP instruction, and presumably this cost a
lot of performance on early Alpha implementations. However, when I
measured it on the 21264 (released six years after the first Alpha),
the cost was like that of a nop; I guess that the trap barrier was
actually a nop on the 21264, because, as an OoO processor, the 21264
performs precise exceptions without breaking a sweat. And the 21264
is faster than the models where the trap barrier actually does
something. Meanwhile, Mitch Alsup also has posted that he can
implement fast denormal numbers with IIRC 30 extra gates (which is
probably less than what is needed for implementing the trap barrier).

3) The Alpha is a rich source of examples of the supercomputer
attitude: It started out without instructions for accessing 8-bit and
16-bit data in memory. Instead, the idea was that for accessing
memory, you would use instruction sequences, and for accessing I/O
devices, the device was mapped three times or so: In one address range
you performed bytewise access, in another address range 16-bit
accesses, and in the third address range 32-bit and 64-bit accesses;
I/O driver writers had to write or modify their drivers for this
model. The rationale for that was that they required ECC for
permanent storage and that would supposedly require slow RMW accesses
for writing bytes to write-back caches. Now the 21064 and 21164 had a write-through D-cache. That made it easy to add byte and word
accesses (BWX) in the 21164A (released 1996), but they could have done
it from the start. The 21164A is in no way slower than the 21164; it
has the same IPC and a higher clock rate.

Some people welcome and celebrate the challenges that the
supercomputer attitude poses for software, and justify it with
"performance", but as the examples above show, such claims often turn
out to be false when you actually invest effort into more capable
hardware.

Given that multi-processors come out of supercomputing, it's no
surprise that the supercomputing attitude is particularly strong
there.

But if you look at it from an architecture (i.e., hardware/software
interface) perspective, weak consistency is just bad architecture:
good architecture says what happens to the architectural state when
software performs some instruction. From that perspective sequential consistency is architecturally best. Weaker consistency models
describe how the architecture does not provide the sequential
consistency guarantees that are so easy to describe; the weaker the
model, the more deviations it has to describe.

[1] The software crisis is that software costs are higher than
hardware costs, and supercomputing with its gigantic hardware costs
notices the software crisis much later than general-purpose computing.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Nov 13 10:36:52 2023

1) People thought that they could achieve faster hardware by throwing
the task of scheduling instructions for maximum instruction-level
parallelism over to the compiler people. Several companies (in
particular, Intel, HP, and Transmeta) invested a lot of money into
this dream (and the Mill project relives this dream), but it turned
out that doing the scheduling in hardware is faster.

IIRC the main argument for the Mill wasn't that it was going to be
faster but that it would give a better performance per watt by avoiding
the administrative cost of managing those hundreds of reordered
in-flight instructions, without losing too much peak performance.

The fact that performance per watt of in-order ARM cores is not really
lower than that of OOO cores suggests that the Mill wouldn't deliver on
this "promise" either.
Still, I really would like to see how it plays out in practice, instead
of having to guess.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Mon Nov 13 19:11:51 2023

Anton Ertl wrote:

"Chris M. Thomasson" <[email protected]> writes:

Also, think about converting any sound lock-free algorithm's finely
tuned memory barriers to _all_ sequential consistency... That would ruin >>performance right off the bat... Think about it.

Proof by claim?

I think about several similar instances, where people went for
simple-minded hardware designs and threw the complexity over the wall
to the software people, and claimed that it was for performance; I
call that the "supercomputing attitude", and it may work in areas
where the software crisis has not yet struck[1], but is a bad attitude
in areas like general-purpose computing where it has struck.

1) People thought that they could achieve faster hardware by throwing
the task of scheduling instructions for maximum instruction-level
parallelism over to the compiler people. Several companies (in
particular, Intel, HP, and Transmeta) invested a lot of money into
this dream (and the Mill project relives this dream), but it turned
out that doing the scheduling in hardware is faster.

<
Not faster, but easier to do with acceptable HW costs. The pipeline
is 1-3 stages longer, but HW has dynamic information that SW cannot have.
<

2) A little earlier, the Alpha designers thought that they could gain
speed by not implementing denormal numbers and by implementing
imprecise exceptions for FP operations, so that it is not possible to implement denormal numbers through a software fixup, either. For

<
So did I in Mc 88100--just as wrong then as it is now.
<

dealing properly with denormal numbers, you had to insert a trap
barrier right after each FP instruction, and presumably this cost a
lot of performance on early Alpha implementations. However, when I
measured it on the 21264 (released six years after the first Alpha),
the cost was like that of a nop; I guess that the trap barrier was
actually a nop on the 21264, because, as an OoO processor, the 21264
performs precise exceptions without breaking a sweat. And the 21264
is faster than the models where the trap barrier actually does
something. Meanwhile, Mitch Alsup also has posted that he can
implement fast denormal numbers with IIRC 30 extra gates (which is
probably less than what is needed for implementing the trap barrier).

<
I recall saying it is about 2% of the gate count of an FMAC unit.
<

3) The Alpha is a rich source of examples of the supercomputer
attitude: It started out without instructions for accessing 8-bit and
16-bit data in memory. Instead, the idea was that for accessing
memory, you would use instruction sequences, and for accessing I/O
devices, the device was mapped three times or so: In one address range
you performed bytewise access, in another address range 16-bit
accesses, and in the third address range 32-bit and 64-bit accesses;
I/O driver writers had to write or modify their drivers for this
model. The rationale for that was that they required ECC for
permanent storage and that would supposedly require slow RMW accesses
for writing bytes to write-back caches. Now the 21064 and 21164 had a write-through D-cache. That made it easy to add byte and word
accesses (BWX) in the 21164A (released 1996), but they could have done
it from the start. The 21164A is in no way slower than the 21164; it
has the same IPC and a higher clock rate.

Some people welcome and celebrate the challenges that the
supercomputer attitude poses for software, and justify it with
"performance", but as the examples above show, such claims often turn
out to be false when you actually invest effort into more capable
hardware.

Given that multi-processors come out of supercomputing, it's no
surprise that the supercomputing attitude is particularly strong
there.

But if you look at it from an architecture (i.e., hardware/software interface) perspective, weak consistency is just bad architecture:
good architecture says what happens to the architectural state when
software performs some instruction. From that perspective sequential consistency is architecturally best. Weaker consistency models
describe how the architecture does not provide the sequential
consistency guarantees that are so easy to describe; the weaker the
model, the more deviations it has to describe.

<
The problem that the weak consistency models enabled comes from the
fact it was universal over all accesses. However the TLB can be used
to solve that problem so each access has its own model and the HW has
to perform with that model often across a multiplicity of memory
references. For my part I have 4 memory models and the CPUs switch to
the appropriate model upon detection without needing instructions. So
when the first instruction of an ATOMIC event is detected (decode),
All weaker outstanding request are allowed to complete, and all of
the ATOMIC requests are performed in sequentially consistent manner,
then afterwards the memory model is weakened, again.
<

[1] The software crisis is that software costs are higher than
hardware costs, and supercomputing with its gigantic hardware costs
notices the software crisis much later than general-purpose computing.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul A. Clayton@21:1/5 to Anton Ertl on Mon Nov 20 10:50:34 2023

On 11/13/23 2:48 AM, Anton Ertl wrote:
[snip]

I think about several similar instances, where people went for
simple-minded hardware designs and threw the complexity over the wall
to the software people, and claimed that it was for performance; I
call that the "supercomputing attitude", and it may work in areas
where the software crisis has not yet struck[1], but is a bad attitude
in areas like general-purpose computing where it has struck.

This is not just a hardware-software wall problem, though that
wall and its abuse is usually well-established. As someone with a micro-optimization orientation, I know I need more external
awareness, but as a non-practicing entity what I think or present
has little effect/danger. Even in my case, there is some danger of
spreading a falsehood (or dangerously incomplete truths), so
external correction is valuable (and I value it myself as I
dislike being wrong, being corrected early hurts but hurts less
than being corrected after the inaccuracy has been well-
established in my own and others' minds).

System-aware optimization also interacts with interface layering.
Isolating concerns reduces design complexity and from a given
complexity allows exploiting "don't care" aspects. The "don't
care" aspects can be painful when the interface user does care;
sometimes these can force a violating of the abstraction,
introducing a dependency of a specific implementation (which can
then introduce an informal interface [performance compatibility
is a common informal interface]).

1) People thought that they could achieve faster hardware by throwing
the task of scheduling instructions for maximum instruction-level
parallelism over to the compiler people. Several companies (in
particular, Intel, HP, and Transmeta) invested a lot of money into
this dream (and the Mill project relives this dream), but it turned
out that doing the scheduling in hardware is faster.

Yet there does not seem to be a strong push to develop a dataflow-
oriented interface/ISA (that is does not require genius
programmers or super-genius compilers). I am not certain what such
an interface would look like, but I suspect something closer to a transport-triggered architecture (TTA) would be an early step. A
TTA-like architecture would compactly encode single use values and
provide some routing information while supporting (possible)
multiple use and some sense of use deferment (loads and stores).

Value prediction (including branch/predicate prediction) also
seems to be required to be included in design considerations.

Such an ISA would also probably blur the boundaries between
threads and naturally support speculative multithreading, which is
in some sense a distant/variable deferment communication/dataflow.

[snip]

Meanwhile, Mitch Alsup also has posted that he can
implement fast denormal numbers with IIRC 30 extra gates (which is
probably less than what is needed for implementing the trap barrier).

I think that cost estimate assumes the inclusion of (single
rounding) FMADD. Single-rounding FMADD was not common for RISCs
when the Alpha designers made their choice.

I am **certainly not** a numerical analyst, but I had the
impression that flush-to-zero was not horrible for analyzing for
correctness and (for double precision) not commonly a problem. Yet
I also think that having multiply round based on an "integer"
power-of-two high result (without carry-in) — where the hardware
could also be used for integer multiply by reciprocal — might have
been "better", so my opinion should probably be taken with a mine
of salt.

I would not be surprised if special-purpose low-power DSPs not
only use not-IEEE formats but use inexact rounding. Even using
inexact computation might be justified for extreme cases.

3) The Alpha is a rich source of examples of the supercomputer
attitude: It started out without instructions for accessing 8-bit and
16-bit data in memory. Instead, the idea was that for accessing
memory, you would use instruction sequences, and for accessing I/O
devices, the device was mapped three times or so: In one address range
you performed bytewise access, in another address range 16-bit
accesses, and in the third address range 32-bit and 64-bit accesses;
I/O driver writers had to write or modify their drivers for this
model. The rationale for that was that they required ECC for
permanent storage and that would supposedly require slow RMW accesses
for writing bytes to write-back caches. Now the 21064 and 21164 had a write-through D-cache. That made it easy to add byte and word
accesses (BWX) in the 21164A (released 1996), but they could have done
it from the start. The 21164A is in no way slower than the 21164; it
has the same IPC and a higher clock rate.

Yet Intel has been using byte parity for L1 Dcaches, so that
design choice was perhaps not *entirely* irrational. (I disagree
with that choice, having hindsight, but I can appreciate the
reasoning.) Parity-only L1 Dcaches are not that bad since the
SRAM design will likely be more robust to allow faster access (I
think) and dirty values will tend to be either evicted quickly or
checked often.

If smaller writes are rare, hardware RMW in a writeback cache
would not have been that expensive, but the cost would have no
value if smaller writes are never necessary.

(I do wonder if there is an interface that would allow software to
reduce hardware RMW costs — often a value is read before being
modified — without introducing more complexity than benefit.
Exploiting the standard double-wide read used for unaligned
accesses to access a double-wide aligned memory seems similarly
desirable. While idiom-detection would allow this to be done in
hardware without changing the interface, idiom detection is more
complex than direct encoding and typically relies on software to
reduce that complexity — e.g., only detecting short contiguous
idioms.)

The different memory regions trick is also used for bit-granular
accesses in some ISAs (e.g., ARM) mainly for I/O device accesses.
Even without side-effects for accesses, non-atomicity might be a
concern. (Of course, one could architect that all simple load-op-
store sequences on that type of memory are atomic, using three
instruction idiom detection.)

Some people welcome and celebrate the challenges that the
supercomputer attitude poses for software, and justify it with
"performance", but as the examples above show, such claims often turn
out to be false when you actually invest effort into more capable
hardware.

The tricky part seems to be in discerning when (and where) extra
effort is justified. This also depends on how easily the
difficulty can be encapsulated. Can a compiler reliably "do the
right thing" (without having to have been written by a supergenius
AI)? Can a library reliably provide the necessary extra
functionality — splitting the difficulty between application
programmer discipline and difficulty of developing the system
software — without requiring genius system programmers and highly
competent application programmers?

Someone who writes lock-free methods for fun is probably not well-
positioned to estimate the difficulty/lack-of-fun of such for most
programmers. Communication between different interest groups seems
critical, but communication also requires data and not just
anecdotes or traditional wisdom. (Anecdotes and traditional wisdom
do have value!)

[snip]

But if you look at it from an architecture (i.e., hardware/software interface) perspective, weak consistency is just bad architecture:
good architecture says what happens to the architectural state when
software performs some instruction. From that perspective sequential consistency is architecturally best. Weaker consistency models
describe how the architecture does not provide the sequential
consistency guarantees that are so easy to describe; the weaker the
model, the more deviations it has to describe.

I am not convinced that sequential consistency is the best
interface. My66000 does not provide sequential consistency for
ordinary memory. While Mitch Alsup would have difficulty
empathizing with most programmers, he has enough experience to
write specifications for "hostile" engineers so he probably
understands the tradeoffs on both sides of the interface fairly
well.

When an effort is considered hard like parallel programming, there
seems to be a spectrum of viewpoints from the UNIX/"real
programmers" perspective of limiting effort to experts to simplify
the system interface so that almost anyone can do almost anything.
The extreme positions have obvious cultural issues (where
expertise is either required for worth or expertise is despised as
arrogance) as well as mechanical issues (expertise is naturally
limited by finite knowledge — where vast knowledge implies
communication overhead even within a single supercomputer
complex).

[1] The software crisis is that software costs are higher than
hardware costs, and supercomputing with its gigantic hardware costs
notices the software crisis much later than general-purpose computing.

This is one strong reason for complexity to be shifted toward
hardware, but I think that there is a danger of "toward" becoming
"into".

I do not know nearly enough about memory ordering considerations
in hardware and software to have more than an opinion based on
which experts I believe (and a tiny amount of rational/data
consistency inference on my part). From what I have read, TSO
seems to be the best tradeoff of hardware overhead and software
difficulty, but I suspect the best set of fine-grained guarantees
may be somewhat different than provided by simple TSO. I also
think there may be noticeable advantages for allowing use that is
outside of recipes/the formal interface.

Documenting an interface often brings an assumption of continuity,
so (besides the cost of writing documentation) there is a
disincentive to expose internals that leak through the abstraction
layer.

(That was a very wordy response.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Mon Nov 20 18:51:49 2023

Paul A. Clayton wrote:

[snip]

But if you look at it from an architecture (i.e., hardware/software
interface) perspective, weak consistency is just bad architecture:
good architecture says what happens to the architectural state when
software performs some instruction. From that perspective sequential
consistency is architecturally best. Weaker consistency models
describe how the architecture does not provide the sequential
consistency guarantees that are so easy to describe; the weaker the
model, the more deviations it has to describe.

I am not convinced that sequential consistency is the best
interface. My66000 does not provide sequential consistency for
ordinary memory. While Mitch Alsup would have difficulty
empathizing with most programmers, he has enough experience to
write specifications for "hostile" engineers so he probably
understands the tradeoffs on both sides of the interface fairly
well.

All accesses being universally sequentially consistent is way too
much ordering, however, the ability to detect the start-end of
ATOMIC events and switching to SC gives the programmer all the
order he needs without constraining the non-concurrent memory
at all.

Over at config-space control registers--these need more than TSO or SC,
these need strong ordering.

On the other hand true ROM needs no ordering whatsoever--so why
impose any ??

One size does not fit all !!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris M. Thomasson@21:1/5 to MitchAlsup on Mon Nov 20 12:29:01 2023

On 11/20/2023 10:51 AM, MitchAlsup wrote:

Paul A. Clayton wrote:

[snip]

But if you look at it from an architecture (i.e., hardware/software
interface) perspective, weak consistency is just bad architecture:
good architecture says what happens to the architectural state when
software performs some instruction. From that perspective sequential
consistency is architecturally best. Weaker consistency models
describe how the architecture does not provide the sequential
consistency guarantees that are so easy to describe; the weaker the
model, the more deviations it has to describe.

I am not convinced that sequential consistency is the best
interface. My66000 does not provide sequential consistency for
ordinary memory. While Mitch Alsup would have difficulty
empathizing with most programmers, he has enough experience to
write specifications for "hostile" engineers so he probably
understands the tradeoffs on both sides of the interface fairly
well.

All accesses being universally sequentially consistent is way too
much ordering, however, the ability to detect the start-end of
ATOMIC events and switching to SC gives the programmer all the
order he needs without constraining the non-concurrent memory
at all.

Over at config-space control registers--these need more than TSO or SC,
these need strong ordering.

On the other hand true ROM needs no ordering whatsoever--so why
impose any ??

One size does not fit all !!

Fwiw, I remember posting an idea of so-called tagged memory barriers on
this group some years ago. I need to try to dig it up.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	60:54:36
Calls:	12,446
Calls today:	1
Files:	15,192
Messages:	6,537,458

weak consistency and the supercomputer attitude (was: Memory dependency

Who's Online

Recent Visitors

System Info