On 11/13/23 2:48 AM, Anton Ertl wrote:
[snip]
I think about several similar instances, where people went for
simple-minded hardware designs and threw the complexity over the wall
to the software people, and claimed that it was for performance; I
call that the "supercomputing attitude", and it may work in areas
where the software crisis has not yet struck[1], but is a bad attitude
in areas like general-purpose computing where it has struck.
This is not just a hardware-software wall problem, though that
wall and its abuse is usually well-established. As someone with a micro-optimization orientation, I know I need more external
awareness, but as a non-practicing entity what I think or present
has little effect/danger. Even in my case, there is some danger of
spreading a falsehood (or dangerously incomplete truths), so
external correction is valuable (and I value it myself as I
dislike being wrong, being corrected early hurts but hurts less
than being corrected after the inaccuracy has been well-
established in my own and others' minds).
System-aware optimization also interacts with interface layering.
Isolating concerns reduces design complexity and from a given
complexity allows exploiting "don't care" aspects. The "don't
care" aspects can be painful when the interface user does care;
sometimes these can force a violating of the abstraction,
introducing a dependency of a specific implementation (which can
then introduce an informal interface [performance compatibility
is a common informal interface]).
1) People thought that they could achieve faster hardware by throwing
the task of scheduling instructions for maximum instruction-level
parallelism over to the compiler people. Several companies (in
particular, Intel, HP, and Transmeta) invested a lot of money into
this dream (and the Mill project relives this dream), but it turned
out that doing the scheduling in hardware is faster.
Yet there does not seem to be a strong push to develop a dataflow-
oriented interface/ISA (that is does not require genius
programmers or super-genius compilers). I am not certain what such
an interface would look like, but I suspect something closer to a transport-triggered architecture (TTA) would be an early step. A
TTA-like architecture would compactly encode single use values and
provide some routing information while supporting (possible)
multiple use and some sense of use deferment (loads and stores).
Value prediction (including branch/predicate prediction) also
seems to be required to be included in design considerations.
Such an ISA would also probably blur the boundaries between
threads and naturally support speculative multithreading, which is
in some sense a distant/variable deferment communication/dataflow.
[snip]
Meanwhile, Mitch Alsup also has posted that he can
implement fast denormal numbers with IIRC 30 extra gates (which is
probably less than what is needed for implementing the trap barrier).
I think that cost estimate assumes the inclusion of (single
rounding) FMADD. Single-rounding FMADD was not common for RISCs
when the Alpha designers made their choice.
I am **certainly not** a numerical analyst, but I had the
impression that flush-to-zero was not horrible for analyzing for
correctness and (for double precision) not commonly a problem. Yet
I also think that having multiply round based on an "integer"
power-of-two high result (without carry-in) — where the hardware
could also be used for integer multiply by reciprocal — might have
been "better", so my opinion should probably be taken with a mine
of salt.
I would not be surprised if special-purpose low-power DSPs not
only use not-IEEE formats but use inexact rounding. Even using
inexact computation might be justified for extreme cases.
3) The Alpha is a rich source of examples of the supercomputer
attitude: It started out without instructions for accessing 8-bit and
16-bit data in memory. Instead, the idea was that for accessing
memory, you would use instruction sequences, and for accessing I/O
devices, the device was mapped three times or so: In one address range
you performed bytewise access, in another address range 16-bit
accesses, and in the third address range 32-bit and 64-bit accesses;
I/O driver writers had to write or modify their drivers for this
model. The rationale for that was that they required ECC for
permanent storage and that would supposedly require slow RMW accesses
for writing bytes to write-back caches. Now the 21064 and 21164 had a write-through D-cache. That made it easy to add byte and word
accesses (BWX) in the 21164A (released 1996), but they could have done
it from the start. The 21164A is in no way slower than the 21164; it
has the same IPC and a higher clock rate.
Yet Intel has been using byte parity for L1 Dcaches, so that
design choice was perhaps not *entirely* irrational. (I disagree
with that choice, having hindsight, but I can appreciate the
reasoning.) Parity-only L1 Dcaches are not that bad since the
SRAM design will likely be more robust to allow faster access (I
think) and dirty values will tend to be either evicted quickly or
checked often.
If smaller writes are rare, hardware RMW in a writeback cache
would not have been that expensive, but the cost would have no
value if smaller writes are never necessary.
(I do wonder if there is an interface that would allow software to
reduce hardware RMW costs — often a value is read before being
modified — without introducing more complexity than benefit.
Exploiting the standard double-wide read used for unaligned
accesses to access a double-wide aligned memory seems similarly
desirable. While idiom-detection would allow this to be done in
hardware without changing the interface, idiom detection is more
complex than direct encoding and typically relies on software to
reduce that complexity — e.g., only detecting short contiguous
idioms.)
The different memory regions trick is also used for bit-granular
accesses in some ISAs (e.g., ARM) mainly for I/O device accesses.
Even without side-effects for accesses, non-atomicity might be a
concern. (Of course, one could architect that all simple load-op-
store sequences on that type of memory are atomic, using three
instruction idiom detection.)
Some people welcome and celebrate the challenges that the
supercomputer attitude poses for software, and justify it with
"performance", but as the examples above show, such claims often turn
out to be false when you actually invest effort into more capable
hardware.
The tricky part seems to be in discerning when (and where) extra
effort is justified. This also depends on how easily the
difficulty can be encapsulated. Can a compiler reliably "do the
right thing" (without having to have been written by a supergenius
AI)? Can a library reliably provide the necessary extra
functionality — splitting the difficulty between application
programmer discipline and difficulty of developing the system
software — without requiring genius system programmers and highly
competent application programmers?
Someone who writes lock-free methods for fun is probably not well-
positioned to estimate the difficulty/lack-of-fun of such for most
programmers. Communication between different interest groups seems
critical, but communication also requires data and not just
anecdotes or traditional wisdom. (Anecdotes and traditional wisdom
do have value!)
[snip]
But if you look at it from an architecture (i.e., hardware/software interface) perspective, weak consistency is just bad architecture:
good architecture says what happens to the architectural state when
software performs some instruction. From that perspective sequential consistency is architecturally best. Weaker consistency models
describe how the architecture does not provide the sequential
consistency guarantees that are so easy to describe; the weaker the
model, the more deviations it has to describe.
I am not convinced that sequential consistency is the best
interface. My66000 does not provide sequential consistency for
ordinary memory. While Mitch Alsup would have difficulty
empathizing with most programmers, he has enough experience to
write specifications for "hostile" engineers so he probably
understands the tradeoffs on both sides of the interface fairly
well.
When an effort is considered hard like parallel programming, there
seems to be a spectrum of viewpoints from the UNIX/"real
programmers" perspective of limiting effort to experts to simplify
the system interface so that almost anyone can do almost anything.
The extreme positions have obvious cultural issues (where
expertise is either required for worth or expertise is despised as
arrogance) as well as mechanical issues (expertise is naturally
limited by finite knowledge — where vast knowledge implies
communication overhead even within a single supercomputer
complex).
[1] The software crisis is that software costs are higher than
hardware costs, and supercomputing with its gigantic hardware costs
notices the software crisis much later than general-purpose computing.
This is one strong reason for complexity to be shifted toward
hardware, but I think that there is a danger of "toward" becoming
"into".
I do not know nearly enough about memory ordering considerations
in hardware and software to have more than an opinion based on
which experts I believe (and a tiny amount of rational/data
consistency inference on my part). From what I have read, TSO
seems to be the best tradeoff of hardware overhead and software
difficulty, but I suspect the best set of fine-grained guarantees
may be somewhat different than provided by simple TSO. I also
think there may be noticeable advantages for allowing use that is
outside of recipes/the formal interface.
Documenting an interface often brings an assumption of continuity,
so (besides the cost of writing documentation) there is a
disincentive to expose internals that leak through the abstraction
layer.
(That was a very wordy response.)
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)