On 12/6/23 2:54 AM, Anton Ertl wrote:
[email protected] (Scott Lurndal) writes:[snip]
One of my professors back in the late 70's was researching
data flow architectures. Perhaps it's time to reconsider the
unit of compute using single instructions, instead providing a
set of hardware 'functions' than can be used in a data flow environment.
We already have data-flow microarchitectures since the mid-1990s, with
the success of OoO execution. And the "von Neumann" ISAs have proven
to be a good and long-term stable interface between software and these
data-flow microarchitectures, whereas the data-flow ISAs of the 1970s
and their microarchitectures turned out to be uncompetetive.
I suspect that a superior interface could be designed which
exploits diverse locality (i.e., data might naturally be closer to
some computational resources than to others) and communication
(and storage) costs and budgets (budgets being related to urgency
and importance).
They also (as I
understand) lacked value prediction whereas OoO effectively uses
value prediction for branches.
While the current interface does allow for significant improvement
exploiting common behaviors and dynamic microarchitectural
information, it does not seem (to me) an ideal interface.
I doubt an alternative interface will be adopted any time soon.
Even with a better understanding than in the 1970s, the problem is
not simple and early developments would not be competitive even if
all the R&D NRE was ignored.
* Explicit fast/small memories that leaves the problem of locality
to software rather than adding the hardware complexity of cache tags
My explanation for the lack of success of explicit local memory is
that it is too hard for software to manage; dealing with it would
be a cross-cutting concern that would make the software significantly
more complex.
* Software prefetch instructions. Ok, we have a cache instead of
explicit local memories, but can't we let software improve the
cache's performace with prefetch hints? I have read some reports
that the experience is that adding such instructions showed
disappointing speedups, and slowdowns also happened. While many
architectures have this feature, it seems to be used little.
... why there should be a significant predictability for values,
the answer was that the values would come from executing along
a mispredicted path: E.g., when an IF is mispredicted, the values of
the code after the ENDIF might still compute to the same values.
I think it is quite remarkable that a sequence of instructions, with
branches and RAM and registers, is what we already have in S/360 (59
years ago) and (I think) in the IBM 704 (69 years ago); I am not
familiar enough with earlier machines or the analytical engine to
comment on them, but this style might be even older.
In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
* Software prefetch instructions. Ok, we have a cache instead of
explicit local memories, but can't we let software improve the
cache's performace with prefetch hints? I have read some reports
that the experience is that adding such instructions showed
disappointing speedups, and slowdowns also happened. While many
architectures have this feature, it seems to be used little.
I'm reasonably confident that the increases in code size, and the >correspondingly increased I-cache misses, were what wrecked x86-32 >prefetching in my case.
An architecture that had prefetch hints designed in from the beginning
might be more efficient with them, as long as it didn't have IA-64's dumb >implementation. That produced memory misreads as an architectural feature, >while ensuring that they usually didn't happen when running under a
debugger.
... why there should be a significant predictability for values,
the answer was that the values would come from executing along
a mispredicted path: E.g., when an IF is mispredicted, the values of
the code after the ENDIF might still compute to the same values.
Only if your programmers write the same code twice. The point of those
IFs is surely to avoid redundant calculation? Programmers are fairly good
at minimising computation, in terms of expression evaluation. They're
much less good at minimising branching.
There have been a number of attempts to design architectures with more explicit hardware-oriented features, as well as attempts to design architectures with more software-oriented features.
On the hardware-oriented side we have seen:
* Explicit fast/small memories
My explanation for the lack of success of explicit local memory is
that it is too hard for software to manage;
* Software prefetch instructions.
* Weak memory models.
One can argue that this particular architectural misfeature has
succeeded: After all, in shared-memory multiprocessors you can buy
today, sequential consistency is nonexistent, and weaker models such
as TSO or even weaker ones dominate.
OTOH, very few programmers program to these weak models.
Isn't that as it should be? Hardware provides some simple, possibly
inconvenient interface, and software abstracts the inconvenience
away;
e.g., RISCs provide only a load/store interface, and compilers
translate the stuff programmers write into this interface. The
difference is that with RISCs the result is a least as fast as
architectures that tried to cater more to the programming language,
while I am convinced that hardware where the designers design for
efficient sequential consistency will let programmers write software
that handily outperforms using libraries or system calls on hardware
designed for weaker memory models.
They also (as I
understand) lacked value prediction whereas OoO effectively uses
value prediction for branches.
Branch prediction is branch prediction, value prediction predicts the
values of registers or memory locations; I have seen research about
value prediction maybe 20 years ago, but I am not aware that it has
been productized.
IIRC when I asked someone who presented a paper on
value prediction why there should be a significant predictability for
values, the answer was that the values would come from executing along
a mispredicted path: E.g., when an IF is mispredicted, the values of
the code after the ENDIF might still compute to the same values.
However, given the rareness of mispredictions these days, this source
of value prediction probably provides too little benefit to justify
the cost (not to mention Spectre considerations, which came in later).
Isn't that as it should be? Hardware provides some simple, possibly
inconvenient interface, and software abstracts the inconvenience
away; e.g., RISCs provide only a load/store interface, and compilers
translate the stuff programmers write into this interface. The
difference is that with RISCs the result is a least as fast as
architectures that tried to cater more to the programming language,
while I am convinced that hardware where the designers design for
efficient sequential consistency will let programmers write software
that handily outperforms using libraries or system calls on hardware
designed for weaker memory models.
Isn't that as it should be? Hardware provides some simple, possibly
inconvenient interface, and software abstracts the inconvenience
away; e.g., RISCs provide only a load/store interface, and compilers
translate the stuff programmers write into this interface. The
difference is that with RISCs the result is a least as fast as
architectures that tried to cater more to the programming language,
while I am convinced that hardware where the designers design for
efficient sequential consistency will let programmers write software
that handily outperforms using libraries or system calls on hardware
designed for weaker memory models.
While I agree that CPUs should provide sequential consistency (it would
make life easier for debugging, formalization, and reasoning, and
I can't think of any reason why it should have a significant cost), I'd
be surprised if it "handily outperforms" the status quo: concurrent programming is hard even with sequential consistency, so most normal
code would still continue using the exact same libraries.
Stefan
On 12/20/2023 8:29 AM, MitchAlsup wrote:
Stefan Monnier wrote:
Isn't that as it should be? Hardware provides some simple, possibly >>>> inconvenient interface, and software abstracts the inconvenience
away; e.g., RISCs provide only a load/store interface, and compilers >>>> translate the stuff programmers write into this interface. The
difference is that with RISCs the result is a least as fast as
architectures that tried to cater more to the programming language, >>>> while I am convinced that hardware where the designers design for
efficient sequential consistency will let programmers write software >>>> that handily outperforms using libraries or system calls on hardware >>>> designed for weaker memory models.
While I agree that CPUs should provide sequential consistency (it would
make life easier for debugging, formalization, and reasoning, and
I can't think of any reason why it should have a significant cost), I'd
be surprised if it "handily outperforms" the status quo: concurrent
programming is hard even with sequential consistency, so most normal
code would still continue using the exact same libraries.
As Lamport showed, one only NEEDs sequential consistency when accessing
memory somebody else can access at the same time. While running single
threaded, one needs nothing more than weak or casual consistency.
What about concurrent multi-threaded code than can work without using sequential consistency on existing arch's?
As to costs::
l = p[ i / 123456789 ];
r[7] = s;
The LD from p is delayed 20+odd cycles by the IDIV. Under SC or TSO
the unaliased r[7] ST cannot become visible until the LD has become
visible; yet since they cannot alias, this thread is not sensitive
to the order in which they are performed--and certain compilers with
certain flags will more the ST above the LD--certain HW memory queues
will also allow ST to become visible before the LD.
Does this happen in practice--not very often.
Stefan
Isn't that as it should be? Hardware provides some simple,
possibly inconvenient interface, and software abstracts the
inconvenience away; e.g., RISCs provide only a load/store
interface, and compilers translate the stuff programmers write into
this interface. The difference is that with RISCs the result is a
least as fast as architectures that tried to cater more to the
programming language, while I am convinced that hardware where the designers design for efficient sequential consistency will let
programmers write software that handily outperforms using libraries
or system calls on hardware designed for weaker memory models.
While I agree that CPUs should provide sequential consistency (it
would make life easier for debugging, formalization, and reasoning,
and I can't think of any reason why it should have a significant
cost),
I'd be surprised if it "handily outperforms" the status quo:
concurrent programming is hard even with sequential consistency, so
most normal code would still continue using the exact same libraries.
Stefan
Isn't that as it should be? Hardware provides some simple, possibly
inconvenient interface, and software abstracts the inconvenience
away; e.g., RISCs provide only a load/store interface, and compilers
translate the stuff programmers write into this interface. The
difference is that with RISCs the result is a least as fast as
architectures that tried to cater more to the programming language,
while I am convinced that hardware where the designers design for
efficient sequential consistency will let programmers write software
that handily outperforms using libraries or system calls on hardware
designed for weaker memory models.
While I agree that CPUs should provide sequential consistency (it would
make life easier for debugging, formalization, and reasoning, and
I can't think of any reason why it should have a significant cost), I'd
be surprised if it "handily outperforms" the status quo: concurrent >programming is hard even with sequential consistency, so most normal
code would still continue using the exact same libraries.
As to costs::
l = p[ i / 123456789 ];
r[7] = s;
The LD from p is delayed 20+odd cycles by the IDIV. Under SC or TSO
the unaliased r[7] ST cannot become visible until the LD has become
visible; yet since they cannot alias, this thread is not sensitive
to the order in which they are performed
--and certain compilers with
certain flags will more the ST above the LD
--certain HW memory queues
will also allow ST to become visible before the LD.
[email protected] (MitchAlsup) writes:
As to costs::
l = p[ i / 123456789 ];
r[7] = s;
The LD from p is delayed 20+odd cycles by the IDIV. Under SC or TSO
the unaliased r[7] ST cannot become visible until the LD has become >>visible; yet since they cannot alias, this thread is not sensitive
to the order in which they are performed
What do you mean with "visible"?
You present a single-threaded
program here, why talk about memory ordering at all?
And why would I want to perform the store early? The more typical
case is that you have a store architecturally before a load, and you
want to perform the load early.
What current CPUs do is to ask a predictor whether two accesses alias,
and if the predictor says that they don't, it speculates that they can
move past each other. If the speculation turns out to be wrong, the speculative state is dropped, just as in case of a branch
misprediction.
And for multi-threaded accesses to memory, sequential consistency can
be implemented in a simular way: Have a predictor that tells you what
memory is contended, and for non-contended memory, speculatively do
the fast thing, and if another thread turns out to access the same
memory, recover and do it in the architectural order.
--and certain compilers with
certain flags will more the ST above the LD
That is irrelevant for the question of how an architecture should
perform.
It's the other way round: programming languages pass on the
bad programming models that bad (i.e., weakly ordered) architectures
provide. What should they do? Provide sequential consistency by
inserting a memory barrier between any two memory accesses? On weakly ordered hardware where memory barriers are extremely slow?
If, OTOH, architectures implemented efficient sequential consistency, programming languages would provide sequential consistency. And those
that don't would be abandoned by everybody but Chris M. Thomasson.
--certain HW memory queues
will also allow ST to become visible before the LD.
Yes, we know it's possible to design bad hardware. The idea here is
to tell the hardware designers that weak ordering is not good enough.
- anton
Anton Ertl wrote:
You present a single-threaded
program here, why talk about memory ordering at all?
Somebody ask for a case where SC reduced performance. TSO also has
reduced performance in this case, whereas Cache, causal, and weak >consistencies do not.
Also note:: One can perform a younger LD early in the face of a
ST with unknown address, if the pipeline can reRun the LD again
if the older ST happens to conflict with the value LDed. When one
demands SC, you cannot do this unless you are willing to back all
the way up to the ST with the conflict.
--and certain compilers with
certain flags will more the ST above the LD
That is irrelevant for the question of how an architecture should
perform.
But is entirely relevant to how a �Architecture is allowed to perform.
When the �Architecture is allowed to move LDs around STs when there
is no overlap in the containers accessed, performance is improved.
If, OTOH, architectures implemented efficient sequential consistency,
programming languages would provide sequential consistency. And those
that don't would be abandoned by everybody but Chris M. Thomasson.
As illustrated above SC-only harms performance, not by a lot, but by enough >to convert a winning �Architeccture into a Ho-Hum competitor.
And B: HW designers do not understand memory ordering requirements >sufficiently that they CAN design HW to the specifications you request.
It is not in their DNA....They work with languages that provide the
illusion of everything that can be in parallel already is in parallel
and they did not have to do anything to cause that to be.
[email protected] (MitchAlsup) writes:
Anton Ertl wrote:
You present a single-threaded
program here, why talk about memory ordering at all?
Somebody ask for a case where SC reduced performance. TSO also has
reduced performance in this case, whereas Cache, causal, and weak >>consistencies do not.
In cases where there is no behavioural difference between sequential consistency and weaker memory orderings, there is no fundamental
reason why there should be a performance difference.
You are apparently thinking about a particular implementation (my
guess is: MY66000). What may hold for this implementation does not necessarily hold in general.
Also note:: One can perform a younger LD early in the face of a
ST with unknown address, if the pipeline can reRun the LD again
if the older ST happens to conflict with the value LDed. When one
demands SC, you cannot do this unless you are willing to back all
the way up to the ST with the conflict.
I would just check whether the result of the load has changed from the speculatively executed version, and if so, restart from load. Why
would you restart from the earlier store?
--and certain compilers with
certain flags will more the ST above the LD
That is irrelevant for the question of how an architecture should
perform.
But is entirely relevant to how a �Architecture is allowed to perform.
What a microarchitecture is allowed to do is entirely limited by the architecture. If the architecture specifies sequential consistency,
the microarchitecture has to implement sequential consistency. If the architecture specifies TSO, the microarchitecture has to implement TSO
or better. What some compiler does is irrelevant.
When the �Architecture is allowed to move LDs around STs when there
is no overlap in the containers accessed, performance is improved.
containers?
If, OTOH, architectures implemented efficient sequential consistency,
programming languages would provide sequential consistency. And those
that don't would be abandoned by everybody but Chris M. Thomasson.
As illustrated above SC-only harms performance, not by a lot, but by enough >>to convert a winning �Architeccture into a Ho-Hum competitor.
I have not found any such illustration in what you wrote, only claims (probably based on a MY66000 microarchitecture).
In particular, given that for single-threaded programs the behaviour
is the same between SC and weak ordering, it is possible to implement
a microarchitecture that provides SC and has the same performance for single-threaded programs as a weakly-ordered competitor.
Now, consider access to shared memory in multi-threaded programs.
With a microarchitecture that is optimized for weak ordering, every
accesses to shared memory will be slow (by having to use slow atomic operations, memory barriers and somesuch). By contrast, with a microarchitecture that is optimized for sequential consistency,
accesses to shared memory will be fast in the uncontended case and
slow only in the contended case (then the reruns are needed). Bottom
line: If the programmer made contention rare, the microarchitecture
designed for sequential consistency wins.
Compare with precise exceptions: The Alpha guys told us that precise exceptions are too slow, and gave us imprecise exceptions with a slow
trap barrier instruction (TRAPB) for limiting the imprecision; and programmers should avoid TRAPB at all costs. Then they implemented
the 21264, and lo and behold, TRAPB was as cheap as a NOP; it probably
was a NOP, and the 21264 implemented precise exceptions, because it
comes for free when you do an OoO microarchitecture with speculation.
And the 21264 was faster than the 21164a with its imprecise
exceptions.
And B: HW designers do not understand memory ordering requirements >>sufficiently that they CAN design HW to the specifications you request.
It is not in their DNA....They work with languages that provide the >>illusion of everything that can be in parallel already is in parallel
and they did not have to do anything to cause that to be.
Whatever DNA they may have, they have managed to implement the
sequential semantics (including precise exceptions) of instruction
sets, and I am sure that they will manage to implement sequential
consistency efficiently if they are tasked to do that. But we have to
tell them that weak ordering (which is easier to implement) is not
good enough.
- anton
I'm reasonably confident that the increases in code size, and the >correspondingly increased I-cache misses, were what wrecked x86-32 >prefetching in my case.
I have read explanations along these lines, or that the prefetch
hints consumed instruction decode/retire/LSU resources that caused
useful instructions to be delayed. I find both of these explanations surprising, because I would expect that, if a program spends
significant time on D-cache misses (so that prefetches could help),
reducing that would far outweigh any I-cache or resource contention
effects.
Do you mean the speculative loads that would return NaTs if the
memory access is invalid? If so, why does that not happen when
running under a debugger.
Maybe I have not made the point clear enough
In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
Do you mean the speculative loads that would return NaTs if the
memory access is invalid? If so, why does that not happen when
running under a debugger.
No. What I found was much, much worse. IA-64 had load-advance and
load-check instructions. You were supposed to issue the load-advance as
soon as you knew the address you were going to need and it happened >asynchronously. The load-check was quick if the value had arrived, but >otherwise stalled the pipeline until it did arrive. If the hardware had >discarded an advance load, as it was permitted to do, this meant a full
wait for memory/cache.
The problem came when you were fetching into floating-point registers,
which were not windowed, although the way they could move around for >modulo-scheduled loops confused people on that front.
The sequence of events that caused problems was:
1. Advance load into floating-point register fx.
2. Call subroutine, which wants to use fx.
3. Subroutine pushes fx to stack, and starts using it.
4. Advance load arrives, overwriting fx.
5. Subroutine may well go wrong, given its register has been corrupted.
6. If subroutine doesn't notice, it finishes its work.
7. Subroutine pops fx from the stack.
8. Subroutine returns.
9. Check load is executed. The ALAT says the load has happened.
The value in fx, nonetheless is what was there /before/ you issued the >advance load.
If you have breakpoints anywhere along this path, they cause pipeline
flushes and the problem does not (usually) reproduce.
IA-64 did not prove that all its ideas were worthless, only that the way >they'd been put together was counterproductive. Nonetheless, anyone else
who tries to launch a EPIC processor is going to face some very hostile >questioning, and it's perfectly understandable that nobody wants to try.
Which compiler was this?
Ouch. This clearly is a compiler bug. It either should abstain from
moving a load across the call, or perform a ld.c before the call.
I have discussed earlier
(<[email protected]>
and others) why EPIC lost against OoO, and it's not just bad
execution by Intel and HP.
On 12/23/2023 6:37 AM, John Dallman wrote:
In article <[email protected]>,
[email protected] (Anton Ertl) wrote:
Which compiler was this?
Microsoft. Our primary demand for Itanium, when things were starting up
for it, was Windows, because there was no 64-bit Windows at the time and
some large customers really wanted that. That demand evaporated like dry
ice in Death Valley when Intel announced their x86-64.
Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?
On 12/23/2023 1:44 PM, MitchAlsup wrote:
Chris M. Thomasson wrote:
On 12/23/2023 6:37 AM, John Dallman wrote:
In article <[email protected]>,
[email protected] (Anton Ertl) wrote:
Which compiler was this?
Microsoft. Our primary demand for Itanium, when things were starting up >>>> for it, was Windows, because there was no 64-bit Windows at the time and >>>> some large customers really wanted that. That demand evaporated like dry >>>> ice in Death Valley when Intel announced their x86-64.
Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?
Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
on the x86 side of things (DCAS in IBM parlance), which is the impetus
for my inventing ASF which later morphed into ESM. My point was and
still is
that one can add 1 to 3 ATOMIC instructions every product rev, or one can
provide the primitives such that SW can invent and use whatever primitive
they can figure out how to program and how to use. {{There is no 3rd
choice}}
Well the thing about cmp8xchg16 and friends is that all the words need
to be adjacent to one another.
Then there is cmpxchg16 that allows for
true 128 bit CAS on a 64 bit system, again wrt adjacent words...
cmp8xchg16 was always strange to me.
Now, speaking of IBM, iirc, are you familiar with the PLO locking
scheme? I think it was called PLO. Iirc, it was in the same appendix as
the lock-free IBM free pool manipulation?
Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?
On 12/23/23 4:44 PM, MitchAlsup wrote:
Chris M. Thomasson wrote:[snip]
Do you remember the rather odd instruction for the Itanium
called cmp8xchg16?
Yes, at that point in time, MS was also asking for Compare 2 Swap
2 over on the x86 side of things (DCAS in IBM parlance), which is the
impetus for my inventing ASF which later morphed into ESM. My
point was and still is that one can add 1 to 3 ATOMIC instructions
every product rev, or one can provide the primitives such that SW can
invent and use whatever primitive they can figure out how to program
and how to use. {{There is no 3rd choice}}
Third choice: do both (likely poorly).
ESM could atomically operate on up to 512 bytes in up to 8 64-
byte-aligned locations. The proposed atomics seem to be limited to
a single cache block; a modestly extended LL/SC that allowed
ordinary loads and stores within the reserved cache block would
seem to provide that functionality. ESM is much more flexible, but
forward progress is more difficult with more reservations. (Yes,
ESM has mechanisms to handle queues and such, but I doubt even
Mitch Alsup has discovered a universal solution.)
(A proposed extension for RISC-V — Zacas — provides double-width
CAS. I do not buy the argument that "the CAS atomic instructions
scale better to highly parallel systems than LR/SC"; I think that
is simply an implementation choice.
I also see little difficulty
in extending LL/SC active regions to include additional loads and
stores within the single reservation.
Methods applied for locks might help in some cases for hardware-
monitored reservations, but designing an interface seems difficult
(general enough to be useful yet also high enough performance to
be useful) and dangerous (new techniques and use cases can change
the two difficulty factors).
Non-composing hardware lock elision seems to have the advantage of
including an exclusive name. I imagine that one could use that
name to establish a single order of atomic operations when there
is conflict, but even then maximizing throughput seems difficult.
If one falls back to all users of the lock name wait in line, one
likely loses a lot of potential concurrency while the lock name is
in "queuing mode". I suspect there are some tricks that would
allow some queue skipping at moderate complexity/area/power/etc.
cost (possibly involving versioned memory).
I have not studied software concurrency (just "overhearing" a few
things from time to time), but even a sub-novice knows that
locking is hard (deadlock, livelock, etc.).
On 12/19/23 3:49 AM, Anton Ertl wrote:
"Paul A. Clayton" <[email protected]> writes:[snip]
I suspect that a superior interface could be designed which
exploits diverse locality (i.e., data might naturally be closer to
some computational resources than to others) and communication
(and storage) costs and budgets (budgets being related to urgency
and importance).
There have been a number of attempts to design architectures with more
explicit hardware-oriented features, as well as attempts to design
architectures with more software-oriented features.
Communication of the (absolute and relative) costs and usefulness
in both directions between hardware and software is a significant
factor. Tossing difficult things to the other side of the
interface just because they are difficult is problematic, but the
relative difficulty can justify the placement of work.
I think it is quite remarkable that a sequence of instructions, with
branches and RAM and registers, is what we already have in S/360 (59
years ago) and (I think) in the IBM 704 (69 years ago);
On 12/23/2023 6:45 PM, Paul A. Clayton wrote:
On 12/23/23 4:44 PM, MitchAlsup wrote:
Chris M. Thomasson wrote:[snip]
Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?
Yes, at that point in time, MS was also asking for Compare 2 Swap 2
over on the x86 side of things (DCAS in IBM parlance), which is the
impetus for my inventing ASF which later morphed into ESM. My point
was and still is that one can add 1 to 3 ATOMIC instructions every
product rev, or one can provide the primitives such that SW can invent
and use whatever primitive they can figure out how to program and how
to use. {{There is no 3rd choice}}
Third choice: do both (likely poorly).
ESM could atomically operate on up to 512 bytes in up to 8 64-
byte-aligned locations. The proposed atomics seem to be limited to
a single cache block; a modestly extended LL/SC that allowed
ordinary loads and stores within the reserved cache block would
seem to provide that functionality. ESM is much more flexible, but
forward progress is more difficult with more reservations. (Yes,
ESM has mechanisms to handle queues and such, but I doubt even
Mitch Alsup has discovered a universal solution.)
(A proposed extension for RISC-V — Zacas — provides double-width
CAS. I do not buy the argument that "the CAS atomic instructions
scale better to highly parallel systems than LR/SC"; I think that
is simply an implementation choice. I also see little difficulty
in extending LL/SC active regions to include additional loads and
stores within the single reservation. However, I would not be
surprised if there was an advantage to easier idiom recognition
for atomic operations that would readily benefit from some
optimizations. The code density disadvantage would be difficult
to remove (a sequence of primitives will typically be longer in
bytes than a complex operation), but greater flexibility might
pay for lower code density especially for less frequent
operations.)
Programmers have to be very careful with LL/SC because they can fail for other reasons besides the comparand being different wrt CAS. Live lock
can enter the picture.
(Reserving a single "page" would also seem to make ordering easier
at the cost of more false sharing. Spatially dense atomics are
also less interesting.)
Methods applied for locks might help in some cases for hardware-
monitored reservations, but designing an interface seems difficult
(general enough to be useful yet also high enough performance to
be useful) and dangerous (new techniques and use cases can change
the two difficulty factors).
Non-composing hardware lock elision seems to have the advantage of
including an exclusive name. I imagine that one could use that
name to establish a single order of atomic operations when there
is conflict, but even then maximizing throughput seems difficult.
If one falls back to all users of the lock name wait in line, one
likely loses a lot of potential concurrency while the lock name is
in "queuing mode". I suspect there are some tricks that would
allow some queue skipping at moderate complexity/area/power/etc.
cost (possibly involving versioned memory).
I have not studied software concurrency (just "overhearing" a few
things from time to time), but even a sub-novice knows that
locking is hard (deadlock, livelock, etc.).
On 12/24/2023 10:39 AM, Scott Lurndal wrote:
"Paul A. Clayton" <[email protected]> writes:
On 12/23/23 4:44 PM, MitchAlsup wrote:
Chris M. Thomasson wrote:[snip]
Do you remember the rather odd instruction for the Itanium
called cmp8xchg16?
Yes, at that point in time, MS was also asking for Compare 2 Swap
2 over on the x86 side of things (DCAS in IBM parlance), which is the
impetus for my inventing ASF which later morphed into ESM. My
point was and still is that one can add 1 to 3 ATOMIC instructions
every product rev, or one can provide the primitives such that SW can
invent and use whatever primitive they can figure out how to program
and how to use. {{There is no 3rd choice}}
Third choice: do both (likely poorly).
ESM could atomically operate on up to 512 bytes in up to 8 64-
byte-aligned locations. The proposed atomics seem to be limited to
a single cache block; a modestly extended LL/SC that allowed
ordinary loads and stores within the reserved cache block would
seem to provide that functionality. ESM is much more flexible, but
forward progress is more difficult with more reservations. (Yes,
ESM has mechanisms to handle queues and such, but I doubt even
Mitch Alsup has discovered a universal solution.)
(A proposed extension for RISC-V — Zacas — provides double-width
CAS. I do not buy the argument that "the CAS atomic instructions
scale better to highly parallel systems than LR/SC"; I think that
is simply an implementation choice.
In my experience with large scale systems, it has been shown
that LL/SC cannot scale so long as it must acquire the cache
line to complete the transaction.
The advantage of atomics (particularly load+add) is that the
core doesn't need to acquire exclusive access to the cache line
before completing the operation - the atomic can be easily delegated
to a higher level (e.g. LLC) of cache, passed to the cache
that owns the line, or even to the memory controller
(or a PCI express device).
I also see little difficulty
in extending LL/SC active regions to include additional loads and
stores within the single reservation.
ARM has specified that any store by the processor _may_ cause
the SC to (Store Exclusive) to fail.
Right, wrt the reservation granule. Iirc, this can be larger than a l2
cache line?
On 12/24/23 1:39 PM, Scott Lurndal wrote:
"Paul A. Clayton" <[email protected]> writes:
On 12/23/23 4:44 PM, MitchAlsup wrote:
Chris M. Thomasson wrote:[snip]
Do you remember the rather odd instruction for the Itanium
called cmp8xchg16?
Yes, at that point in time, MS was also asking for Compare 2 Swap
2 over on the x86 side of things (DCAS in IBM parlance), which is the
impetus for my inventing ASF which later morphed into ESM. My
point was and still is that one can add 1 to 3 ATOMIC instructions
every product rev, or one can provide the primitives such that SW can
invent and use whatever primitive they can figure out how to program
and how to use. {{There is no 3rd choice}}
Third choice: do both (likely poorly).
ESM could atomically operate on up to 512 bytes in up to 8 64-
byte-aligned locations. The proposed atomics seem to be limited to
a single cache block; a modestly extended LL/SC that allowed
ordinary loads and stores within the reserved cache block would
seem to provide that functionality. ESM is much more flexible, but
forward progress is more difficult with more reservations. (Yes,
ESM has mechanisms to handle queues and such, but I doubt even
Mitch Alsup has discovered a universal solution.)
(A proposed extension for RISC-V — Zacas — provides double-width
CAS. I do not buy the argument that "the CAS atomic instructions
scale better to highly parallel systems than LR/SC"; I think that
is simply an implementation choice.
In my experience with large scale systems, it has been shown
that LL/SC cannot scale so long as it must acquire the cache
line to complete the transaction.
I think this is significantly an implementation choice.
IBM z/Architecture provided "constrained transactions" that
are guaranteed to complete (hardware retry and other mechanisms).
I think the constraints avoided repeated TLB/cache misses and
other issues that make reliably fast completion difficult. I got
the impression that this was not implemented to provide best-
reasonable performance; correct operation is obviously more
important, especially for that ISA, but the description of how
conflicts were handled was disappointing to me. (By the way,
z/Architecture also promoted memory-updating compute instructions
to be memory atomic if aligned.)
With a single reservation granule, ordering is much easier.
The advantage of atomics (particularly load+add) is that the
core doesn't need to acquire exclusive access to the cache line
before completing the operation - the atomic can be easily delegated
to a higher level (e.g. LLC) of cache, passed to the cache
that owns the line, or even to the memory controller
(or a PCI express device).
Obviously, idiom recognition could translate ll/add/sc into
an atomic add. (Alternatively, an atomic add could be implemented
using the mechanisms provided for ll/sc.)
While I would prefer a denser idiom and one that constrains the
cost of idiom recognition, even a 12-byte "atomic instruction"
using ordinary ll/sc and a compute instruction might not be so
horrible. (I mentioned previously the possibility of a special
ll that implied a sc after a compute operation, providing both
a more "outspoken" idiom and a shorter expression.)
I also see little difficulty
in extending LL/SC active regions to include additional loads and
stores within the single reservation.
ARM has specified that any store by the processor _may_ cause
the SC to (Store Exclusive) to fail.
I think Intel and IBM made the mistake of going directly to a
largish scope (L1 cache write set and Intel at least extending the
read set with a conservative filter) rather than providing high
performance for small transactions and perhaps the possibility
to experiment with larger transactions. Both also probably
suffered from slow failure (I seem to recall Intel's TSX had
slow failures even for small transactions).
Cliff Click proposed (in "I Wanna Bit") a L1-capacity read set and
single access write set with the further restriction that all
cache misses failed the atomic event (so retries were expected).
He also had a few blog posts about the problems faced with Azul
Systems' transactional memory (e.g., Java libraries often had
software performance counters that artificially introduced
conflicts).
I think My 66000's ESM is a good first step, but I would be
tempted to provide the ability to play with more extensive read
sets in the first version. That might not be very useful, but I
suspect some clever programmers would find uses.
A conservative filter can be used to extend the read set, so an implementation would not necessarily need to change the cache
tag metadata. (Conservative filters were used only for evictions,
if I understand correctly, for Intel TSX. Per-cache-block marker
bits were used to track both read and write sets.)
In addition to all the other issues with extending the capacity
of atomic memory operations, there is also the question of
version-based vs. read/write set tracking. Versioning could
facilitate some speculative multithreading and more generally
allowing orderable conflicts, but versioning is more complex.
There are also reasons why one would like to leak more of the
implementation information rather than provide a generic
abstraction (not only for what operations are more likely to
be fast/succeed but also what information might be usefully
passed across the interface).
I think having hardware on which to test ideas would be a
significant benefit. I also know that I am biased by seeing
limited transactional memory support as being relatively
cheap and an obvious extension to a more clearly useful
interface like ESM. The main problem with providing such seems
to be the burden of compatibility.
"Chris M. Thomasson" <[email protected]> writes:
On 12/24/2023 10:39 AM, Scott Lurndal wrote:
"Paul A. Clayton" <[email protected]> writes:
I also see little difficultyARM has specified that any store by the processor _may_ cause
in extending LL/SC active regions to include additional loads and
stores within the single reservation.
the SC to (Store Exclusive) to fail.
Right, wrt the reservation granule. Iirc, this can be larger than a l2
cache line?
Actually -any- store regardless of PA.
The reservation granule can, in the degenerate case (e.g. on an ARM Cortex-M7), be the entire address space. In general, the size is implementation defined (between 4 and 512 words for ARMv8).
From DDI0487_I_a
When a PE writes using any instruction other than a Store-Exclusive instruction:
� If the write is to a PA that is not marked as Exclusive Access by its local
monitor and that local monitor is in the Exclusive Access state, it is
IMPLEMENTATION DEFINED whether the write affects the state of the local
monitor.
� If the write is to a PA that is marked as Exclusive Access by its local
monitor, it is IMPLEMENTATION DEFINED whether the write affects the
state of the local monitor.
� Any transition of the local monitor to the Open Access state not caused
by the architectural execution of an operation shown in Figure B2-4 must
not indefinitely delay forward progress of execution.
� LoadExcl/StoreExcl loops are guaranteed to make forward progress only if,
for any LoadExcl/StoreExcl loop within a single thread of execution,
the software meets all of the following conditions:
1 Between the Load-Exclusive and the Store-Exclusive, there are no explicit memory effects,
preloads, direct or indirect System register writes, address translation instructions, cache or TLB
maintenance instructions, exception generating instructions, exception returns, ISB barriers, or
indirect branches.
An implementation is free to clear the monitor for almost any reason. Thus, applications are recommended to keep the number of instructions between the load and store deminimus (avoiding any other loads or stores).
On 12/25/23 9:44 AM, EricP wrote:
Scott Lurndal wrote:
"Chris M. Thomasson" <[email protected]> writes:
On 12/24/2023 10:39 AM, Scott Lurndal wrote:
"Paul A. Clayton" <[email protected]> writes:
I also see little difficultyARM has specified that any store by the processor _may_ cause
in extending LL/SC active regions to include additional loads
and
stores within the single reservation.
the SC to (Store Exclusive) to fail.
Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block.
I can't think of any way to make use of that since only the most
recent
LL is active at a time.
I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and
retains
the line lock, and SCR Store Conditional Release which conditionally
applies all SCH and SCR updates and releases the line lock.
It requires an extra cache line buffer to retain all SCH updates
until
the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.
Not sure how useful multi-word LL/SCH/SCR is though since all
updates must
be within one cache line. It can do atomic double, quad and
octa-wide CAS,
or atomic compare-exchange plus increment a generation number.
One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.
Scott Lurndal wrote:
"Chris M. Thomasson" <[email protected]> writes:
On 12/24/2023 10:39 AM, Scott Lurndal wrote:
"Paul A. Clayton" <[email protected]> writes:
I also see little difficultyARM has specified that any store by the processor _may_ cause
in extending LL/SC active regions to include additional loads and
stores within the single reservation.
the SC to (Store Exclusive) to fail.
Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block.
I can't think of any way to make use of that since only the most recent
LL is active at a time.
I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and retains
the line lock, and SCR Store Conditional Release which conditionally
applies all SCH and SCR updates and releases the line lock.
It requires an extra cache line buffer to retain all SCH updates until
the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.
Not sure how useful multi-word LL/SCH/SCR is though since all updates must
be within one cache line. It can do atomic double, quad and octa-wide CAS,
or atomic compare-exchange plus increment a generation number.
On 12/25/23 12:08 PM, MitchAlsup wrote:
Paul A. Clayton wrote:[snip]
On 12/25/23 9:44 AM, EricP wrote:
[snip]I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and
retains
the line lock, and SCR Store Conditional Release which
conditionally
applies all SCH and SCR updates and releases the line lock.
One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.
Yes, but the HW does not know if that memory reference is
participating
or not in the event. That is why participating must be announced.
I was assuming a simple interface where all memory operations
inside the atomic operation participated. ASF allowed not only non-participating memory operations but also releasing of
participants (I am not certain if ESM has the latter).
ESM is
obviously a better and more flexible interface, but for merely
extending the semantics of an existing ISA interpreting LL as
Transaction Begin and SC as Transaction End seems fairly nice.
If I recall correctly, most ISAs with LL/SC give those
instructions very limited addressing modes, so limiting address
generation clutter to two memory accesses seems more desirable
than reusing LL (and one would have to add a "Store Conditional
Hold" instruction since SC ends the atomic operation).
For simple atomics, one does not need escaping memory operations
or release of reservations even though such are useful
features. If limited to one granule, escaping and releasing
operations do not make much sense (unless there is additional
false sharing tracking) — EricP's proposal was for only one
granule/cache block.
On 12/24/23 8:28 PM, MitchAlsup wrote:
Paul A. Clayton wrote:[snip]
On 12/24/23 1:39 PM, Scott Lurndal wrote:
In my experience with large scale systems, it has been shown
that LL/SC cannot scale so long as it must acquire the cache
line to complete the transaction.
I think this is significantly an implementation choice.
There are also architectural choices involved. ESM does not HAVE
TO hold onto cache lines, what it cannot allow is access of write
permission. With this paradigm, one can lose cache lines to
interference without loosing the ability to complete the ATOMIC
event.
ESM holds the read-and-write set in a write buffer, so the cache
itself holds no tracking data. If the data cache holds tracking
data, an eviction from limited associativity could allow a write
to the evicted cache block *if* the block was part of the atomic
operation. (If it was merely resident from previous accesses,
eviction is not important for the atomic operation, though
eviction might slow completion if that data is later part of the
atomic operation.)
Yes, for the read set the data does not need to be retained. A
list of addresses or a conservative filter are sufficient to
guard against writes. (If the atomic operation can establish a
time marker such that any write to evicted blocks happens "after"
the atomic operation commits. For something like ESM, a cache
block that has received its last update could be sent rather than
a NAK if the atomic operation is being delayed by a data from a
cache miss in the read set.
This "optimization" would have the
bad consequence of spreading poisoned data when an uncorrectable
ECC error is encountered for the cache miss — the atomic stores
dependent on that delayed data would have to be poisoned as
uncorrectable — as well as potentially being more complex than
beneficial.)
IBM z/Architecture provided "constrained transactions" that
are guaranteed to complete (hardware retry and other mechanisms).
One of the reasons I invented the ATOIC control point and automagic
control transfer to the control point on ATOMIC failure. HW has to
leave the IP pointing somewhere, so why not at the failure recovery
point ??
I think the earlier ASF had that feature. The LL/SC interface
requires a separate branch instruction at the end. Yet there
also seem to be cases where the failure type would just
indicate "retry".
I think the constraints avoided repeated TLB/cache misses and
other issues that make reliably fast completion difficult. I got
ESM mandates all TLB misses in the setup phase, so that none of
the participating STs miss in the TLB. {{Details matter}}
Yes, details matter! Experience and deep thought (and
communication with other [experienced] deep thinkers) can expose
details and how they interact. An elegant design has synergy,
recurring patterns, and encapsulation that seems natural (and
obvious after the fact).
[snip]
(By the way,
z/Architecture also promoted memory-updating compute instructions
to be memory atomic if aligned.)
I have STM, LDM, and MM as atomic, as long as the data size being
transferred does not cross a page boundary. The bus was designed
for page sized ATOMIC transfers, so why not allow the CPU to make
use of this property.
This is a nice synergy example.
With a single reservation granule, ordering is much easier.
You cannot reduce the exponent of complexity and retain a single
ATOMIC granule !! See my code illustrating moving (extracting
and then inserting) an element of a CDS where nobody can ever see
that element not INSIDE the CDS !! So, you need fewer events
and fewer events means less interference, so a lower exponent of
latency,.....
I was not suggesting that all atomics be limited to a single
granule but that this case could be optimized more easily than
for multiple granules. (The interaction of multi-granule and
single-granule atomics does complicate this, of course.)
Even within a larger atomic, single conflict optimizations are
possible. (I mentioned one possibility for ESM above.) If one
granule is under contention, the ordering principles for single
granule atomics may be applicable.
There are also optimizations for single contender, where handoffs
would be more practical. Even a case where a thread is about to
complete an atomic that would write data that would change the
participants of a conflicting atomic operation might allow some
optimization where the almost complete atomic prefetches the new
participant data to the thread with the other atomic operation and
provides that atomic with an "I'm next" token.
Another possible optimization would be to commit an atomic that
is waiting on a cache miss for one granule in its read set. If the
atomic is guaranteed to commit, it can release all the non-
dependent members to other threads (the dependent members have to
be reserved until the data arrives from the cache miss but their
modification happens "before" the accesses of the released
members).
Such would have the bad aspect that an uncorrectable
ECC error on the cache miss data would force additional memory
locations to be poisoned. (With limited versioned memory, a
rollback of the atomic event might be possible. Given that the
ECC error would be greatly delayed, I suspect undoing all the
dependent effects would be excessively complex and error-prone.)
[snip]
I think Intel and IBM made the mistake of going directly to a
largish scope (L1 cache write set and Intel at least extending the
read set with a conservative filter) rather than providing high
performance for small transactions and perhaps the possibility
to experiment with larger transactions. Both also probably
suffered from slow failure (I seem to recall Intel's TSX had
slow failures even for small transactions).
The mistake is that this is a problem the CPU is in a place to solve.
The CPU is not, the memory controller is. ...
Could you unpack that statement?
As the scope of the atomic increases (number of granules, amount
of computation), it would seem that minimizing communication would
favor doing more of the operation closer to the core. Moving the
compute to the data would also not necessarily move it all the
way to the memory controller. There may even be cases where
something like test-and-test-and-set is useful; a first non-atomic
read checks the data, then if the data needs updating an atomic
operation is performed. While simple cases like test-and-test-and-
set would avoid the first test when computation is moved to data,
one can imagine more complex cases where core-local data interacts
with the more contended data. There may be less communication if
core orchestrates the operation.
[snip]
I think My 66000's ESM is a good first step, but I would be
tempted to provide the ability to play with more extensive read
sets in the first version. That might not be very useful, but I
suspect some clever programmers would find uses.
Any read set larger than the outstanding miss buffer is either
supporting width the ISA cannot use, or providing a limit to
the width of ATOMICIty.
Unpack?
Since MM (memory copy) is already atomic when within a page, could
it not be included within an ESM operation even when more than
eight cache lines are accessed? Such would extend both the read
set and the write set. (I do not know that such would be useful,
but it seems likely not to add much hardware or complexity.)
On 12/25/23 1:38 PM, MitchAlsup wrote:
Paul A. Clayton wrote:[snip]
ASF and ESM still work even when there is no <data> cache.
The buffer holding 8 64-byte entries is a tiny data cache.
A conservative filter could be used instead of a data cache for an
extended read set.
While providing compatibility across implementations is very
valuable, varying capability across implementations does not seem
horrible to me.
ESM sets the minimum and maximum capacity of an atomic operation
for all implementations. This means that a capacity failure in
any implementation is guaranteed to be one in all implementations.
ESM also guarantees that atomic conflicts will be atomic
conflicts. A conservative filter would provide failures that are
not true conflicts. Limited-associativity cache-based tracking
would introduce evictions caused by the atomic memory accesses.
(A cuckoo/elbow cache — skewed associative cache where a victim
from one index is reinserted at one of its alternative indexes —
could greatly increase the effective associativity. Copying cache
blocks may be preferable to failing a transaction. A modest victim
cache could buffer the copy activity and avoid reinsertion based
on new information or longer analysis of information.)
Combining the optimism of no atomic conflicts with the optimism of
fitting into _effective_ capacity does not seem extremely wrong-
headed to me. ESM provides a way to moderate the atomic conflict
optimism with a "retry" that assumes N conflicts. Hardware might
be able to somewhat moderate effective capacity optimism (e.g.,
cuckoo cache, index recoding, borrowing capacity from a neighbor),
but I suspect using a non-optimistic fallback path would be
necessary.
Maybe only fools would push capacity limits, especially since
failing a large transaction is very expensive. (Perhaps the
ability to query estimated effective capacity might be useful
with a transaction. If a transaction could be split into
smaller transactions at significant cost — such that the
larger transaction would be preferred if there is low conflict
and capacity is sufficient — that choice might be made even
partially into the first attempt.)
[snip]
There are also optimizations for single contender, where
handoffs would be more practical. Even a case where a thread is
about to
complete an atomic that would write data that would change the
participants of a conflicting atomic operation might allow some
optimization where the almost complete atomic prefetches the new
participant data to the thread with the other atomic operation and
provides that atomic with an "I'm next" token.
Another possible optimization would be to commit an atomic that
is waiting on a cache miss for one granule in its read set. If the
atomic is guaranteed to commit, it can release all the non-
dependent members to other threads (the dependent members have to
be reserved until the data arrives from the cache miss but their
modification happens "before" the accesses of the released
members).
Be very careful, here.
I am somewhat aware that these are pushing into the hairy
complexity where even formal methods will not keep one from
losing both feet and most of both legs with death from
bleeding out if not from the explosion of the foot cannon
shell itself.
It is fun, however, to think of such possibilities.
Such would have the bad aspect that an uncorrectable
ECC error on the cache miss data would force additional memory
locations to be poisoned. (With limited versioned memory, a
rollback of the atomic event might be possible. Given that the
ECC error would be greatly delayed, I suspect undoing all the
dependent effects would be excessively complex and error-prone.)
Most systems take unrecoverable ECC failures are Machine Check.
You may be able to save the Guest OS, but you are not going to
be able to save the application.
Sometimes the application could be saved. If the data is in the
text segment, recovery may be trivial. If the data can be isolated
to a specific task and that task's starting data is available,
more localized restarting would be possible. Sometimes the
specific task could even be dropped (e.g., a Monte Carlo search
could just drop one random attempt with likely little harm).
Just killing the application may be the best option most of
time, but playing "what if" can be fun.
Uncorrectable tag errors for possibly dirty data seem more
fatal. Poisoning 0.1% of memory that matches the cache block's
index seems extremely likely to kill the system. Isolating
cache allocations with a mechanism like logical partitioning
would allow such a failure to be isolated to a fraction of the
active programs, but even then a tag error seems really bad.
Paul A. Clayton wrote:
There are also architectural choices involved. ESM does not HAVE TO hold
onto cache lines, what it cannot allow is access of write
permission. With this paradigm, one can lose cache lines to interference without loosing the ability to complete the ATOMIC event.
IBM z/Architecture provided "constrained transactions" that
are guaranteed to complete (hardware retry and other mechanisms).
One of the reasons I invented the ATOIC control point and automagic
control transfer to the control point on ATOMIC failure. HW has to
leave the IP pointing somewhere, so why not at the failure recovery
point ??
I think the constraints avoided repeated TLB/cache misses and
other issues that make reliably fast completion difficult. I got
ESM mandates all TLB misses in the setup phase, so that none of the participating STs miss in the TLB. {{Details matter}}
On 12/25/23 9:44 AM, EricP wrote:
Scott Lurndal wrote:
"Chris M. Thomasson" <[email protected]> writes:
On 12/24/2023 10:39 AM, Scott Lurndal wrote:
"Paul A. Clayton" <[email protected]> writes:
I also see little difficultyARM has specified that any store by the processor _may_ cause
in extending LL/SC active regions to include additional loads and
stores within the single reservation.
the SC to (Store Exclusive) to fail.
Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block.
I can't think of any way to make use of that since only the most recent
LL is active at a time.
I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and retains
the line lock, and SCR Store Conditional Release which conditionally
applies all SCH and SCR updates and releases the line lock.
It requires an extra cache line buffer to retain all SCH updates until
the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.
Not sure how useful multi-word LL/SCH/SCR is though since all updates
must
be within one cache line. It can do atomic double, quad and octa-wide
CAS,
or atomic compare-exchange plus increment a generation number.
One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.
The LL/SC interface could, in fact, be extended to have the LL be
Transaction Begin and the SC be Transaction End with arbitrary
scope. I get the impression that most ISAs provide a success
result in a GPR, so multiple failure modes could be provided.
Such extensions would only require updating documentation (and compilers/libraries and hardware implementations). Hardware could
be updated before the documentation not even necessarily needing
"chicken bits" to disable the functionality.
A natural progression of scope seems to be single access read-and-
write set, read and write set both within a single granule, write
set in single granule and read set extended to largish number of
granules, write set extended to several granules, finally perhaps
large number of granules for the write set.
(I think largish read sets are easy enough to support that
"several granule" read sets would not be a necessary step.)
Performance optimizations could be interleaved with expanding
scope.
MitchAlsup wrote:
Paul A. Clayton wrote:
There are also architectural choices involved. ESM does not HAVE TO hold
onto cache lines, what it cannot allow is access of write
permission. With this paradigm, one can lose cache lines to interference
without loosing the ability to complete the ATOMIC event.
In my HTM design I do NOT use the coherence protocol Invalidate messages
to detect collisions, as Intel RTM does and AMD ASF would have.
Instead I use a separate set of Atomic Transaction (ATX) messages
which travel on the coherence network but are never sent to cache.
However the Atomic Transaction Manager (ATM) can emit coherence messages
and cause lines to move between caches.
Making the ATX messaging completely separate from Invalidate messages
was crucial to eliminating the "false sharing cancellation" problem,
where a line that was touched for bytes ADJACENT to the ones in the transaction caused an Invalidate message that aborted the transaction.
IBM z/Architecture provided "constrained transactions" that
are guaranteed to complete (hardware retry and other mechanisms).
One of the reasons I invented the ATOIC control point and automagic
control transfer to the control point on ATOMIC failure. HW has to
leave the IP pointing somewhere, so why not at the failure recovery
point ??
I think the constraints avoided repeated TLB/cache misses and
other issues that make reliably fast completion difficult. I got
ESM mandates all TLB misses in the setup phase, so that none of the
participating STs miss in the TLB. {{Details matter}}
This is why I added a PREFETCH_TRANSLATION instruction so one can
pre-load the TLB for data without actually touching the data item and
causing the cache line to move between caches until the transaction
is actually in progress.
EricP wrote:
MitchAlsup wrote:
I think the constraints avoided repeated TLB/cache misses and
other issues that make reliably fast completion difficult. I got
ESM mandates all TLB misses in the setup phase, so that none of the
participating STs miss in the TLB. {{Details matter}}
This is why I added a PREFETCH_TRANSLATION instruction so one can
pre-load the TLB for data without actually touching the data item and
causing the cache line to move between caches until the transaction
is actually in progress.
If you use a PRE (prefetch) instruction and specify DRAM as the location
to prefetch, you end up only making sure the TLB is loaded.
PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits determine where the data (line) is prefetched to {L1, L2, L3, DAM}.
I have also run into a couple of lock-like situations where one PRE's
the container with .lock, and later PUSH's the same container (also with .lock) to end the ATOMIC event without ever seeing the contained
value.
MitchAlsup wrote:
EricP wrote:
MitchAlsup wrote:
I think the constraints avoided repeated TLB/cache misses and
other issues that make reliably fast completion difficult. I got
ESM mandates all TLB misses in the setup phase, so that none of the
participating STs miss in the TLB. {{Details matter}}
This is why I added a PREFETCH_TRANSLATION instruction so one can
pre-load the TLB for data without actually touching the data item and
causing the cache line to move between caches until the transaction
is actually in progress.
If you use a PRE (prefetch) instruction and specify DRAM as the location
to prefetch, you end up only making sure the TLB is loaded.
PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits
determine where the data (line) is prefetched to {L1, L2, L3, DRAM}.
Also the prefetch translation should, if necessary, set the PTE's
Accessed and Dirty bits as setting them requires transferring the
PTE's cache line as exclusive into that D$L1 cache, which also takes time.
Gets as much stuff done as possible that would cause transaction latency before the transaction starts, and minimize the size of any contention window.
I'm not sure how much help the would really be though as many shared
data structures are accessed through pointers, and to get those pointers
you have to read the structure.
For example, in a double linked list insert you could PRE-fetch the translation for the list header, but to get the pointer to the first
object you have to read the header which requires transferring the
header's line this cache, which you don't want to do until the
transaction actually starts because doing so prematurely would just
delay the current updater's transaction.
I have also run into a couple of lock-like situations where one PRE's
the container with .lock, and later PUSH's the same container (also with
.lock) to end the ATOMIC event without ever seeing the contained
value.
You should mention that your PUSH instruction is a cache flush
"used to push data out of the cache back towards memory".
Paul A. Clayton wrote:
On 12/25/23 9:44 AM, EricP wrote:
Scott Lurndal wrote:
"Chris M. Thomasson" <[email protected]> writes:
On 12/24/2023 10:39 AM, Scott Lurndal wrote:
"Paul A. Clayton" <[email protected]> writes:
I also see little difficultyARM has specified that any store by the processor _may_ cause
in extending LL/SC active regions to include additional loads and >>>>>>> stores within the single reservation.
the SC to (Store Exclusive) to fail.
Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block.
I can't think of any way to make use of that since only the most recent
LL is active at a time.
I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and retains
the line lock, and SCR Store Conditional Release which conditionally
applies all SCH and SCR updates and releases the line lock.
It requires an extra cache line buffer to retain all SCH updates until
the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.
Not sure how useful multi-word LL/SCH/SCR is though since all updates
must
be within one cache line. It can do atomic double, quad and octa-wide
CAS,
or atomic compare-exchange plus increment a generation number.
One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.
Yes, it doesn't actually need a SCH instruction.
Just document that multiple LL are allowed to the same cache line,
a normal ST to the same line is held aside if the lock is active,
and SC ends the transaction and updates all changed line bytes.
If all LL/ST/SC are not to the same line it cancels the lock.
EricP wrote:
Paul A. Clayton wrote:
On 12/25/23 9:44 AM, EricP wrote:
Scott Lurndal wrote:
"Chris M. Thomasson" <[email protected]> writes:
On 12/24/2023 10:39 AM, Scott Lurndal wrote:
"Paul A. Clayton" <[email protected]> writes:
I also see little difficultyARM has specified that any store by the processor _may_ cause
in extending LL/SC active regions to include additional loads and >>>>>>>> stores within the single reservation.
the SC to (Store Exclusive) to fail.
Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block.
I can't think of any way to make use of that since only the most recent >>>> LL is active at a time.
I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and retains >>>> the line lock, and SCR Store Conditional Release which conditionally
applies all SCH and SCR updates and releases the line lock.
It requires an extra cache line buffer to retain all SCH updates until >>>> the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.
Not sure how useful multi-word LL/SCH/SCR is though since all updates
must
be within one cache line. It can do atomic double, quad and octa-wide
CAS,
or atomic compare-exchange plus increment a generation number.
One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.
Yes, it doesn't actually need a SCH instruction.
Just document that multiple LL are allowed to the same cache line,
a normal ST to the same line is held aside if the lock is active,
and SC ends the transaction and updates all changed line bytes.
If all LL/ST/SC are not to the same line it cancels the lock.
On reflection, eliminating the SCH instruction would leave this
vulnerable to an interrupt occurring in the multiple store sequence.
The interrupt would cancel the lock,
and on return a regular ST would
not be distinguishable as intended to be part of an atomic sequence,
and so would erroneously update memory,
whereas a SCH would be tossed
because the lock was no longer set.
EricP wrote:
EricP wrote:
Paul A. Clayton wrote:
On 12/25/23 9:44 AM, EricP wrote:
Scott Lurndal wrote:
"Chris M. Thomasson" <[email protected]> writes:
On 12/24/2023 10:39 AM, Scott Lurndal wrote:
"Paul A. Clayton" <[email protected]> writes:
I also see little difficultyARM has specified that any store by the processor _may_ cause
in extending LL/SC active regions to include additional loads and >>>>>>>>> stores within the single reservation.
the SC to (Store Exclusive) to fail.
Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block. >>>>> I can't think of any way to make use of that since only the most
recent
LL is active at a time.
I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and retains >>>>> the line lock, and SCR Store Conditional Release which conditionally >>>>> applies all SCH and SCR updates and releases the line lock.
It requires an extra cache line buffer to retain all SCH updates until >>>>> the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.
Not sure how useful multi-word LL/SCH/SCR is though since all
updates must
be within one cache line. It can do atomic double, quad and
octa-wide CAS,
or atomic compare-exchange plus increment a generation number.
One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.
Yes, it doesn't actually need a SCH instruction.
Just document that multiple LL are allowed to the same cache line,
a normal ST to the same line is held aside if the lock is active,
and SC ends the transaction and updates all changed line bytes.
If all LL/ST/SC are not to the same line it cancels the lock.
On reflection, eliminating the SCH instruction would leave this
vulnerable to an interrupt occurring in the multiple store sequence.
If an exception occurs in the store (manifestation) section of an ESM
ATOMIC event, the event fails and none of the stores appears to have
been performed.
If an interrupt occurs in the store (manifestation) section of an ESM
ATOMIC event, if all the store can be committed they will be and the
event succeeds, and if they cannot (or cannot be determined that they
can) the the event fails and none of the stores appear to have been performed.
In any event, if the thread performing the event cannot be completed,
due to transfer of control to more privilege operation, the event fails control appears to have been transferred to the event control point, and
then control is transferred to the more privileged thread.
The interrupt would cancel the lock,
ESM cancels the event not the lock.
and on return a regular ST would
not be distinguishable as intended to be part of an atomic sequence,
Upon return you are no longer "in" the ATOMIC event (control transfers
to the control point before control transfers to the more privileged operation.
and so would erroneously update memory,
So, there is no control transfer "back" inside the ATOMIC event.
whereas a SCH would be tossed
because the lock was no longer set.
There should be no read or write set upon return.
MitchAlsup wrote:
EricP wrote:
EricP wrote:
Paul A. Clayton wrote:
On 12/25/23 9:44 AM, EricP wrote:
Scott Lurndal wrote:
"Chris M. Thomasson" <[email protected]> writes:
On 12/24/2023 10:39 AM, Scott Lurndal wrote:
"Paul A. Clayton" <[email protected]> writes:
I also see little difficultyARM has specified that any store by the processor _may_ cause >>>>>>>>> the SC to (Store Exclusive) to fail.
in extending LL/SC active regions to include additional loads and >>>>>>>>>> stores within the single reservation.
Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block. >>>>>> I can't think of any way to make use of that since only the most
recent
LL is active at a time.
I had an idea a while back for an extension to LL/SC which allowed >>>>>> multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and retains >>>>>> the line lock, and SCR Store Conditional Release which conditionally >>>>>> applies all SCH and SCR updates and releases the line lock.
It requires an extra cache line buffer to retain all SCH updates until >>>>>> the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.
Not sure how useful multi-word LL/SCH/SCR is though since all
updates must
be within one cache line. It can do atomic double, quad and
octa-wide CAS,
or atomic compare-exchange plus increment a generation number.
One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.
Yes, it doesn't actually need a SCH instruction.
Just document that multiple LL are allowed to the same cache line,
a normal ST to the same line is held aside if the lock is active,
and SC ends the transaction and updates all changed line bytes.
If all LL/ST/SC are not to the same line it cancels the lock.
On reflection, eliminating the SCH instruction would leave this
vulnerable to an interrupt occurring in the multiple store sequence.
If an exception occurs in the store (manifestation) section of an ESM
ATOMIC event, the event fails and none of the stores appears to have
been performed.
If an interrupt occurs in the store (manifestation) section of an ESM
ATOMIC event, if all the store can be committed they will be and the
event succeeds, and if they cannot (or cannot be determined that they
can) the the event fails and none of the stores appear to have been
performed.
In any event, if the thread performing the event cannot be completed,
due to transfer of control to more privilege operation, the event fails
control appears to have been transferred to the event control point, and
then control is transferred to the more privileged thread.
Yes, my HTM has some similarities.
It has a formal start to the transaction that establishes an abort
transfer RIP address for when an interrupt or exception occurs.
That state change in the RIP indicates the transaction has terminated.
I have two sets of instructions, the usual atomic RMW and LL/SCH/SCR ones, and a completely different set for atomic transactions.
The LL/SCH/SCR is simple and cheap and could be easily retrofitted into
a core with LL/SC. However it has no abort transfer address and so needs
some other means to distinguish an atomic store from a regular one.
Thus SCH as a distinct instruction is required.
The interrupt would cancel the lock,
ESM cancels the event not the lock.
That reference to "lock" came from the LL Load Locked instruction.
My AT instructions refer to "canceling the transaction".
and on return a regular ST would
not be distinguishable as intended to be part of an atomic sequence,
Upon return you are no longer "in" the ATOMIC event (control transfers
to the control point before control transfers to the more privileged
operation.
Right, for multi-line transactions.
The LL/SCH/SCR instructions are not a multi-line transaction.
They are a single cache line atomic multiple update.
and so would erroneously update memory,
So, there is no control transfer "back" inside the ATOMIC event.
Right, in atomic transactions.
I see atomic RMW and LL/SCH/SCR as one class of instruction
as they all deal with updates to a single cache line
and their HW is optimized for just that one operation.
The AT Atomic Transaction instructions are completely separate,
dealing with multiple updates to multiple lines,
and having more flexibility but also more overhead.
whereas a SCH would be tossed
because the lock was no longer set.
There should be no read or write set upon return.
Yes, in a transaction there isn't - the interrupt/exception cancels the transaction tossing all pending guarded memory changes,
updates the status register, and sets the RIP to the abort address.
Though I have been thinking lately that might be overly conservative.
It might not need to cancel a transaction just because there is an interrupt, as long as the interrupt handler doesn't initiate a new transaction or
read or write any memory that is a member of the transaction.
That might cut down on spurious unnecessary transaction aborts.
On 12/21/2023 1:49 PM, Chris M. Thomasson wrote:
On 12/20/2023 8:29 AM, MitchAlsup wrote:
Stefan Monnier wrote:
Isn't that as it should be? Hardware provides some simple, possibly >>>>> inconvenient interface, and software abstracts the inconvenience
away; e.g., RISCs provide only a load/store interface, and compilers >>>>> translate the stuff programmers write into this interface. The
difference is that with RISCs the result is a least as fast as
architectures that tried to cater more to the programming language, >>>>> while I am convinced that hardware where the designers design for >>>>> efficient sequential consistency will let programmers write software >>>>> that handily outperforms using libraries or system calls on hardware >>>>> designed for weaker memory models.
While I agree that CPUs should provide sequential consistency
It's nice that they already do. The fun part is creating algorithms that
try to avoid it. Not only avoid seq_cst, but avoid a acquire and release
membars, and the fun acq_rel acquire release all in one... ;^) Now,
there are certain algorithms that, as-is, cannot avoid it (seq_cst).
Original SMR and, even Dekkers algorithm require seq_cst...
[...](it would
make life easier for debugging, formalization, and reasoning, and
I can't think of any reason why it should have a significant cost), I'd >>>> be surprised if it "handily outperforms" the status quo: concurrent
programming is hard even with sequential consistency, so most normal
code would still continue using the exact same libraries.
Keep in mind that acquire and release are not as strong as a seq_cst
barrier. Wrt the SPARC:
acquire = #LoadStore | #LoadLoad
release = #LoadStore | #StoreStore
Not a #StoreLoad in sight!
Be especially careful on which trigger you pull when both hands hold
guns pointing at your feet.
SunOS had a funny quirk--it used cacheable locks to guard uncacheable
stores.
On Fri, 29 Dec 2023 22:06:46 +0000, MitchAlsup wrote:
SunOS had a funny quirk--it used cacheable locks to guard uncacheable
stores.
That _sounds_ bad, at first. Surely it's a terrible security hole or something.
But why are some memory addresses marked as not cacheable? Usually,
it's because they aren't really RAM, but instead I/O ports or
something like that.
The lock, which indicates which process those addresses currently
belong to, is just data. So it isn't uncacheable for the same
reason as addresses like that.
Of course, while in _some cases_ this might make perfect sense, there
may well be other cases where this really is the horrible mistake it
seems to be on first sight.
John Savard
On 12/28/23 2:10 PM, EricP wrote:
EricP wrote:[snip extended LL/SC design]
Paul A. Clayton wrote:
One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.
Yes, it doesn't actually need a SCH instruction.
Just document that multiple LL are allowed to the same cache line,
a normal ST to the same line is held aside if the lock is active,
and SC ends the transaction and updates all changed line bytes.
If all LL/ST/SC are not to the same line it cancels the lock.
On reflection, eliminating the SCH instruction would leave this
vulnerable to an interrupt occurring in the multiple store sequence.
The interrupt would cancel the lock, and on return a regular ST would
not be distinguishable as intended to be part of an atomic sequence,
and so would erroneously update memory, whereas a SCH would be tossed
because the lock was no longer set.
I had not considered this consequence of an interrupt, but this
issue would be "trivially" handled by distinguishing between
reservation active and reservation valid (i.e., use two bits with
one preserved across interrupts). The hardware would then know
that it is within a ll/sc enclosed code sequence and stores would
merely be buffered as usual until the sc was encountered. Advanced
hardware for such a primitive interface, knowing that stores would
not be preserved could elide them.
For a single cache block reservation, having to run through the
rest of a known-failed atomic attempt may not be problematic. With
broader uses, that would be painful; any failure would have to run
through the rest of the attempt before being able to consider
retrying. If the state on a failure is architected as boundedly
undefined, a simple scan for the sc (no execution to produce
identical final state) would suffice. It might be possible to add
this imprecise state aspect when allowing multiple accesses so
hardware could safely just scan after the second non-ll/sc memory
access.
A microarchitecture could theoretically cache sc locations similar
to a branch target buffer, but memoizing such seems difficult and
a prediction would have to be confirmed (e.g., by scanning for the
next sc while speculating that the predicted location was
correct). That seems needlessly complex.
It is not terribly difficult to imagine ways of addressing this
deficiency. One might be able to get away with reinterpreting a
branch instruction as setting a failure jump point, avoiding
adding instructions. Such a branch would have to use a nonsensical
condition like "if not equal to self" (assuming no one has used
that as a nop before the architecture decided to reserve it).
Short attempts would not include the branch and just run through
to find the sc; simple attempts that are somehow known not to have
only transient failure causes might target the ll instruction
producing a hard retry loop. One could also consider each register
tested for non-equality to self as providing additional control
information.
Given that opcode space is usually not so tight as to prevent a
secondary atomic interface, but one would really want to have the
second be comprehensive to avoid a third, fourth, fifth, etc.
Extending ll/sc semantics — without adding instructions — might
have allowed experience/feedback to accumulate for a second try.
(I do believe that one could substantially extend the semantics
without having to add instructions. When starting from a clean
slate is better idea seems usually to be realized late. "Free"
extensions also discourages trying again rather than trying to
fix an interface that mostly sort-of works.)
Given that none of the RISCs have actually extended ll/sc
semantics, the pressure to provide broader atomics must not have
been particularly great.
ASF did better in checkpointing the start such that a failure
jumped to a known location
that performed a test for
success/failure and branched to handling code.
ASF also preserved
the stack pointer; this does not seem a substantial benefit as
short attempts could probably avoid changing the stack pointer and
the relative overhead of explicitly saving the stack pointer is
not that great if the attempt is large.
Mitch wants escaping stores to preserve logged information (with
the obvious potential benefit for debugging and performance
tuning, possibly even dynamic performance tuning i.e. by the
software and not the developer). AMD's ASF defined a RELEASE
instruction which removed a cache block from the tracking set; I
do not remember seeing this in the documentation I have of ESM.
Removing reservations avoids some capacity constraints, though
ASF defined this as a "hint" so an implementation could still fail
an atomic event that would have succeeded if the tracking
adjustment had been done, whether from avoiding a contention or
exceeding a capacity limit. Removing reservations is also
dangerous, especially at cache block tracking granularity, as
software must ensure that other memory accesses either do not
access that cache block or have the same lifetime.
On 12/26/23 7:24 PM, MitchAlsup wrote:
[snip]
If you use a PRE (prefetch) instruction and specify DRAM as the
location
to prefetch, you end up only making sure the TLB is loaded.
PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits
determine where the data (line) is prefetched to {L1, L2, L3, DAM}.
What if there is a cache after L3?
What if one wants to prefetch
to a cache layer that is directly connected to the memory
controller (no extra coherence traffic, available fairly in terms
of performance to all agents [except possible processor-in-
memory], etc.)?
What if the design has L3 slices which are home
nodes in terms of coherence state but allow allocation to a non-
home node with non-home allocations having lower associativity? (I
do not know if such would be sensible. Potential full use of all
L3 capacity by one core has advantages and home nodes must have
some advantage; the former could be accomplished by using other L3
slices as victim caches, but those then become L4 caches.) There
are presumably many other design points that would make the
4-level designation less than complete.
I am also curious about where a prefetch would be placed with RE
(or WE/RWE) permission for a cache level that could be either
instruction or data. RE _might_ make sense for some corner cases
such as checking if an executable chunk gives a matching CRC code
immediately before executing. Allocation to both caches might make
sense for RE.
WE/WRE might make sense as a hint that the block is going to be
written and shortly executed (in the split cache level case
possibly by different cores if that level had a shared Icache
level).
For WRE, I could see allocating the data to the data cache of the
specified level (possibly with a placement bias to favor cheaper
transfer to the instruction cache,
or perhaps an icache might send
some victims to a typically not shared cache level), but it is
less clear if the PTE should loaded to both data and instruction
TLBs if "refreshing" the TLB entry to allow execution (assuming it
was denied for writable pages) was cheaper.
There are are bypassing, NUCA, replacement metadata, and perhaps
other factors. One might want to prefetch into a cache level with
setting of a "Keep Me" bit to choose another victim the first time
that block is chosen as a victim (or a probabalistic ignoring of
the replacement choice, which might then want the probability).
("Keep Me" could be useful for prefetches that may be excessively
early but are almost always used "soonish".)
Similarly, a cache
might have metadata urging earlier eviction (set 'LRU' for LRU and
tree pLRU, set unaccessed for bit-vector NRU); this might also
involve more than one state as possible choices (e.g., NRU+LRU
pair might set unaccessed and LRU or unaccessed and MRU).
Metadata/placement may also be complicated on a cache hit. If a
prefetch targets L2 but hits L1, should it be treated as a nop?
What if L2 is shared by multiple cores and the block is dirty or
L2 does not include L1s? What if the L1 state was "next victim"?
Early victimizing _might_ make sense, but changing to "last
victim" might also make sense?
One might also want a prefetch that prepares a hardware prefetch
engine. Such a software prefecth might want metadata about the
type of hardware prefetching (unit stride/non-unit stride/address
is pointer/generic hardware prefetch).
Caching of paging structures further increases the possibilities.
What if one wants to prefetch the translation information for the
table level normally one above the last (i.e., either a huge page
PTE or a second level PDE)? (This might apply if software believes
a future access will be in a region. A hash table or random access
array might have access behavior that software knows well ahead
that the data structure will be used soonish but far enough away
that earlier prefetching translation information provides some
value over just doing a demand load.)
What if one wants to prefetch translation information to an L2
structure or a structure associated with a particular level of
cache (e.g., prefetch to L2 memory cache TLB to accelerate
indirection and page-crossing stride prefetches for the hardware
prefetcher at the L2 memory cache)?
Most of the above might be obviously bad ideas for all reasonable implementations, but I suspect that even experts would not know
all reasonable choices for the lifetime of the interface. Having
such configuration bits reference a table is well-known method for
adjusting to future changes (or inadequacy of
One might also have hints that are more directives (i.e., prevent
or disincentivize hardware from ignoring the hint or adjusting its
action) and this opens a cooperation space that might be quite
complex. Software might want to specify what hardware metadata is
given what weight. Software might have an estimated use distance
probability map (possibly with a model for probabilities of use by
all other agents agent before and/or after use), but presenting
hardware with 4KiB of metadata for a 64-byte prefetch seems not
quite right.
I have also run into a couple of lock-like situations where one PRE's
the container with .lock, and later PUSH's the same container
(also with .lock) to end the ATOMIC event without ever seeing the
contained
value.
When would one want to guard/reserve a cache block (.lock) without
accessing such?
Perhaps there is a coarser-grained software lock
that is checked without being monitored/reserved (or the
reservation is dropped quickly) such that the atomic event could
monitor the specific cache block without failing by holding the
more frequently used software lock block as reserved? PUSHing the
data outward could make sense if the data is likely to be used
soon by some other agent, but it is not clear why that should be
'locked' (unless that is necessary to end the event?)
I suspect I am missing something, possibly something obvious to
anyone familiar with shared memory programming.
On 12/25/23 7:12 PM, MitchAlsup wrote:
Paul A. Clayton wrote:[snip]
While providing compatibility across implementations is very
valuable, varying capability across implementations does not seem
horrible to me.
You still have to be able to make Architectural statements about
what does what. And if you want any semblance of compatibility
the lowest end machine has to be able to perform significant
ATOMIC events.
ESM sets the minimum and maximum capacity of an atomic operation
for all implementations. This means that a capacity failure in
any implementation is guaranteed to be one in all implementations.
For compatibility !!
But in all senses:: 8 cache lines is so much more powerful that
anything programmers have access to today, making it bigger or
its size vague is not worthy.
I think part of the question is how acceptable address-based
(variable internal caching conflicts) and capacity-based failures are.
If software does not know that allocations are specifically
aligned, accessing five 32-byte objects might or might not exceed
the 8-entry capacity of ESM.
If alignment-caused "capacity"
failures are acceptable, why are address-based associativity
conflicts so extremely problematic?
(Yes, in general, ensuring
that participating accesses fit within N cache blocks is easier
than ensuring that associativity constraints are handled [other
than by restricting block count to the associativity], but this is
not an absolute distinction.)
ESM also guarantees that atomic conflicts will be atomic
conflicts. A conservative filter would provide failures that are
not true conflicts. Limited-associativity cache-based tracking
would introduce evictions caused by the atomic memory accesses.
ESM still has a few spurious failure modes--especially in the
"methodological" operating mode.
I was not considering exceptions, but I guess there are other
failure modes beyond capacity and contention.
(A cuckoo/elbow cache — skewed associative cache where a victim
from one index is reinserted at one of its alternative indexes —
could greatly increase the effective associativity. Copying cache
blocks may be preferable to failing a transaction. A modest victim
cache could buffer the copy activity and avoid reinsertion based
on new information or longer analysis of information.)
A FA Miss Buffer solves this without complexity.
But full associativity tends to constrain capacity.
This is not a consideration if a particular capacity is sufficient
for all implementations for the lifetime of the interface — which
you judge to be the case — but if one wants to allow use of the
same interface for implementations supporting larger capacity then
not relying of full associativity may be justified.
[snip]
If you look into the "math" you will see that the exponent e in
BigO( m^e ) is what bounds (minimum) system time in an ATOMIC
event. Making bigger ATOMIC events makes e smaller. This rule
is opposite what you propose above. Although there are some
events where you rule may take hold I suspect minimizing the
total number of events needed reduces overhead faster than
making events as small as possible { Test_and_Set() }.
If supporting larger events reduces system time, why forever bound
capacity to 8 64-byte cache blocks?
I can understand your previous statements about wanting to provide
something cheap that is usable for more than all known small
atomic operations. I.e., a universal baseline guarantee makes
sense even to me and having the baseline by 8 (when I think you
mentioned the largest atomics only used 6) gives substantial room
for software growth. However, allowing a relatively large read set
seems to be a plausible extension. (Extensions can be temporal incompatibilities or also 'spatial' incompatibilities.)
I appear to value compatibility much less than you. I think it is
useful to have an interface that can be extended even though such
breaks compatibility.
Even 'spatial' incompatibility seems to have some potential
benefits. A special purpose system might want to use established
interface elements but make interface changes that fit the purpose.
Interface versions and subsets/profiles allow communicating
interface specifics more simply and communicating guarantees from
vendors and demands from users. Managing diversity has challenges
but prohibiting diversity (which includes temporal diversity/
extensions as such introduces multiple interfaces).
Some of this is just naming/branding (and the Ship of Theseus
problem). As a lumper, I am biased toward broad major categories
and consideration of similarities.
Paul A. Clayton wrote:
On 12/26/23 7:24 PM, MitchAlsup wrote:
[snip]
If you use a PRE (prefetch) instruction and specify DRAM as the location >>> to prefetch, you end up only making sure the TLB is loaded.
PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits
determine where the data (line) is prefetched to {L1, L2, L3, DAM}.
What if there is a cache after L3?
Consider L3 as LLC.
What if one wants to prefetch
to a cache layer that is directly connected to the memory
controller (no extra coherence traffic, available fairly in terms
of performance to all agents [except possible processor-in-
memory], etc.)?
L3 in My 66000 Architecture is defined to be a read/write buffer
used by Memory controller and DRAM controller. L3 is a cache of
what DRAM either looks like now or should look like eventually.
I specify this organization as a means to eliminate RowHammer.
What if the design has L3 slices which are home
nodes in terms of coherence state but allow allocation to a non-
home node with non-home allocations having lower associativity? (I
This is not an allowable L3 configuration in My 66000 Architecture.
do not know if such would be sensible. Potential full use of all
L3 capacity by one core has advantages and home nodes must have
some advantage; the former could be accomplished by using other L3
slices as victim caches, but those then become L4 caches.) There
are presumably many other design points that would make the
4-level designation less than complete.
2-points
1) there are a limited number of bits (5) and a limited number of
cache horizons.
2) SW has enough trouble prefetching before getting into prefetch
from DRAM to L3 but no higher into the hierarchy, then later
prefetch into L2 from L3 if it is still there,.....
Not enough SW understand how to effectively utilize prefetch before
one gets into prefetch to where--and the converse case post-push
to where.
I am also curious about where a prefetch would be placed with RE
(or WE/RWE) permission for a cache level that could be either
instruction or data. RE _might_ make sense for some corner cases
such as checking if an executable chunk gives a matching CRC code
immediately before executing. Allocation to both caches might make
sense for RE.
RxE is prefetchable directly into L2 UCache (but no higher)
--E is prefetchable directly into L1 ICache
Rx- is prefetchable directly into L1 DCache
WE/WRE might make sense as a hint that the block is going to be
written and shortly executed (in the split cache level case
possibly by different cores if that level had a shared Icache
level).
For WRE, I could see allocating the data to the data cache of the
specified level (possibly with a placement bias to favor cheaper
transfer to the instruction cache,
That placement bias is L2 BTW.
or perhaps an icache might send
some victims to a typically not shared cache level), but it is
less clear if the PTE should loaded to both data and instruction
TLBs if "refreshing" the TLB entry to allow execution (assuming it
was denied for writable pages) was cheaper.
Fast shared L2 TLB solves this problem.
There are are bypassing, NUCA, replacement metadata, and perhaps
other factors. One might want to prefetch into a cache level with
setting of a "Keep Me" bit to choose another victim the first time
that block is chosen as a victim (or a probabalistic ignoring of
the replacement choice, which might then want the probability).
A cache is either an invisible companion that reduces latency of
memory accesses, or it is a programmable store unit which requires
direct access to tag storage. You cannot have it both ways.
("Keep Me" could be useful for prefetches that may be excessively
early but are almost always used "soonish".)
Keep Me requires access to the LRU or NRU control bits in the tag.
Similarly, a cache
might have metadata urging earlier eviction (set 'LRU' for LRU and
tree pLRU, set unaccessed for bit-vector NRU); this might also
involve more than one state as possible choices (e.g., NRU+LRU
pair might set unaccessed and LRU or unaccessed and MRU).
99.44% of programmers could not use these features. Of the 0.56%
who can, 80% of them work in OS Kernels.
Metadata/placement may also be complicated on a cache hit. If a
prefetch targets L2 but hits L1, should it be treated as a nop?
The only metadata I ever allows configuration access to in the cache was
to lock lines out of being available for allocation;
(bad bit in the line or tag). So, a 4-way set associative cache
with a bad bit goes to (2^n-1)-lines of 4-was SA and 1 line of 3-way SA.
The dead line does not participate in allocation and
cannot hit.
What if L2 is shared by multiple cores and the block is dirty or
L2 does not include L1s? What if the L1 state was "next victim"?
Early victimizing _might_ make sense, but changing to "last
victim" might also make sense?
If L2 does not filter SNOOPs the L1s are SNOOPed.
When a line is reallocated
a) modified data migrates outward
b) new data replaces old data
One might also want a prefetch that prepares a hardware prefetch
engine. Such a software prefecth might want metadata about the
type of hardware prefetching (unit stride/non-unit stride/address
is pointer/generic hardware prefetch).
Your advocation for the use of prefetch is admirable. Truly !
The problem with prefetch is that SW cannot use it effectively
and efficiently, and in ways that preserve code over multiple generations--one end might need to prefetch only 100 cycles
in front of use whereas the other end might need to prefetch
several thousand cycles in front of use.
A second problem is that prefetch is NECESSARILY specified as
a HINT because some other piece of code (yours or not) may
touch the memory hierarchy and the prefetch you so carefully
performed goes away before you come for the actual data.
SW must know about latency model it is using prefetch
to ameliorate !! But which one ??
On 12/28/23 2:10 PM, EricP wrote:
EricP wrote:[snip extended LL/SC design]
Paul A. Clayton wrote:
One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.
Yes, it doesn't actually need a SCH instruction.
Just document that multiple LL are allowed to the same cache line,
a normal ST to the same line is held aside if the lock is active,
and SC ends the transaction and updates all changed line bytes.
If all LL/ST/SC are not to the same line it cancels the lock.
On reflection, eliminating the SCH instruction would leave this
vulnerable to an interrupt occurring in the multiple store sequence.
The interrupt would cancel the lock, and on return a regular ST would
not be distinguishable as intended to be part of an atomic sequence,
and so would erroneously update memory, whereas a SCH would be tossed
because the lock was no longer set.
I had not considered this consequence of an interrupt, but this
issue would be "trivially" handled by distinguishing between
reservation active and reservation valid (i.e., use two bits with
one preserved across interrupts). The hardware would then know
that it is within a ll/sc enclosed code sequence and stores would
merely be buffered as usual until the sc was encountered. Advanced
hardware for such a primitive interface, knowing that stores would
not be preserved could elide them.
For a single cache block reservation, having to run through the
rest of a known-failed atomic attempt may not be problematic. With
broader uses, that would be painful; any failure would have to run
through the rest of the attempt before being able to consider
retrying.
If the state on a failure is architected as boundedly
undefined, a simple scan for the sc (no execution to produce
identical final state) would suffice. It might be possible to add
this imprecise state aspect when allowing multiple accesses so
hardware could safely just scan after the second non-ll/sc memory
access.
A microarchitecture could theoretically cache sc locations similar to a branch target buffer, but memoizing such seems difficult and a
prediction would have to be confirmed (e.g., by scanning for the next sc while speculating that the predicted location was correct). That seems needlessly complex.
It is not terribly difficult to imagine ways of addressing this
deficiency. One might be able to get away with reinterpreting a
branch instruction as setting a failure jump point, avoiding
adding instructions. Such a branch would have to use a nonsensical
condition like "if not equal to self" (assuming no one has used
that as a nop before the architecture decided to reserve it).
Short attempts would not include the branch and just run through
to find the sc; simple attempts that are somehow known not to have
only transient failure causes might target the ll instruction
producing a hard retry loop. One could also consider each register
tested for non-equality to self as providing additional control
information.
Given that opcode space is usually not so tight as to prevent a
secondary atomic interface, but one would really want to have the second
be comprehensive to avoid a third, fourth, fifth, etc. Extending ll/sc semantics — without adding instructions — might have allowed experience/feedback to accumulate for a second try.
(I do believe that one could substantially extend the semantics
without having to add instructions. When starting from a clean
slate is better idea seems usually to be realized late. "Free"
extensions also discourages trying again rather than trying to
fix an interface that mostly sort-of works.)
Given that none of the RISCs have actually extended ll/sc
semantics, the pressure to provide broader atomics must not have
been particularly great.
ASF did better in checkpointing the start such that a failure
jumped to a known location that performed a test for
success/failure and branched to handling code. ASF also preserved
the stack pointer; this does not seem a substantial benefit as
short attempts could probably avoid changing the stack pointer and
the relative overhead of explicitly saving the stack pointer is
not that great if the attempt is large.
Mitch wants escaping stores to preserve logged information (with
the obvious potential benefit for debugging and performance
tuning, possibly even dynamic performance tuning i.e. by the
software and not the developer). AMD's ASF defined a RELEASE
instruction which removed a cache block from the tracking set; I
do not remember seeing this in the documentation I have of ESM.
Removing reservations avoids some capacity constraints, though
ASF defined this as a "hint" so an implementation could still fail
an atomic event that would have succeeded if the tracking
adjustment had been done, whether from avoiding a contention or
exceeding a capacity limit. Removing reservations is also
dangerous, especially at cache block tracking granularity, as
software must ensure that other memory accesses either do not
access that cache block or have the same lifetime.
Paul A. Clayton wrote:
On 12/28/23 2:10 PM, EricP wrote:
Mitch wants escaping stores to preserve logged information (with
the obvious potential benefit for debugging and performance
tuning, possibly even dynamic performance tuning i.e. by the
software and not the developer). AMD's ASF defined a RELEASE
instruction which removed a cache block from the tracking set; I
do not remember seeing this in the documentation I have of ESM.
RELEASE was after I had left AMD.
ASF (under my tutelage) and ESM were ATOMIC events not restricted Transactional Memory things.
An ATOMIC event has the property that if the event succeeds all
of the writes become visible to the system instantly, otherwise
no interested 3rd party sees any of the memory locations as having
been modified.
An instruction like RELEASE violates the definition and, in essence,
takes the model from ATOMIC towards Transactional by one tiny but
important step--a step I was not willing to take--mainly because I could
not come up with a reasonable definition of ATOMIC if it had that semantic.
ESM can be used to help SW implement Transactional Memory, but it
is not Transactional Memory.
Removing reservations avoids some capacity constraints, though
Current ATOMIC things allow for 1 or 2 memory containers. ASF and
ESM allow for 8 cache lines used any way the programmer desires.
EricP's version goes for 16 {*p,length} read/write sets.
I submit that 8 (0r 16) is so much larger than ATOMIC event have
today that it will take decades for SW to figure out how to use
that new capability efficiently and effectively.
ASF defined this as a "hint" so an implementation could still fail
an atomic event that would have succeeded if the tracking
adjustment had been done, whether from avoiding a contention or
exceeding a capacity limit. Removing reservations is also
dangerous, especially at cache block tracking granularity, as
software must ensure that other memory accesses either do not
access that cache block or have the same lifetime.
Removing reservations violates the "all or nothing" semantic of
anything that wants to smell ATOMIC.
Yes, it could be "if you build it they will come" or
"if you build it they will stare at it with curiosity
then go back to doing things the way they have always done".
MitchAlsup wrote:
Paul A. Clayton wrote:
On 12/28/23 2:10 PM, EricP wrote:
Mitch wants escaping stores to preserve logged information (with
the obvious potential benefit for debugging and performance
tuning, possibly even dynamic performance tuning i.e. by the
software and not the developer). AMD's ASF defined a RELEASE
instruction which removed a cache block from the tracking set; I
do not remember seeing this in the documentation I have of ESM.
RELEASE was after I had left AMD.
ASF (under my tutelage) and ESM were ATOMIC events not restricted
Transactional Memory things.
An ATOMIC event has the property that if the event succeeds all
of the writes become visible to the system instantly, otherwise
no interested 3rd party sees any of the memory locations as having
been modified.
I don't follow your distinction between transaction and multi-line atomic. Both must monitor variables in multiple cache lines as read-only and
others as read-write with all writes made globally visible at once.
An instruction like RELEASE violates the definition and, in essence,
takes the model from ATOMIC towards Transactional by one tiny but
important step--a step I was not willing to take--mainly because I could
not come up with a reasonable definition of ATOMIC if it had that semantic.
I don't follow your reasons here either.
I see RELEASE as being used so an atomic sequence can add an object
to its member sets with read-share or write-exclusive protection,
examine it, decide its not relevant, and release it so others can
update it without contention with you.
ESM can be used to help SW implement Transactional Memory, but it
is not Transactional Memory.
Removing reservations avoids some capacity constraints, though
Current ATOMIC things allow for 1 or 2 memory containers. ASF and
ESM allow for 8 cache lines used any way the programmer desires.
EricP's version goes for 16 {*p,length} read/write sets.
I submit that 8 (0r 16) is so much larger than ATOMIC event have
today that it will take decades for SW to figure out how to use
that new capability efficiently and effectively.
ASF defined this as a "hint" so an implementation could still fail
an atomic event that would have succeeded if the tracking
adjustment had been done, whether from avoiding a contention or
exceeding a capacity limit. Removing reservations is also
dangerous, especially at cache block tracking granularity, as
software must ensure that other memory accesses either do not
access that cache block or have the same lifetime.
Removing reservations violates the "all or nothing" semantic of
anything that wants to smell ATOMIC.
I think I understand. You see "atomic" in a purist sense, as one thing
done one way, whereas "transaction" is more dynamic and expansive.
This also depends on how RELEASE works - whether it can toss pending
stores out of a transaction or only releases read-only monitors.
I hadn't decided because it didn't seem that much of an issue
and just a question of where some IF statements are located.
For hardware implementation its just a matter of whether one allows applications to reset the "changed" bits on pending byte buffers.
I currently see no technical reason to disallow it.
So I would allow RELEASE to toss pending stores after they were done.
EricP wrote:
Yes, it could be "if you build it they will come" or
The following is the important clause::
"if you build it they will stare at it with curiosity then go back to
doing things the way they have always done".
On 12/31/23 2:33 PM, MitchAlsup wrote:
Paul A. Clayton wrote:[snip]
Mitch wants escaping stores to preserve logged information (with
the obvious potential benefit for debugging and performance
tuning, possibly even dynamic performance tuning i.e. by the
software and not the developer). AMD's ASF defined a RELEASE
instruction which removed a cache block from the tracking set; I
do not remember seeing this in the documentation I have of ESM.
RELEASE was after I had left AMD.
ASF (under my tutelage) and ESM were ATOMIC events not restricted
Transactional Memory things.
An ATOMIC event has the property that if the event succeeds all
of the writes become visible to the system instantly, otherwise
no interested 3rd party sees any of the memory locations as having
been modified.
I thought you had written that you wanted escaping stores to
facilitate debugging. I guess I misremembered.
An instruction like RELEASE violates the definition and, in
essence, takes the model from ATOMIC towards Transactional by one
tiny but important step--a step I was not willing to take--mainly
because I could not come up with a reasonable definition of ATOMIC
if it had that semantic.
If the data is thread local, as far as other threads are concerned
that data does not necessarily even exist.
Breaking the transactional nature is also not _necessarily_ a bad
choice. I do not know of any, but that is not proof of non-
existence.
ESM can be used to help SW implement Transactional Memory, but it
is not Transactional Memory.
I agree with you that hardware cannot really provide generic
transactional memory (composable, "unbounded" capacity, etc.).
My disagreement comes from perceiving hardware as being better
equipped to monitor conflicts. Larger read sets seem especially
inexpensive for hardware (especially if one can accept the
limitations of a conservative filter). I do see such expansions
of capacity as less beneficial. (I agree with 'get simple atomics
utilized first and well-understood before guaranteeing useless
features that will have to be supported for twenty years'.)
Removing reservations avoids some capacity constraints, though
Current ATOMIC things allow for 1 or 2 memory containers. ASF and
ESM allow for 8 cache lines used any way the programmer desires.
EricP's version goes for 16 {*p,length} read/write sets.
I submit that 8 (0r 16) is so much larger than ATOMIC event have
today that it will take decades for SW to figure out how to use
that new capability efficiently and effectively.
Probably mostly true. More expansive uses are also likely to be
less common (both application count and use per run); that is just
the general nature of things.
Even so, I think providing cheap expansion can facilitate
experimentation. Expanding the read set seems really cheap.
However, I do not know how one can present a feature as
"experimental" such that one is not committing to it
architecturally. _Theoretically_ the hardware transactional memory
uses are suppose to have a fall back path, so they are hint-like,
but people tend not to read the fine print.
ASF defined this as a "hint" so an implementation could still fail
an atomic event that would have succeeded if the tracking
adjustment had been done, whether from avoiding a contention or
exceeding a capacity limit. Removing reservations is also
dangerous, especially at cache block tracking granularity, as
software must ensure that other memory accesses either do not
access that cache block or have the same lifetime.
Removing reservations violates the "all or nothing" semantic of
anything that wants to smell ATOMIC.
Violating atomicity may not violate simple orderability. Sometimes
software might be willing to sacrifice true atomicity for lower
chance of "atomic" event failure. This risks the ease of
understanding the interface and one probably does not want to
supply software developers with atomic foot guns.
(I think the use case for RELEASE was linked lists. One might have
to traverse more than a few nodes but one might not care if
another thread inserted/deleted a node well behind or well ahead
of where this thread is. There are probably other possible uses.)
On 12/31/23 2:33 PM, MitchAlsup wrote:
Paul A. Clayton wrote:[snip]
I had not considered this consequence of an interrupt, but this
issue would be "trivially" handled by distinguishing between
reservation active and reservation valid (i.e., use two bits with
one preserved across interrupts). The hardware would then know
that it is within a ll/sc enclosed code sequence and stores would
merely be buffered as usual until the sc was encountered. Advanced
hardware for such a primitive interface, knowing that stores would
not be preserved could elide them.
If taken the way I think Paul intends, that means the read and
write sets are part of thread saved state. It is nuances like
exceptions and interrupts; why My 66000 abandons the event if
it has to perform more privileged control transfer.
No. The fact that the thread is within an atomic event is part of
thread state (I assume ll/sc implementations make atomic even
success part of hardware thread/core state). Knowing that the
thread is in failed atomic event state, stores can be guarded such
that they do not manifest.
The concern expressed was that ordinary store instructions would
leak when resuming from an interrupt.
For a single cache block reservation, having to run through the
rest of a known-failed atomic attempt may not be problematic. With
broader uses, that would be painful; any failure would have to run
through the rest of the attempt before being able to consider
retrying. If the state on a failure is architected as boundedly
undefined, a simple scan for the sc (no execution to produce
identical final state) would suffice. It might be possible to add
this imprecise state aspect when allowing multiple accesses so
hardware could safely just scan after the second non-ll/sc memory
access.
A microarchitecture could theoretically cache sc locations
similar to a branch target buffer, but memoizing such seems
difficult and a prediction would have to be confirmed (e.g., by
scanning for the next sc while speculating that the predicted
location was correct). That seems needlessly complex.
When one considers all of the effects of {memory order, branch
prediction, exceptions, interrupts, machine checks,...} it is
better for the complexity management of a project* to abandon an
ATOMIC event when "control gets hard". I have seen and been in
the debates of "how do we get out of this mess" often enough not
to want to repeat it again.
Do you still have the T-shirt?☺
Similar considerations apply to cache coherence (or really any
complex functionality). Getting it right in a basic design seems
to be hard enough.
I presented workarounds even when i thought them "needlessly
complex".
(*) and the possibly more important:: "ability of SW to reason
about what happened and why";
This is another factor. Getting the hardware right is not enough
if getting the software is likely to get it wrong. Comprehensible
interfaces, error/bug detection, and debugging facilities are
obviously important.
Since I have no experience in hardware or software development or collaboration (but some knowledge), I have only limited ability to
evaluate proposals on such considerations. I am aware of some of
the dangers and some of the mitigations, but I would object to
being made responsible for any significant project. (I do like to
think I could contribute as something like an intern, but the
difference between intern and junior (much less head) computer
architect is perhaps somewhat more than minuscule.☺)
[snip legacy sandbagging atomic development]
ASF did better in checkpointing the start such that a failure
jumped to a known location
yes
that performed a test for
success/failure and branched to handling code.
not necessarily. If you arrived you knew the event had failed.
ASF provided a mechanism whereby you could refine your understanding
of the failure and branching based on that.
I thought the branch instruction was right after the transaction
beginning instruction and was not necessarily skipped but just
never triggered because EAX was set to zero and the flags updated.
ASF also preserved
the stack pointer; this does not seem a substantial benefit as
short attempts could probably avoid changing the stack pointer and
the relative overhead of explicitly saving the stack pointer is
not that great if the attempt is large.
The SP thing had to do with PUSH/POP model in x86 and automagic
control transfer.
Okay. I thought such might have been the case.
Mitch wants escaping stores to preserve logged information (with
the obvious potential benefit for debugging and performance
tuning, possibly even dynamic performance tuning i.e. by the
software and not the developer). AMD's ASF defined a RELEASE
instruction which removed a cache block from the tracking set; I
do not remember seeing this in the documentation I have of ESM.
RELEASE was after I had left AMD.
ASF (under my tutelage) and ESM were ATOMIC events not restricted
Transactional Memory things.
An ATOMIC event has the property that if the event succeeds all
of the writes become visible to the system instantly, otherwise
no interested 3rd party sees any of the memory locations as having
been modified.
I thought you had written that you wanted escaping stores to
facilitate debugging. I guess I misremembered.
An instruction like RELEASE violates the definition and, in
essence, takes the model from ATOMIC towards Transactional by one
tiny but important step--a step I was not willing to take--mainly
because I could not come up with a reasonable definition of ATOMIC
if it had that semantic.
If the data is thread local, as far as other threads are concerned
that data does not necessarily even exist.
Breaking the transactional nature is also not _necessarily_ a bad
choice. I do not know of any, but that is not proof of non-
existence.
ESM can be used to help SW implement Transactional Memory, but it
is not Transactional Memory.
I agree with you that hardware cannot really provide generic
transactional memory (composable, "unbounded" capacity, etc.).
My disagreement comes from perceiving hardware as being better
equipped to monitor conflicts. Larger read sets seem especially
inexpensive for hardware (especially if one can accept the
limitations of a conservative filter). I do see such expansions
of capacity as less beneficial. (I agree with 'get simple atomics
utilized first and well-understood before guaranteeing useless
features that will have to be supported for twenty years'.)
Removing reservations avoids some capacity constraints, though
Current ATOMIC things allow for 1 or 2 memory containers. ASF and
ESM allow for 8 cache lines used any way the programmer desires.
EricP's version goes for 16 {*p,length} read/write sets.
I submit that 8 (0r 16) is so much larger than ATOMIC event have
today that it will take decades for SW to figure out how to use
that new capability efficiently and effectively.
Probably mostly true. More expansive uses are also likely to be
less common (both application count and use per run); that is just
the general nature of things.
Even so, I think providing cheap expansion can facilitate
experimentation. Expanding the read set seems really cheap.
However, I do not know how one can present a feature as
"experimental" such that one is not committing to it
architecturally. _Theoretically_ the hardware transactional memory
uses are suppose to have a fall back path, so they are hint-like,
but people tend not to read the fine print.
ASF defined this as a "hint" so an implementation could still fail
an atomic event that would have succeeded if the tracking
adjustment had been done, whether from avoiding a contention or
exceeding a capacity limit. Removing reservations is also
dangerous, especially at cache block tracking granularity, as
software must ensure that other memory accesses either do not
access that cache block or have the same lifetime.
Removing reservations violates the "all or nothing" semantic of
anything that wants to smell ATOMIC.
Violating atomicity may not violate simple orderability. Sometimes
software might be willing to sacrifice true atomicity for lower
chance of "atomic" event failure. This risks the ease of
understanding the interface and one probably does not want to
supply software developers with atomic foot guns.
(I think the use case for RELEASE was linked lists. One might have
to traverse more than a few nodes but one might not care if
another thread inserted/deleted a node well behind or well ahead
of where this thread is. There are probably other possible uses.)
On 12/31/23 3:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:
[snip]On 12/25/23 7:12 PM, MitchAlsup wrote:
Expecting ATOMIC behavior across cache line boundaries hasNEVER been
guaranteed. So SW that deals with ATOMIC stuff needs to understand
this {along with false sharing,...} and take appropriate actions
at the language level and malloc()/new level.
This may be a further variance from transactional memory-like
interface. If I understand your statement, software is expected to
take special alignment precautions for objects used within ESM.
(This would not affect any atomic operation involving 4 or fewer
64-byte chunks — 8 aligned chunks handles all possible alignments.
I seem to recall only 6 were used in the _largest_ known case, so
I would guess most atomic operations would not be concerned about
alignment. Even the operations with more access regions are not
likely to encounter an alignment issue. I think generic allocators
tend to assume 16-byte alignment on 64-bit platforms, which should
reduce the likelihood of cache block crossing. Many hot object
members probably tend to be in the first bytes, further reducing the probability. Members used with temporal locality are also more likely
to have spatial locality even at the short temporal locality of an
atomic operation and sub-cache-block spatial locality.)
Of course, high-effort software will naturally make such
optimizations. Until compilers are trained to use ESM for merely lock-guarded operations or code sections are labeled atomic, those
using larger atomic operations will probably be almost exclusively high-effort software.
Paul A. Clayton wrote:If alignment-caused "capacity"
failures are acceptable, why are address-based associativity
conflicts so extremely problematic?
False sharing make thinks look like interference and harms
performance at the system level by performing too many bus
transfers.
Yes. However, aligning every object by 64-bytes (or assuring that
the interesting bits are in a 64-byte aligned chunk) seems
impractical if not nearly as bad as the early Alpha suggestion that
locks not share a page as ll/sc granularity was only
architecturally guaranteed at that coarseness (if I remember
correctly).
[snip]
I was not considering exceptions, but I guess there are other
failure modes beyond capacity and contention.
Consider you are several checkpoints down a path, and have several outstanding cache misses and an interrupt transpires. What are you
(the CPU) allowed to do with the returning cache lines since those
memory reference instructions did not survive to retirement ??
a) If you put them in the cache you have a Spectré side channel.
Which *I* do not consider so horrible.
Since avoiding such
speculation side channels is not very expensive (e.g., providing
buffering of speculative information until it is non-speculative)
and being free of those side channels has value, I do not consider
avoiding the side channels an unworthy effort.
Producing/infecting a game, productivity tool, or other desirable
software with malware seems easier, faster, and more powerful
(though possibly less traceable). With so many websites requiring
Javascript and Javascript sandboxing not being perfect, I doubt
one needs Spectré to get information. Plugging a hardware security
hole (which is very difficult to work around) does mean that those
wishing to sacrifice some software and web browsing options can be
more secure.
b) if you do not, you need to migrate them somewhere Ownership does
not get misunderstood.
Unless the context switch is very brief, an atomic attempt is likely
toast anyway (and retrying is probably not as big a deal if the
thread has already had to wait to resume). I was not suggesting that interrupts not be fatal.
There may be cases where interrupts not failing a transaction is
justified.
If an ESM operation completed its set-up and only used
4 miss buffer entries, keeping them as unsaved/unfreed state
(similar to FP/SIMD register files have sometimes been lazily saved
by software) might allow a quick interrupt handler to run to
completion and resume the earlier thread before the atomic fails.
(I suspect one would want to fail the atomic rather than use NAKs
if there is contention at least after one NAK was sent — unless
there was somehow a way to know that the interrupt handler is just
about done.)
Side question: for your expected implementations, would a fully set-
up transaction send ordinary NAKs for a cache block that was a miss
to main memory? The cache coherence network can be slow but main
memory is still slower.
I wonder if this would be a case where limited multithreading could
be useful. A quick/small transaction could complete _while_ the
interrupt handling is done. (I do realize that sharing resources
enables side channels.)
[snip]
No, I expect that in 2 decades there will be enough experiencethat did
with ESM that capacity extension would be desired. Here code
work on early implementations continues to work, but more codes
can now be performed with the new capacity limits.
I expect that ESM will not be implemented on enough systems to
provide that experience in just 2 decades. ASF — which was
actually presented as a candidate extension by a significant
processor vendor — was never implemented. Intel's TSX has had
correctness issues and security issues and did not seem to work
that well even for ESM-scale transactions. Power and z/Arch
transactional memory implementations are not going to get broad
use. ARM seems to be seeing the issues and waiting before
providing a hardware transactional memory implementation.
The larger potential capacity implementations do have the
disadvantage of not encouraging experimentation with small-scale
atomics, especially with their ability to nest (which is a feature
even I would probably exclude from early implementations).
I do not see any large corporation deciding to back My 66000 the
way Amazon, Google, and Apple have made use of ARM computers
available to millions (especially since My 66000 is a better ISA
and not tied to another corporation). This may be largely my
depression talking. Right now I do not think I want to be alive in
twenty years with the direction my health, employment, and other
personal aspects have been going [getting old etc.] and the
direction of the world and particularly of U.S. politics.
MitchAlsup wrote:
Paul A. Clayton wrote:
One might also want a prefetch that prepares a hardware prefetch
engine. Such a software prefecth might want metadata about the
type of hardware prefetching (unit stride/non-unit stride/address
is pointer/generic hardware prefetch).
Your advocation for the use of prefetch is admirable. Truly !
The problem with prefetch is that SW cannot use it effectively
and efficiently, and in ways that preserve code over multiple
generations--one end might need to prefetch only 100 cycles
in front of use whereas the other end might need to prefetch
several thousand cycles in front of use.
A second problem is that prefetch is NECESSARILY specified as
a HINT because some other piece of code (yours or not) may
touch the memory hierarchy and the prefetch you so carefully
performed goes away before you come for the actual data.
I am one of those programmers who read about PREFETCH being added to x86
and immediately thought "Yes, now I can make my code faster!", then
almost immediately (after getting one of the new CPUs) hit the tiny
little snag of "Oh, it does not seem to work!?!".
I.e. I have written code where actually loading a byte (forcing a fetch,
not just a hint) from every cache line in a 4KB block will speed it up,
but almost never seen any gains from adding a PREFETCH (n) cache lines
in front of the current operating address.
SW must know about latency model it is using prefetch
to ameliorate !! But which one ??
It is already pretty complicated without that added issue.
Terje
On 1/1/2024 11:06 PM, Terje Mathisen wrote:
MitchAlsup wrote:
Paul A. Clayton wrote:
SW must know about latency model it is using prefetch
to ameliorate !! But which one ??
It is already pretty complicated without that added issue.
Terje
Think of iterating a linked list of adjacent cache lines. A prefetch of
the next cache line would be in order, right?
Chris M. Thomasson wrote:
On 12/23/2023 6:37 AM, John Dallman wrote:
In article <[email protected]>,
[email protected] (Anton Ertl) wrote:
Which compiler was this?
Microsoft. Our primary demand for Itanium, when things were starting up
for it, was Windows, because there was no 64-bit Windows at the time and >>> some large customers really wanted that. That demand evaporated like dry >>> ice in Death Valley when Intel announced their x86-64.
Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?
Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
on the x86 side of things (DCAS in IBM parlance), which is the impetus
for my inventing ASF which later morphed into ESM. My point was and
still is
that one can add 1 to 3 ATOMIC instructions every product rev, or one can provide the primitives such that SW can invent and use whatever primitive they can figure out how to program and how to use. {{There is no 3rd
choice}}
On 12/23/2023 1:44 PM, MitchAlsup wrote:
Chris M. Thomasson wrote:
Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?
Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
on the x86 side of things (DCAS in IBM parlance), which is the impetus
for my inventing ASF which later morphed into ESM. My point was and
still is
that one can add 1 to 3 ATOMIC instructions every product rev, or one can
provide the primitives such that SW can invent and use whatever primitive
they can figure out how to program and how to use. {{There is no 3rd
choice}}
Perhaps. While I certainly agree that you can do pretty much every reasonable "primitive" using ESM, there is a cost to doing that. Let's
say you want an atomic memory increment and update. With ESM, I think
this would be three instructions, whereas in implementations that
support it directly, it is one instruction (although the one instruction would take longer than any one of the three used by ESM). So the single instruction implementation might be faster, and certainly takes less instruction space than the ESM implementation.
I don't have a feel for which of the existing instructions are the most
used, but a perhaps "best of both worlds" implementation would provide
for the most used as single instructions (only where they would be
better performing than ESM), and provide ESM for whatever complex
mechanism someone might want. Then, in the future, only if some new mechanism proves out and becomes popular, a future version could provide
that mechanism as a more performant single instruction in a totally
upward compatible way.
Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?
Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
on the x86 side of things (DCAS in IBM parlance), which is the impetus
for my inventing ASF which later morphed into ESM. My point was and
still is
that one can add 1 to 3 ATOMIC instructions every product rev, or one can
provide the primitives such that SW can invent and use whatever primitive
they can figure out how to program and how to use. {{There is no 3rd
choice}}
Perhaps. While I certainly agree that you can do pretty much every reasonable "primitive" using ESM, there is a cost to doing that. Let's
say you want an atomic memory increment and update. With ESM, I think
this would be three instructions, whereas in implementations that
support it directly, it is one instruction (although the one instruction would take longer than any one of the three used by ESM). So the single instruction implementation might be faster, and certainly takes less instruction space than the ESM implementation.
I don't have a feel for which of the existing instructions are the most
used, but a perhaps "best of both worlds" implementation would provide
for the most used as single instructions (only where they would be
better performing than ESM), and provide ESM for whatever complex
mechanism someone might want. Then, in the future, only if some new mechanism proves out and becomes popular, a future version could provide
that mechanism as a more performant single instruction in a totally
upward compatible way.
Stephen Fuld wrote:
Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?
Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over >>> on the x86 side of things (DCAS in IBM parlance), which is the
impetus for my inventing ASF which later morphed into ESM. My point
was and still is
that one can add 1 to 3 ATOMIC instructions every product rev, or one
can
provide the primitives such that SW can invent and use whatever
primitive
they can figure out how to program and how to use. {{There is no 3rd
choice}}
Perhaps. While I certainly agree that you can do pretty much every
reasonable "primitive" using ESM, there is a cost to doing that.
Let's say you want an atomic memory increment and update. With ESM, I
think this would be three instructions, whereas in implementations
that support it directly, it is one instruction (although the one
instruction would take longer than any one of the three used by ESM).
So the single instruction implementation might be faster, and
certainly takes less instruction space than the ESM implementation.
What prevents the core from recognizing::
LDL R7,[some address]
ADD R8,R7,R3
STL R8,[some address] // last use of R8
and converting it into::
ADD_to_Memory R7,[some address],R3
??
I don't have a feel for which of the existing instructions are the
most used, but a perhaps "best of both worlds" implementation would
provide for the most used as single instructions (only where they
would be better performing than ESM), and provide ESM for whatever
complex mechanism someone might want. Then, in the future, only if
some new mechanism proves out and becomes popular, a future version
could provide that mechanism as a more performant single instruction
in a totally upward compatible way.
ASF was developed as a means to avoid having to invent new ATOMIC instructions every processor generation, and was developed with
a keen eye on how microcode (or other sequencer) would manipulate
current hardware (Opteron) to provide the illusion of atomicity.
The goal of ASF was to allow SW to implement all of the (then) known
ATOMIC primitives without needing new instructions.
ESM is ASF expressed in RISC ISA with solutions to several issues
that could not be solved under x86-64 ISA and in particular x86
conditional branches. For example:: the Branch on the condition of interference instruction (BIN) samples the interference monitor
and branches on the current value, whereas x86 would have to access
the interference bit and then branch conditionally on it, causing
already stale information used in branching.
On 1/3/2024 11:05 AM, MitchAlsup wrote:
Stephen Fuld wrote:
Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?
Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over >>>> on the x86 side of things (DCAS in IBM parlance), which is the
impetus for my inventing ASF which later morphed into ESM. My point
was and still is
that one can add 1 to 3 ATOMIC instructions every product rev, or one
can
provide the primitives such that SW can invent and use whatever
primitive
they can figure out how to program and how to use. {{There is no 3rd
choice}}
Perhaps. While I certainly agree that you can do pretty much every
reasonable "primitive" using ESM, there is a cost to doing that.
Let's say you want an atomic memory increment and update. With ESM, I
think this would be three instructions, whereas in implementations
that support it directly, it is one instruction (although the one
instruction would take longer than any one of the three used by ESM).
So the single instruction implementation might be faster, and
certainly takes less instruction space than the ESM implementation.
What prevents the core from recognizing::
LDL R7,[some address]
ADD R8,R7,R3
STL R8,[some address] // last use of R8
and converting it into::
ADD_to_Memory R7,[some address],R3
??
I don't know. If the core can do that then great. However, can the
core distinguish between that sequence that should be atomic, and an
ordinary (i.e. non atomic) memory update?
I suppose the atomic version
means some extra things going on (bus lock, collision detection, etc.)
that the ordinary version doesn't need. That cost, would be avoided in
my proposal. And it still requires three times the memory footprint.
On 1/2/24 3:49 PM, MitchAlsup wrote:
Paul A. Clayton wrote:[snip]
No. The fact that the thread is within an atomic event is part
of thread state (I assume ll/sc implementations make atomic even
success part of hardware thread/core state). Knowing that the
thread is in failed atomic event state, stores can be guarded
such that they do not manifest.
The transfer of control to the ATOMIC control point prior to performing
the exception or interrupt control transfer means ATOMIC adds nothing
to thread state !!! You do not need to remember the participating
cache liens (or their state), nor the fact you are in an event, ...
My comments were in the context of extending ll/sc without adding instructions. While I think one could interpret a branch
instruction immediately after a ll as setting up such a control
point — merely reinterpreting an instruction in that specific
context not "adding" an instruction — the retaining of "in
transaction" state allows just executing until the sc is reached.
The concern expressed was that ordinary store instructions would
leak when resuming from an interrupt.
Can you illustrate this as a scenario ??
ll r4 ← [r2]
ld r3 ← [r2+4]
sub r5 ← r4 - r3
add r6 ← r3 + 12
<interrupt> // transaction_valid state set to "invalid"
// old thread tx_active saved as "active"
[interrupt handling code]
<resume thread>
st [r2+4] ← r6 // core discards this store since "active"
// and also "invalid"
sc [r2] ← r5 // normal behavior for sc when "invalid"
The saving of the extra bit (tx_active) is not an _application_
software compatibility issue, but it does seem to require the
interrupt handler to save this extra state. (Unlike My 66000,
the ll/sc using ISAs do not have hardware context management.)
Executing through from the resume point to the sc is not
efficient given that the transaction is known to fail, but
compatibility is maintained.
The point was to avoid EricP's proposed extra instructions to
indicate that memory operations were part of a transaction.
Stephen Fuld wrote:
On 1/3/2024 11:05 AM, MitchAlsup wrote:
My point
was and still is
that one can add 1 to 3 ATOMIC instructions every product rev, or
one can
provide the primitives such that SW can invent and use whatever
primitive
they can figure out how to program and how to use. {{There is no
3rd choice}}
Perhaps. While I certainly agree that you can do pretty much every
reasonable "primitive" using ESM, there is a cost to doing that.
Let's say you want an atomic memory increment and update. With ESM,
I think this would be three instructions, whereas in implementations
that support it directly, it is one instruction (although the one
instruction would take longer than any one of the three used by
ESM). So the single instruction implementation might be faster, and
certainly takes less instruction space than the ESM implementation.
What prevents the core from recognizing::
LDL R7,[some address]
ADD R8,R7,R3
STL R8,[some address] // last use of R8 >>>
and converting it into::
ADD_to_Memory R7,[some address],R3
??
I don't know. If the core can do that then great. However, can the
core distinguish between that sequence that should be atomic, and an
ordinary (i.e. non atomic) memory update?
The memory update is already ATOMIC in that all interested parties see
only the before and only the after bit-pattern even without using a lock
bit on the "bus".
I suppose the atomic version
means some extra things going on (bus lock, collision detection, etc.)
that the ordinary version doesn't need. That cost, would be avoided
in my proposal. And it still requires three times the memory footprint.
On 1/3/2024 12:59 PM, MitchAlsup wrote:
Stephen Fuld wrote:
On 1/3/2024 11:05 AM, MitchAlsup wrote:
My point
was and still is
that one can add 1 to 3 ATOMIC instructions every product rev, or
one can
provide the primitives such that SW can invent and use whatever
primitive
they can figure out how to program and how to use. {{There is no
3rd choice}}
Perhaps. While I certainly agree that you can do pretty much every >>>>> reasonable "primitive" using ESM, there is a cost to doing that.
Let's say you want an atomic memory increment and update. With ESM, >>>>> I think this would be three instructions, whereas in implementations >>>>> that support it directly, it is one instruction (although the one
instruction would take longer than any one of the three used by
ESM). So the single instruction implementation might be faster, and
certainly takes less instruction space than the ESM implementation.
What prevents the core from recognizing::
LDL R7,[some address]
ADD R8,R7,R3
STL R8,[some address] // last use of R8 >>>>
and converting it into::
ADD_to_Memory R7,[some address],R3
??
I don't know. If the core can do that then great. However, can the
core distinguish between that sequence that should be atomic, and an
ordinary (i.e. non atomic) memory update?
The memory update is already ATOMIC in that all interested parties see
only the before and only the after bit-pattern even without using a lock
bit on the "bus".
I suppose the atomic version
means some extra things going on (bus lock, collision detection, etc.)
that the ordinary version doesn't need. That cost, would be avoided
in my proposal. And it still requires three times the memory footprint.
OK, I think I misunderstood what you proposed about "coalescing" the instructions within the core. I missed that your code had the "L" on
the load and store instructions. So forget about confusing the sequence
with non-atomic updates.
So yes, the core could internally recognize the sequence and turn it
into what I suppose would take the same amount of time as the single instruction.
But I guess that would require at least as much circuitry
as executing the single atomic instruction. And, to actually gain the benefit, you have to document that the core is doing this for that
particular sequence. So ISTM the only extra cost of the atomic
instruction is the op code (which I know you want to minimize). And
you still have the fact of three instructions in the stream versus 1,
I-cache usage, etc.
Of course, this same argument would apply to other popular "primitives"
such as TS, or whatever. So, under this scenario, it comes down the equivalent core space, the cost of an extra op-code, versus the benefit
of a smaller program. Not too major either way. But still an
alternative to the two you mentioned above.
On 12/22/2023 9:21 AM, Anton Ertl wrote:
SC radically reduces performance on existing archs for sure. Think of iterating a linked list. Wrt RCU, the readers can iterate the list without using any memory barriers (DEC Alpha aside for a moment). This is FARSomebody ask for a case where SC reduced performance. TSO also has
reduced performance in this case, whereas Cache, causal, and weak
consistencies do not.
Chris M. Thomasson [2023-12-22 11:37:03] wrote:
On 12/22/2023 9:21 AM, Anton Ertl wrote:
SC radically reduces performance on existing archs for sure. Think ofSomebody ask for a case where SC reduced performance. TSO also has
reduced performance in this case, whereas Cache, causal, and weak
consistencies do not.
iterating a linked list. Wrt RCU, the readers can iterate the list without >> using any memory barriers (DEC Alpha aside for a moment). This is FAR
Anton is talking about SC provided directly by the ISA (e.g. as on the
MIPS architecture), in which case there are no barriers at all.
IOW, the cost is moved to the CPU's OoO engine.
Stefan
Can a C++ program use all std::memory_order_relaxed barriers for its
atomic sync algorithms and still be correct on MIPS?
Since small-width in-order processors are invariable SC, it is always
in the OoO engine.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 09:29:20 |
| Calls: | 12,100 |
| Files: | 15,003 |
| Messages: | 6,517,968 |