Forum: >>> Magnum BBS <<<

Superior architecture styles? (was: Concertina II Progress)

From Anton Ertl@21:1/5 to Paul A. Clayton on Tue Dec 19 08:49:18 2023

"Paul A. Clayton" <[email protected]> writes:

On 12/6/23 2:54 AM, Anton Ertl wrote:

[email protected] (Scott Lurndal) writes:

[snip]

One of my professors back in the late 70's was researching
data flow architectures. Perhaps it's time to reconsider the
unit of compute using single instructions, instead providing a
set of hardware 'functions' than can be used in a data flow environment.

We already have data-flow microarchitectures since the mid-1990s, with
the success of OoO execution. And the "von Neumann" ISAs have proven
to be a good and long-term stable interface between software and these
data-flow microarchitectures, whereas the data-flow ISAs of the 1970s
and their microarchitectures turned out to be uncompetetive.

I suspect that a superior interface could be designed which
exploits diverse locality (i.e., data might naturally be closer to
some computational resources than to others) and communication
(and storage) costs and budgets (budgets being related to urgency
and importance).

There have been a number of attempts to design architectures with more
explicit hardware-oriented features, as well as attempts to design architectures with more software-oriented features.

On the hardware-oriented side we have seen:

* Explicit fast/small memories that leaves the problem of locality to
software rather than adding the hardware complexity of cache tags
and letting the hardware guess what memory will be accessed in the
future. The only such memories that have succeeded in the long term
are registers; my explanation for the latter is that registers are
useful for scalar variables in programs (and vector/SIMD registers
for temporary storage of pieces of aggregated data).

My explanation for the lack of success of explicit local memory is
that it is too hard for software to manage; dealing with it would be
a cross-cutting concern that would make the software significantly
more complex. Better write software as if there was no cache or
local memory, and let hardware attempt to get good performance out
of it.

* Software prefetch instructions. Ok, we have a cache instead of
explicit local memories, but can't we let software improve the
cache's performace with prefetch hints? I have read some reports
that the experience is that adding such instructions showed
disappointing speedups, and slowdowns also happened. While many
architectures have this feature, it seems to be used little.
Hardware prefetchers seem to do ok, though.

* Distributed memory. This avoids all the problems of cache
consistency, but it requires the programmer to avoid all sharing
(including read-only data and code), and, most importantly, to fit
within its slice of machine. This is acceptable for supercomputing,
but does not fly in general-purpose computing. And even in game
computin, which one might consider to be more supercomputer-like,
the Cell architecture was abandoned in the next generation.

* Weak memory models. Ok, we (hardware designers) have to do cache
consistency, but let's do a shoddy job and shift the problems over
to the software people whom we task with inserting architecturally
meaningless barriers at select places. The great thing about this
is that we can always blame the software people: if the performance
sucks, they have inserted too many barriers; if the program breaks,
they have inserted too few; and if the performance sucks while the
program breaks, well, we can blame them for both, or just shrug and
say that the problem is hard. In any case, the blame is deflected
from us.

One can argue that this particular architectural misfeature has
succeeded: After all, in shared-memory multiprocessors you can buy
today, sequential consistency is nonexistent, and weaker models such
as TSO or even weaker ones dominate.

OTOH, very few programmers program to these weak models. They write
sequential programs, or use libraries/system calls that hide the
horrors of weak consistency behind easier-to-understand but much
slower abstractions.

Isn't that as it should be? Hardware provides some simple, possibly
inconvenient interface, and software abstracts the inconvenience
away; e.g., RISCs provide only a load/store interface, and compilers
translate the stuff programmers write into this interface. The
difference is that with RISCs the result is a least as fast as
architectures that tried to cater more to the programming language,
while I am convinced that hardware where the designers design for
efficient sequential consistency will let programmers write software
that handily outperforms using libraries or system calls on hardware
designed for weaker memory models.

They also (as I
understand) lacked value prediction whereas OoO effectively uses
value prediction for branches.

Branch prediction is branch prediction, value prediction predicts the
values of registers or memory locations; I have seen research about
value prediction maybe 20 years ago, but I am not aware that it has
been productized. IIRC when I asked someone who presented a paper on
value prediction why there should be a significant predictability for
values, the answer was that the values would come from executing along
a mispredicted path: E.g., when an IF is mispredicted, the values of
the code after the ENDIF might still compute to the same values.
However, given the rareness of mispredictions these days, this source
of value prediction probably provides too little benefit to justify
the cost (not to mention Spectre considerations, which came in later).

While the current interface does allow for significant improvement
exploiting common behaviors and dynamic microarchitectural
information, it does not seem (to me) an ideal interface.

I think it is quite remarkable that a sequence of instructions, with
branches and RAM and registers, is what we already have in S/360 (59
years ago) and (I think) in the IBM 704 (69 years ago); I am not
familiar enough with earlier machines or the analytical engine to
comment on them, but this style might be even older. And while the
hardware capabilities have exploded since that time, and performance characteristics have changed, this style of interface is still doing
well. Even specific instruction sets are doing well: AMD64 with it's
legacy going back to 8086 (45 years ago) and the Datapoint 2200 (53
years ago) competes well with recent architectures such as ARM A64 and
RISC-V. S390x goes back to S/360 (59 years ago); I don't know how
fast it's current implementation is on programs I care about, but I
expect that in an alternative reality where the PC market was based on
S390x, the offerings of Intel and AMD (or whoever would take their
place) would offer similar performance on software I care about as
their AMD64 offerings offer in this reality.

And of course the recent ARM A64 and RISC-V also follow the style
outlined above. RISC-V is even very similar to MIPS (37 years ago),
despite the R2000 having only ~100,000 transistors, and current CPUs
having billions.

I doubt an alternative interface will be adopted any time soon.
Even with a better understanding than in the 1970s, the problem is
not simple and early developments would not be competitive even if
all the R&D NRE was ignored.

There have been repeated attempts at alternative interfaces, from the
data-flow stuff you mentioned to IA-64, which saw a huge investment
and also had a huge mindshare everywhere. Academia has pursued
alternative programming language concepts (e.g., lazy functional
programming) for many decades, but attempts to cater to alternative
programming languages in architectures have not been successful for
long.

Despite all this work on alternatives ARM A64 and RISC-V are
interfaces that, for the most part work like an IBM 704 or a Commodore
64: RAM actually allows random access (the architecture does not just
not expose caches and hardware prefetchers, but not DRAM bursts, open
DRAM pages and somesuch, either (well, the load/store-pair
instructions of A64 can be attributed to catering to cache line
sizes), and that RAM belongs to the program (virtual memory is usually transparent to user-level programs).

Anyway, if an alternative interface, closer to what you consider ideal
(what would that be?) has not succeeded, it's not for lack of trying.
And it's also not just because of the network effects: There once was
the 36-bit word-addressed de-facto standard (IBM 704, PDP-6,
Burroughs, and others), but that did not prevent the byte-addressed
32-bit S/360 from succeeding.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Tue Dec 19 16:31:00 2023

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

* Explicit fast/small memories that leaves the problem of locality
to software rather than adding the hardware complexity of cache tags

My explanation for the lack of success of explicit local memory is
that it is too hard for software to manage; dealing with it would
be a cross-cutting concern that would make the software significantly
more complex.

Absolutely. An algorithm that takes a different form if its data is more
than a specific size is effectively two algorithms, with a transition
case. It is more complex to write and to test, and is pointless
complexity on an architecture that does not have the feature. If you
leave out the complexity on those architectures, you now have three
algorithms. You have to run all their tests and measure their execution
times on each architecture to prove that you really have the right code
on each machine.

* Software prefetch instructions. Ok, we have a cache instead of
explicit local memories, but can't we let software improve the
cache's performace with prefetch hints? I have read some reports
that the experience is that adding such instructions showed
disappointing speedups, and slowdowns also happened. While many
architectures have this feature, it seems to be used little.

I'm reasonably confident that the increases in code size, and the correspondingly increased I-cache misses, were what wrecked x86-32
prefetching in my case.

An architecture that had prefetch hints designed in from the beginning
might be more efficient with them, as long as it didn't have IA-64's dumb implementation. That produced memory misreads as an architectural feature, while ensuring that they usually didn't happen when running under a
debugger.

... why there should be a significant predictability for values,
the answer was that the values would come from executing along
a mispredicted path: E.g., when an IF is mispredicted, the values of
the code after the ENDIF might still compute to the same values.

Only if your programmers write the same code twice. The point of those
IFs is surely to avoid redundant calculation? Programmers are fairly good
at minimising computation, in terms of expression evaluation. They're
much less good at minimising branching.

I think it is quite remarkable that a sequence of instructions, with
branches and RAM and registers, is what we already have in S/360 (59
years ago) and (I think) in the IBM 704 (69 years ago); I am not
familiar enough with earlier machines or the analytical engine to
comment on them, but this style might be even older.

It has the great advantage of combining comprehensibility and generality.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Tue Dec 19 17:08:34 2023

[email protected] (John Dallman) writes:

In article <[email protected]>, >[email protected] (Anton Ertl) wrote:

* Software prefetch instructions. Ok, we have a cache instead of
explicit local memories, but can't we let software improve the
cache's performace with prefetch hints? I have read some reports
that the experience is that adding such instructions showed
disappointing speedups, and slowdowns also happened. While many
architectures have this feature, it seems to be used little.

I'm reasonably confident that the increases in code size, and the >correspondingly increased I-cache misses, were what wrecked x86-32 >prefetching in my case.

I have read explanations along these lines, or that the prefetch hints
consumed instruction decode/retire/LSU resources that caused useful instructions to be delayed. I find both of these explanations
surprising, because I would expect that, if a program spends
significant time on D-cache misses (so that prefetches could help),
reducing that would far outweigh any I-cache or resource contention
effects.

An architecture that had prefetch hints designed in from the beginning
might be more efficient with them, as long as it didn't have IA-64's dumb >implementation. That produced memory misreads as an architectural feature, >while ensuring that they usually didn't happen when running under a
debugger.

Do you mean the speculative loads that would return NaTs if the memory
access is invalid? If so, why does that not happen when running under
a debugger.

... why there should be a significant predictability for values,
the answer was that the values would come from executing along
a mispredicted path: E.g., when an IF is mispredicted, the values of
the code after the ENDIF might still compute to the same values.

Only if your programmers write the same code twice. The point of those
IFs is surely to avoid redundant calculation? Programmers are fairly good
at minimising computation, in terms of expression evaluation. They're
much less good at minimising branching.

Maybe I have not made the point clear enough: Consider:

if (...) {
...; /* then clause */
} else {
...; /* else clause */
}
x = ...;
y = a[x];

Now consider that the then clause is executed speculatively, until
after the assignment to y, and that this speculation is wrong. On misprediction recovery the else clause is executed, and the assignment
to x and y is executed again. If x and y do not depend on values
computed in the if-statement, the values will be the same as in the misspeculated execution, and value prediction would forward these
values from the misspeculated to the architectural execution. I guess
there is some history recording involved about which instructions
produce predictable values and which don't.

Consider if the speculation has run for 200 instructions behind the mispredicted branch. Then value prediction could produce the results
of dozens of instructions right away, promoting many others to
readyness. Of course these instructions still need to be executed, to
verify the predicted values. Apparently these benefits are not worth
the costs, though, even without considering Spectre (if you don't want
Spectre, you cannot do that).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Tue Dec 19 18:27:17 2023

Anton Ertl wrote:

There have been a number of attempts to design architectures with more explicit hardware-oriented features, as well as attempts to design architectures with more software-oriented features.

On the hardware-oriented side we have seen:

* Explicit fast/small memories

Cray 2 B registers

My explanation for the lack of success of explicit local memory is
that it is too hard for software to manage;

Yes, Indeed.

* Software prefetch instructions.

How do you determine the latency of the prefetch, on one implementation
100 cycles before might be enough, on another 1,000 might be too little.

* Weak memory models.

Good for HW (performance) a nightmare for SW (ordering)

One can argue that this particular architectural misfeature has
succeeded: After all, in shared-memory multiprocessors you can buy
today, sequential consistency is nonexistent, and weaker models such
as TSO or even weaker ones dominate.

All high performance machines allow the memory controller to order
memory references, requests to different cache lines are performed
out of order with respect to arrival, where accesses to the same
cache line are strictly ordered (causal). Thus, the CPU is not even
in a position to guarantee memory order--and that is where SW resides.

OTOH, very few programmers program to these weak models.

By and large, programmers keep their heads down and try to do their job.
SW is complicated enough without having all programmers understand FP
numerical methods, or memory ordering eccentricities.

Isn't that as it should be? Hardware provides some simple, possibly
inconvenient interface, and software abstracts the inconvenience
away;

Languages abstract the SW stuff away from the HW stuff. Languages limit
what programmers can write (but not how badly they write it.)

e.g., RISCs provide only a load/store interface, and compilers
translate the stuff programmers write into this interface. The
difference is that with RISCs the result is a least as fast as
architectures that tried to cater more to the programming language,
while I am convinced that hardware where the designers design for
efficient sequential consistency will let programmers write software
that handily outperforms using libraries or system calls on hardware
designed for weaker memory models.

They also (as I
understand) lacked value prediction whereas OoO effectively uses
value prediction for branches.

Branch prediction is branch prediction, value prediction predicts the
values of registers or memory locations; I have seen research about
value prediction maybe 20 years ago, but I am not aware that it has
been productized.

Branch prediction only has to guess 1-bit (taken or not)
Value prediction has to predict a whole value often a pointer.

IIRC when I asked someone who presented a paper on
value prediction why there should be a significant predictability for
values, the answer was that the values would come from executing along
a mispredicted path: E.g., when an IF is mispredicted, the values of
the code after the ENDIF might still compute to the same values.

Which illustrates why value prediction has, thus far, failed.

However, given the rareness of mispredictions these days, this source
of value prediction probably provides too little benefit to justify
the cost (not to mention Spectre considerations, which came in later).

Spectré is the contrapositive of value prediction, it is the synthesis
of a value you are not supposed to be able to see.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Dec 20 09:26:09 2023

Isn't that as it should be? Hardware provides some simple, possibly
inconvenient interface, and software abstracts the inconvenience
away; e.g., RISCs provide only a load/store interface, and compilers
translate the stuff programmers write into this interface. The
difference is that with RISCs the result is a least as fast as
architectures that tried to cater more to the programming language,
while I am convinced that hardware where the designers design for
efficient sequential consistency will let programmers write software
that handily outperforms using libraries or system calls on hardware
designed for weaker memory models.

While I agree that CPUs should provide sequential consistency (it would
make life easier for debugging, formalization, and reasoning, and
I can't think of any reason why it should have a significant cost), I'd
be surprised if it "handily outperforms" the status quo: concurrent
programming is hard even with sequential consistency, so most normal
code would still continue using the exact same libraries.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stefan Monnier on Wed Dec 20 16:29:12 2023

Stefan Monnier wrote:

Isn't that as it should be? Hardware provides some simple, possibly
inconvenient interface, and software abstracts the inconvenience
away; e.g., RISCs provide only a load/store interface, and compilers
translate the stuff programmers write into this interface. The
difference is that with RISCs the result is a least as fast as
architectures that tried to cater more to the programming language,
while I am convinced that hardware where the designers design for
efficient sequential consistency will let programmers write software
that handily outperforms using libraries or system calls on hardware
designed for weaker memory models.

While I agree that CPUs should provide sequential consistency (it would
make life easier for debugging, formalization, and reasoning, and
I can't think of any reason why it should have a significant cost), I'd
be surprised if it "handily outperforms" the status quo: concurrent programming is hard even with sequential consistency, so most normal
code would still continue using the exact same libraries.

As Lamport showed, one only NEEDs sequential consistency when accessing
memory somebody else can access at the same time. While running single threaded, one needs nothing more than weak or casual consistency.

As to costs::

l = p[ i / 123456789 ];
r[7] = s;

The LD from p is delayed 20+odd cycles by the IDIV. Under SC or TSO
the unaliased r[7] ST cannot become visible until the LD has become
visible; yet since they cannot alias, this thread is not sensitive
to the order in which they are performed--and certain compilers with
certain flags will more the ST above the LD--certain HW memory queues
will also allow ST to become visible before the LD.

Does this happen in practice--not very often.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Chris M. Thomasson on Thu Dec 21 00:43:44 2023

Chris M. Thomasson wrote:

On 12/20/2023 8:29 AM, MitchAlsup wrote:

Stefan Monnier wrote:

Isn't that as it should be? Hardware provides some simple, possibly >>>> inconvenient interface, and software abstracts the inconvenience
away; e.g., RISCs provide only a load/store interface, and compilers >>>> translate the stuff programmers write into this interface. The
difference is that with RISCs the result is a least as fast as
architectures that tried to cater more to the programming language, >>>> while I am convinced that hardware where the designers design for
efficient sequential consistency will let programmers write software >>>> that handily outperforms using libraries or system calls on hardware >>>> designed for weaker memory models.

While I agree that CPUs should provide sequential consistency (it would
make life easier for debugging, formalization, and reasoning, and
I can't think of any reason why it should have a significant cost), I'd
be surprised if it "handily outperforms" the status quo: concurrent
programming is hard even with sequential consistency, so most normal
code would still continue using the exact same libraries.

As Lamport showed, one only NEEDs sequential consistency when accessing
memory somebody else can access at the same time. While running single
threaded, one needs nothing more than weak or casual consistency.

What about concurrent multi-threaded code than can work without using sequential consistency on existing arch's?

What about engine controllers that don't need anything stronger than very weak memory models because they recompute everything between spark firing intervals and if a sensor gets read wrong, you miss one cylinder firing for one revolution
and then everything goes back to normal.

As to costs::

      l = p[ i / 123456789 ];
      r[7] = s;

The LD from p is delayed 20+odd cycles by the IDIV. Under SC or TSO
the unaliased r[7] ST cannot become visible until the LD has become
visible; yet since they cannot alias, this thread is not sensitive
to the order in which they are performed--and certain compilers with
certain flags will more the ST above the LD--certain HW memory queues
will also allow ST to become visible before the LD.

Does this happen in practice--not very often.

        Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Stefan Monnier on Thu Dec 21 15:08:08 2023

On Wed, 20 Dec 2023 09:26:09 -0500
Stefan Monnier <[email protected]> wrote:

Isn't that as it should be? Hardware provides some simple,
possibly inconvenient interface, and software abstracts the
inconvenience away; e.g., RISCs provide only a load/store
interface, and compilers translate the stuff programmers write into
this interface. The difference is that with RISCs the result is a
least as fast as architectures that tried to cater more to the
programming language, while I am convinced that hardware where the designers design for efficient sequential consistency will let
programmers write software that handily outperforms using libraries
or system calls on hardware designed for weaker memory models.

While I agree that CPUs should provide sequential consistency (it
would make life easier for debugging, formalization, and reasoning,
and I can't think of any reason why it should have a significant
cost),

The same reason TSO-like models were invented and re-invented since
first S/370 multiprocessor: unconstrained ability to feed loads
from local store queue. If anything, I would think that this ability
is even more advantageous with big store queues and deep OoO windows of
today than it was back in early S/370 days.

I'd be surprised if it "handily outperforms" the status quo:
concurrent programming is hard even with sequential consistency, so
most normal code would still continue using the exact same libraries.

Stefan

I agree.
More so, even for "normal abnormal" code, x86 model where TSO is
augmented by implied memory barriers in all synchronization
instructions does not appear inferior to SC.
Now, some people want to write "abnormal abnormal" code, where
threads communicate with regular loads and stores, without any use of
special synchronization instructions, but this people are a very small minority.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Thu Dec 21 17:46:24 2023

Stefan Monnier <[email protected]> writes:

Isn't that as it should be? Hardware provides some simple, possibly
inconvenient interface, and software abstracts the inconvenience
away; e.g., RISCs provide only a load/store interface, and compilers
translate the stuff programmers write into this interface. The
difference is that with RISCs the result is a least as fast as
architectures that tried to cater more to the programming language,
while I am convinced that hardware where the designers design for
efficient sequential consistency will let programmers write software
that handily outperforms using libraries or system calls on hardware
designed for weaker memory models.

While I agree that CPUs should provide sequential consistency (it would
make life easier for debugging, formalization, and reasoning, and
I can't think of any reason why it should have a significant cost), I'd
be surprised if it "handily outperforms" the status quo: concurrent >programming is hard even with sequential consistency, so most normal
code would still continue using the exact same libraries.

Yes, most programmers will still write sequential programs or use
libraries or system calls, but if the number of people who write
lock-free concurrent code increases by a factor of 100, I expect that
there will be more efficient libraries, so even those who use
libraries will benefit.

I actually expect even the existing libraries to perform better on
hardware that efficiently implements sequential consistency, because
the barriers become noops instead of being super-expensive on weakly
ordered hardware).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to MitchAlsup on Thu Dec 21 18:13:26 2023

[email protected] (MitchAlsup) writes:

As to costs::

l = p[ i / 123456789 ];
r[7] = s;

The LD from p is delayed 20+odd cycles by the IDIV. Under SC or TSO
the unaliased r[7] ST cannot become visible until the LD has become
visible; yet since they cannot alias, this thread is not sensitive
to the order in which they are performed

What do you mean with "visible"? You present a single-threaded
program here, why talk about memory ordering at all?

And why would I want to perform the store early? The more typical
case is that you have a store architecturally before a load, and you
want to perform the load early.

What current CPUs do is to ask a predictor whether two accesses alias,
and if the predictor says that they don't, it speculates that they can
move past each other. If the speculation turns out to be wrong, the speculative state is dropped, just as in case of a branch
misprediction.

And for multi-threaded accesses to memory, sequential consistency can
be implemented in a simular way: Have a predictor that tells you what
memory is contended, and for non-contended memory, speculatively do
the fast thing, and if another thread turns out to access the same
memory, recover and do it in the architectural order.

--and certain compilers with
certain flags will more the ST above the LD

That is irrelevant for the question of how an architecture should
perform. It's the other way round: programming languages pass on the
bad programming models that bad (i.e., weakly ordered) architectures
provide. What should they do? Provide sequential consistency by
inserting a memory barrier between any two memory accesses? On weakly
ordered hardware where memory barriers are extremely slow?

If, OTOH, architectures implemented efficient sequential consistency, programming languages would provide sequential consistency. And those
that don't would be abandoned by everybody but Chris M. Thomasson.

--certain HW memory queues
will also allow ST to become visible before the LD.

Yes, we know it's possible to design bad hardware. The idea here is
to tell the hardware designers that weak ordering is not good enough.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Thu Dec 21 19:51:50 2023

Anton Ertl wrote:

[email protected] (MitchAlsup) writes:

As to costs::

l = p[ i / 123456789 ];
r[7] = s;

The LD from p is delayed 20+odd cycles by the IDIV. Under SC or TSO
the unaliased r[7] ST cannot become visible until the LD has become >>visible; yet since they cannot alias, this thread is not sensitive
to the order in which they are performed

What do you mean with "visible"?

It is the same language Lamport used in his seminal paper. It means
when when the address has been broadcast (on the bus) but neither
the read or write has been performed. In CPUs with cache, it means
when the cache has been accessed and hit has not been granted.

You present a single-threaded
program here, why talk about memory ordering at all?

Somebody ask for a case where SC reduced performance. TSO also has
reduced performance in this case, whereas Cache, causal, and weak
consistencies do not.

And why would I want to perform the store early? The more typical
case is that you have a store architecturally before a load, and you
want to perform the load early.

Is not the general paradigm of OoO processors to do everything as early
as possible ?? And you want both LDs and STs to pass each other freely
AFTER you know the virtual address (after AGEN) when independent, but
remain in processor order when dependent.

Also note:: One can perform a younger LD early in the face of a
ST with unknown address, if the pipeline can reRun the LD again
if the older ST happens to conflict with the value LDed. When one
demands SC, you cannot do this unless you are willing to back all
the way up to the ST with the conflict.

What current CPUs do is to ask a predictor whether two accesses alias,
and if the predictor says that they don't, it speculates that they can
move past each other. If the speculation turns out to be wrong, the speculative state is dropped, just as in case of a branch
misprediction.

Yes, this is backing up to the ST, causal only has to reRun the LD--
which is a lot less expensive (to performance).

And for multi-threaded accesses to memory, sequential consistency can
be implemented in a simular way: Have a predictor that tells you what
memory is contended, and for non-contended memory, speculatively do
the fast thing, and if another thread turns out to access the same
memory, recover and do it in the architectural order.

I agree when doing cross-threaded accesses SC is the best model for
the programmer (least complicated) which is why My 66000 reverts to
SC when doing ATOMIC stuff and reverts back to causal when this is
no longer necessary. In order to be in a position to do this, both
the LD and ST of the ATOMIC event have to be special in some way to
the pipeline. I mark them with a Lock-bit.

Mixing causal and SC at the boundaries of cross-threading accesses
is done without a predictor (although one could be used) by restricting
when cells in a memory dependence matrix is allowed to relax (and allow dependent memory reference to become visible). A predictor makes recovery
of the MDM position significantly harder.

--and certain compilers with
certain flags will more the ST above the LD

That is irrelevant for the question of how an architecture should
perform.

But is entirely relevant to how a µArchitecture is allowed to perform.
When the µArchitecture is allowed to move LDs around STs when there
is no overlap in the containers accessed, performance is improved.
But when this is allowed, one needs to know if one is performing an
ATOMIC event (or not) so as to prevent that.

It's the other way round: programming languages pass on the
bad programming models that bad (i.e., weakly ordered) architectures
provide. What should they do? Provide sequential consistency by
inserting a memory barrier between any two memory accesses? On weakly ordered hardware where memory barriers are extremely slow?

Use processors that fix the problem for you with essentially no effort
on your part.

If, OTOH, architectures implemented efficient sequential consistency, programming languages would provide sequential consistency. And those
that don't would be abandoned by everybody but Chris M. Thomasson.

As illustrated above SC-only harms performance, not by a lot, but by enough
to convert a winning µArchiteccture into a Ho-Hum competitor.

--certain HW memory queues
will also allow ST to become visible before the LD.

Yes, we know it's possible to design bad hardware. The idea here is
to tell the hardware designers that weak ordering is not good enough.

I am not now or for the past 10 years been advocating for weak memory
models when cross-thread accesses are involved. Here, I am advocating
for weakened models that happen to allow for cross-thread accesses to
stabilize to the position that all interested parties agree on what
happened and in what order. This is causal consistency. {{I do not
believe that memory models weaker than causal bring enough performance
to the table to justify the added SW complexity.}}

I am not now or for the past 10 years been advocating for compilers to
make assumptions about memory aliasing allowing them to move LDs and
STs across each other. Here, I am advocating for HW to allows LD<->ST
when accesses can be proven never to conflict outside of cross-thread
memory accesses.

And B: HW designers do not understand memory ordering requirements
sufficiently that they CAN design HW to the specifications you request.
It is not in their DNA....They work with languages that provide the
illusion of everything that can be in parallel already is in parallel
and they did not have to do anything to cause that to be.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to MitchAlsup on Fri Dec 22 17:21:53 2023

[email protected] (MitchAlsup) writes:

Anton Ertl wrote:

You present a single-threaded
program here, why talk about memory ordering at all?

Somebody ask for a case where SC reduced performance. TSO also has
reduced performance in this case, whereas Cache, causal, and weak >consistencies do not.

In cases where there is no behavioural difference between sequential consistency and weaker memory orderings, there is no fundamental
reason why there should be a performance difference.

You are apparently thinking about a particular implementation (my
guess is: MY66000). What may hold for this implementation does not
necessarily hold in general.

Also note:: One can perform a younger LD early in the face of a
ST with unknown address, if the pipeline can reRun the LD again
if the older ST happens to conflict with the value LDed. When one
demands SC, you cannot do this unless you are willing to back all
the way up to the ST with the conflict.

I would just check whether the result of the load has changed from the speculatively executed version, and if so, restart from load. Why
would you restart from the earlier store?

--and certain compilers with
certain flags will more the ST above the LD

That is irrelevant for the question of how an architecture should
perform.

But is entirely relevant to how a �Architecture is allowed to perform.

What a microarchitecture is allowed to do is entirely limited by the architecture. If the architecture specifies sequential consistency,
the microarchitecture has to implement sequential consistency. If the architecture specifies TSO, the microarchitecture has to implement TSO
or better. What some compiler does is irrelevant.

When the �Architecture is allowed to move LDs around STs when there
is no overlap in the containers accessed, performance is improved.

containers?

If, OTOH, architectures implemented efficient sequential consistency,
programming languages would provide sequential consistency. And those
that don't would be abandoned by everybody but Chris M. Thomasson.

As illustrated above SC-only harms performance, not by a lot, but by enough >to convert a winning �Architeccture into a Ho-Hum competitor.

I have not found any such illustration in what you wrote, only claims
(probably based on a MY66000 microarchitecture).

In particular, given that for single-threaded programs the behaviour
is the same between SC and weak ordering, it is possible to implement
a microarchitecture that provides SC and has the same performance for single-threaded programs as a weakly-ordered competitor.

Now, consider access to shared memory in multi-threaded programs.
With a microarchitecture that is optimized for weak ordering, every
accesses to shared memory will be slow (by having to use slow atomic operations, memory barriers and somesuch). By contrast, with a microarchitecture that is optimized for sequential consistency,
accesses to shared memory will be fast in the uncontended case and
slow only in the contended case (then the reruns are needed). Bottom
line: If the programmer made contention rare, the microarchitecture
designed for sequential consistency wins.

Compare with precise exceptions: The Alpha guys told us that precise
exceptions are too slow, and gave us imprecise exceptions with a slow
trap barrier instruction (TRAPB) for limiting the imprecision; and
programmers should avoid TRAPB at all costs. Then they implemented
the 21264, and lo and behold, TRAPB was as cheap as a NOP; it probably
was a NOP, and the 21264 implemented precise exceptions, because it
comes for free when you do an OoO microarchitecture with speculation.
And the 21264 was faster than the 21164a with its imprecise
exceptions.

And B: HW designers do not understand memory ordering requirements >sufficiently that they CAN design HW to the specifications you request.
It is not in their DNA....They work with languages that provide the
illusion of everything that can be in parallel already is in parallel
and they did not have to do anything to cause that to be.

Whatever DNA they may have, they have managed to implement the
sequential semantics (including precise exceptions) of instruction
sets, and I am sure that they will manage to implement sequential
consistency efficiently if they are tasked to do that. But we have to
tell them that weak ordering (which is easier to implement) is not
good enough.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Fri Dec 22 19:46:51 2023

Anton Ertl wrote:

[email protected] (MitchAlsup) writes:

Anton Ertl wrote:

You present a single-threaded
program here, why talk about memory ordering at all?

Somebody ask for a case where SC reduced performance. TSO also has
reduced performance in this case, whereas Cache, causal, and weak >>consistencies do not.

In cases where there is no behavioural difference between sequential consistency and weaker memory orderings, there is no fundamental
reason why there should be a performance difference.

You are apparently thinking about a particular implementation (my
guess is: MY66000). What may hold for this implementation does not necessarily hold in general.

Granted, I convert everything into the architecture I know best and
in most detail. Wouldn't you ??

Also note:: One can perform a younger LD early in the face of a
ST with unknown address, if the pipeline can reRun the LD again
if the older ST happens to conflict with the value LDed. When one
demands SC, you cannot do this unless you are willing to back all
the way up to the ST with the conflict.

I would just check whether the result of the load has changed from the speculatively executed version, and if so, restart from load. Why
would you restart from the earlier store?

To make everyone agree on the ordering of memory events.

--and certain compilers with
certain flags will more the ST above the LD

That is irrelevant for the question of how an architecture should
perform.

But is entirely relevant to how a �Architecture is allowed to perform.

What a microarchitecture is allowed to do is entirely limited by the architecture. If the architecture specifies sequential consistency,
the microarchitecture has to implement sequential consistency. If the architecture specifies TSO, the microarchitecture has to implement TSO
or better. What some compiler does is irrelevant.

Most µArchitectures operate under the "as if" rule, and if SC is the Architectural rule, µArchitecture can do things in SC order or can do
things n WO and if a SC violation is detected, back up and make it
look like SC was in play.

When the �Architecture is allowed to move LDs around STs when there
is no overlap in the containers accessed, performance is improved.

containers?

Unit of memory access {byte, half (except on x86) word, double, quad
(except on everyone other than x86), bigger than 64-bit things};
aligned or not, line crossing or not, page crossing or not, MTRR
crossing or not.

If, OTOH, architectures implemented efficient sequential consistency,
programming languages would provide sequential consistency. And those
that don't would be abandoned by everybody but Chris M. Thomasson.

As illustrated above SC-only harms performance, not by a lot, but by enough >>to convert a winning �Architeccture into a Ho-Hum competitor.

I have not found any such illustration in what you wrote, only claims (probably based on a MY66000 microarchitecture).

In particular, given that for single-threaded programs the behaviour
is the same between SC and weak ordering, it is possible to implement
a microarchitecture that provides SC and has the same performance for single-threaded programs as a weakly-ordered competitor.

I assert without proof:: no.
I assert that the loss may not be significant in many µArchitectures.
I assert that the loss is significant on a few of the biggest baddest µArchitectures.
Where significance is >5% and <10%.

Now, consider access to shared memory in multi-threaded programs.
With a microarchitecture that is optimized for weak ordering, every
accesses to shared memory will be slow (by having to use slow atomic operations, memory barriers and somesuch). By contrast, with a microarchitecture that is optimized for sequential consistency,
accesses to shared memory will be fast in the uncontended case and
slow only in the contended case (then the reruns are needed). Bottom
line: If the programmer made contention rare, the microarchitecture
designed for sequential consistency wins.

No, but it might beak even.

Compare with precise exceptions: The Alpha guys told us that precise exceptions are too slow, and gave us imprecise exceptions with a slow
trap barrier instruction (TRAPB) for limiting the imprecision; and programmers should avoid TRAPB at all costs. Then they implemented
the 21264, and lo and behold, TRAPB was as cheap as a NOP; it probably
was a NOP, and the 21264 implemented precise exceptions, because it
comes for free when you do an OoO microarchitecture with speculation.
And the 21264 was faster than the 21164a with its imprecise
exceptions.

264 was wider issue and OoO, too.

And B: HW designers do not understand memory ordering requirements >>sufficiently that they CAN design HW to the specifications you request.
It is not in their DNA....They work with languages that provide the >>illusion of everything that can be in parallel already is in parallel
and they did not have to do anything to cause that to be.

Whatever DNA they may have, they have managed to implement the
sequential semantics (including precise exceptions) of instruction
sets, and I am sure that they will manage to implement sequential
consistency efficiently if they are tasked to do that. But we have to
tell them that weak ordering (which is easier to implement) is not
good enough.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Fri Dec 22 21:52:00 2023

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

I'm reasonably confident that the increases in code size, and the >correspondingly increased I-cache misses, were what wrecked x86-32 >prefetching in my case.

I have read explanations along these lines, or that the prefetch
hints consumed instruction decode/retire/LSU resources that caused
useful instructions to be delayed. I find both of these explanations surprising, because I would expect that, if a program spends
significant time on D-cache misses (so that prefetches could help),
reducing that would far outweigh any I-cache or resource contention
effects.

Well, it was the best explanation I could come up with at the time. The
code that I applied prefetching to was certainly generating lots of
misses in both I-cache and D-cache, and had not been designed for
prefetching. I applied it to the data traversal operators in the domain-specific language the code is written in, and some thing got
faster, some slower, but without any very large changes.

Do you mean the speculative loads that would return NaTs if the
memory access is invalid? If so, why does that not happen when
running under a debugger.

No. What I found was much, much worse. IA-64 had load-advance and
load-check instructions. You were supposed to issue the load-advance as
soon as you knew the address you were going to need and it happened asynchronously. The load-check was quick if the value had arrived, but otherwise stalled the pipeline until it did arrive. If the hardware had discarded an advance load, as it was permitted to do, this meant a full
wait for memory/cache.

If you were fetching into integer registers, the windowing of those
registers and the Register Stack Engine allowed for all this to happen smoothly, provided you hadn't gone into so many levels of subroutine
during the fetch that the register you were loading into had been pushed
out to memory. I'm not sure what was meant to happen then, but it wasn't
going to be quick.

The problem came when you were fetching into floating-point registers,
which were not windowed, although the way they could move around for modulo-scheduled loops confused people on that front.

The sequence of events that caused problems was:

1. Advance load into floating-point register fx.
2. Call subroutine, which wants to use fx.
3. Subroutine pushes fx to stack, and starts using it.
4. Advance load arrives, overwriting fx.
5. Subroutine may well go wrong, given its register has been corrupted.
6. If subroutine doesn't notice, it finishes its work.
7. Subroutine pops fx from the stack.
8. Subroutine returns.
9. Check load is executed. The ALAT says the load has happened.

The value in fx, nonetheless is what was there /before/ you issued the
advance load.

If you have breakpoints anywhere along this path, they cause pipeline
flushes and the problem does not (usually) reproduce.

It took me a fair amount of time and brain cells to figure this out. I
ended up forming a hypothesis and searching assembly listings for the
relevant code sequences. Once I knew what was going on, I tried to report
the bug. Intel didn't want to know for a while and then switched to "Oh
yes, that old thing. It doesn't matter! Didn't we tell you about it?"
This was not convincing.

The net effect is that you could not safely have floating-point advance
loads outstanding across subroutine calls. The "fix" was to re-issue all outstanding advance floating-point loads after each subroutine call,
which bulked out the code quite seriously. IA-64 code was already very
bulky, pressurising cache size and memory bandwidth. The "fix" also
didn't prevent the subroutines suffering from random register corruptions.

Intel seemed to have assumed that floating-point work would /only/ happen
in leaf functions. That's a nice ideal, but there's plenty of real-world
code where it is not true. /Maybe/ whole-program optimisation might have reduced its effects. I tried that just once. The link took an hour. This
was not practical when the compiler was so immature that I was reporting
bugs in it every week.

IA-64 did not prove that all its ideas were worthless, only that the way
they'd been put together was counterproductive. Nonetheless, anyone else
who tries to launch a EPIC processor is going to face some very hostile questioning, and it's perfectly understandable that nobody wants to try.

Maybe I have not made the point clear enough

OK, with you now.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Sat Dec 23 10:49:11 2023

[email protected] (John Dallman) writes:

In article <[email protected]>, >[email protected] (Anton Ertl) wrote:

Do you mean the speculative loads that would return NaTs if the
memory access is invalid? If so, why does that not happen when
running under a debugger.

No. What I found was much, much worse. IA-64 had load-advance and
load-check instructions. You were supposed to issue the load-advance as
soon as you knew the address you were going to need and it happened >asynchronously. The load-check was quick if the value had arrived, but >otherwise stalled the pipeline until it did arrive. If the hardware had >discarded an advance load, as it was permitted to do, this meant a full
wait for memory/cache.

Ah, yes, this one. In particular, the ld.c would repeat the load if a
store in between stored to that address.

The problem came when you were fetching into floating-point registers,
which were not windowed, although the way they could move around for >modulo-scheduled loops confused people on that front.

The sequence of events that caused problems was:

1. Advance load into floating-point register fx.
2. Call subroutine, which wants to use fx.
3. Subroutine pushes fx to stack, and starts using it.
4. Advance load arrives, overwriting fx.
5. Subroutine may well go wrong, given its register has been corrupted.
6. If subroutine doesn't notice, it finishes its work.
7. Subroutine pops fx from the stack.
8. Subroutine returns.
9. Check load is executed. The ALAT says the load has happened.

The value in fx, nonetheless is what was there /before/ you issued the >advance load.

If you have breakpoints anywhere along this path, they cause pipeline
flushes and the problem does not (usually) reproduce.

Ouch. This clearly is a compiler bug. It eiter should abstain from
moving a load across the call, or perform a ld.c before the call.
Which compiler was this?

IA-64 did not prove that all its ideas were worthless, only that the way >they'd been put together was counterproductive. Nonetheless, anyone else
who tries to launch a EPIC processor is going to face some very hostile >questioning, and it's perfectly understandable that nobody wants to try.

I have not had any such problems when I compiled my code on IA-64
(with gcc), the result was just slow:

Gforth 0.7 results (lower is better):
sieve bubble matrix fib
1.000 1.120 0.710 1.680 0.7.0; Itanium II 900MHz (HP rx2600); gcc-4.1.1
0.245 0.287 0.156 0.376 0.7.0; Pentium 4 Northwood 2.26GHz; gcc-2.95.4 20011002 (Debian prerelease)
0.670 1.070 0.932 0.968 0.7.0; Alpha 21264B 800MHz; gcc-4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

Both Intel CPUs are from 2002, the 21264B from 2001 (based on the
21264 first released in 1998); and this is the Itanium running IA-64
code, not IA-32 code.

I have discussed earlier (<[email protected]>
and others) why EPIC lost against OoO, and it's not just bad execution
by Intel and HP.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Sat Dec 23 14:37:00 2023

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

Which compiler was this?

Microsoft. Our primary demand for Itanium, when things were starting up
for it, was Windows, because there was no 64-bit Windows at the time and
some large customers really wanted that. That demand evaporated like dry
ice in Death Valley when Intel announced their x86-64.

I started work on Windows using Intel's and Microsoft's compilers in
parallel, expecting to discard one when it became clear which was better.
After a while, we were at a point when we could get through the most
basic level of testing in an optimised build with Microsoft, and couldn't
in a deoptimised build with Intel.

Ouch. This clearly is a compiler bug. It either should abstain from
moving a load across the call, or perform a ld.c before the call.

The problem comes when loads and calls are mixed together very closely,
such that there's obviously no time for a load to complete between the
address being known and the next necessary call. Microsoft's tactic of re-issuing loads got rid of most of the problems, and we'd become far
more interested in AMD64.

I have discussed earlier
(<[email protected]>
and others) why EPIC lost against OoO, and it's not just bad
execution by Intel and HP.

Indeed.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Chris M. Thomasson on Sat Dec 23 21:44:22 2023

Chris M. Thomasson wrote:

On 12/23/2023 6:37 AM, John Dallman wrote:

In article <[email protected]>,
[email protected] (Anton Ertl) wrote:

Which compiler was this?

Microsoft. Our primary demand for Itanium, when things were starting up
for it, was Windows, because there was no 64-bit Windows at the time and
some large customers really wanted that. That demand evaporated like dry
ice in Death Valley when Intel announced their x86-64.

Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?

Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
on the x86 side of things (DCAS in IBM parlance), which is the impetus for
my inventing ASF which later morphed into ESM. My point was and still is
that one can add 1 to 3 ATOMIC instructions every product rev, or one can provide the primitives such that SW can invent and use whatever primitive
they can figure out how to program and how to use. {{There is no 3rd choice}}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Chris M. Thomasson on Sat Dec 23 23:12:20 2023

Chris M. Thomasson wrote:

On 12/23/2023 1:44 PM, MitchAlsup wrote:

Chris M. Thomasson wrote:

On 12/23/2023 6:37 AM, John Dallman wrote:

In article <[email protected]>,
[email protected] (Anton Ertl) wrote:

Which compiler was this?

Microsoft. Our primary demand for Itanium, when things were starting up >>>> for it, was Windows, because there was no 64-bit Windows at the time and >>>> some large customers really wanted that. That demand evaporated like dry >>>> ice in Death Valley when Intel announced their x86-64.

Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?

Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
on the x86 side of things (DCAS in IBM parlance), which is the impetus
for my inventing ASF which later morphed into ESM. My point was and
still is
that one can add 1 to 3 ATOMIC instructions every product rev, or one can
provide the primitives such that SW can invent and use whatever primitive
they can figure out how to program and how to use. {{There is no 3rd
choice}}

Well the thing about cmp8xchg16 and friends is that all the words need
to be adjacent to one another.

That really limits its utility.

Then there is cmpxchg16 that allows for
true 128 bit CAS on a 64 bit system, again wrt adjacent words...
cmp8xchg16 was always strange to me.

ESM has none of those limitations, and you can perform up to CMP8SWAP8 if
that floats your boat, and/or attach time stamps along with the pointer manipulations,.......

Now, speaking of IBM, iirc, are you familiar with the PLO locking
scheme? I think it was called PLO. Iirc, it was in the same appendix as
the lock-free IBM free pool manipulation?

No

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Chris M. Thomasson on Sun Dec 24 11:35:00 2023

In article <um7gmd$274oh$[email protected]>, [email protected]
(Chris M. Thomasson) wrote:

Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?

I am fortunate enough not to have had to delve into locking
implementations; I make a practice of using the OS facilities carefully.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Paul A. Clayton on Sun Dec 24 18:39:54 2023

"Paul A. Clayton" <[email protected]> writes:

On 12/23/23 4:44 PM, MitchAlsup wrote:

Chris M. Thomasson wrote:

[snip]

Do you remember the rather odd instruction for the Itanium
called cmp8xchg16?

Yes, at that point in time, MS was also asking for Compare 2 Swap
2 over on the x86 side of things (DCAS in IBM parlance), which is the
impetus for my inventing ASF which later morphed into ESM. My
point was and still is that one can add 1 to 3 ATOMIC instructions
every product rev, or one can provide the primitives such that SW can
invent and use whatever primitive they can figure out how to program
and how to use. {{There is no 3rd choice}}

Third choice: do both (likely poorly).

ESM could atomically operate on up to 512 bytes in up to 8 64-
byte-aligned locations. The proposed atomics seem to be limited to
a single cache block; a modestly extended LL/SC that allowed
ordinary loads and stores within the reserved cache block would
seem to provide that functionality. ESM is much more flexible, but
forward progress is more difficult with more reservations. (Yes,
ESM has mechanisms to handle queues and such, but I doubt even
Mitch Alsup has discovered a universal solution.)

(A proposed extension for RISC-V — Zacas — provides double-width
CAS. I do not buy the argument that "the CAS atomic instructions
scale better to highly parallel systems than LR/SC"; I think that
is simply an implementation choice.

In my experience with large scale systems, it has been shown
that LL/SC cannot scale so long as it must acquire the cache
line to complete the transaction.

The advantage of atomics (particularly load+add) is that the
core doesn't need to acquire exclusive access to the cache line
before completing the operation - the atomic can be easily delegated
to a higher level (e.g. LLC) of cache, passed to the cache
that owns the line, or even to the memory controller
(or a PCI express device).

I also see little difficulty
in extending LL/SC active regions to include additional loads and
stores within the single reservation.

ARM has specified that any store by the processor _may_ cause
the SC to (Store Exclusive) to fail.

Methods applied for locks might help in some cases for hardware-
monitored reservations, but designing an interface seems difficult
(general enough to be useful yet also high enough performance to
be useful) and dangerous (new techniques and use cases can change
the two difficulty factors).

Back in the early 1980's, we added hardware support for locks
(and events - e.g. modern condition variables) directly to
the instruction set. An instruction to acquire a lock,
one to release a lock (operating like a mutex), one to
test if the lock is available and acquire if so, but not
block if owned by another thread. Instructions to
wait for an event and signal an event.

The lock was defined by a data structure that included
a 'canonical lock number', and a field containing
the current lock owner task number, and a link to the
first thread waiting for the lock, if any.

The MCP(OS) thread data structure contained a field that
recorded the current canonical lock number (CLN) owned by the
thread and the lock instruction would fail (set condition codes)
if the lock being locked had a CLN less than or equal to
the currently owned CLN. The unlock instruction would fault
if the thread attempted to unlock a lock where the CLN didn't
match the current CLN. These rules completely prevented A-B
deadlocks.

If the lock was already owned the lock instruction would trap
to a microkernel (highest privilege level) which would place
the thread on a waiting list and dispatch the next highest
priority thread.

If there was a waiter recorded in the lock structure, the
unlock instruction would trap to the microkernel which would
adjust the list and, if the newly runnable thread was highest
priority, dispatch it.

Non-composing hardware lock elision seems to have the advantage of
including an exclusive name. I imagine that one could use that
name to establish a single order of atomic operations when there
is conflict, but even then maximizing throughput seems difficult.
If one falls back to all users of the lock name wait in line, one
likely loses a lot of potential concurrency while the lock name is
in "queuing mode". I suspect there are some tricks that would
allow some queue skipping at moderate complexity/area/power/etc.
cost (possibly involving versioned memory).

Sounds much like transactions. Most hardware implementations so
far have suffered from various restrictions in the size of the
transaction and haven't found much traction yet.

I have not studied software concurrency (just "overhearing" a few
things from time to time), but even a sub-novice knows that
locking is hard (deadlock, livelock, etc.).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Paul A. Clayton on Sun Dec 24 18:55:19 2023

"Paul A. Clayton" <[email protected]> writes:

On 12/19/23 3:49 AM, Anton Ertl wrote:

"Paul A. Clayton" <[email protected]> writes:

[snip]

I suspect that a superior interface could be designed which
exploits diverse locality (i.e., data might naturally be closer to
some computational resources than to others) and communication
(and storage) costs and budgets (budgets being related to urgency
and importance).

There have been a number of attempts to design architectures with more
explicit hardware-oriented features, as well as attempts to design
architectures with more software-oriented features.

Communication of the (absolute and relative) costs and usefulness
in both directions between hardware and software is a significant
factor. Tossing difficult things to the other side of the
interface just because they are difficult is problematic, but the
relative difficulty can justify the placement of work.

Although there are many things that hardware can do much
more efficiently, perhaps at a cost. Such as Content
Addressible Memory which is useful in many workloads,
particularly in the networking space.

I'd love to have some form of CAM available directly to a
C++ programmer on standard hardware - I often see cases
where it could be useful for performance, for small
lookup tables. Perhaps a CAM instruction with a set of
CAM banks (one per hardware thread) and support in the kernel
to save/restore them on context switch (lazily, like FPR).

I think it is quite remarkable that a sequence of instructions, with
branches and RAM and registers, is what we already have in S/360 (59
years ago) and (I think) in the IBM 704 (69 years ago);

The Datatron first shipped the same year as the 704. It was
announced in '52. There was a 2-digit order code in a 'command'
(what we'd call an instruction). It included both multiplication
and division (10 digit operands, 20 digit product) as well
as shift (digit, not bit) orders (peephole optimizations
for multiply/division by powers of 10). Load (Clear and Add) and Store orders. Block transfer order copied 20 consecutive main storage
cells to a 'quick access loop'.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Chris M. Thomasson on Sun Dec 24 20:37:51 2023

Chris M. Thomasson wrote:

On 12/23/2023 6:45 PM, Paul A. Clayton wrote:

On 12/23/23 4:44 PM, MitchAlsup wrote:

Chris M. Thomasson wrote:

[snip]

Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?

Yes, at that point in time, MS was also asking for Compare 2 Swap 2
over on the x86 side of things (DCAS in IBM parlance), which is the
impetus for my inventing ASF which later morphed into ESM. My point
was and still is that one can add 1 to 3 ATOMIC instructions every
product rev, or one can provide the primitives such that SW can invent
and use whatever primitive they can figure out how to program and how
to use. {{There is no 3rd choice}}

Third choice: do both (likely poorly).

ESM could atomically operate on up to 512 bytes in up to 8 64-
byte-aligned locations. The proposed atomics seem to be limited to
a single cache block; a modestly extended LL/SC that allowed
ordinary loads and stores within the reserved cache block would
seem to provide that functionality. ESM is much more flexible, but
forward progress is more difficult with more reservations. (Yes,
ESM has mechanisms to handle queues and such, but I doubt even
Mitch Alsup has discovered a universal solution.)

(A proposed extension for RISC-V — Zacas — provides double-width
CAS. I do not buy the argument that "the CAS atomic instructions
scale better to highly parallel systems than LR/SC"; I think that
is simply an implementation choice. I also see little difficulty
in extending LL/SC active regions to include additional loads and
stores within the single reservation. However, I would not be
surprised if there was an advantage to easier idiom recognition
for atomic operations that would readily benefit from some
optimizations. The code density disadvantage would be difficult
to remove (a sequence of primitives will typically be longer in
bytes than a complex operation), but greater flexibility might
pay for lower code density especially for less frequent
operations.)

Programmers have to be very careful with LL/SC because they can fail for other reasons besides the comparand being different wrt CAS. Live lock
can enter the picture.

When I developed ESM (after ASF) I understood that SW might want to know
not just that interference had transpired, but what kind and how many.
So, I have a variable that can be accessed for a short time after an
ATOMIC event called WHY. Why has 3 kinds of values:: a negative value represents a spurious error {buffer overflow, ...}, zero (0) represents success, and positive number indicates how many other threads are contending with this ATOMIC event. SW can use this to decrease future contention.

I would suggest that spurious failures are vastly preferred compared to spurious ABA successes.

(Reserving a single "page" would also seem to make ordering easier
at the cost of more false sharing. Spatially dense atomics are
also less interesting.)

Methods applied for locks might help in some cases for hardware-
monitored reservations, but designing an interface seems difficult
(general enough to be useful yet also high enough performance to
be useful) and dangerous (new techniques and use cases can change
the two difficulty factors).

Non-composing hardware lock elision seems to have the advantage of
including an exclusive name. I imagine that one could use that
name to establish a single order of atomic operations when there
is conflict, but even then maximizing throughput seems difficult.
If one falls back to all users of the lock name wait in line, one
likely loses a lot of potential concurrency while the lock name is
in "queuing mode". I suspect there are some tricks that would
allow some queue skipping at moderate complexity/area/power/etc.
cost (possibly involving versioned memory).

I have not studied software concurrency (just "overhearing" a few
things from time to time), but even a sub-novice knows that
locking is hard (deadlock, livelock, etc.).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Chris M. Thomasson on Sun Dec 24 21:36:06 2023

"Chris M. Thomasson" <[email protected]> writes:

On 12/24/2023 10:39 AM, Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

On 12/23/23 4:44 PM, MitchAlsup wrote:

Chris M. Thomasson wrote:

[snip]

Do you remember the rather odd instruction for the Itanium
called cmp8xchg16?

Yes, at that point in time, MS was also asking for Compare 2 Swap
2 over on the x86 side of things (DCAS in IBM parlance), which is the
impetus for my inventing ASF which later morphed into ESM. My
point was and still is that one can add 1 to 3 ATOMIC instructions
every product rev, or one can provide the primitives such that SW can
invent and use whatever primitive they can figure out how to program
and how to use. {{There is no 3rd choice}}

Third choice: do both (likely poorly).

ESM could atomically operate on up to 512 bytes in up to 8 64-
byte-aligned locations. The proposed atomics seem to be limited to
a single cache block; a modestly extended LL/SC that allowed
ordinary loads and stores within the reserved cache block would
seem to provide that functionality. ESM is much more flexible, but
forward progress is more difficult with more reservations. (Yes,
ESM has mechanisms to handle queues and such, but I doubt even
Mitch Alsup has discovered a universal solution.)

(A proposed extension for RISC-V — Zacas — provides double-width
CAS. I do not buy the argument that "the CAS atomic instructions
scale better to highly parallel systems than LR/SC"; I think that
is simply an implementation choice.

In my experience with large scale systems, it has been shown
that LL/SC cannot scale so long as it must acquire the cache
line to complete the transaction.

The advantage of atomics (particularly load+add) is that the
core doesn't need to acquire exclusive access to the cache line
before completing the operation - the atomic can be easily delegated
to a higher level (e.g. LLC) of cache, passed to the cache
that owns the line, or even to the memory controller
(or a PCI express device).

I also see little difficulty
in extending LL/SC active regions to include additional loads and
stores within the single reservation.

ARM has specified that any store by the processor _may_ cause
the SC to (Store Exclusive) to fail.

Right, wrt the reservation granule. Iirc, this can be larger than a l2
cache line?

Actually -any- store regardless of PA.

The reservation granule can, in the degenerate case (e.g. on an ARM
Cortex-M7), be the entire address space. In general, the size is implementation defined (between 4 and 512 words for ARMv8).

From DDI0487_I_a

When a PE writes using any instruction other than a Store-Exclusive instruction:

� If the write is to a PA that is not marked as Exclusive Access by its local
monitor and that local monitor is in the Exclusive Access state, it is
IMPLEMENTATION DEFINED whether the write affects the state of the local
monitor.

� If the write is to a PA that is marked as Exclusive Access by its local
monitor, it is IMPLEMENTATION DEFINED whether the write affects the
state of the local monitor.

� Any transition of the local monitor to the Open Access state not caused
by the architectural execution of an operation shown in Figure B2-4 must
not indefinitely delay forward progress of execution.

� LoadExcl/StoreExcl loops are guaranteed to make forward progress only if,
for any LoadExcl/StoreExcl loop within a single thread of execution,
the software meets all of the following conditions:

1 Between the Load-Exclusive and the Store-Exclusive, there are no explicit memory effects,
preloads, direct or indirect System register writes, address translation instructions, cache or TLB
maintenance instructions, exception generating instructions, exception returns, ISB barriers, or
indirect branches.

An implementation is free to clear the monitor for almost any reason. Thus, applications are recommended to keep the number of instructions between the load and store deminimus (avoiding any other loads or stores).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Mon Dec 25 01:28:58 2023

Paul A. Clayton wrote:

On 12/24/23 1:39 PM, Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

On 12/23/23 4:44â€¯PM, MitchAlsup wrote:

Chris M. Thomasson wrote:

[snip]

Do you remember the rather odd instruction for the Itanium
called cmp8xchg16?

Yes, at that point in time, MS was also asking for Compare 2 Swap
2 over on the x86 side of things (DCAS in IBM parlance), which is the
impetus for my inventing ASF which later morphed into ESM. My
point was and still is that one can add 1 to 3 ATOMIC instructions
every product rev, or one can provide the primitives such that SW can
invent and use whatever primitive they can figure out how to program
and how to use. {{There is no 3rd choice}}

Third choice: do both (likely poorly).

ESM could atomically operate on up to 512 bytes in up to 8 64-
byte-aligned locations. The proposed atomics seem to be limited to
a single cache block; a modestly extended LL/SC that allowed
ordinary loads and stores within the reserved cache block would
seem to provide that functionality. ESM is much more flexible, but
forward progress is more difficult with more reservations. (Yes,
ESM has mechanisms to handle queues and such, but I doubt even
Mitch Alsup has discovered a universal solution.)

(A proposed extension for RISC-V — Zacas — provides double-width
CAS. I do not buy the argument that "the CAS atomic instructions
scale better to highly parallel systems than LR/SC"; I think that
is simply an implementation choice.

In my experience with large scale systems, it has been shown
that LL/SC cannot scale so long as it must acquire the cache
line to complete the transaction.

I think this is significantly an implementation choice.

There are also architectural choices involved. ESM does not HAVE
TO hold onto cache lines, what it cannot allow is access of write
permission. With this paradigm, one can lose cache lines to
interference without loosing the ability to complete the ATOMIC
event.

IBM z/Architecture provided "constrained transactions" that
are guaranteed to complete (hardware retry and other mechanisms).

One of the reasons I invented the ATOIC control point and automagic
control transfer to the control point on ATOMIC failure. HW has to
leave the IP pointing somewhere, so why not at the failure recovery
point ??

I think the constraints avoided repeated TLB/cache misses and
other issues that make reliably fast completion difficult. I got

ESM mandates all TLB misses in the setup phase, so that none of the participating STs miss in the TLB. {{Details matter}}

the impression that this was not implemented to provide best-
reasonable performance; correct operation is obviously more
important, especially for that ISA, but the description of how
conflicts were handled was disappointing to me. (By the way,
z/Architecture also promoted memory-updating compute instructions
to be memory atomic if aligned.)

I have STM, LDM, and MM as atomic, as long as the data size being
transferred does not cross a page boundary. The bus was designed
for page sized ATOMIC transfers, so why not allow the CPU to make
use of this property.

With a single reservation granule, ordering is much easier.

You cannot reduce the exponent of complexity and retain a single
ATOMIC granule !! See my code illustrating moving (extracting
and then inserting) an element of a CDS where nobody can ever
see that element not INSIDE the CDS !! So, you need fewer events
and fewer events means less interference, so a lower exponent of
latency,.....

The advantage of atomics (particularly load+add) is that the
core doesn't need to acquire exclusive access to the cache line
before completing the operation - the atomic can be easily delegated
to a higher level (e.g. LLC) of cache, passed to the cache
that owns the line, or even to the memory controller
(or a PCI express device).

Obviously, idiom recognition could translate ll/add/sc into
an atomic add. (Alternatively, an atomic add could be implemented
using the mechanisms provided for ll/sc.)

Most uses of ATOMIC_op_to_memory() are expected to execute way
out in the memory hierarchy (wherever the request runs into the
data). So, not only does the core have to recognize the paradigm,
but convert it into a get_perforemed_in_memory() operation. In
any event, Op_to_memory() is the only one that never has failure
modes....

While I would prefer a denser idiom and one that constrains the
cost of idiom recognition, even a 12-byte "atomic instruction"
using ordinary ll/sc and a compute instruction might not be so
horrible. (I mentioned previously the possibility of a special
ll that implied a sc after a compute operation, providing both
a more "outspoken" idiom and a shorter expression.)

Not horrible at all as long as the operation takes place "over in memory"
and not in the CPU.

I also see little difficulty
in extending LL/SC active regions to include additional loads and
stores within the single reservation.

ARM has specified that any store by the processor _may_ cause
the SC to (Store Exclusive) to fail.

To be fair, my 66000 will end up at the fail point if any exception
transpires while in an ATOMIC event {{including debug trap}}. This
is why ESM requires all writes to pass TLB checks in the setup phase.

<snip>

I think Intel and IBM made the mistake of going directly to a
largish scope (L1 cache write set and Intel at least extending the
read set with a conservative filter) rather than providing high
performance for small transactions and perhaps the possibility
to experiment with larger transactions. Both also probably
suffered from slow failure (I seem to recall Intel's TSX had
slow failures even for small transactions).

The mistake is that this is a problem the CPU is in a place to solve.
The CPU is not, the memory controller is. ...

Cliff Click proposed (in "I Wanna Bit") a L1-capacity read set and
single access write set with the further restriction that all
cache misses failed the atomic event (so retries were expected).
He also had a few blog posts about the problems faced with Azul
Systems' transactional memory (e.g., Java libraries often had
software performance counters that artificially introduced
conflicts).

I think My 66000's ESM is a good first step, but I would be
tempted to provide the ability to play with more extensive read
sets in the first version. That might not be very useful, but I
suspect some clever programmers would find uses.

Any read set larger than the outstanding miss buffer is either
supporting width the ISA cannot use, or providing a limit to
the width of ATOMICIty.

A conservative filter can be used to extend the read set, so an implementation would not necessarily need to change the cache
tag metadata. (Conservative filters were used only for evictions,
if I understand correctly, for Intel TSX. Per-cache-block marker
bits were used to track both read and write sets.)

In addition to all the other issues with extending the capacity
of atomic memory operations, there is also the question of
version-based vs. read/write set tracking. Versioning could
facilitate some speculative multithreading and more generally
allowing orderable conflicts, but versioning is more complex.
There are also reasons why one would like to leak more of the
implementation information rather than provide a generic
abstraction (not only for what operations are more likely to
be fast/succeed but also what information might be usefully
passed across the interface).

I think having hardware on which to test ideas would be a
significant benefit. I also know that I am biased by seeing
limited transactional memory support as being relatively
cheap and an obvious extension to a more clearly useful
interface like ESM. The main problem with providing such seems
to be the burden of compatibility.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Scott Lurndal on Mon Dec 25 09:44:14 2023

Scott Lurndal wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 12/24/2023 10:39 AM, Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

I also see little difficulty
in extending LL/SC active regions to include additional loads and
stores within the single reservation.

ARM has specified that any store by the processor _may_ cause
the SC to (Store Exclusive) to fail.

Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block.
I can't think of any way to make use of that since only the most recent
LL is active at a time.

I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and retains
the line lock, and SCR Store Conditional Release which conditionally
applies all SCH and SCR updates and releases the line lock.

It requires an extra cache line buffer to retain all SCH updates until
the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.

Not sure how useful multi-word LL/SCH/SCR is though since all updates must
be within one cache line. It can do atomic double, quad and octa-wide CAS,
or atomic compare-exchange plus increment a generation number.

Right, wrt the reservation granule. Iirc, this can be larger than a l2
cache line?

Actually -any- store regardless of PA.

The reservation granule can, in the degenerate case (e.g. on an ARM Cortex-M7), be the entire address space. In general, the size is implementation defined (between 4 and 512 words for ARMv8).

From DDI0487_I_a

When a PE writes using any instruction other than a Store-Exclusive instruction:

� If the write is to a PA that is not marked as Exclusive Access by its local
monitor and that local monitor is in the Exclusive Access state, it is
IMPLEMENTATION DEFINED whether the write affects the state of the local
monitor.

� If the write is to a PA that is marked as Exclusive Access by its local
monitor, it is IMPLEMENTATION DEFINED whether the write affects the
state of the local monitor.

� Any transition of the local monitor to the Open Access state not caused
by the architectural execution of an operation shown in Figure B2-4 must
not indefinitely delay forward progress of execution.

� LoadExcl/StoreExcl loops are guaranteed to make forward progress only if,
for any LoadExcl/StoreExcl loop within a single thread of execution,
the software meets all of the following conditions:

1 Between the Load-Exclusive and the Store-Exclusive, there are no explicit memory effects,
preloads, direct or indirect System register writes, address translation instructions, cache or TLB
maintenance instructions, exception generating instructions, exception returns, ISB barriers, or
indirect branches.

An implementation is free to clear the monitor for almost any reason. Thus, applications are recommended to keep the number of instructions between the load and store deminimus (avoiding any other loads or stores).

Pretty much the same rules as Alpha.
These rules appear to allow it to acquire the line with exclusive ownership, pin the cache line between LE and SE, and release the pin if anything other than an SE happens. The pinning-while-exclusively-owned is what guarantees forward progress. It doesn't set an upper limit on the number of
instructions or cpu clocks between LE and SE other than a time clock or
any other interrupt will cancel the lock.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Mon Dec 25 17:08:06 2023

Paul A. Clayton wrote:

On 12/25/23 9:44 AM, EricP wrote:

Scott Lurndal wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 12/24/2023 10:39 AM, Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

I also see little difficulty
in extending LL/SC active regions to include additional loads
and
stores within the single reservation.

ARM has specified that any store by the processor _may_ cause
the SC to (Store Exclusive) to fail.

Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block.
I can't think of any way to make use of that since only the most
recent
LL is active at a time.

I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and
retains
the line lock, and SCR Store Conditional Release which conditionally
applies all SCH and SCR updates and releases the line lock.

It requires an extra cache line buffer to retain all SCH updates
until
the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.

Not sure how useful multi-word LL/SCH/SCR is though since all
updates must
be within one cache line. It can do atomic double, quad and
octa-wide CAS,
or atomic compare-exchange plus increment a generation number.

One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.

Yes, but the HW does not know if that memory reference is participating
or not in the event. That is why participating must be announced.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Mon Dec 25 17:06:11 2023

EricP wrote:

Scott Lurndal wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 12/24/2023 10:39 AM, Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

I also see little difficulty
in extending LL/SC active regions to include additional loads and
stores within the single reservation.

ARM has specified that any store by the processor _may_ cause
the SC to (Store Exclusive) to fail.

Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block.
I can't think of any way to make use of that since only the most recent
LL is active at a time.

I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,

My 66000 allows for more than 1 cache line (up to 8) each cache line
is accessed by an inbound memory reference {LD or PREfetch} and a
monitor is setup to watch the cached lines. Each touched line becomes
a participating cache line.

and SCH Store Conditional Hold which conditionally updates and retains
the line lock, and SCR Store Conditional Release which conditionally
applies all SCH and SCR updates and releases the line lock.

The monitor remains active until a <single> outbound memory reference
{ST or PUSH} and then allows for all state modifications to all the
monitored lines to become visible as if at the same instant. A locked
ST to a nonparticipatng cache line abandons the event (with failure)

It requires an extra cache line buffer to retain all SCH updates until
the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.

I use the miss buffer for this. {{Sts to lines not participating in the
event are performed normally.}}

Not sure how useful multi-word LL/SCH/SCR is though since all updates must
be within one cache line. It can do atomic double, quad and octa-wide CAS,
or atomic compare-exchange plus increment a generation number.

Not nearly as useful as across cache lines.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Mon Dec 25 18:41:58 2023

Paul A. Clayton wrote:

On 12/25/23 12:08 PM, MitchAlsup wrote:

Paul A. Clayton wrote:

On 12/25/23 9:44 AM, EricP wrote:

[snip]

I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and
retains
the line lock, and SCR Store Conditional Release which
conditionally
applies all SCH and SCR updates and releases the line lock.

[snip]

One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.

Yes, but the HW does not know if that memory reference is
participating
or not in the event. That is why participating must be announced.

I was assuming a simple interface where all memory operations
inside the atomic operation participated. ASF allowed not only non-participating memory operations but also releasing of
participants (I am not certain if ESM has the latter).

How are you going to debug an ATOMIC event when you are not
allowed to single step through the event ?? .... Every exception
(of which debug trap is certainly one) sets IP to the control
point and then takes the exception.

ESM is
obviously a better and more flexible interface, but for merely
extending the semantics of an existing ISA interpreting LL as
Transaction Begin and SC as Transaction End seems fairly nice.

If I recall correctly, most ISAs with LL/SC give those
instructions very limited addressing modes, so limiting address

There is no limitation of address mode in My 66000.

generation clutter to two memory accesses seems more desirable
than reusing LL (and one would have to add a "Store Conditional
Hold" instruction since SC ends the atomic operation).

For simple atomics, one does not need escaping memory operations
or release of reservations even though such are useful
features. If limited to one granule, escaping and releasing
operations do not make much sense (unless there is additional
false sharing tracking) — EricP's proposal was for only one
granule/cache block.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Mon Dec 25 18:38:11 2023

Paul A. Clayton wrote:

On 12/24/23 8:28 PM, MitchAlsup wrote:

Paul A. Clayton wrote:

On 12/24/23 1:39 PM, Scott Lurndal wrote:

[snip]

In my experience with large scale systems, it has been shown
that LL/SC cannot scale so long as it must acquire the cache
line to complete the transaction.

I think this is significantly an implementation choice.

There are also architectural choices involved. ESM does not HAVE
TO hold onto cache lines, what it cannot allow is access of write
permission. With this paradigm, one can lose cache lines to
interference without loosing the ability to complete the ATOMIC
event.

ESM holds the read-and-write set in a write buffer, so the cache

Where the buffer is writable but the cache is left in a pristine
state. I plan on using the Miss buffer for this. The size of the
miss buffer is the limiter of the number of participating cache
lines. The miss buffer is also <essentially> fully associative.
The Miss buffer is also SNOOPed to keep track of interference.

itself holds no tracking data. If the data cache holds tracking
data, an eviction from limited associativity could allow a write
to the evicted cache block *if* the block was part of the atomic
operation. (If it was merely resident from previous accesses,
eviction is not important for the atomic operation, though
eviction might slow completion if that data is later part of the
atomic operation.)

ASF and ESM still work even when there is no <data> cache.

Yes, for the read set the data does not need to be retained. A
list of addresses or a conservative filter are sufficient to
guard against writes. (If the atomic operation can establish a
time marker such that any write to evicted blocks happens "after"
the atomic operation commits. For something like ESM, a cache
block that has received its last update could be sent rather than
a NAK if the atomic operation is being delayed by a data from a
cache miss in the read set.

Yes, if we know that the event will succeed the miss buffer can hand
off the partially completed as if "they all become visible instantaneously".

This "optimization" would have the
bad consequence of spreading poisoned data when an uncorrectable
ECC error is encountered for the cache miss — the atomic stores
dependent on that delayed data would have to be poisoned as
uncorrectable — as well as potentially being more complex than
beneficial.)

IBM z/Architecture provided "constrained transactions" that
are guaranteed to complete (hardware retry and other mechanisms).

One of the reasons I invented the ATOIC control point and automagic
control transfer to the control point on ATOMIC failure. HW has to
leave the IP pointing somewhere, so why not at the failure recovery
point ??

I think the earlier ASF had that feature. The LL/SC interface
requires a separate branch instruction at the end. Yet there
also seem to be cases where the failure type would just
indicate "retry".

The earlier ASF had part of this feature, it did not have the change
in the control point when a BIN (branch on interference) was
performed {mostly because x86 uses Condition Coded branches and
there was no way of adding "interference" to the CC.

I think the constraints avoided repeated TLB/cache misses and
other issues that make reliably fast completion difficult. I got

ESM mandates all TLB misses in the setup phase, so that none of
the participating STs miss in the TLB. {{Details matter}}

Yes, details matter! Experience and deep thought (and
communication with other [experienced] deep thinkers) can expose
details and how they interact. An elegant design has synergy,
recurring patterns, and encapsulation that seems natural (and
obvious after the fact).

[snip]

(By the way,
z/Architecture also promoted memory-updating compute instructions
to be memory atomic if aligned.)

I have STM, LDM, and MM as atomic, as long as the data size being
transferred does not cross a page boundary. The bus was designed
for page sized ATOMIC transfers, so why not allow the CPU to make
use of this property.

This is a nice synergy example.

With a single reservation granule, ordering is much easier.

You cannot reduce the exponent of complexity and retain a single
ATOMIC granule !! See my code illustrating moving (extracting
and then inserting) an element of a CDS where nobody can ever see
that element not INSIDE the CDS !! So, you need fewer events
and fewer events means less interference, so a lower exponent of
latency,.....

I was not suggesting that all atomics be limited to a single
granule but that this case could be optimized more easily than
for multiple granules. (The interaction of multi-granule and
single-granule atomics does complicate this, of course.)

Even within a larger atomic, single conflict optimizations are
possible. (I mentioned one possibility for ESM above.) If one
granule is under contention, the ordering principles for single
granule atomics may be applicable.

Single cache line operate under the same guidelines, but can only
utilize a strict subset of the possible orderings.

There are also optimizations for single contender, where handoffs
would be more practical. Even a case where a thread is about to
complete an atomic that would write data that would change the
participants of a conflicting atomic operation might allow some
optimization where the almost complete atomic prefetches the new
participant data to the thread with the other atomic operation and
provides that atomic with an "I'm next" token.

Another possible optimization would be to commit an atomic that
is waiting on a cache miss for one granule in its read set. If the
atomic is guaranteed to commit, it can release all the non-
dependent members to other threads (the dependent members have to
be reserved until the data arrives from the cache miss but their
modification happens "before" the accesses of the released
members).

Be very careful, here.

Such would have the bad aspect that an uncorrectable
ECC error on the cache miss data would force additional memory
locations to be poisoned. (With limited versioned memory, a
rollback of the atomic event might be possible. Given that the
ECC error would be greatly delayed, I suspect undoing all the
dependent effects would be excessively complex and error-prone.)

Most systems take unrecoverable ECC failures are Machine Check.
You may be able to save the Guest OS, but you are not going to
be able to save the application.

[snip]

I think Intel and IBM made the mistake of going directly to a
largish scope (L1 cache write set and Intel at least extending the
read set with a conservative filter) rather than providing high
performance for small transactions and perhaps the possibility
to experiment with larger transactions. Both also probably
suffered from slow failure (I seem to recall Intel's TSX had
slow failures even for small transactions).

The mistake is that this is a problem the CPU is in a place to solve.
The CPU is not, the memory controller is. ...

Could you unpack that statement?

It is my contention after studying this problem since 1977, that
the CPU is not in a position to perform ATOMIC things--the proper
place for these is over in the memory hierarchy (memory controller
or DRAM controller of which there are multiples). The OP_to_memory()
primitives guarantee all sorts of things that no CAS can guarantee.

As the scope of the atomic increases (number of granules, amount
of computation), it would seem that minimizing communication would
favor doing more of the operation closer to the core. Moving the
compute to the data would also not necessarily move it all the
way to the memory controller. There may even be cases where
something like test-and-test-and-set is useful; a first non-atomic
read checks the data, then if the data needs updating an atomic
operation is performed. While simple cases like test-and-test-and-
set would avoid the first test when computation is moved to data,
one can imagine more complex cases where core-local data interacts
with the more contended data. There may be less communication if
core orchestrates the operation.

The fact you illustrated using test_and_test_and_set illustrates
WHY the CPU is not in a position to "do the right thing". Whereas add_to_memory() always does the right thing without looping,...
Notice that most GPUs only have OP_to_memory() atomics.

[snip]

I think My 66000's ESM is a good first step, but I would be
tempted to provide the ability to play with more extensive read
sets in the first version. That might not be very useful, but I
suspect some clever programmers would find uses.

Any read set larger than the outstanding miss buffer is either
supporting width the ISA cannot use, or providing a limit to
the width of ATOMICIty.

Unpack?

Since MM (memory copy) is already atomic when within a page, could
it not be included within an ESM operation even when more than
eight cache lines are accessed? Such would extend both the read
set and the write set. (I do not know that such would be useful,
but it seems likely not to add much hardware or complexity.)

Maybe, on a theoretical basis, they could.
I did not run into any practical ATOMIC codes that needed anything
remotely like multiple MMs as a single ATOMIC event.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Dec 26 00:12:22 2023

Paul A. Clayton wrote:

On 12/25/23 1:38 PM, MitchAlsup wrote:

Paul A. Clayton wrote:

[snip]

ASF and ESM still work even when there is no <data> cache.

The buffer holding 8 64-byte entries is a tiny data cache.

Yes but its name is the Miss Buffer on many machines. Most Miss
Buffers are not accessible like a cache.

A conservative filter could be used instead of a data cache for an
extended read set.

While providing compatibility across implementations is very
valuable, varying capability across implementations does not seem
horrible to me.

You still have to be able to make Architectural statements about
what does what. And if you want any semblance of compatibility
the lowest end machine has to be able to perform significant
ATOMIC events.

ESM sets the minimum and maximum capacity of an atomic operation
for all implementations. This means that a capacity failure in
any implementation is guaranteed to be one in all implementations.

For compatibility !!

But in all senses:: 8 cache lines is so much more powerful that
anything programmers have access to today, making it bigger or
its size vague is not worthy.

ESM also guarantees that atomic conflicts will be atomic
conflicts. A conservative filter would provide failures that are
not true conflicts. Limited-associativity cache-based tracking
would introduce evictions caused by the atomic memory accesses.

ESM still has a few spurious failure modes--especially in the
"methodological" operating mode.

(A cuckoo/elbow cache — skewed associative cache where a victim
from one index is reinserted at one of its alternative indexes —
could greatly increase the effective associativity. Copying cache
blocks may be preferable to failing a transaction. A modest victim
cache could buffer the copy activity and avoid reinsertion based
on new information or longer analysis of information.)

A FA Miss Buffer solves this without complexity.

Combining the optimism of no atomic conflicts with the optimism of
fitting into _effective_ capacity does not seem extremely wrong-
headed to me. ESM provides a way to moderate the atomic conflict
optimism with a "retry" that assumes N conflicts. Hardware might
be able to somewhat moderate effective capacity optimism (e.g.,
cuckoo cache, index recoding, borrowing capacity from a neighbor),
but I suspect using a non-optimistic fallback path would be
necessary.

Maybe only fools would push capacity limits, especially since
failing a large transaction is very expensive. (Perhaps the
ability to query estimated effective capacity might be useful
with a transaction. If a transaction could be split into
smaller transactions at significant cost — such that the
larger transaction would be preferred if there is low conflict
and capacity is sufficient — that choice might be made even
partially into the first attempt.)

If you look into the "math" you will see that the exponent e in
BigO( m^e ) is what bounds (minimum) system time in an ATOMIC
event. Making bigger ATOMIC events makes e smaller. This rule
is opposite what you propose above. Although there are some
events where you rule may take hold I suspect minimizing the
total number of events needed reduces overhead faster than
making events as small as possible { Test_and_Set() }.

[snip]

There are also optimizations for single contender, where
handoffs would be more practical. Even a case where a thread is
about to
complete an atomic that would write data that would change the
participants of a conflicting atomic operation might allow some
optimization where the almost complete atomic prefetches the new
participant data to the thread with the other atomic operation and
provides that atomic with an "I'm next" token.

Another possible optimization would be to commit an atomic that
is waiting on a cache miss for one granule in its read set. If the
atomic is guaranteed to commit, it can release all the non-
dependent members to other threads (the dependent members have to
be reserved until the data arrives from the cache miss but their
modification happens "before" the accesses of the released
members).

Be very careful, here.

I am somewhat aware that these are pushing into the hairy
complexity where even formal methods will not keep one from
losing both feet and most of both legs with death from
bleeding out if not from the explosion of the foot cannon
shell itself.

Be especially careful on which trigger you pull when both hands
hold guns pointing at your feet.

It is fun, however, to think of such possibilities.

Such would have the bad aspect that an uncorrectable
ECC error on the cache miss data would force additional memory
locations to be poisoned. (With limited versioned memory, a
rollback of the atomic event might be possible. Given that the
ECC error would be greatly delayed, I suspect undoing all the
dependent effects would be excessively complex and error-prone.)

Most systems take unrecoverable ECC failures are Machine Check.
You may be able to save the Guest OS, but you are not going to
be able to save the application.

Sometimes the application could be saved. If the data is in the
text segment, recovery may be trivial. If the data can be isolated
to a specific task and that task's starting data is available,
more localized restarting would be possible. Sometimes the
specific task could even be dropped (e.g., a Monte Carlo search
could just drop one random attempt with likely little harm).
Just killing the application may be the best option most of
time, but playing "what if" can be fun.

At great overhead, applications can avail themselves of recovery
{generally by checkpointing a large amount of state.} In general,
it does not happen very often outside of large simulations and
data-base applications.

Uncorrectable tag errors for possibly dirty data seem more
fatal. Poisoning 0.1% of memory that matches the cache block's
index seems extremely likely to kill the system. Isolating
cache allocations with a mechanism like logical partitioning
would allow such a failure to be isolated to a fraction of the
active programs, but even then a tag error seems really bad.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Tue Dec 26 13:50:47 2023

MitchAlsup wrote:

Paul A. Clayton wrote:

There are also architectural choices involved. ESM does not HAVE TO hold
onto cache lines, what it cannot allow is access of write
permission. With this paradigm, one can lose cache lines to interference without loosing the ability to complete the ATOMIC event.

In my HTM design I do NOT use the coherence protocol Invalidate messages
to detect collisions, as Intel RTM does and AMD ASF would have.
Instead I use a separate set of Atomic Transaction (ATX) messages
which travel on the coherence network but are never sent to cache.
However the Atomic Transaction Manager (ATM) can emit coherence messages
and cause lines to move between caches.

Making the ATX messaging completely separate from Invalidate messages
was crucial to eliminating the "false sharing cancellation" problem,
where a line that was touched for bytes ADJACENT to the ones in the
transaction caused an Invalidate message that aborted the transaction.

IBM z/Architecture provided "constrained transactions" that
are guaranteed to complete (hardware retry and other mechanisms).

One of the reasons I invented the ATOIC control point and automagic
control transfer to the control point on ATOMIC failure. HW has to
leave the IP pointing somewhere, so why not at the failure recovery
point ??

I think the constraints avoided repeated TLB/cache misses and
other issues that make reliably fast completion difficult. I got

ESM mandates all TLB misses in the setup phase, so that none of the participating STs miss in the TLB. {{Details matter}}

This is why I added a PREFETCH_TRANSLATION instruction so one can
pre-load the TLB for data without actually touching the data item and
causing the cache line to move between caches until the transaction
is actually in progress.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Paul A. Clayton on Tue Dec 26 13:18:37 2023

Paul A. Clayton wrote:

On 12/25/23 9:44 AM, EricP wrote:

Scott Lurndal wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 12/24/2023 10:39 AM, Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

I also see little difficulty
in extending LL/SC active regions to include additional loads and
stores within the single reservation.

ARM has specified that any store by the processor _may_ cause
the SC to (Store Exclusive) to fail.

Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block.
I can't think of any way to make use of that since only the most recent
LL is active at a time.

I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and retains
the line lock, and SCR Store Conditional Release which conditionally
applies all SCH and SCR updates and releases the line lock.

It requires an extra cache line buffer to retain all SCH updates until
the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.

Not sure how useful multi-word LL/SCH/SCR is though since all updates
must
be within one cache line. It can do atomic double, quad and octa-wide
CAS,
or atomic compare-exchange plus increment a generation number.

One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.

The LL/SC interface could, in fact, be extended to have the LL be
Transaction Begin and the SC be Transaction End with arbitrary
scope. I get the impression that most ISAs provide a success
result in a GPR, so multiple failure modes could be provided.

Such extensions would only require updating documentation (and compilers/libraries and hardware implementations). Hardware could
be updated before the documentation not even necessarily needing
"chicken bits" to disable the functionality.

A natural progression of scope seems to be single access read-and-
write set, read and write set both within a single granule, write
set in single granule and read set extended to largish number of
granules, write set extended to several granules, finally perhaps
large number of granules for the write set.

(I think largish read sets are easy enough to support that
"several granule" read sets would not be a necessary step.)

Performance optimizations could be interleaved with expanding
scope.

Yes, it doesn't actually need a SCH instruction.
Just document that multiple LL are allowed to the same cache line,
a normal ST to the same line is held aside if the lock is active,
and SC ends the transaction and updates all changed line bytes.
If all LL/ST/SC are not to the same line it cancels the lock.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Wed Dec 27 00:24:48 2023

EricP wrote:

MitchAlsup wrote:

Paul A. Clayton wrote:

There are also architectural choices involved. ESM does not HAVE TO hold
onto cache lines, what it cannot allow is access of write
permission. With this paradigm, one can lose cache lines to interference
without loosing the ability to complete the ATOMIC event.

In my HTM design I do NOT use the coherence protocol Invalidate messages
to detect collisions, as Intel RTM does and AMD ASF would have.
Instead I use a separate set of Atomic Transaction (ATX) messages
which travel on the coherence network but are never sent to cache.
However the Atomic Transaction Manager (ATM) can emit coherence messages
and cause lines to move between caches.

Making the ATX messaging completely separate from Invalidate messages
was crucial to eliminating the "false sharing cancellation" problem,
where a line that was touched for bytes ADJACENT to the ones in the transaction caused an Invalidate message that aborted the transaction.

IBM z/Architecture provided "constrained transactions" that
are guaranteed to complete (hardware retry and other mechanisms).

One of the reasons I invented the ATOIC control point and automagic
control transfer to the control point on ATOMIC failure. HW has to
leave the IP pointing somewhere, so why not at the failure recovery
point ??

I think the constraints avoided repeated TLB/cache misses and
other issues that make reliably fast completion difficult. I got

ESM mandates all TLB misses in the setup phase, so that none of the
participating STs miss in the TLB. {{Details matter}}

This is why I added a PREFETCH_TRANSLATION instruction so one can
pre-load the TLB for data without actually touching the data item and
causing the cache line to move between caches until the transaction
is actually in progress.

If you use a PRE (prefetch) instruction and specify DRAM as the location
to prefetch, you end up only making sure the TLB is loaded.

PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits
determine where the data (line) is prefetched to {L1, L2, L3, DAM}.

I have also run into a couple of lock-like situations where one PRE's
the container with .lock, and later PUSH's the same container (also
with .lock) to end the ATOMIC event without ever seeing the contained
value.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Wed Dec 27 09:51:47 2023

MitchAlsup wrote:

EricP wrote:

MitchAlsup wrote:

I think the constraints avoided repeated TLB/cache misses and
other issues that make reliably fast completion difficult. I got

ESM mandates all TLB misses in the setup phase, so that none of the
participating STs miss in the TLB. {{Details matter}}

This is why I added a PREFETCH_TRANSLATION instruction so one can
pre-load the TLB for data without actually touching the data item and
causing the cache line to move between caches until the transaction
is actually in progress.

If you use a PRE (prefetch) instruction and specify DRAM as the location
to prefetch, you end up only making sure the TLB is loaded.

PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits determine where the data (line) is prefetched to {L1, L2, L3, DAM}.

Also the prefetch translation should, if necessary, set the PTE's
Accessed and Dirty bits as setting them requires transferring the
PTE's cache line as exclusive into that D$L1 cache, which also takes time.

Gets as much stuff done as possible that would cause transaction latency
before the transaction starts, and minimize the size of any contention
window.

I'm not sure how much help the would really be though as many shared
data structures are accessed through pointers, and to get those pointers
you have to read the structure.

For example, in a double linked list insert you could PRE-fetch the
translation for the list header, but to get the pointer to the first
object you have to read the header which requires transferring the
header's line this cache, which you don't want to do until the
transaction actually starts because doing so prematurely would just
delay the current updater's transaction.

I have also run into a couple of lock-like situations where one PRE's
the container with .lock, and later PUSH's the same container (also with .lock) to end the ATOMIC event without ever seeing the contained
value.

You should mention that your PUSH instruction is a cache flush
"used to push data out of the cache back towards memory".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Wed Dec 27 17:17:31 2023

EricP wrote:

MitchAlsup wrote:

EricP wrote:

MitchAlsup wrote:

I think the constraints avoided repeated TLB/cache misses and
other issues that make reliably fast completion difficult. I got

ESM mandates all TLB misses in the setup phase, so that none of the
participating STs miss in the TLB. {{Details matter}}

This is why I added a PREFETCH_TRANSLATION instruction so one can
pre-load the TLB for data without actually touching the data item and
causing the cache line to move between caches until the transaction
is actually in progress.

If you use a PRE (prefetch) instruction and specify DRAM as the location
to prefetch, you end up only making sure the TLB is loaded.

PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits
determine where the data (line) is prefetched to {L1, L2, L3, DRAM}.

Also the prefetch translation should, if necessary, set the PTE's
Accessed and Dirty bits as setting them requires transferring the
PTE's cache line as exclusive into that D$L1 cache, which also takes time.

If you want this semantic::

PRE RW-L1,[address].lock

instead of::

PRE RW-DRAM,[address].lock

Gets as much stuff done as possible that would cause transaction latency before the transaction starts, and minimize the size of any contention window.

I'm not sure how much help the would really be though as many shared
data structures are accessed through pointers, and to get those pointers
you have to read the structure.

These prefetches should be of the former rather than later.

For example, in a double linked list insert you could PRE-fetch the translation for the list header, but to get the pointer to the first
object you have to read the header which requires transferring the
header's line this cache, which you don't want to do until the
transaction actually starts because doing so prematurely would just
delay the current updater's transaction.

I have also run into a couple of lock-like situations where one PRE's
the container with .lock, and later PUSH's the same container (also with
.lock) to end the ATOMIC event without ever seeing the contained
value.

You should mention that your PUSH instruction is a cache flush
"used to push data out of the cache back towards memory".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Thu Dec 28 14:10:51 2023

EricP wrote:

Paul A. Clayton wrote:

On 12/25/23 9:44 AM, EricP wrote:

Scott Lurndal wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 12/24/2023 10:39 AM, Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

I also see little difficulty
in extending LL/SC active regions to include additional loads and >>>>>>> stores within the single reservation.

ARM has specified that any store by the processor _may_ cause
the SC to (Store Exclusive) to fail.

Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block.
I can't think of any way to make use of that since only the most recent
LL is active at a time.

I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and retains
the line lock, and SCR Store Conditional Release which conditionally
applies all SCH and SCR updates and releases the line lock.

It requires an extra cache line buffer to retain all SCH updates until
the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.

Not sure how useful multi-word LL/SCH/SCR is though since all updates
must
be within one cache line. It can do atomic double, quad and octa-wide
CAS,
or atomic compare-exchange plus increment a generation number.

One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.

Yes, it doesn't actually need a SCH instruction.
Just document that multiple LL are allowed to the same cache line,
a normal ST to the same line is held aside if the lock is active,
and SC ends the transaction and updates all changed line bytes.
If all LL/ST/SC are not to the same line it cancels the lock.

On reflection, eliminating the SCH instruction would leave this
vulnerable to an interrupt occurring in the multiple store sequence.
The interrupt would cancel the lock, and on return a regular ST would
not be distinguishable as intended to be part of an atomic sequence,
and so would erroneously update memory, whereas a SCH would be tossed
because the lock was no longer set.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Thu Dec 28 22:42:47 2023

EricP wrote:

EricP wrote:

Paul A. Clayton wrote:

On 12/25/23 9:44 AM, EricP wrote:

Scott Lurndal wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 12/24/2023 10:39 AM, Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

I also see little difficulty
in extending LL/SC active regions to include additional loads and >>>>>>>> stores within the single reservation.

ARM has specified that any store by the processor _may_ cause
the SC to (Store Exclusive) to fail.

Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block.
I can't think of any way to make use of that since only the most recent >>>> LL is active at a time.

I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and retains >>>> the line lock, and SCR Store Conditional Release which conditionally
applies all SCH and SCR updates and releases the line lock.

It requires an extra cache line buffer to retain all SCH updates until >>>> the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.

Not sure how useful multi-word LL/SCH/SCR is though since all updates
must
be within one cache line. It can do atomic double, quad and octa-wide
CAS,
or atomic compare-exchange plus increment a generation number.

One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.

Yes, it doesn't actually need a SCH instruction.
Just document that multiple LL are allowed to the same cache line,
a normal ST to the same line is held aside if the lock is active,
and SC ends the transaction and updates all changed line bytes.
If all LL/ST/SC are not to the same line it cancels the lock.

On reflection, eliminating the SCH instruction would leave this
vulnerable to an interrupt occurring in the multiple store sequence.

If an exception occurs in the store (manifestation) section of an ESM
ATOMIC event, the event fails and none of the stores appears to have
been performed.

If an interrupt occurs in the store (manifestation) section of an ESM
ATOMIC event, if all the store can be committed they will be and the
event succeeds, and if they cannot (or cannot be determined that they
can) the the event fails and none of the stores appear to have been
performed.

In any event, if the thread performing the event cannot be completed,
due to transfer of control to more privilege operation, the event fails
control appears to have been transferred to the event control point, and
then control is transferred to the more privileged thread.

The interrupt would cancel the lock,

ESM cancels the event not the lock.

and on return a regular ST would
not be distinguishable as intended to be part of an atomic sequence,

Upon return you are no longer "in" the ATOMIC event (control transfers
to the control point before control transfers to the more privileged
operation.

and so would erroneously update memory,

So, there is no control transfer "back" inside the ATOMIC event.

whereas a SCH would be tossed
because the lock was no longer set.

There should be no read or write set upon return.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Fri Dec 29 12:17:53 2023

MitchAlsup wrote:

EricP wrote:

EricP wrote:

Paul A. Clayton wrote:

On 12/25/23 9:44 AM, EricP wrote:

Scott Lurndal wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 12/24/2023 10:39 AM, Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

I also see little difficulty
in extending LL/SC active regions to include additional loads and >>>>>>>>> stores within the single reservation.

ARM has specified that any store by the processor _may_ cause
the SC to (Store Exclusive) to fail.

Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block. >>>>> I can't think of any way to make use of that since only the most
recent
LL is active at a time.

I had an idea a while back for an extension to LL/SC which allowed
multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and retains >>>>> the line lock, and SCR Store Conditional Release which conditionally >>>>> applies all SCH and SCR updates and releases the line lock.

It requires an extra cache line buffer to retain all SCH updates until >>>>> the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.

Not sure how useful multi-word LL/SCH/SCR is though since all
updates must
be within one cache line. It can do atomic double, quad and
octa-wide CAS,
or atomic compare-exchange plus increment a generation number.

One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.

Yes, it doesn't actually need a SCH instruction.
Just document that multiple LL are allowed to the same cache line,
a normal ST to the same line is held aside if the lock is active,
and SC ends the transaction and updates all changed line bytes.
If all LL/ST/SC are not to the same line it cancels the lock.

On reflection, eliminating the SCH instruction would leave this
vulnerable to an interrupt occurring in the multiple store sequence.

If an exception occurs in the store (manifestation) section of an ESM
ATOMIC event, the event fails and none of the stores appears to have
been performed.

If an interrupt occurs in the store (manifestation) section of an ESM
ATOMIC event, if all the store can be committed they will be and the
event succeeds, and if they cannot (or cannot be determined that they
can) the the event fails and none of the stores appear to have been performed.

In any event, if the thread performing the event cannot be completed,
due to transfer of control to more privilege operation, the event fails control appears to have been transferred to the event control point, and
then control is transferred to the more privileged thread.

Yes, my HTM has some similarities.
It has a formal start to the transaction that establishes an abort
transfer RIP address for when an interrupt or exception occurs.
That state change in the RIP indicates the transaction has terminated.

I have two sets of instructions, the usual atomic RMW and LL/SCH/SCR ones,
and a completely different set for atomic transactions.

The LL/SCH/SCR is simple and cheap and could be easily retrofitted into
a core with LL/SC. However it has no abort transfer address and so needs
some other means to distinguish an atomic store from a regular one.
Thus SCH as a distinct instruction is required.

The interrupt would cancel the lock,

ESM cancels the event not the lock.

That reference to "lock" came from the LL Load Locked instruction.

My AT instructions refer to "canceling the transaction".

and on return a regular ST would
not be distinguishable as intended to be part of an atomic sequence,

Upon return you are no longer "in" the ATOMIC event (control transfers
to the control point before control transfers to the more privileged operation.

Right, for multi-line transactions.

The LL/SCH/SCR instructions are not a multi-line transaction.
They are a single cache line atomic multiple update.

and so would erroneously update memory,

So, there is no control transfer "back" inside the ATOMIC event.

Right, in atomic transactions.

I see atomic RMW and LL/SCH/SCR as one class of instruction
as they all deal with updates to a single cache line
and their HW is optimized for just that one operation.

The AT Atomic Transaction instructions are completely separate,
dealing with multiple updates to multiple lines,
and having more flexibility but also more overhead.

whereas a SCH would be tossed
because the lock was no longer set.

There should be no read or write set upon return.

Yes, in a transaction there isn't - the interrupt/exception cancels the transaction tossing all pending guarded memory changes,
updates the status register, and sets the RIP to the abort address.

Though I have been thinking lately that might be overly conservative.
It might not need to cancel a transaction just because there is an interrupt, as long as the interrupt handler doesn't initiate a new transaction or
read or write any memory that is a member of the transaction.
That might cut down on spurious unnecessary transaction aborts.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Fri Dec 29 20:45:01 2023

EricP wrote:

MitchAlsup wrote:

EricP wrote:

EricP wrote:

Paul A. Clayton wrote:

On 12/25/23 9:44 AM, EricP wrote:

Scott Lurndal wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 12/24/2023 10:39 AM, Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

I also see little difficulty
in extending LL/SC active regions to include additional loads and >>>>>>>>>> stores within the single reservation.

ARM has specified that any store by the processor _may_ cause >>>>>>>>> the SC to (Store Exclusive) to fail.

Same as on Alpha.
Alpha rules only required LL and SC to be to the same 16 byte block. >>>>>> I can't think of any way to make use of that since only the most
recent
LL is active at a time.

I had an idea a while back for an extension to LL/SC which allowed >>>>>> multiple LL's as long as they were for the same cache line,
and SCH Store Conditional Hold which conditionally updates and retains >>>>>> the line lock, and SCR Store Conditional Release which conditionally >>>>>> applies all SCH and SCR updates and releases the line lock.

It requires an extra cache line buffer to retain all SCH updates until >>>>>> the SCR update succeeds then they all update the cache together.
Other than that, it is basically the same as LL/SC.

Not sure how useful multi-word LL/SCH/SCR is though since all
updates must
be within one cache line. It can do atomic double, quad and
octa-wide CAS,
or atomic compare-exchange plus increment a generation number.

One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.

Yes, it doesn't actually need a SCH instruction.
Just document that multiple LL are allowed to the same cache line,
a normal ST to the same line is held aside if the lock is active,
and SC ends the transaction and updates all changed line bytes.
If all LL/ST/SC are not to the same line it cancels the lock.

On reflection, eliminating the SCH instruction would leave this
vulnerable to an interrupt occurring in the multiple store sequence.

If an exception occurs in the store (manifestation) section of an ESM
ATOMIC event, the event fails and none of the stores appears to have
been performed.

If an interrupt occurs in the store (manifestation) section of an ESM
ATOMIC event, if all the store can be committed they will be and the
event succeeds, and if they cannot (or cannot be determined that they
can) the the event fails and none of the stores appear to have been
performed.

In any event, if the thread performing the event cannot be completed,
due to transfer of control to more privilege operation, the event fails
control appears to have been transferred to the event control point, and
then control is transferred to the more privileged thread.

Yes, my HTM has some similarities.

Yes, I see lots of similarities--most of the differences are down in
the minutia.

It has a formal start to the transaction that establishes an abort
transfer RIP address for when an interrupt or exception occurs.
That state change in the RIP indicates the transaction has terminated.

I have two sets of instructions, the usual atomic RMW and LL/SCH/SCR ones, and a completely different set for atomic transactions.

The LL/SCH/SCR is simple and cheap and could be easily retrofitted into
a core with LL/SC. However it has no abort transfer address and so needs
some other means to distinguish an atomic store from a regular one.
Thus SCH as a distinct instruction is required.

I am assuming SCH is SC and Hole while SCR is SC and Release.

The interrupt would cancel the lock,

ESM cancels the event not the lock.

That reference to "lock" came from the LL Load Locked instruction.

My AT instructions refer to "canceling the transaction".

Which I call an event==transaction. In any event (sic); the event does
not span (persist across) control transfers to other privilege levels.
It is this point that requires the establishment of the failure control
point.

and on return a regular ST would
not be distinguishable as intended to be part of an atomic sequence,

Upon return you are no longer "in" the ATOMIC event (control transfers
to the control point before control transfers to the more privileged
operation.

Right, for multi-line transactions.

The LL/SCH/SCR instructions are not a multi-line transaction.
They are a single cache line atomic multiple update.

and so would erroneously update memory,

So, there is no control transfer "back" inside the ATOMIC event.

Right, in atomic transactions.

I see atomic RMW and LL/SCH/SCR as one class of instruction
as they all deal with updates to a single cache line
and their HW is optimized for just that one operation.

The AT Atomic Transaction instructions are completely separate,
dealing with multiple updates to multiple lines,
and having more flexibility but also more overhead.

whereas a SCH would be tossed
because the lock was no longer set.

There should be no read or write set upon return.

Yes, in a transaction there isn't - the interrupt/exception cancels the transaction tossing all pending guarded memory changes,
updates the status register, and sets the RIP to the abort address.

By the time the exception/interrupt has been taken care of, millions of
cycles can have gone by (millisecond at this point) so the state of the concurrent data structure is stale as seen by the thread receiving
control in the middle of an event--the probability of the pending
ATOMIC event succeeding is so remote that you should not lean in
that direction.

Though I have been thinking lately that might be overly conservative.
It might not need to cancel a transaction just because there is an interrupt, as long as the interrupt handler doesn't initiate a new transaction or
read or write any memory that is a member of the transaction.
That might cut down on spurious unnecessary transaction aborts.

There might be hundreds of CPUs all doing "transactions" a few percent of
their time, to any number of resources.

In My case, failing the event on exception/interrupt is a way of
minimizing the amount of state that is needed to be saved/restored;
and a way of minimizing priority inversion.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Chris M. Thomasson on Fri Dec 29 22:06:46 2023

Chris M. Thomasson wrote:

On 12/21/2023 1:49 PM, Chris M. Thomasson wrote:

On 12/20/2023 8:29 AM, MitchAlsup wrote:

Stefan Monnier wrote:

Isn't that as it should be? Hardware provides some simple, possibly >>>>> inconvenient interface, and software abstracts the inconvenience
away; e.g., RISCs provide only a load/store interface, and compilers >>>>> translate the stuff programmers write into this interface. The
difference is that with RISCs the result is a least as fast as
architectures that tried to cater more to the programming language, >>>>> while I am convinced that hardware where the designers design for >>>>> efficient sequential consistency will let programmers write software >>>>> that handily outperforms using libraries or system calls on hardware >>>>> designed for weaker memory models.

While I agree that CPUs should provide sequential consistency

It's nice that they already do. The fun part is creating algorithms that
try to avoid it. Not only avoid seq_cst, but avoid a acquire and release
membars, and the fun acq_rel acquire release all in one... ;^) Now,
there are certain algorithms that, as-is, cannot avoid it (seq_cst).
Original SMR and, even Dekkers algorithm require seq_cst...

(it would
make life easier for debugging, formalization, and reasoning, and
I can't think of any reason why it should have a significant cost), I'd >>>> be surprised if it "handily outperforms" the status quo: concurrent
programming is hard even with sequential consistency, so most normal
code would still continue using the exact same libraries.

[...]

Keep in mind that acquire and release are not as strong as a seq_cst
barrier. Wrt the SPARC:

acquire = #LoadStore | #LoadLoad
release = #LoadStore | #StoreStore

Not a #StoreLoad in sight!

SunOS had a funny quirk--it used cacheable locks to guard uncacheable
stores. SuperSPARC had a slow cache-to-cache protocol and nobody saw
the timing hole until the Ross Colorado chips could pass data cache to
cache faster and another CPU could see the lock unlocked before the
uncacheable store arrived at {what we call} North Bridge {today}; and
this timing hole could crash the OS.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Sat Dec 30 04:08:47 2023

On Tue, 26 Dec 2023 00:12:22 +0000, MitchAlsup wrote:

Be especially careful on which trigger you pull when both hands hold
guns pointing at your feet.

Be especially careful, yes.

But not primarily about _which trigger you pull_, since pulling *either* trigger will have the bad result of shooting oneself in the foot. Instead, first point the guns in some other direction before pulling either
trigger!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Sat Dec 30 04:17:26 2023

On Fri, 29 Dec 2023 22:06:46 +0000, MitchAlsup wrote:

SunOS had a funny quirk--it used cacheable locks to guard uncacheable
stores.

That _sounds_ bad, at first. Surely it's a terrible security hole or
something.

But why are some memory addresses marked as not cacheable? Usually,
it's because they aren't really RAM, but instead I/O ports or
something like that.

The lock, which indicates which process those addresses currently
belong to, is just data. So it isn't uncacheable for the same
reason as addresses like that.

Of course, while in _some cases_ this might make perfect sense, there
may well be other cases where this really is the horrible mistake it
seems to be on first sight.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Sat Dec 30 16:51:56 2023

Quadibloc wrote:

On Fri, 29 Dec 2023 22:06:46 +0000, MitchAlsup wrote:

SunOS had a funny quirk--it used cacheable locks to guard uncacheable
stores.

That _sounds_ bad, at first. Surely it's a terrible security hole or something.

It was more like the differences between MESI and MOESI. MESI has
cache to cache transfers go through memory, MOESI has cache to
cache transfers go cache to cache.

But why are some memory addresses marked as not cacheable? Usually,
it's because they aren't really RAM, but instead I/O ports or
something like that.

In this particular case, the locations were MMI/O.

The lock, which indicates which process those addresses currently
belong to, is just data. So it isn't uncacheable for the same
reason as addresses like that.

Of course, while in _some cases_ this might make perfect sense, there
may well be other cases where this really is the horrible mistake it
seems to be on first sight.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Sun Dec 31 19:33:06 2023

Paul A. Clayton wrote:

On 12/28/23 2:10 PM, EricP wrote:

EricP wrote:

Paul A. Clayton wrote:

[snip extended LL/SC design]

One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.

Yes, it doesn't actually need a SCH instruction.
Just document that multiple LL are allowed to the same cache line,
a normal ST to the same line is held aside if the lock is active,
and SC ends the transaction and updates all changed line bytes.
If all LL/ST/SC are not to the same line it cancels the lock.

On reflection, eliminating the SCH instruction would leave this
vulnerable to an interrupt occurring in the multiple store sequence.
The interrupt would cancel the lock, and on return a regular ST would
not be distinguishable as intended to be part of an atomic sequence,
and so would erroneously update memory, whereas a SCH would be tossed
because the lock was no longer set.

I had not considered this consequence of an interrupt, but this
issue would be "trivially" handled by distinguishing between
reservation active and reservation valid (i.e., use two bits with
one preserved across interrupts). The hardware would then know
that it is within a ll/sc enclosed code sequence and stores would
merely be buffered as usual until the sc was encountered. Advanced
hardware for such a primitive interface, knowing that stores would
not be preserved could elide them.

If taken the way I think Paul intends, that means the read and
write sets are part of thread saved state. It is nuances like
exceptions and interrupts; why My 66000 abandons the event if
it has to perform more privileged control transfer.

For a single cache block reservation, having to run through the
rest of a known-failed atomic attempt may not be problematic. With
broader uses, that would be painful; any failure would have to run
through the rest of the attempt before being able to consider
retrying. If the state on a failure is architected as boundedly
undefined, a simple scan for the sc (no execution to produce
identical final state) would suffice. It might be possible to add
this imprecise state aspect when allowing multiple accesses so
hardware could safely just scan after the second non-ll/sc memory
access.

A microarchitecture could theoretically cache sc locations similar
to a branch target buffer, but memoizing such seems difficult and
a prediction would have to be confirmed (e.g., by scanning for the
next sc while speculating that the predicted location was
correct). That seems needlessly complex.

When one considers all of the effects of {memory order, branch prediction, exceptions, interrupts, machine checks,...} it is better for the complexity management of a project* to abandon an ATOMIC event when "control gets hard".
I have seen and been in the debates of "how do we get out of this mess"
often enough not to want to repeat it again.

(*) and the possibly more important:: "ability of SW to reason about what happened and why";

It is not terribly difficult to imagine ways of addressing this
deficiency. One might be able to get away with reinterpreting a
branch instruction as setting a failure jump point, avoiding
adding instructions. Such a branch would have to use a nonsensical
condition like "if not equal to self" (assuming no one has used
that as a nop before the architecture decided to reserve it).
Short attempts would not include the branch and just run through
to find the sc; simple attempts that are somehow known not to have
only transient failure causes might target the ll instruction
producing a hard retry loop. One could also consider each register
tested for non-equality to self as providing additional control
information.

Given that opcode space is usually not so tight as to prevent a
secondary atomic interface, but one would really want to have the
second be comprehensive to avoid a third, fourth, fifth, etc.
Extending ll/sc semantics — without adding instructions — might
have allowed experience/feedback to accumulate for a second try.
(I do believe that one could substantially extend the semantics
without having to add instructions. When starting from a clean
slate is better idea seems usually to be realized late. "Free"
extensions also discourages trying again rather than trying to
fix an interface that mostly sort-of works.)

Given that none of the RISCs have actually extended ll/sc
semantics, the pressure to provide broader atomics must not have
been particularly great.

Most of those machines have only 1 OS to port (Linux) and most of
the Linux port requires only the ATOMICs that x86 provides.
Most of the Linux environment is not multi-threaded,
Most of the multi-threaded Linux applications requires only the
ATOMICs that x86 provides.

Thus, the chicken and egg problem.

LL/SC is sufficient to replicate the ATOMICs x86 provides.

Remember Sun Microsystems took 5 years of programmer effort in
reducing the lock contention in SOLARIS--an effort that nobody
in the Linux community can afford to perform.

ASF did better in checkpointing the start such that a failure
jumped to a known location

yes

that performed a test for
success/failure and branched to handling code.

not necessarily. If you arrived you knew the event had failed.
ASF provided a mechanism whereby you could refine your understanding
of the failure and branching based on that.

ASF also preserved
the stack pointer; this does not seem a substantial benefit as
short attempts could probably avoid changing the stack pointer and
the relative overhead of explicitly saving the stack pointer is
not that great if the attempt is large.

The SP thing had to do with PUSH/POP model in x86 and automagic
control transfer.

Mitch wants escaping stores to preserve logged information (with
the obvious potential benefit for debugging and performance
tuning, possibly even dynamic performance tuning i.e. by the
software and not the developer). AMD's ASF defined a RELEASE
instruction which removed a cache block from the tracking set; I
do not remember seeing this in the documentation I have of ESM.

RELEASE was after I had left AMD.

ASF (under my tutelage) and ESM were ATOMIC events not restricted
Transactional Memory things.

An ATOMIC event has the property that if the event succeeds all
of the writes become visible to the system instantly, otherwise
no interested 3rd party sees any of the memory locations as having
been modified.

An instruction like RELEASE violates the definition and, in essence,
takes the model from ATOMIC towards Transactional by one tiny but
important step--a step I was not willing to take--mainly because
I could not come up with a reasonable definition of ATOMIC if it
had that semantic.

ESM can be used to help SW implement Transactional Memory, but it
is not Transactional Memory.

Removing reservations avoids some capacity constraints, though

Current ATOMIC things allow for 1 or 2 memory containers. ASF and
ESM allow for 8 cache lines used any way the programmer desires.
EricP's version goes for 16 {*p,length} read/write sets.

I submit that 8 (0r 16) is so much larger than ATOMIC event have
today that it will take decades for SW to figure out how to use
that new capability efficiently and effectively.

ASF defined this as a "hint" so an implementation could still fail
an atomic event that would have succeeded if the tracking
adjustment had been done, whether from avoiding a contention or
exceeding a capacity limit. Removing reservations is also
dangerous, especially at cache block tracking granularity, as
software must ensure that other memory accesses either do not
access that cache block or have the same lifetime.

Removing reservations violates the "all or nothing" semantic of
anything that wants to smell ATOMIC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Sun Dec 31 20:23:23 2023

Paul A. Clayton wrote:

On 12/26/23 7:24 PM, MitchAlsup wrote:
[snip]

If you use a PRE (prefetch) instruction and specify DRAM as the
location
to prefetch, you end up only making sure the TLB is loaded.

PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits
determine where the data (line) is prefetched to {L1, L2, L3, DAM}.

What if there is a cache after L3?

Consider L3 as LLC.

What if one wants to prefetch
to a cache layer that is directly connected to the memory
controller (no extra coherence traffic, available fairly in terms
of performance to all agents [except possible processor-in-
memory], etc.)?

L3 in My 66000 Architecture is defined to be a read/write buffer
used by Memory controller and DRAM controller. L3 is a cache of
what DRAM either looks like now or should look like eventually.
I specify this organization as a means to eliminate RowHammer.

What if the design has L3 slices which are home
nodes in terms of coherence state but allow allocation to a non-
home node with non-home allocations having lower associativity? (I

This is not an allowable L3 configuration in My 66000 Architecture.

do not know if such would be sensible. Potential full use of all
L3 capacity by one core has advantages and home nodes must have
some advantage; the former could be accomplished by using other L3
slices as victim caches, but those then become L4 caches.) There
are presumably many other design points that would make the
4-level designation less than complete.

2-points
1) there are a limited number of bits (5) and a limited number of
cache horizons.
2) SW has enough trouble prefetching before getting into prefetch
from DRAM to L3 but no higher into the hierarchy, then later
prefetch into L2 from L3 if it is still there,.....
Not enough SW understand how to effectively utilize prefetch before
one gets into prefetch to where--and the converse case post-push
to where.

I am also curious about where a prefetch would be placed with RE
(or WE/RWE) permission for a cache level that could be either
instruction or data. RE _might_ make sense for some corner cases
such as checking if an executable chunk gives a matching CRC code
immediately before executing. Allocation to both caches might make
sense for RE.

RxE is prefetchable directly into L2 UCache (but no higher)
--E is prefetchable directly into L1 ICache
Rx- is prefetchable directly into L1 DCache

WE/WRE might make sense as a hint that the block is going to be
written and shortly executed (in the split cache level case
possibly by different cores if that level had a shared Icache
level).

For WRE, I could see allocating the data to the data cache of the
specified level (possibly with a placement bias to favor cheaper
transfer to the instruction cache,

That placement bias is L2 BTW.

or perhaps an icache might send
some victims to a typically not shared cache level), but it is
less clear if the PTE should loaded to both data and instruction
TLBs if "refreshing" the TLB entry to allow execution (assuming it
was denied for writable pages) was cheaper.

Fast shared L2 TLB solves this problem.

There are are bypassing, NUCA, replacement metadata, and perhaps
other factors. One might want to prefetch into a cache level with
setting of a "Keep Me" bit to choose another victim the first time
that block is chosen as a victim (or a probabalistic ignoring of
the replacement choice, which might then want the probability).

A cache is either an invisible companion that reduces latency of
memory accesses, or it is a programmable store unit which requires
direct access to tag storage. You cannot have it both ways.

("Keep Me" could be useful for prefetches that may be excessively
early but are almost always used "soonish".)

Keep Me requires access to the LRU or NRU control bits in the tag.

Similarly, a cache
might have metadata urging earlier eviction (set 'LRU' for LRU and
tree pLRU, set unaccessed for bit-vector NRU); this might also
involve more than one state as possible choices (e.g., NRU+LRU
pair might set unaccessed and LRU or unaccessed and MRU).

99.44% of programmers could not use these features. Of the 0.56%
who can, 80% of them work in OS Kernels.

Metadata/placement may also be complicated on a cache hit. If a
prefetch targets L2 but hits L1, should it be treated as a nop?

The only metadata I ever allows configuration access to in the
cache was to lock lines out of being available for allocation;
(bad bit in the line or tag). So, a 4-way set associative cache
with a bad bit goes to (2^n-1)-lines of 4-was SA and 1 line of
3-way SA. The dead line does not participate in allocation and
cannot hit.

What if L2 is shared by multiple cores and the block is dirty or
L2 does not include L1s? What if the L1 state was "next victim"?
Early victimizing _might_ make sense, but changing to "last
victim" might also make sense?

If L2 does not filter SNOOPs the L1s are SNOOPed.
When a line is reallocated
a) modified data migrates outward
b) new data replaces old data

One might also want a prefetch that prepares a hardware prefetch
engine. Such a software prefecth might want metadata about the
type of hardware prefetching (unit stride/non-unit stride/address
is pointer/generic hardware prefetch).

Your advocation for the use of prefetch is admirable. Truly !

The problem with prefetch is that SW cannot use it effectively
and efficiently, and in ways that preserve code over multiple
generations--one end might need to prefetch only 100 cycles
in front of use whereas the other end might need to prefetch
several thousand cycles in front of use.

A second problem is that prefetch is NECESSARILY specified as
a HINT because some other piece of code (yours or not) may
touch the memory hierarchy and the prefetch you so carefully
performed goes away before you come for the actual data.

SW must know about latency model it is using prefetch
to ameliorate !! But which one ??

Caching of paging structures further increases the possibilities.
What if one wants to prefetch the translation information for the
table level normally one above the last (i.e., either a huge page
PTE or a second level PDE)? (This might apply if software believes
a future access will be in a region. A hash table or random access
array might have access behavior that software knows well ahead
that the data structure will be used soonish but far enough away
that earlier prefetching translation information provides some
value over just doing a demand load.)

What if one wants to prefetch translation information to an L2
structure or a structure associated with a particular level of
cache (e.g., prefetch to L2 memory cache TLB to accelerate
indirection and page-crossing stride prefetches for the hardware
prefetcher at the L2 memory cache)?

Another what if for you::

What if HW just wanted to keep things simple ?? so they can get the
chip out the door.

HW has granted prefetch to SW in order for a few carefully chosen use cases--not generally for use by everyone on every application.

My 66000 also has the contrapositive of Post PUSH; migrating lines toward
DRAM, disposing of write permission (R--L1) and a few other subtle uses.
But each capability handed to SW comes with the responsibility to use
it effectively and efficiently. I see scant evidence of this.

In that regard:: HW prefetchers have proven easier for all concerned.

HW prefetchers can easily lock onto multiple simultaneous stride-based
prefetch strategies, so should SW use prefetch for these or not ??
HW here knows what machine it is running on and knows how far in
front to prefetch !! Can SW ?? across multiple implementations ??
on Big-Little implementations ??

HW prefetchers cannot easily lock onto multiple simultaneous linked-
list prefetching since this goes all the way back to the TLBs from
using virtual addresses--but SW cannot prefetch these since it needs
the next address to prefetch the next next access.. {chicken and egg}

So, which this in mind--what kind of access pattern can SW use prefetch
for that really improves performance ?? {Indexing is already a solved
HW prefetch problem, pointer chasing will never be SW solvable.}

Most of the above might be obviously bad ideas for all reasonable implementations, but I suspect that even experts would not know
all reasonable choices for the lifetime of the interface. Having
such configuration bits reference a table is well-known method for
adjusting to future changes (or inadequacy of

Architecture is as much about knowing what to leave out as what to put
in !

My 66000 has PRE and PUSH that solve known problems and might (MIGHT)
be able to help more general SW prefetching. My suspicion is that
HW prefetchers will get better than SW use of prefetch instructions.

Do you want "a bunch of CPUID data" so SW can decide on how far in
front of use and attempt at being effective and efficient ?? What
about Triangular calculations where "how far in front" changes
as the calculation proceeds ??

One might also have hints that are more directives (i.e., prevent
or disincentivize hardware from ignoring the hint or adjusting its
action) and this opens a cooperation space that might be quite
complex. Software might want to specify what hardware metadata is
given what weight. Software might have an estimated use distance
probability map (possibly with a model for probabilities of use by
all other agents agent before and/or after use), but presenting
hardware with 4KiB of metadata for a 64-byte prefetch seems not
quite right.

I have also run into a couple of lock-like situations where one PRE's
the container with .lock, and later PUSH's the same container
(also with .lock) to end the ATOMIC event without ever seeing the
contained
value.

When would one want to guard/reserve a cache block (.lock) without
accessing such?

Grabbing a lock (monitoring a line for interference) while doing 1 or
more accesses to MMI/O registers. ESM performs locking not based on
the value of data, but on the access and rights of an address across
the event. The Lock guards something else, and as long as you can
guarantee ATOMIC behavior, you don't actually need the lock's data.

Perhaps there is a coarser-grained software lock
that is checked without being monitored/reserved (or the
reservation is dropped quickly) such that the atomic event could
monitor the specific cache block without failing by holding the
more frequently used software lock block as reserved? PUSHing the
data outward could make sense if the data is likely to be used
soon by some other agent, but it is not clear why that should be
'locked' (unless that is necessary to end the event?)

I suspect I am missing something, possibly something obvious to
anyone familiar with shared memory programming.

Architecture is as much about knowing what to leave out as what to put
in ! You and Quadriblock are misconsidering the former while overdoing
the later.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Sun Dec 31 20:40:35 2023

Paul A. Clayton wrote:

On 12/25/23 7:12 PM, MitchAlsup wrote:

Paul A. Clayton wrote:

[snip]

While providing compatibility across implementations is very
valuable, varying capability across implementations does not seem
horrible to me.

You still have to be able to make Architectural statements about
what does what. And if you want any semblance of compatibility
the lowest end machine has to be able to perform significant
ATOMIC events.

ESM sets the minimum and maximum capacity of an atomic operation
for all implementations. This means that a capacity failure in
any implementation is guaranteed to be one in all implementations.

For compatibility !!

But in all senses:: 8 cache lines is so much more powerful that
anything programmers have access to today, making it bigger or
its size vague is not worthy.

I think part of the question is how acceptable address-based
(variable internal caching conflicts) and capacity-based failures are.

If software does not know that allocations are specifically
aligned, accessing five 32-byte objects might or might not exceed
the 8-entry capacity of ESM.

Expecting ATOMIC behavior across cache line boundaries has NEVER been guaranteed. So SW that deals with ATOMIC stuff needs to understand
this {along with false sharing,...} and take appropriate actions
at the language level and malloc()/new level.

If alignment-caused "capacity"
failures are acceptable, why are address-based associativity
conflicts so extremely problematic?

False sharing make thinks look like interference and harms
performance at the system level by performing too many bus
transfers.

(Yes, in general, ensuring
that participating accesses fit within N cache blocks is easier
than ensuring that associativity constraints are handled [other
than by restricting block count to the associativity], but this is
not an absolute distinction.)

ESM also guarantees that atomic conflicts will be atomic
conflicts. A conservative filter would provide failures that are
not true conflicts. Limited-associativity cache-based tracking
would introduce evictions caused by the atomic memory accesses.

ESM still has a few spurious failure modes--especially in the
"methodological" operating mode.

I was not considering exceptions, but I guess there are other
failure modes beyond capacity and contention.

Consider you are several checkpoints down a path, and have several
outstanding cache misses and an interrupt transpires. What are you
(the CPU) allowed to do with the returning cache lines since those
memory reference instructions did not survive to retirement ??
a) If you put them in the cache you have a Spectré side channel.
b) if you do not, you need to migrate them somewhere Ownership does
not get misunderstood.

(A cuckoo/elbow cache — skewed associative cache where a victim
from one index is reinserted at one of its alternative indexes —
could greatly increase the effective associativity. Copying cache
blocks may be preferable to failing a transaction. A modest victim
cache could buffer the copy activity and avoid reinsertion based
on new information or longer analysis of information.)

A FA Miss Buffer solves this without complexity.

But full associativity tends to constrain capacity.

That is why registers stop at 32 and miss buffers stop at 8.

This is not a consideration if a particular capacity is sufficient
for all implementations for the lifetime of the interface — which
you judge to be the case — but if one wants to allow use of the
same interface for implementations supporting larger capacity then
not relying of full associativity may be justified.

No, I expect that in 2 decades there will be enough experience with
ESM that capacity extension would be desired. Here code that did
work on early implementations continues to work, but more codes
can now be performed with the new capacity limits.

[snip]

If you look into the "math" you will see that the exponent e in
BigO( m^e ) is what bounds (minimum) system time in an ATOMIC
event. Making bigger ATOMIC events makes e smaller. This rule
is opposite what you propose above. Although there are some
events where you rule may take hold I suspect minimizing the
total number of events needed reduces overhead faster than
making events as small as possible { Test_and_Set() }.

If supporting larger events reduces system time, why forever bound
capacity to 8 64-byte cache blocks?

BECAUSE I ALREADY HAVE SOMETHING IN THE CORE THAT IS PERFORMING
THE THINIGS I NEED TO PERFORM, so the number 8 was essentially FREE.

SECONDLY, 8 IS SO FAR BIGGER THAN ANYONE HAS EXPERIENCE WITH, THAT 8
IS EFFECTIVELY infinity.

What to leave out, what to put in.

I can understand your previous statements about wanting to provide
something cheap that is usable for more than all known small
atomic operations. I.e., a universal baseline guarantee makes
sense even to me and having the baseline by 8 (when I think you
mentioned the largest atomics only used 6) gives substantial room
for software growth. However, allowing a relatively large read set
seems to be a plausible extension. (Extensions can be temporal incompatibilities or also 'spatial' incompatibilities.)

The bigger you make something architectural, the more you burden the
lowest members of you implementable architectures.

And I really do want forward and backward compatibility.

I appear to value compatibility much less than you. I think it is
useful to have an interface that can be extended even though such
breaks compatibility.

You can always make it bigger without compatibility problems.

Even 'spatial' incompatibility seems to have some potential
benefits. A special purpose system might want to use established
interface elements but make interface changes that fit the purpose.

Interface versions and subsets/profiles allow communicating
interface specifics more simply and communicating guarantees from
vendors and demands from users. Managing diversity has challenges
but prohibiting diversity (which includes temporal diversity/
extensions as such introduces multiple interfaces).

Some of this is just naming/branding (and the Ship of Theseus
problem). As a lumper, I am biased toward broad major categories
and consideration of similarities.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to MitchAlsup on Tue Jan 2 08:06:25 2024

MitchAlsup wrote:

Paul A. Clayton wrote:

On 12/26/23 7:24â€¯PM, MitchAlsup wrote:
[snip]

If you use a PRE (prefetch) instruction and specify DRAM as the location >>> to prefetch, you end up only making sure the TLB is loaded.

PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits
determine where the data (line) is prefetched to {L1, L2, L3, DAM}.

What if there is a cache after L3?

Consider L3 as LLC.

                                   What if one wants to prefetch
to a cache layer that is directly connected to the memory
controller (no extra coherence traffic, available fairly in terms
of performance to all agents [except possible processor-in-
memory], etc.)?

L3 in My 66000 Architecture is defined to be a read/write buffer
used by Memory controller and DRAM controller. L3 is a cache of
what DRAM either looks like now or should look like eventually.
I specify this organization as a means to eliminate RowHammer.

                What if the design has L3 slices which are home
nodes in terms of coherence state but allow allocation to a non-
home node with non-home allocations having lower associativity? (I

This is not an allowable L3 configuration in My 66000 Architecture.

do not know if such would be sensible. Potential full use of all
L3 capacity by one core has advantages and home nodes must have
some advantage; the former could be accomplished by using other L3
slices as victim caches, but those then become L4 caches.) There
are presumably many other design points that would make the
4-level designation less than complete.

2-points
1) there are a limited number of bits (5) and a limited number of
cache horizons.
2) SW has enough trouble prefetching before getting into prefetch
from DRAM to L3 but no higher into the hierarchy, then later
prefetch into L2 from L3 if it is still there,.....
Not enough SW understand how to effectively utilize prefetch before
one gets into prefetch to where--and the converse case post-push
to where.

I am also curious about where a prefetch would be placed with RE
(or WE/RWE) permission for a cache level that could be either
instruction or data. RE _might_ make sense for some corner cases
such as checking if an executable chunk gives a matching CRC code
immediately before executing. Allocation to both caches might make
sense for RE.

RxE is prefetchable directly into L2 UCache (but no higher)
--E is prefetchable directly into L1 ICache
Rx- is prefetchable directly into L1 DCache

WE/WRE might make sense as a hint that the block is going to be
written and shortly executed (in the split cache level case
possibly by different cores if that level had a shared Icache
level).

For WRE, I could see allocating the data to the data cache of the
specified level (possibly with a placement bias to favor cheaper
transfer to the instruction cache,

That placement bias is L2 BTW.

                                   or perhaps an icache might send
some victims to a typically not shared cache level), but it is
less clear if the PTE should loaded to both data and instruction
TLBs if "refreshing" the TLB entry to allow execution (assuming it
was denied for writable pages) was cheaper.

Fast shared L2 TLB solves this problem.

There are are bypassing, NUCA, replacement metadata, and perhaps
other factors. One might want to prefetch into a cache level with
setting of a "Keep Me" bit to choose another victim the first time
that block is chosen as a victim (or a probabalistic ignoring of
the replacement choice, which might then want the probability).

A cache is either an invisible companion that reduces latency of
memory accesses, or it is a programmable store unit which requires
direct access to tag storage. You cannot have it both ways.

("Keep Me" could be useful for prefetches that may be excessively
early but are almost always used "soonish".)

Keep Me requires access to the LRU or NRU control bits in the tag.

                                             Similarly, a cache
might have metadata urging earlier eviction (set 'LRU' for LRU and
tree pLRU, set unaccessed for bit-vector NRU); this might also
involve more than one state as possible choices (e.g., NRU+LRU
pair might set unaccessed and LRU or unaccessed and MRU).

99.44% of programmers could not use these features. Of the 0.56%
who can, 80% of them work in OS Kernels.

Metadata/placement may also be complicated on a cache hit. If a
prefetch targets L2 but hits L1, should it be treated as a nop?

The only metadata I ever allows configuration access to in the cache was
to lock lines out of being available for allocation;
(bad bit in the line or tag). So, a 4-way set associative cache
with a bad bit goes to (2^n-1)-lines of 4-was SA and 1 line of 3-way SA.
The dead line does not participate in allocation and
cannot hit.

What if L2 is shared by multiple cores and the block is dirty or
L2 does not include L1s? What if the L1 state was "next victim"?
Early victimizing _might_ make sense, but changing to "last
victim" might also make sense?

If L2 does not filter SNOOPs the L1s are SNOOPed.
When a line is reallocated
a) modified data migrates outward
b) new data replaces old data

One might also want a prefetch that prepares a hardware prefetch
engine. Such a software prefecth might want metadata about the
type of hardware prefetching (unit stride/non-unit stride/address
is pointer/generic hardware prefetch).

Your advocation for the use of prefetch is admirable. Truly !

The problem with prefetch is that SW cannot use it effectively
and efficiently, and in ways that preserve code over multiple generations--one end might need to prefetch only 100 cycles
in front of use whereas the other end might need to prefetch
several thousand cycles in front of use.

A second problem is that prefetch is NECESSARILY specified as
a HINT because some other piece of code (yours or not) may
touch the memory hierarchy and the prefetch you so carefully
performed goes away before you come for the actual data.

I am one of those programmers who read about PREFETCH being added to x86
and immediately thought "Yes, now I can make my code faster!", then
almost immediately (after getting one of the new CPUs) hit the tiny
little snag of "Oh, it does not seem to work!?!".

I.e. I have written code where actually loading a byte (forcing a fetch,
not just a hint) from every cache line in a 4KB block will speed it up,
but almost never seen any gains from adding a PREFETCH (n) cache lines
in front of the current operating address.

SW must know about latency model it is using prefetch
to ameliorate !! But which one ??

It is already pretty complicated without that added issue.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Paul A. Clayton on Tue Jan 2 12:26:32 2024

Paul A. Clayton wrote:

On 12/28/23 2:10 PM, EricP wrote:

EricP wrote:

Paul A. Clayton wrote:

[snip extended LL/SC design]

One could just use regular loads and stores between the LL and SC.
The hardware knows that it is within an atomic operation.

Yes, it doesn't actually need a SCH instruction.
Just document that multiple LL are allowed to the same cache line,
a normal ST to the same line is held aside if the lock is active,
and SC ends the transaction and updates all changed line bytes.
If all LL/ST/SC are not to the same line it cancels the lock.

On reflection, eliminating the SCH instruction would leave this
vulnerable to an interrupt occurring in the multiple store sequence.
The interrupt would cancel the lock, and on return a regular ST would
not be distinguishable as intended to be part of an atomic sequence,
and so would erroneously update memory, whereas a SCH would be tossed
because the lock was no longer set.

I had not considered this consequence of an interrupt, but this
issue would be "trivially" handled by distinguishing between
reservation active and reservation valid (i.e., use two bits with
one preserved across interrupts). The hardware would then know
that it is within a ll/sc enclosed code sequence and stores would
merely be buffered as usual until the sc was encountered. Advanced
hardware for such a primitive interface, knowing that stores would
not be preserved could elide them.

I still don't think that would quite work.
The lock is canceled if (a) there is an interrupt or
(b) if you read or write an address that is not in the same line

Presumably the interrupt handler would do a ST which should work
even if it happens to be to the same address as the lock.
Then the handler returns to the prior sequence and a ST to the
locked address must now be skipped until we see an SC,
which is also skipped, or a LD or ST to a different line.

To make that work those two bits would have to be part of the interrupt save/restore context. An SCH instruction is easier and unambiguous.

For a single cache block reservation, having to run through the
rest of a known-failed atomic attempt may not be problematic. With
broader uses, that would be painful; any failure would have to run
through the rest of the attempt before being able to consider
retrying.

The instructions inside the LL/SCH/SCR sequence are either reg-reg or
D$L1 cache hits to the locked line. The cost of continuing the occasional failed lock sequence after an interrupt would be minimal.

(Yes you could put a fib() call inside your lock sequence
but its only your application that gets shot in the foot.)

If the state on a failure is architected as boundedly
undefined, a simple scan for the sc (no execution to produce
identical final state) would suffice. It might be possible to add
this imprecise state aspect when allowing multiple accesses so
hardware could safely just scan after the second non-ll/sc memory
access.

Didn't this start out as a cheap and easy retrofit?
I'm sure I heard that someplace.

A microarchitecture could theoretically cache sc locations similar to a branch target buffer, but memoizing such seems difficult and a
prediction would have to be confirmed (e.g., by scanning for the next sc while speculating that the predicted location was correct). That seems needlessly complex.

It is not terribly difficult to imagine ways of addressing this
deficiency. One might be able to get away with reinterpreting a
branch instruction as setting a failure jump point, avoiding
adding instructions. Such a branch would have to use a nonsensical
condition like "if not equal to self" (assuming no one has used
that as a nop before the architecture decided to reserve it).
Short attempts would not include the branch and just run through
to find the sc; simple attempts that are somehow known not to have
only transient failure causes might target the ll instruction
producing a hard retry loop. One could also consider each register
tested for non-equality to self as providing additional control
information.

Given that opcode space is usually not so tight as to prevent a
secondary atomic interface, but one would really want to have the second
be comprehensive to avoid a third, fourth, fifth, etc. Extending ll/sc semantics — without adding instructions — might have allowed experience/feedback to accumulate for a second try.
(I do believe that one could substantially extend the semantics
without having to add instructions. When starting from a clean
slate is better idea seems usually to be realized late. "Free"
extensions also discourages trying again rather than trying to
fix an interface that mostly sort-of works.)

Given that none of the RISCs have actually extended ll/sc
semantics, the pressure to provide broader atomics must not have
been particularly great.

Yes, it could be "if you build it they will come" or
"if you build it they will stare at it with curiosity
then go back to doing things the way they have always done".

ASF did better in checkpointing the start such that a failure
jumped to a known location that performed a test for
success/failure and branched to handling code. ASF also preserved
the stack pointer; this does not seem a substantial benefit as
short attempts could probably avoid changing the stack pointer and
the relative overhead of explicitly saving the stack pointer is
not that great if the attempt is large.

On abort both RTM and proposed ASF had "exception semantics" that
restored the registers to their state at the start of the transaction
as well as toss all memory stores.

Mitch wants escaping stores to preserve logged information (with
the obvious potential benefit for debugging and performance
tuning, possibly even dynamic performance tuning i.e. by the
software and not the developer). AMD's ASF defined a RELEASE
instruction which removed a cache block from the tracking set; I
do not remember seeing this in the documentation I have of ESM.

I also have Release on a guard range in my ATX proposal.
Also one can upgrade or downgrade read and write guard ranges.

I want a transaction to be able to safely pick something up,
look at it, decide its not relevant, put it down and continue.

Removing reservations avoids some capacity constraints, though
ASF defined this as a "hint" so an implementation could still fail
an atomic event that would have succeeded if the tracking
adjustment had been done, whether from avoiding a contention or
exceeding a capacity limit. Removing reservations is also
dangerous, especially at cache block tracking granularity, as
software must ensure that other memory accesses either do not
access that cache block or have the same lifetime.

I intend Release for avoiding unnecessary collisions and aborts.
There is no danger in it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Tue Jan 2 13:12:53 2024

MitchAlsup wrote:

Paul A. Clayton wrote:

On 12/28/23 2:10 PM, EricP wrote:

Mitch wants escaping stores to preserve logged information (with
the obvious potential benefit for debugging and performance
tuning, possibly even dynamic performance tuning i.e. by the
software and not the developer). AMD's ASF defined a RELEASE
instruction which removed a cache block from the tracking set; I
do not remember seeing this in the documentation I have of ESM.

RELEASE was after I had left AMD.

ASF (under my tutelage) and ESM were ATOMIC events not restricted Transactional Memory things.
An ATOMIC event has the property that if the event succeeds all
of the writes become visible to the system instantly, otherwise
no interested 3rd party sees any of the memory locations as having
been modified.

I don't follow your distinction between transaction and multi-line atomic.
Both must monitor variables in multiple cache lines as read-only and
others as read-write with all writes made globally visible at once.

An instruction like RELEASE violates the definition and, in essence,
takes the model from ATOMIC towards Transactional by one tiny but
important step--a step I was not willing to take--mainly because I could
not come up with a reasonable definition of ATOMIC if it had that semantic.

I don't follow your reasons here either.
I see RELEASE as being used so an atomic sequence can add an object
to its member sets with read-share or write-exclusive protection,
examine it, decide its not relevant, and release it so others can
update it without contention with you.

ESM can be used to help SW implement Transactional Memory, but it
is not Transactional Memory.

Removing reservations avoids some capacity constraints, though

Current ATOMIC things allow for 1 or 2 memory containers. ASF and
ESM allow for 8 cache lines used any way the programmer desires.
EricP's version goes for 16 {*p,length} read/write sets.

I submit that 8 (0r 16) is so much larger than ATOMIC event have
today that it will take decades for SW to figure out how to use
that new capability efficiently and effectively.

ASF defined this as a "hint" so an implementation could still fail
an atomic event that would have succeeded if the tracking
adjustment had been done, whether from avoiding a contention or
exceeding a capacity limit. Removing reservations is also
dangerous, especially at cache block tracking granularity, as
software must ensure that other memory accesses either do not
access that cache block or have the same lifetime.

Removing reservations violates the "all or nothing" semantic of
anything that wants to smell ATOMIC.

I think I understand. You see "atomic" in a purist sense, as one thing
done one way, whereas "transaction" is more dynamic and expansive.

This also depends on how RELEASE works - whether it can toss pending
stores out of a transaction or only releases read-only monitors.
I hadn't decided because it didn't seem that much of an issue
and just a question of where some IF statements are located.

For hardware implementation its just a matter of whether one allows applications to reset the "changed" bits on pending byte buffers.
I currently see no technical reason to disallow it.
So I would allow RELEASE to toss pending stores after they were done.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Tue Jan 2 19:23:32 2024

EricP wrote:

Yes, it could be "if you build it they will come" or

The following is the important clause::

"if you build it they will stare at it with curiosity
then go back to doing things the way they have always done".

This pretty much sums up the advancement of inter-thread synchronization
over the last 5 decades.

First there was T&S, then later T&T&S, then CAS, then DCAS, then LL/SC,
perhaps others along the way::

And still synchronization is hard to program, hard to reason, and hard
to get right.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Tue Jan 2 19:42:22 2024

EricP wrote:

MitchAlsup wrote:

Paul A. Clayton wrote:

On 12/28/23 2:10 PM, EricP wrote:

Mitch wants escaping stores to preserve logged information (with
the obvious potential benefit for debugging and performance
tuning, possibly even dynamic performance tuning i.e. by the
software and not the developer). AMD's ASF defined a RELEASE
instruction which removed a cache block from the tracking set; I
do not remember seeing this in the documentation I have of ESM.

RELEASE was after I had left AMD.

ASF (under my tutelage) and ESM were ATOMIC events not restricted
Transactional Memory things.
An ATOMIC event has the property that if the event succeeds all
of the writes become visible to the system instantly, otherwise
no interested 3rd party sees any of the memory locations as having
been modified.

I don't follow your distinction between transaction and multi-line atomic. Both must monitor variables in multiple cache lines as read-only and
others as read-write with all writes made globally visible at once.

Transactional memory can* support nested transactions.
ESM cannot.

TM can survive an interrupt (possibly an exception).
ESM cannot.

(*) in the TM literature I have read.

An instruction like RELEASE violates the definition and, in essence,
takes the model from ATOMIC towards Transactional by one tiny but
important step--a step I was not willing to take--mainly because I could
not come up with a reasonable definition of ATOMIC if it had that semantic.

I don't follow your reasons here either.
I see RELEASE as being used so an atomic sequence can add an object
to its member sets with read-share or write-exclusive protection,
examine it, decide its not relevant, and release it so others can
update it without contention with you.

Let's say you have ESM with RELEASE, and you are processing an ATOMIC
event where you write to one or more participating lines, then RELEASE
one of them, write a few more and complete the event.

This allows interested 3rd parties to see the RELEASED cache line before
they see the rest of the modified cache lines. ATOMIC is defined where
all interested 3rd parties either see all of memory in the before state
or all of the modifications of the after state, and no intermediate
state.

You can assert that because the programmer used RELEASE, that the code
is tolerant of some of the data being seen before the rest of the data
is seen. But this violates the definition of ATOMIC. And I never found
a definition of ATOMIC that would allow a RELEASE. {{I do know how to
do it in HW--I just cant make any definition of ATOMIC work in that µArchitecture. It is indeed a Definitional problem.}}

Perhaps you could come up with one.

ESM can be used to help SW implement Transactional Memory, but it
is not Transactional Memory.

Removing reservations avoids some capacity constraints, though

Current ATOMIC things allow for 1 or 2 memory containers. ASF and
ESM allow for 8 cache lines used any way the programmer desires.
EricP's version goes for 16 {*p,length} read/write sets.

I submit that 8 (0r 16) is so much larger than ATOMIC event have
today that it will take decades for SW to figure out how to use
that new capability efficiently and effectively.

ASF defined this as a "hint" so an implementation could still fail
an atomic event that would have succeeded if the tracking
adjustment had been done, whether from avoiding a contention or
exceeding a capacity limit. Removing reservations is also
dangerous, especially at cache block tracking granularity, as
software must ensure that other memory accesses either do not
access that cache block or have the same lifetime.

Removing reservations violates the "all or nothing" semantic of
anything that wants to smell ATOMIC.

I think I understand. You see "atomic" in a purist sense, as one thing
done one way, whereas "transaction" is more dynamic and expansive.

This also depends on how RELEASE works - whether it can toss pending
stores out of a transaction or only releases read-only monitors.
I hadn't decided because it didn't seem that much of an issue
and just a question of where some IF statements are located.

You were using RELEASE so that you could add more participating cache
lines (after RELEASE); so if you have the ability to add more after
RELEASE you had to PUSH the modified data somewhere it will be visible
to make room for the new participants.

For hardware implementation its just a matter of whether one allows applications to reset the "changed" bits on pending byte buffers.
I currently see no technical reason to disallow it.
So I would allow RELEASE to toss pending stores after they were done.

Yes, it is entirely buffer management--that supports the architecture's definition of what ATOMIC means.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Tue Jan 2 19:49:00 2024

On Tue, 02 Jan 2024 19:23:32 +0000, MitchAlsup wrote:

EricP wrote:

Yes, it could be "if you build it they will come" or

The following is the important clause::

"if you build it they will stare at it with curiosity then go back to
doing things the way they have always done".

Regrettably, I fear this is the likeliest fate of the Mill. But given
that they're going to actually build one, and thus be in a position to *demonstrate* its superior performance, perhaps not.

With your MY 66000, surely it's not so strange as to scare people away.

But, yes, interest in anything new and different is very limited. People
want a processor that will run their Windows programs - x86-64 - or their Android apps - ARM.

RISC-V, even if deeply flawed, has enough mindshare that it could someday
get somewhere.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Jan 2 20:58:41 2024

Paul A. Clayton wrote:

On 12/31/23 2:33 PM, MitchAlsup wrote:

Paul A. Clayton wrote:

[snip]

Mitch wants escaping stores to preserve logged information (with
the obvious potential benefit for debugging and performance
tuning, possibly even dynamic performance tuning i.e. by the
software and not the developer). AMD's ASF defined a RELEASE
instruction which removed a cache block from the tracking set; I
do not remember seeing this in the documentation I have of ESM.

RELEASE was after I had left AMD.

ASF (under my tutelage) and ESM were ATOMIC events not restricted
Transactional Memory things.

An ATOMIC event has the property that if the event succeeds all
of the writes become visible to the system instantly, otherwise
no interested 3rd party sees any of the memory locations as having
been modified.

I thought you had written that you wanted escaping stores to
facilitate debugging. I guess I misremembered.

Non-participating STs are performed in <virtual> program order and
can escape prior to the event completing. Participating STs do not.

An instruction like RELEASE violates the definition and, in
essence, takes the model from ATOMIC towards Transactional by one
tiny but important step--a step I was not willing to take--mainly
because I could not come up with a reasonable definition of ATOMIC
if it had that semantic.

If the data is thread local, as far as other threads are concerned
that data does not necessarily even exist.

Unless a pointer has to TLS has been passed through global memory;
and then all bets are off.

Breaking the transactional nature is also not _necessarily_ a bad
choice. I do not know of any, but that is not proof of non-
existence.

ESM can be used to help SW implement Transactional Memory, but it
is not Transactional Memory.

I agree with you that hardware cannot really provide generic
transactional memory (composable, "unbounded" capacity, etc.).

Nesting is the nastiest one.

My disagreement comes from perceiving hardware as being better
equipped to monitor conflicts. Larger read sets seem especially
inexpensive for hardware (especially if one can accept the
limitations of a conservative filter). I do see such expansions
of capacity as less beneficial. (I agree with 'get simple atomics
utilized first and well-understood before guaranteeing useless
features that will have to be supported for twenty years'.)

Removing reservations avoids some capacity constraints, though

Current ATOMIC things allow for 1 or 2 memory containers. ASF and
ESM allow for 8 cache lines used any way the programmer desires.
EricP's version goes for 16 {*p,length} read/write sets.

I submit that 8 (0r 16) is so much larger than ATOMIC event have
today that it will take decades for SW to figure out how to use
that new capability efficiently and effectively.

Probably mostly true. More expansive uses are also likely to be
less common (both application count and use per run); that is just
the general nature of things.

Even so, I think providing cheap expansion can facilitate
experimentation. Expanding the read set seems really cheap.
However, I do not know how one can present a feature as
"experimental" such that one is not committing to it
architecturally. _Theoretically_ the hardware transactional memory
uses are suppose to have a fall back path, so they are hint-like,
but people tend not to read the fine print.

Probably related to how fast TM has taken off and become the standard bearer--Oh Wait.....

ASF defined this as a "hint" so an implementation could still fail
an atomic event that would have succeeded if the tracking
adjustment had been done, whether from avoiding a contention or
exceeding a capacity limit. Removing reservations is also
dangerous, especially at cache block tracking granularity, as
software must ensure that other memory accesses either do not
access that cache block or have the same lifetime.

Removing reservations violates the "all or nothing" semantic of
anything that wants to smell ATOMIC.

Violating atomicity may not violate simple orderability. Sometimes
software might be willing to sacrifice true atomicity for lower
chance of "atomic" event failure. This risks the ease of
understanding the interface and one probably does not want to
supply software developers with atomic foot guns.

This ties in with similar discussion with EricP. It may be true
that programmers can reasons about releasing lines early that does
not violate some definition of ATOMIC. I spent enough time (SAF and
ESM) thinking about these things to come to the conclusion one needs
Leslie Lamport level acumen to get ATOMIC features correct the first
time.

(I think the use case for RELEASE was linked lists. One might have
to traverse more than a few nodes but one might not care if
another thread inserted/deleted a node well behind or well ahead
of where this thread is. There are probably other possible uses.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Jan 2 20:49:46 2024

Paul A. Clayton wrote:

On 12/31/23 2:33 PM, MitchAlsup wrote:

Paul A. Clayton wrote:

[snip]

I had not considered this consequence of an interrupt, but this
issue would be "trivially" handled by distinguishing between
reservation active and reservation valid (i.e., use two bits with
one preserved across interrupts). The hardware would then know
that it is within a ll/sc enclosed code sequence and stores would
merely be buffered as usual until the sc was encountered. Advanced
hardware for such a primitive interface, knowing that stores would
not be preserved could elide them.

If taken the way I think Paul intends, that means the read and
write sets are part of thread saved state. It is nuances like
exceptions and interrupts; why My 66000 abandons the event if
it has to perform more privileged control transfer.

No. The fact that the thread is within an atomic event is part of
thread state (I assume ll/sc implementations make atomic even
success part of hardware thread/core state). Knowing that the
thread is in failed atomic event state, stores can be guarded such
that they do not manifest.

The transfer of control to the ATOMIC control point prior to performing
the exception or interrupt control transfer means ATOMIC adds nothing
to thread state !!! You do not need to remember the participating
cache liens (or their state), nor the fact you are in an event, ...

The concern expressed was that ordinary store instructions would
leak when resuming from an interrupt.

Can you illustrate this as a scenario ??

For a single cache block reservation, having to run through the
rest of a known-failed atomic attempt may not be problematic. With
broader uses, that would be painful; any failure would have to run
through the rest of the attempt before being able to consider
retrying. If the state on a failure is architected as boundedly
undefined, a simple scan for the sc (no execution to produce
identical final state) would suffice. It might be possible to add
this imprecise state aspect when allowing multiple accesses so
hardware could safely just scan after the second non-ll/sc memory
access.

A microarchitecture could theoretically cache sc locations
similar to a branch target buffer, but memoizing such seems
difficult and a prediction would have to be confirmed (e.g., by
scanning for the next sc while speculating that the predicted
location was correct). That seems needlessly complex.

When one considers all of the effects of {memory order, branch
prediction, exceptions, interrupts, machine checks,...} it is
better for the complexity management of a project* to abandon an
ATOMIC event when "control gets hard". I have seen and been in
the debates of "how do we get out of this mess" often enough not
to want to repeat it again.

Do you still have the T-shirt?☺

Only the scars...

Similar considerations apply to cache coherence (or really any
complex functionality). Getting it right in a basic design seems
to be hard enough.

Sound definitions are the first requirement.
a) What does it mean to be cache coherent ??
b) are we interested in 3rd parties or just 2nd parties ??
c) How is DRAM involved ??
d) How is the interconnect involved ??

I presented workarounds even when i thought them "needlessly
complex".

(*) and the possibly more important:: "ability of SW to reason
about what happened and why";

This is another factor. Getting the hardware right is not enough
if getting the software is likely to get it wrong. Comprehensible
interfaces, error/bug detection, and debugging facilities are
obviously important.

Since I have no experience in hardware or software development or collaboration (but some knowledge), I have only limited ability to
evaluate proposals on such considerations. I am aware of some of
the dangers and some of the mitigations, but I would object to
being made responsible for any significant project. (I do like to
think I could contribute as something like an intern, but the
difference between intern and junior (much less head) computer
architect is perhaps somewhat more than minuscule.☺)

I, on the other hand, think you are coming alone nicely.

[snip legacy sandbagging atomic development]

ASF did better in checkpointing the start such that a failure
jumped to a known location

yes

that performed a test for
success/failure and branched to handling code.

not necessarily. If you arrived you knew the event had failed.
ASF provided a mechanism whereby you could refine your understanding
of the failure and branching based on that.

I thought the branch instruction was right after the transaction
beginning instruction and was not necessarily skipped but just
never triggered because EAX was set to zero and the flags updated.

You are too steeped in x86. pick you head out of the hole in the ground
and look at other ways of doing similar stuff.

ASF also preserved
the stack pointer; this does not seem a substantial benefit as
short attempts could probably avoid changing the stack pointer and
the relative overhead of explicitly saving the stack pointer is
not that great if the attempt is large.

The SP thing had to do with PUSH/POP model in x86 and automagic
control transfer.

Okay. I thought such might have been the case.

Mitch wants escaping stores to preserve logged information (with
the obvious potential benefit for debugging and performance
tuning, possibly even dynamic performance tuning i.e. by the
software and not the developer). AMD's ASF defined a RELEASE
instruction which removed a cache block from the tracking set; I
do not remember seeing this in the documentation I have of ESM.

RELEASE was after I had left AMD.

ASF (under my tutelage) and ESM were ATOMIC events not restricted
Transactional Memory things.

An ATOMIC event has the property that if the event succeeds all
of the writes become visible to the system instantly, otherwise
no interested 3rd party sees any of the memory locations as having
been modified.

I thought you had written that you wanted escaping stores to
facilitate debugging. I guess I misremembered.

An instruction like RELEASE violates the definition and, in
essence, takes the model from ATOMIC towards Transactional by one
tiny but important step--a step I was not willing to take--mainly
because I could not come up with a reasonable definition of ATOMIC
if it had that semantic.

If the data is thread local, as far as other threads are concerned
that data does not necessarily even exist.

Breaking the transactional nature is also not _necessarily_ a bad
choice. I do not know of any, but that is not proof of non-
existence.

ESM can be used to help SW implement Transactional Memory, but it
is not Transactional Memory.

I agree with you that hardware cannot really provide generic
transactional memory (composable, "unbounded" capacity, etc.).

My disagreement comes from perceiving hardware as being better
equipped to monitor conflicts. Larger read sets seem especially
inexpensive for hardware (especially if one can accept the
limitations of a conservative filter). I do see such expansions
of capacity as less beneficial. (I agree with 'get simple atomics
utilized first and well-understood before guaranteeing useless
features that will have to be supported for twenty years'.)

Removing reservations avoids some capacity constraints, though

Current ATOMIC things allow for 1 or 2 memory containers. ASF and
ESM allow for 8 cache lines used any way the programmer desires.
EricP's version goes for 16 {*p,length} read/write sets.

I submit that 8 (0r 16) is so much larger than ATOMIC event have
today that it will take decades for SW to figure out how to use
that new capability efficiently and effectively.

Probably mostly true. More expansive uses are also likely to be
less common (both application count and use per run); that is just
the general nature of things.

Even so, I think providing cheap expansion can facilitate
experimentation. Expanding the read set seems really cheap.
However, I do not know how one can present a feature as
"experimental" such that one is not committing to it
architecturally. _Theoretically_ the hardware transactional memory
uses are suppose to have a fall back path, so they are hint-like,
but people tend not to read the fine print.

ASF defined this as a "hint" so an implementation could still fail
an atomic event that would have succeeded if the tracking
adjustment had been done, whether from avoiding a contention or
exceeding a capacity limit. Removing reservations is also
dangerous, especially at cache block tracking granularity, as
software must ensure that other memory accesses either do not
access that cache block or have the same lifetime.

Removing reservations violates the "all or nothing" semantic of
anything that wants to smell ATOMIC.

Violating atomicity may not violate simple orderability. Sometimes
software might be willing to sacrifice true atomicity for lower
chance of "atomic" event failure. This risks the ease of
understanding the interface and one probably does not want to
supply software developers with atomic foot guns.

(I think the use case for RELEASE was linked lists. One might have
to traverse more than a few nodes but one might not care if
another thread inserted/deleted a node well behind or well ahead
of where this thread is. There are probably other possible uses.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Jan 2 21:22:14 2024

Paul A. Clayton wrote:

On 12/31/23 3:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:

On 12/25/23 7:12 PM, MitchAlsup wrote:

[snip]

Expecting ATOMIC behavior across cache line boundaries has

NEVER been

guaranteed. So SW that deals with ATOMIC stuff needs to understand
this {along with false sharing,...} and take appropriate actions
at the language level and malloc()/new level.

This may be a further variance from transactional memory-like
interface. If I understand your statement, software is expected to
take special alignment precautions for objects used within ESM.

It is not ESM::software must already take these special considerations
on ANYTHING they want to actually be ATOMIC.
a) avoid false sharing--to prevent losing your cache line lock
b) non-spanning data--so HW can perform the access atomically

(This would not affect any atomic operation involving 4 or fewer
64-byte chunks — 8 aligned chunks handles all possible alignments.
I seem to recall only 6 were used in the _largest_ known case, so
I would guess most atomic operations would not be concerned about
alignment. Even the operations with more access regions are not
likely to encounter an alignment issue. I think generic allocators
tend to assume 16-byte alignment on 64-bit platforms, which should
reduce the likelihood of cache block crossing. Many hot object
members probably tend to be in the first bytes, further reducing the probability. Members used with temporal locality are also more likely
to have spatial locality even at the short temporal locality of an
atomic operation and sub-cache-block spatial locality.)

Of course, high-effort software will naturally make such
optimizations. Until compilers are trained to use ESM for merely lock-guarded operations or code sections are labeled atomic, those
using larger atomic operations will probably be almost exclusively high-effort software.

Paul A. Clayton wrote:

If alignment-caused "capacity"
failures are acceptable, why are address-based associativity
conflicts so extremely problematic?

False sharing make thinks look like interference and harms
performance at the system level by performing too many bus
transfers.

Yes. However, aligning every object by 64-bytes (or assuring that
the interesting bits are in a 64-byte aligned chunk) seems
impractical if not nearly as bad as the early Alpha suggestion that
locks not share a page as ll/sc granularity was only
architecturally guaranteed at that coarseness (if I remember
correctly).

[snip]

I was not considering exceptions, but I guess there are other
failure modes beyond capacity and contention.

Consider you are several checkpoints down a path, and have several outstanding cache misses and an interrupt transpires. What are you
(the CPU) allowed to do with the returning cache lines since those
memory reference instructions did not survive to retirement ??
a) If you put them in the cache you have a Spectré side channel.

Which *I* do not consider so horrible.

You would if you were Bank of America.
You would if you were the CIA.
..

Since avoiding such
speculation side channels is not very expensive (e.g., providing
buffering of speculative information until it is non-speculative)
and being free of those side channels has value, I do not consider
avoiding the side channels an unworthy effort.

You would if you were the CIA.

Producing/infecting a game, productivity tool, or other desirable
software with malware seems easier, faster, and more powerful
(though possibly less traceable). With so many websites requiring
Javascript and Javascript sandboxing not being perfect, I doubt
one needs Spectré to get information. Plugging a hardware security
hole (which is very difficult to work around) does mean that those
wishing to sacrifice some software and web browsing options can be
more secure.

I have JS turned off on at least 300 web sights. This gets rid of
90% of the adds AddBlocker does not already get rid of and also
prevents popups.

b) if you do not, you need to migrate them somewhere Ownership does
not get misunderstood.

Unless the context switch is very brief, an atomic attempt is likely
toast anyway (and retrying is probably not as big a deal if the
thread has already had to wait to resume). I was not suggesting that interrupts not be fatal.

Correct, which is why I fail the event in the control transfer and
forget all the dynamic state. The probability of the concurrent data
structure being in the exact state you left it in when control returns
is negligible.

There may be cases where interrupts not failing a transaction is
justified.

But then you have so much more state you have to save and restore.
Remember one of the goals of My 66000 is low context switch time.

If an ESM operation completed its set-up and only used
4 miss buffer entries, keeping them as unsaved/unfreed state
(similar to FP/SIMD register files have sometimes been lazily saved
by software) might allow a quick interrupt handler to run to
completion and resume the earlier thread before the atomic fails.

State save is 5 cache lines (1 header and 4 register file),
You are adding up to 8 more cache lines and other state (another
cache line) which is doubling the context switch overhead.

Given the "it is better to" consider the CDS as stale upon return,
saving the ATOMIC intermediate state is unwarranted. Given you
don't want to return to the middle of an event, control transfer
to the ATOMIC control point IS warranted.

(I suspect one would want to fail the atomic rather than use NAKs
if there is contention at least after one NAK was sent — unless
there was somehow a way to know that the interrupt handler is just
about done.)

NAKs are not done willy nilly. NAKs are only used when the core has
all the state* needed to complete the event already present, and uses
NAKs to delay contenders until it can complete the event.

(*) all the cache lines are present and writable in the Miss buffer.
Core cannot be waiting for a line or write permission and respond with
NAK. {This ended up being a live-lock issue along the way.}

Side question: for your expected implementations, would a fully set-
up transaction send ordinary NAKs for a cache block that was a miss
to main memory? The cache coherence network can be slow but main
memory is still slower.

See above.

I wonder if this would be a case where limited multithreading could
be useful. A quick/small transaction could complete _while_ the
interrupt handling is done. (I do realize that sharing resources
enables side channels.)

[snip]

No, I expect that in 2 decades there will be enough experience
with ESM that capacity extension would be desired. Here code

that did

work on early implementations continues to work, but more codes
can now be performed with the new capacity limits.

I expect that ESM will not be implemented on enough systems to
provide that experience in just 2 decades. ASF — which was
actually presented as a candidate extension by a significant
processor vendor — was never implemented. Intel's TSX has had
correctness issues and security issues and did not seem to work
that well even for ESM-scale transactions. Power and z/Arch
transactional memory implementations are not going to get broad
use. ARM seems to be seeing the issues and waiting before
providing a hardware transactional memory implementation.

So, how is SW going to use 10,000 cores ??

The larger potential capacity implementations do have the
disadvantage of not encouraging experimentation with small-scale
atomics, especially with their ability to nest (which is a feature
even I would probably exclude from early implementations).

I do not see any large corporation deciding to back My 66000 the
way Amazon, Google, and Apple have made use of ARM computers
available to millions (especially since My 66000 is a better ISA
and not tied to another corporation). This may be largely my
depression talking. Right now I do not think I want to be alive in
twenty years with the direction my health, employment, and other
personal aspects have been going [getting old etc.] and the
direction of the world and particularly of U.S. politics.

As far as RISC-V is concerned, I can't see any *.gov body accepting
it for anything--too much Chinese money involved.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Tue Jan 2 18:05:51 2024

On Tue, 2 Jan 2024 08:06:25 +0100, Terje Mathisen
<[email protected]> wrote:

MitchAlsup wrote:

Paul A. Clayton wrote:

One might also want a prefetch that prepares a hardware prefetch
engine. Such a software prefecth might want metadata about the
type of hardware prefetching (unit stride/non-unit stride/address
is pointer/generic hardware prefetch).

Your advocation for the use of prefetch is admirable. Truly !

The problem with prefetch is that SW cannot use it effectively
and efficiently, and in ways that preserve code over multiple
generations--one end might need to prefetch only 100 cycles
in front of use whereas the other end might need to prefetch
several thousand cycles in front of use.

A second problem is that prefetch is NECESSARILY specified as
a HINT because some other piece of code (yours or not) may
touch the memory hierarchy and the prefetch you so carefully
performed goes away before you come for the actual data.

I am one of those programmers who read about PREFETCH being added to x86
and immediately thought "Yes, now I can make my code faster!", then
almost immediately (after getting one of the new CPUs) hit the tiny
little snag of "Oh, it does not seem to work!?!".

I.e. I have written code where actually loading a byte (forcing a fetch,
not just a hint) from every cache line in a 4KB block will speed it up,
but almost never seen any gains from adding a PREFETCH (n) cache lines
in front of the current operating address.

SW must know about latency model it is using prefetch
to ameliorate !! But which one ??

It is already pretty complicated without that added issue.

Terje

I have had good results prefetching data for the next iteration of a
loop - prefetch at the top of the loop - and letting the load(s)
percolate *while* processing the current iteration.

I haven't had much success with small loops, e.g., prefetching data
for N+3,N+4 while working on N ... but it has benefited larger loops
that do significant processing.

Obviously, MMV.
George

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Chris M. Thomasson on Tue Jan 2 23:43:53 2024

"Chris M. Thomasson" <[email protected]> writes:

On 1/1/2024 11:06 PM, Terje Mathisen wrote:

MitchAlsup wrote:

Paul A. Clayton wrote:

SW must know about latency model it is using prefetch
to ameliorate !! But which one ??

It is already pretty complicated without that added issue.

Terje

Think of iterating a linked list of adjacent cache lines. A prefetch of
the next cache line would be in order, right?

The last several generations of processors have had very good
prefetch circuitry built into the cache subsystem. Attempting
to out-think them by explicitly using prefetch instructions
in cases where the processor could figure out the access pattern
itself in the prefetcher seems counter productive.

Which is why naive attempts to add explicit prefetching generally don't show useful performance improvements.

It is in the cases where the processor prefetcher cannot determine an access pattern that explicit prefetch instructions start to become useful.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to MitchAlsup on Wed Jan 3 09:42:33 2024

On 12/23/2023 1:44 PM, MitchAlsup wrote:

Chris M. Thomasson wrote:

On 12/23/2023 6:37 AM, John Dallman wrote:

In article <[email protected]>,
[email protected] (Anton Ertl) wrote:

Which compiler was this?

Microsoft. Our primary demand for Itanium, when things were starting up
for it, was Windows, because there was no 64-bit Windows at the time and >>> some large customers really wanted that. That demand evaporated like dry >>> ice in Death Valley when Intel announced their x86-64.

Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?

Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
on the x86 side of things (DCAS in IBM parlance), which is the impetus
for my inventing ASF which later morphed into ESM. My point was and
still is
that one can add 1 to 3 ATOMIC instructions every product rev, or one can provide the primitives such that SW can invent and use whatever primitive they can figure out how to program and how to use. {{There is no 3rd
choice}}

Perhaps. While I certainly agree that you can do pretty much every
reasonable "primitive" using ESM, there is a cost to doing that. Let's
say you want an atomic memory increment and update. With ESM, I think
this would be three instructions, whereas in implementations that
support it directly, it is one instruction (although the one instruction
would take longer than any one of the three used by ESM). So the single instruction implementation might be faster, and certainly takes less instruction space than the ESM implementation.

I don't have a feel for which of the existing instructions are the most
used, but a perhaps "best of both worlds" implementation would provide
for the most used as single instructions (only where they would be
better performing than ESM), and provide ESM for whatever complex
mechanism someone might want. Then, in the future, only if some new
mechanism proves out and becomes popular, a future version could provide
that mechanism as a more performant single instruction in a totally
upward compatible way.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stephen Fuld on Wed Jan 3 18:39:14 2024

Stephen Fuld wrote:

On 12/23/2023 1:44 PM, MitchAlsup wrote:

Chris M. Thomasson wrote:

Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?

Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
on the x86 side of things (DCAS in IBM parlance), which is the impetus
for my inventing ASF which later morphed into ESM. My point was and
still is
that one can add 1 to 3 ATOMIC instructions every product rev, or one can
provide the primitives such that SW can invent and use whatever primitive
they can figure out how to program and how to use. {{There is no 3rd
choice}}

Perhaps. While I certainly agree that you can do pretty much every reasonable "primitive" using ESM, there is a cost to doing that. Let's
say you want an atomic memory increment and update. With ESM, I think
this would be three instructions, whereas in implementations that
support it directly, it is one instruction (although the one instruction would take longer than any one of the three used by ESM). So the single instruction implementation might be faster, and certainly takes less instruction space than the ESM implementation.

I don't have a feel for which of the existing instructions are the most
used, but a perhaps "best of both worlds" implementation would provide
for the most used as single instructions (only where they would be
better performing than ESM), and provide ESM for whatever complex
mechanism someone might want. Then, in the future, only if some new mechanism proves out and becomes popular, a future version could provide
that mechanism as a more performant single instruction in a totally
upward compatible way.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stephen Fuld on Wed Jan 3 19:05:50 2024

Stephen Fuld wrote:

Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?

Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
on the x86 side of things (DCAS in IBM parlance), which is the impetus
for my inventing ASF which later morphed into ESM. My point was and
still is
that one can add 1 to 3 ATOMIC instructions every product rev, or one can
provide the primitives such that SW can invent and use whatever primitive
they can figure out how to program and how to use. {{There is no 3rd
choice}}

Perhaps. While I certainly agree that you can do pretty much every reasonable "primitive" using ESM, there is a cost to doing that. Let's
say you want an atomic memory increment and update. With ESM, I think
this would be three instructions, whereas in implementations that
support it directly, it is one instruction (although the one instruction would take longer than any one of the three used by ESM). So the single instruction implementation might be faster, and certainly takes less instruction space than the ESM implementation.

What prevents the core from recognizing::

LDL R7,[some address]
ADD R8,R7,R3
STL R8,[some address] // last use of R8

and converting it into::

ADD_to_Memory R7,[some address],R3

??

I don't have a feel for which of the existing instructions are the most
used, but a perhaps "best of both worlds" implementation would provide
for the most used as single instructions (only where they would be
better performing than ESM), and provide ESM for whatever complex
mechanism someone might want. Then, in the future, only if some new mechanism proves out and becomes popular, a future version could provide
that mechanism as a more performant single instruction in a totally
upward compatible way.

ASF was developed as a means to avoid having to invent new ATOMIC
instructions every processor generation, and was developed with
a keen eye on how microcode (or other sequencer) would manipulate
current hardware (Opteron) to provide the illusion of atomicity.
The goal of ASF was to allow SW to implement all of the (then)
known ATOMIC primitives without needing new instructions.

ESM is ASF expressed in RISC ISA with solutions to several issues
that could not be solved under x86-64 ISA and in particular x86
conditional branches. For example:: the Branch on the condition of
interference instruction (BIN) samples the interference monitor
and branches on the current value, whereas x86 would have to access
the interference bit and then branch conditionally on it, causing
already stale information used in branching.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to MitchAlsup on Wed Jan 3 11:18:44 2024

On 1/3/2024 11:05 AM, MitchAlsup wrote:

Stephen Fuld wrote:

Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?

Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over >>> on the x86 side of things (DCAS in IBM parlance), which is the
impetus for my inventing ASF which later morphed into ESM. My point
was and still is
that one can add 1 to 3 ATOMIC instructions every product rev, or one
can
provide the primitives such that SW can invent and use whatever
primitive
they can figure out how to program and how to use. {{There is no 3rd
choice}}

Perhaps. While I certainly agree that you can do pretty much every
reasonable "primitive" using ESM, there is a cost to doing that.
Let's say you want an atomic memory increment and update. With ESM, I
think this would be three instructions, whereas in implementations
that support it directly, it is one instruction (although the one
instruction would take longer than any one of the three used by ESM).
So the single instruction implementation might be faster, and
certainly takes less instruction space than the ESM implementation.

What prevents the core from recognizing::

      LDL     R7,[some address]
      ADD     R8,R7,R3
      STL     R8,[some address]       // last use of R8

and converting it into::

      ADD_to_Memory R7,[some address],R3

??

I don't know. If the core can do that then great. However, can the
core distinguish between that sequence that should be atomic, and an
ordinary (i.e. non atomic) memory update? I suppose the atomic version
means some extra things going on (bus lock, collision detection, etc.)
that the ordinary version doesn't need. That cost, would be avoided in
my proposal. And it still requires three times the memory footprint.

I don't have a feel for which of the existing instructions are the
most used, but a perhaps "best of both worlds" implementation would
provide for the most used as single instructions (only where they
would be better performing than ESM), and provide ESM for whatever
complex mechanism someone might want. Then, in the future, only if
some new mechanism proves out and becomes popular, a future version
could provide that mechanism as a more performant single instruction
in a totally upward compatible way.

ASF was developed as a means to avoid having to invent new ATOMIC instructions every processor generation, and was developed with
a keen eye on how microcode (or other sequencer) would manipulate
current hardware (Opteron) to provide the illusion of atomicity.
The goal of ASF was to allow SW to implement all of the (then) known
ATOMIC primitives without needing new instructions.

ESM is ASF expressed in RISC ISA with solutions to several issues
that could not be solved under x86-64 ISA and in particular x86
conditional branches. For example:: the Branch on the condition of interference instruction (BIN) samples the interference monitor
and branches on the current value, whereas x86 would have to access
the interference bit and then branch conditionally on it, causing
already stale information used in branching.

Thanks for the history. Don't get me wrong. I like ESM, and think it
is an elegant generalized solution for whatever new needs arise. I just
think that generality comes at a cost, which might be avoided most of
the time.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stephen Fuld on Wed Jan 3 20:59:39 2024

Stephen Fuld wrote:

On 1/3/2024 11:05 AM, MitchAlsup wrote:

Stephen Fuld wrote:

Do you remember the rather odd instruction for the Itanium called
cmp8xchg16?

Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over >>>> on the x86 side of things (DCAS in IBM parlance), which is the
impetus for my inventing ASF which later morphed into ESM. My point
was and still is
that one can add 1 to 3 ATOMIC instructions every product rev, or one
can
provide the primitives such that SW can invent and use whatever
primitive
they can figure out how to program and how to use. {{There is no 3rd
choice}}

Perhaps. While I certainly agree that you can do pretty much every
reasonable "primitive" using ESM, there is a cost to doing that.
Let's say you want an atomic memory increment and update. With ESM, I
think this would be three instructions, whereas in implementations
that support it directly, it is one instruction (although the one
instruction would take longer than any one of the three used by ESM).
So the single instruction implementation might be faster, and
certainly takes less instruction space than the ESM implementation.

What prevents the core from recognizing::

      LDL     R7,[some address]
      ADD     R8,R7,R3
      STL     R8,[some address]       // last use of R8

and converting it into::

      ADD_to_Memory R7,[some address],R3

??

I don't know. If the core can do that then great. However, can the
core distinguish between that sequence that should be atomic, and an
ordinary (i.e. non atomic) memory update?

The memory update is already ATOMIC in that all interested parties see
only the before and only the after bit-pattern even without using a lock
bit on the "bus".

I suppose the atomic version
means some extra things going on (bus lock, collision detection, etc.)
that the ordinary version doesn't need. That cost, would be avoided in
my proposal. And it still requires three times the memory footprint.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Sun Jan 7 23:31:00 2024

Paul A. Clayton wrote:

On 1/2/24 3:49 PM, MitchAlsup wrote:

Paul A. Clayton wrote:

[snip]

No. The fact that the thread is within an atomic event is part
of thread state (I assume ll/sc implementations make atomic even
success part of hardware thread/core state). Knowing that the
thread is in failed atomic event state, stores can be guarded
such that they do not manifest.

The transfer of control to the ATOMIC control point prior to performing
the exception or interrupt control transfer means ATOMIC adds nothing
to thread state !!! You do not need to remember the participating
cache liens (or their state), nor the fact you are in an event, ...

My comments were in the context of extending ll/sc without adding instructions. While I think one could interpret a branch
instruction immediately after a ll as setting up such a control
point — merely reinterpreting an instruction in that specific
context not "adding" an instruction — the retaining of "in
transaction" state allows just executing until the sc is reached.

One could use a typical branch to reassign ATOMIC control point !!
But I used a special conditional branch (BIN) which samples the
interference monitor and branches if interference occurs. This inst
also reassigns the ATOMIC control point to be the target of the
branch.

I wanted control flow between the start of the event and the
conclusion of the event to be natural to SW (which should
preclude using a typical BC as the special one).

The concern expressed was that ordinary store instructions would
leak when resuming from an interrupt.

Can you illustrate this as a scenario ??

ll r4 ← [r2]
ld r3 ← [r2+4]
sub r5 ← r4 - r3
add r6 ← r3 + 12
<interrupt> // transaction_valid state set to "invalid"
// old thread tx_active saved as "active"
[interrupt handling code]
<resume thread>
st [r2+4] ← r6 // core discards this store since "active"
// and also "invalid"
sc [r2] ← r5 // normal behavior for sc when "invalid"

ll r4 ← [r2] // this is the ATOMIC control point
ld r3 ← [r2+4]
sub r5 ← r4 - r3
add r6 ← r3 + 12
<interrupt> // event has failed
// control reverts to ATOMIC control point
// no special state
[interrupt handling code]
<resume thread>
ll r4 ← [r2] // this is the ATOMIC control point again
ld r3 ← [r2+4]
sub r5 ← r4 - r3
add r6 ← r3 + 12
...

The saving of the extra bit (tx_active) is not an _application_
software compatibility issue, but it does seem to require the
interrupt handler to save this extra state. (Unlike My 66000,
the ll/sc using ISAs do not have hardware context management.)

It is not the context management it is that there is no state to
save or restore that is perinate to ATOMIC events. The event fails
before interrupt control transfer transpires.

Executing through from the resume point to the sc is not
efficient given that the transaction is known to fail, but
compatibility is maintained.

Which is why I don't do it.

The point was to avoid EricP's proposed extra instructions to
indicate that memory operations were part of a transaction.

EricP is on different glide path to optimizing synchronization.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to MitchAlsup on Mon Jan 8 10:44:37 2024

On 1/3/2024 12:59 PM, MitchAlsup wrote:

Stephen Fuld wrote:

On 1/3/2024 11:05 AM, MitchAlsup wrote:

My point
was and still is
that one can add 1 to 3 ATOMIC instructions every product rev, or
one can
provide the primitives such that SW can invent and use whatever
primitive
they can figure out how to program and how to use. {{There is no
3rd choice}}

Perhaps. While I certainly agree that you can do pretty much every
reasonable "primitive" using ESM, there is a cost to doing that.
Let's say you want an atomic memory increment and update. With ESM,
I think this would be three instructions, whereas in implementations
that support it directly, it is one instruction (although the one
instruction would take longer than any one of the three used by
ESM). So the single instruction implementation might be faster, and
certainly takes less instruction space than the ESM implementation.

What prevents the core from recognizing::

       LDL     R7,[some address]
       ADD     R8,R7,R3
       STL     R8,[some address]       // last use of R8 >>>
and converting it into::

       ADD_to_Memory R7,[some address],R3

??

I don't know. If the core can do that then great. However, can the
core distinguish between that sequence that should be atomic, and an
ordinary (i.e. non atomic) memory update?

The memory update is already ATOMIC in that all interested parties see
only the before and only the after bit-pattern even without using a lock
bit on the "bus".

                                          I suppose the atomic version
means some extra things going on (bus lock, collision detection, etc.)
that the ordinary version doesn't need. That cost, would be avoided
in my proposal. And it still requires three times the memory footprint.

OK, I think I misunderstood what you proposed about "coalescing" the instructions within the core. I missed that your code had the "L" on
the load and store instructions. So forget about confusing the sequence
with non-atomic updates.

So yes, the core could internally recognize the sequence and turn it
into what I suppose would take the same amount of time as the single instruction. But I guess that would require at least as much circuitry
as executing the single atomic instruction. And, to actually gain the
benefit, you have to document that the core is doing this for that
particular sequence. So ISTM the only extra cost of the atomic
instruction is the op code (which I know you want to minimize). And
you still have the fact of three instructions in the stream versus 1,
I-cache usage, etc.

Of course, this same argument would apply to other popular "primitives"
such as TS, or whatever. So, under this scenario, it comes down the
equivalent core space, the cost of an extra op-code, versus the benefit
of a smaller program. Not too major either way. But still an
alternative to the two you mentioned above.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stephen Fuld on Mon Jan 8 20:55:11 2024

Stephen Fuld wrote:

On 1/3/2024 12:59 PM, MitchAlsup wrote:

Stephen Fuld wrote:

On 1/3/2024 11:05 AM, MitchAlsup wrote:

My point
was and still is
that one can add 1 to 3 ATOMIC instructions every product rev, or
one can
provide the primitives such that SW can invent and use whatever
primitive
they can figure out how to program and how to use. {{There is no
3rd choice}}

Perhaps. While I certainly agree that you can do pretty much every >>>>> reasonable "primitive" using ESM, there is a cost to doing that.
Let's say you want an atomic memory increment and update. With ESM, >>>>> I think this would be three instructions, whereas in implementations >>>>> that support it directly, it is one instruction (although the one
instruction would take longer than any one of the three used by
ESM). So the single instruction implementation might be faster, and
certainly takes less instruction space than the ESM implementation.

What prevents the core from recognizing::

       LDL     R7,[some address]
       ADD     R8,R7,R3
       STL     R8,[some address]       // last use of R8 >>>>
and converting it into::

       ADD_to_Memory R7,[some address],R3

??

I don't know. If the core can do that then great. However, can the
core distinguish between that sequence that should be atomic, and an
ordinary (i.e. non atomic) memory update?

The memory update is already ATOMIC in that all interested parties see
only the before and only the after bit-pattern even without using a lock
bit on the "bus".

                                          I suppose the atomic version
means some extra things going on (bus lock, collision detection, etc.)
that the ordinary version doesn't need. That cost, would be avoided
in my proposal. And it still requires three times the memory footprint.

OK, I think I misunderstood what you proposed about "coalescing" the instructions within the core. I missed that your code had the "L" on
the load and store instructions. So forget about confusing the sequence
with non-atomic updates.

Accepted.......

So yes, the core could internally recognize the sequence and turn it
into what I suppose would take the same amount of time as the single instruction.

The advantage is that the bundled version ALLWAYS succeeds even in the
face of withering interference, whereas the core executed version does
not.

But I guess that would require at least as much circuitry
as executing the single atomic instruction. And, to actually gain the benefit, you have to document that the core is doing this for that
particular sequence. So ISTM the only extra cost of the atomic
instruction is the op code (which I know you want to minimize). And
you still have the fact of three instructions in the stream versus 1,
I-cache usage, etc.

Granted.

Of course, this same argument would apply to other popular "primitives"
such as TS, or whatever. So, under this scenario, it comes down the equivalent core space, the cost of an extra op-code, versus the benefit
of a smaller program. Not too major either way. But still an
alternative to the two you mentioned above.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Jan 10 18:42:43 2024

Chris M. Thomasson [2023-12-22 11:37:03] wrote:

On 12/22/2023 9:21 AM, Anton Ertl wrote:

Somebody ask for a case where SC reduced performance. TSO also has
reduced performance in this case, whereas Cache, causal, and weak
consistencies do not.

SC radically reduces performance on existing archs for sure. Think of iterating a linked list. Wrt RCU, the readers can iterate the list without using any memory barriers (DEC Alpha aside for a moment). This is FAR

Anton is talking about SC provided directly by the ISA (e.g. as on the
MIPS architecture), in which case there are no barriers at all.
IOW, the cost is moved to the CPU's OoO engine.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stefan Monnier on Thu Jan 11 00:24:34 2024

Stefan Monnier wrote:

Chris M. Thomasson [2023-12-22 11:37:03] wrote:

On 12/22/2023 9:21 AM, Anton Ertl wrote:

Somebody ask for a case where SC reduced performance. TSO also has
reduced performance in this case, whereas Cache, causal, and weak
consistencies do not.

SC radically reduces performance on existing archs for sure. Think of
iterating a linked list. Wrt RCU, the readers can iterate the list without >> using any memory barriers (DEC Alpha aside for a moment). This is FAR

Anton is talking about SC provided directly by the ISA (e.g. as on the
MIPS architecture), in which case there are no barriers at all.

I too am similarly talking about an architecture where memory is weaker
than SC (or TSO) and you still need no MBARs. The trick is the ability
to recognize when stronger orders are required and automagically apply
the stronger model for a short but needed duration then automagically
revert to the weaker model when it is not.

That is placing the memory model as bits in a PSL is ill-advised.

IOW, the cost is moved to the CPU's OoO engine.

Since small-width in-order processors are invariable SC, it is always
in the OoO engine.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Chris M. Thomasson on Thu Jan 11 06:57:08 2024

"Chris M. Thomasson" <[email protected]> writes:

Can a C++ program use all std::memory_order_relaxed barriers for its
atomic sync algorithms and still be correct on MIPS?

In a sequentially consistent architecture, memory barriers are nops,
so you can insert or delete them wihout affecting correctness (well,
such an architecture actually won't have instructions for them, unless
there is a variant of the architecture with weaker consistency).

I don't know if MIPS guarantees sequential consistency, though.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to MitchAlsup on Thu Jan 11 07:48:15 2024

[email protected] (MitchAlsup) writes:

Since small-width in-order processors are invariable SC, it is always
in the OoO engine.

"Invariable"? Alpha has had memory barrier instructions from the
start, while its implementations were still small-width in-order
processors. The advocacy paper for weak consistency
[adve&gharachorloo95] was written by DEC people three years before DEC
released the OoO 21264, and the arguments why they think that weak
consistency is necessary for performance have nothing to do with OoO,
but mainly with caches, but also with accessing several main memory
channels in parallel.

@TechReport{adve&gharachorloo95,
author = {Sarita V. Adve and Kourosh Gharachorloo},
title = {Shared Memory Consistency Models: A Tutorial},
institution = {Digital Western Research Lab},
year = {1995},
type = {WRL Research Report},
number = {95/7},
annote = {Gives an overview of architectural features of
shared-memory computers such as independent memory
banks and per-CPU caches, and how they make the (for
programmers) most natural consistency model hard to
implement, giving examples of programs that can fail
with weaker consistency models. It then discusses
several categories of weaker consistency models and
actual consistency models in these categories, and
which ``safety net'' (e.g., memory barrier
instructions) programmers need to use to work around
the deficiencies of these models. While the authors
recognize that programmers find it difficult to use
these safety nets correctly and efficiently, it
still advocates weaker consistency models, claiming
that sequential consistency is too inefficient, by
outlining an inefficient implementation (which is of
course no proof that no efficient implementation
exists). Still the paper is a good introduction to
the issues involved.}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Guest
  Wed Jul 29 14:26:54 2026
  from Balkans via Telnet
- Rixter
  Wed Jul 29 14:18:17 2026
  from Madison, Nc via Telnet
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	70:06:44
Calls:	12,448
Calls today:	3
Files:	15,194
Messages:	6,537,602

Superior architecture styles? (was: Concertina II Progress)

Who's Online

Recent Visitors

System Info