• Superior architecture styles? (was: Concertina II Progress)

    From Anton Ertl@21:1/5 to Paul A. Clayton on Tue Dec 19 08:49:18 2023
    "Paul A. Clayton" <[email protected]> writes:
    On 12/6/23 2:54 AM, Anton Ertl wrote:
    [email protected] (Scott Lurndal) writes:
    [snip]
    One of my professors back in the late 70's was researching
    data flow architectures. Perhaps it's time to reconsider the
    unit of compute using single instructions, instead providing a
    set of hardware 'functions' than can be used in a data flow environment.

    We already have data-flow microarchitectures since the mid-1990s, with
    the success of OoO execution. And the "von Neumann" ISAs have proven
    to be a good and long-term stable interface between software and these
    data-flow microarchitectures, whereas the data-flow ISAs of the 1970s
    and their microarchitectures turned out to be uncompetetive.

    I suspect that a superior interface could be designed which
    exploits diverse locality (i.e., data might naturally be closer to
    some computational resources than to others) and communication
    (and storage) costs and budgets (budgets being related to urgency
    and importance).

    There have been a number of attempts to design architectures with more
    explicit hardware-oriented features, as well as attempts to design architectures with more software-oriented features.

    On the hardware-oriented side we have seen:

    * Explicit fast/small memories that leaves the problem of locality to
    software rather than adding the hardware complexity of cache tags
    and letting the hardware guess what memory will be accessed in the
    future. The only such memories that have succeeded in the long term
    are registers; my explanation for the latter is that registers are
    useful for scalar variables in programs (and vector/SIMD registers
    for temporary storage of pieces of aggregated data).

    My explanation for the lack of success of explicit local memory is
    that it is too hard for software to manage; dealing with it would be
    a cross-cutting concern that would make the software significantly
    more complex. Better write software as if there was no cache or
    local memory, and let hardware attempt to get good performance out
    of it.

    * Software prefetch instructions. Ok, we have a cache instead of
    explicit local memories, but can't we let software improve the
    cache's performace with prefetch hints? I have read some reports
    that the experience is that adding such instructions showed
    disappointing speedups, and slowdowns also happened. While many
    architectures have this feature, it seems to be used little.
    Hardware prefetchers seem to do ok, though.

    * Distributed memory. This avoids all the problems of cache
    consistency, but it requires the programmer to avoid all sharing
    (including read-only data and code), and, most importantly, to fit
    within its slice of machine. This is acceptable for supercomputing,
    but does not fly in general-purpose computing. And even in game
    computin, which one might consider to be more supercomputer-like,
    the Cell architecture was abandoned in the next generation.

    * Weak memory models. Ok, we (hardware designers) have to do cache
    consistency, but let's do a shoddy job and shift the problems over
    to the software people whom we task with inserting architecturally
    meaningless barriers at select places. The great thing about this
    is that we can always blame the software people: if the performance
    sucks, they have inserted too many barriers; if the program breaks,
    they have inserted too few; and if the performance sucks while the
    program breaks, well, we can blame them for both, or just shrug and
    say that the problem is hard. In any case, the blame is deflected
    from us.

    One can argue that this particular architectural misfeature has
    succeeded: After all, in shared-memory multiprocessors you can buy
    today, sequential consistency is nonexistent, and weaker models such
    as TSO or even weaker ones dominate.

    OTOH, very few programmers program to these weak models. They write
    sequential programs, or use libraries/system calls that hide the
    horrors of weak consistency behind easier-to-understand but much
    slower abstractions.

    Isn't that as it should be? Hardware provides some simple, possibly
    inconvenient interface, and software abstracts the inconvenience
    away; e.g., RISCs provide only a load/store interface, and compilers
    translate the stuff programmers write into this interface. The
    difference is that with RISCs the result is a least as fast as
    architectures that tried to cater more to the programming language,
    while I am convinced that hardware where the designers design for
    efficient sequential consistency will let programmers write software
    that handily outperforms using libraries or system calls on hardware
    designed for weaker memory models.

    They also (as I
    understand) lacked value prediction whereas OoO effectively uses
    value prediction for branches.

    Branch prediction is branch prediction, value prediction predicts the
    values of registers or memory locations; I have seen research about
    value prediction maybe 20 years ago, but I am not aware that it has
    been productized. IIRC when I asked someone who presented a paper on
    value prediction why there should be a significant predictability for
    values, the answer was that the values would come from executing along
    a mispredicted path: E.g., when an IF is mispredicted, the values of
    the code after the ENDIF might still compute to the same values.
    However, given the rareness of mispredictions these days, this source
    of value prediction probably provides too little benefit to justify
    the cost (not to mention Spectre considerations, which came in later).

    While the current interface does allow for significant improvement
    exploiting common behaviors and dynamic microarchitectural
    information, it does not seem (to me) an ideal interface.

    I think it is quite remarkable that a sequence of instructions, with
    branches and RAM and registers, is what we already have in S/360 (59
    years ago) and (I think) in the IBM 704 (69 years ago); I am not
    familiar enough with earlier machines or the analytical engine to
    comment on them, but this style might be even older. And while the
    hardware capabilities have exploded since that time, and performance characteristics have changed, this style of interface is still doing
    well. Even specific instruction sets are doing well: AMD64 with it's
    legacy going back to 8086 (45 years ago) and the Datapoint 2200 (53
    years ago) competes well with recent architectures such as ARM A64 and
    RISC-V. S390x goes back to S/360 (59 years ago); I don't know how
    fast it's current implementation is on programs I care about, but I
    expect that in an alternative reality where the PC market was based on
    S390x, the offerings of Intel and AMD (or whoever would take their
    place) would offer similar performance on software I care about as
    their AMD64 offerings offer in this reality.

    And of course the recent ARM A64 and RISC-V also follow the style
    outlined above. RISC-V is even very similar to MIPS (37 years ago),
    despite the R2000 having only ~100,000 transistors, and current CPUs
    having billions.

    I doubt an alternative interface will be adopted any time soon.
    Even with a better understanding than in the 1970s, the problem is
    not simple and early developments would not be competitive even if
    all the R&D NRE was ignored.

    There have been repeated attempts at alternative interfaces, from the
    data-flow stuff you mentioned to IA-64, which saw a huge investment
    and also had a huge mindshare everywhere. Academia has pursued
    alternative programming language concepts (e.g., lazy functional
    programming) for many decades, but attempts to cater to alternative
    programming languages in architectures have not been successful for
    long.

    Despite all this work on alternatives ARM A64 and RISC-V are
    interfaces that, for the most part work like an IBM 704 or a Commodore
    64: RAM actually allows random access (the architecture does not just
    not expose caches and hardware prefetchers, but not DRAM bursts, open
    DRAM pages and somesuch, either (well, the load/store-pair
    instructions of A64 can be attributed to catering to cache line
    sizes), and that RAM belongs to the program (virtual memory is usually transparent to user-level programs).

    Anyway, if an alternative interface, closer to what you consider ideal
    (what would that be?) has not succeeded, it's not for lack of trying.
    And it's also not just because of the network effects: There once was
    the 36-bit word-addressed de-facto standard (IBM 704, PDP-6,
    Burroughs, and others), but that did not prevent the byte-addressed
    32-bit S/360 from succeeding.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Tue Dec 19 16:31:00 2023
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    * Explicit fast/small memories that leaves the problem of locality
    to software rather than adding the hardware complexity of cache tags

    My explanation for the lack of success of explicit local memory is
    that it is too hard for software to manage; dealing with it would
    be a cross-cutting concern that would make the software significantly
    more complex.

    Absolutely. An algorithm that takes a different form if its data is more
    than a specific size is effectively two algorithms, with a transition
    case. It is more complex to write and to test, and is pointless
    complexity on an architecture that does not have the feature. If you
    leave out the complexity on those architectures, you now have three
    algorithms. You have to run all their tests and measure their execution
    times on each architecture to prove that you really have the right code
    on each machine.

    * Software prefetch instructions. Ok, we have a cache instead of
    explicit local memories, but can't we let software improve the
    cache's performace with prefetch hints? I have read some reports
    that the experience is that adding such instructions showed
    disappointing speedups, and slowdowns also happened. While many
    architectures have this feature, it seems to be used little.

    I'm reasonably confident that the increases in code size, and the correspondingly increased I-cache misses, were what wrecked x86-32
    prefetching in my case.

    An architecture that had prefetch hints designed in from the beginning
    might be more efficient with them, as long as it didn't have IA-64's dumb implementation. That produced memory misreads as an architectural feature, while ensuring that they usually didn't happen when running under a
    debugger.

    ... why there should be a significant predictability for values,
    the answer was that the values would come from executing along
    a mispredicted path: E.g., when an IF is mispredicted, the values of
    the code after the ENDIF might still compute to the same values.

    Only if your programmers write the same code twice. The point of those
    IFs is surely to avoid redundant calculation? Programmers are fairly good
    at minimising computation, in terms of expression evaluation. They're
    much less good at minimising branching.

    I think it is quite remarkable that a sequence of instructions, with
    branches and RAM and registers, is what we already have in S/360 (59
    years ago) and (I think) in the IBM 704 (69 years ago); I am not
    familiar enough with earlier machines or the analytical engine to
    comment on them, but this style might be even older.

    It has the great advantage of combining comprehensibility and generality.


    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Tue Dec 19 17:08:34 2023
    [email protected] (John Dallman) writes:
    In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
    * Software prefetch instructions. Ok, we have a cache instead of
    explicit local memories, but can't we let software improve the
    cache's performace with prefetch hints? I have read some reports
    that the experience is that adding such instructions showed
    disappointing speedups, and slowdowns also happened. While many
    architectures have this feature, it seems to be used little.

    I'm reasonably confident that the increases in code size, and the >correspondingly increased I-cache misses, were what wrecked x86-32 >prefetching in my case.

    I have read explanations along these lines, or that the prefetch hints
    consumed instruction decode/retire/LSU resources that caused useful instructions to be delayed. I find both of these explanations
    surprising, because I would expect that, if a program spends
    significant time on D-cache misses (so that prefetches could help),
    reducing that would far outweigh any I-cache or resource contention
    effects.

    An architecture that had prefetch hints designed in from the beginning
    might be more efficient with them, as long as it didn't have IA-64's dumb >implementation. That produced memory misreads as an architectural feature, >while ensuring that they usually didn't happen when running under a
    debugger.

    Do you mean the speculative loads that would return NaTs if the memory
    access is invalid? If so, why does that not happen when running under
    a debugger.

    ... why there should be a significant predictability for values,
    the answer was that the values would come from executing along
    a mispredicted path: E.g., when an IF is mispredicted, the values of
    the code after the ENDIF might still compute to the same values.

    Only if your programmers write the same code twice. The point of those
    IFs is surely to avoid redundant calculation? Programmers are fairly good
    at minimising computation, in terms of expression evaluation. They're
    much less good at minimising branching.

    Maybe I have not made the point clear enough: Consider:

    if (...) {
    ...; /* then clause */
    } else {
    ...; /* else clause */
    }
    x = ...;
    y = a[x];

    Now consider that the then clause is executed speculatively, until
    after the assignment to y, and that this speculation is wrong. On misprediction recovery the else clause is executed, and the assignment
    to x and y is executed again. If x and y do not depend on values
    computed in the if-statement, the values will be the same as in the misspeculated execution, and value prediction would forward these
    values from the misspeculated to the architectural execution. I guess
    there is some history recording involved about which instructions
    produce predictable values and which don't.

    Consider if the speculation has run for 200 instructions behind the mispredicted branch. Then value prediction could produce the results
    of dozens of instructions right away, promoting many others to
    readyness. Of course these instructions still need to be executed, to
    verify the predicted values. Apparently these benefits are not worth
    the costs, though, even without considering Spectre (if you don't want
    Spectre, you cannot do that).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Tue Dec 19 18:27:17 2023
    Anton Ertl wrote:


    There have been a number of attempts to design architectures with more explicit hardware-oriented features, as well as attempts to design architectures with more software-oriented features.

    On the hardware-oriented side we have seen:

    * Explicit fast/small memories

    Cray 2 B registers

    My explanation for the lack of success of explicit local memory is
    that it is too hard for software to manage;

    Yes, Indeed.

    * Software prefetch instructions.

    How do you determine the latency of the prefetch, on one implementation
    100 cycles before might be enough, on another 1,000 might be too little.

    * Weak memory models.

    Good for HW (performance) a nightmare for SW (ordering)

    One can argue that this particular architectural misfeature has
    succeeded: After all, in shared-memory multiprocessors you can buy
    today, sequential consistency is nonexistent, and weaker models such
    as TSO or even weaker ones dominate.

    All high performance machines allow the memory controller to order
    memory references, requests to different cache lines are performed
    out of order with respect to arrival, where accesses to the same
    cache line are strictly ordered (causal). Thus, the CPU is not even
    in a position to guarantee memory order--and that is where SW resides.

    OTOH, very few programmers program to these weak models.

    By and large, programmers keep their heads down and try to do their job.
    SW is complicated enough without having all programmers understand FP
    numerical methods, or memory ordering eccentricities.

    Isn't that as it should be? Hardware provides some simple, possibly
    inconvenient interface, and software abstracts the inconvenience
    away;

    Languages abstract the SW stuff away from the HW stuff. Languages limit
    what programmers can write (but not how badly they write it.)

    e.g., RISCs provide only a load/store interface, and compilers
    translate the stuff programmers write into this interface. The
    difference is that with RISCs the result is a least as fast as
    architectures that tried to cater more to the programming language,
    while I am convinced that hardware where the designers design for
    efficient sequential consistency will let programmers write software
    that handily outperforms using libraries or system calls on hardware
    designed for weaker memory models.

    They also (as I
    understand) lacked value prediction whereas OoO effectively uses
    value prediction for branches.

    Branch prediction is branch prediction, value prediction predicts the
    values of registers or memory locations; I have seen research about
    value prediction maybe 20 years ago, but I am not aware that it has
    been productized.

    Branch prediction only has to guess 1-bit (taken or not)
    Value prediction has to predict a whole value often a pointer.

    IIRC when I asked someone who presented a paper on
    value prediction why there should be a significant predictability for
    values, the answer was that the values would come from executing along
    a mispredicted path: E.g., when an IF is mispredicted, the values of
    the code after the ENDIF might still compute to the same values.

    Which illustrates why value prediction has, thus far, failed.

    However, given the rareness of mispredictions these days, this source
    of value prediction probably provides too little benefit to justify
    the cost (not to mention Spectre considerations, which came in later).

    Spectré is the contrapositive of value prediction, it is the synthesis
    of a value you are not supposed to be able to see.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Dec 20 09:26:09 2023
    Isn't that as it should be? Hardware provides some simple, possibly
    inconvenient interface, and software abstracts the inconvenience
    away; e.g., RISCs provide only a load/store interface, and compilers
    translate the stuff programmers write into this interface. The
    difference is that with RISCs the result is a least as fast as
    architectures that tried to cater more to the programming language,
    while I am convinced that hardware where the designers design for
    efficient sequential consistency will let programmers write software
    that handily outperforms using libraries or system calls on hardware
    designed for weaker memory models.

    While I agree that CPUs should provide sequential consistency (it would
    make life easier for debugging, formalization, and reasoning, and
    I can't think of any reason why it should have a significant cost), I'd
    be surprised if it "handily outperforms" the status quo: concurrent
    programming is hard even with sequential consistency, so most normal
    code would still continue using the exact same libraries.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stefan Monnier on Wed Dec 20 16:29:12 2023
    Stefan Monnier wrote:

    Isn't that as it should be? Hardware provides some simple, possibly
    inconvenient interface, and software abstracts the inconvenience
    away; e.g., RISCs provide only a load/store interface, and compilers
    translate the stuff programmers write into this interface. The
    difference is that with RISCs the result is a least as fast as
    architectures that tried to cater more to the programming language,
    while I am convinced that hardware where the designers design for
    efficient sequential consistency will let programmers write software
    that handily outperforms using libraries or system calls on hardware
    designed for weaker memory models.

    While I agree that CPUs should provide sequential consistency (it would
    make life easier for debugging, formalization, and reasoning, and
    I can't think of any reason why it should have a significant cost), I'd
    be surprised if it "handily outperforms" the status quo: concurrent programming is hard even with sequential consistency, so most normal
    code would still continue using the exact same libraries.

    As Lamport showed, one only NEEDs sequential consistency when accessing
    memory somebody else can access at the same time. While running single threaded, one needs nothing more than weak or casual consistency.

    As to costs::

    l = p[ i / 123456789 ];
    r[7] = s;

    The LD from p is delayed 20+odd cycles by the IDIV. Under SC or TSO
    the unaliased r[7] ST cannot become visible until the LD has become
    visible; yet since they cannot alias, this thread is not sensitive
    to the order in which they are performed--and certain compilers with
    certain flags will more the ST above the LD--certain HW memory queues
    will also allow ST to become visible before the LD.

    Does this happen in practice--not very often.

    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Thu Dec 21 00:43:44 2023
    Chris M. Thomasson wrote:

    On 12/20/2023 8:29 AM, MitchAlsup wrote:
    Stefan Monnier wrote:

      Isn't that as it should be?  Hardware provides some simple, possibly >>>>   inconvenient interface, and software abstracts the inconvenience
      away; e.g., RISCs provide only a load/store interface, and compilers >>>>   translate the stuff programmers write into this interface.  The
      difference is that with RISCs the result is a least as fast as
      architectures that tried to cater more to the programming language, >>>>   while I am convinced that hardware where the designers design for
      efficient sequential consistency will let programmers write software >>>>   that handily outperforms using libraries or system calls on hardware >>>>   designed for weaker memory models.

    While I agree that CPUs should provide sequential consistency (it would
    make life easier for debugging, formalization, and reasoning, and
    I can't think of any reason why it should have a significant cost), I'd
    be surprised if it "handily outperforms" the status quo: concurrent
    programming is hard even with sequential consistency, so most normal
    code would still continue using the exact same libraries.

    As Lamport showed, one only NEEDs sequential consistency when accessing
    memory somebody else can access at the same time. While running single
    threaded, one needs nothing more than weak or casual consistency.

    What about concurrent multi-threaded code than can work without using sequential consistency on existing arch's?

    What about engine controllers that don't need anything stronger than very weak memory models because they recompute everything between spark firing intervals and if a sensor gets read wrong, you miss one cylinder firing for one revolution
    and then everything goes back to normal.


    As to costs::

          l = p[ i / 123456789 ];
          r[7] = s;

    The LD from p is delayed 20+odd cycles by the IDIV. Under SC or TSO
    the unaliased r[7] ST cannot become visible until the LD has become
    visible; yet since they cannot alias, this thread is not sensitive
    to the order in which they are performed--and certain compilers with
    certain flags will more the ST above the LD--certain HW memory queues
    will also allow ST to become visible before the LD.

    Does this happen in practice--not very often.

            Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Stefan Monnier on Thu Dec 21 15:08:08 2023
    On Wed, 20 Dec 2023 09:26:09 -0500
    Stefan Monnier <[email protected]> wrote:

    Isn't that as it should be? Hardware provides some simple,
    possibly inconvenient interface, and software abstracts the
    inconvenience away; e.g., RISCs provide only a load/store
    interface, and compilers translate the stuff programmers write into
    this interface. The difference is that with RISCs the result is a
    least as fast as architectures that tried to cater more to the
    programming language, while I am convinced that hardware where the designers design for efficient sequential consistency will let
    programmers write software that handily outperforms using libraries
    or system calls on hardware designed for weaker memory models.

    While I agree that CPUs should provide sequential consistency (it
    would make life easier for debugging, formalization, and reasoning,
    and I can't think of any reason why it should have a significant
    cost),

    The same reason TSO-like models were invented and re-invented since
    first S/370 multiprocessor: unconstrained ability to feed loads
    from local store queue. If anything, I would think that this ability
    is even more advantageous with big store queues and deep OoO windows of
    today than it was back in early S/370 days.

    I'd be surprised if it "handily outperforms" the status quo:
    concurrent programming is hard even with sequential consistency, so
    most normal code would still continue using the exact same libraries.


    Stefan

    I agree.
    More so, even for "normal abnormal" code, x86 model where TSO is
    augmented by implied memory barriers in all synchronization
    instructions does not appear inferior to SC.
    Now, some people want to write "abnormal abnormal" code, where
    threads communicate with regular loads and stores, without any use of
    special synchronization instructions, but this people are a very small minority.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Thu Dec 21 17:46:24 2023
    Stefan Monnier <[email protected]> writes:
    Isn't that as it should be? Hardware provides some simple, possibly
    inconvenient interface, and software abstracts the inconvenience
    away; e.g., RISCs provide only a load/store interface, and compilers
    translate the stuff programmers write into this interface. The
    difference is that with RISCs the result is a least as fast as
    architectures that tried to cater more to the programming language,
    while I am convinced that hardware where the designers design for
    efficient sequential consistency will let programmers write software
    that handily outperforms using libraries or system calls on hardware
    designed for weaker memory models.

    While I agree that CPUs should provide sequential consistency (it would
    make life easier for debugging, formalization, and reasoning, and
    I can't think of any reason why it should have a significant cost), I'd
    be surprised if it "handily outperforms" the status quo: concurrent >programming is hard even with sequential consistency, so most normal
    code would still continue using the exact same libraries.

    Yes, most programmers will still write sequential programs or use
    libraries or system calls, but if the number of people who write
    lock-free concurrent code increases by a factor of 100, I expect that
    there will be more efficient libraries, so even those who use
    libraries will benefit.

    I actually expect even the existing libraries to perform better on
    hardware that efficiently implements sequential consistency, because
    the barriers become noops instead of being super-expensive on weakly
    ordered hardware).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Thu Dec 21 18:13:26 2023
    [email protected] (MitchAlsup) writes:
    As to costs::

    l = p[ i / 123456789 ];
    r[7] = s;

    The LD from p is delayed 20+odd cycles by the IDIV. Under SC or TSO
    the unaliased r[7] ST cannot become visible until the LD has become
    visible; yet since they cannot alias, this thread is not sensitive
    to the order in which they are performed

    What do you mean with "visible"? You present a single-threaded
    program here, why talk about memory ordering at all?

    And why would I want to perform the store early? The more typical
    case is that you have a store architecturally before a load, and you
    want to perform the load early.

    What current CPUs do is to ask a predictor whether two accesses alias,
    and if the predictor says that they don't, it speculates that they can
    move past each other. If the speculation turns out to be wrong, the speculative state is dropped, just as in case of a branch
    misprediction.

    And for multi-threaded accesses to memory, sequential consistency can
    be implemented in a simular way: Have a predictor that tells you what
    memory is contended, and for non-contended memory, speculatively do
    the fast thing, and if another thread turns out to access the same
    memory, recover and do it in the architectural order.

    --and certain compilers with
    certain flags will more the ST above the LD

    That is irrelevant for the question of how an architecture should
    perform. It's the other way round: programming languages pass on the
    bad programming models that bad (i.e., weakly ordered) architectures
    provide. What should they do? Provide sequential consistency by
    inserting a memory barrier between any two memory accesses? On weakly
    ordered hardware where memory barriers are extremely slow?

    If, OTOH, architectures implemented efficient sequential consistency, programming languages would provide sequential consistency. And those
    that don't would be abandoned by everybody but Chris M. Thomasson.

    --certain HW memory queues
    will also allow ST to become visible before the LD.

    Yes, we know it's possible to design bad hardware. The idea here is
    to tell the hardware designers that weak ordering is not good enough.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Thu Dec 21 19:51:50 2023
    Anton Ertl wrote:

    [email protected] (MitchAlsup) writes:
    As to costs::

    l = p[ i / 123456789 ];
    r[7] = s;

    The LD from p is delayed 20+odd cycles by the IDIV. Under SC or TSO
    the unaliased r[7] ST cannot become visible until the LD has become >>visible; yet since they cannot alias, this thread is not sensitive
    to the order in which they are performed

    What do you mean with "visible"?

    It is the same language Lamport used in his seminal paper. It means
    when when the address has been broadcast (on the bus) but neither
    the read or write has been performed. In CPUs with cache, it means
    when the cache has been accessed and hit has not been granted.

    You present a single-threaded
    program here, why talk about memory ordering at all?

    Somebody ask for a case where SC reduced performance. TSO also has
    reduced performance in this case, whereas Cache, causal, and weak
    consistencies do not.

    And why would I want to perform the store early? The more typical
    case is that you have a store architecturally before a load, and you
    want to perform the load early.

    Is not the general paradigm of OoO processors to do everything as early
    as possible ?? And you want both LDs and STs to pass each other freely
    AFTER you know the virtual address (after AGEN) when independent, but
    remain in processor order when dependent.

    Also note:: One can perform a younger LD early in the face of a
    ST with unknown address, if the pipeline can reRun the LD again
    if the older ST happens to conflict with the value LDed. When one
    demands SC, you cannot do this unless you are willing to back all
    the way up to the ST with the conflict.

    What current CPUs do is to ask a predictor whether two accesses alias,
    and if the predictor says that they don't, it speculates that they can
    move past each other. If the speculation turns out to be wrong, the speculative state is dropped, just as in case of a branch
    misprediction.

    Yes, this is backing up to the ST, causal only has to reRun the LD--
    which is a lot less expensive (to performance).

    And for multi-threaded accesses to memory, sequential consistency can
    be implemented in a simular way: Have a predictor that tells you what
    memory is contended, and for non-contended memory, speculatively do
    the fast thing, and if another thread turns out to access the same
    memory, recover and do it in the architectural order.

    I agree when doing cross-threaded accesses SC is the best model for
    the programmer (least complicated) which is why My 66000 reverts to
    SC when doing ATOMIC stuff and reverts back to causal when this is
    no longer necessary. In order to be in a position to do this, both
    the LD and ST of the ATOMIC event have to be special in some way to
    the pipeline. I mark them with a Lock-bit.

    Mixing causal and SC at the boundaries of cross-threading accesses
    is done without a predictor (although one could be used) by restricting
    when cells in a memory dependence matrix is allowed to relax (and allow dependent memory reference to become visible). A predictor makes recovery
    of the MDM position significantly harder.

    --and certain compilers with
    certain flags will more the ST above the LD

    That is irrelevant for the question of how an architecture should
    perform.

    But is entirely relevant to how a µArchitecture is allowed to perform.
    When the µArchitecture is allowed to move LDs around STs when there
    is no overlap in the containers accessed, performance is improved.
    But when this is allowed, one needs to know if one is performing an
    ATOMIC event (or not) so as to prevent that.

    It's the other way round: programming languages pass on the
    bad programming models that bad (i.e., weakly ordered) architectures
    provide. What should they do? Provide sequential consistency by
    inserting a memory barrier between any two memory accesses? On weakly ordered hardware where memory barriers are extremely slow?

    Use processors that fix the problem for you with essentially no effort
    on your part.

    If, OTOH, architectures implemented efficient sequential consistency, programming languages would provide sequential consistency. And those
    that don't would be abandoned by everybody but Chris M. Thomasson.

    As illustrated above SC-only harms performance, not by a lot, but by enough
    to convert a winning µArchiteccture into a Ho-Hum competitor.

    --certain HW memory queues
    will also allow ST to become visible before the LD.

    Yes, we know it's possible to design bad hardware. The idea here is
    to tell the hardware designers that weak ordering is not good enough.

    I am not now or for the past 10 years been advocating for weak memory
    models when cross-thread accesses are involved. Here, I am advocating
    for weakened models that happen to allow for cross-thread accesses to
    stabilize to the position that all interested parties agree on what
    happened and in what order. This is causal consistency. {{I do not
    believe that memory models weaker than causal bring enough performance
    to the table to justify the added SW complexity.}}

    I am not now or for the past 10 years been advocating for compilers to
    make assumptions about memory aliasing allowing them to move LDs and
    STs across each other. Here, I am advocating for HW to allows LD<->ST
    when accesses can be proven never to conflict outside of cross-thread
    memory accesses.

    And B: HW designers do not understand memory ordering requirements
    sufficiently that they CAN design HW to the specifications you request.
    It is not in their DNA....They work with languages that provide the
    illusion of everything that can be in parallel already is in parallel
    and they did not have to do anything to cause that to be.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Fri Dec 22 17:21:53 2023
    [email protected] (MitchAlsup) writes:
    Anton Ertl wrote:
    You present a single-threaded
    program here, why talk about memory ordering at all?

    Somebody ask for a case where SC reduced performance. TSO also has
    reduced performance in this case, whereas Cache, causal, and weak >consistencies do not.

    In cases where there is no behavioural difference between sequential consistency and weaker memory orderings, there is no fundamental
    reason why there should be a performance difference.

    You are apparently thinking about a particular implementation (my
    guess is: MY66000). What may hold for this implementation does not
    necessarily hold in general.

    Also note:: One can perform a younger LD early in the face of a
    ST with unknown address, if the pipeline can reRun the LD again
    if the older ST happens to conflict with the value LDed. When one
    demands SC, you cannot do this unless you are willing to back all
    the way up to the ST with the conflict.

    I would just check whether the result of the load has changed from the speculatively executed version, and if so, restart from load. Why
    would you restart from the earlier store?

    --and certain compilers with
    certain flags will more the ST above the LD

    That is irrelevant for the question of how an architecture should
    perform.

    But is entirely relevant to how a �Architecture is allowed to perform.

    What a microarchitecture is allowed to do is entirely limited by the architecture. If the architecture specifies sequential consistency,
    the microarchitecture has to implement sequential consistency. If the architecture specifies TSO, the microarchitecture has to implement TSO
    or better. What some compiler does is irrelevant.

    When the �Architecture is allowed to move LDs around STs when there
    is no overlap in the containers accessed, performance is improved.

    containers?

    If, OTOH, architectures implemented efficient sequential consistency,
    programming languages would provide sequential consistency. And those
    that don't would be abandoned by everybody but Chris M. Thomasson.

    As illustrated above SC-only harms performance, not by a lot, but by enough >to convert a winning �Architeccture into a Ho-Hum competitor.

    I have not found any such illustration in what you wrote, only claims
    (probably based on a MY66000 microarchitecture).

    In particular, given that for single-threaded programs the behaviour
    is the same between SC and weak ordering, it is possible to implement
    a microarchitecture that provides SC and has the same performance for single-threaded programs as a weakly-ordered competitor.

    Now, consider access to shared memory in multi-threaded programs.
    With a microarchitecture that is optimized for weak ordering, every
    accesses to shared memory will be slow (by having to use slow atomic operations, memory barriers and somesuch). By contrast, with a microarchitecture that is optimized for sequential consistency,
    accesses to shared memory will be fast in the uncontended case and
    slow only in the contended case (then the reruns are needed). Bottom
    line: If the programmer made contention rare, the microarchitecture
    designed for sequential consistency wins.

    Compare with precise exceptions: The Alpha guys told us that precise
    exceptions are too slow, and gave us imprecise exceptions with a slow
    trap barrier instruction (TRAPB) for limiting the imprecision; and
    programmers should avoid TRAPB at all costs. Then they implemented
    the 21264, and lo and behold, TRAPB was as cheap as a NOP; it probably
    was a NOP, and the 21264 implemented precise exceptions, because it
    comes for free when you do an OoO microarchitecture with speculation.
    And the 21264 was faster than the 21164a with its imprecise
    exceptions.

    And B: HW designers do not understand memory ordering requirements >sufficiently that they CAN design HW to the specifications you request.
    It is not in their DNA....They work with languages that provide the
    illusion of everything that can be in parallel already is in parallel
    and they did not have to do anything to cause that to be.

    Whatever DNA they may have, they have managed to implement the
    sequential semantics (including precise exceptions) of instruction
    sets, and I am sure that they will manage to implement sequential
    consistency efficiently if they are tasked to do that. But we have to
    tell them that weak ordering (which is easier to implement) is not
    good enough.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Fri Dec 22 19:46:51 2023
    Anton Ertl wrote:

    [email protected] (MitchAlsup) writes:
    Anton Ertl wrote:
    You present a single-threaded
    program here, why talk about memory ordering at all?

    Somebody ask for a case where SC reduced performance. TSO also has
    reduced performance in this case, whereas Cache, causal, and weak >>consistencies do not.

    In cases where there is no behavioural difference between sequential consistency and weaker memory orderings, there is no fundamental
    reason why there should be a performance difference.

    You are apparently thinking about a particular implementation (my
    guess is: MY66000). What may hold for this implementation does not necessarily hold in general.

    Granted, I convert everything into the architecture I know best and
    in most detail. Wouldn't you ??

    Also note:: One can perform a younger LD early in the face of a
    ST with unknown address, if the pipeline can reRun the LD again
    if the older ST happens to conflict with the value LDed. When one
    demands SC, you cannot do this unless you are willing to back all
    the way up to the ST with the conflict.

    I would just check whether the result of the load has changed from the speculatively executed version, and if so, restart from load. Why
    would you restart from the earlier store?

    To make everyone agree on the ordering of memory events.

    --and certain compilers with
    certain flags will more the ST above the LD

    That is irrelevant for the question of how an architecture should
    perform.

    But is entirely relevant to how a �Architecture is allowed to perform.

    What a microarchitecture is allowed to do is entirely limited by the architecture. If the architecture specifies sequential consistency,
    the microarchitecture has to implement sequential consistency. If the architecture specifies TSO, the microarchitecture has to implement TSO
    or better. What some compiler does is irrelevant.

    Most µArchitectures operate under the "as if" rule, and if SC is the Architectural rule, µArchitecture can do things in SC order or can do
    things n WO and if a SC violation is detected, back up and make it
    look like SC was in play.

    When the �Architecture is allowed to move LDs around STs when there
    is no overlap in the containers accessed, performance is improved.

    containers?

    Unit of memory access {byte, half (except on x86) word, double, quad
    (except on everyone other than x86), bigger than 64-bit things};
    aligned or not, line crossing or not, page crossing or not, MTRR
    crossing or not.

    If, OTOH, architectures implemented efficient sequential consistency,
    programming languages would provide sequential consistency. And those
    that don't would be abandoned by everybody but Chris M. Thomasson.

    As illustrated above SC-only harms performance, not by a lot, but by enough >>to convert a winning �Architeccture into a Ho-Hum competitor.

    I have not found any such illustration in what you wrote, only claims (probably based on a MY66000 microarchitecture).

    In particular, given that for single-threaded programs the behaviour
    is the same between SC and weak ordering, it is possible to implement
    a microarchitecture that provides SC and has the same performance for single-threaded programs as a weakly-ordered competitor.

    I assert without proof:: no.
    I assert that the loss may not be significant in many µArchitectures.
    I assert that the loss is significant on a few of the biggest baddest µArchitectures.
    Where significance is >5% and <10%.

    Now, consider access to shared memory in multi-threaded programs.
    With a microarchitecture that is optimized for weak ordering, every
    accesses to shared memory will be slow (by having to use slow atomic operations, memory barriers and somesuch). By contrast, with a microarchitecture that is optimized for sequential consistency,
    accesses to shared memory will be fast in the uncontended case and
    slow only in the contended case (then the reruns are needed). Bottom
    line: If the programmer made contention rare, the microarchitecture
    designed for sequential consistency wins.

    No, but it might beak even.

    Compare with precise exceptions: The Alpha guys told us that precise exceptions are too slow, and gave us imprecise exceptions with a slow
    trap barrier instruction (TRAPB) for limiting the imprecision; and programmers should avoid TRAPB at all costs. Then they implemented
    the 21264, and lo and behold, TRAPB was as cheap as a NOP; it probably
    was a NOP, and the 21264 implemented precise exceptions, because it
    comes for free when you do an OoO microarchitecture with speculation.
    And the 21264 was faster than the 21164a with its imprecise
    exceptions.

    264 was wider issue and OoO, too.

    And B: HW designers do not understand memory ordering requirements >>sufficiently that they CAN design HW to the specifications you request.
    It is not in their DNA....They work with languages that provide the >>illusion of everything that can be in parallel already is in parallel
    and they did not have to do anything to cause that to be.

    Whatever DNA they may have, they have managed to implement the
    sequential semantics (including precise exceptions) of instruction
    sets, and I am sure that they will manage to implement sequential
    consistency efficiently if they are tasked to do that. But we have to
    tell them that weak ordering (which is easier to implement) is not
    good enough.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Fri Dec 22 21:52:00 2023
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    I'm reasonably confident that the increases in code size, and the >correspondingly increased I-cache misses, were what wrecked x86-32 >prefetching in my case.

    I have read explanations along these lines, or that the prefetch
    hints consumed instruction decode/retire/LSU resources that caused
    useful instructions to be delayed. I find both of these explanations surprising, because I would expect that, if a program spends
    significant time on D-cache misses (so that prefetches could help),
    reducing that would far outweigh any I-cache or resource contention
    effects.

    Well, it was the best explanation I could come up with at the time. The
    code that I applied prefetching to was certainly generating lots of
    misses in both I-cache and D-cache, and had not been designed for
    prefetching. I applied it to the data traversal operators in the domain-specific language the code is written in, and some thing got
    faster, some slower, but without any very large changes.


    Do you mean the speculative loads that would return NaTs if the
    memory access is invalid? If so, why does that not happen when
    running under a debugger.

    No. What I found was much, much worse. IA-64 had load-advance and
    load-check instructions. You were supposed to issue the load-advance as
    soon as you knew the address you were going to need and it happened asynchronously. The load-check was quick if the value had arrived, but otherwise stalled the pipeline until it did arrive. If the hardware had discarded an advance load, as it was permitted to do, this meant a full
    wait for memory/cache.

    If you were fetching into integer registers, the windowing of those
    registers and the Register Stack Engine allowed for all this to happen smoothly, provided you hadn't gone into so many levels of subroutine
    during the fetch that the register you were loading into had been pushed
    out to memory. I'm not sure what was meant to happen then, but it wasn't
    going to be quick.

    The problem came when you were fetching into floating-point registers,
    which were not windowed, although the way they could move around for modulo-scheduled loops confused people on that front.

    The sequence of events that caused problems was:

    1. Advance load into floating-point register fx.
    2. Call subroutine, which wants to use fx.
    3. Subroutine pushes fx to stack, and starts using it.
    4. Advance load arrives, overwriting fx.
    5. Subroutine may well go wrong, given its register has been corrupted.
    6. If subroutine doesn't notice, it finishes its work.
    7. Subroutine pops fx from the stack.
    8. Subroutine returns.
    9. Check load is executed. The ALAT says the load has happened.

    The value in fx, nonetheless is what was there /before/ you issued the
    advance load.

    If you have breakpoints anywhere along this path, they cause pipeline
    flushes and the problem does not (usually) reproduce.

    It took me a fair amount of time and brain cells to figure this out. I
    ended up forming a hypothesis and searching assembly listings for the
    relevant code sequences. Once I knew what was going on, I tried to report
    the bug. Intel didn't want to know for a while and then switched to "Oh
    yes, that old thing. It doesn't matter! Didn't we tell you about it?"
    This was not convincing.

    The net effect is that you could not safely have floating-point advance
    loads outstanding across subroutine calls. The "fix" was to re-issue all outstanding advance floating-point loads after each subroutine call,
    which bulked out the code quite seriously. IA-64 code was already very
    bulky, pressurising cache size and memory bandwidth. The "fix" also
    didn't prevent the subroutines suffering from random register corruptions.


    Intel seemed to have assumed that floating-point work would /only/ happen
    in leaf functions. That's a nice ideal, but there's plenty of real-world
    code where it is not true. /Maybe/ whole-program optimisation might have reduced its effects. I tried that just once. The link took an hour. This
    was not practical when the compiler was so immature that I was reporting
    bugs in it every week.

    IA-64 did not prove that all its ideas were worthless, only that the way
    they'd been put together was counterproductive. Nonetheless, anyone else
    who tries to launch a EPIC processor is going to face some very hostile questioning, and it's perfectly understandable that nobody wants to try.


    Maybe I have not made the point clear enough

    OK, with you now.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Sat Dec 23 10:49:11 2023
    [email protected] (John Dallman) writes:
    In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
    Do you mean the speculative loads that would return NaTs if the
    memory access is invalid? If so, why does that not happen when
    running under a debugger.

    No. What I found was much, much worse. IA-64 had load-advance and
    load-check instructions. You were supposed to issue the load-advance as
    soon as you knew the address you were going to need and it happened >asynchronously. The load-check was quick if the value had arrived, but >otherwise stalled the pipeline until it did arrive. If the hardware had >discarded an advance load, as it was permitted to do, this meant a full
    wait for memory/cache.

    Ah, yes, this one. In particular, the ld.c would repeat the load if a
    store in between stored to that address.

    The problem came when you were fetching into floating-point registers,
    which were not windowed, although the way they could move around for >modulo-scheduled loops confused people on that front.

    The sequence of events that caused problems was:

    1. Advance load into floating-point register fx.
    2. Call subroutine, which wants to use fx.
    3. Subroutine pushes fx to stack, and starts using it.
    4. Advance load arrives, overwriting fx.
    5. Subroutine may well go wrong, given its register has been corrupted.
    6. If subroutine doesn't notice, it finishes its work.
    7. Subroutine pops fx from the stack.
    8. Subroutine returns.
    9. Check load is executed. The ALAT says the load has happened.

    The value in fx, nonetheless is what was there /before/ you issued the >advance load.

    If you have breakpoints anywhere along this path, they cause pipeline
    flushes and the problem does not (usually) reproduce.

    Ouch. This clearly is a compiler bug. It eiter should abstain from
    moving a load across the call, or perform a ld.c before the call.
    Which compiler was this?

    IA-64 did not prove that all its ideas were worthless, only that the way >they'd been put together was counterproductive. Nonetheless, anyone else
    who tries to launch a EPIC processor is going to face some very hostile >questioning, and it's perfectly understandable that nobody wants to try.

    I have not had any such problems when I compiled my code on IA-64
    (with gcc), the result was just slow:

    Gforth 0.7 results (lower is better):
    sieve bubble matrix fib
    1.000 1.120 0.710 1.680 0.7.0; Itanium II 900MHz (HP rx2600); gcc-4.1.1
    0.245 0.287 0.156 0.376 0.7.0; Pentium 4 Northwood 2.26GHz; gcc-2.95.4 20011002 (Debian prerelease)
    0.670 1.070 0.932 0.968 0.7.0; Alpha 21264B 800MHz; gcc-4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

    Both Intel CPUs are from 2002, the 21264B from 2001 (based on the
    21264 first released in 1998); and this is the Itanium running IA-64
    code, not IA-32 code.

    I have discussed earlier (<[email protected]>
    and others) why EPIC lost against OoO, and it's not just bad execution
    by Intel and HP.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Sat Dec 23 14:37:00 2023
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    Which compiler was this?

    Microsoft. Our primary demand for Itanium, when things were starting up
    for it, was Windows, because there was no 64-bit Windows at the time and
    some large customers really wanted that. That demand evaporated like dry
    ice in Death Valley when Intel announced their x86-64.

    I started work on Windows using Intel's and Microsoft's compilers in
    parallel, expecting to discard one when it became clear which was better.
    After a while, we were at a point when we could get through the most
    basic level of testing in an optimised build with Microsoft, and couldn't
    in a deoptimised build with Intel.

    Ouch. This clearly is a compiler bug. It either should abstain from
    moving a load across the call, or perform a ld.c before the call.

    The problem comes when loads and calls are mixed together very closely,
    such that there's obviously no time for a load to complete between the
    address being known and the next necessary call. Microsoft's tactic of re-issuing loads got rid of most of the problems, and we'd become far
    more interested in AMD64.

    I have discussed earlier
    (<[email protected]>
    and others) why EPIC lost against OoO, and it's not just bad
    execution by Intel and HP.

    Indeed.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Sat Dec 23 21:44:22 2023
    Chris M. Thomasson wrote:

    On 12/23/2023 6:37 AM, John Dallman wrote:
    In article <[email protected]>,
    [email protected] (Anton Ertl) wrote:

    Which compiler was this?

    Microsoft. Our primary demand for Itanium, when things were starting up
    for it, was Windows, because there was no 64-bit Windows at the time and
    some large customers really wanted that. That demand evaporated like dry
    ice in Death Valley when Intel announced their x86-64.

    Do you remember the rather odd instruction for the Itanium called
    cmp8xchg16?

    Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
    on the x86 side of things (DCAS in IBM parlance), which is the impetus for
    my inventing ASF which later morphed into ESM. My point was and still is
    that one can add 1 to 3 ATOMIC instructions every product rev, or one can provide the primitives such that SW can invent and use whatever primitive
    they can figure out how to program and how to use. {{There is no 3rd choice}}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Sat Dec 23 23:12:20 2023
    Chris M. Thomasson wrote:

    On 12/23/2023 1:44 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 12/23/2023 6:37 AM, John Dallman wrote:
    In article <[email protected]>,
    [email protected] (Anton Ertl) wrote:

    Which compiler was this?

    Microsoft. Our primary demand for Itanium, when things were starting up >>>> for it, was Windows, because there was no 64-bit Windows at the time and >>>> some large customers really wanted that. That demand evaporated like dry >>>> ice in Death Valley when Intel announced their x86-64.

    Do you remember the rather odd instruction for the Itanium called
    cmp8xchg16?

    Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
    on the x86 side of things (DCAS in IBM parlance), which is the impetus
    for my inventing ASF which later morphed into ESM. My point was and
    still is
    that one can add 1 to 3 ATOMIC instructions every product rev, or one can
    provide the primitives such that SW can invent and use whatever primitive
    they can figure out how to program and how to use. {{There is no 3rd
    choice}}

    Well the thing about cmp8xchg16 and friends is that all the words need
    to be adjacent to one another.

    That really limits its utility.

    Then there is cmpxchg16 that allows for
    true 128 bit CAS on a 64 bit system, again wrt adjacent words...
    cmp8xchg16 was always strange to me.

    ESM has none of those limitations, and you can perform up to CMP8SWAP8 if
    that floats your boat, and/or attach time stamps along with the pointer manipulations,.......

    Now, speaking of IBM, iirc, are you familiar with the PLO locking
    scheme? I think it was called PLO. Iirc, it was in the same appendix as
    the lock-free IBM free pool manipulation?

    No

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Chris M. Thomasson on Sun Dec 24 11:35:00 2023
    In article <um7gmd$274oh$[email protected]>, [email protected]
    (Chris M. Thomasson) wrote:

    Do you remember the rather odd instruction for the Itanium called
    cmp8xchg16?

    I am fortunate enough not to have had to delve into locking
    implementations; I make a practice of using the OS facilities carefully.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Paul A. Clayton on Sun Dec 24 18:39:54 2023
    "Paul A. Clayton" <[email protected]> writes:
    On 12/23/23 4:44 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:
    [snip]
    Do you remember the rather odd instruction for the Itanium
    called cmp8xchg16?

    Yes, at that point in time, MS was also asking for Compare 2 Swap
    2 over on the x86 side of things (DCAS in IBM parlance), which is the
    impetus for my inventing ASF which later morphed into ESM. My
    point was and still is that one can add 1 to 3 ATOMIC instructions
    every product rev, or one can provide the primitives such that SW can
    invent and use whatever primitive they can figure out how to program
    and how to use. {{There is no 3rd choice}}

    Third choice: do both (likely poorly).

    ESM could atomically operate on up to 512 bytes in up to 8 64-
    byte-aligned locations. The proposed atomics seem to be limited to
    a single cache block; a modestly extended LL/SC that allowed
    ordinary loads and stores within the reserved cache block would
    seem to provide that functionality. ESM is much more flexible, but
    forward progress is more difficult with more reservations. (Yes,
    ESM has mechanisms to handle queues and such, but I doubt even
    Mitch Alsup has discovered a universal solution.)

    (A proposed extension for RISC-V — Zacas — provides double-width
    CAS. I do not buy the argument that "the CAS atomic instructions
    scale better to highly parallel systems than LR/SC"; I think that
    is simply an implementation choice.

    In my experience with large scale systems, it has been shown
    that LL/SC cannot scale so long as it must acquire the cache
    line to complete the transaction.

    The advantage of atomics (particularly load+add) is that the
    core doesn't need to acquire exclusive access to the cache line
    before completing the operation - the atomic can be easily delegated
    to a higher level (e.g. LLC) of cache, passed to the cache
    that owns the line, or even to the memory controller
    (or a PCI express device).


    I also see little difficulty
    in extending LL/SC active regions to include additional loads and
    stores within the single reservation.

    ARM has specified that any store by the processor _may_ cause
    the SC to (Store Exclusive) to fail.



    Methods applied for locks might help in some cases for hardware-
    monitored reservations, but designing an interface seems difficult
    (general enough to be useful yet also high enough performance to
    be useful) and dangerous (new techniques and use cases can change
    the two difficulty factors).

    Back in the early 1980's, we added hardware support for locks
    (and events - e.g. modern condition variables) directly to
    the instruction set. An instruction to acquire a lock,
    one to release a lock (operating like a mutex), one to
    test if the lock is available and acquire if so, but not
    block if owned by another thread. Instructions to
    wait for an event and signal an event.

    The lock was defined by a data structure that included
    a 'canonical lock number', and a field containing
    the current lock owner task number, and a link to the
    first thread waiting for the lock, if any.

    The MCP(OS) thread data structure contained a field that
    recorded the current canonical lock number (CLN) owned by the
    thread and the lock instruction would fail (set condition codes)
    if the lock being locked had a CLN less than or equal to
    the currently owned CLN. The unlock instruction would fault
    if the thread attempted to unlock a lock where the CLN didn't
    match the current CLN. These rules completely prevented A-B
    deadlocks.

    If the lock was already owned the lock instruction would trap
    to a microkernel (highest privilege level) which would place
    the thread on a waiting list and dispatch the next highest
    priority thread.

    If there was a waiter recorded in the lock structure, the
    unlock instruction would trap to the microkernel which would
    adjust the list and, if the newly runnable thread was highest
    priority, dispatch it.


    Non-composing hardware lock elision seems to have the advantage of
    including an exclusive name. I imagine that one could use that
    name to establish a single order of atomic operations when there
    is conflict, but even then maximizing throughput seems difficult.
    If one falls back to all users of the lock name wait in line, one
    likely loses a lot of potential concurrency while the lock name is
    in "queuing mode". I suspect there are some tricks that would
    allow some queue skipping at moderate complexity/area/power/etc.
    cost (possibly involving versioned memory).

    Sounds much like transactions. Most hardware implementations so
    far have suffered from various restrictions in the size of the
    transaction and haven't found much traction yet.


    I have not studied software concurrency (just "overhearing" a few
    things from time to time), but even a sub-novice knows that
    locking is hard (deadlock, livelock, etc.).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Paul A. Clayton on Sun Dec 24 18:55:19 2023
    "Paul A. Clayton" <[email protected]> writes:
    On 12/19/23 3:49 AM, Anton Ertl wrote:
    "Paul A. Clayton" <[email protected]> writes:
    [snip]
    I suspect that a superior interface could be designed which
    exploits diverse locality (i.e., data might naturally be closer to
    some computational resources than to others) and communication
    (and storage) costs and budgets (budgets being related to urgency
    and importance).

    There have been a number of attempts to design architectures with more
    explicit hardware-oriented features, as well as attempts to design
    architectures with more software-oriented features.

    Communication of the (absolute and relative) costs and usefulness
    in both directions between hardware and software is a significant
    factor. Tossing difficult things to the other side of the
    interface just because they are difficult is problematic, but the
    relative difficulty can justify the placement of work.

    Although there are many things that hardware can do much
    more efficiently, perhaps at a cost. Such as Content
    Addressible Memory which is useful in many workloads,
    particularly in the networking space.

    I'd love to have some form of CAM available directly to a
    C++ programmer on standard hardware - I often see cases
    where it could be useful for performance, for small
    lookup tables. Perhaps a CAM instruction with a set of
    CAM banks (one per hardware thread) and support in the kernel
    to save/restore them on context switch (lazily, like FPR).


    I think it is quite remarkable that a sequence of instructions, with
    branches and RAM and registers, is what we already have in S/360 (59
    years ago) and (I think) in the IBM 704 (69 years ago);

    The Datatron first shipped the same year as the 704. It was
    announced in '52. There was a 2-digit order code in a 'command'
    (what we'd call an instruction). It included both multiplication
    and division (10 digit operands, 20 digit product) as well
    as shift (digit, not bit) orders (peephole optimizations
    for multiply/division by powers of 10). Load (Clear and Add) and Store orders. Block transfer order copied 20 consecutive main storage
    cells to a 'quick access loop'.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Sun Dec 24 20:37:51 2023
    Chris M. Thomasson wrote:

    On 12/23/2023 6:45 PM, Paul A. Clayton wrote:
    On 12/23/23 4:44 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:
    [snip]
    Do you remember the rather odd instruction for the Itanium called
    cmp8xchg16?

    Yes, at that point in time, MS was also asking for Compare 2 Swap 2
    over on the x86 side of things (DCAS in IBM parlance), which is the
    impetus for my inventing ASF which later morphed into ESM. My point
    was and still is that one can add 1 to 3 ATOMIC instructions every
    product rev, or one can provide the primitives such that SW can invent
    and use whatever primitive they can figure out how to program and how
    to use. {{There is no 3rd choice}}

    Third choice: do both (likely poorly).

    ESM could atomically operate on up to 512 bytes in up to 8 64-
    byte-aligned locations. The proposed atomics seem to be limited to
    a single cache block; a modestly extended LL/SC that allowed
    ordinary loads and stores within the reserved cache block would
    seem to provide that functionality. ESM is much more flexible, but
    forward progress is more difficult with more reservations. (Yes,
    ESM has mechanisms to handle queues and such, but I doubt even
    Mitch Alsup has discovered a universal solution.)

    (A proposed extension for RISC-V — Zacas — provides double-width
    CAS. I do not buy the argument that "the CAS atomic instructions
    scale better to highly parallel systems than LR/SC"; I think that
    is simply an implementation choice. I also see little difficulty
    in extending LL/SC active regions to include additional loads and
    stores within the single reservation. However, I would not be
    surprised if there was an advantage to easier idiom recognition
    for atomic operations that would readily benefit from some
    optimizations. The code density disadvantage would be difficult
    to remove (a sequence of primitives will typically be longer in
    bytes than a complex operation), but greater flexibility might
    pay for lower code density especially for less frequent
    operations.)

    Programmers have to be very careful with LL/SC because they can fail for other reasons besides the comparand being different wrt CAS. Live lock
    can enter the picture.

    When I developed ESM (after ASF) I understood that SW might want to know
    not just that interference had transpired, but what kind and how many.
    So, I have a variable that can be accessed for a short time after an
    ATOMIC event called WHY. Why has 3 kinds of values:: a negative value represents a spurious error {buffer overflow, ...}, zero (0) represents success, and positive number indicates how many other threads are contending with this ATOMIC event. SW can use this to decrease future contention.

    I would suggest that spurious failures are vastly preferred compared to spurious ABA successes.


    (Reserving a single "page" would also seem to make ordering easier
    at the cost of more false sharing. Spatially dense atomics are
    also less interesting.)

    Methods applied for locks might help in some cases for hardware-
    monitored reservations, but designing an interface seems difficult
    (general enough to be useful yet also high enough performance to
    be useful) and dangerous (new techniques and use cases can change
    the two difficulty factors).

    Non-composing hardware lock elision seems to have the advantage of
    including an exclusive name. I imagine that one could use that
    name to establish a single order of atomic operations when there
    is conflict, but even then maximizing throughput seems difficult.
    If one falls back to all users of the lock name wait in line, one
    likely loses a lot of potential concurrency while the lock name is
    in "queuing mode". I suspect there are some tricks that would
    allow some queue skipping at moderate complexity/area/power/etc.
    cost (possibly involving versioned memory).

    I have not studied software concurrency (just "overhearing" a few
    things from time to time), but even a sub-novice knows that
    locking is hard (deadlock, livelock, etc.).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Chris M. Thomasson on Sun Dec 24 21:36:06 2023
    "Chris M. Thomasson" <[email protected]> writes:
    On 12/24/2023 10:39 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <[email protected]> writes:
    On 12/23/23 4:44 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:
    [snip]
    Do you remember the rather odd instruction for the Itanium
    called cmp8xchg16?

    Yes, at that point in time, MS was also asking for Compare 2 Swap
    2 over on the x86 side of things (DCAS in IBM parlance), which is the
    impetus for my inventing ASF which later morphed into ESM. My
    point was and still is that one can add 1 to 3 ATOMIC instructions
    every product rev, or one can provide the primitives such that SW can
    invent and use whatever primitive they can figure out how to program
    and how to use. {{There is no 3rd choice}}

    Third choice: do both (likely poorly).

    ESM could atomically operate on up to 512 bytes in up to 8 64-
    byte-aligned locations. The proposed atomics seem to be limited to
    a single cache block; a modestly extended LL/SC that allowed
    ordinary loads and stores within the reserved cache block would
    seem to provide that functionality. ESM is much more flexible, but
    forward progress is more difficult with more reservations. (Yes,
    ESM has mechanisms to handle queues and such, but I doubt even
    Mitch Alsup has discovered a universal solution.)

    (A proposed extension for RISC-V — Zacas — provides double-width
    CAS. I do not buy the argument that "the CAS atomic instructions
    scale better to highly parallel systems than LR/SC"; I think that
    is simply an implementation choice.

    In my experience with large scale systems, it has been shown
    that LL/SC cannot scale so long as it must acquire the cache
    line to complete the transaction.

    The advantage of atomics (particularly load+add) is that the
    core doesn't need to acquire exclusive access to the cache line
    before completing the operation - the atomic can be easily delegated
    to a higher level (e.g. LLC) of cache, passed to the cache
    that owns the line, or even to the memory controller
    (or a PCI express device).


    I also see little difficulty
    in extending LL/SC active regions to include additional loads and
    stores within the single reservation.

    ARM has specified that any store by the processor _may_ cause
    the SC to (Store Exclusive) to fail.

    Right, wrt the reservation granule. Iirc, this can be larger than a l2
    cache line?

    Actually -any- store regardless of PA.

    The reservation granule can, in the degenerate case (e.g. on an ARM
    Cortex-M7), be the entire address space. In general, the size is implementation defined (between 4 and 512 words for ARMv8).

    From DDI0487_I_a

    When a PE writes using any instruction other than a Store-Exclusive instruction:

    � If the write is to a PA that is not marked as Exclusive Access by its local
    monitor and that local monitor is in the Exclusive Access state, it is
    IMPLEMENTATION DEFINED whether the write affects the state of the local
    monitor.

    � If the write is to a PA that is marked as Exclusive Access by its local
    monitor, it is IMPLEMENTATION DEFINED whether the write affects the
    state of the local monitor.

    � Any transition of the local monitor to the Open Access state not caused
    by the architectural execution of an operation shown in Figure B2-4 must
    not indefinitely delay forward progress of execution.

    � LoadExcl/StoreExcl loops are guaranteed to make forward progress only if,
    for any LoadExcl/StoreExcl loop within a single thread of execution,
    the software meets all of the following conditions:

    1 Between the Load-Exclusive and the Store-Exclusive, there are no explicit memory effects,
    preloads, direct or indirect System register writes, address translation instructions, cache or TLB
    maintenance instructions, exception generating instructions, exception returns, ISB barriers, or
    indirect branches.

    An implementation is free to clear the monitor for almost any reason. Thus, applications are recommended to keep the number of instructions between the load and store deminimus (avoiding any other loads or stores).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Mon Dec 25 01:28:58 2023
    Paul A. Clayton wrote:

    On 12/24/23 1:39 PM, Scott Lurndal wrote:
    "Paul A. Clayton" <[email protected]> writes:
    On 12/23/23 4:44 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:
    [snip]
    Do you remember the rather odd instruction for the Itanium
    called cmp8xchg16?

    Yes, at that point in time, MS was also asking for Compare 2 Swap
    2 over on the x86 side of things (DCAS in IBM parlance), which is the
    impetus for my inventing ASF which later morphed into ESM. My
    point was and still is that one can add 1 to 3 ATOMIC instructions
    every product rev, or one can provide the primitives such that SW can
    invent and use whatever primitive they can figure out how to program
    and how to use. {{There is no 3rd choice}}

    Third choice: do both (likely poorly).

    ESM could atomically operate on up to 512 bytes in up to 8 64-
    byte-aligned locations. The proposed atomics seem to be limited to
    a single cache block; a modestly extended LL/SC that allowed
    ordinary loads and stores within the reserved cache block would
    seem to provide that functionality. ESM is much more flexible, but
    forward progress is more difficult with more reservations. (Yes,
    ESM has mechanisms to handle queues and such, but I doubt even
    Mitch Alsup has discovered a universal solution.)

    (A proposed extension for RISC-V — Zacas — provides double-width
    CAS. I do not buy the argument that "the CAS atomic instructions
    scale better to highly parallel systems than LR/SC"; I think that
    is simply an implementation choice.

    In my experience with large scale systems, it has been shown
    that LL/SC cannot scale so long as it must acquire the cache
    line to complete the transaction.

    I think this is significantly an implementation choice.

    There are also architectural choices involved. ESM does not HAVE
    TO hold onto cache lines, what it cannot allow is access of write
    permission. With this paradigm, one can lose cache lines to
    interference without loosing the ability to complete the ATOMIC
    event.

    IBM z/Architecture provided "constrained transactions" that
    are guaranteed to complete (hardware retry and other mechanisms).

    One of the reasons I invented the ATOIC control point and automagic
    control transfer to the control point on ATOMIC failure. HW has to
    leave the IP pointing somewhere, so why not at the failure recovery
    point ??

    I think the constraints avoided repeated TLB/cache misses and
    other issues that make reliably fast completion difficult. I got

    ESM mandates all TLB misses in the setup phase, so that none of the participating STs miss in the TLB. {{Details matter}}

    the impression that this was not implemented to provide best-
    reasonable performance; correct operation is obviously more
    important, especially for that ISA, but the description of how
    conflicts were handled was disappointing to me. (By the way,
    z/Architecture also promoted memory-updating compute instructions
    to be memory atomic if aligned.)

    I have STM, LDM, and MM as atomic, as long as the data size being
    transferred does not cross a page boundary. The bus was designed
    for page sized ATOMIC transfers, so why not allow the CPU to make
    use of this property.

    With a single reservation granule, ordering is much easier.

    You cannot reduce the exponent of complexity and retain a single
    ATOMIC granule !! See my code illustrating moving (extracting
    and then inserting) an element of a CDS where nobody can ever
    see that element not INSIDE the CDS !! So, you need fewer events
    and fewer events means less interference, so a lower exponent of
    latency,.....

    The advantage of atomics (particularly load+add) is that the
    core doesn't need to acquire exclusive access to the cache line
    before completing the operation - the atomic can be easily delegated
    to a higher level (e.g. LLC) of cache, passed to the cache
    that owns the line, or even to the memory controller
    (or a PCI express device).

    Obviously, idiom recognition could translate ll/add/sc into
    an atomic add. (Alternatively, an atomic add could be implemented
    using the mechanisms provided for ll/sc.)

    Most uses of ATOMIC_op_to_memory() are expected to execute way
    out in the memory hierarchy (wherever the request runs into the
    data). So, not only does the core have to recognize the paradigm,
    but convert it into a get_perforemed_in_memory() operation. In
    any event, Op_to_memory() is the only one that never has failure
    modes....

    While I would prefer a denser idiom and one that constrains the
    cost of idiom recognition, even a 12-byte "atomic instruction"
    using ordinary ll/sc and a compute instruction might not be so
    horrible. (I mentioned previously the possibility of a special
    ll that implied a sc after a compute operation, providing both
    a more "outspoken" idiom and a shorter expression.)

    Not horrible at all as long as the operation takes place "over in memory"
    and not in the CPU.



    I also see little difficulty
    in extending LL/SC active regions to include additional loads and
    stores within the single reservation.

    ARM has specified that any store by the processor _may_ cause
    the SC to (Store Exclusive) to fail.

    To be fair, my 66000 will end up at the fail point if any exception
    transpires while in an ATOMIC event {{including debug trap}}. This
    is why ESM requires all writes to pass TLB checks in the setup phase.

    <snip>

    I think Intel and IBM made the mistake of going directly to a
    largish scope (L1 cache write set and Intel at least extending the
    read set with a conservative filter) rather than providing high
    performance for small transactions and perhaps the possibility
    to experiment with larger transactions. Both also probably
    suffered from slow failure (I seem to recall Intel's TSX had
    slow failures even for small transactions).

    The mistake is that this is a problem the CPU is in a place to solve.
    The CPU is not, the memory controller is. ...

    Cliff Click proposed (in "I Wanna Bit") a L1-capacity read set and
    single access write set with the further restriction that all
    cache misses failed the atomic event (so retries were expected).
    He also had a few blog posts about the problems faced with Azul
    Systems' transactional memory (e.g., Java libraries often had
    software performance counters that artificially introduced
    conflicts).

    I think My 66000's ESM is a good first step, but I would be
    tempted to provide the ability to play with more extensive read
    sets in the first version. That might not be very useful, but I
    suspect some clever programmers would find uses.

    Any read set larger than the outstanding miss buffer is either
    supporting width the ISA cannot use, or providing a limit to
    the width of ATOMICIty.

    A conservative filter can be used to extend the read set, so an implementation would not necessarily need to change the cache
    tag metadata. (Conservative filters were used only for evictions,
    if I understand correctly, for Intel TSX. Per-cache-block marker
    bits were used to track both read and write sets.)

    In addition to all the other issues with extending the capacity
    of atomic memory operations, there is also the question of
    version-based vs. read/write set tracking. Versioning could
    facilitate some speculative multithreading and more generally
    allowing orderable conflicts, but versioning is more complex.
    There are also reasons why one would like to leak more of the
    implementation information rather than provide a generic
    abstraction (not only for what operations are more likely to
    be fast/succeed but also what information might be usefully
    passed across the interface).

    I think having hardware on which to test ideas would be a
    significant benefit. I also know that I am biased by seeing
    limited transactional memory support as being relatively
    cheap and an obvious extension to a more clearly useful
    interface like ESM. The main problem with providing such seems
    to be the burden of compatibility.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Mon Dec 25 09:44:14 2023
    Scott Lurndal wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    On 12/24/2023 10:39 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <[email protected]> writes:

    I also see little difficulty
    in extending LL/SC active regions to include additional loads and
    stores within the single reservation.
    ARM has specified that any store by the processor _may_ cause
    the SC to (Store Exclusive) to fail.

    Same as on Alpha.
    Alpha rules only required LL and SC to be to the same 16 byte block.
    I can't think of any way to make use of that since only the most recent
    LL is active at a time.

    I had an idea a while back for an extension to LL/SC which allowed
    multiple LL's as long as they were for the same cache line,
    and SCH Store Conditional Hold which conditionally updates and retains
    the line lock, and SCR Store Conditional Release which conditionally
    applies all SCH and SCR updates and releases the line lock.

    It requires an extra cache line buffer to retain all SCH updates until
    the SCR update succeeds then they all update the cache together.
    Other than that, it is basically the same as LL/SC.

    Not sure how useful multi-word LL/SCH/SCR is though since all updates must
    be within one cache line. It can do atomic double, quad and octa-wide CAS,
    or atomic compare-exchange plus increment a generation number.

    Right, wrt the reservation granule. Iirc, this can be larger than a l2
    cache line?

    Actually -any- store regardless of PA.

    The reservation granule can, in the degenerate case (e.g. on an ARM Cortex-M7), be the entire address space. In general, the size is implementation defined (between 4 and 512 words for ARMv8).

    From DDI0487_I_a

    When a PE writes using any instruction other than a Store-Exclusive instruction:

    � If the write is to a PA that is not marked as Exclusive Access by its local
    monitor and that local monitor is in the Exclusive Access state, it is
    IMPLEMENTATION DEFINED whether the write affects the state of the local
    monitor.

    � If the write is to a PA that is marked as Exclusive Access by its local
    monitor, it is IMPLEMENTATION DEFINED whether the write affects the
    state of the local monitor.

    � Any transition of the local monitor to the Open Access state not caused
    by the architectural execution of an operation shown in Figure B2-4 must
    not indefinitely delay forward progress of execution.

    � LoadExcl/StoreExcl loops are guaranteed to make forward progress only if,
    for any LoadExcl/StoreExcl loop within a single thread of execution,
    the software meets all of the following conditions:

    1 Between the Load-Exclusive and the Store-Exclusive, there are no explicit memory effects,
    preloads, direct or indirect System register writes, address translation instructions, cache or TLB
    maintenance instructions, exception generating instructions, exception returns, ISB barriers, or
    indirect branches.

    An implementation is free to clear the monitor for almost any reason. Thus, applications are recommended to keep the number of instructions between the load and store deminimus (avoiding any other loads or stores).

    Pretty much the same rules as Alpha.
    These rules appear to allow it to acquire the line with exclusive ownership, pin the cache line between LE and SE, and release the pin if anything other than an SE happens. The pinning-while-exclusively-owned is what guarantees forward progress. It doesn't set an upper limit on the number of
    instructions or cpu clocks between LE and SE other than a time clock or
    any other interrupt will cancel the lock.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Mon Dec 25 17:08:06 2023
    Paul A. Clayton wrote:

    On 12/25/23 9:44 AM, EricP wrote:
    Scott Lurndal wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    On 12/24/2023 10:39 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <[email protected]> writes:

    I also see little difficulty
    in extending LL/SC active regions to include additional loads
    and
    stores within the single reservation.
    ARM has specified that any store by the processor _may_ cause
    the SC to (Store Exclusive) to fail.

    Same as on Alpha.
    Alpha rules only required LL and SC to be to the same 16 byte block.
    I can't think of any way to make use of that since only the most
    recent
    LL is active at a time.

    I had an idea a while back for an extension to LL/SC which allowed
    multiple LL's as long as they were for the same cache line,
    and SCH Store Conditional Hold which conditionally updates and
    retains
    the line lock, and SCR Store Conditional Release which conditionally
    applies all SCH and SCR updates and releases the line lock.

    It requires an extra cache line buffer to retain all SCH updates
    until
    the SCR update succeeds then they all update the cache together.
    Other than that, it is basically the same as LL/SC.

    Not sure how useful multi-word LL/SCH/SCR is though since all
    updates must
    be within one cache line. It can do atomic double, quad and
    octa-wide CAS,
    or atomic compare-exchange plus increment a generation number.

    One could just use regular loads and stores between the LL and SC.
    The hardware knows that it is within an atomic operation.

    Yes, but the HW does not know if that memory reference is participating
    or not in the event. That is why participating must be announced.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Mon Dec 25 17:06:11 2023
    EricP wrote:

    Scott Lurndal wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    On 12/24/2023 10:39 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <[email protected]> writes:

    I also see little difficulty
    in extending LL/SC active regions to include additional loads and
    stores within the single reservation.
    ARM has specified that any store by the processor _may_ cause
    the SC to (Store Exclusive) to fail.

    Same as on Alpha.
    Alpha rules only required LL and SC to be to the same 16 byte block.
    I can't think of any way to make use of that since only the most recent
    LL is active at a time.

    I had an idea a while back for an extension to LL/SC which allowed
    multiple LL's as long as they were for the same cache line,

    My 66000 allows for more than 1 cache line (up to 8) each cache line
    is accessed by an inbound memory reference {LD or PREfetch} and a
    monitor is setup to watch the cached lines. Each touched line becomes
    a participating cache line.

    and SCH Store Conditional Hold which conditionally updates and retains
    the line lock, and SCR Store Conditional Release which conditionally
    applies all SCH and SCR updates and releases the line lock.

    The monitor remains active until a <single> outbound memory reference
    {ST or PUSH} and then allows for all state modifications to all the
    monitored lines to become visible as if at the same instant. A locked
    ST to a nonparticipatng cache line abandons the event (with failure)

    It requires an extra cache line buffer to retain all SCH updates until
    the SCR update succeeds then they all update the cache together.
    Other than that, it is basically the same as LL/SC.

    I use the miss buffer for this. {{Sts to lines not participating in the
    event are performed normally.}}

    Not sure how useful multi-word LL/SCH/SCR is though since all updates must
    be within one cache line. It can do atomic double, quad and octa-wide CAS,
    or atomic compare-exchange plus increment a generation number.

    Not nearly as useful as across cache lines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Mon Dec 25 18:41:58 2023
    Paul A. Clayton wrote:

    On 12/25/23 12:08 PM, MitchAlsup wrote:
    Paul A. Clayton wrote:
    On 12/25/23 9:44 AM, EricP wrote:
    [snip]
    I had an idea a while back for an extension to LL/SC which allowed
    multiple LL's as long as they were for the same cache line,
    and SCH Store Conditional Hold which conditionally updates and
    retains
    the line lock, and SCR Store Conditional Release which
    conditionally
    applies all SCH and SCR updates and releases the line lock.
    [snip]
    One could just use regular loads and stores between the LL and SC.
    The hardware knows that it is within an atomic operation.

    Yes, but the HW does not know if that memory reference is
    participating
    or not in the event. That is why participating must be announced.

    I was assuming a simple interface where all memory operations
    inside the atomic operation participated. ASF allowed not only non-participating memory operations but also releasing of
    participants (I am not certain if ESM has the latter).

    How are you going to debug an ATOMIC event when you are not
    allowed to single step through the event ?? .... Every exception
    (of which debug trap is certainly one) sets IP to the control
    point and then takes the exception.

    ESM is
    obviously a better and more flexible interface, but for merely
    extending the semantics of an existing ISA interpreting LL as
    Transaction Begin and SC as Transaction End seems fairly nice.

    If I recall correctly, most ISAs with LL/SC give those
    instructions very limited addressing modes, so limiting address

    There is no limitation of address mode in My 66000.

    generation clutter to two memory accesses seems more desirable
    than reusing LL (and one would have to add a "Store Conditional
    Hold" instruction since SC ends the atomic operation).

    For simple atomics, one does not need escaping memory operations
    or release of reservations even though such are useful
    features. If limited to one granule, escaping and releasing
    operations do not make much sense (unless there is additional
    false sharing tracking) — EricP's proposal was for only one
    granule/cache block.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Mon Dec 25 18:38:11 2023
    Paul A. Clayton wrote:

    On 12/24/23 8:28 PM, MitchAlsup wrote:
    Paul A. Clayton wrote:

    On 12/24/23 1:39 PM, Scott Lurndal wrote:
    [snip]
    In my experience with large scale systems, it has been shown
    that LL/SC cannot scale so long as it must acquire the cache
    line to complete the transaction.

    I think this is significantly an implementation choice.

    There are also architectural choices involved. ESM does not HAVE
    TO hold onto cache lines, what it cannot allow is access of write
    permission. With this paradigm, one can lose cache lines to
    interference without loosing the ability to complete the ATOMIC
    event.

    ESM holds the read-and-write set in a write buffer, so the cache

    Where the buffer is writable but the cache is left in a pristine
    state. I plan on using the Miss buffer for this. The size of the
    miss buffer is the limiter of the number of participating cache
    lines. The miss buffer is also <essentially> fully associative.
    The Miss buffer is also SNOOPed to keep track of interference.

    itself holds no tracking data. If the data cache holds tracking
    data, an eviction from limited associativity could allow a write
    to the evicted cache block *if* the block was part of the atomic
    operation. (If it was merely resident from previous accesses,
    eviction is not important for the atomic operation, though
    eviction might slow completion if that data is later part of the
    atomic operation.)

    ASF and ESM still work even when there is no <data> cache.

    Yes, for the read set the data does not need to be retained. A
    list of addresses or a conservative filter are sufficient to
    guard against writes. (If the atomic operation can establish a
    time marker such that any write to evicted blocks happens "after"
    the atomic operation commits. For something like ESM, a cache
    block that has received its last update could be sent rather than
    a NAK if the atomic operation is being delayed by a data from a
    cache miss in the read set.

    Yes, if we know that the event will succeed the miss buffer can hand
    off the partially completed as if "they all become visible instantaneously".

    This "optimization" would have the
    bad consequence of spreading poisoned data when an uncorrectable
    ECC error is encountered for the cache miss — the atomic stores
    dependent on that delayed data would have to be poisoned as
    uncorrectable — as well as potentially being more complex than
    beneficial.)

    IBM z/Architecture provided "constrained transactions" that
    are guaranteed to complete (hardware retry and other mechanisms).

    One of the reasons I invented the ATOIC control point and automagic
    control transfer to the control point on ATOMIC failure. HW has to
    leave the IP pointing somewhere, so why not at the failure recovery
    point ??

    I think the earlier ASF had that feature. The LL/SC interface
    requires a separate branch instruction at the end. Yet there
    also seem to be cases where the failure type would just
    indicate "retry".

    The earlier ASF had part of this feature, it did not have the change
    in the control point when a BIN (branch on interference) was
    performed {mostly because x86 uses Condition Coded branches and
    there was no way of adding "interference" to the CC.

    I think the constraints avoided repeated TLB/cache misses and
    other issues that make reliably fast completion difficult. I got

    ESM mandates all TLB misses in the setup phase, so that none of
    the participating STs miss in the TLB. {{Details matter}}

    Yes, details matter! Experience and deep thought (and
    communication with other [experienced] deep thinkers) can expose
    details and how they interact. An elegant design has synergy,
    recurring patterns, and encapsulation that seems natural (and
    obvious after the fact).

    [snip]
    (By the way,
    z/Architecture also promoted memory-updating compute instructions
    to be memory atomic if aligned.)

    I have STM, LDM, and MM as atomic, as long as the data size being
    transferred does not cross a page boundary. The bus was designed
    for page sized ATOMIC transfers, so why not allow the CPU to make
    use of this property.

    This is a nice synergy example.

    With a single reservation granule, ordering is much easier.

    You cannot reduce the exponent of complexity and retain a single
    ATOMIC granule !! See my code illustrating moving (extracting
    and then inserting) an element of a CDS where nobody can ever see
    that element not INSIDE the CDS !! So, you need fewer events
    and fewer events means less interference, so a lower exponent of
    latency,.....

    I was not suggesting that all atomics be limited to a single
    granule but that this case could be optimized more easily than
    for multiple granules. (The interaction of multi-granule and
    single-granule atomics does complicate this, of course.)

    Even within a larger atomic, single conflict optimizations are
    possible. (I mentioned one possibility for ESM above.) If one
    granule is under contention, the ordering principles for single
    granule atomics may be applicable.

    Single cache line operate under the same guidelines, but can only
    utilize a strict subset of the possible orderings.

    There are also optimizations for single contender, where handoffs
    would be more practical. Even a case where a thread is about to
    complete an atomic that would write data that would change the
    participants of a conflicting atomic operation might allow some
    optimization where the almost complete atomic prefetches the new
    participant data to the thread with the other atomic operation and
    provides that atomic with an "I'm next" token.

    Another possible optimization would be to commit an atomic that
    is waiting on a cache miss for one granule in its read set. If the
    atomic is guaranteed to commit, it can release all the non-
    dependent members to other threads (the dependent members have to
    be reserved until the data arrives from the cache miss but their
    modification happens "before" the accesses of the released
    members).

    Be very careful, here.

    Such would have the bad aspect that an uncorrectable
    ECC error on the cache miss data would force additional memory
    locations to be poisoned. (With limited versioned memory, a
    rollback of the atomic event might be possible. Given that the
    ECC error would be greatly delayed, I suspect undoing all the
    dependent effects would be excessively complex and error-prone.)

    Most systems take unrecoverable ECC failures are Machine Check.
    You may be able to save the Guest OS, but you are not going to
    be able to save the application.

    [snip]
    I think Intel and IBM made the mistake of going directly to a
    largish scope (L1 cache write set and Intel at least extending the
    read set with a conservative filter) rather than providing high
    performance for small transactions and perhaps the possibility
    to experiment with larger transactions. Both also probably
    suffered from slow failure (I seem to recall Intel's TSX had
    slow failures even for small transactions).

    The mistake is that this is a problem the CPU is in a place to solve.
    The CPU is not, the memory controller is. ...

    Could you unpack that statement?

    It is my contention after studying this problem since 1977, that
    the CPU is not in a position to perform ATOMIC things--the proper
    place for these is over in the memory hierarchy (memory controller
    or DRAM controller of which there are multiples). The OP_to_memory()
    primitives guarantee all sorts of things that no CAS can guarantee.

    As the scope of the atomic increases (number of granules, amount
    of computation), it would seem that minimizing communication would
    favor doing more of the operation closer to the core. Moving the
    compute to the data would also not necessarily move it all the
    way to the memory controller. There may even be cases where
    something like test-and-test-and-set is useful; a first non-atomic
    read checks the data, then if the data needs updating an atomic
    operation is performed. While simple cases like test-and-test-and-
    set would avoid the first test when computation is moved to data,
    one can imagine more complex cases where core-local data interacts
    with the more contended data. There may be less communication if
    core orchestrates the operation.

    The fact you illustrated using test_and_test_and_set illustrates
    WHY the CPU is not in a position to "do the right thing". Whereas add_to_memory() always does the right thing without looping,...
    Notice that most GPUs only have OP_to_memory() atomics.

    [snip]
    I think My 66000's ESM is a good first step, but I would be
    tempted to provide the ability to play with more extensive read
    sets in the first version. That might not be very useful, but I
    suspect some clever programmers would find uses.

    Any read set larger than the outstanding miss buffer is either
    supporting width the ISA cannot use, or providing a limit to
    the width of ATOMICIty.

    Unpack?

    Since MM (memory copy) is already atomic when within a page, could
    it not be included within an ESM operation even when more than
    eight cache lines are accessed? Such would extend both the read
    set and the write set. (I do not know that such would be useful,
    but it seems likely not to add much hardware or complexity.)

    Maybe, on a theoretical basis, they could.
    I did not run into any practical ATOMIC codes that needed anything
    remotely like multiple MMs as a single ATOMIC event.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Dec 26 00:12:22 2023
    Paul A. Clayton wrote:

    On 12/25/23 1:38 PM, MitchAlsup wrote:
    Paul A. Clayton wrote:
    [snip]

    ASF and ESM still work even when there is no <data> cache.

    The buffer holding 8 64-byte entries is a tiny data cache.

    Yes but its name is the Miss Buffer on many machines. Most Miss
    Buffers are not accessible like a cache.

    A conservative filter could be used instead of a data cache for an
    extended read set.

    While providing compatibility across implementations is very
    valuable, varying capability across implementations does not seem
    horrible to me.

    You still have to be able to make Architectural statements about
    what does what. And if you want any semblance of compatibility
    the lowest end machine has to be able to perform significant
    ATOMIC events.

    ESM sets the minimum and maximum capacity of an atomic operation
    for all implementations. This means that a capacity failure in
    any implementation is guaranteed to be one in all implementations.

    For compatibility !!

    But in all senses:: 8 cache lines is so much more powerful that
    anything programmers have access to today, making it bigger or
    its size vague is not worthy.

    ESM also guarantees that atomic conflicts will be atomic
    conflicts. A conservative filter would provide failures that are
    not true conflicts. Limited-associativity cache-based tracking
    would introduce evictions caused by the atomic memory accesses.

    ESM still has a few spurious failure modes--especially in the
    "methodological" operating mode.

    (A cuckoo/elbow cache — skewed associative cache where a victim
    from one index is reinserted at one of its alternative indexes —
    could greatly increase the effective associativity. Copying cache
    blocks may be preferable to failing a transaction. A modest victim
    cache could buffer the copy activity and avoid reinsertion based
    on new information or longer analysis of information.)

    A FA Miss Buffer solves this without complexity.

    Combining the optimism of no atomic conflicts with the optimism of
    fitting into _effective_ capacity does not seem extremely wrong-
    headed to me. ESM provides a way to moderate the atomic conflict
    optimism with a "retry" that assumes N conflicts. Hardware might
    be able to somewhat moderate effective capacity optimism (e.g.,
    cuckoo cache, index recoding, borrowing capacity from a neighbor),
    but I suspect using a non-optimistic fallback path would be
    necessary.

    Maybe only fools would push capacity limits, especially since
    failing a large transaction is very expensive. (Perhaps the
    ability to query estimated effective capacity might be useful
    with a transaction. If a transaction could be split into
    smaller transactions at significant cost — such that the
    larger transaction would be preferred if there is low conflict
    and capacity is sufficient — that choice might be made even
    partially into the first attempt.)

    If you look into the "math" you will see that the exponent e in
    BigO( m^e ) is what bounds (minimum) system time in an ATOMIC
    event. Making bigger ATOMIC events makes e smaller. This rule
    is opposite what you propose above. Although there are some
    events where you rule may take hold I suspect minimizing the
    total number of events needed reduces overhead faster than
    making events as small as possible { Test_and_Set() }.

    [snip]
    There are also optimizations for single contender, where
    handoffs would be more practical. Even a case where a thread is
    about to
    complete an atomic that would write data that would change the
    participants of a conflicting atomic operation might allow some
    optimization where the almost complete atomic prefetches the new
    participant data to the thread with the other atomic operation and
    provides that atomic with an "I'm next" token.

    Another possible optimization would be to commit an atomic that
    is waiting on a cache miss for one granule in its read set. If the
    atomic is guaranteed to commit, it can release all the non-
    dependent members to other threads (the dependent members have to
    be reserved until the data arrives from the cache miss but their
    modification happens "before" the accesses of the released
    members).

    Be very careful, here.

    I am somewhat aware that these are pushing into the hairy
    complexity where even formal methods will not keep one from
    losing both feet and most of both legs with death from
    bleeding out if not from the explosion of the foot cannon
    shell itself.

    Be especially careful on which trigger you pull when both hands
    hold guns pointing at your feet.

    It is fun, however, to think of such possibilities.

              Such would have the bad aspect that an uncorrectable
    ECC error on the cache miss data would force additional memory
    locations to be poisoned. (With limited versioned memory, a
    rollback of the atomic event might be possible. Given that the
    ECC error would be greatly delayed, I suspect undoing all the
    dependent effects would be excessively complex and error-prone.)

    Most systems take unrecoverable ECC failures are Machine Check.
    You may be able to save the Guest OS, but you are not going to
    be able to save the application.

    Sometimes the application could be saved. If the data is in the
    text segment, recovery may be trivial. If the data can be isolated
    to a specific task and that task's starting data is available,
    more localized restarting would be possible. Sometimes the
    specific task could even be dropped (e.g., a Monte Carlo search
    could just drop one random attempt with likely little harm).
    Just killing the application may be the best option most of
    time, but playing "what if" can be fun.

    At great overhead, applications can avail themselves of recovery
    {generally by checkpointing a large amount of state.} In general,
    it does not happen very often outside of large simulations and
    data-base applications.

    Uncorrectable tag errors for possibly dirty data seem more
    fatal. Poisoning 0.1% of memory that matches the cache block's
    index seems extremely likely to kill the system. Isolating
    cache allocations with a mechanism like logical partitioning
    would allow such a failure to be isolated to a fraction of the
    active programs, but even then a tag error seems really bad.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Tue Dec 26 13:50:47 2023
    MitchAlsup wrote:
    Paul A. Clayton wrote:

    There are also architectural choices involved. ESM does not HAVE TO hold
    onto cache lines, what it cannot allow is access of write
    permission. With this paradigm, one can lose cache lines to interference without loosing the ability to complete the ATOMIC event.

    In my HTM design I do NOT use the coherence protocol Invalidate messages
    to detect collisions, as Intel RTM does and AMD ASF would have.
    Instead I use a separate set of Atomic Transaction (ATX) messages
    which travel on the coherence network but are never sent to cache.
    However the Atomic Transaction Manager (ATM) can emit coherence messages
    and cause lines to move between caches.

    Making the ATX messaging completely separate from Invalidate messages
    was crucial to eliminating the "false sharing cancellation" problem,
    where a line that was touched for bytes ADJACENT to the ones in the
    transaction caused an Invalidate message that aborted the transaction.

    IBM z/Architecture provided "constrained transactions" that
    are guaranteed to complete (hardware retry and other mechanisms).

    One of the reasons I invented the ATOIC control point and automagic
    control transfer to the control point on ATOMIC failure. HW has to
    leave the IP pointing somewhere, so why not at the failure recovery
    point ??

    I think the constraints avoided repeated TLB/cache misses and
    other issues that make reliably fast completion difficult. I got

    ESM mandates all TLB misses in the setup phase, so that none of the participating STs miss in the TLB. {{Details matter}}

    This is why I added a PREFETCH_TRANSLATION instruction so one can
    pre-load the TLB for data without actually touching the data item and
    causing the cache line to move between caches until the transaction
    is actually in progress.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Paul A. Clayton on Tue Dec 26 13:18:37 2023
    Paul A. Clayton wrote:
    On 12/25/23 9:44 AM, EricP wrote:
    Scott Lurndal wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    On 12/24/2023 10:39 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <[email protected]> writes:

    I also see little difficulty
    in extending LL/SC active regions to include additional loads and
    stores within the single reservation.
    ARM has specified that any store by the processor _may_ cause
    the SC to (Store Exclusive) to fail.

    Same as on Alpha.
    Alpha rules only required LL and SC to be to the same 16 byte block.
    I can't think of any way to make use of that since only the most recent
    LL is active at a time.

    I had an idea a while back for an extension to LL/SC which allowed
    multiple LL's as long as they were for the same cache line,
    and SCH Store Conditional Hold which conditionally updates and retains
    the line lock, and SCR Store Conditional Release which conditionally
    applies all SCH and SCR updates and releases the line lock.

    It requires an extra cache line buffer to retain all SCH updates until
    the SCR update succeeds then they all update the cache together.
    Other than that, it is basically the same as LL/SC.

    Not sure how useful multi-word LL/SCH/SCR is though since all updates
    must
    be within one cache line. It can do atomic double, quad and octa-wide
    CAS,
    or atomic compare-exchange plus increment a generation number.

    One could just use regular loads and stores between the LL and SC.
    The hardware knows that it is within an atomic operation.

    The LL/SC interface could, in fact, be extended to have the LL be
    Transaction Begin and the SC be Transaction End with arbitrary
    scope. I get the impression that most ISAs provide a success
    result in a GPR, so multiple failure modes could be provided.

    Such extensions would only require updating documentation (and compilers/libraries and hardware implementations). Hardware could
    be updated before the documentation not even necessarily needing
    "chicken bits" to disable the functionality.

    A natural progression of scope seems to be single access read-and-
    write set, read and write set both within a single granule, write
    set in single granule and read set extended to largish number of
    granules, write set extended to several granules, finally perhaps
    large number of granules for the write set.

    (I think largish read sets are easy enough to support that
    "several granule" read sets would not be a necessary step.)

    Performance optimizations could be interleaved with expanding
    scope.

    Yes, it doesn't actually need a SCH instruction.
    Just document that multiple LL are allowed to the same cache line,
    a normal ST to the same line is held aside if the lock is active,
    and SC ends the transaction and updates all changed line bytes.
    If all LL/ST/SC are not to the same line it cancels the lock.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Wed Dec 27 00:24:48 2023
    EricP wrote:

    MitchAlsup wrote:
    Paul A. Clayton wrote:

    There are also architectural choices involved. ESM does not HAVE TO hold
    onto cache lines, what it cannot allow is access of write
    permission. With this paradigm, one can lose cache lines to interference
    without loosing the ability to complete the ATOMIC event.

    In my HTM design I do NOT use the coherence protocol Invalidate messages
    to detect collisions, as Intel RTM does and AMD ASF would have.
    Instead I use a separate set of Atomic Transaction (ATX) messages
    which travel on the coherence network but are never sent to cache.
    However the Atomic Transaction Manager (ATM) can emit coherence messages
    and cause lines to move between caches.

    Making the ATX messaging completely separate from Invalidate messages
    was crucial to eliminating the "false sharing cancellation" problem,
    where a line that was touched for bytes ADJACENT to the ones in the transaction caused an Invalidate message that aborted the transaction.

    IBM z/Architecture provided "constrained transactions" that
    are guaranteed to complete (hardware retry and other mechanisms).

    One of the reasons I invented the ATOIC control point and automagic
    control transfer to the control point on ATOMIC failure. HW has to
    leave the IP pointing somewhere, so why not at the failure recovery
    point ??

    I think the constraints avoided repeated TLB/cache misses and
    other issues that make reliably fast completion difficult. I got

    ESM mandates all TLB misses in the setup phase, so that none of the
    participating STs miss in the TLB. {{Details matter}}

    This is why I added a PREFETCH_TRANSLATION instruction so one can
    pre-load the TLB for data without actually touching the data item and
    causing the cache line to move between caches until the transaction
    is actually in progress.

    If you use a PRE (prefetch) instruction and specify DRAM as the location
    to prefetch, you end up only making sure the TLB is loaded.

    PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits
    determine where the data (line) is prefetched to {L1, L2, L3, DAM}.

    I have also run into a couple of lock-like situations where one PRE's
    the container with .lock, and later PUSH's the same container (also
    with .lock) to end the ATOMIC event without ever seeing the contained
    value.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Wed Dec 27 09:51:47 2023
    MitchAlsup wrote:
    EricP wrote:
    MitchAlsup wrote:

    I think the constraints avoided repeated TLB/cache misses and
    other issues that make reliably fast completion difficult. I got

    ESM mandates all TLB misses in the setup phase, so that none of the
    participating STs miss in the TLB. {{Details matter}}

    This is why I added a PREFETCH_TRANSLATION instruction so one can
    pre-load the TLB for data without actually touching the data item and
    causing the cache line to move between caches until the transaction
    is actually in progress.

    If you use a PRE (prefetch) instruction and specify DRAM as the location
    to prefetch, you end up only making sure the TLB is loaded.

    PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits determine where the data (line) is prefetched to {L1, L2, L3, DAM}.

    Also the prefetch translation should, if necessary, set the PTE's
    Accessed and Dirty bits as setting them requires transferring the
    PTE's cache line as exclusive into that D$L1 cache, which also takes time.

    Gets as much stuff done as possible that would cause transaction latency
    before the transaction starts, and minimize the size of any contention
    window.

    I'm not sure how much help the would really be though as many shared
    data structures are accessed through pointers, and to get those pointers
    you have to read the structure.

    For example, in a double linked list insert you could PRE-fetch the
    translation for the list header, but to get the pointer to the first
    object you have to read the header which requires transferring the
    header's line this cache, which you don't want to do until the
    transaction actually starts because doing so prematurely would just
    delay the current updater's transaction.

    I have also run into a couple of lock-like situations where one PRE's
    the container with .lock, and later PUSH's the same container (also with .lock) to end the ATOMIC event without ever seeing the contained
    value.

    You should mention that your PUSH instruction is a cache flush
    "used to push data out of the cache back towards memory".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Wed Dec 27 17:17:31 2023
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:
    MitchAlsup wrote:

    I think the constraints avoided repeated TLB/cache misses and
    other issues that make reliably fast completion difficult. I got

    ESM mandates all TLB misses in the setup phase, so that none of the
    participating STs miss in the TLB. {{Details matter}}

    This is why I added a PREFETCH_TRANSLATION instruction so one can
    pre-load the TLB for data without actually touching the data item and
    causing the cache line to move between caches until the transaction
    is actually in progress.

    If you use a PRE (prefetch) instruction and specify DRAM as the location
    to prefetch, you end up only making sure the TLB is loaded.

    PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits
    determine where the data (line) is prefetched to {L1, L2, L3, DRAM}.

    Also the prefetch translation should, if necessary, set the PTE's
    Accessed and Dirty bits as setting them requires transferring the
    PTE's cache line as exclusive into that D$L1 cache, which also takes time.

    If you want this semantic::

    PRE RW-L1,[address].lock

    instead of::

    PRE RW-DRAM,[address].lock

    Gets as much stuff done as possible that would cause transaction latency before the transaction starts, and minimize the size of any contention window.

    I'm not sure how much help the would really be though as many shared
    data structures are accessed through pointers, and to get those pointers
    you have to read the structure.

    These prefetches should be of the former rather than later.

    For example, in a double linked list insert you could PRE-fetch the translation for the list header, but to get the pointer to the first
    object you have to read the header which requires transferring the
    header's line this cache, which you don't want to do until the
    transaction actually starts because doing so prematurely would just
    delay the current updater's transaction.

    I have also run into a couple of lock-like situations where one PRE's
    the container with .lock, and later PUSH's the same container (also with
    .lock) to end the ATOMIC event without ever seeing the contained
    value.

    You should mention that your PUSH instruction is a cache flush
    "used to push data out of the cache back towards memory".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Thu Dec 28 14:10:51 2023
    EricP wrote:
    Paul A. Clayton wrote:
    On 12/25/23 9:44 AM, EricP wrote:
    Scott Lurndal wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    On 12/24/2023 10:39 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <[email protected]> writes:

    I also see little difficulty
    in extending LL/SC active regions to include additional loads and >>>>>>> stores within the single reservation.
    ARM has specified that any store by the processor _may_ cause
    the SC to (Store Exclusive) to fail.

    Same as on Alpha.
    Alpha rules only required LL and SC to be to the same 16 byte block.
    I can't think of any way to make use of that since only the most recent
    LL is active at a time.

    I had an idea a while back for an extension to LL/SC which allowed
    multiple LL's as long as they were for the same cache line,
    and SCH Store Conditional Hold which conditionally updates and retains
    the line lock, and SCR Store Conditional Release which conditionally
    applies all SCH and SCR updates and releases the line lock.

    It requires an extra cache line buffer to retain all SCH updates until
    the SCR update succeeds then they all update the cache together.
    Other than that, it is basically the same as LL/SC.

    Not sure how useful multi-word LL/SCH/SCR is though since all updates
    must
    be within one cache line. It can do atomic double, quad and octa-wide
    CAS,
    or atomic compare-exchange plus increment a generation number.

    One could just use regular loads and stores between the LL and SC.
    The hardware knows that it is within an atomic operation.

    Yes, it doesn't actually need a SCH instruction.
    Just document that multiple LL are allowed to the same cache line,
    a normal ST to the same line is held aside if the lock is active,
    and SC ends the transaction and updates all changed line bytes.
    If all LL/ST/SC are not to the same line it cancels the lock.

    On reflection, eliminating the SCH instruction would leave this
    vulnerable to an interrupt occurring in the multiple store sequence.
    The interrupt would cancel the lock, and on return a regular ST would
    not be distinguishable as intended to be part of an atomic sequence,
    and so would erroneously update memory, whereas a SCH would be tossed
    because the lock was no longer set.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Thu Dec 28 22:42:47 2023
    EricP wrote:

    EricP wrote:
    Paul A. Clayton wrote:
    On 12/25/23 9:44 AM, EricP wrote:
    Scott Lurndal wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    On 12/24/2023 10:39 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <[email protected]> writes:

    I also see little difficulty
    in extending LL/SC active regions to include additional loads and >>>>>>>> stores within the single reservation.
    ARM has specified that any store by the processor _may_ cause
    the SC to (Store Exclusive) to fail.

    Same as on Alpha.
    Alpha rules only required LL and SC to be to the same 16 byte block.
    I can't think of any way to make use of that since only the most recent >>>> LL is active at a time.

    I had an idea a while back for an extension to LL/SC which allowed
    multiple LL's as long as they were for the same cache line,
    and SCH Store Conditional Hold which conditionally updates and retains >>>> the line lock, and SCR Store Conditional Release which conditionally
    applies all SCH and SCR updates and releases the line lock.

    It requires an extra cache line buffer to retain all SCH updates until >>>> the SCR update succeeds then they all update the cache together.
    Other than that, it is basically the same as LL/SC.

    Not sure how useful multi-word LL/SCH/SCR is though since all updates
    must
    be within one cache line. It can do atomic double, quad and octa-wide
    CAS,
    or atomic compare-exchange plus increment a generation number.

    One could just use regular loads and stores between the LL and SC.
    The hardware knows that it is within an atomic operation.

    Yes, it doesn't actually need a SCH instruction.
    Just document that multiple LL are allowed to the same cache line,
    a normal ST to the same line is held aside if the lock is active,
    and SC ends the transaction and updates all changed line bytes.
    If all LL/ST/SC are not to the same line it cancels the lock.

    On reflection, eliminating the SCH instruction would leave this
    vulnerable to an interrupt occurring in the multiple store sequence.

    If an exception occurs in the store (manifestation) section of an ESM
    ATOMIC event, the event fails and none of the stores appears to have
    been performed.

    If an interrupt occurs in the store (manifestation) section of an ESM
    ATOMIC event, if all the store can be committed they will be and the
    event succeeds, and if they cannot (or cannot be determined that they
    can) the the event fails and none of the stores appear to have been
    performed.

    In any event, if the thread performing the event cannot be completed,
    due to transfer of control to more privilege operation, the event fails
    control appears to have been transferred to the event control point, and
    then control is transferred to the more privileged thread.

    The interrupt would cancel the lock,

    ESM cancels the event not the lock.

    and on return a regular ST would
    not be distinguishable as intended to be part of an atomic sequence,

    Upon return you are no longer "in" the ATOMIC event (control transfers
    to the control point before control transfers to the more privileged
    operation.

    and so would erroneously update memory,

    So, there is no control transfer "back" inside the ATOMIC event.

    whereas a SCH would be tossed
    because the lock was no longer set.

    There should be no read or write set upon return.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Fri Dec 29 12:17:53 2023
    MitchAlsup wrote:
    EricP wrote:

    EricP wrote:
    Paul A. Clayton wrote:
    On 12/25/23 9:44 AM, EricP wrote:
    Scott Lurndal wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    On 12/24/2023 10:39 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <[email protected]> writes:

    I also see little difficulty
    in extending LL/SC active regions to include additional loads and >>>>>>>>> stores within the single reservation.
    ARM has specified that any store by the processor _may_ cause
    the SC to (Store Exclusive) to fail.

    Same as on Alpha.
    Alpha rules only required LL and SC to be to the same 16 byte block. >>>>> I can't think of any way to make use of that since only the most
    recent
    LL is active at a time.

    I had an idea a while back for an extension to LL/SC which allowed
    multiple LL's as long as they were for the same cache line,
    and SCH Store Conditional Hold which conditionally updates and retains >>>>> the line lock, and SCR Store Conditional Release which conditionally >>>>> applies all SCH and SCR updates and releases the line lock.

    It requires an extra cache line buffer to retain all SCH updates until >>>>> the SCR update succeeds then they all update the cache together.
    Other than that, it is basically the same as LL/SC.

    Not sure how useful multi-word LL/SCH/SCR is though since all
    updates must
    be within one cache line. It can do atomic double, quad and
    octa-wide CAS,
    or atomic compare-exchange plus increment a generation number.

    One could just use regular loads and stores between the LL and SC.
    The hardware knows that it is within an atomic operation.

    Yes, it doesn't actually need a SCH instruction.
    Just document that multiple LL are allowed to the same cache line,
    a normal ST to the same line is held aside if the lock is active,
    and SC ends the transaction and updates all changed line bytes.
    If all LL/ST/SC are not to the same line it cancels the lock.

    On reflection, eliminating the SCH instruction would leave this
    vulnerable to an interrupt occurring in the multiple store sequence.

    If an exception occurs in the store (manifestation) section of an ESM
    ATOMIC event, the event fails and none of the stores appears to have
    been performed.

    If an interrupt occurs in the store (manifestation) section of an ESM
    ATOMIC event, if all the store can be committed they will be and the
    event succeeds, and if they cannot (or cannot be determined that they
    can) the the event fails and none of the stores appear to have been performed.

    In any event, if the thread performing the event cannot be completed,
    due to transfer of control to more privilege operation, the event fails control appears to have been transferred to the event control point, and
    then control is transferred to the more privileged thread.

    Yes, my HTM has some similarities.
    It has a formal start to the transaction that establishes an abort
    transfer RIP address for when an interrupt or exception occurs.
    That state change in the RIP indicates the transaction has terminated.

    I have two sets of instructions, the usual atomic RMW and LL/SCH/SCR ones,
    and a completely different set for atomic transactions.

    The LL/SCH/SCR is simple and cheap and could be easily retrofitted into
    a core with LL/SC. However it has no abort transfer address and so needs
    some other means to distinguish an atomic store from a regular one.
    Thus SCH as a distinct instruction is required.

    The interrupt would cancel the lock,

    ESM cancels the event not the lock.

    That reference to "lock" came from the LL Load Locked instruction.

    My AT instructions refer to "canceling the transaction".

    and on return a regular ST would
    not be distinguishable as intended to be part of an atomic sequence,

    Upon return you are no longer "in" the ATOMIC event (control transfers
    to the control point before control transfers to the more privileged operation.

    Right, for multi-line transactions.

    The LL/SCH/SCR instructions are not a multi-line transaction.
    They are a single cache line atomic multiple update.

    and so would erroneously update memory,

    So, there is no control transfer "back" inside the ATOMIC event.

    Right, in atomic transactions.

    I see atomic RMW and LL/SCH/SCR as one class of instruction
    as they all deal with updates to a single cache line
    and their HW is optimized for just that one operation.

    The AT Atomic Transaction instructions are completely separate,
    dealing with multiple updates to multiple lines,
    and having more flexibility but also more overhead.

    whereas a SCH would be tossed
    because the lock was no longer set.

    There should be no read or write set upon return.

    Yes, in a transaction there isn't - the interrupt/exception cancels the transaction tossing all pending guarded memory changes,
    updates the status register, and sets the RIP to the abort address.

    Though I have been thinking lately that might be overly conservative.
    It might not need to cancel a transaction just because there is an interrupt, as long as the interrupt handler doesn't initiate a new transaction or
    read or write any memory that is a member of the transaction.
    That might cut down on spurious unnecessary transaction aborts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Fri Dec 29 20:45:01 2023
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:

    EricP wrote:
    Paul A. Clayton wrote:
    On 12/25/23 9:44 AM, EricP wrote:
    Scott Lurndal wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    On 12/24/2023 10:39 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <[email protected]> writes:

    I also see little difficulty
    in extending LL/SC active regions to include additional loads and >>>>>>>>>> stores within the single reservation.
    ARM has specified that any store by the processor _may_ cause >>>>>>>>> the SC to (Store Exclusive) to fail.

    Same as on Alpha.
    Alpha rules only required LL and SC to be to the same 16 byte block. >>>>>> I can't think of any way to make use of that since only the most
    recent
    LL is active at a time.

    I had an idea a while back for an extension to LL/SC which allowed >>>>>> multiple LL's as long as they were for the same cache line,
    and SCH Store Conditional Hold which conditionally updates and retains >>>>>> the line lock, and SCR Store Conditional Release which conditionally >>>>>> applies all SCH and SCR updates and releases the line lock.

    It requires an extra cache line buffer to retain all SCH updates until >>>>>> the SCR update succeeds then they all update the cache together.
    Other than that, it is basically the same as LL/SC.

    Not sure how useful multi-word LL/SCH/SCR is though since all
    updates must
    be within one cache line. It can do atomic double, quad and
    octa-wide CAS,
    or atomic compare-exchange plus increment a generation number.

    One could just use regular loads and stores between the LL and SC.
    The hardware knows that it is within an atomic operation.

    Yes, it doesn't actually need a SCH instruction.
    Just document that multiple LL are allowed to the same cache line,
    a normal ST to the same line is held aside if the lock is active,
    and SC ends the transaction and updates all changed line bytes.
    If all LL/ST/SC are not to the same line it cancels the lock.

    On reflection, eliminating the SCH instruction would leave this
    vulnerable to an interrupt occurring in the multiple store sequence.

    If an exception occurs in the store (manifestation) section of an ESM
    ATOMIC event, the event fails and none of the stores appears to have
    been performed.

    If an interrupt occurs in the store (manifestation) section of an ESM
    ATOMIC event, if all the store can be committed they will be and the
    event succeeds, and if they cannot (or cannot be determined that they
    can) the the event fails and none of the stores appear to have been
    performed.

    In any event, if the thread performing the event cannot be completed,
    due to transfer of control to more privilege operation, the event fails
    control appears to have been transferred to the event control point, and
    then control is transferred to the more privileged thread.

    Yes, my HTM has some similarities.

    Yes, I see lots of similarities--most of the differences are down in
    the minutia.

    It has a formal start to the transaction that establishes an abort
    transfer RIP address for when an interrupt or exception occurs.
    That state change in the RIP indicates the transaction has terminated.

    I have two sets of instructions, the usual atomic RMW and LL/SCH/SCR ones, and a completely different set for atomic transactions.

    The LL/SCH/SCR is simple and cheap and could be easily retrofitted into
    a core with LL/SC. However it has no abort transfer address and so needs
    some other means to distinguish an atomic store from a regular one.
    Thus SCH as a distinct instruction is required.

    I am assuming SCH is SC and Hole while SCR is SC and Release.

    The interrupt would cancel the lock,

    ESM cancels the event not the lock.

    That reference to "lock" came from the LL Load Locked instruction.

    My AT instructions refer to "canceling the transaction".

    Which I call an event==transaction. In any event (sic); the event does
    not span (persist across) control transfers to other privilege levels.
    It is this point that requires the establishment of the failure control
    point.

    and on return a regular ST would
    not be distinguishable as intended to be part of an atomic sequence,

    Upon return you are no longer "in" the ATOMIC event (control transfers
    to the control point before control transfers to the more privileged
    operation.

    Right, for multi-line transactions.

    The LL/SCH/SCR instructions are not a multi-line transaction.
    They are a single cache line atomic multiple update.

    and so would erroneously update memory,

    So, there is no control transfer "back" inside the ATOMIC event.

    Right, in atomic transactions.

    I see atomic RMW and LL/SCH/SCR as one class of instruction
    as they all deal with updates to a single cache line
    and their HW is optimized for just that one operation.

    The AT Atomic Transaction instructions are completely separate,
    dealing with multiple updates to multiple lines,
    and having more flexibility but also more overhead.

    whereas a SCH would be tossed
    because the lock was no longer set.

    There should be no read or write set upon return.

    Yes, in a transaction there isn't - the interrupt/exception cancels the transaction tossing all pending guarded memory changes,
    updates the status register, and sets the RIP to the abort address.

    By the time the exception/interrupt has been taken care of, millions of
    cycles can have gone by (millisecond at this point) so the state of the concurrent data structure is stale as seen by the thread receiving
    control in the middle of an event--the probability of the pending
    ATOMIC event succeeding is so remote that you should not lean in
    that direction.

    Though I have been thinking lately that might be overly conservative.
    It might not need to cancel a transaction just because there is an interrupt, as long as the interrupt handler doesn't initiate a new transaction or
    read or write any memory that is a member of the transaction.
    That might cut down on spurious unnecessary transaction aborts.

    There might be hundreds of CPUs all doing "transactions" a few percent of
    their time, to any number of resources.

    In My case, failing the event on exception/interrupt is a way of
    minimizing the amount of state that is needed to be saved/restored;
    and a way of minimizing priority inversion.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Fri Dec 29 22:06:46 2023
    Chris M. Thomasson wrote:

    On 12/21/2023 1:49 PM, Chris M. Thomasson wrote:
    On 12/20/2023 8:29 AM, MitchAlsup wrote:
    Stefan Monnier wrote:

      Isn't that as it should be?  Hardware provides some simple, possibly >>>>>   inconvenient interface, and software abstracts the inconvenience
      away; e.g., RISCs provide only a load/store interface, and compilers >>>>>   translate the stuff programmers write into this interface.  The
      difference is that with RISCs the result is a least as fast as
      architectures that tried to cater more to the programming language, >>>>>   while I am convinced that hardware where the designers design for >>>>>   efficient sequential consistency will let programmers write software >>>>>   that handily outperforms using libraries or system calls on hardware >>>>>   designed for weaker memory models.

    While I agree that CPUs should provide sequential consistency

    It's nice that they already do. The fun part is creating algorithms that
    try to avoid it. Not only avoid seq_cst, but avoid a acquire and release
    membars, and the fun acq_rel acquire release all in one... ;^) Now,
    there are certain algorithms that, as-is, cannot avoid it (seq_cst).
    Original SMR and, even Dekkers algorithm require seq_cst...


    (it would
    make life easier for debugging, formalization, and reasoning, and
    I can't think of any reason why it should have a significant cost), I'd >>>> be surprised if it "handily outperforms" the status quo: concurrent
    programming is hard even with sequential consistency, so most normal
    code would still continue using the exact same libraries.
    [...]

    Keep in mind that acquire and release are not as strong as a seq_cst
    barrier. Wrt the SPARC:

    acquire = #LoadStore | #LoadLoad
    release = #LoadStore | #StoreStore

    Not a #StoreLoad in sight!

    SunOS had a funny quirk--it used cacheable locks to guard uncacheable
    stores. SuperSPARC had a slow cache-to-cache protocol and nobody saw
    the timing hole until the Ross Colorado chips could pass data cache to
    cache faster and another CPU could see the lock unlocked before the
    uncacheable store arrived at {what we call} North Bridge {today}; and
    this timing hole could crash the OS.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Sat Dec 30 04:08:47 2023
    On Tue, 26 Dec 2023 00:12:22 +0000, MitchAlsup wrote:

    Be especially careful on which trigger you pull when both hands hold
    guns pointing at your feet.

    Be especially careful, yes.

    But not primarily about _which trigger you pull_, since pulling *either* trigger will have the bad result of shooting oneself in the foot. Instead, first point the guns in some other direction before pulling either
    trigger!

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Sat Dec 30 04:17:26 2023
    On Fri, 29 Dec 2023 22:06:46 +0000, MitchAlsup wrote:

    SunOS had a funny quirk--it used cacheable locks to guard uncacheable
    stores.

    That _sounds_ bad, at first. Surely it's a terrible security hole or
    something.

    But why are some memory addresses marked as not cacheable? Usually,
    it's because they aren't really RAM, but instead I/O ports or
    something like that.

    The lock, which indicates which process those addresses currently
    belong to, is just data. So it isn't uncacheable for the same
    reason as addresses like that.

    Of course, while in _some cases_ this might make perfect sense, there
    may well be other cases where this really is the horrible mistake it
    seems to be on first sight.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Sat Dec 30 16:51:56 2023
    Quadibloc wrote:

    On Fri, 29 Dec 2023 22:06:46 +0000, MitchAlsup wrote:

    SunOS had a funny quirk--it used cacheable locks to guard uncacheable
    stores.

    That _sounds_ bad, at first. Surely it's a terrible security hole or something.

    It was more like the differences between MESI and MOESI. MESI has
    cache to cache transfers go through memory, MOESI has cache to
    cache transfers go cache to cache.

    But why are some memory addresses marked as not cacheable? Usually,
    it's because they aren't really RAM, but instead I/O ports or
    something like that.

    In this particular case, the locations were MMI/O.

    The lock, which indicates which process those addresses currently
    belong to, is just data. So it isn't uncacheable for the same
    reason as addresses like that.

    Of course, while in _some cases_ this might make perfect sense, there
    may well be other cases where this really is the horrible mistake it
    seems to be on first sight.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Sun Dec 31 19:33:06 2023
    Paul A. Clayton wrote:

    On 12/28/23 2:10 PM, EricP wrote:
    EricP wrote:
    Paul A. Clayton wrote:
    [snip extended LL/SC design]
    One could just use regular loads and stores between the LL and SC.
    The hardware knows that it is within an atomic operation.

    Yes, it doesn't actually need a SCH instruction.
    Just document that multiple LL are allowed to the same cache line,
    a normal ST to the same line is held aside if the lock is active,
    and SC ends the transaction and updates all changed line bytes.
    If all LL/ST/SC are not to the same line it cancels the lock.

    On reflection, eliminating the SCH instruction would leave this
    vulnerable to an interrupt occurring in the multiple store sequence.
    The interrupt would cancel the lock, and on return a regular ST would
    not be distinguishable as intended to be part of an atomic sequence,
    and so would erroneously update memory, whereas a SCH would be tossed
    because the lock was no longer set.

    I had not considered this consequence of an interrupt, but this
    issue would be "trivially" handled by distinguishing between
    reservation active and reservation valid (i.e., use two bits with
    one preserved across interrupts). The hardware would then know
    that it is within a ll/sc enclosed code sequence and stores would
    merely be buffered as usual until the sc was encountered. Advanced
    hardware for such a primitive interface, knowing that stores would
    not be preserved could elide them.

    If taken the way I think Paul intends, that means the read and
    write sets are part of thread saved state. It is nuances like
    exceptions and interrupts; why My 66000 abandons the event if
    it has to perform more privileged control transfer.

    For a single cache block reservation, having to run through the
    rest of a known-failed atomic attempt may not be problematic. With
    broader uses, that would be painful; any failure would have to run
    through the rest of the attempt before being able to consider
    retrying. If the state on a failure is architected as boundedly
    undefined, a simple scan for the sc (no execution to produce
    identical final state) would suffice. It might be possible to add
    this imprecise state aspect when allowing multiple accesses so
    hardware could safely just scan after the second non-ll/sc memory
    access.

    A microarchitecture could theoretically cache sc locations similar
    to a branch target buffer, but memoizing such seems difficult and
    a prediction would have to be confirmed (e.g., by scanning for the
    next sc while speculating that the predicted location was
    correct). That seems needlessly complex.

    When one considers all of the effects of {memory order, branch prediction, exceptions, interrupts, machine checks,...} it is better for the complexity management of a project* to abandon an ATOMIC event when "control gets hard".
    I have seen and been in the debates of "how do we get out of this mess"
    often enough not to want to repeat it again.

    (*) and the possibly more important:: "ability of SW to reason about what happened and why";

    It is not terribly difficult to imagine ways of addressing this
    deficiency. One might be able to get away with reinterpreting a
    branch instruction as setting a failure jump point, avoiding
    adding instructions. Such a branch would have to use a nonsensical
    condition like "if not equal to self" (assuming no one has used
    that as a nop before the architecture decided to reserve it).
    Short attempts would not include the branch and just run through
    to find the sc; simple attempts that are somehow known not to have
    only transient failure causes might target the ll instruction
    producing a hard retry loop. One could also consider each register
    tested for non-equality to self as providing additional control
    information.

    Given that opcode space is usually not so tight as to prevent a
    secondary atomic interface, but one would really want to have the
    second be comprehensive to avoid a third, fourth, fifth, etc.
    Extending ll/sc semantics — without adding instructions — might
    have allowed experience/feedback to accumulate for a second try.
    (I do believe that one could substantially extend the semantics
    without having to add instructions. When starting from a clean
    slate is better idea seems usually to be realized late. "Free"
    extensions also discourages trying again rather than trying to
    fix an interface that mostly sort-of works.)

    Given that none of the RISCs have actually extended ll/sc
    semantics, the pressure to provide broader atomics must not have
    been particularly great.

    Most of those machines have only 1 OS to port (Linux) and most of
    the Linux port requires only the ATOMICs that x86 provides.
    Most of the Linux environment is not multi-threaded,
    Most of the multi-threaded Linux applications requires only the
    ATOMICs that x86 provides.

    Thus, the chicken and egg problem.

    LL/SC is sufficient to replicate the ATOMICs x86 provides.

    Remember Sun Microsystems took 5 years of programmer effort in
    reducing the lock contention in SOLARIS--an effort that nobody
    in the Linux community can afford to perform.

    ASF did better in checkpointing the start such that a failure
    jumped to a known location

    yes

    that performed a test for
    success/failure and branched to handling code.

    not necessarily. If you arrived you knew the event had failed.
    ASF provided a mechanism whereby you could refine your understanding
    of the failure and branching based on that.

    ASF also preserved
    the stack pointer; this does not seem a substantial benefit as
    short attempts could probably avoid changing the stack pointer and
    the relative overhead of explicitly saving the stack pointer is
    not that great if the attempt is large.

    The SP thing had to do with PUSH/POP model in x86 and automagic
    control transfer.

    Mitch wants escaping stores to preserve logged information (with
    the obvious potential benefit for debugging and performance
    tuning, possibly even dynamic performance tuning i.e. by the
    software and not the developer). AMD's ASF defined a RELEASE
    instruction which removed a cache block from the tracking set; I
    do not remember seeing this in the documentation I have of ESM.

    RELEASE was after I had left AMD.

    ASF (under my tutelage) and ESM were ATOMIC events not restricted
    Transactional Memory things.

    An ATOMIC event has the property that if the event succeeds all
    of the writes become visible to the system instantly, otherwise
    no interested 3rd party sees any of the memory locations as having
    been modified.

    An instruction like RELEASE violates the definition and, in essence,
    takes the model from ATOMIC towards Transactional by one tiny but
    important step--a step I was not willing to take--mainly because
    I could not come up with a reasonable definition of ATOMIC if it
    had that semantic.

    ESM can be used to help SW implement Transactional Memory, but it
    is not Transactional Memory.

    Removing reservations avoids some capacity constraints, though

    Current ATOMIC things allow for 1 or 2 memory containers. ASF and
    ESM allow for 8 cache lines used any way the programmer desires.
    EricP's version goes for 16 {*p,length} read/write sets.

    I submit that 8 (0r 16) is so much larger than ATOMIC event have
    today that it will take decades for SW to figure out how to use
    that new capability efficiently and effectively.

    ASF defined this as a "hint" so an implementation could still fail
    an atomic event that would have succeeded if the tracking
    adjustment had been done, whether from avoiding a contention or
    exceeding a capacity limit. Removing reservations is also
    dangerous, especially at cache block tracking granularity, as
    software must ensure that other memory accesses either do not
    access that cache block or have the same lifetime.

    Removing reservations violates the "all or nothing" semantic of
    anything that wants to smell ATOMIC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Sun Dec 31 20:23:23 2023
    Paul A. Clayton wrote:

    On 12/26/23 7:24 PM, MitchAlsup wrote:
    [snip]
    If you use a PRE (prefetch) instruction and specify DRAM as the
    location
    to prefetch, you end up only making sure the TLB is loaded.

    PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits
    determine where the data (line) is prefetched to {L1, L2, L3, DAM}.

    What if there is a cache after L3?

    Consider L3 as LLC.

    What if one wants to prefetch
    to a cache layer that is directly connected to the memory
    controller (no extra coherence traffic, available fairly in terms
    of performance to all agents [except possible processor-in-
    memory], etc.)?

    L3 in My 66000 Architecture is defined to be a read/write buffer
    used by Memory controller and DRAM controller. L3 is a cache of
    what DRAM either looks like now or should look like eventually.
    I specify this organization as a means to eliminate RowHammer.

    What if the design has L3 slices which are home
    nodes in terms of coherence state but allow allocation to a non-
    home node with non-home allocations having lower associativity? (I

    This is not an allowable L3 configuration in My 66000 Architecture.

    do not know if such would be sensible. Potential full use of all
    L3 capacity by one core has advantages and home nodes must have
    some advantage; the former could be accomplished by using other L3
    slices as victim caches, but those then become L4 caches.) There
    are presumably many other design points that would make the
    4-level designation less than complete.

    2-points
    1) there are a limited number of bits (5) and a limited number of
    cache horizons.
    2) SW has enough trouble prefetching before getting into prefetch
    from DRAM to L3 but no higher into the hierarchy, then later
    prefetch into L2 from L3 if it is still there,.....
    Not enough SW understand how to effectively utilize prefetch before
    one gets into prefetch to where--and the converse case post-push
    to where.

    I am also curious about where a prefetch would be placed with RE
    (or WE/RWE) permission for a cache level that could be either
    instruction or data. RE _might_ make sense for some corner cases
    such as checking if an executable chunk gives a matching CRC code
    immediately before executing. Allocation to both caches might make
    sense for RE.

    RxE is prefetchable directly into L2 UCache (but no higher)
    --E is prefetchable directly into L1 ICache
    Rx- is prefetchable directly into L1 DCache

    WE/WRE might make sense as a hint that the block is going to be
    written and shortly executed (in the split cache level case
    possibly by different cores if that level had a shared Icache
    level).

    For WRE, I could see allocating the data to the data cache of the
    specified level (possibly with a placement bias to favor cheaper
    transfer to the instruction cache,

    That placement bias is L2 BTW.

    or perhaps an icache might send
    some victims to a typically not shared cache level), but it is
    less clear if the PTE should loaded to both data and instruction
    TLBs if "refreshing" the TLB entry to allow execution (assuming it
    was denied for writable pages) was cheaper.

    Fast shared L2 TLB solves this problem.

    There are are bypassing, NUCA, replacement metadata, and perhaps
    other factors. One might want to prefetch into a cache level with
    setting of a "Keep Me" bit to choose another victim the first time
    that block is chosen as a victim (or a probabalistic ignoring of
    the replacement choice, which might then want the probability).

    A cache is either an invisible companion that reduces latency of
    memory accesses, or it is a programmable store unit which requires
    direct access to tag storage. You cannot have it both ways.

    ("Keep Me" could be useful for prefetches that may be excessively
    early but are almost always used "soonish".)

    Keep Me requires access to the LRU or NRU control bits in the tag.

    Similarly, a cache
    might have metadata urging earlier eviction (set 'LRU' for LRU and
    tree pLRU, set unaccessed for bit-vector NRU); this might also
    involve more than one state as possible choices (e.g., NRU+LRU
    pair might set unaccessed and LRU or unaccessed and MRU).

    99.44% of programmers could not use these features. Of the 0.56%
    who can, 80% of them work in OS Kernels.

    Metadata/placement may also be complicated on a cache hit. If a
    prefetch targets L2 but hits L1, should it be treated as a nop?

    The only metadata I ever allows configuration access to in the
    cache was to lock lines out of being available for allocation;
    (bad bit in the line or tag). So, a 4-way set associative cache
    with a bad bit goes to (2^n-1)-lines of 4-was SA and 1 line of
    3-way SA. The dead line does not participate in allocation and
    cannot hit.

    What if L2 is shared by multiple cores and the block is dirty or
    L2 does not include L1s? What if the L1 state was "next victim"?
    Early victimizing _might_ make sense, but changing to "last
    victim" might also make sense?

    If L2 does not filter SNOOPs the L1s are SNOOPed.
    When a line is reallocated
    a) modified data migrates outward
    b) new data replaces old data

    One might also want a prefetch that prepares a hardware prefetch
    engine. Such a software prefecth might want metadata about the
    type of hardware prefetching (unit stride/non-unit stride/address
    is pointer/generic hardware prefetch).

    Your advocation for the use of prefetch is admirable. Truly !

    The problem with prefetch is that SW cannot use it effectively
    and efficiently, and in ways that preserve code over multiple
    generations--one end might need to prefetch only 100 cycles
    in front of use whereas the other end might need to prefetch
    several thousand cycles in front of use.

    A second problem is that prefetch is NECESSARILY specified as
    a HINT because some other piece of code (yours or not) may
    touch the memory hierarchy and the prefetch you so carefully
    performed goes away before you come for the actual data.

    SW must know about latency model it is using prefetch
    to ameliorate !! But which one ??

    Caching of paging structures further increases the possibilities.
    What if one wants to prefetch the translation information for the
    table level normally one above the last (i.e., either a huge page
    PTE or a second level PDE)? (This might apply if software believes
    a future access will be in a region. A hash table or random access
    array might have access behavior that software knows well ahead
    that the data structure will be used soonish but far enough away
    that earlier prefetching translation information provides some
    value over just doing a demand load.)

    What if one wants to prefetch translation information to an L2
    structure or a structure associated with a particular level of
    cache (e.g., prefetch to L2 memory cache TLB to accelerate
    indirection and page-crossing stride prefetches for the hardware
    prefetcher at the L2 memory cache)?

    Another what if for you::

    What if HW just wanted to keep things simple ?? so they can get the
    chip out the door.

    HW has granted prefetch to SW in order for a few carefully chosen use cases--not generally for use by everyone on every application.

    My 66000 also has the contrapositive of Post PUSH; migrating lines toward
    DRAM, disposing of write permission (R--L1) and a few other subtle uses.
    But each capability handed to SW comes with the responsibility to use
    it effectively and efficiently. I see scant evidence of this.

    In that regard:: HW prefetchers have proven easier for all concerned.

    HW prefetchers can easily lock onto multiple simultaneous stride-based
    prefetch strategies, so should SW use prefetch for these or not ??
    HW here knows what machine it is running on and knows how far in
    front to prefetch !! Can SW ?? across multiple implementations ??
    on Big-Little implementations ??

    HW prefetchers cannot easily lock onto multiple simultaneous linked-
    list prefetching since this goes all the way back to the TLBs from
    using virtual addresses--but SW cannot prefetch these since it needs
    the next address to prefetch the next next access.. {chicken and egg}

    So, which this in mind--what kind of access pattern can SW use prefetch
    for that really improves performance ?? {Indexing is already a solved
    HW prefetch problem, pointer chasing will never be SW solvable.}

    Most of the above might be obviously bad ideas for all reasonable implementations, but I suspect that even experts would not know
    all reasonable choices for the lifetime of the interface. Having
    such configuration bits reference a table is well-known method for
    adjusting to future changes (or inadequacy of

    Architecture is as much about knowing what to leave out as what to put
    in !

    My 66000 has PRE and PUSH that solve known problems and might (MIGHT)
    be able to help more general SW prefetching. My suspicion is that
    HW prefetchers will get better than SW use of prefetch instructions.

    Do you want "a bunch of CPUID data" so SW can decide on how far in
    front of use and attempt at being effective and efficient ?? What
    about Triangular calculations where "how far in front" changes
    as the calculation proceeds ??

    One might also have hints that are more directives (i.e., prevent
    or disincentivize hardware from ignoring the hint or adjusting its
    action) and this opens a cooperation space that might be quite
    complex. Software might want to specify what hardware metadata is
    given what weight. Software might have an estimated use distance
    probability map (possibly with a model for probabilities of use by
    all other agents agent before and/or after use), but presenting
    hardware with 4KiB of metadata for a 64-byte prefetch seems not
    quite right.

    I have also run into a couple of lock-like situations where one PRE's
    the container with .lock, and later PUSH's the same container
    (also with .lock) to end the ATOMIC event without ever seeing the
    contained
    value.

    When would one want to guard/reserve a cache block (.lock) without
    accessing such?

    Grabbing a lock (monitoring a line for interference) while doing 1 or
    more accesses to MMI/O registers. ESM performs locking not based on
    the value of data, but on the access and rights of an address across
    the event. The Lock guards something else, and as long as you can
    guarantee ATOMIC behavior, you don't actually need the lock's data.

    Perhaps there is a coarser-grained software lock
    that is checked without being monitored/reserved (or the
    reservation is dropped quickly) such that the atomic event could
    monitor the specific cache block without failing by holding the
    more frequently used software lock block as reserved? PUSHing the
    data outward could make sense if the data is likely to be used
    soon by some other agent, but it is not clear why that should be
    'locked' (unless that is necessary to end the event?)

    I suspect I am missing something, possibly something obvious to
    anyone familiar with shared memory programming.

    Architecture is as much about knowing what to leave out as what to put
    in ! You and Quadriblock are misconsidering the former while overdoing
    the later.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Sun Dec 31 20:40:35 2023
    Paul A. Clayton wrote:

    On 12/25/23 7:12 PM, MitchAlsup wrote:
    Paul A. Clayton wrote:
    [snip]
    While providing compatibility across implementations is very
    valuable, varying capability across implementations does not seem
    horrible to me.

    You still have to be able to make Architectural statements about
    what does what. And if you want any semblance of compatibility
    the lowest end machine has to be able to perform significant
    ATOMIC events.

    ESM sets the minimum and maximum capacity of an atomic operation
    for all implementations. This means that a capacity failure in
    any implementation is guaranteed to be one in all implementations.

    For compatibility !!

    But in all senses:: 8 cache lines is so much more powerful that
    anything programmers have access to today, making it bigger or
    its size vague is not worthy.

    I think part of the question is how acceptable address-based
    (variable internal caching conflicts) and capacity-based failures are.

    If software does not know that allocations are specifically
    aligned, accessing five 32-byte objects might or might not exceed
    the 8-entry capacity of ESM.

    Expecting ATOMIC behavior across cache line boundaries has NEVER been guaranteed. So SW that deals with ATOMIC stuff needs to understand
    this {along with false sharing,...} and take appropriate actions
    at the language level and malloc()/new level.

    If alignment-caused "capacity"
    failures are acceptable, why are address-based associativity
    conflicts so extremely problematic?

    False sharing make thinks look like interference and harms
    performance at the system level by performing too many bus
    transfers.

    (Yes, in general, ensuring
    that participating accesses fit within N cache blocks is easier
    than ensuring that associativity constraints are handled [other
    than by restricting block count to the associativity], but this is
    not an absolute distinction.)

    ESM also guarantees that atomic conflicts will be atomic
    conflicts. A conservative filter would provide failures that are
    not true conflicts. Limited-associativity cache-based tracking
    would introduce evictions caused by the atomic memory accesses.

    ESM still has a few spurious failure modes--especially in the
    "methodological" operating mode.

    I was not considering exceptions, but I guess there are other
    failure modes beyond capacity and contention.

    Consider you are several checkpoints down a path, and have several
    outstanding cache misses and an interrupt transpires. What are you
    (the CPU) allowed to do with the returning cache lines since those
    memory reference instructions did not survive to retirement ??
    a) If you put them in the cache you have a Spectré side channel.
    b) if you do not, you need to migrate them somewhere Ownership does
    not get misunderstood.

    (A cuckoo/elbow cache — skewed associative cache where a victim
    from one index is reinserted at one of its alternative indexes —
    could greatly increase the effective associativity. Copying cache
    blocks may be preferable to failing a transaction. A modest victim
    cache could buffer the copy activity and avoid reinsertion based
    on new information or longer analysis of information.)

    A FA Miss Buffer solves this without complexity.

    But full associativity tends to constrain capacity.

    That is why registers stop at 32 and miss buffers stop at 8.

    This is not a consideration if a particular capacity is sufficient
    for all implementations for the lifetime of the interface — which
    you judge to be the case — but if one wants to allow use of the
    same interface for implementations supporting larger capacity then
    not relying of full associativity may be justified.

    No, I expect that in 2 decades there will be enough experience with
    ESM that capacity extension would be desired. Here code that did
    work on early implementations continues to work, but more codes
    can now be performed with the new capacity limits.

    [snip]
    If you look into the "math" you will see that the exponent e in
    BigO( m^e ) is what bounds (minimum) system time in an ATOMIC
    event. Making bigger ATOMIC events makes e smaller. This rule
    is opposite what you propose above. Although there are some
    events where you rule may take hold I suspect minimizing the
    total number of events needed reduces overhead faster than
    making events as small as possible { Test_and_Set() }.

    If supporting larger events reduces system time, why forever bound
    capacity to 8 64-byte cache blocks?

    BECAUSE I ALREADY HAVE SOMETHING IN THE CORE THAT IS PERFORMING
    THE THINIGS I NEED TO PERFORM, so the number 8 was essentially FREE.

    SECONDLY, 8 IS SO FAR BIGGER THAN ANYONE HAS EXPERIENCE WITH, THAT 8
    IS EFFECTIVELY infinity.

    What to leave out, what to put in.

    I can understand your previous statements about wanting to provide
    something cheap that is usable for more than all known small
    atomic operations. I.e., a universal baseline guarantee makes
    sense even to me and having the baseline by 8 (when I think you
    mentioned the largest atomics only used 6) gives substantial room
    for software growth. However, allowing a relatively large read set
    seems to be a plausible extension. (Extensions can be temporal incompatibilities or also 'spatial' incompatibilities.)

    The bigger you make something architectural, the more you burden the
    lowest members of you implementable architectures.

    And I really do want forward and backward compatibility.

    I appear to value compatibility much less than you. I think it is
    useful to have an interface that can be extended even though such
    breaks compatibility.

    You can always make it bigger without compatibility problems.

    Even 'spatial' incompatibility seems to have some potential
    benefits. A special purpose system might want to use established
    interface elements but make interface changes that fit the purpose.

    Interface versions and subsets/profiles allow communicating
    interface specifics more simply and communicating guarantees from
    vendors and demands from users. Managing diversity has challenges
    but prohibiting diversity (which includes temporal diversity/
    extensions as such introduces multiple interfaces).

    Some of this is just naming/branding (and the Ship of Theseus
    problem). As a lumper, I am biased toward broad major categories
    and consideration of similarities.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Tue Jan 2 08:06:25 2024
    MitchAlsup wrote:
    Paul A. Clayton wrote:

    On 12/26/23 7:24 PM, MitchAlsup wrote:
    [snip]
    If you use a PRE (prefetch) instruction and specify DRAM as the location >>> to prefetch, you end up only making sure the TLB is loaded.

    PRE has a 5-bit field; 3-bits define RWE permissions desired, 2-bits
    determine where the data (line) is prefetched to {L1, L2, L3, DAM}.

    What if there is a cache after L3?

    Consider L3 as LLC.

                                       What if one wants to prefetch
    to a cache layer that is directly connected to the memory
    controller (no extra coherence traffic, available fairly in terms
    of performance to all agents [except possible processor-in-
    memory], etc.)?

    L3 in My 66000 Architecture is defined to be a read/write buffer
    used by Memory controller and DRAM controller. L3 is a cache of
    what DRAM either looks like now or should look like eventually.
    I specify this organization as a means to eliminate RowHammer.

                    What if the design has L3 slices which are home
    nodes in terms of coherence state but allow allocation to a non-
    home node with non-home allocations having lower associativity? (I

    This is not an allowable L3 configuration in My 66000 Architecture.

    do not know if such would be sensible. Potential full use of all
    L3 capacity by one core has advantages and home nodes must have
    some advantage; the former could be accomplished by using other L3
    slices as victim caches, but those then become L4 caches.) There
    are presumably many other design points that would make the
    4-level designation less than complete.

    2-points
    1) there are a limited number of bits (5) and a limited number of
    cache horizons.
    2) SW has enough trouble prefetching before getting into prefetch
    from DRAM to L3 but no higher into the hierarchy, then later
    prefetch into L2 from L3 if it is still there,.....
    Not enough SW understand how to effectively utilize prefetch before
    one gets into prefetch to where--and the converse case post-push
    to where.

    I am also curious about where a prefetch would be placed with RE
    (or WE/RWE) permission for a cache level that could be either
    instruction or data. RE _might_ make sense for some corner cases
    such as checking if an executable chunk gives a matching CRC code
    immediately before executing. Allocation to both caches might make
    sense for RE.

    RxE is prefetchable directly into L2 UCache (but no higher)
    --E is prefetchable directly into L1 ICache
    Rx- is prefetchable directly into L1 DCache

    WE/WRE might make sense as a hint that the block is going to be
    written and shortly executed (in the split cache level case
    possibly by different cores if that level had a shared Icache
    level).

    For WRE, I could see allocating the data to the data cache of the
    specified level (possibly with a placement bias to favor cheaper
    transfer to the instruction cache,

    That placement bias is L2 BTW.

                                       or perhaps an icache might send
    some victims to a typically not shared cache level), but it is
    less clear if the PTE should loaded to both data and instruction
    TLBs if "refreshing" the TLB entry to allow execution (assuming it
    was denied for writable pages) was cheaper.

    Fast shared L2 TLB solves this problem.

    There are are bypassing, NUCA, replacement metadata, and perhaps
    other factors. One might want to prefetch into a cache level with
    setting of a "Keep Me" bit to choose another victim the first time
    that block is chosen as a victim (or a probabalistic ignoring of
    the replacement choice, which might then want the probability).

    A cache is either an invisible companion that reduces latency of
    memory accesses, or it is a programmable store unit which requires
    direct access to tag storage. You cannot have it both ways.

    ("Keep Me" could be useful for prefetches that may be excessively
    early but are almost always used "soonish".)

    Keep Me requires access to the LRU or NRU control bits in the tag.

                                                 Similarly, a cache
    might have metadata urging earlier eviction (set 'LRU' for LRU and
    tree pLRU, set unaccessed for bit-vector NRU); this might also
    involve more than one state as possible choices (e.g., NRU+LRU
    pair might set unaccessed and LRU or unaccessed and MRU).

    99.44% of programmers could not use these features. Of the 0.56%
    who can, 80% of them work in OS Kernels.

    Metadata/placement may also be complicated on a cache hit. If a
    prefetch targets L2 but hits L1, should it be treated as a nop?

    The only metadata I ever allows configuration access to in the cache was
    to lock lines out of being available for allocation;
    (bad bit in the line or tag). So, a 4-way set associative cache
    with a bad bit goes to (2^n-1)-lines of 4-was SA and 1 line of 3-way SA.
    The dead line does not participate in allocation and
    cannot hit.

    What if L2 is shared by multiple cores and the block is dirty or
    L2 does not include L1s? What if the L1 state was "next victim"?
    Early victimizing _might_ make sense, but changing to "last
    victim" might also make sense?

    If L2 does not filter SNOOPs the L1s are SNOOPed.
    When a line is reallocated
    a) modified data migrates outward
    b) new data replaces old data

    One might also want a prefetch that prepares a hardware prefetch
    engine. Such a software prefecth might want metadata about the
    type of hardware prefetching (unit stride/non-unit stride/address
    is pointer/generic hardware prefetch).

    Your advocation for the use of prefetch is admirable. Truly !

    The problem with prefetch is that SW cannot use it effectively
    and efficiently, and in ways that preserve code over multiple generations--one end might need to prefetch only 100 cycles
    in front of use whereas the other end might need to prefetch
    several thousand cycles in front of use.

    A second problem is that prefetch is NECESSARILY specified as
    a HINT because some other piece of code (yours or not) may
    touch the memory hierarchy and the prefetch you so carefully
    performed goes away before you come for the actual data.

    I am one of those programmers who read about PREFETCH being added to x86
    and immediately thought "Yes, now I can make my code faster!", then
    almost immediately (after getting one of the new CPUs) hit the tiny
    little snag of "Oh, it does not seem to work!?!".

    I.e. I have written code where actually loading a byte (forcing a fetch,
    not just a hint) from every cache line in a 4KB block will speed it up,
    but almost never seen any gains from adding a PREFETCH (n) cache lines
    in front of the current operating address.


    SW must know about latency model it is using prefetch
    to ameliorate !! But which one ??

    It is already pretty complicated without that added issue.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Paul A. Clayton on Tue Jan 2 12:26:32 2024
    Paul A. Clayton wrote:
    On 12/28/23 2:10 PM, EricP wrote:
    EricP wrote:
    Paul A. Clayton wrote:
    [snip extended LL/SC design]
    One could just use regular loads and stores between the LL and SC.
    The hardware knows that it is within an atomic operation.

    Yes, it doesn't actually need a SCH instruction.
    Just document that multiple LL are allowed to the same cache line,
    a normal ST to the same line is held aside if the lock is active,
    and SC ends the transaction and updates all changed line bytes.
    If all LL/ST/SC are not to the same line it cancels the lock.

    On reflection, eliminating the SCH instruction would leave this
    vulnerable to an interrupt occurring in the multiple store sequence.
    The interrupt would cancel the lock, and on return a regular ST would
    not be distinguishable as intended to be part of an atomic sequence,
    and so would erroneously update memory, whereas a SCH would be tossed
    because the lock was no longer set.

    I had not considered this consequence of an interrupt, but this
    issue would be "trivially" handled by distinguishing between
    reservation active and reservation valid (i.e., use two bits with
    one preserved across interrupts). The hardware would then know
    that it is within a ll/sc enclosed code sequence and stores would
    merely be buffered as usual until the sc was encountered. Advanced
    hardware for such a primitive interface, knowing that stores would
    not be preserved could elide them.

    I still don't think that would quite work.
    The lock is canceled if (a) there is an interrupt or
    (b) if you read or write an address that is not in the same line

    Presumably the interrupt handler would do a ST which should work
    even if it happens to be to the same address as the lock.
    Then the handler returns to the prior sequence and a ST to the
    locked address must now be skipped until we see an SC,
    which is also skipped, or a LD or ST to a different line.

    To make that work those two bits would have to be part of the interrupt save/restore context. An SCH instruction is easier and unambiguous.

    For a single cache block reservation, having to run through the
    rest of a known-failed atomic attempt may not be problematic. With
    broader uses, that would be painful; any failure would have to run
    through the rest of the attempt before being able to consider
    retrying.

    The instructions inside the LL/SCH/SCR sequence are either reg-reg or
    D$L1 cache hits to the locked line. The cost of continuing the occasional failed lock sequence after an interrupt would be minimal.

    (Yes you could put a fib() call inside your lock sequence
    but its only your application that gets shot in the foot.)

    If the state on a failure is architected as boundedly
    undefined, a simple scan for the sc (no execution to produce
    identical final state) would suffice. It might be possible to add
    this imprecise state aspect when allowing multiple accesses so
    hardware could safely just scan after the second non-ll/sc memory
    access.

    Didn't this start out as a cheap and easy retrofit?
    I'm sure I heard that someplace.

    A microarchitecture could theoretically cache sc locations similar to a branch target buffer, but memoizing such seems difficult and a
    prediction would have to be confirmed (e.g., by scanning for the next sc while speculating that the predicted location was correct). That seems needlessly complex.

    It is not terribly difficult to imagine ways of addressing this
    deficiency. One might be able to get away with reinterpreting a
    branch instruction as setting a failure jump point, avoiding
    adding instructions. Such a branch would have to use a nonsensical
    condition like "if not equal to self" (assuming no one has used
    that as a nop before the architecture decided to reserve it).
    Short attempts would not include the branch and just run through
    to find the sc; simple attempts that are somehow known not to have
    only transient failure causes might target the ll instruction
    producing a hard retry loop. One could also consider each register
    tested for non-equality to self as providing additional control
    information.

    Given that opcode space is usually not so tight as to prevent a
    secondary atomic interface, but one would really want to have the second
    be comprehensive to avoid a third, fourth, fifth, etc. Extending ll/sc semantics — without adding instructions — might have allowed experience/feedback to accumulate for a second try.
    (I do believe that one could substantially extend the semantics
    without having to add instructions. When starting from a clean
    slate is better idea seems usually to be realized late. "Free"
    extensions also discourages trying again rather than trying to
    fix an interface that mostly sort-of works.)

    Given that none of the RISCs have actually extended ll/sc
    semantics, the pressure to provide broader atomics must not have
    been particularly great.

    Yes, it could be "if you build it they will come" or
    "if you build it they will stare at it with curiosity
    then go back to doing things the way they have always done".

    ASF did better in checkpointing the start such that a failure
    jumped to a known location that performed a test for
    success/failure and branched to handling code. ASF also preserved
    the stack pointer; this does not seem a substantial benefit as
    short attempts could probably avoid changing the stack pointer and
    the relative overhead of explicitly saving the stack pointer is
    not that great if the attempt is large.

    On abort both RTM and proposed ASF had "exception semantics" that
    restored the registers to their state at the start of the transaction
    as well as toss all memory stores.

    Mitch wants escaping stores to preserve logged information (with
    the obvious potential benefit for debugging and performance
    tuning, possibly even dynamic performance tuning i.e. by the
    software and not the developer). AMD's ASF defined a RELEASE
    instruction which removed a cache block from the tracking set; I
    do not remember seeing this in the documentation I have of ESM.

    I also have Release on a guard range in my ATX proposal.
    Also one can upgrade or downgrade read and write guard ranges.

    I want a transaction to be able to safely pick something up,
    look at it, decide its not relevant, put it down and continue.

    Removing reservations avoids some capacity constraints, though
    ASF defined this as a "hint" so an implementation could still fail
    an atomic event that would have succeeded if the tracking
    adjustment had been done, whether from avoiding a contention or
    exceeding a capacity limit. Removing reservations is also
    dangerous, especially at cache block tracking granularity, as
    software must ensure that other memory accesses either do not
    access that cache block or have the same lifetime.

    I intend Release for avoiding unnecessary collisions and aborts.
    There is no danger in it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Tue Jan 2 13:12:53 2024
    MitchAlsup wrote:
    Paul A. Clayton wrote:
    On 12/28/23 2:10 PM, EricP wrote:

    Mitch wants escaping stores to preserve logged information (with
    the obvious potential benefit for debugging and performance
    tuning, possibly even dynamic performance tuning i.e. by the
    software and not the developer). AMD's ASF defined a RELEASE
    instruction which removed a cache block from the tracking set; I
    do not remember seeing this in the documentation I have of ESM.

    RELEASE was after I had left AMD.

    ASF (under my tutelage) and ESM were ATOMIC events not restricted Transactional Memory things.
    An ATOMIC event has the property that if the event succeeds all
    of the writes become visible to the system instantly, otherwise
    no interested 3rd party sees any of the memory locations as having
    been modified.

    I don't follow your distinction between transaction and multi-line atomic.
    Both must monitor variables in multiple cache lines as read-only and
    others as read-write with all writes made globally visible at once.

    An instruction like RELEASE violates the definition and, in essence,
    takes the model from ATOMIC towards Transactional by one tiny but
    important step--a step I was not willing to take--mainly because I could
    not come up with a reasonable definition of ATOMIC if it had that semantic.

    I don't follow your reasons here either.
    I see RELEASE as being used so an atomic sequence can add an object
    to its member sets with read-share or write-exclusive protection,
    examine it, decide its not relevant, and release it so others can
    update it without contention with you.

    ESM can be used to help SW implement Transactional Memory, but it
    is not Transactional Memory.

    Removing reservations avoids some capacity constraints, though

    Current ATOMIC things allow for 1 or 2 memory containers. ASF and
    ESM allow for 8 cache lines used any way the programmer desires.
    EricP's version goes for 16 {*p,length} read/write sets.

    I submit that 8 (0r 16) is so much larger than ATOMIC event have
    today that it will take decades for SW to figure out how to use
    that new capability efficiently and effectively.

    ASF defined this as a "hint" so an implementation could still fail
    an atomic event that would have succeeded if the tracking
    adjustment had been done, whether from avoiding a contention or
    exceeding a capacity limit. Removing reservations is also
    dangerous, especially at cache block tracking granularity, as
    software must ensure that other memory accesses either do not
    access that cache block or have the same lifetime.

    Removing reservations violates the "all or nothing" semantic of
    anything that wants to smell ATOMIC.

    I think I understand. You see "atomic" in a purist sense, as one thing
    done one way, whereas "transaction" is more dynamic and expansive.

    This also depends on how RELEASE works - whether it can toss pending
    stores out of a transaction or only releases read-only monitors.
    I hadn't decided because it didn't seem that much of an issue
    and just a question of where some IF statements are located.

    For hardware implementation its just a matter of whether one allows applications to reset the "changed" bits on pending byte buffers.
    I currently see no technical reason to disallow it.
    So I would allow RELEASE to toss pending stores after they were done.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Tue Jan 2 19:23:32 2024
    EricP wrote:

    Yes, it could be "if you build it they will come" or

    The following is the important clause::

    "if you build it they will stare at it with curiosity
    then go back to doing things the way they have always done".


    This pretty much sums up the advancement of inter-thread synchronization
    over the last 5 decades.

    First there was T&S, then later T&T&S, then CAS, then DCAS, then LL/SC,
    perhaps others along the way::

    And still synchronization is hard to program, hard to reason, and hard
    to get right.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Tue Jan 2 19:42:22 2024
    EricP wrote:

    MitchAlsup wrote:
    Paul A. Clayton wrote:
    On 12/28/23 2:10 PM, EricP wrote:

    Mitch wants escaping stores to preserve logged information (with
    the obvious potential benefit for debugging and performance
    tuning, possibly even dynamic performance tuning i.e. by the
    software and not the developer). AMD's ASF defined a RELEASE
    instruction which removed a cache block from the tracking set; I
    do not remember seeing this in the documentation I have of ESM.

    RELEASE was after I had left AMD.

    ASF (under my tutelage) and ESM were ATOMIC events not restricted
    Transactional Memory things.
    An ATOMIC event has the property that if the event succeeds all
    of the writes become visible to the system instantly, otherwise
    no interested 3rd party sees any of the memory locations as having
    been modified.

    I don't follow your distinction between transaction and multi-line atomic. Both must monitor variables in multiple cache lines as read-only and
    others as read-write with all writes made globally visible at once.

    Transactional memory can* support nested transactions.
    ESM cannot.

    TM can survive an interrupt (possibly an exception).
    ESM cannot.

    (*) in the TM literature I have read.

    An instruction like RELEASE violates the definition and, in essence,
    takes the model from ATOMIC towards Transactional by one tiny but
    important step--a step I was not willing to take--mainly because I could
    not come up with a reasonable definition of ATOMIC if it had that semantic.

    I don't follow your reasons here either.
    I see RELEASE as being used so an atomic sequence can add an object
    to its member sets with read-share or write-exclusive protection,
    examine it, decide its not relevant, and release it so others can
    update it without contention with you.

    Let's say you have ESM with RELEASE, and you are processing an ATOMIC
    event where you write to one or more participating lines, then RELEASE
    one of them, write a few more and complete the event.

    This allows interested 3rd parties to see the RELEASED cache line before
    they see the rest of the modified cache lines. ATOMIC is defined where
    all interested 3rd parties either see all of memory in the before state
    or all of the modifications of the after state, and no intermediate
    state.

    You can assert that because the programmer used RELEASE, that the code
    is tolerant of some of the data being seen before the rest of the data
    is seen. But this violates the definition of ATOMIC. And I never found
    a definition of ATOMIC that would allow a RELEASE. {{I do know how to
    do it in HW--I just cant make any definition of ATOMIC work in that µArchitecture. It is indeed a Definitional problem.}}

    Perhaps you could come up with one.

    ESM can be used to help SW implement Transactional Memory, but it
    is not Transactional Memory.

    Removing reservations avoids some capacity constraints, though

    Current ATOMIC things allow for 1 or 2 memory containers. ASF and
    ESM allow for 8 cache lines used any way the programmer desires.
    EricP's version goes for 16 {*p,length} read/write sets.

    I submit that 8 (0r 16) is so much larger than ATOMIC event have
    today that it will take decades for SW to figure out how to use
    that new capability efficiently and effectively.

    ASF defined this as a "hint" so an implementation could still fail
    an atomic event that would have succeeded if the tracking
    adjustment had been done, whether from avoiding a contention or
    exceeding a capacity limit. Removing reservations is also
    dangerous, especially at cache block tracking granularity, as
    software must ensure that other memory accesses either do not
    access that cache block or have the same lifetime.

    Removing reservations violates the "all or nothing" semantic of
    anything that wants to smell ATOMIC.

    I think I understand. You see "atomic" in a purist sense, as one thing
    done one way, whereas "transaction" is more dynamic and expansive.

    This also depends on how RELEASE works - whether it can toss pending
    stores out of a transaction or only releases read-only monitors.
    I hadn't decided because it didn't seem that much of an issue
    and just a question of where some IF statements are located.

    You were using RELEASE so that you could add more participating cache
    lines (after RELEASE); so if you have the ability to add more after
    RELEASE you had to PUSH the modified data somewhere it will be visible
    to make room for the new participants.

    For hardware implementation its just a matter of whether one allows applications to reset the "changed" bits on pending byte buffers.
    I currently see no technical reason to disallow it.
    So I would allow RELEASE to toss pending stores after they were done.

    Yes, it is entirely buffer management--that supports the architecture's definition of what ATOMIC means.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Tue Jan 2 19:49:00 2024
    On Tue, 02 Jan 2024 19:23:32 +0000, MitchAlsup wrote:
    EricP wrote:

    Yes, it could be "if you build it they will come" or

    The following is the important clause::

    "if you build it they will stare at it with curiosity then go back to
    doing things the way they have always done".

    Regrettably, I fear this is the likeliest fate of the Mill. But given
    that they're going to actually build one, and thus be in a position to *demonstrate* its superior performance, perhaps not.

    With your MY 66000, surely it's not so strange as to scare people away.

    But, yes, interest in anything new and different is very limited. People
    want a processor that will run their Windows programs - x86-64 - or their Android apps - ARM.

    RISC-V, even if deeply flawed, has enough mindshare that it could someday
    get somewhere.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Jan 2 20:58:41 2024
    Paul A. Clayton wrote:

    On 12/31/23 2:33 PM, MitchAlsup wrote:
    Paul A. Clayton wrote:
    [snip]
    Mitch wants escaping stores to preserve logged information (with
    the obvious potential benefit for debugging and performance
    tuning, possibly even dynamic performance tuning i.e. by the
    software and not the developer). AMD's ASF defined a RELEASE
    instruction which removed a cache block from the tracking set; I
    do not remember seeing this in the documentation I have of ESM.

    RELEASE was after I had left AMD.

    ASF (under my tutelage) and ESM were ATOMIC events not restricted
    Transactional Memory things.

    An ATOMIC event has the property that if the event succeeds all
    of the writes become visible to the system instantly, otherwise
    no interested 3rd party sees any of the memory locations as having
    been modified.

    I thought you had written that you wanted escaping stores to
    facilitate debugging. I guess I misremembered.

    Non-participating STs are performed in <virtual> program order and
    can escape prior to the event completing. Participating STs do not.

    An instruction like RELEASE violates the definition and, in
    essence, takes the model from ATOMIC towards Transactional by one
    tiny but important step--a step I was not willing to take--mainly
    because I could not come up with a reasonable definition of ATOMIC
    if it had that semantic.

    If the data is thread local, as far as other threads are concerned
    that data does not necessarily even exist.

    Unless a pointer has to TLS has been passed through global memory;
    and then all bets are off.

    Breaking the transactional nature is also not _necessarily_ a bad
    choice. I do not know of any, but that is not proof of non-
    existence.

    ESM can be used to help SW implement Transactional Memory, but it
    is not Transactional Memory.

    I agree with you that hardware cannot really provide generic
    transactional memory (composable, "unbounded" capacity, etc.).

    Nesting is the nastiest one.

    My disagreement comes from perceiving hardware as being better
    equipped to monitor conflicts. Larger read sets seem especially
    inexpensive for hardware (especially if one can accept the
    limitations of a conservative filter). I do see such expansions
    of capacity as less beneficial. (I agree with 'get simple atomics
    utilized first and well-understood before guaranteeing useless
    features that will have to be supported for twenty years'.)

    Removing reservations avoids some capacity constraints, though

    Current ATOMIC things allow for 1 or 2 memory containers. ASF and
    ESM allow for 8 cache lines used any way the programmer desires.
    EricP's version goes for 16 {*p,length} read/write sets.

    I submit that 8 (0r 16) is so much larger than ATOMIC event have
    today that it will take decades for SW to figure out how to use
    that new capability efficiently and effectively.

    Probably mostly true. More expansive uses are also likely to be
    less common (both application count and use per run); that is just
    the general nature of things.

    Even so, I think providing cheap expansion can facilitate
    experimentation. Expanding the read set seems really cheap.
    However, I do not know how one can present a feature as
    "experimental" such that one is not committing to it
    architecturally. _Theoretically_ the hardware transactional memory
    uses are suppose to have a fall back path, so they are hint-like,
    but people tend not to read the fine print.

    Probably related to how fast TM has taken off and become the standard bearer--Oh Wait.....

    ASF defined this as a "hint" so an implementation could still fail
    an atomic event that would have succeeded if the tracking
    adjustment had been done, whether from avoiding a contention or
    exceeding a capacity limit. Removing reservations is also
    dangerous, especially at cache block tracking granularity, as
    software must ensure that other memory accesses either do not
    access that cache block or have the same lifetime.

    Removing reservations violates the "all or nothing" semantic of
    anything that wants to smell ATOMIC.

    Violating atomicity may not violate simple orderability. Sometimes
    software might be willing to sacrifice true atomicity for lower
    chance of "atomic" event failure. This risks the ease of
    understanding the interface and one probably does not want to
    supply software developers with atomic foot guns.

    This ties in with similar discussion with EricP. It may be true
    that programmers can reasons about releasing lines early that does
    not violate some definition of ATOMIC. I spent enough time (SAF and
    ESM) thinking about these things to come to the conclusion one needs
    Leslie Lamport level acumen to get ATOMIC features correct the first
    time.

    (I think the use case for RELEASE was linked lists. One might have
    to traverse more than a few nodes but one might not care if
    another thread inserted/deleted a node well behind or well ahead
    of where this thread is. There are probably other possible uses.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Jan 2 20:49:46 2024
    Paul A. Clayton wrote:

    On 12/31/23 2:33 PM, MitchAlsup wrote:
    Paul A. Clayton wrote:
    [snip]
    I had not considered this consequence of an interrupt, but this
    issue would be "trivially" handled by distinguishing between
    reservation active and reservation valid (i.e., use two bits with
    one preserved across interrupts). The hardware would then know
    that it is within a ll/sc enclosed code sequence and stores would
    merely be buffered as usual until the sc was encountered. Advanced
    hardware for such a primitive interface, knowing that stores would
    not be preserved could elide them.

    If taken the way I think Paul intends, that means the read and
    write sets are part of thread saved state. It is nuances like
    exceptions and interrupts; why My 66000 abandons the event if
    it has to perform more privileged control transfer.

    No. The fact that the thread is within an atomic event is part of
    thread state (I assume ll/sc implementations make atomic even
    success part of hardware thread/core state). Knowing that the
    thread is in failed atomic event state, stores can be guarded such
    that they do not manifest.

    The transfer of control to the ATOMIC control point prior to performing
    the exception or interrupt control transfer means ATOMIC adds nothing
    to thread state !!! You do not need to remember the participating
    cache liens (or their state), nor the fact you are in an event, ...

    The concern expressed was that ordinary store instructions would
    leak when resuming from an interrupt.

    Can you illustrate this as a scenario ??


    For a single cache block reservation, having to run through the
    rest of a known-failed atomic attempt may not be problematic. With
    broader uses, that would be painful; any failure would have to run
    through the rest of the attempt before being able to consider
    retrying. If the state on a failure is architected as boundedly
    undefined, a simple scan for the sc (no execution to produce
    identical final state) would suffice. It might be possible to add
    this imprecise state aspect when allowing multiple accesses so
    hardware could safely just scan after the second non-ll/sc memory
    access.

    A microarchitecture could theoretically cache sc locations
    similar to a branch target buffer, but memoizing such seems
    difficult and a prediction would have to be confirmed (e.g., by
    scanning for the next sc while speculating that the predicted
    location was correct). That seems needlessly complex.

    When one considers all of the effects of {memory order, branch
    prediction, exceptions, interrupts, machine checks,...} it is
    better for the complexity management of a project* to abandon an
    ATOMIC event when "control gets hard". I have seen and been in
    the debates of "how do we get out of this mess" often enough not
    to want to repeat it again.

    Do you still have the T-shirt?☺

    Only the scars...

    Similar considerations apply to cache coherence (or really any
    complex functionality). Getting it right in a basic design seems
    to be hard enough.

    Sound definitions are the first requirement.
    a) What does it mean to be cache coherent ??
    b) are we interested in 3rd parties or just 2nd parties ??
    c) How is DRAM involved ??
    d) How is the interconnect involved ??

    I presented workarounds even when i thought them "needlessly
    complex".

    (*) and the possibly more important:: "ability of SW to reason
    about what happened and why";

    This is another factor. Getting the hardware right is not enough
    if getting the software is likely to get it wrong. Comprehensible
    interfaces, error/bug detection, and debugging facilities are
    obviously important.

    Since I have no experience in hardware or software development or collaboration (but some knowledge), I have only limited ability to
    evaluate proposals on such considerations. I am aware of some of
    the dangers and some of the mitigations, but I would object to
    being made responsible for any significant project. (I do like to
    think I could contribute as something like an intern, but the
    difference between intern and junior (much less head) computer
    architect is perhaps somewhat more than minuscule.☺)

    I, on the other hand, think you are coming alone nicely.

    [snip legacy sandbagging atomic development]
    ASF did better in checkpointing the start such that a failure
    jumped to a known location

    yes

                               that performed a test for
    success/failure and branched to handling code.

    not necessarily. If you arrived you knew the event had failed.
    ASF provided a mechanism whereby you could refine your understanding
    of the failure and branching based on that.

    I thought the branch instruction was right after the transaction
    beginning instruction and was not necessarily skipped but just
    never triggered because EAX was set to zero and the flags updated.

    You are too steeped in x86. pick you head out of the hole in the ground
    and look at other ways of doing similar stuff.

                                                   ASF also preserved
    the stack pointer; this does not seem a substantial benefit as
    short attempts could probably avoid changing the stack pointer and
    the relative overhead of explicitly saving the stack pointer is
    not that great if the attempt is large.

    The SP thing had to do with PUSH/POP model in x86 and automagic
    control transfer.

    Okay. I thought such might have been the case.

    Mitch wants escaping stores to preserve logged information (with
    the obvious potential benefit for debugging and performance
    tuning, possibly even dynamic performance tuning i.e. by the
    software and not the developer). AMD's ASF defined a RELEASE
    instruction which removed a cache block from the tracking set; I
    do not remember seeing this in the documentation I have of ESM.

    RELEASE was after I had left AMD.

    ASF (under my tutelage) and ESM were ATOMIC events not restricted
    Transactional Memory things.

    An ATOMIC event has the property that if the event succeeds all
    of the writes become visible to the system instantly, otherwise
    no interested 3rd party sees any of the memory locations as having
    been modified.

    I thought you had written that you wanted escaping stores to
    facilitate debugging. I guess I misremembered.

    An instruction like RELEASE violates the definition and, in
    essence, takes the model from ATOMIC towards Transactional by one
    tiny but important step--a step I was not willing to take--mainly
    because I could not come up with a reasonable definition of ATOMIC
    if it had that semantic.

    If the data is thread local, as far as other threads are concerned
    that data does not necessarily even exist.

    Breaking the transactional nature is also not _necessarily_ a bad
    choice. I do not know of any, but that is not proof of non-
    existence.

    ESM can be used to help SW implement Transactional Memory, but it
    is not Transactional Memory.

    I agree with you that hardware cannot really provide generic
    transactional memory (composable, "unbounded" capacity, etc.).

    My disagreement comes from perceiving hardware as being better
    equipped to monitor conflicts. Larger read sets seem especially
    inexpensive for hardware (especially if one can accept the
    limitations of a conservative filter). I do see such expansions
    of capacity as less beneficial. (I agree with 'get simple atomics
    utilized first and well-understood before guaranteeing useless
    features that will have to be supported for twenty years'.)

    Removing reservations avoids some capacity constraints, though

    Current ATOMIC things allow for 1 or 2 memory containers. ASF and
    ESM allow for 8 cache lines used any way the programmer desires.
    EricP's version goes for 16 {*p,length} read/write sets.

    I submit that 8 (0r 16) is so much larger than ATOMIC event have
    today that it will take decades for SW to figure out how to use
    that new capability efficiently and effectively.

    Probably mostly true. More expansive uses are also likely to be
    less common (both application count and use per run); that is just
    the general nature of things.

    Even so, I think providing cheap expansion can facilitate
    experimentation. Expanding the read set seems really cheap.
    However, I do not know how one can present a feature as
    "experimental" such that one is not committing to it
    architecturally. _Theoretically_ the hardware transactional memory
    uses are suppose to have a fall back path, so they are hint-like,
    but people tend not to read the fine print.

    ASF defined this as a "hint" so an implementation could still fail
    an atomic event that would have succeeded if the tracking
    adjustment had been done, whether from avoiding a contention or
    exceeding a capacity limit. Removing reservations is also
    dangerous, especially at cache block tracking granularity, as
    software must ensure that other memory accesses either do not
    access that cache block or have the same lifetime.

    Removing reservations violates the "all or nothing" semantic of
    anything that wants to smell ATOMIC.

    Violating atomicity may not violate simple orderability. Sometimes
    software might be willing to sacrifice true atomicity for lower
    chance of "atomic" event failure. This risks the ease of
    understanding the interface and one probably does not want to
    supply software developers with atomic foot guns.

    (I think the use case for RELEASE was linked lists. One might have
    to traverse more than a few nodes but one might not care if
    another thread inserted/deleted a node well behind or well ahead
    of where this thread is. There are probably other possible uses.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Jan 2 21:22:14 2024
    Paul A. Clayton wrote:

    On 12/31/23 3:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:

    On 12/25/23 7:12 PM, MitchAlsup wrote:
    [snip]
    Expecting ATOMIC behavior across cache line boundaries has
    NEVER been
    guaranteed. So SW that deals with ATOMIC stuff needs to understand
    this {along with false sharing,...} and take appropriate actions
    at the language level and malloc()/new level.

    This may be a further variance from transactional memory-like
    interface. If I understand your statement, software is expected to
    take special alignment precautions for objects used within ESM.

    It is not ESM::software must already take these special considerations
    on ANYTHING they want to actually be ATOMIC.
    a) avoid false sharing--to prevent losing your cache line lock
    b) non-spanning data--so HW can perform the access atomically

    (This would not affect any atomic operation involving 4 or fewer
    64-byte chunks — 8 aligned chunks handles all possible alignments.
    I seem to recall only 6 were used in the _largest_ known case, so
    I would guess most atomic operations would not be concerned about
    alignment. Even the operations with more access regions are not
    likely to encounter an alignment issue. I think generic allocators
    tend to assume 16-byte alignment on 64-bit platforms, which should
    reduce the likelihood of cache block crossing. Many hot object
    members probably tend to be in the first bytes, further reducing the probability. Members used with temporal locality are also more likely
    to have spatial locality even at the short temporal locality of an
    atomic operation and sub-cache-block spatial locality.)

    Of course, high-effort software will naturally make such
    optimizations. Until compilers are trained to use ESM for merely lock-guarded operations or code sections are labeled atomic, those
    using larger atomic operations will probably be almost exclusively high-effort software.

    Paul A. Clayton wrote:
    If alignment-caused "capacity"
    failures are acceptable, why are address-based associativity
    conflicts so extremely problematic?

    False sharing make thinks look like interference and harms
    performance at the system level by performing too many bus
    transfers.

    Yes. However, aligning every object by 64-bytes (or assuring that
    the interesting bits are in a 64-byte aligned chunk) seems
    impractical if not nearly as bad as the early Alpha suggestion that
    locks not share a page as ll/sc granularity was only
    architecturally guaranteed at that coarseness (if I remember
    correctly).

    [snip]
    I was not considering exceptions, but I guess there are other
    failure modes beyond capacity and contention.

    Consider you are several checkpoints down a path, and have several outstanding cache misses and an interrupt transpires. What are you
    (the CPU) allowed to do with the returning cache lines since those
    memory reference instructions did not survive to retirement ??
    a) If you put them in the cache you have a Spectré side channel.

    Which *I* do not consider so horrible.

    You would if you were Bank of America.
    You would if you were the CIA.
    ..

    Since avoiding such
    speculation side channels is not very expensive (e.g., providing
    buffering of speculative information until it is non-speculative)
    and being free of those side channels has value, I do not consider
    avoiding the side channels an unworthy effort.

    You would if you were the CIA.

    Producing/infecting a game, productivity tool, or other desirable
    software with malware seems easier, faster, and more powerful
    (though possibly less traceable). With so many websites requiring
    Javascript and Javascript sandboxing not being perfect, I doubt
    one needs Spectré to get information. Plugging a hardware security
    hole (which is very difficult to work around) does mean that those
    wishing to sacrifice some software and web browsing options can be
    more secure.

    I have JS turned off on at least 300 web sights. This gets rid of
    90% of the adds AddBlocker does not already get rid of and also
    prevents popups.

    b) if you do not, you need to migrate them somewhere Ownership does
    not get misunderstood.

    Unless the context switch is very brief, an atomic attempt is likely
    toast anyway (and retrying is probably not as big a deal if the
    thread has already had to wait to resume). I was not suggesting that interrupts not be fatal.

    Correct, which is why I fail the event in the control transfer and
    forget all the dynamic state. The probability of the concurrent data
    structure being in the exact state you left it in when control returns
    is negligible.

    There may be cases where interrupts not failing a transaction is
    justified.

    But then you have so much more state you have to save and restore.
    Remember one of the goals of My 66000 is low context switch time.

    If an ESM operation completed its set-up and only used
    4 miss buffer entries, keeping them as unsaved/unfreed state
    (similar to FP/SIMD register files have sometimes been lazily saved
    by software) might allow a quick interrupt handler to run to
    completion and resume the earlier thread before the atomic fails.

    State save is 5 cache lines (1 header and 4 register file),
    You are adding up to 8 more cache lines and other state (another
    cache line) which is doubling the context switch overhead.

    Given the "it is better to" consider the CDS as stale upon return,
    saving the ATOMIC intermediate state is unwarranted. Given you
    don't want to return to the middle of an event, control transfer
    to the ATOMIC control point IS warranted.

    (I suspect one would want to fail the atomic rather than use NAKs
    if there is contention at least after one NAK was sent — unless
    there was somehow a way to know that the interrupt handler is just
    about done.)

    NAKs are not done willy nilly. NAKs are only used when the core has
    all the state* needed to complete the event already present, and uses
    NAKs to delay contenders until it can complete the event.

    (*) all the cache lines are present and writable in the Miss buffer.
    Core cannot be waiting for a line or write permission and respond with
    NAK. {This ended up being a live-lock issue along the way.}

    Side question: for your expected implementations, would a fully set-
    up transaction send ordinary NAKs for a cache block that was a miss
    to main memory? The cache coherence network can be slow but main
    memory is still slower.

    See above.

    I wonder if this would be a case where limited multithreading could
    be useful. A quick/small transaction could complete _while_ the
    interrupt handling is done. (I do realize that sharing resources
    enables side channels.)

    [snip]
    No, I expect that in 2 decades there will be enough experience
    with ESM that capacity extension would be desired. Here code
    that did
    work on early implementations continues to work, but more codes
    can now be performed with the new capacity limits.

    I expect that ESM will not be implemented on enough systems to
    provide that experience in just 2 decades. ASF — which was
    actually presented as a candidate extension by a significant
    processor vendor — was never implemented. Intel's TSX has had
    correctness issues and security issues and did not seem to work
    that well even for ESM-scale transactions. Power and z/Arch
    transactional memory implementations are not going to get broad
    use. ARM seems to be seeing the issues and waiting before
    providing a hardware transactional memory implementation.

    So, how is SW going to use 10,000 cores ??

    The larger potential capacity implementations do have the
    disadvantage of not encouraging experimentation with small-scale
    atomics, especially with their ability to nest (which is a feature
    even I would probably exclude from early implementations).

    I do not see any large corporation deciding to back My 66000 the
    way Amazon, Google, and Apple have made use of ARM computers
    available to millions (especially since My 66000 is a better ISA
    and not tied to another corporation). This may be largely my
    depression talking. Right now I do not think I want to be alive in
    twenty years with the direction my health, employment, and other
    personal aspects have been going [getting old etc.] and the
    direction of the world and particularly of U.S. politics.

    As far as RISC-V is concerned, I can't see any *.gov body accepting
    it for anything--too much Chinese money involved.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Tue Jan 2 18:05:51 2024
    On Tue, 2 Jan 2024 08:06:25 +0100, Terje Mathisen
    <[email protected]> wrote:

    MitchAlsup wrote:
    Paul A. Clayton wrote:

    One might also want a prefetch that prepares a hardware prefetch
    engine. Such a software prefecth might want metadata about the
    type of hardware prefetching (unit stride/non-unit stride/address
    is pointer/generic hardware prefetch).

    Your advocation for the use of prefetch is admirable. Truly !

    The problem with prefetch is that SW cannot use it effectively
    and efficiently, and in ways that preserve code over multiple
    generations--one end might need to prefetch only 100 cycles
    in front of use whereas the other end might need to prefetch
    several thousand cycles in front of use.

    A second problem is that prefetch is NECESSARILY specified as
    a HINT because some other piece of code (yours or not) may
    touch the memory hierarchy and the prefetch you so carefully
    performed goes away before you come for the actual data.

    I am one of those programmers who read about PREFETCH being added to x86
    and immediately thought "Yes, now I can make my code faster!", then
    almost immediately (after getting one of the new CPUs) hit the tiny
    little snag of "Oh, it does not seem to work!?!".

    I.e. I have written code where actually loading a byte (forcing a fetch,
    not just a hint) from every cache line in a 4KB block will speed it up,
    but almost never seen any gains from adding a PREFETCH (n) cache lines
    in front of the current operating address.


    SW must know about latency model it is using prefetch
    to ameliorate !! But which one ??

    It is already pretty complicated without that added issue.

    Terje

    I have had good results prefetching data for the next iteration of a
    loop - prefetch at the top of the loop - and letting the load(s)
    percolate *while* processing the current iteration.

    I haven't had much success with small loops, e.g., prefetching data
    for N+3,N+4 while working on N ... but it has benefited larger loops
    that do significant processing.

    Obviously, MMV.
    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Chris M. Thomasson on Tue Jan 2 23:43:53 2024
    "Chris M. Thomasson" <[email protected]> writes:
    On 1/1/2024 11:06 PM, Terje Mathisen wrote:
    MitchAlsup wrote:
    Paul A. Clayton wrote:


    SW must know about latency model it is using prefetch
    to ameliorate !! But which one ??

    It is already pretty complicated without that added issue.

    Terje


    Think of iterating a linked list of adjacent cache lines. A prefetch of
    the next cache line would be in order, right?

    The last several generations of processors have had very good
    prefetch circuitry built into the cache subsystem. Attempting
    to out-think them by explicitly using prefetch instructions
    in cases where the processor could figure out the access pattern
    itself in the prefetcher seems counter productive.

    Which is why naive attempts to add explicit prefetching generally don't show useful performance improvements.

    It is in the cases where the processor prefetcher cannot determine an access pattern that explicit prefetch instructions start to become useful.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to MitchAlsup on Wed Jan 3 09:42:33 2024
    On 12/23/2023 1:44 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 12/23/2023 6:37 AM, John Dallman wrote:
    In article <[email protected]>,
    [email protected] (Anton Ertl) wrote:

    Which compiler was this?

    Microsoft. Our primary demand for Itanium, when things were starting up
    for it, was Windows, because there was no 64-bit Windows at the time and >>> some large customers really wanted that. That demand evaporated like dry >>> ice in Death Valley when Intel announced their x86-64.

    Do you remember the rather odd instruction for the Itanium called
    cmp8xchg16?

    Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
    on the x86 side of things (DCAS in IBM parlance), which is the impetus
    for my inventing ASF which later morphed into ESM. My point was and
    still is
    that one can add 1 to 3 ATOMIC instructions every product rev, or one can provide the primitives such that SW can invent and use whatever primitive they can figure out how to program and how to use. {{There is no 3rd
    choice}}

    Perhaps. While I certainly agree that you can do pretty much every
    reasonable "primitive" using ESM, there is a cost to doing that. Let's
    say you want an atomic memory increment and update. With ESM, I think
    this would be three instructions, whereas in implementations that
    support it directly, it is one instruction (although the one instruction
    would take longer than any one of the three used by ESM). So the single instruction implementation might be faster, and certainly takes less instruction space than the ESM implementation.

    I don't have a feel for which of the existing instructions are the most
    used, but a perhaps "best of both worlds" implementation would provide
    for the most used as single instructions (only where they would be
    better performing than ESM), and provide ESM for whatever complex
    mechanism someone might want. Then, in the future, only if some new
    mechanism proves out and becomes popular, a future version could provide
    that mechanism as a more performant single instruction in a totally
    upward compatible way.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stephen Fuld on Wed Jan 3 18:39:14 2024
    Stephen Fuld wrote:

    On 12/23/2023 1:44 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:


    Do you remember the rather odd instruction for the Itanium called
    cmp8xchg16?

    Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
    on the x86 side of things (DCAS in IBM parlance), which is the impetus
    for my inventing ASF which later morphed into ESM. My point was and
    still is
    that one can add 1 to 3 ATOMIC instructions every product rev, or one can
    provide the primitives such that SW can invent and use whatever primitive
    they can figure out how to program and how to use. {{There is no 3rd
    choice}}

    Perhaps. While I certainly agree that you can do pretty much every reasonable "primitive" using ESM, there is a cost to doing that. Let's
    say you want an atomic memory increment and update. With ESM, I think
    this would be three instructions, whereas in implementations that
    support it directly, it is one instruction (although the one instruction would take longer than any one of the three used by ESM). So the single instruction implementation might be faster, and certainly takes less instruction space than the ESM implementation.

    I don't have a feel for which of the existing instructions are the most
    used, but a perhaps "best of both worlds" implementation would provide
    for the most used as single instructions (only where they would be
    better performing than ESM), and provide ESM for whatever complex
    mechanism someone might want. Then, in the future, only if some new mechanism proves out and becomes popular, a future version could provide
    that mechanism as a more performant single instruction in a totally
    upward compatible way.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stephen Fuld on Wed Jan 3 19:05:50 2024
    Stephen Fuld wrote:


    Do you remember the rather odd instruction for the Itanium called
    cmp8xchg16?

    Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over
    on the x86 side of things (DCAS in IBM parlance), which is the impetus
    for my inventing ASF which later morphed into ESM. My point was and
    still is
    that one can add 1 to 3 ATOMIC instructions every product rev, or one can
    provide the primitives such that SW can invent and use whatever primitive
    they can figure out how to program and how to use. {{There is no 3rd
    choice}}

    Perhaps. While I certainly agree that you can do pretty much every reasonable "primitive" using ESM, there is a cost to doing that. Let's
    say you want an atomic memory increment and update. With ESM, I think
    this would be three instructions, whereas in implementations that
    support it directly, it is one instruction (although the one instruction would take longer than any one of the three used by ESM). So the single instruction implementation might be faster, and certainly takes less instruction space than the ESM implementation.

    What prevents the core from recognizing::

    LDL R7,[some address]
    ADD R8,R7,R3
    STL R8,[some address] // last use of R8

    and converting it into::

    ADD_to_Memory R7,[some address],R3

    ??

    I don't have a feel for which of the existing instructions are the most
    used, but a perhaps "best of both worlds" implementation would provide
    for the most used as single instructions (only where they would be
    better performing than ESM), and provide ESM for whatever complex
    mechanism someone might want. Then, in the future, only if some new mechanism proves out and becomes popular, a future version could provide
    that mechanism as a more performant single instruction in a totally
    upward compatible way.

    ASF was developed as a means to avoid having to invent new ATOMIC
    instructions every processor generation, and was developed with
    a keen eye on how microcode (or other sequencer) would manipulate
    current hardware (Opteron) to provide the illusion of atomicity.
    The goal of ASF was to allow SW to implement all of the (then)
    known ATOMIC primitives without needing new instructions.

    ESM is ASF expressed in RISC ISA with solutions to several issues
    that could not be solved under x86-64 ISA and in particular x86
    conditional branches. For example:: the Branch on the condition of
    interference instruction (BIN) samples the interference monitor
    and branches on the current value, whereas x86 would have to access
    the interference bit and then branch conditionally on it, causing
    already stale information used in branching.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to MitchAlsup on Wed Jan 3 11:18:44 2024
    On 1/3/2024 11:05 AM, MitchAlsup wrote:
    Stephen Fuld wrote:


    Do you remember the rather odd instruction for the Itanium called
    cmp8xchg16?

    Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over >>> on the x86 side of things (DCAS in IBM parlance), which is the
    impetus for my inventing ASF which later morphed into ESM. My point
    was and still is
    that one can add 1 to 3 ATOMIC instructions every product rev, or one
    can
    provide the primitives such that SW can invent and use whatever
    primitive
    they can figure out how to program and how to use. {{There is no 3rd
    choice}}

    Perhaps.  While I certainly agree that you can do pretty much every
    reasonable "primitive" using ESM, there is a cost to doing that.
    Let's say you want an atomic memory increment and update.  With ESM, I
    think this would be three instructions, whereas in implementations
    that support it directly, it is one instruction (although the one
    instruction would take longer than any one of the three used by ESM).
    So the single instruction implementation might be faster, and
    certainly takes less instruction space than the ESM implementation.

    What prevents the core from recognizing::

          LDL     R7,[some address]
          ADD     R8,R7,R3
          STL     R8,[some address]       // last use of R8

    and converting it into::

          ADD_to_Memory R7,[some address],R3

    ??

    I don't know. If the core can do that then great. However, can the
    core distinguish between that sequence that should be atomic, and an
    ordinary (i.e. non atomic) memory update? I suppose the atomic version
    means some extra things going on (bus lock, collision detection, etc.)
    that the ordinary version doesn't need. That cost, would be avoided in
    my proposal. And it still requires three times the memory footprint.


    I don't have a feel for which of the existing instructions are the
    most used, but a perhaps "best of both worlds" implementation would
    provide for the most used as single instructions (only where they
    would be better performing than ESM), and provide ESM for whatever
    complex mechanism someone might want.  Then, in the future, only if
    some new mechanism proves out and becomes popular, a future version
    could provide that mechanism as a more performant single instruction
    in a totally upward compatible way.

    ASF was developed as a means to avoid having to invent new ATOMIC instructions every processor generation, and was developed with
    a keen eye on how microcode (or other sequencer) would manipulate
    current hardware (Opteron) to provide the illusion of atomicity.
    The goal of ASF was to allow SW to implement all of the (then) known
    ATOMIC primitives without needing new instructions.

    ESM is ASF expressed in RISC ISA with solutions to several issues
    that could not be solved under x86-64 ISA and in particular x86
    conditional branches. For example:: the Branch on the condition of interference instruction (BIN) samples the interference monitor
    and branches on the current value, whereas x86 would have to access
    the interference bit and then branch conditionally on it, causing
    already stale information used in branching.

    Thanks for the history. Don't get me wrong. I like ESM, and think it
    is an elegant generalized solution for whatever new needs arise. I just
    think that generality comes at a cost, which might be avoided most of
    the time.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stephen Fuld on Wed Jan 3 20:59:39 2024
    Stephen Fuld wrote:

    On 1/3/2024 11:05 AM, MitchAlsup wrote:
    Stephen Fuld wrote:


    Do you remember the rather odd instruction for the Itanium called
    cmp8xchg16?

    Yes, at that point in time, MS was also asking for Compare 2 Swap 2 over >>>> on the x86 side of things (DCAS in IBM parlance), which is the
    impetus for my inventing ASF which later morphed into ESM. My point
    was and still is
    that one can add 1 to 3 ATOMIC instructions every product rev, or one
    can
    provide the primitives such that SW can invent and use whatever
    primitive
    they can figure out how to program and how to use. {{There is no 3rd
    choice}}

    Perhaps.  While I certainly agree that you can do pretty much every
    reasonable "primitive" using ESM, there is a cost to doing that.
    Let's say you want an atomic memory increment and update.  With ESM, I
    think this would be three instructions, whereas in implementations
    that support it directly, it is one instruction (although the one
    instruction would take longer than any one of the three used by ESM).
    So the single instruction implementation might be faster, and
    certainly takes less instruction space than the ESM implementation.

    What prevents the core from recognizing::

          LDL     R7,[some address]
          ADD     R8,R7,R3
          STL     R8,[some address]       // last use of R8

    and converting it into::

          ADD_to_Memory R7,[some address],R3

    ??

    I don't know. If the core can do that then great. However, can the
    core distinguish between that sequence that should be atomic, and an
    ordinary (i.e. non atomic) memory update?

    The memory update is already ATOMIC in that all interested parties see
    only the before and only the after bit-pattern even without using a lock
    bit on the "bus".

    I suppose the atomic version
    means some extra things going on (bus lock, collision detection, etc.)
    that the ordinary version doesn't need. That cost, would be avoided in
    my proposal. And it still requires three times the memory footprint.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Sun Jan 7 23:31:00 2024
    Paul A. Clayton wrote:

    On 1/2/24 3:49 PM, MitchAlsup wrote:
    Paul A. Clayton wrote:
    [snip]
    No. The fact that the thread is within an atomic event is part
    of thread state (I assume ll/sc implementations make atomic even
    success part of hardware thread/core state). Knowing that the
    thread is in failed atomic event state, stores can be guarded
    such that they do not manifest.

    The transfer of control to the ATOMIC control point prior to performing
    the exception or interrupt control transfer means ATOMIC adds nothing
    to thread state !!! You do not need to remember the participating
    cache liens (or their state), nor the fact you are in an event, ...

    My comments were in the context of extending ll/sc without adding instructions. While I think one could interpret a branch
    instruction immediately after a ll as setting up such a control
    point — merely reinterpreting an instruction in that specific
    context not "adding" an instruction — the retaining of "in
    transaction" state allows just executing until the sc is reached.

    One could use a typical branch to reassign ATOMIC control point !!
    But I used a special conditional branch (BIN) which samples the
    interference monitor and branches if interference occurs. This inst
    also reassigns the ATOMIC control point to be the target of the
    branch.

    I wanted control flow between the start of the event and the
    conclusion of the event to be natural to SW (which should
    preclude using a typical BC as the special one).

    The concern expressed was that ordinary store instructions would
    leak when resuming from an interrupt.

    Can you illustrate this as a scenario ??

    ll r4 ← [r2]
    ld r3 ← [r2+4]
    sub r5 ← r4 - r3
    add r6 ← r3 + 12
    <interrupt> // transaction_valid state set to "invalid"
    // old thread tx_active saved as "active"
    [interrupt handling code]
    <resume thread>
    st [r2+4] ← r6 // core discards this store since "active"
    // and also "invalid"
    sc [r2] ← r5 // normal behavior for sc when "invalid"

    ll r4 ← [r2] // this is the ATOMIC control point
    ld r3 ← [r2+4]
    sub r5 ← r4 - r3
    add r6 ← r3 + 12
    <interrupt> // event has failed
    // control reverts to ATOMIC control point
    // no special state
    [interrupt handling code]
    <resume thread>
    ll r4 ← [r2] // this is the ATOMIC control point again
    ld r3 ← [r2+4]
    sub r5 ← r4 - r3
    add r6 ← r3 + 12
    ...

    The saving of the extra bit (tx_active) is not an _application_
    software compatibility issue, but it does seem to require the
    interrupt handler to save this extra state. (Unlike My 66000,
    the ll/sc using ISAs do not have hardware context management.)

    It is not the context management it is that there is no state to
    save or restore that is perinate to ATOMIC events. The event fails
    before interrupt control transfer transpires.

    Executing through from the resume point to the sc is not
    efficient given that the transaction is known to fail, but
    compatibility is maintained.

    Which is why I don't do it.

    The point was to avoid EricP's proposed extra instructions to
    indicate that memory operations were part of a transaction.

    EricP is on different glide path to optimizing synchronization.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to MitchAlsup on Mon Jan 8 10:44:37 2024
    On 1/3/2024 12:59 PM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 1/3/2024 11:05 AM, MitchAlsup wrote:

    My point
    was and still is
    that one can add 1 to 3 ATOMIC instructions every product rev, or
    one can
    provide the primitives such that SW can invent and use whatever
    primitive
    they can figure out how to program and how to use. {{There is no
    3rd choice}}

    Perhaps.  While I certainly agree that you can do pretty much every
    reasonable "primitive" using ESM, there is a cost to doing that.
    Let's say you want an atomic memory increment and update.  With ESM,
    I think this would be three instructions, whereas in implementations
    that support it directly, it is one instruction (although the one
    instruction would take longer than any one of the three used by
    ESM). So the single instruction implementation might be faster, and
    certainly takes less instruction space than the ESM implementation.

    What prevents the core from recognizing::

           LDL     R7,[some address]
           ADD     R8,R7,R3
           STL     R8,[some address]       // last use of R8 >>>
    and converting it into::

           ADD_to_Memory R7,[some address],R3

    ??

    I don't know.  If the core can do that then great.  However, can the
    core distinguish between that sequence that should be atomic, and an
    ordinary (i.e. non atomic) memory update?

    The memory update is already ATOMIC in that all interested parties see
    only the before and only the after bit-pattern even without using a lock
    bit on the "bus".

                                              I suppose the atomic version
    means some extra things going on (bus lock, collision detection, etc.)
    that the ordinary version doesn't need.  That cost, would be avoided
    in my proposal.  And it still requires three times the memory footprint.

    OK, I think I misunderstood what you proposed about "coalescing" the instructions within the core. I missed that your code had the "L" on
    the load and store instructions. So forget about confusing the sequence
    with non-atomic updates.

    So yes, the core could internally recognize the sequence and turn it
    into what I suppose would take the same amount of time as the single instruction. But I guess that would require at least as much circuitry
    as executing the single atomic instruction. And, to actually gain the
    benefit, you have to document that the core is doing this for that
    particular sequence. So ISTM the only extra cost of the atomic
    instruction is the op code (which I know you want to minimize). And
    you still have the fact of three instructions in the stream versus 1,
    I-cache usage, etc.

    Of course, this same argument would apply to other popular "primitives"
    such as TS, or whatever. So, under this scenario, it comes down the
    equivalent core space, the cost of an extra op-code, versus the benefit
    of a smaller program. Not too major either way. But still an
    alternative to the two you mentioned above.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stephen Fuld on Mon Jan 8 20:55:11 2024
    Stephen Fuld wrote:

    On 1/3/2024 12:59 PM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 1/3/2024 11:05 AM, MitchAlsup wrote:

    My point
    was and still is
    that one can add 1 to 3 ATOMIC instructions every product rev, or
    one can
    provide the primitives such that SW can invent and use whatever
    primitive
    they can figure out how to program and how to use. {{There is no
    3rd choice}}

    Perhaps.  While I certainly agree that you can do pretty much every >>>>> reasonable "primitive" using ESM, there is a cost to doing that.
    Let's say you want an atomic memory increment and update.  With ESM, >>>>> I think this would be three instructions, whereas in implementations >>>>> that support it directly, it is one instruction (although the one
    instruction would take longer than any one of the three used by
    ESM). So the single instruction implementation might be faster, and
    certainly takes less instruction space than the ESM implementation.

    What prevents the core from recognizing::

           LDL     R7,[some address]
           ADD     R8,R7,R3
           STL     R8,[some address]       // last use of R8 >>>>
    and converting it into::

           ADD_to_Memory R7,[some address],R3

    ??

    I don't know.  If the core can do that then great.  However, can the
    core distinguish between that sequence that should be atomic, and an
    ordinary (i.e. non atomic) memory update?

    The memory update is already ATOMIC in that all interested parties see
    only the before and only the after bit-pattern even without using a lock
    bit on the "bus".

                                              I suppose the atomic version
    means some extra things going on (bus lock, collision detection, etc.)
    that the ordinary version doesn't need.  That cost, would be avoided
    in my proposal.  And it still requires three times the memory footprint.

    OK, I think I misunderstood what you proposed about "coalescing" the instructions within the core. I missed that your code had the "L" on
    the load and store instructions. So forget about confusing the sequence
    with non-atomic updates.

    Accepted.......

    So yes, the core could internally recognize the sequence and turn it
    into what I suppose would take the same amount of time as the single instruction.

    The advantage is that the bundled version ALLWAYS succeeds even in the
    face of withering interference, whereas the core executed version does
    not.

    But I guess that would require at least as much circuitry
    as executing the single atomic instruction. And, to actually gain the benefit, you have to document that the core is doing this for that
    particular sequence. So ISTM the only extra cost of the atomic
    instruction is the op code (which I know you want to minimize). And
    you still have the fact of three instructions in the stream versus 1,
    I-cache usage, etc.

    Granted.

    Of course, this same argument would apply to other popular "primitives"
    such as TS, or whatever. So, under this scenario, it comes down the equivalent core space, the cost of an extra op-code, versus the benefit
    of a smaller program. Not too major either way. But still an
    alternative to the two you mentioned above.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jan 10 18:42:43 2024
    Chris M. Thomasson [2023-12-22 11:37:03] wrote:
    On 12/22/2023 9:21 AM, Anton Ertl wrote:
    Somebody ask for a case where SC reduced performance. TSO also has
    reduced performance in this case, whereas Cache, causal, and weak
    consistencies do not.
    SC radically reduces performance on existing archs for sure. Think of iterating a linked list. Wrt RCU, the readers can iterate the list without using any memory barriers (DEC Alpha aside for a moment). This is FAR

    Anton is talking about SC provided directly by the ISA (e.g. as on the
    MIPS architecture), in which case there are no barriers at all.
    IOW, the cost is moved to the CPU's OoO engine.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stefan Monnier on Thu Jan 11 00:24:34 2024
    Stefan Monnier wrote:

    Chris M. Thomasson [2023-12-22 11:37:03] wrote:
    On 12/22/2023 9:21 AM, Anton Ertl wrote:
    Somebody ask for a case where SC reduced performance. TSO also has
    reduced performance in this case, whereas Cache, causal, and weak
    consistencies do not.
    SC radically reduces performance on existing archs for sure. Think of
    iterating a linked list. Wrt RCU, the readers can iterate the list without >> using any memory barriers (DEC Alpha aside for a moment). This is FAR

    Anton is talking about SC provided directly by the ISA (e.g. as on the
    MIPS architecture), in which case there are no barriers at all.

    I too am similarly talking about an architecture where memory is weaker
    than SC (or TSO) and you still need no MBARs. The trick is the ability
    to recognize when stronger orders are required and automagically apply
    the stronger model for a short but needed duration then automagically
    revert to the weaker model when it is not.

    That is placing the memory model as bits in a PSL is ill-advised.

    IOW, the cost is moved to the CPU's OoO engine.

    Since small-width in-order processors are invariable SC, it is always
    in the OoO engine.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Thu Jan 11 06:57:08 2024
    "Chris M. Thomasson" <[email protected]> writes:
    Can a C++ program use all std::memory_order_relaxed barriers for its
    atomic sync algorithms and still be correct on MIPS?

    In a sequentially consistent architecture, memory barriers are nops,
    so you can insert or delete them wihout affecting correctness (well,
    such an architecture actually won't have instructions for them, unless
    there is a variant of the architecture with weaker consistency).

    I don't know if MIPS guarantees sequential consistency, though.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Thu Jan 11 07:48:15 2024
    [email protected] (MitchAlsup) writes:
    Since small-width in-order processors are invariable SC, it is always
    in the OoO engine.

    "Invariable"? Alpha has had memory barrier instructions from the
    start, while its implementations were still small-width in-order
    processors. The advocacy paper for weak consistency
    [adve&gharachorloo95] was written by DEC people three years before DEC
    released the OoO 21264, and the arguments why they think that weak
    consistency is necessary for performance have nothing to do with OoO,
    but mainly with caches, but also with accessing several main memory
    channels in parallel.

    @TechReport{adve&gharachorloo95,
    author = {Sarita V. Adve and Kourosh Gharachorloo},
    title = {Shared Memory Consistency Models: A Tutorial},
    institution = {Digital Western Research Lab},
    year = {1995},
    type = {WRL Research Report},
    number = {95/7},
    annote = {Gives an overview of architectural features of
    shared-memory computers such as independent memory
    banks and per-CPU caches, and how they make the (for
    programmers) most natural consistency model hard to
    implement, giving examples of programs that can fail
    with weaker consistency models. It then discusses
    several categories of weaker consistency models and
    actual consistency models in these categories, and
    which ``safety net'' (e.g., memory barrier
    instructions) programmers need to use to work around
    the deficiencies of these models. While the authors
    recognize that programmers find it difficult to use
    these safety nets correctly and efficiently, it
    still advocates weaker consistency models, claiming
    that sequential consistency is too inefficient, by
    outlining an inefficient implementation (which is of
    course no proof that no efficient implementation
    exists). Still the paper is a good introduction to
    the issues involved.}
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)