• Re: Efficiency of in-order vs. OoO

    From Anton Ertl@21:1/5 to Paul A. Clayton on Wed Jan 24 07:47:31 2024
    "Paul A. Clayton" <[email protected]> writes:
    In AnandTech's Exynos9820 comparison, one knows the process used
    is the same, but one does not know how optimized the designs were
    at the HDL level nor at the netlist ("compiled") level. It is also
    possible to optimize the same HDL for different power-performance-
    area targets.

    Suipposedly the A55 is there for efficiency, so if different
    optimizations were applied, one would assume that the A55 was
    optimized for perf/W.

    I would not be surprised if ARM did not invest the same design
    effort per unit performance (e.g.) in A55 as in A75.

    The A55 served as little core for the A75, A76, A77, and A78, and it
    served as only core for a significant number of SoCs. Why should ARM
    make it performance-inefficient and power-inefficient like you
    suggest? It certainly is more performance-efficient than the A53:

    LateX benchmark, numbers are times in seconds:

    - Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
    - Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended) 2.105

    Likewise, I could imagine Samsung putting less effort into
    optimizing A55.

    What effort do you expect on Samsung's side? They only had a few
    months from the time when ARM gives them the IP until they tape out
    the Exynos 9820, so I expect thet used the IP as-is, only deciding on
    some process parameters.

    Performance optimization likely makes less sense
    for background tasks (the likely targeted use for A55 in this
    case) and the benefit of core-level power optimization is likely
    less significant than the I/O power for many of the targeted
    tasks. Even optimizing for low energy cost for bursty workloads
    (useless energy for sleep/wakeup, e.g.) would probably not help
    much because of system power consumption.

    So you speculate (without evidence) that ARM/Samsung optimized the A55
    for low performance and low power-efficiency? Not very plausible.

    In any case, the A53/A55/A5xx line of ARM are the only cases where
    core designers have not switched from in-order to OoO designs (unlike
    Intel with their E-Cores and Xeon Phi where they switched from
    in-order to OoO in the face of power efficiency being of supreme
    importance for the E-Core line). And now you write that ARM did not
    design it for power efficiency. If you are right, that supports the
    position that in-order is uncompetetive not just wrt performance, but
    also perf/W as soon as there are relatively low performance
    requirements.

    The memory system, on-chip network, and such would also affect the
    energy efficiency. Exynos9820's memory system might _reasonably_
    be optimized for high power/high performance use; that would tend
    to hurt the efficiency of wimpy cores.

    What scenario do you imagine where one would want these in-order
    cores? ARM's niche for them is the little cores in a big.LITTLE
    design; that is necessarily coupled with a memory system with a high
    bandwidth. There are also SoCs with only A55 cores (no BIG ones) like
    the RK3566, but they are only bought because of the price, not because
    of their power-efficiency.

    I think system power is also less likely to scale well downward
    with performance. E.g., the same capacity L2 suited to one A75
    core might properly service more than two A55 cores. If the design
    had more A55 cores per L2 than A75 cores per L2, the A55 cores
    could be at a power disadvantage in single threaded use just from
    the L2 cache.

    In the Exynos 9820 the A55s have no private L2 cache, and they access
    1MB (shared with all other cores) out of the 4MB L3 cache (3MB of
    which are exclusive to the two M4 cores). This is Samsung's work, and
    it is not plausible that they optimized this setup for power
    inefficiency.

    One might be able to adjust for system power scaling factors by
    using all cores of a type for a run (e.g., SPECrate), but I
    suspect that would be tricky given fixed aspects of the hardware.

    <https://images.anandtech.com/doci/14072/Exynos9820-Perf-Estimated_575px.png>
    <https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png> >>
    from the article

    <https://www.anandtech.com/show/14072/the-samsung-galaxy-s10plus-review/4> >>
    In the Exynos 9820, we see at different points of the DVFS curve:

    A55 | A75
    in-order | OoO
    perf mW pf/mW | perf mW pf/mW
    1.0 22 0.046 | 3.7 88 0.042 highest efficiency point for each core
    1.4 33 0.042 | 3.7 88 0.042 same pf/mW at highest common efficiency >> 2.7 90 0.030 | 3.7 88 0.042 same mW at lowest common mW
    5.1 400 0.013 | 5.1 124 0.041 same perf at highest common performance >> 5.1 400 0.013 | 10.5 400 0.027 same mW at highest common mW
    5.1 400 0.013 | 17.2 1270 0.013 highest performance point for each core >>
    "prf" is SPEC2006 Int+FP Geomean. "pf/mW" (shown as "Perf/W" in the
    second graph) is SPEC Int+FP Geomean/mW (you can confirm this by
    computing corresponding numbers from the first graph).

    The SPEC2006 workload probably also biases is favor of larger
    cores, especially the FP portion.

    So you admit that in-order is not efficient for SPEC2006? Given that
    SPEC CPU benchmarks are commonly accepted as representative for tasks
    where CPU is relevant, what does that tell us about in-order cores?

    I suspect A55 uses 64-bit width
    SIMD execution (which makes sense for the targeted use), which
    would substantially reduce SPECFP performance and possibly degrade
    SPECINT performance.

    The power consumption should be correspondingly lower. If there was a power-efficiency advantage to in-order, you should still see it.

    Even the gcc component of SPECINT might be more compute dense than
    the targeted workloads for A55 (which might often be more
    performance constrained by I/O) and gcc is probably less "compute
    dense" than other SPECINT components.

    So the supposed power efficiency of in-order is only relevant for
    workloads that don't compute, but only wait for I/O all the time?
    Even if it was, the solution to me seems to get rid of most of this
    waiting by moving it off the CPU to some dedicated circuit. And AFAIK
    in all I/O that moves a lot of data (e.g., block devices, network
    devices), that has happened.

    Obviously an extremely biased workload like the data analysis
    workloads targeted by Intel's research chip would probably show
    A55 in a better light (though A55 would likely be very inefficient
    compared to the research design, I think it used 4-way threaded
    in-order cores with limited cache and narrow memory channels [to
    avoid 64-byte accesses to access 64-bits or less of data]), but
    that would not be "fair".

    I have no idea what Intel research chip you have in mind, but
    certainly, for special workloads specialized designs have been used successfully. In some cases, these workloads generate so much revenue
    that they result in their own specialized processors, as happened with
    GPUs which then have also been used for HPC and AI as GPGPUs (or is
    there a new word for that).

    But programs for GPGPUs do not run efficiently on CPUs and vice versa,
    and the question at hand is if in-order is power-efficient for CPUs.
    And above a certain (relatively low) performance level, the answer
    seems to be: No.

    Core efficiency cannot be isolated from the system, especially if
    measured by system resource use (I *suspect* AnandTech measured
    system power and subtracted idle system power).

    You can read about how it was measured in the article: https://www.anandtech.com/print/14072/the-samsung-galaxy-s10plus-review

    Fair comparison is difficult, especially when the design targets
    are different.

    I think that the comparison is as fair as we can get. Of course if
    for some reason you don't want to be convinced, there are always some
    straws that you can grasp in the hope that they will save the belief
    system you favour. But if you look at it objectively, all evidence
    there is (from Transmeta through Intel's E-cores and the lack of
    in-order at Apple, Intel, and AMD to Andrei Frumusanu's Exynos 9820
    data) supports the position that in-order is not more power-efficient
    than OoO above a certain performance level, while the opposite
    position cannot point to evidence, but only to some corners where we
    don't have evidence, and where in-order fans hope that these corners
    will favour in-order.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Wed Jan 24 18:38:53 2024
    On Wed, 24 Jan 2024 07:47:31 +0000, Anton Ertl wrote:

    I think that the comparison is as fair as we can get. Of course if
    for some reason you don't want to be convinced, there are always some
    straws that you can grasp in the hope that they will save the belief
    system you favour. But if you look at it objectively, all evidence
    there is (from Transmeta through Intel's E-cores and the lack of
    in-order at Apple, Intel, and AMD to Andrei Frumusanu's Exynos 9820
    data) supports the position that in-order is not more power-efficient
    than OoO above a certain performance level, while the opposite
    position cannot point to evidence, but only to some corners where we
    don't have evidence, and where in-order fans hope that these corners
    will favour in-order.

    Above a certain performance level, _all_ cores are out-of-order.

    If in-order is more power-efficient than out-of-order at *low*
    performance levels, than the basic notion that implementing
    out-of-order requires some extra transistors, and transistors
    take power, is confirmed. That basic notion is what leads
    people to hope that, if in-order could be extended to higher
    performance levels, then it would provide power savings there
    too.

    Let us then imagine what a high-performance in-order CPU would
    look like. Its goal would be to achieve what OoO achieves to
    improve performance without being OoO.

    Thus, such a CPU would have a giant architectural register
    file - to match the large hardware register files, including
    rename registers, of OoO systems.

    So we're talking AMD 29000 or Itanium. AMD sold off the 29000,
    and it's still being used for compatibility reasons in some
    aviation hardware.

    The sample size is small, and so it's not that unreasonable to
    argue that although the Itanium failed to meet expectations, this
    class of architectures may still deserve some more investigation
    and study. Yes, there's no high-performance OoO-beating in-order
    chip you can buy off the shelf today, but maybe it's still worth
    trying to design one.

    What arguments are there against that? I can see a few:

    - It's been tried many times, and failed each time. (This
    doesn't _seem_ to be the case, but the few times it was
    tried may have been enough to prove the point.)

    - The benefits of in-order at high performance are known
    to be negligible. (That is, the gate cost of OoO at
    high performance scales well, and becomes a decreasing
    fraction of transistor count in higher-performance designs.
    Mitch tells us that the GBOoO direction of progress is
    *not sustainable*, so that doesn't seem to be the case.)

    - The drawbacks of in-order outweigh their benefits.
    (If you have larger register files, you have bigger
    instructions, so you fetch more code out of DRAM.
    Is that really enough to make the difference?)

    However, in framing this counterargument in favor of
    in-order, the *fatal* drawback of in-order for high
    performance has dawned on me. (Although Ivan Godard
    in his Mill design is, in fact, making an effort to
    address just this particular drawback!)

    As Mitch notes, to further increase performance, OoO
    has become GBOoO: ever larger hardware register files
    and so on.

    This means that, even if an in-order design which had
    large architectural register files, an exposed pipeline,
    and so on, matched _current_ OoO CPUs in performance
    for less power...

    the performance of OoO CPUs doesn't stand still...

    and so the _next generation_ of the in-order design
    would have to have *larger* register files (and, no
    doubt, all sorts of other things)...

    which means it wouldn't be upwards-compatible with
    software for the last generation.

    That's why in-order RISC ended up being succeeded by
    OoO implementations of the same ISA! Going from 32
    registers to 128 registers to stay in-order... isn't
    just something you can *do only once*, and solve the
    problem forever!

    It is by looking at the real problem that the false hope
    of high-performance in-order can finally be dashed. Maybe
    it isn't technically impossible. But for the mass market
    that wants to coalesce around a popular and stable
    platform, it may not be able to meet *their* requirements,
    even if such architectures could still find a niche
    (like supercomputers that are only programmed by the
    users themselvels in FORTRAN).

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Quadibloc on Wed Jan 24 20:03:15 2024
    Quadibloc wrote:

    On Wed, 24 Jan 2024 07:47:31 +0000, Anton Ertl wrote:

    I think that the comparison is as fair as we can get. Of course if
    for some reason you don't want to be convinced, there are always some
    straws that you can grasp in the hope that they will save the belief
    system you favour. But if you look at it objectively, all evidence
    there is (from Transmeta through Intel's E-cores and the lack of
    in-order at Apple, Intel, and AMD to Andrei Frumusanu's Exynos 9820
    data) supports the position that in-order is not more power-efficient
    than OoO above a certain performance level, while the opposite
    position cannot point to evidence, but only to some corners where we
    don't have evidence, and where in-order fans hope that these corners
    will favour in-order.

    Above a certain performance level, _all_ cores are out-of-order.

    Above about 1.0 I/c everything goes OoO.

    Remember the 1st generation RISCs performed at about 0.7 I/c
    and the 2-wide In-Order machines close to 1.0 I/c;
    Data from simulations showed 4-wide IO machines near 1.13 I/c

    If in-order is more power-efficient than out-of-order at *low*
    performance levels, than the basic notion that implementing
    out-of-order requires some extra transistors, and transistors
    take power, is confirmed. That basic notion is what leads
    people to hope that, if in-order could be extended to higher
    performance levels, then it would provide power savings there
    too.

    Let us then imagine what a high-performance in-order CPU would
    look like. Its goal would be to achieve what OoO achieves to
    improve performance without being OoO.

    Do not forget Vector Machines as IO at high perf. By repeating
    the same calculation (or memory reference) 64 times and chaining
    (i.e., forwarding) they could achieve several (~3 I/c) long term.

    Thus, such a CPU would have a giant architectural register
    file - to match the large hardware register files, including
    rename registers, of OoO systems.

    CRAY 1 had 4096 Bytes of Vector Registers (and only 8 registers).

    So we're talking AMD 29000 or Itanium. AMD sold off the 29000,
    and it's still being used for compatibility reasons in some
    aviation hardware.

    The sample size is small, and so it's not that unreasonable to
    argue that although the Itanium failed to meet expectations, this
    class of architectures may still deserve some more investigation
    and study. Yes, there's no high-performance OoO-beating in-order
    chip you can buy off the shelf today, but maybe it's still worth
    trying to design one.

    What arguments are there against that? I can see a few:

    - It's been tried many times, and failed each time. (This
    doesn't _seem_ to be the case, but the few times it was
    tried may have been enough to prove the point.)

    You can add DataFlow to this list. Tried several times, the most
    successful (I think) was Monsoon.

    - The benefits of in-order at high performance are known
    to be negligible. (That is, the gate cost of OoO at
    high performance scales well, and becomes a decreasing
    fraction of transistor count in higher-performance designs.
    Mitch tells us that the GBOoO direction of progress is
    *not sustainable*, so that doesn't seem to be the case.)

    Non-switching transistors cost area but not <much> power.

    - The drawbacks of in-order outweigh their benefits.
    (If you have larger register files, you have bigger
    instructions, so you fetch more code out of DRAM.
    Is that really enough to make the difference?)

    There is a difference between the architectural register file
    (32 entry) and the implementation register file (128 rename
    pool). The above confuses the two.

    However, in framing this counterargument in favor of
    in-order, the *fatal* drawback of in-order for high
    performance has dawned on me. (Although Ivan Godard
    in his Mill design is, in fact, making an effort to
    address just this particular drawback!)

    As Mitch notes, to further increase performance, OoO
    has become GBOoO: ever larger hardware register files
    and so on.

    Only to the point where the register file can still cycle
    in 1 clock. This puts the limit somewhere between 128 and
    256 total registers.

    This means that, even if an in-order design which had
    large architectural register files, an exposed pipeline,
    and so on, matched _current_ OoO CPUs in performance
    for less power...

    the performance of OoO CPUs doesn't stand still...

    and so the _next generation_ of the in-order design
    would have to have *larger* register files (and, no
    doubt, all sorts of other things)...

    which means it wouldn't be upwards-compatible with
    software for the last generation.

    That's why in-order RISC ended up being succeeded by
    OoO implementations of the same ISA! Going from 32
    registers to 128 registers to stay in-order... isn't
    just something you can *do only once*, and solve the
    problem forever!

    And Itanic's downfall.

    It is by looking at the real problem that the false hope
    of high-performance in-order can finally be dashed. Maybe
    it isn't technically impossible. But for the mass market
    that wants to coalesce around a popular and stable
    platform, it may not be able to meet *their* requirements,
    even if such architectures could still find a niche
    (like supercomputers that are only programmed by the
    users themselvels in FORTRAN).

    Vector machines fell out of fashion when the length of the
    vector register could no longer absorb the latency to memory.
    {{Although NEC persisted for longer}}

    Given certain kinds of HW (CAMs) one can build sort algorithms
    in linear time for sorts of less than 128-entries. Then resort
    to merges for longer lists.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Quadibloc on Wed Jan 24 20:33:20 2024
    Quadibloc <[email protected]d> schrieb:

    Above a certain performance level, _all_ cores are out-of-order.

    That is true for general-purpose CPUs, but not for GPUs - these
    are in-order. I think AMD and NVIDIA differ in their handling
    of register hazards - AMD handles them, NVIDIA depends on the
    compiler (well, whatever you want to call the piece of software
    that translates the intermediate PTX into whatever the graphics
    card itself understands) to do this.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Wed Jan 24 21:54:12 2024
    Quadibloc <[email protected]d> writes:
    On Wed, 24 Jan 2024 07:47:31 +0000, Anton Ertl wrote:

    I think that the comparison is as fair as we can get. Of course if
    for some reason you don't want to be convinced, there are always some
    straws that you can grasp in the hope that they will save the belief
    system you favour. But if you look at it objectively, all evidence
    there is (from Transmeta through Intel's E-cores and the lack of
    in-order at Apple, Intel, and AMD to Andrei Frumusanu's Exynos 9820
    data) supports the position that in-order is not more power-efficient
    than OoO above a certain performance level, while the opposite
    position cannot point to evidence, but only to some corners where we
    don't have evidence, and where in-order fans hope that these corners
    will favour in-order.

    Above a certain performance level, _all_ cores are out-of-order.

    True, but not only that: In the Exynos 9820, at nearly all of the
    performance range of the A55 (roughly as soon as you clock it
    =500MHz), the A75 offers more performance at better efficiency; the
    A55 can run at 1800MHz on the Exynos 9820, but one better shouldn't,
    and certainly not above 1000MHz where not just the efficiency, but
    also the power consumption overtakes that of the A75 at its
    lowest-power point despite delivering less performance.

    If in-order is more power-efficient than out-of-order at *low*
    performance levels, than the basic notion that implementing
    out-of-order requires some extra transistors, and transistors
    take power, is confirmed.

    Certainly the A75 takes more area and more transistors than the A55;
    concerning area, looking at <https://images.anandtech.com/doci/14069/ChipRebel9820.png> there
    seems to be roughly a factor or 3-4 between them, just as for
    performance. But the A75 has been designed as a big core in ARMs
    big.LITTLE system, so it should not be surprising that it is bigger.

    Transistors (and connections) take power when they switch, and
    (non-switching) leakage has also been a big topic maybe 15 years ago;
    CPU manufacturers have dealt with leakage by powering down inactive
    units, but powering them up again takes some time.

    That basic notion is what leads
    people to hope that, if in-order could be extended to higher
    performance levels, then it would provide power savings there
    too.

    Higher performance levels without more transistors? How?

    Let us then imagine what a high-performance in-order CPU would
    look like. Its goal would be to achieve what OoO achieves to
    improve performance without being OoO.

    We don't have to imagine. We know what the various IA-64
    implementations look like. And they were not power-efficient, on the
    contrary, IIRC Merced in particular was exceptionally power-hungry for
    its time at IIRC 130W. The 1.66MHz Madison is rated by Intel at 122W <https://ark.intel.com/content/www/us/en/ark/products/27995/intel-itanium-processor-1-66-ghz-9m-cache-667-mhz-fsb.html>,
    and the Itanium 9560 (Poulson) with 8 cores @ 2.53GHz has a TDP of
    170W.

    Thus, such a CPU would have a giant architectural register
    file - to match the large hardware register files, including
    rename registers, of OoO systems.

    If a big register file is all that is needed, IA-64 would have
    performed well (not just on software-pipelinable HPC code). But
    compare a 2002-vintage 900MHz Itanium 2 (130W TDP <https://www.hardware-aktuell.com/lexikon/Intel_Itanium_2>, 180nm)
    with its 128 integer registers with a 2000-vintage 800MHz K7
    (Thunderbird, also 180nm, 42.6W maximum power dissipation <https://www.cpu-world.com/CPUs/K7/AMD-Athlon%20800%20-%20A0800APT3B.html>) with IIRC 72 physical registers, on our LaTeX benchmark (lower is
    better):

    - HP workstation 900MHz Itanium II, Debian Linux 3.528
    - Athlon (Thunderbird) 800, Abit KT7, PC100-333, RedHat 5.1 2.49

    So here in-order provided lower performance at thrice the power
    consumption, two years later.

    Anyway, a major advantage of OoO is that its scheduler can make use of
    the dynamic branch predictor and its superior accuracy. (Joshua
    Landau pointed out a way that allows static schedulers to make use of
    this accuracy, but it's doubtful that this can be achieved without a
    code explosion).

    Concerning the kind of regular code where IA-64 performed well, the
    rest of the world added SIMD registers which can be used to perform
    well on those applications; and even in that world (Xeon Phi), Intel
    first tried to go for in-order, but replaced it with OoO in the next generation, and eventually just added AVX512 to its mainstream
    performance cores, and made them the replacement for the Phi-Xeons.

    The sample size is small, and so it's not that unreasonable to
    argue that although the Itanium failed to meet expectations, this
    class of architectures may still deserve some more investigation
    and study.

    But both IA-64 and Transmeta burned through serious amounts of money
    pursuing the dream of superior in-order performance and (later for
    Transmeta, after superior performance evaporated) efficiency. If you
    cannot identify what you plan to do better and why that solves the
    problems that IA-64 had, you will likely find out that in-order cannot
    compete in performance and is not so great on efficiency, either.

    - It's been tried many times, and failed each time. (This
    doesn't _seem_ to be the case, but the few times it was
    tried may have been enough to prove the point.)

    In addition to IA-64 and the Transmeta chips, we can also point to the
    big in-order cores of the times before OoO took over, e.g., the 4-wide
    in-order 21164; it was succeeded and eclipsed by the 4-wide OoO 21264.
    Sun tried to stick to in-order (or failed to produce a competetive OoO
    CPU) for a long time, e.g., the 4-wide UltraSPARC III/IV/IV+, which
    was succeeded by the OoO SPARC64 VI (from Fujitsu). IBM switched from
    OoO in Power5 to in-order in Power6, and then back to OoO in Power7,
    but I know too little about these CPUs. Anyway, high-performance
    in-order has not only been tried by Intel in IA-64 and Transmeta, but
    OoO has won. For efficiency, one can point to Intel's in-order
    Bonnell being succeeded by Intel's OoO Silvermont.

    - The benefits of in-order at high performance are known
    to be negligible. (That is, the gate cost of OoO at
    high performance scales well, and becomes a decreasing
    fraction of transistor count in higher-performance designs.
    Mitch tells us that the GBOoO direction of progress is
    *not sustainable*, so that doesn't seem to be the case.)

    For a while OoO seemed to be limited to 3-4 wide (Intel 1995-2010
    three-wide, 2011-2014 4-wide, AMD from at least 1999 (maybe earlier)
    until (I think) 2016), but in recent years we see significant width
    growth; e.g. the Cortex-X4 is 10-wide and even Intel's E-Core
    Gracemont is 5-wide.

    - The drawbacks of in-order outweigh their benefits.
    (If you have larger register files, you have bigger
    instructions, so you fetch more code out of DRAM.
    Is that really enough to make the difference?)

    I don't think so.

    However, in framing this counterargument in favor of
    in-order, the *fatal* drawback of in-order for high
    performance has dawned on me. (Although Ivan Godard
    in his Mill design is, in fact, making an effort to
    address just this particular drawback!)

    As Mitch notes, to further increase performance, OoO
    has become GBOoO: ever larger hardware register files
    and so on.

    This means that, even if an in-order design which had
    large architectural register files, an exposed pipeline,
    and so on, matched _current_ OoO CPUs in performance
    for less power...

    the performance of OoO CPUs doesn't stand still...

    and so the _next generation_ of the in-order design
    would have to have *larger* register files (and, no
    doubt, all sorts of other things)...

    Yes, one of the benefits of OoO is that existing code just runs fine
    *and fast* on next year's CPU, but in-order also tends to lose on code
    that is compiled specifically for the model it runs on.

    With the branch prediction disadvantage, you cannot make good use of
    more than 128 registers for speculation even if you have more.

    even if such architectures could still find a niche
    (like supercomputers that are only programmed by the
    users themselvels in FORTRAN).

    Well, that part of the market has been taken away from in-order CPUs
    by OoO CPUs with SIMD, and by GPGPUs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jan 24 19:20:13 2024
    So here in-order provided lower performance at thrice the power
    consumption, two years later.

    What is clear is that currently, no one know how to make in-order CPUs
    as fast as OoO for "general purpose" computing (i.e. not things you can
    run on things like GPGPUs or TPUs).

    But indeed, the more interesting aspect is that even in terms of
    efficiency, in-order seems to be a losing proposition.
    I'd be interested to hear opinions about why that is the case.

    I can think of two factors, tho there are probably more:
    - in-order CPUs spend more time waiting (which is the cause for their
    lower performance), and they still burn Joules while they wait,
    which throws away the Joules they presumably saved by staying clear of
    the OoO "baggage".
    - OoO execution is naturally more asynchronous, making it possible to
    make decisions about what to do when in a more local way, thus wasting
    less energy on costly whole-chip synchronization.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Jan 25 01:21:00 2024
    Stefan Monnier wrote:

    So here in-order provided lower performance at thrice the power
    consumption, two years later.

    What is clear is that currently, no one know how to make in-order CPUs
    as fast as OoO for "general purpose" computing (i.e. not things you can
    run on things like GPGPUs or TPUs).

    I think this calls for a "point of order"::

    An in order pipeline must be kept short (5 cycles on the thin end,
    8 cycle on the fat end) whereas GBOoO machines start with 12 cycles
    on the thin end and 30 cycles on the fat end.

    You can make a GBOoO machine clock faster than the IO machine simply
    from less work in each pipe stage--and this makes up for the depth
    of the pipeline.

    Furthermore: IO machines are always latency bound, while GBOoO machines
    are schedule bound, capable of absorbing L1 cache misses, long cycle
    count instructions, ... that significantly harm IO machines.

    But indeed, the more interesting aspect is that even in terms of
    efficiency, in-order seems to be a losing proposition.
    I'd be interested to hear opinions about why that is the case.

    I can think of two factors, tho there are probably more:
    - in-order CPUs spend more time waiting (which is the cause for their
    lower performance), and they still burn Joules while they wait,

    A properly clock-gated IO design should not be wasting clocking (and
    flip-flop) power while waiting. In 2005 I designed an IO x86 that
    went clock = 0Hz while waiting on L1 miss. The whole pipeline stopped eliminating 2 from the text vector exponent.

    which throws away the Joules they presumably saved by staying clear of
    the OoO "baggage".

    The OoO Baggage that is not changing its assertions burn little power.
    Clocking an IO pipeline while stalled burns significant power. {Hint:
    it takes more power to clock the pipeline than to perform integer calculations.}

    - OoO execution is naturally more asynchronous, making it possible to
    make decisions about what to do when in a more local way, thus wasting
    less energy on costly whole-chip synchronization.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Wed Jan 24 23:28:49 2024
    Anton Ertl wrote:

    Anyway, a major advantage of OoO is that its scheduler can make use of
    the dynamic branch predictor and its superior accuracy. (Joshua
    Landau pointed out a way that allows static schedulers to make use of
    this accuracy, but it's doubtful that this can be achieved without a
    code explosion).

    And OoO can queue multiple overlapping cache misses.
    This later allows multiple instructions to complete at once,
    which allows multiple instructions to retire at once,
    which allows it to fill in pipeline bubbles and catch up.

    InO simply can't do that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Thu Jan 25 06:46:31 2024
    EricP <[email protected]> writes:
    And OoO can queue multiple overlapping cache misses.
    This later allows multiple instructions to complete at once,
    which allows multiple instructions to retire at once,
    which allows it to fill in pipeline bubbles and catch up.

    InO simply can't do that.

    If it is designed accordingly (and I am sure that all IA-64
    implementations are), it can: It starts a load, starts the next load
    etc. The in-order property only comes into play when it wants to use
    the result of one of these loads.

    E.g., looking at <https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:

    Specifically, the A510 can overlap two cache misses with the following
    between them:

    * 12 total instructions, up from 8 on the A53

    * 6 FP instructions, up from 4 on the A53. This includes 128-bit
    vector instructions on the A510 but not on the A53. A53 finds
    vector operations scary and will stall immediately on encountering
    one

    * 3 branches, unchanged from A53

    * 5 loads. The A53 would stall on any memory access past a cache miss.

    And that's for a LITTLE core.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Thu Jan 25 07:49:15 2024
    Stefan Monnier <[email protected]> writes:
    But indeed, the more interesting aspect is that even in terms of
    efficiency, in-order seems to be a losing proposition.
    I'd be interested to hear opinions about why that is the case.

    I can think of two factors, tho there are probably more:
    - in-order CPUs spend more time waiting (which is the cause for their
    lower performance), and they still burn Joules while they wait,
    which throws away the Joules they presumably saved by staying clear of
    the OoO "baggage".
    - OoO execution is naturally more asynchronous, making it possible to
    make decisions about what to do when in a more local way, thus wasting
    less energy on costly whole-chip synchronization.

    These are the two explanations I have come up with, too.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Thu Jan 25 07:05:13 2024
    [email protected] (MitchAlsup1) writes:
    Furthermore: IO machines are always latency bound, while GBOoO machines
    are schedule bound, capable of absorbing L1 cache misses, long cycle
    count instructions, ... that significantly harm IO machines.

    What does "schedule bound" mean?

    I have seen enough cases where a chain of dependent instructions
    (whether it is a chain of multiplications, a chain of L1-hitting
    loads, or even a chain of integer adds mixed with occasional
    L1-hitting loads) determines the performance of an OoO machine, in
    particular a wide OoO machine.

    If branch mispredictions are low enough, what limits the performance
    of an OoO machine is

    * either its resources (functional units, rename width, or somesuch),
    and I call that "resource bound",

    * or a dependence chain is so long (and the rest of the instructions
    consume so few resources) that eventually the reorder buffers are
    filled with the rest of the instructions or the schedulers are
    filled with instructions from the dependence chain. Then the
    machine has to wait for an instruction from the dependence chain to
    retire (for unclogging the ROB) or to produce a result (for freeing
    a scheduler slot). I call that latency-bound or dependence-bound.

    The wider the OoO engine, the fewer programs will be resource-bound on
    that machine. Hardware designers use deep ROBs and deep schedulers on
    wide OoO engines to reduce the number or impact of dependence-bound
    cases, and indeed, with a bigger scheduling window, one may be able to
    see more parallelism than with a smaller window.

    And at some point there will be a branch misprediction, which acts as
    an in-order constraint for the dependence-bound case. In the
    resource-bound case, if the machine starts resolving the branch
    misprediction before retiring the branch, there are still instructions
    waiting for their functional unit, so the misprediction penalty will
    be lower than otherwise.

    As for in-order machines, for data-parallel stuff like, say, matrix multiplication, they can also be resource bound, and indeed, these are
    the kinds of codes where IA-64 performed particularly well.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to All on Thu Jan 25 08:34:58 2024
    On Thu, 25 Jan 2024 01:21:00 +0000, MitchAlsup1 wrote:

    You can make a GBOoO machine clock faster than the IO machine simply
    from less work in each pipe stage--and this makes up for the depth
    of the pipeline.

    That's true, but to a naive reader that would seem utterly meaningless:
    more clocks, but each instruction takes exactly the same number of
    gate delays to do. So what?

    Of course, though, that's _not_ the whole truth.

    What a faster clock speed means for a pipelined computer, especially
    one with out-of-order execution, is that it can do more instructions
    in parallel, each one at a different stage of completion. So it's
    just like adding more cores.. except it's even better, because
    everything is more tightly coupled.

    This, of course, is obvious stuff that you know perfectly well, but
    some readers of your post could have missed it.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to All on Thu Jan 25 08:43:05 2024
    On Wed, 24 Jan 2024 20:03:15 +0000, MitchAlsup1 wrote:

    Vector machines fell out of fashion when the length of the
    vector register could no longer absorb the latency to memory.
    {{Although NEC persisted for longer}}

    Hmm.

    If the latency to memory is bigger, then having more vector
    regisers lets you access stuff for a bigger percentage of
    the time that is faster than memory.

    Just like cache, or regulsr register files, therefore, one
    would expect the utility of vector registers to increase,
    not decrease, when memory becomes slower by comparison.

    So I'm missing something here.

    One possibility is that vector registers are usually used to
    facilitate operations between vectors in memory - vectors
    that are several times longer than the length of a vector
    register. So the speed of memory controls the speed of the
    overall calculation - in part. The vector registers multiply
    it by a factor of how much work gets done on values once
    they're read in - but perhaps if memory gets slow enough,
    there's not much benefit over less elaborate local storage.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to All on Thu Jan 25 08:49:14 2024
    On Thu, 25 Jan 2024 01:21:00 +0000, MitchAlsup1 wrote:

    Furthermore: IO machines are always latency bound, while GBOoO machines
    are schedule bound, capable of absorbing L1 cache misses, long cycle
    count instructions, ... that significantly harm IO machines.

    Ah. This is useful information. It's the L1 cache misses, not L2 or L3
    cache misses, that OoO is absorbing, and increasing performance thereby.

    That makes sense: OoO has a limited capacity to look ahead and move instructions around, so the short delays caused by a miss in the highest
    level of cache to the next highest are the ones it's best able to deal
    with.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Thu Jan 25 08:45:49 2024
    Anton Ertl wrote:
    EricP <[email protected]> writes:
    And OoO can queue multiple overlapping cache misses.
    This later allows multiple instructions to complete at once,
    which allows multiple instructions to retire at once,
    which allows it to fill in pipeline bubbles and catch up.

    InO simply can't do that.

    If it is designed accordingly (and I am sure that all IA-64
    implementations are), it can: It starts a load, starts the next load
    etc. The in-order property only comes into play when it wants to use
    the result of one of these loads.

    E.g., looking at <https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:

    Specifically, the A510 can overlap two cache misses with the following between them:

    * 12 total instructions, up from 8 on the A53

    * 6 FP instructions, up from 4 on the A53. This includes 128-bit
    vector instructions on the A510 but not on the A53. A53 finds
    vector operations scary and will stall immediately on encountering
    one

    * 3 branches, unchanged from A53

    * 5 loads. The A53 would stall on any memory access past a cache miss.

    And that's for a LITTLE core.

    - anton

    That 510 backend is not in-order, it's light weight OoO.
    That 3-way superscalar CDC6600 style backend allows a younger instruction
    to proceed to its next processing stage even though an older instruction
    is still executing. That's fine, and it might be possible even to forward pending function unit results to other function unit inputs,
    and as long as writeback happens in-order interrupts will be precise.
    But that is a form of bypassing.

    That uArch is distinct from a dual or triple InO pipeline because
    in those if one pipeline stage stalls, they all stall.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Stefan Monnier on Thu Jan 25 09:26:35 2024
    Stefan Monnier wrote:
    So here in-order provided lower performance at thrice the power
    consumption, two years later.

    What is clear is that currently, no one know how to make in-order CPUs
    as fast as OoO for "general purpose" computing (i.e. not things you can
    run on things like GPGPUs or TPUs).

    But indeed, the more interesting aspect is that even in terms of
    efficiency, in-order seems to be a losing proposition.
    I'd be interested to hear opinions about why that is the case.

    I can think of two factors, tho there are probably more:
    - in-order CPUs spend more time waiting (which is the cause for their
    lower performance), and they still burn Joules while they wait,
    which throws away the Joules they presumably saved by staying clear of
    the OoO "baggage".
    - OoO execution is naturally more asynchronous, making it possible to
    make decisions about what to do when in a more local way, thus wasting
    less energy on costly whole-chip synchronization.


    Stefan

    In-order serializes when operations start, OoO synchronizes after they finish. The later creates more potential opportunities for asynchronous concurrency, and this potential propagates through the whole system design.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Thu Jan 25 15:22:30 2024
    EricP <[email protected]> writes:
    Anton Ertl wrote:
    EricP <[email protected]> writes:
    And OoO can queue multiple overlapping cache misses.
    This later allows multiple instructions to complete at once,
    which allows multiple instructions to retire at once,
    which allows it to fill in pipeline bubbles and catch up.

    InO simply can't do that.

    If it is designed accordingly (and I am sure that all IA-64
    implementations are), it can: It starts a load, starts the next load
    etc. The in-order property only comes into play when it wants to use
    the result of one of these loads.

    E.g., looking at
    <https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:

    Specifically, the A510 can overlap two cache misses with the following
    between them:

    * 12 total instructions, up from 8 on the A53

    * 6 FP instructions, up from 4 on the A53. This includes 128-bit
    vector instructions on the A510 but not on the A53. A53 finds
    vector operations scary and will stall immediately on encountering
    one

    * 3 branches, unchanged from A53

    * 5 loads. The A53 would stall on any memory access past a cache miss.

    And that's for a LITTLE core.

    - anton

    That 510 backend is not in-order, it's light weight OoO.
    That 3-way superscalar CDC6600 style backend allows a younger instruction
    to proceed to its next processing stage even though an older instruction
    is still executing. That's fine, and it might be possible even to forward >pending function unit results to other function unit inputs,
    and as long as writeback happens in-order interrupts will be precise.
    But that is a form of bypassing.

    Not OoO in my book. By your definition anything is OoO that allows
    some execution overlap of an architecturally earlier instruction with
    an architecturally later instruction. With your definition, all
    pipelined CPUs are OoO, including the MIPS R2000 with its delayed
    branch, delayed load, and especially the multiply/divide unit.

    Also, the 21064 which even allowed to issue two instructions at the
    same time, as well as having instructions with more than one cycle of load-to-use latency; e.g., there could be an FP multiplication
    followed by a load followed by an add, and the add would actually
    finish using its ALU before the FP multiplication or the load
    finishes.

    As described above, the A53 would be OoO by your definition, too.

    Last, but not least, all IA-64 implementations would be OoO by your
    definition.

    A definition that classifies everything as OoO and nothing as in-order
    is neither helpful nor is it the commonly understood meaning of
    "in-order" and OoO. I think the commonly understood meaning is that
    all instructions start their execution in-order (i.e., none goes to a functional unit earlier than an architecturally earlier instruction).
    Execution can overlap.

    Concerning precise interrupts, that is certainly a problem for CPUs
    without reorder buffers; the Alpha architects even put imprecise FP
    interrupts and the trapb instruction (IIRC) in the architecture
    because of that.

    That uArch is distinct from a dual or triple InO pipeline because
    in those if one pipeline stage stalls, they all stall.

    That's a somewhat different definition. AFAIK the R2000 stalls the
    whole (integer) pipeline on a cache miss despite allowing overlap
    between instruction executions.

    AFAIK microarchitects got rid of this limitation as soon as there were
    enough transistors available. The problem with this limitation is
    that it makes it pointless to schedule a load further ahead to reduce
    the impact of a cache-miss latency, or to use a prefetch instruction,
    because either one would stop the whole machine during the cache miss.
    A prefetch could actually be counterproductive, but it would
    definitely never help.

    So this definition may describe some historical designs, but it's not
    the difference between in-order and OoO as commonly understood.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Thu Jan 25 16:45:19 2024
    Anton Ertl wrote:

    [email protected] (MitchAlsup1) writes:
    Furthermore: IO machines are always latency bound, while GBOoO machines
    are schedule bound, capable of absorbing L1 cache misses, long cycle
    count instructions, ... that significantly harm IO machines.

    What does "schedule bound" mean?

    I have seen enough cases where a chain of dependent instructions
    (whether it is a chain of multiplications, a chain of L1-hitting
    loads, or even a chain of integer adds mixed with occasional
    L1-hitting loads) determines the performance of an OoO machine, in
    particular a wide OoO machine.

    Those ARE schedule bound--the speed of the scheduler(s) launching data dependent instructions is the limiting performance.

    If branch mispredictions are low enough, what limits the performance
    of an OoO machine is

    * either its resources (functional units, rename width, or somesuch),
    and I call that "resource bound",

    Those are simply the targets of the scheduler(s). Getting the instruction launched is somewhat harder.

    The FUs are easy to pipeline, data-dependent operations cannot use the available BW of the FUs when schedule bound.

    * or a dependence chain is so long (and the rest of the instructions
    consume so few resources) that eventually the reorder buffers are
    filled with the rest of the instructions or the schedulers are
    filled with instructions from the dependence chain. Then the
    machine has to wait for an instruction from the dependence chain to
    retire (for unclogging the ROB) or to produce a result (for freeing
    a scheduler slot). I call that latency-bound or dependence-bound.

    This is the other end of the schedule pipeline.

    The wider the OoO engine, the fewer programs will be resource-bound on
    that machine.

    And the more will be schedule bound.

    Hardware designers use deep ROBs and deep schedulers on
    wide OoO engines to reduce the number or impact of dependence-bound
    cases, and indeed, with a bigger scheduling window, one may be able to
    see more parallelism than with a smaller window.

    Yes, Indeed.

    And at some point there will be a branch misprediction, which acts as
    an in-order constraint for the dependence-bound case. In the
    resource-bound case, if the machine starts resolving the branch
    misprediction before retiring the branch, there are still instructions waiting for their functional unit, so the misprediction penalty will
    be lower than otherwise.

    As for in-order machines, for data-parallel stuff like, say, matrix multiplication, they can also be resource bound, and indeed, these are
    the kinds of codes where IA-64 performed particularly well.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Quadibloc on Thu Jan 25 16:47:49 2024
    Quadibloc wrote:

    On Wed, 24 Jan 2024 20:03:15 +0000, MitchAlsup1 wrote:

    Vector machines fell out of fashion when the length of the
    vector register could no longer absorb the latency to memory.
    {{Although NEC persisted for longer}}

    Hmm.

    If the latency to memory is bigger, then having more vector
    regisers lets you access stuff for a bigger percentage of
    the time that is faster than memory.

    And lose code compatibility with your predecessors.

    Just like cache, or regulsr register files, therefore, one
    would expect the utility of vector registers to increase,
    not decrease, when memory becomes slower by comparison.

    So I'm missing something here.

    Amdahl's law still applies.

    One possibility is that vector registers are usually used to
    facilitate operations between vectors in memory - vectors
    that are several times longer than the length of a vector
    register. So the speed of memory controls the speed of the
    overall calculation - in part. The vector registers multiply
    it by a factor of how much work gets done on values once
    they're read in - but perhaps if memory gets slow enough,
    there's not much benefit over less elaborate local storage.

    Speed of memory ~== bisection bandwidth.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Jan 25 16:49:32 2024
    EricP wrote:

    Stefan Monnier wrote:
    So here in-order provided lower performance at thrice the power
    consumption, two years later.

    What is clear is that currently, no one know how to make in-order CPUs
    as fast as OoO for "general purpose" computing (i.e. not things you can
    run on things like GPGPUs or TPUs).

    But indeed, the more interesting aspect is that even in terms of
    efficiency, in-order seems to be a losing proposition.
    I'd be interested to hear opinions about why that is the case.

    I can think of two factors, tho there are probably more:
    - in-order CPUs spend more time waiting (which is the cause for their
    lower performance), and they still burn Joules while they wait,
    which throws away the Joules they presumably saved by staying clear of
    the OoO "baggage".
    - OoO execution is naturally more asynchronous, making it possible to
    make decisions about what to do when in a more local way, thus wasting
    less energy on costly whole-chip synchronization.


    Stefan

    In-order serializes when operations start,

    And remain serialized while traversing the pipeline.

    OoO synchronizes after they finish.
    The later creates more potential opportunities for asynchronous concurrency, and this potential propagates through the whole system design.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Fri Jan 26 10:06:35 2024
    Anton Ertl wrote:
    EricP <[email protected]> writes:
    Anton Ertl wrote:
    EricP <[email protected]> writes:
    And OoO can queue multiple overlapping cache misses.
    This later allows multiple instructions to complete at once,
    which allows multiple instructions to retire at once,
    which allows it to fill in pipeline bubbles and catch up.

    InO simply can't do that.
    If it is designed accordingly (and I am sure that all IA-64
    implementations are), it can: It starts a load, starts the next load
    etc. The in-order property only comes into play when it wants to use
    the result of one of these loads.

    E.g., looking at
    <https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:

    Specifically, the A510 can overlap two cache misses with the following
    between them:

    * 12 total instructions, up from 8 on the A53

    * 6 FP instructions, up from 4 on the A53. This includes 128-bit
    vector instructions on the A510 but not on the A53. A53 finds
    vector operations scary and will stall immediately on encountering
    one

    * 3 branches, unchanged from A53

    * 5 loads. The A53 would stall on any memory access past a cache miss. >>>
    And that's for a LITTLE core.

    - anton
    That 510 backend is not in-order, it's light weight OoO.
    That 3-way superscalar CDC6600 style backend allows a younger instruction
    to proceed to its next processing stage even though an older instruction
    is still executing. That's fine, and it might be possible even to forward
    pending function unit results to other function unit inputs,
    and as long as writeback happens in-order interrupts will be precise.
    But that is a form of bypassing.

    Not OoO in my book. By your definition anything is OoO that allows
    some execution overlap of an architecturally earlier instruction with
    an architecturally later instruction. With your definition, all
    pipelined CPUs are OoO, including the MIPS R2000 with its delayed
    branch, delayed load, and especially the multiply/divide unit.

    No, not overlap, bypassing. Multiple parallel pipelines is still in-order. Instructions enter each pipeline in program order, each maintains fifo
    order internally, and results exit from each in program order.

    I got the impression from the description of the 510 that it allowed
    a limited form of bypass where it says under Execution Engine
    "Instructions can co-issue if they�re independent, have their inputs ready..". It depends on exactly what that box labeled "Issue" does.

    Also the figure in section 2.1 Pipeline Overview gave me the impression
    that bypassing might be allowed.

    Anyway, as long as the register file is updated in-order then the only one
    that matters is the load store queue. While the LSQ allows 2 outstanding
    cache misses, as long as it finishes each load/store in order then none
    of this is visible.

    Also, the 21064 which even allowed to issue two instructions at the
    same time, as well as having instructions with more than one cycle of load-to-use latency; e.g., there could be an FP multiplication
    followed by a load followed by an add, and the add would actually
    finish using its ALU before the FP multiplication or the load
    finishes.

    As described above, the A53 would be OoO by your definition, too.

    21164 was two parallel integer pipelines. I don't know about A53.

    Last, but not least, all IA-64 implementations would be OoO by your definition.

    A definition that classifies everything as OoO and nothing as in-order
    is neither helpful nor is it the commonly understood meaning of
    "in-order" and OoO. I think the commonly understood meaning is that
    all instructions start their execution in-order (i.e., none goes to a functional unit earlier than an architecturally earlier instruction). Execution can overlap.

    And it sounded like the 510 might to exactly that: send instructions
    to a function unit OoO.

    Concerning precise interrupts, that is certainly a problem for CPUs
    without reorder buffers; the Alpha architects even put imprecise FP interrupts and the trapb instruction (IIRC) in the architecture
    because of that.

    And in that regard the Alpha made itself the poster boy for what not to do.

    That uArch is distinct from a dual or triple InO pipeline because
    in those if one pipeline stage stalls, they all stall.

    That's a somewhat different definition. AFAIK the R2000 stalls the
    whole (integer) pipeline on a cache miss despite allowing overlap
    between instruction executions.

    I thought the R2000 only has one pipeline.
    Anyway, I was thinking the 21064 had two integer pipelines but
    it only has one integer/memory and one float.
    The 21164 has two int/mem, one float add, one float multiply.

    AFAIK microarchitects got rid of this limitation as soon as there were
    enough transistors available. The problem with this limitation is
    that it makes it pointless to schedule a load further ahead to reduce
    the impact of a cache-miss latency, or to use a prefetch instruction,
    because either one would stop the whole machine during the cache miss.
    A prefetch could actually be counterproductive, but it would
    definitely never help.

    So this definition may describe some historical designs, but it's not
    the difference between in-order and OoO as commonly understood.

    - anton

    Yes, my brain fart. Forget I said that.
    I wasn't thinking it was the difference between IO and OoO,
    I was thinking there is no point in keeping two parallel integer pipelines running when one stalls for a long duration operation because the running
    one will have to stall at the pipe end to synchronize writeback anyway.
    But that ignores the possibility that the pipelines are asymmetric.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Jan 26 21:46:42 2024
    EricP wrote:

    Anton Ertl wrote:
    EricP <[email protected]> writes:
    Anton Ertl wrote:
    EricP <[email protected]> writes:
    And OoO can queue multiple overlapping cache misses.
    This later allows multiple instructions to complete at once,
    which allows multiple instructions to retire at once,
    which allows it to fill in pipeline bubbles and catch up.

    InO simply can't do that.
    If it is designed accordingly (and I am sure that all IA-64
    implementations are), it can: It starts a load, starts the next load
    etc. The in-order property only comes into play when it wants to use
    the result of one of these loads.

    E.g., looking at
    <https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:

    Specifically, the A510 can overlap two cache misses with the following >>>> between them:

    * 12 total instructions, up from 8 on the A53

    * 6 FP instructions, up from 4 on the A53. This includes 128-bit
    vector instructions on the A510 but not on the A53. A53 finds
    vector operations scary and will stall immediately on encountering >>>> one

    * 3 branches, unchanged from A53

    * 5 loads. The A53 would stall on any memory access past a cache miss. >>>>
    And that's for a LITTLE core.

    - anton
    That 510 backend is not in-order, it's light weight OoO.
    That 3-way superscalar CDC6600 style backend allows a younger instruction >>> to proceed to its next processing stage even though an older instruction >>> is still executing. That's fine, and it might be possible even to forward >>> pending function unit results to other function unit inputs,
    and as long as writeback happens in-order interrupts will be precise.
    But that is a form of bypassing.

    Not OoO in my book. By your definition anything is OoO that allows
    some execution overlap of an architecturally earlier instruction with
    an architecturally later instruction. With your definition, all
    pipelined CPUs are OoO, including the MIPS R2000 with its delayed
    branch, delayed load, and especially the multiply/divide unit.

    No, not overlap, bypassing. Multiple parallel pipelines is still in-order.

    Note:: Mc88100 had multiple parallel pipelines and was not In-Order !!
    A older LD stall would allow a younger instructions to complete !

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sat Jan 27 15:15:43 2024
    EricP <[email protected]> writes:
    Anton Ertl wrote:
    [...]

    Anyway, as long as the register file is updated in-order

    This discussion resulted in the unearthing of memories of the things I
    read about the microarchitectures of the advanced in-order machines of
    last century (and I think that in-order machines in this century tend
    to work in the same way). My memory may be unreliable here, but
    anyway:

    The way things worked was that in those machines, instructions were
    issued in-order but the results could be written back out-of-order.
    However, each register has a bit that tells whether the register is
    up-to-date, or will be updated in the future by a currently in-flight instruction. This was often called scoreboarding (although Mitch
    Alsup and <https://en.wikipedia.org/wiki/Scoreboarding> tell us that
    the CDC 6600 scoreboard was a more sophisticated mechanism; given the
    I in MIPS, one could also call it an interlock). So each instruction
    checks whether all its source and destination registers are
    up-to-date, and if not, it waits until they are (forwarding changes
    the notion of "up-to-date" a bit, but I'll skip this here).

    With out-of-order completion, how is architectural execution and, in particular, precise exceptions, ensured? For ordinary execution, it
    does not matter for an instruction whether an unrelated register is
    not up-to-date, and if the register is mentioned in the instruction,
    the instruction and all that follow wait until the register is
    up-to-date.

    I don't remember how loads and stores were handled, but again, as long
    as they were to non-overlapping addresses (and for weak memory
    ordering in multiprocessors) one can do quite a bit in parallel
    without destroying architectural order.

    I also don't remember how flags registers were handled on
    architectures that have them, but it needs something cleverer than the "up-to-date" scheme described above, or there would be lots of stalls
    due to write-after-write dependences. I am sure the microarchitects
    found something appropriate.

    For precise exceptions, I remember discussions about the importance of
    knowing early in the instruction that an exception happens; i.e., so
    early that the writebacks of architecturally later instructions can be cancelled. For loads, the exception is known early, when the TLB
    lookup has happened; I expect that the whole machine is stalled on a
    TLB miss (or, with a software-managed TLB, the exception happens right
    there). Alpha has imprecise FP exceptions because the architecture
    wanted to allow implementing denormals through trapping, but it takes
    several cycles to know whether an FP result is normal or not.

    [Cortex-A510]
    then the only one
    that matters is the load store queue. While the LSQ allows 2 outstanding >cache misses, as long as it finishes each load/store in order then none
    of this is visible.

    I expect that the A510 uses the mechanism described above, which means
    that loads can finish out of order, but none of this is visible
    nonetheless.

    Also, the 21064 which even allowed to issue two instructions at the
    same time, as well as having instructions with more than one cycle of
    load-to-use latency; e.g., there could be an FP multiplication
    followed by a load followed by an add, and the add would actually
    finish using its ALU before the FP multiplication or the load
    finishes.

    As described above, the A53 would be OoO by your definition, too.

    21164 was two parallel integer pipelines. I don't know about A53.

    The Cortex-A53 has two ALU ports <https://chipsandcheese.com/2023/05/28/arms-cortex-a53-tiny-but-important/>.

    It's interesting to compare the A53 (2012) to the 21164 (1995). Both
    have roughly similar execution resources (2 integer (one of which can
    be a branch), 2 FP, 1LSU (not sure about that for the 21164)), but the
    21164 has a four-wide decoder, while the A53 only has a two-wide
    decoder. I guess the cost of decoding all of A64, A32, and especially
    T32 caused them to limit the decoding capabilities.

    For the A510 ARM expanded that to a three-wide decode, but the A510 is
    an A64-only core. ARM also provided a third ALU and an additional
    load unit to the A510. Given that an ALU was not that expensive even
    in the 21164 timeframe, my guess is that the 21164 architects provided
    only two because of register port or forwarding path limitations,
    something that the ARM designers apparently have a solution for (more
    metal layers?).

    That's a somewhat different definition. AFAIK the R2000 stalls the
    whole (integer) pipeline on a cache miss despite allowing overlap
    between instruction executions.

    I thought the R2000 only has one pipeline.

    My memories from last century tells me that there was some concept
    like "squashing pipeline bubbles" being discussed at the time, i.e.,
    that instructions in earlier stages could advance until the first of
    them reaches the stalled instruction. Conversely, instructions in
    later stages could continue, filling the stages they left with bubbles
    (I don't remember this being discussed). But of course none of that
    is used in the R2000. The R2000 has a multiply/divide unit that takes
    many cycles, and actually with interlocks. I don't know if that
    continues working while a cache miss is served; the R2010 FPU
    certainly continues working while a cache miss is served.

    And then we got the 88100 with three pipelines, and then the 21064
    with dual-issue and three pipelines.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sun Jan 28 13:48:24 2024
    MitchAlsup1 wrote:
    EricP wrote:

    Anton Ertl wrote:
    EricP <[email protected]> writes:
    Anton Ertl wrote:
    EricP <[email protected]> writes:
    And OoO can queue multiple overlapping cache misses.
    This later allows multiple instructions to complete at once,
    which allows multiple instructions to retire at once,
    which allows it to fill in pipeline bubbles and catch up.

    InO simply can't do that.
    If it is designed accordingly (and I am sure that all IA-64
    implementations are), it can: It starts a load, starts the next load >>>>> etc. The in-order property only comes into play when it wants to use >>>>> the result of one of these loads.

    E.g., looking at
    <https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>,
    the A510 has a 5-entry load buffer. The text says:

    Specifically, the A510 can overlap two cache misses with the following >>>>> between them:

    * 12 total instructions, up from 8 on the A53
    * 6 FP instructions, up from 4 on the A53. This includes 128-bit >>>>> vector instructions on the A510 but not on the A53. A53 finds
    vector operations scary and will stall immediately on encountering >>>>> one

    * 3 branches, unchanged from A53

    * 5 loads. The A53 would stall on any memory access past a cache
    miss.

    And that's for a LITTLE core.

    - anton
    That 510 backend is not in-order, it's light weight OoO.
    That 3-way superscalar CDC6600 style backend allows a younger
    instruction
    to proceed to its next processing stage even though an older
    instruction
    is still executing. That's fine, and it might be possible even to
    forward
    pending function unit results to other function unit inputs,
    and as long as writeback happens in-order interrupts will be precise.
    But that is a form of bypassing.

    Not OoO in my book. By your definition anything is OoO that allows
    some execution overlap of an architecturally earlier instruction with
    an architecturally later instruction. With your definition, all
    pipelined CPUs are OoO, including the MIPS R2000 with its delayed
    branch, delayed load, and especially the multiply/divide unit.

    No, not overlap, bypassing. Multiple parallel pipelines is still
    in-order.

    Note:: Mc88100 had multiple parallel pipelines and was not In-Order !!
    A older LD stall would allow a younger instructions to complete !

    Multiple parallel pipelines is fine but it has to sequence the pipe exits
    so the results retire in order for precise exceptions and interrupts.

    Also each pipeline can be a source for forwarding so it can wind up
    with many forwarding buses which have to be checked for each source
    operand on each issue lane.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Sun Jan 28 20:57:27 2024
    Anton Ertl wrote:
    EricP <[email protected]> writes:
    Anton Ertl wrote:
    [...]

    Anyway, as long as the register file is updated in-order

    This discussion resulted in the unearthing of memories of the things I
    read about the microarchitectures of the advanced in-order machines of
    last century (and I think that in-order machines in this century tend
    to work in the same way). My memory may be unreliable here, but
    anyway:

    The way things worked was that in those machines, instructions were
    issued in-order but the results could be written back out-of-order.
    However, each register has a bit that tells whether the register is up-to-date, or will be updated in the future by a currently in-flight instruction. This was often called scoreboarding (although Mitch
    Alsup and <https://en.wikipedia.org/wiki/Scoreboarding> tell us that
    the CDC 6600 scoreboard was a more sophisticated mechanism; given the
    I in MIPS, one could also call it an interlock). So each instruction
    checks whether all its source and destination registers are
    up-to-date, and if not, it waits until they are (forwarding changes
    the notion of "up-to-date" a bit, but I'll skip this here).

    With out-of-order completion, how is architectural execution and, in particular, precise exceptions, ensured? For ordinary execution, it
    does not matter for an instruction whether an unrelated register is
    not up-to-date, and if the register is mentioned in the instruction,
    the instruction and all that follow wait until the register is
    up-to-date.

    I don't remember how loads and stores were handled, but again, as long
    as they were to non-overlapping addresses (and for weak memory
    ordering in multiprocessors) one can do quite a bit in parallel
    without destroying architectural order.

    I also don't remember how flags registers were handled on
    architectures that have them, but it needs something cleverer than the "up-to-date" scheme described above, or there would be lots of stalls
    due to write-after-write dependences. I am sure the microarchitects
    found something appropriate.

    For precise exceptions, I remember discussions about the importance of knowing early in the instruction that an exception happens; i.e., so
    early that the writebacks of architecturally later instructions can be cancelled. For loads, the exception is known early, when the TLB
    lookup has happened; I expect that the whole machine is stalled on a
    TLB miss (or, with a software-managed TLB, the exception happens right there). Alpha has imprecise FP exceptions because the architecture
    wanted to allow implementing denormals through trapping, but it takes
    several cycles to know whether an FP result is normal or not.

    That scoreboard allows OoO execution and completion,
    and avoids RAW, WAW, and WAR hazards,
    but it doesn't write back results in program order.
    Exceptions can be made precise by (a) aways writing results in-order,
    and (b) only recognizing exceptions at Writeback.

    To write the results back in order one could attach a sequence counter
    to each uOp - a counter with enough bits so that each possible in-flight
    uOp in any stage has a unique number plus 1 bit for a wrap flag.
    The uOps can then flow down separate parallel pipelines.

    Writeback also has a sequence counter so it knows which uOp is
    next to write its register. I would want two register write ports
    so it at least has a chance of catching up after a bubble.
    WB checks the exits of all the pipeline for the next two sequence numbers, removes those uOps from their pipelines and writes the results.

    Each pipeline takes care of its own stalls internally and compacts out
    NULL uOps from stages if possible. So the only time a pipeline has to completely stall is when all stages are full and the end result is not
    the oldest uOp and so cannot be written back.

    The uOp sequence numbers also allow branch mispredict to purge just those in-flight uOps that are younger than the branch. The Branch Control Unit detects a mispredicted conditional branch and transmits its own sequence
    number on the Cancel Bus which goes to all stages of all pipelines.
    Each stage compares its own sequence number to the cancel number and
    if higher (younger) the it nullifies that entry.


    [Cortex-A510]
    then the only one
    that matters is the load store queue. While the LSQ allows 2 outstanding
    cache misses, as long as it finishes each load/store in order then none
    of this is visible.

    I expect that the A510 uses the mechanism described above, which means
    that loads can finish out of order, but none of this is visible
    nonetheless.

    Also, the 21064 which even allowed to issue two instructions at the
    same time, as well as having instructions with more than one cycle of
    load-to-use latency; e.g., there could be an FP multiplication
    followed by a load followed by an add, and the add would actually
    finish using its ALU before the FP multiplication or the load
    finishes.

    As described above, the A53 would be OoO by your definition, too.
    21164 was two parallel integer pipelines. I don't know about A53.

    The Cortex-A53 has two ALU ports <https://chipsandcheese.com/2023/05/28/arms-cortex-a53-tiny-but-important/>.

    It's interesting to compare the A53 (2012) to the 21164 (1995). Both
    have roughly similar execution resources (2 integer (one of which can
    be a branch), 2 FP, 1LSU (not sure about that for the 21164)), but the
    21164 has a four-wide decoder, while the A53 only has a two-wide
    decoder. I guess the cost of decoding all of A64, A32, and especially
    T32 caused them to limit the decoding capabilities.

    For the A510 ARM expanded that to a three-wide decode, but the A510 is
    an A64-only core. ARM also provided a third ALU and an additional
    load unit to the A510. Given that an ALU was not that expensive even
    in the 21164 timeframe, my guess is that the 21164 architects provided
    only two because of register port or forwarding path limitations,
    something that the ARM designers apparently have a solution for (more
    metal layers?).

    That's a somewhat different definition. AFAIK the R2000 stalls the
    whole (integer) pipeline on a cache miss despite allowing overlap
    between instruction executions.
    I thought the R2000 only has one pipeline.

    My memories from last century tells me that there was some concept
    like "squashing pipeline bubbles" being discussed at the time, i.e.,
    that instructions in earlier stages could advance until the first of
    them reaches the stalled instruction. Conversely, instructions in
    later stages could continue, filling the stages they left with bubbles
    (I don't remember this being discussed). But of course none of that
    is used in the R2000. The R2000 has a multiply/divide unit that takes
    many cycles, and actually with interlocks. I don't know if that
    continues working while a cache miss is served; the R2010 FPU
    certainly continues working while a cache miss is served.

    And then we got the 88100 with three pipelines, and then the 21064
    with dual-issue and three pipelines.

    - anton

    A simple way to squash bubbles (NULL uOps) out of pipeline stages is:

    generate stage N stall signal and inhibit clocking its input buffer if
    - the stage N buffer valid flag is set
    - and stage N would generate a valid output (eg resources are available)
    - and stage N+1 is generating a stall

    The stage N stall signal propagates back to stage N-1.

    Unfortunately in this simple design the stall signal serially propagates backwards through all the stages. Also the pipeline can stretch a long
    way across the chip which means long wires.
    This total stall signal delay is added to the worst case stage calculation delay and cuts into the max frequency.

    Other designs called elastic buffers are possible where the stall is only between adjacent stages and they can squash bubbles but those require more
    than double the cost of a stage buffer. One can also alternate the simple
    stage design with elastic buffer to limit the stall propagation delay.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Mon Jan 29 18:08:42 2024
    EricP <[email protected]> writes:
    Anton Ertl wrote:
    EricP <[email protected]> writes:
    Anton Ertl wrote:
    [...]

    Anyway, as long as the register file is updated in-order

    This discussion resulted in the unearthing of memories of the things I
    read about the microarchitectures of the advanced in-order machines of
    last century (and I think that in-order machines in this century tend
    to work in the same way). My memory may be unreliable here, but
    anyway:

    The way things worked was that in those machines, instructions were
    issued in-order but the results could be written back out-of-order.
    However, each register has a bit that tells whether the register is
    up-to-date, or will be updated in the future by a currently in-flight
    instruction. This was often called scoreboarding (although Mitch
    Alsup and <https://en.wikipedia.org/wiki/Scoreboarding> tell us that
    the CDC 6600 scoreboard was a more sophisticated mechanism; given the
    I in MIPS, one could also call it an interlock). So each instruction
    checks whether all its source and destination registers are
    up-to-date, and if not, it waits until they are (forwarding changes
    the notion of "up-to-date" a bit, but I'll skip this here).

    With out-of-order completion, how is architectural execution and, in
    particular, precise exceptions, ensured? For ordinary execution, it
    does not matter for an instruction whether an unrelated register is
    not up-to-date, and if the register is mentioned in the instruction,
    the instruction and all that follow wait until the register is
    up-to-date.

    I don't remember how loads and stores were handled, but again, as long
    as they were to non-overlapping addresses (and for weak memory
    ordering in multiprocessors) one can do quite a bit in parallel
    without destroying architectural order.

    I also don't remember how flags registers were handled on
    architectures that have them, but it needs something cleverer than the
    "up-to-date" scheme described above, or there would be lots of stalls
    due to write-after-write dependences. I am sure the microarchitects
    found something appropriate.

    For precise exceptions, I remember discussions about the importance of
    knowing early in the instruction that an exception happens; i.e., so
    early that the writebacks of architecturally later instructions can be
    cancelled. For loads, the exception is known early, when the TLB
    lookup has happened; I expect that the whole machine is stalled on a
    TLB miss (or, with a software-managed TLB, the exception happens right
    there). Alpha has imprecise FP exceptions because the architecture
    wanted to allow implementing denormals through trapping, but it takes
    several cycles to know whether an FP result is normal or not.

    That scoreboard allows OoO execution and completion,
    and avoids RAW, WAW, and WAR hazards,
    but it doesn't write back results in program order.
    Exceptions can be made precise by (a) aways writing results in-order,
    and (b) only recognizing exceptions at Writeback.

    AFAIK the ROB for in-order completion only came with the modern wave
    of microarchitectures with OoO execution (the 360/91 has no reorder
    buffer).

    The approach I described above is different: Results are written
    out-of-order; precise exceptions are recognized in the first cycle of
    the instruction, before any architecturally later instruction writes
    back; the writebacks of these architecturally later instructions are
    then suppressed. The question is how later writebacks are suppressed
    without suppressing the writebacks of architecturally earlier
    instructions. I can think of some mechanism, but I don't know if that
    was used.

    The same mechanism is needed for dealing with branches without delay
    slot: Either they are predicted to go in some direction (as in, e.g.,
    the 21064), or fallthrough is preferred (as in the 486), which is
    equivalent to predicting not-taken. In either case, when the
    prediction is wrong, the instructions along the predicted path must
    not write back, and in this case the recovery must be fast (whereas
    exceptions are so rare that a few cycles more would be acceptable).

    To write the results back in order one could attach a sequence counter
    to each uOp - a counter with enough bits so that each possible in-flight
    uOp in any stage has a unique number plus 1 bit for a wrap flag.

    A sequence counter is also the first solution for suppressing the
    writebacks of architecturally later instructions in the OoO writeback
    setup.

    Writeback also has a sequence counter so it knows which uOp is
    next to write its register. I would want two register write ports
    so it at least has a chance of catching up after a bubble.

    The 88100 has only one writeback port and writes results back
    out-of-order. My first refereed paper <https://www.complang.tuwien.ac.at/papers/ertl%26krall91.ps.gz> was
    about scheduling for the 88100, and utilizing writeback slots better
    than the usual schedulers was the major benefit of our scheduler.

    Concerning cache misses, the in-order scheme described above also can
    live with setting aside loads for cache misses, and processing
    architecturally later loads in the meantime, and I am sure that a
    number of in-order microarchitectures have done that; recently the
    A510. Of course, stores to overlapping addresses have to be processed
    in-order wrt the loads and each other.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Mon Jan 29 21:04:54 2024
    EricP wrote:

    Anton Ertl wrote:
    EricP <[email protected]> writes:
    Anton Ertl wrote:
    [...]

    Anyway, as long as the register file is updated in-order

    <snip>

    For precise exceptions, I remember discussions about the importance of
    knowing early in the instruction that an exception happens; i.e., so
    early that the writebacks of architecturally later instructions can be
    cancelled. For loads, the exception is known early, when the TLB
    lookup has happened; I expect that the whole machine is stalled on a
    TLB miss (or, with a software-managed TLB, the exception happens right
    there). Alpha has imprecise FP exceptions because the architecture
    wanted to allow implementing denormals through trapping, but it takes
    several cycles to know whether an FP result is normal or not.

    That scoreboard allows OoO execution and completion,
    and avoids RAW, WAW, and WAR hazards,

    Register hazards are obeyed, memory hazards are not necessarily obeyed.

    but it doesn't write back results in program order.

    Exceptions can be made precise by (a) aways writing results in-order,

    OR by allowing younger writes only after kno0wing older results will not
    raise exceptions.

    and (b) only recognizing exceptions at Writeback.

    A bit restrictive, but it does work.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Sun Feb 25 21:58:12 2024
    Paul A. Clayton wrote:

    On 1/28/24 1:48 PM, EricP wrote:
    [snip]
    Multiple parallel pipelines is fine but it has to sequence the pipe exits
    so the results retire in order for precise exceptions and interrupts.

    In-order retire is not strictly required for precise exceptions
    and certainly is not needed for interrupts. If the exception's
    presence is determined before writeback of results from later
    instructions, these writebacks can be prevented. One could
    alternatively use a conservative filter of exception conditions
    to stall writeback of later results (and stall those pipelines)
    until it is known whether the exception occurs.

    (From Computer Architecture: A Quantitative Approach, 3rd Ed.,
    Appendix H, "One approach to this problem, used in the MIPS R3010,
    is to identify instructions that may cause an exception early in
    the instruction cycle. For example, an addition can overflow only
    if one of the operands has an exponent of Emax, and so on. This
    early check is conservative: It might flag an operation that
    doesn’t actually cause an exception. However, if such false
    positives are rare, then this technique will have excellent
    performance. When an instruction is tagged as being possibly
    exceptional, special code in a trap handler can compute it without
    destroying any state. Remember that all these problems occur only
    when trap handlers are enabled.")

    Not writing results in order would require suppressing earlier
    writes to the same register (a singular writeback stage design
    would also have this). With simple in-order issue, this would
    (I think) only occur when the result was never used (e.g., a
    slow operation started before a conditional branch that
    determines it use — or in a "free" delay slot — or if two results
    are produced and one is unused such as unused flag settings).
    Out-of-order writeback also presents register write port hazards;
    more write ports might be needed than available.

    It _might_ be practical to allow store instructions that use a
    delayed result to issue before the result is available — similar

    ST instructions are special in that one can compute the address
    as soon as operand dependencies resolve, and then only access the
    value to be aligned and stored after the ST instruction is retired.
    This way, ST.data is never latent. HP has a patent on this circa
    1986±.

    to the classic store-address-generation/store-data split for
    out-of-order execution. A store buffer entry could be marked as
    not having valid data (similar to ready bits for registers) and
    the slow operation could "forward" to the store buffer.

    My pipelines don't even bother to fetch the data to be stored until
    the ST instruction retires.

    Multiply-
    add instructions can also conceivably exploit delayed availability
    of the addend. There might also be some cases were necessary
    latency is data dependent and knowing that the computation can
    be done faster the operations might be "issued" early as if it
    had the normal/worse-case latency — that communication complexity
    seems unlikely to be worthwhile but it is conceivably possible.

    Since low-end out-of-order is not extraordinarily complex or resource-intensive, heroic efforts to provide slightly less
    constrained but still in-order execution seem rather questionable.

    The only counter point is that every time one allows the front of the
    pipeline and the end of the pipeline to determine advance differently,
    you add 1 to the exponent of test vector complexity. If you allow the
    center of the pipeline to crush out bubbles, you have now added 2 to
    the text vector complexity of the pipeline.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Sun Feb 25 22:22:04 2024
    Paul A. Clayton wrote:

    On 1/24/24 2:47 AM, Anton Ertl wrote:> "Paul A. Clayton" <[email protected]> writes:

    When I looked at the pipeline design presented in the Arm Cortex-
    A55 Software Optimization Guide, I was surprised by the design.
    Figure 1 (page 10 in Revision R2p0) shows nine execution pipelines
    (ALU0, ALU1, MAC, DIV, branch, store, load, FP/Neon MAC &
    DIV/SWRT, FP/Neon ALU) and ALU0 and ALU1 have a shift pipeline
    stage before an ALU stage (clearly for AArch32).

    Almost like an Mc88100 which had 5 pipelines.



    The separation of MAC and DIV is mildly questionable — from my
    very amateur perspective — not supporting dual issue of a MAC-DIV
    pair seems very unlikely to hurt performance but the cost may be
    trivial.

    Many (MANY) MUL-DIV pairs are data dependent. y = i*m/n;

    The Chips and Cheese article also indicated that branches are only
    resolved at writeback, two cycles later than if branch direction
    was resolved in the first execution stage. The difference between
    a six stage misprediction penalty and an eight stage one is not
    huge, but it seems to indicate a difference in focus. With

    In an 8 stage pipeline, the 2 cycles of added delay should hurt by ~5%-7%

    condition code based branches and in-order execution, I would have
    been tempted to try resolving such branches by the end of the
    issue stage. (MIPS R2000 resolved register-compare branches at the
    end of decode, so resolving branches based on a condition code —
    if the data is available — in the cycle after decode does not seem incredibly difficult. It may be that condition codes are generally
    not set early enough to justify such effort, but it seems
    obviously "possible".)

    I would have *guessed* that an AGLU (a functional unit providing
    address generation and "simple" ALU functions, like AMD's Bobcat?)
    would be more area and power efficient than having separate
    pipelines, at least for store address generation.

    Be careful with assumptions like that. Silicon area with no moving
    signals is remarkably power efficient.

    I may be misinterpreting/misunderstanding the information. While I
    believe I am not entirely incompetent in general
    microarchitectural design, it is difficult to believe that any
    professional (much less a team of professionals) would do worse
    than I would. Other tradeoffs (like design reuse) may also justify
    design choices that seem worse.
    snip
    They had to choose the L1 size. Cortex-A55 supports L1 sizes of 16
    KiB, 32 KiB, and 64 KiB. With a fixed three-cycle latency (and
    other pipeline stages fixed in their work), the size of the L1
    caches will affect not only cycle time. If the pipeline diagram is interpreted extremely literally, address generation takes one
    cycle, data cache output takes one cycle, and align and extend
    takes one cycle. If cache access itself takes one cycle and if
    that latency increases by sqrt(2) with each capacity doubling,

    A capacity doubling adds SQRT(2) to wire latency and 1 gates to gate-
    delay latency. Depending one which is more critical, you may choose
    to go one way or the other.

    then implementations with the largest *either* data or instruction
    cache would have twice as much time in a cycle as implementations
    with both L1s being 16 KiB *if* the pipeline was designed for the
    smallest cache.

    (I would **GUESS** that ARM designed the pipeline for 32 KiB
    caches and smaller caches mostly mean unused time within the cache
    access cycle and larger caches mostly mean unused time within all
    the other stages. The time to complete a certain about of logical
    operation can be adjusted, e.g., using a faster adder, but not
    shifting the clock boundaries constrains such changes as not all
    chunks of logic can be made faster — intentional clock skew might
    allow borrowing time — and synthesized designs might not get all
    the possible changes.)

    According to the AnandTech article, Samsung chose not to implement
    an L2 for the A55 cores. Since accessing the L3 means crossing a
    clocking domain, this would seem to have a significant impact on
    performance for workloads like SPEC and, I suspect, a noticeable
    impact on energy-efficiency. If this choice also lead to using 64
    KiB L1 caches **and if** ARM optimized the pipeline for 32 KiB
    caches, this might also have noticeably impacted performance and energy-efficiency.

    Crossing a clock domain boundary is 2.5 clocks of latency.

    (For SPEC, I would guess that even the 256 KiB maximum
    configuration L2 size for A55 would have a significant performance
    impact. SPEC2006 used by AnandTech might be friendlier to modest
    L2 size than SPEC2017. If the software is "tuned" for workstation
    hardware of five years before the SPEC benchmark, 2019 smart
    phones might not be that far from 2001 workstations in terms of L2
    sizes.)

    -----------------------------

    If my above guess that a 64 KiB L1 was used and that this impacts
    frequency, voltage and frequency scaling may have been effected.
    (I seem to recall reading that caches have poorer voltage-
    frequency scaling; that *might* incline a larger L1 cache to
    further hurt energy efficiency if a single voltage is used for the
    whole core.)

    SRAMs do not operate (well) below a certain voltage. At voltage
    the sense amplifier will have a gain > 50× while below that voltage
    the SA gain my decrease to 10× and there is a range of voltages
    where the change in gain vs. change in voltage is quadratic.
    ------------------

    With respect to sticking with in-order, there also seems to be a
    tendency to go "all in" when switching to out-of-order, i.e., the
    initial out-of-order design seems to be relatively "beefy" in its out-of-order resources. This may result from having delayed the
    transition well beyond where performance or efficiency estimates
    would have justified the change or perhaps from crossover being a
    large enough region by the time a change is fully justified the
    out-of-order design would be relatively beefy.

    OoO gain in both ILP and in frequency gaining a quadratic uplift.

    Perhaps mildly out-of-order designs (say a little more than the
    PowerPC 750) are not actually useful (other than as a starting
    point for understanding out-of-order design). I do not understand
    why such an intermediate design (between in-order and 30+
    scheduling window out-of-order) is not useful. It may be that

    It is useful, just not all that much.

    going from say 10 to 30 scheduler entries gives so much benefit
    for relatively little extra cost (and no design is so precisely
    area constrained — even doubling core size would not mean pushing
    L1 off-chip, e.g.). I have a lumper taxonomic bias, so I have some
    emotional investment in intermediate and mixed designs.

    10 does not accommodate much ILP beyond that of a 10 deep pipeline.
    30 accommodates L1 cache misses and typical FP latencies.
    90 accommodates "almost everything else"
    250 accommodates multiple L1 misses with L2 hits and "everything else".

    And now you write that ARM did not
    design it for power efficiency. If you are right, that
    supports the
    position that in-order is uncompetetive not just wrt
    performance, but
    also perf/W as soon as there are relatively low performance
    requirements.

    If ARM designed A55 for power efficiency (at that performance
    level) over all other concerns, the L1 caches would be fixed size.
    Users of ARM designs are obviously willing to sacrifice some power
    efficiency for the benefit from flexible L1 size. With different functionality differing in timing and energy costs with different
    processes, energy-efficiency at all costs would seem to lead to
    different designs for different processes. Presumably this is not
    cost effective.

    The memory system, on-chip network, and such would also affect the
    energy efficiency. Exynos9820's memory system might _reasonably_
    be optimized for high power/high performance use; that would tend
    to hurt the efficiency of wimpy cores.

    What scenario do you imagine where one would want these in-order
    cores? ARM's niche for them is the little cores in a big.LITTLE
    design; that is necessarily coupled with a memory system with a
    high
    bandwidth. There are also SoCs with only A55 cores (no BIG
    ones) like
    the RK3566, but they are only bought because of the price, not
    because
    of their power-efficiency.

    For something like a smart phone, one or two small cores might be
    useful for background activity, tasks whose latency (within a
    broad range) is not related to system responsiveness for the user.

    For a server expected to run embarrassingly parallel workloads, if

    Servers are not expected to run embarrassingly parallel applications,
    they are expected to run an embarrassing large number of essentially
    serial applications.

    a wimpy core provides sufficient responsiveness, I would expect
    most of the cores (possibly even all of the cores) to be wimpy.
    There might not be many workloads with such characteristics;

    Talk to Google about that....

    although fundamental network latency has not improved that much
    over the last decade, bandwidth has increased and server-side
    processing complexity has increased. Even with splitting a request
    to multiple threads can make wimpy cores less useful than one
    might expect because work will not be perfectly distributed and
    tail latency increases.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Paul A. Clayton on Mon Feb 26 10:48:39 2024
    Paul A. Clayton wrote:
    On 1/28/24 1:48 PM, EricP wrote:
    [snip]
    Multiple parallel pipelines is fine but it has to sequence the pipe exits
    so the results retire in order for precise exceptions and interrupts.

    In-order retire is not strictly required for precise exceptions
    and certainly is not needed for interrupts. If the exception's
    presence is determined before writeback of results from later
    instructions, these writebacks can be prevented. One could
    alternatively use a conservative filter of exception conditions
    to stall writeback of later results (and stall those pipelines)
    until it is known whether the exception occurs.

    Interrupts have to be restartable so in-order retire, where everything
    older than the interrupt RIP is executed and retired and everything
    after that RIP is not, is simplest and cheapest to implement.
    Yes you could make it more complicated, but why?

    As Mitch has pointed out many times, uOps with exceptions might look
    ahead to see if all older uOps that might throw exceptions have executed
    and did not indicate an exception. However I believe that exceptions are exceptional (unusual) and find the extra logic needed to do this to be
    not justified for the benefits of early prefetching of an exception handler.

    My only exception handler that is triggered with any regularity is
    page fault (assuming a hardware table walker so no TLB miss exceptions),
    and it typically invokes a handler with many thousands of instructions
    so prefetching that code a few clocks earlier won't make any difference.

    (From Computer Architecture: A Quantitative Approach, 3rd Ed.,
    Appendix H, "One approach to this problem, used in the MIPS R3010,
    is to identify instructions that may cause an exception early in
    the instruction cycle. For example, an addition can overflow only
    if one of the operands has an exponent of Emax, and so on. This
    early check is conservative: It might flag an operation that
    doesn’t actually cause an exception. However, if such false
    positives are rare, then this technique will have excellent
    performance. When an instruction is tagged as being possibly
    exceptional, special code in a trap handler can compute it without
    destroying any state. Remember that all these problems occur only
    when trap handlers are enabled.")

    Ok but their problem was they used the exception mechanism for Usuals,
    TLB misses and in this case floating point fix-ups. And a consequence of
    the exception mechanism is a pipeline drain, which doesn't matter if it
    only happens rarely but does if it happens often.

    This was in the early RISC days when they used traps for all kinds of
    normal management, misaligned memory accesses or Sparc register windows.
    And they all suffered performance problems.

    So rather than fix the actual problems by adding in a HW table walker
    and HW float fix-ups, it sounds like they added a complicated mechanism
    to sort-of-almost-but-not-quite-multi-threaded to execute the trap handler
    and avoid the pipeline drain. I had the same idea for Alpha's software
    TLB miss handler, which sapped up to 25% of performance, but decided that software managed TLB's are a dead end and a HW table walker was best.

    Moral of the story: don't use the exception mechanism for usuals
    and then complain about the performance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Mon Feb 26 19:49:06 2024
    EricP wrote:

    Paul A. Clayton wrote:
    On 1/28/24 1:48 PM, EricP wrote:
    [snip]
    Multiple parallel pipelines is fine but it has to sequence the pipe exits >>> so the results retire in order for precise exceptions and interrupts.

    In-order retire is not strictly required for precise exceptions
    and certainly is not needed for interrupts. If the exception's
    presence is determined before writeback of results from later
    instructions, these writebacks can be prevented. One could
    alternatively use a conservative filter of exception conditions
    to stall writeback of later results (and stall those pipelines)
    until it is known whether the exception occurs.

    Interrupts have to be restartable so in-order retire, where everything
    older than the interrupt RIP is executed and retired and everything
    after that RIP is not, is simplest and cheapest to implement.
    Yes you could make it more complicated, but why?

    As Mitch has pointed out many times, uOps with exceptions might look
    ahead to see if all older uOps that might throw exceptions have executed
    and did not indicate an exception. However I believe that exceptions are exceptional (unusual) and find the extra logic needed to do this to be
    not justified for the benefits of early prefetching of an exception handler.

    My only exception handler that is triggered with any regularity is
    page fault (assuming a hardware table walker so no TLB miss exceptions),
    and it typically invokes a handler with many thousands of instructions
    so prefetching that code a few clocks earlier won't make any difference.

    (From Computer Architecture: A Quantitative Approach, 3rd Ed.,
    Appendix H, "One approach to this problem, used in the MIPS R3010,
    is to identify instructions that may cause an exception early in
    the instruction cycle. For example, an addition can overflow only
    if one of the operands has an exponent of Emax, and so on. This
    early check is conservative: It might flag an operation that
    doesn’t actually cause an exception. However, if such false
    positives are rare, then this technique will have excellent
    performance. When an instruction is tagged as being possibly
    exceptional, special code in a trap handler can compute it without
    destroying any state. Remember that all these problems occur only
    when trap handlers are enabled.")

    Ok but their problem was they used the exception mechanism for Usuals,
    TLB misses and in this case floating point fix-ups. And a consequence of
    the exception mechanism is a pipeline drain, which doesn't matter if it
    only happens rarely but does if it happens often.

    Note: the requirement of pipeline drain cost MIPS R2000 5 instructions
    and a modern x86 up to 200 instructions. The bigger the execution window
    the less you want to take exceptions.

    This was in the early RISC days when they used traps for all kinds of
    normal management, misaligned memory accesses or Sparc register windows.
    And they all suffered performance problems.

    Maybe--relative to the performance they had--but compared to the x86s and Mc68Ks of the competition, the RISCs outclassed them.

    So rather than fix the actual problems by adding in a HW table walker
    and HW float fix-ups, it sounds like they added a complicated mechanism
    to sort-of-almost-but-not-quite-multi-threaded to execute the trap handler and avoid the pipeline drain. I had the same idea for Alpha's software
    TLB miss handler, which sapped up to 25% of performance, but decided that software managed TLB's are a dead end and a HW table walker was best.

    SW TLB miss handlers were dead the instant one wanted 2-level translation:: GuestOS and HyperVisor--which is the norm today (outside of µControllers.)

    Moral of the story: don't use the exception mechanism for usuals
    and then complain about the performance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sat Mar 2 20:55:18 2024
    EricP wrote:

    Paul A. Clayton wrote:
    On 1/28/24 1:48 PM, EricP wrote:
    [snip]
    Multiple parallel pipelines is fine but it has to sequence the pipe exits >>> so the results retire in order for precise exceptions and interrupts.

    In-order retire is not strictly required for precise exceptions
    and certainly is not needed for interrupts. If the exception's
    presence is determined before writeback of results from later
    instructions, these writebacks can be prevented. One could
    alternatively use a conservative filter of exception conditions
    to stall writeback of later results (and stall those pipelines)
    until it is known whether the exception occurs.

    Interrupts have to be restartable so in-order retire,

    Either restartable or completable.

    where everything
    older than the interrupt RIP is executed and retired and everything
    after that RIP is not, is simplest and cheapest to implement.

    It is not the cheapest to implement, for that you need to review Mc 88100.

    Yes you could make it more complicated, but why?

    You CAN make it less complicated in HW and toss the burden off to SW.
    {similar to VLIW, RISC vs CISC, ...}

    As Mitch has pointed out many times, uOps with exceptions might look
    ahead to see if all older uOps that might throw exceptions have executed
    and did not indicate an exception.

    You look backwards (not ahead) to see if older instruction can no longer
    raise exceptions.

    However I believe that exceptions are exceptional (unusual) and find the extra logic needed to do this to be
    not justified for the benefits of early prefetching of an exception handler.

    My only exception handler that is triggered with any regularity is
    page fault (assuming a hardware table walker so no TLB miss exceptions),
    and it typically invokes a handler with many thousands of instructions
    so prefetching that code a few clocks earlier won't make any difference.

    If you use it often enough it will still be in your cache when you next
    need it. {I don't remember exactly who told me this, but it was one of
    the original MIPS (the company not Stanford) guys}; so you don't need to prefetch it.

    (From Computer Architecture: A Quantitative Approach, 3rd Ed.,
    Appendix H, "One approach to this problem, used in the MIPS R3010,
    is to identify instructions that may cause an exception early in
    the instruction cycle. For example, an addition can overflow only
    if one of the operands has an exponent of Emax, and so on. This
    early check is conservative: It might flag an operation that
    doesn’t actually cause an exception. However, if such false
    positives are rare, then this technique will have excellent
    performance. When an instruction is tagged as being possibly
    exceptional, special code in a trap handler can compute it without
    destroying any state. Remember that all these problems occur only
    when trap handlers are enabled.")

    Ok but their problem was they used the exception mechanism for Usuals,
    TLB misses and in this case floating point fix-ups. And a consequence of
    the exception mechanism is a pipeline drain, which doesn't matter if it
    only happens rarely but does if it happens often.

    It also matters less when the pipeline depth is small (less than 10).

    This was in the early RISC days when they used traps for all kinds of
    normal management, misaligned memory accesses or Sparc register windows.
    And they all suffered performance problems.

    Exceptions meant we did not have to build a bunch of mechanisms that
    SW could do similarly well. The smarter of us learned out lessons and
    modern Si means we have the logic to do it right without scrimping
    {FPGA still has not reached this as BGB and another NG member subscribe.}

    So rather than fix the actual problems by adding in a HW table walker
    and HW float fix-ups, it sounds like they added a complicated mechanism
    to sort-of-almost-but-not-quite-multi-threaded to execute the trap handler and avoid the pipeline drain.

    The MIPS guys (above) would state that that mechanism already had to exist
    and they could leverage it for FLB refills as they could page faults or FP fixups. Only page faults should remain in a modern (non-FPGA) implementation.

    I had the same idea for Alpha's software
    TLB miss handler, which sapped up to 25% of performance, but decided that software managed TLB's are a dead end and a HW table walker was best.

    Moral of the story: don't use the exception mechanism for usuals
    and then complain about the performance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Sat Mar 2 21:04:35 2024
    Paul A. Clayton wrote:

    On 1/25/24 10:22 AM, Anton Ertl wrote:
    [snip]
    I think the commonly understood meaning is that
    all instructions start their execution in-order (i.e., none goes to a
    functional unit earlier than an architecturally earlier instruction).
    Execution can overlap.

    What about a skewed pipeline? A simple skewed pipeline that
    statically assigned operations to a pipeline-stage/execution unit
    has been called in-order (in what I have read). A "second-chance"
    pipeline (where many operations can dynamically choose the
    pipeline stage based on operand availability) involves dynamic
    scheduling (so would seem to fall in to out-of-order), but
    counterflow pipelines ("Counterflow Pipeline Processor
    Architecture", Robert F. Sproull et al., 1994) — which are more
    extreme in some ways than pipelines that have two stages in which
    operations can start — are stated to have "No overtaking.
    Instructions must stay in program order in the instruction
    pipeline.", which sounds "in-order" (and the paper was written by
    people working at Sun Microsystems).

    (I thought counterflow pipelines were weird. Simplifying
    communication makes sense, but ...)

    I get the impression that early PowerPC out-or-order execution implementations were really very similar to using the forwarding
    network for out-of-order completion (with in-order writeback). If
    I recall correctly, renaming was done by appending a version to
    the architectural register name and operands would be captured as
    soon as they were available rather than passing along the pipeline
    with forwarding until the writeback stage.

    This sounds more like Mc 88110 rather than PPC 620.

    PPC was shrunk from 6-wide to 4-wide in order to fit in the acceptable
    die area. Other things may have been jettisoned at this shrink of design
    point. The 620 was originally targeted to be equal to Mc 88120 which was
    a 6-wide GBOoO machine full Tomasulo with precise exceptions and 4 external busses named {Data Out, Data In, Address Out, Address In}

    Address Out was used for cache misses to bring data to the CPU
    Data Out was used for cache victims to send data to DRAM
    Data In was used by arriving DRAM data
    Address In was used for arriving Snoops

    Smart externals could use Data In to send the CPU data before it knew it
    needed it. That data could be code or data.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Sun Mar 3 01:04:44 2024
    On Sat, 2 Mar 2024 21:04:35 +0000
    [email protected] (MitchAlsup1) wrote:

    Paul A. Clayton wrote:

    On 1/25/24 10:22 AM, Anton Ertl wrote:
    [snip]
    I think the commonly understood meaning is that
    all instructions start their execution in-order (i.e., none goes
    to a functional unit earlier than an architecturally earlier
    instruction). Execution can overlap.

    What about a skewed pipeline? A simple skewed pipeline that
    statically assigned operations to a pipeline-stage/execution unit
    has been called in-order (in what I have read). A "second-chance"
    pipeline (where many operations can dynamically choose the
    pipeline stage based on operand availability) involves dynamic
    scheduling (so would seem to fall in to out-of-order), but
    counterflow pipelines ("Counterflow Pipeline Processor
    Architecture", Robert F. Sproull et al., 1994) — which are more
    extreme in some ways than pipelines that have two stages in which operations can start — are stated to have "No overtaking.
    Instructions must stay in program order in the instruction
    pipeline.", which sounds "in-order" (and the paper was written by
    people working at Sun Microsystems).

    (I thought counterflow pipelines were weird. Simplifying
    communication makes sense, but ...)

    I get the impression that early PowerPC out-or-order execution implementations were really very similar to using the forwarding
    network for out-of-order completion (with in-order writeback). If
    I recall correctly, renaming was done by appending a version to
    the architectural register name and operands would be captured as
    soon as they were available rather than passing along the pipeline
    with forwarding until the writeback stage.

    This sounds more like Mc 88110 rather than PPC 620.


    Paul A. Clayton probably has in mind 603 and 7xx series rather than
    (more ambitious) 604 and its ill-fated never shipped followup 620.

    PPC was shrunk from 6-wide to 4-wide in order to fit in the acceptable
    die area. Other things may have been jettisoned at this shrink of
    design point. The 620 was originally targeted to be equal to Mc 88120
    which was a 6-wide GBOoO machine full Tomasulo with precise
    exceptions and 4 external busses named {Data Out, Data In, Address
    Out, Address In}

    Address Out was used for cache misses to bring data to the CPU
    Data Out was used for cache victims to send data to DRAM
    Data In was used by arriving DRAM data
    Address In was used for arriving Snoops

    Smart externals could use Data In to send the CPU data before it knew
    it needed it. That data could be code or data.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Mon Mar 4 21:54:12 2024
    MitchAlsup1 wrote:
    EricP wrote:
    My only exception handler that is triggered with any regularity is
    page fault (assuming a hardware table walker so no TLB miss exceptions),
    and it typically invokes a handler with many thousands of instructions
    so prefetching that code a few clocks earlier won't make any difference.

    If you use it often enough it will still be in your cache when you next
    need it. {I don't remember exactly who told me this, but it was one of
    the original MIPS (the company not Stanford) guys}; so you don't need to prefetch it.


    That has been my rule-of-thumb for lookup tables replacing logic: If the
    table is small enough and used often enough that it could make a
    significant difference to the total runtime, then it will also stay in
    cache nearly all the time.

    if it does get evicted between uses most of the time, then it simply
    wasn't that important.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Sat Mar 9 04:01:49 2024
    Paul A. Clayton wrote:

    On 2/26/24 10:48 AM, EricP wrote:
    Paul A. Clayton wrote:
    On 1/28/24 1:48 PM, EricP wrote:
    [snip]
    Multiple parallel pipelines is fine but it has to sequence the
    pipe exits
    so the results retire in order for precise exceptions and
    interrupts.

    In-order retire is not strictly required for precise exceptions
    and certainly is not needed for interrupts. If the exception's
    presence is determined before writeback of results from later
    instructions, these writebacks can be prevented. One could
    alternatively use a conservative filter of exception conditions
    to stall writeback of later results (and stall those pipelines)
    until it is known whether the exception occurs.

    Interrupts have to be restartable so in-order retire, where everything
    older than the interrupt RIP is executed and retired and everything
    after that RIP is not, is simplest and cheapest to implement.
    Yes you could make it more complicated, but why?

    The above described method still provides precise exceptions. The
    absence of a earlier exception is required to allow such out-of-
    order retirement.

    This also means that handling of an asynchronous event might have
    to be delayed (if one did not want to have two threads active)
    until all instructions before the latest-in-program-order retired
    instruction have retired.

    For memory reads, the late failure generated by an uncorrectable
    ECC error would probably have to be handled differently or there
    would probably be little opportunity to exploit out-of-order
    retirement. It might not be entirely unreasonable to treat such as
    a fatal thread error that is asynchronous.

    What about for memory stores where the ECC check on the delivered
    data fails ?? This seems to be just as fatal as a LD with an ECC
    fail.

    I suspect general out-of-order retirement would not be worthwhile
    with precise exceptions; it just sounds complex. My comment was
    mainly to point out that such was possible not that it was wise.

    We basically all converged on this about 1990.

    [snip]
    Moral of the story: don't use the exception mechanism for usuals
    and then complain about the performance.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Sat Mar 9 15:03:05 2024
    [email protected] (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    On 2/26/24 10:48 AM, EricP wrote:
    Paul A. Clayton wrote:
    On 1/28/24 1:48 PM, EricP wrote:
    [snip]
    Multiple parallel pipelines is fine but it has to sequence the
    pipe exits
    so the results retire in order for precise exceptions and
    interrupts.

    In-order retire is not strictly required for precise exceptions
    and certainly is not needed for interrupts. If the exception's
    presence is determined before writeback of results from later
    instructions, these writebacks can be prevented. One could
    alternatively use a conservative filter of exception conditions
    to stall writeback of later results (and stall those pipelines)
    until it is known whether the exception occurs.

    Interrupts have to be restartable so in-order retire, where everything
    older than the interrupt RIP is executed and retired and everything
    after that RIP is not, is simplest and cheapest to implement.
    Yes you could make it more complicated, but why?

    The above described method still provides precise exceptions. The
    absence of a earlier exception is required to allow such out-of-
    order retirement.

    This also means that handling of an asynchronous event might have
    to be delayed (if one did not want to have two threads active)
    until all instructions before the latest-in-program-order retired
    instruction have retired.

    For memory reads, the late failure generated by an uncorrectable
    ECC error would probably have to be handled differently or there
    would probably be little opportunity to exploit out-of-order
    retirement. It might not be entirely unreasonable to treat such as
    a fatal thread error that is asynchronous.

    What about for memory stores where the ECC check on the delivered
    data fails ?? This seems to be just as fatal as a LD with an ECC
    fail.

    As most stores are posted, the data stored needs to be 'poisoned'
    so that any subsequent use of the data (e.g. a load) will report
    a fault.

    The LLC (or memory controller) can optionally support an interrupt
    to management software to indicate that an uncorrected fault occurred; that would, of course, be asynchronous and occur long after the
    store had retired.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Mar 9 18:45:48 2024
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    On 2/26/24 10:48 AM, EricP wrote:
    Paul A. Clayton wrote:
    On 1/28/24 1:48 PM, EricP wrote:
    [snip]
    Multiple parallel pipelines is fine but it has to sequence the
    pipe exits
    so the results retire in order for precise exceptions and
    interrupts.

    In-order retire is not strictly required for precise exceptions
    and certainly is not needed for interrupts. If the exception's
    presence is determined before writeback of results from later
    instructions, these writebacks can be prevented. One could
    alternatively use a conservative filter of exception conditions
    to stall writeback of later results (and stall those pipelines)
    until it is known whether the exception occurs.

    Interrupts have to be restartable so in-order retire, where everything >>>> older than the interrupt RIP is executed and retired and everything
    after that RIP is not, is simplest and cheapest to implement.
    Yes you could make it more complicated, but why?

    The above described method still provides precise exceptions. The
    absence of a earlier exception is required to allow such out-of-
    order retirement.

    This also means that handling of an asynchronous event might have
    to be delayed (if one did not want to have two threads active)
    until all instructions before the latest-in-program-order retired
    instruction have retired.

    For memory reads, the late failure generated by an uncorrectable
    ECC error would probably have to be handled differently or there
    would probably be little opportunity to exploit out-of-order
    retirement. It might not be entirely unreasonable to treat such as
    a fatal thread error that is asynchronous.

    What about for memory stores where the ECC check on the delivered
    data fails ?? This seems to be just as fatal as a LD with an ECC
    fail.

    As most stores are posted, the data stored needs to be 'poisoned'
    so that any subsequent use of the data (e.g. a load) will report
    a fault.

    Storing the bad <arriving> ECC should take care of that.

    The LLC (or memory controller) can optionally support an interrupt
    to management software to indicate that an uncorrected fault occurred; that would, of course, be asynchronous and occur long after the
    store had retired.

    I was going to check ECC on arrival at LLC and request retransmission
    on failure. CPU sender cannot free the "miss buffer" until it gets a
    release (arrived OK) from LLC. LLC then and later writes data into DRAM
    through DRAM controller.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Mar 9 18:48:26 2024
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:


    For memory reads, the late failure generated by an uncorrectable
    ECC error would probably have to be handled differently or there
    would probably be little opportunity to exploit out-of-order
    retirement. It might not be entirely unreasonable to treat such as
    a fatal thread error that is asynchronous.

    What about for memory stores where the ECC check on the delivered
    data fails ?? This seems to be just as fatal as a LD with an ECC
    fail.

    As most stores are posted, the data stored needs to be 'poisoned'
    so that any subsequent use of the data (e.g. a load) will report
    a fault.

    The LLC (or memory controller) can optionally support an interrupt
    to management software to indicate that an uncorrected fault occurred; that would, of course, be asynchronous and occur long after the
    store had retired.

    The Interrupt Tables are manipulated by LLC (set, clear) and this is transmitted to CPU[*] by the cache coherence protocol (Invalidate Addr).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sun Mar 10 11:53:37 2024
    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    For memory reads, the late failure generated by an uncorrectable
    ECC error would probably have to be handled differently or there
    would probably be little opportunity to exploit out-of-order
    retirement. It might not be entirely unreasonable to treat such as
    a fatal thread error that is asynchronous.

    What about for memory stores where the ECC check on the delivered
    data fails ?? This seems to be just as fatal as a LD with an ECC fail.

    As most stores are posted, the data stored needs to be 'poisoned'
    so that any subsequent use of the data (e.g. a load) will report
    a fault.

    Storing the bad <arriving> ECC should take care of that.

    I don't think that will always work. Assuming we are using a
    72-bit SECDED ECC and a cache line is read with a double error,
    then if the ST overwrites an 8 byte aligned value it will generate
    a new valid ECC and correct the error.

    However if the ST is less than 8 bytes or misaligned, it won't know which
    of the 8 bytes was invalid so can't tell if the bad data was overwritten.
    If it keeps the old ECC as an error indicator, that code might actually be correct for the new data. If it generates a new valid ECC then it loses
    track of the fact that the data MAY be invalid.

    In this second case of partial overwrite I think it has to generate a
    new invalid ECC for the new 8 byte data indicating a double error.

    When the modified line is written back to DRAM it retains the
    double error ECC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Paul A. Clayton on Sun Mar 10 12:39:15 2024
    Paul A. Clayton wrote:
    On 2/26/24 10:48 AM, EricP wrote:
    Paul A. Clayton wrote:
    On 1/28/24 1:48 PM, EricP wrote:
    [snip]
    Multiple parallel pipelines is fine but it has to sequence the pipe
    exits
    so the results retire in order for precise exceptions and interrupts.

    In-order retire is not strictly required for precise exceptions
    and certainly is not needed for interrupts. If the exception's
    presence is determined before writeback of results from later
    instructions, these writebacks can be prevented. One could
    alternatively use a conservative filter of exception conditions
    to stall writeback of later results (and stall those pipelines)
    until it is known whether the exception occurs.

    Interrupts have to be restartable so in-order retire, where everything
    older than the interrupt RIP is executed and retired and everything
    after that RIP is not, is simplest and cheapest to implement.
    Yes you could make it more complicated, but why?

    The above described method still provides precise exceptions. The
    absence of a earlier exception is required to allow such out-of-
    order retirement.

    Yes, early OoO retire with precise exception is possible.
    The criteria would seem to be that:
    - all older instructions that might generate an exception must have
    executed without detecting an exception
    - plus all older loads and stores translated their virtual addresses
    (loads don't need to have completed execution, and stores will not have)
    - plus all older conditional branches have executed without mispredicting.

    My concern is that the circuit for doing this could be pretty complicated.
    Many of the pieces that have to be checked are scattered around the core.
    Also many of states are in circular buffers so determining "older" starts getting slightly hairy (the Load Store Queue has a similar problem for disambiguation determining if all older loads and stores have "resolved").
    And all this has to run in parallel so it takes less than 1 clock.

    The motivation for early OoO retire is usually early recycling of some resources, usually physical registers. However note that you can't early recycle some resources like entries in circular buffers, such as the Instruction Queue, ROB/Future-File, LSQ, Branch Queue.

    So the question I have is it really worth it?

    This also means that handling of an asynchronous event might have
    to be delayed (if one did not want to have two threads active)
    until all instructions before the latest-in-program-order retired
    instruction have retired.

    I define exceptions as part of the ISA, internal, and synchronous with
    their triggering instruction. Doing so allows the exception mechanism
    to focus on doing just its one thing.

    An asynchronous event would therefore not be an "exception".

    I define interrupts as and asynchronous restartable traps,
    with model dependent delivery and control.

    I define errors as a whole different category from exceptions and
    interrupts, and explicitly model dependent, and each error has its
    own characteristics.

    For memory reads, the late failure generated by an uncorrectable
    ECC error would probably have to be handled differently or there
    would probably be little opportunity to exploit out-of-order
    retirement. It might not be entirely unreasonable to treat such as
    a fatal thread error that is asynchronous.

    Errors are totally model and situation dependent. A bus parity error
    reading a cache line from DRAM might mean logging the error and repeating
    the last flit transfer, while a bus parity error reading a device control register is device dependent whether it can be repeated as some devices
    change state on register read (e.g. a UART's received byte FIFO).

    I suspect general out-of-order retirement would not be worthwhile
    with precise exceptions; it just sounds complex. My comment was
    mainly to point out that such was possible not that it was wise.

    I agree it sounds complicated. The motivation for doing so which
    I have seen is usually to recycle some resources earlier.
    But you also have to consider all the resources required to
    manage freeing up those resources earlier.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Sun Mar 10 13:26:01 2024
    EricP wrote:
    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    For memory reads, the late failure generated by an uncorrectable
    ECC error would probably have to be handled differently or there
    would probably be little opportunity to exploit out-of-order
    retirement. It might not be entirely unreasonable to treat such as
    a fatal thread error that is asynchronous.

    What about for memory stores where the ECC check on the delivered
    data fails ?? This seems to be just as fatal as a LD with an ECC fail.

    As most stores are posted, the data stored needs to be 'poisoned'
    so that any subsequent use of the data (e.g. a load) will report
    a fault.

    Storing the bad <arriving> ECC should take care of that.

    I don't think that will always work. Assuming we are using a
    72-bit SECDED ECC and a cache line is read with a double error,
    then if the ST overwrites an 8 byte aligned value it will generate
    a new valid ECC and correct the error.

    However if the ST is less than 8 bytes or misaligned, it won't know which
    of the 8 bytes was invalid so can't tell if the bad data was overwritten.
    If it keeps the old ECC as an error indicator, that code might actually be correct for the new data. If it generates a new valid ECC then it loses
    track of the fact that the data MAY be invalid.

    In this second case of partial overwrite I think it has to generate a
    new invalid ECC for the new 8 byte data indicating a double error.

    When the modified line is written back to DRAM it retains the
    double error ECC.

    And if the page is out swapped and recycled we lose track of
    the error indicator on that 8-byte value.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Sun Mar 10 18:34:12 2024
    EricP <[email protected]> writes:
    EricP wrote:

    As most stores are posted, the data stored needs to be 'poisoned'
    so that any subsequent use of the data (e.g. a load) will report
    a fault.

    Storing the bad <arriving> ECC should take care of that.

    I don't think that will always work. Assuming we are using a
    72-bit SECDED ECC and a cache line is read with a double error,
    then if the ST overwrites an 8 byte aligned value it will generate
    a new valid ECC and correct the error.

    However if the ST is less than 8 bytes or misaligned, it won't know which
    of the 8 bytes was invalid so can't tell if the bad data was overwritten.
    If it keeps the old ECC as an error indicator, that code might actually be >> correct for the new data. If it generates a new valid ECC then it loses
    track of the fact that the data MAY be invalid.

    In this second case of partial overwrite I think it has to generate a
    new invalid ECC for the new 8 byte data indicating a double error.

    When the modified line is written back to DRAM it retains the
    double error ECC.

    And if the page is out swapped and recycled we lose track of
    the error indicator on that 8-byte value.

    If it was properly poisoned, the access by the DMA engine will
    cause a RAS error to be signalled and the DMA aborted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Mar 10 19:31:10 2024
    EricP wrote:

    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    For memory reads, the late failure generated by an uncorrectable
    ECC error would probably have to be handled differently or there
    would probably be little opportunity to exploit out-of-order
    retirement. It might not be entirely unreasonable to treat such as
    a fatal thread error that is asynchronous.

    What about for memory stores where the ECC check on the delivered
    data fails ?? This seems to be just as fatal as a LD with an ECC fail.

    As most stores are posted, the data stored needs to be 'poisoned'
    so that any subsequent use of the data (e.g. a load) will report
    a fault.

    Storing the bad <arriving> ECC should take care of that.

    I don't think that will always work. Assuming we are using a
    72-bit SECDED ECC and a cache line is read with a double error,
    then if the ST overwrites an 8 byte aligned value it will generate
    a new valid ECC and correct the error.

    For my scenario to transpire:: the cache line written back would have
    had to be read in the L1/L2-cache with correct ECC (which accompanies
    the line to DRAM controller) and the whole line would be written into
    DRAM with the original ECC.

    However if the ST is less than 8 bytes or misaligned, it won't know which
    of the 8 bytes was invalid so can't tell if the bad data was overwritten.
    If it keeps the old ECC as an error indicator, that code might actually be correct for the new data. If it generates a new valid ECC then it loses
    track of the fact that the data MAY be invalid.

    Even uncacheable DRAM is accessed line-at-a-time.

    In this second case of partial overwrite I think it has to generate a
    new invalid ECC for the new 8 byte data indicating a double error.

    It knows which DoubleWords contain bad ECC ...

    When the modified line is written back to DRAM it retains the
    double error ECC.

    Straight from the CPU cache.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Mar 10 19:39:21 2024
    EricP wrote:

    Paul A. Clayton wrote:
    On 2/26/24 10:48 AM, EricP wrote:
    Paul A. Clayton wrote:
    On 1/28/24 1:48 PM, EricP wrote:
    [snip]
    The above described method still provides precise exceptions. The
    absence of a earlier exception is required to allow such out-of-
    order retirement.

    Yes, early OoO retire with precise exception is possible.
    The criteria would seem to be that:
    - all older instructions that might generate an exception must have
    executed without detecting an exception
    - plus all older loads and stores translated their virtual addresses
    (loads don't need to have completed execution, and stores will not have)
    - plus all older conditional branches have executed without mispredicting.
    You missed
    - all inbound cache lines need to have arrived without ECC errors.

    My concern is that the circuit for doing this could be pretty complicated.

    Essentially equal in complexity to an IO retirement µArchitecture.

    Many of the pieces that have to be checked are scattered around the core. Also many of states are in circular buffers so determining "older" starts getting slightly hairy (the Load Store Queue has a similar problem for disambiguation determining if all older loads and stores have "resolved"). And all this has to run in parallel so it takes less than 1 clock.

    The motivation for early OoO retire is usually early recycling of some resources, usually physical registers. However note that you can't early recycle some resources like entries in circular buffers, such as the Instruction Queue, ROB/Future-File, LSQ, Branch Queue.

    So the question I have is it really worth it?

    History says no.

    This also means that handling of an asynchronous event might have
    to be delayed (if one did not want to have two threads active)
    until all instructions before the latest-in-program-order retired
    instruction have retired.

    Which is exactly the IO retire criterion. Why go OoO retire when you have
    to be able to IO retire under certain circumstances ?!?

    <snip>

    I define errors as a whole different category from exceptions and
    interrupts, and explicitly model dependent, and each error has its
    own characteristics.

    Errors are different--machine checks, for example. These things SHOULD
    not happen and you really do want to know if they do.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Sun Mar 10 15:44:38 2024
    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    EricP wrote:

    As most stores are posted, the data stored needs to be 'poisoned'
    so that any subsequent use of the data (e.g. a load) will report
    a fault.
    Storing the bad <arriving> ECC should take care of that.
    I don't think that will always work. Assuming we are using a
    72-bit SECDED ECC and a cache line is read with a double error,
    then if the ST overwrites an 8 byte aligned value it will generate
    a new valid ECC and correct the error.

    However if the ST is less than 8 bytes or misaligned, it won't know which >>> of the 8 bytes was invalid so can't tell if the bad data was overwritten. >>> If it keeps the old ECC as an error indicator, that code might actually be >>> correct for the new data. If it generates a new valid ECC then it loses
    track of the fact that the data MAY be invalid.

    In this second case of partial overwrite I think it has to generate a
    new invalid ECC for the new 8 byte data indicating a double error.

    When the modified line is written back to DRAM it retains the
    double error ECC.
    And if the page is out swapped and recycled we lose track of
    the error indicator on that 8-byte value.

    If it was properly poisoned, the access by the DMA engine will
    cause a RAS error to be signalled and the DMA aborted.

    And the OS does what with the page and its data?
    This could happen long after the owner process terminated,
    maybe part of a lazy file cache write back.

    The only option for the OS might be to log the error and just reset
    the ECC to valid for the current data so the IO can complete.

    There is little point in decommissioning the physical page frame for
    just one incident as most dram errors are random single event upsets
    which can affect multiple bits in adjacent rows or columns.
    So there could be multiple errors in a frame due to a single event.

    To decommission a bad frame you'd want to see multiple such events,
    indicating perhaps a bad row or column.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Mar 10 19:41:15 2024
    EricP wrote:

    EricP wrote:
    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    For memory reads, the late failure generated by an uncorrectable
    ECC error would probably have to be handled differently or there
    would probably be little opportunity to exploit out-of-order
    retirement. It might not be entirely unreasonable to treat such as >>>>>> a fatal thread error that is asynchronous.

    What about for memory stores where the ECC check on the delivered
    data fails ?? This seems to be just as fatal as a LD with an ECC fail. >>>
    As most stores are posted, the data stored needs to be 'poisoned'
    so that any subsequent use of the data (e.g. a load) will report
    a fault.

    Storing the bad <arriving> ECC should take care of that.

    I don't think that will always work. Assuming we are using a
    72-bit SECDED ECC and a cache line is read with a double error,
    then if the ST overwrites an 8 byte aligned value it will generate
    a new valid ECC and correct the error.

    However if the ST is less than 8 bytes or misaligned, it won't know which
    of the 8 bytes was invalid so can't tell if the bad data was overwritten.
    If it keeps the old ECC as an error indicator, that code might actually be >> correct for the new data. If it generates a new valid ECC then it loses
    track of the fact that the data MAY be invalid.

    In this second case of partial overwrite I think it has to generate a
    new invalid ECC for the new 8 byte data indicating a double error.

    When the modified line is written back to DRAM it retains the
    double error ECC.

    And if the page is out swapped and recycled we lose track of
    the error indicator on that 8-byte value.

    The line was displaced from an L1/L2 cache and its DRAM landing spot is
    not in DRAM ?? but over on some disk/SSD ?!?

    How (the frick) did it get into L1/L2 if it was not in DRAM ?? and thus
    not on disk (as its only access point). ????

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sun Mar 10 17:52:29 2024
    MitchAlsup1 wrote:
    EricP wrote:

    EricP wrote:
    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    For memory reads, the late failure generated by an uncorrectable >>>>>>> ECC error would probably have to be handled differently or there >>>>>>> would probably be little opportunity to exploit out-of-order
    retirement. It might not be entirely unreasonable to treat such as >>>>>>> a fatal thread error that is asynchronous.

    What about for memory stores where the ECC check on the delivered
    data fails ?? This seems to be just as fatal as a LD with an ECC
    fail.

    As most stores are posted, the data stored needs to be 'poisoned'
    so that any subsequent use of the data (e.g. a load) will report
    a fault.

    Storing the bad <arriving> ECC should take care of that.

    I don't think that will always work. Assuming we are using a
    72-bit SECDED ECC and a cache line is read with a double error,
    then if the ST overwrites an 8 byte aligned value it will generate
    a new valid ECC and correct the error.

    However if the ST is less than 8 bytes or misaligned, it won't know
    which
    of the 8 bytes was invalid so can't tell if the bad data was
    overwritten.
    If it keeps the old ECC as an error indicator, that code might
    actually be
    correct for the new data. If it generates a new valid ECC then it loses
    track of the fact that the data MAY be invalid.

    In this second case of partial overwrite I think it has to generate a
    new invalid ECC for the new 8 byte data indicating a double error.

    When the modified line is written back to DRAM it retains the
    double error ECC.

    And if the page is out swapped and recycled we lose track of
    the error indicator on that 8-byte value.

    The line was displaced from an L1/L2 cache and its DRAM landing spot is
    not in DRAM ?? but over on some disk/SSD ?!?
    How (the frick) did it get into L1/L2 if it was not in DRAM ?? and thus
    not on disk (as its only access point). ????

    I'm just pointing out that the erroneous value with its
    poisoned ECC that is written from LLC back to DRAM can eventually
    lose its ECC error tag when it is out swapped.

    What we are left with is probably an error report buried in a
    log someplace and an seemingly valid value on disk.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Sun Mar 10 22:14:50 2024
    EricP <[email protected]> writes:
    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    EricP wrote:

    As most stores are posted, the data stored needs to be 'poisoned'
    so that any subsequent use of the data (e.g. a load) will report
    a fault.
    Storing the bad <arriving> ECC should take care of that.
    I don't think that will always work. Assuming we are using a
    72-bit SECDED ECC and a cache line is read with a double error,
    then if the ST overwrites an 8 byte aligned value it will generate
    a new valid ECC and correct the error.

    However if the ST is less than 8 bytes or misaligned, it won't know which >>>> of the 8 bytes was invalid so can't tell if the bad data was overwritten. >>>> If it keeps the old ECC as an error indicator, that code might actually be >>>> correct for the new data. If it generates a new valid ECC then it loses >>>> track of the fact that the data MAY be invalid.

    In this second case of partial overwrite I think it has to generate a
    new invalid ECC for the new 8 byte data indicating a double error.

    When the modified line is written back to DRAM it retains the
    double error ECC.
    And if the page is out swapped and recycled we lose track of
    the error indicator on that 8-byte value.

    If it was properly poisoned, the access by the DMA engine will
    cause a RAS error to be signalled and the DMA aborted.

    And the OS does what with the page and its data?
    This could happen long after the owner process terminated,
    maybe part of a lazy file cache write back.

    The only option for the OS might be to log the error and just reset
    the ECC to valid for the current data so the IO can complete.

    No, the I/O must be aborted. RAS 101 - do not propogate
    poisoned data.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Mon Mar 11 14:19:19 2024
    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    EricP wrote:
    As most stores are posted, the data stored needs to be 'poisoned' >>>>>>> so that any subsequent use of the data (e.g. a load) will report >>>>>>> a fault.
    Storing the bad <arriving> ECC should take care of that.
    I don't think that will always work. Assuming we are using a
    72-bit SECDED ECC and a cache line is read with a double error,
    then if the ST overwrites an 8 byte aligned value it will generate
    a new valid ECC and correct the error.

    However if the ST is less than 8 bytes or misaligned, it won't know which >>>>> of the 8 bytes was invalid so can't tell if the bad data was overwritten. >>>>> If it keeps the old ECC as an error indicator, that code might actually be
    correct for the new data. If it generates a new valid ECC then it loses >>>>> track of the fact that the data MAY be invalid.

    In this second case of partial overwrite I think it has to generate a >>>>> new invalid ECC for the new 8 byte data indicating a double error.

    When the modified line is written back to DRAM it retains the
    double error ECC.
    And if the page is out swapped and recycled we lose track of
    the error indicator on that 8-byte value.
    If it was properly poisoned, the access by the DMA engine will
    cause a RAS error to be signalled and the DMA aborted.
    And the OS does what with the page and its data?
    This could happen long after the owner process terminated,
    maybe part of a lazy file cache write back.

    The only option for the OS might be to log the error and just reset
    the ECC to valid for the current data so the IO can complete.

    No, the I/O must be aborted. RAS 101 - do not propogate
    poisoned data.

    Perhaps but tossing a whole block from an IO expands the size of
    the problem by a factor of 1000's.

    If that was one byte wrong in a text file then I think most people
    would want it written, as opposed to tossing out their work.

    If that was one byte wrong in a file system meta data block then
    there is no good answer. Many of the meta data blocks are in linked lists
    or B+ trees so not writing the block could corrupt a whole file system,
    and writing the block could also cause corruption but hopefully less likely.

    So you are damned if you do fix the ECC and write the block,
    and damned if you don't. But do seems less damning.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Mon Mar 11 18:50:03 2024
    EricP <[email protected]> writes:
    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    EricP wrote:
    As most stores are posted, the data stored needs to be 'poisoned' >>>>>>>> so that any subsequent use of the data (e.g. a load) will report >>>>>>>> a fault.
    Storing the bad <arriving> ECC should take care of that.
    I don't think that will always work. Assuming we are using a
    72-bit SECDED ECC and a cache line is read with a double error,
    then if the ST overwrites an 8 byte aligned value it will generate >>>>>> a new valid ECC and correct the error.

    However if the ST is less than 8 bytes or misaligned, it won't know which
    of the 8 bytes was invalid so can't tell if the bad data was overwritten.
    If it keeps the old ECC as an error indicator, that code might actually be
    correct for the new data. If it generates a new valid ECC then it loses >>>>>> track of the fact that the data MAY be invalid.

    In this second case of partial overwrite I think it has to generate a >>>>>> new invalid ECC for the new 8 byte data indicating a double error. >>>>>>
    When the modified line is written back to DRAM it retains the
    double error ECC.
    And if the page is out swapped and recycled we lose track of
    the error indicator on that 8-byte value.
    If it was properly poisoned, the access by the DMA engine will
    cause a RAS error to be signalled and the DMA aborted.
    And the OS does what with the page and its data?
    This could happen long after the owner process terminated,
    maybe part of a lazy file cache write back.

    The only option for the OS might be to log the error and just reset
    the ECC to valid for the current data so the IO can complete.

    No, the I/O must be aborted. RAS 101 - do not propogate
    poisoned data.

    Perhaps but tossing a whole block from an IO expands the size of
    the problem by a factor of 1000's.

    Not Having the data (or at least the data in the I/O block being
    written (512/4k) given non-sequential underlying disk sector allocations)
    is _far far_ better than having corrupt data. The former can be
    repaired. The latter is may not even be detected.


    If that was one byte wrong in a text file then I think most people
    would want it written, as opposed to tossing out their work.

    I really doubt that any programmer would prefer bad data to no data.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed Mar 13 10:24:25 2024
    MitchAlsup1 wrote:
    EricP wrote:

    My concern is that the circuit for doing this could be pretty
    complicated.

    Essentially equal in complexity to an IO retirement µArchitecture.

    For my uArch Retire should be quite straight forward to implement.

    Retire reads the tail (oldest) entry in the Instruction Queue (IQ) and
    checks if the Done flag is set. If it is and the entry's Exception flag
    is clear:

    - if instruction was not a taken branch Retire adds the instruction
    length to the committed RIP register.
    - else if it is a taken branch Retire pops the new committed RIP from
    the tail of the branch queue in the Branch Control Unit.
    - it clears the Architecture Reg flag on the old dest physical register
    (which also frees it) and sets it on the new dest physical register
    - updates the Committed-RAT with the new dest register for the Arch register
    - increments IQ tail pointer, freeing the entry.

    If the entry's Exception flag is set then it is also straight forward,
    with a flush of all in-flight instructions, bulk copy the Committed-RAT
    into the Future-RAT to restore renaming, and set a jump address in Fetch.
    (Any in-flight cache miss operations are allowed to complete.)

    This is also relatively straight forward to do multiple retires per clock,
    each mostly costs an extra read port on IQ and extra write ports on the Committed-RAT and the Physical Register Status register.

    Many of the pieces that have to be checked are scattered around the core.
    Also many of states are in circular buffers so determining "older" starts
    getting slightly hairy (the Load Store Queue has a similar problem for
    disambiguation determining if all older loads and stores have
    "resolved").
    And all this has to run in parallel so it takes less than 1 clock.

    Adding the structures to support OoO Retire would greatly complicate this.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Mar 13 15:34:47 2024
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    My concern is that the circuit for doing this could be pretty
    complicated.

    Essentially equal in complexity to an IO retirement µArchitecture.

    For my uArch Retire should be quite straight forward to implement.

    Retire reads the tail (oldest) entry in the Instruction Queue (IQ) and
    checks if the Done flag is set. If it is and the entry's Exception flag
    is clear:

    - if instruction was not a taken branch Retire adds the instruction
    length to the committed RIP register.
    - else if it is a taken branch Retire pops the new committed RIP from
    the tail of the branch queue in the Branch Control Unit.
    - it clears the Architecture Reg flag on the old dest physical register
    (which also frees it) and sets it on the new dest physical register
    - updates the Committed-RAT with the new dest register for the Arch register - increments IQ tail pointer, freeing the entry.

    All of these would have been completed when the instruction comes out
    of its function unit, and then retire multiplexes this data onto the
    current retired instruction state. {2-gates not 13-gates}

    If the entry's Exception flag is set then it is also straight forward,
    with a flush of all in-flight instructions, bulk copy the Committed-RAT
    into the Future-RAT to restore renaming, and set a jump address in Fetch. (Any in-flight cache miss operations are allowed to complete.)

    This is also relatively straight forward to do multiple retires per clock, each mostly costs an extra read port on IQ and extra write ports on the Committed-RAT and the Physical Register Status register.

    Many of the pieces that have to be checked are scattered around the core. >>> Also many of states are in circular buffers so determining "older" starts >>> getting slightly hairy (the Load Store Queue has a similar problem for
    disambiguation determining if all older loads and stores have
    "resolved").
    And all this has to run in parallel so it takes less than 1 clock.

    Adding the structures to support OoO Retire would greatly complicate this.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Mar 13 15:31:50 2024
    Scott Lurndal wrote:

    EricP <[email protected]> writes:

    No, the I/O must be aborted. RAS 101 - do not propogate
    poisoned data.

    Perhaps but tossing a whole block from an IO expands the size of
    the problem by a factor of 1000's.

    Not Having the data (or at least the data in the I/O block being
    written (512/4k) given non-sequential underlying disk sector allocations)
    is _far far_ better than having corrupt data. The former can be
    repaired. The latter is may not even be detected.


    If that was one byte wrong in a text file then I think most people
    would want it written, as opposed to tossing out their work.

    I really doubt that any programmer would prefer bad data to no data.

    Any application dealing with money will prefer knowing the data is bad
    to not knowing if the data is bad.

    On the other hand, engine controllers deal with bad data all the time,
    and correct any current data problem on the next engine revolution.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Mar 13 15:36:50 2024
    EricP wrote:

    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    EricP wrote:
    As most stores are posted, the data stored needs to be 'poisoned' >>>>>>>> so that any subsequent use of the data (e.g. a load) will report >>>>>>>> a fault.
    Storing the bad <arriving> ECC should take care of that.
    I don't think that will always work. Assuming we are using a
    72-bit SECDED ECC and a cache line is read with a double error,
    then if the ST overwrites an 8 byte aligned value it will generate >>>>>> a new valid ECC and correct the error.

    However if the ST is less than 8 bytes or misaligned, it won't know which
    of the 8 bytes was invalid so can't tell if the bad data was overwritten.
    If it keeps the old ECC as an error indicator, that code might actually be
    correct for the new data. If it generates a new valid ECC then it loses >>>>>> track of the fact that the data MAY be invalid.

    In this second case of partial overwrite I think it has to generate a >>>>>> new invalid ECC for the new 8 byte data indicating a double error. >>>>>>
    When the modified line is written back to DRAM it retains the
    double error ECC.
    And if the page is out swapped and recycled we lose track of
    the error indicator on that 8-byte value.
    If it was properly poisoned, the access by the DMA engine will
    cause a RAS error to be signalled and the DMA aborted.
    And the OS does what with the page and its data?
    This could happen long after the owner process terminated,
    maybe part of a lazy file cache write back.

    The only option for the OS might be to log the error and just reset
    the ECC to valid for the current data so the IO can complete.

    No, the I/O must be aborted. RAS 101 - do not propogate
    poisoned data.

    Consider a page being written out and the last cache line in the page
    has a bad ECC. What command does one send the disk to indicate "forget
    all that data I just sent you" ??

    Perhaps but tossing a whole block from an IO expands the size of
    the problem by a factor of 1000's.

    If that was one byte wrong in a text file then I think most people
    would want it written, as opposed to tossing out their work.

    If that was one byte wrong in a file system meta data block then
    there is no good answer. Many of the meta data blocks are in linked lists
    or B+ trees so not writing the block could corrupt a whole file system,
    and writing the block could also cause corruption but hopefully less likely.

    So you are damned if you do fix the ECC and write the block,
    and damned if you don't. But do seems less damning.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed Mar 13 12:31:14 2024
    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    My concern is that the circuit for doing this could be pretty
    complicated.

    Essentially equal in complexity to an IO retirement µArchitecture.

    For my uArch Retire should be quite straight forward to implement.

    Retire reads the tail (oldest) entry in the Instruction Queue (IQ) and
    checks if the Done flag is set. If it is and the entry's Exception flag
    is clear:

    - if instruction was not a taken branch Retire adds the instruction
    length to the committed RIP register.
    - else if it is a taken branch Retire pops the new committed RIP from
    the tail of the branch queue in the Branch Control Unit.
    - it clears the Architecture Reg flag on the old dest physical register
    (which also frees it) and sets it on the new dest physical register
    - updates the Committed-RAT with the new dest register for the Arch
    register
    - increments IQ tail pointer, freeing the entry.

    All of these would have been completed when the instruction comes out of
    its function unit, and then retire multiplexes this data onto the
    current retired instruction state. {2-gates not 13-gates}

    IIRC the Alpha 21064 was 16 gates per stage so if my Retire unit
    could hit 13 gates I'd be extremely chuffed (delighted).
    I would likely be targetting 20 gates per stage anyway.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Wed Mar 13 16:16:39 2024
    [email protected] (MitchAlsup1) writes:
    EricP wrote:

    No, the I/O must be aborted. RAS 101 - do not propogate
    poisoned data.

    Consider a page being written out and the last cache line in the page
    has a bad ECC. What command does one send the disk to indicate "forget
    all that data I just sent you" ??

    You're not sending the data asychronously. The DMA engine on
    the disk controller is requesting the data from DRAM. The
    response to the DMA READ indicates an error to the disk
    controller and it aborts the write (and if it is buffered in the
    RAM disk controller cache, the entire I/O can be aborted
    with no change to the sector(s) on disk).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Mar 13 19:14:51 2024
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    My concern is that the circuit for doing this could be pretty
    complicated.

    Essentially equal in complexity to an IO retirement µArchitecture.

    For my uArch Retire should be quite straight forward to implement.

    Retire reads the tail (oldest) entry in the Instruction Queue (IQ) and
    checks if the Done flag is set. If it is and the entry's Exception flag
    is clear:

    - if instruction was not a taken branch Retire adds the instruction
    length to the committed RIP register.
    - else if it is a taken branch Retire pops the new committed RIP from
    the tail of the branch queue in the Branch Control Unit.
    - it clears the Architecture Reg flag on the old dest physical register
    (which also frees it) and sets it on the new dest physical register
    - updates the Committed-RAT with the new dest register for the Arch
    register
    - increments IQ tail pointer, freeing the entry.

    All of these would have been completed when the instruction comes out of
    its function unit, and then retire multiplexes this data onto the
    current retired instruction state. {2-gates not 13-gates}

    IIRC the Alpha 21064 was 16

    carefully tuned

    gates per stage so if my Retire unit
    could hit 13 gates I'd be extremely chuffed (delighted).
    I would likely be targeting 20 gates per stage anyway.

    For example, Athlon was a 16-gate machine and Opteron was a 17-gate
    machine. The 64-bit* integer adder was 11-gates of delay which had
    been carefully tuned so it was at least as fast as 8-random gates
    of FO4.

    (*) and the 56-bit fraction FADD adder was also 11-gates.

    As to gates of delay per stage::

    At 20-gates you can run 6-wide forwarding anything goes anywhere and hit
    each cache port twice per cycle (generally 1 RD 1 WT). This µArchitecture shortens the number of retire stages. One can also use register file ports twice per cycle so a 6-port RF can do 6 RDs and 6 WTs per cycle.

    At 16-gates 3-4-wide machines can perform everything goes everywhere forwarding but cannot run an SRAM twice per cycle {either RD-RD or RD-WT}. It is right
    on the edge of doable to use your register ports twice per cycle--I would recommend not trying} 30 years ago with circuit designers tuning gates you could now with gates-only-from-library you cannot.

    At 12-gates per stage you cannot perform anything goes anywhere forwarding
    {for example an ADD-Btye (x86) could not be forwarded to a 32-bit or 64-bit integer ADD. Part of the problem is x86 defines byte addition as insert.}

    At 8-gates per stage, the integer adder and accessing SRAM both take an
    entire cycle, so a LD cannot be shorter than 3-cycles and set associative caches are often 4-cycles. {So DM caches may actually outperform SA cache} Decode is at least 2 cycles even on a 1-wide machine. Decode is at least 3-cycles on a GBOoO machine. Forwarding is approximately ½ cycle.

    -----------------------------

    Having doe designs in each of these arenas:: I lean towards 16-gates on
    narrow machines and 20-gates on GBOoO machines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Paul A. Clayton on Fri Mar 22 10:23:09 2024
    Paul A. Clayton wrote:
    On 1/22/24 9:44 AM, Paul A. Clayton wrote:
    [snip]
    Obviously an extremely biased workload like the data analysis
    workloads targeted by Intel's research chip would probably show
    A55 in a better light (though A55 would likely be very inefficient
    compared to the research design, I think it used 4-way threaded
    in-order cores with limited cache and narrow memory channels [to avoid
    64-byte accesses to access 64-bits or less of data]), but
    that would not be "fair".

    I (finally) found a reference to the Intel research chip. https://ieeexplore.ieee.org/document/10188866
    "The Intel Programmable and Integrated Unified Memory Architecture
    Graph Analytics Processor" (Sriram Aananthakrishnan et al., 2023)
    A PDF of the paper appears to be available at https://heirman.net/papers/aananthakrishnan2023piuma.pdf

    Interesting. Thanks.
    I haven't finished reading it but one thing I notice is that since
    normally all of the chased pointers are virtual addresses, while they
    mention "Address translation tables (ATT)", I didn't see how they
    actually DO the virtual address translation during these offloaded chases.

    Also interesting are some of the authors other recent publishings. E.g.:

    https://scholar.google.com/citations?hl=en&user=bUTgzBUAAAAJ&view_op=list_works&sortby=pubdate

    https://scholar.google.com/citations?hl=en&user=ySqvmSQAAAAJ&view_op=list_works&sortby=pubdate

    This is a different approach to OoO uArch.
    Existing OoO work on the basis that most things are serial and predictable. This approach is optimized for sparse: short sequential code segments intermixed with sparse conditional code segments, chasing sparse data.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Sun Mar 24 19:00:22 2024
    Paul A. Clayton wrote:

    On 2/25/24 5:22 PM, MitchAlsup1 wrote:
    Paul A. Clayton wrote:
    [snip]
    When I looked at the pipeline design presented in the Arm Cortex-
    A55 Software Optimization Guide, I was surprised by the design.
    Figure 1 (page 10 in Revision R2p0) shows nine execution pipelines
    (ALU0, ALU1, MAC, DIV, branch, store, load, FP/Neon MAC &
    DIV/SWRT, FP/Neon ALU) and ALU0 and ALU1 have a shift pipeline
    stage before an ALU stage (clearly for AArch32).

    Almost like an Mc88100 which had 5 pipelines.

    I think I have an incorrect conception of data communication
    (fowarding and register-to-functional-unit). I also seem to be
    conflating somewhat issue port and functional unit. Forwarding
    from nine locations to nine locations and the remaining eight
    locations to eight locations (counting functional unit as a single
    target location even though a functional unit may have three
    functionally different input operands).

    Much newer µArchitectural literature does not draw a firm box
    properly around real function units.

    For example, Mc 88120 has 6 function units buffered by 6 reservation
    stations. Each function unit had an Integer Adder including things
    like the branch resolution unit, FADD, and FMUL. When I drew those
    boxes, I would show post-forwarding operands arriving at the FU
    and then after arriving either being diverted to the INT unit or
    being diverted to the "other" function unit. This way you could
    count operand and result busses and end points for fan-in::fan-out
    reasons.

    This style seems to have fallen from favor; possible because we made
    the transition from value-containing reservation stations to value-
    free reservation stations--alleviating register file porting problems.

    I am used to functionality being merged; e.g., the multiplier also
    having a general ALU. Merged functional units would still need to
    route the operands to the appropriate functionality, but selecting
    the operation path for two operands *seems* simpler than selecting
    distinct operands and separate functional unit independently. This
    might also be a nomenclature issue.

    The above remains my style in µArchitecture literature, but when
    describing block diagram and circuit design levels, only the interior
    of the function unit is illustrated.

    If one can only begin two operations in a cycle, the generality of
    having nine potential paths seems wasteful to me. Having separate
    paths for FP/Neon and GPR-using operations makes sense because of
    the different register sets (as well as latency/efficiency-
    optimized functional units vs. SIMD-optimized functional units;
    sharing execution hardware is tempting but there are tradeoffs).

    In general, operand timing is tight and you better not screw it up;
    while result delivery timing only has to deal with fan-out and data
    arrival issues.

    My style was conceived back in the days where wires were fast and
    metal was precious (3 layers). Now that we have 12-15 layers it
    matters less, I suppose.

    With nine potential issue ports, it seems strange to me that width
    is strictly capped at two.

    Likely to be a register porting or a register port analysis limitation. Value-free reservation stations exacerbate this.

    Even though AArch64 does not have My
    66000's Virtual Vector Method to exploit normally underutilized,
    there would be cases where an extra instruction or two could
    execute in parallel without increasing resources significantly. As
    an outsider, I can only assume that any benefit did not justify
    the costs in hardware and design effort. (With in-order execution,
    even a nearly free [hardware] increasing of width may not result
    in improved performance or efficiency.)

    VVM works best with value-containing reservation stations.

    The separation of MAC and DIV is mildly questionable — from my
    very amateur perspective — not supporting dual issue of a MAC-DIV
    pair seems very unlikely to hurt performance but the cost may be
    trivial.

    Many (MANY) MUL-DIV pairs are data dependent. y = i*m/n;

    I also ass?me the other operations are usually available for
    parallel execution (though this depends somewhat on compiler
    optimization for the microarchitecture), so execution of a
    multiply and a divide in parallel is probably uncommon.

    In general, any 2 calculations that are not data-dependent, can
    be launched into execution without temporal binds.

    The FP/Neon section has these operations merged into a functional
    unit; I guess — I am not motivated to look this — that this is
    because FP divide/sqrt use the multiplier while integer divide
    does not.

    The Chips and Cheese article also indicated that branches are only
    resolved at writeback, two cycles later than if branch direction
    was resolved in the first execution stage. The difference between
    a six stage misprediction penalty and an eight stage one is not
    huge, but it seems to indicate a difference in focus. With

    In an 8 stage pipeline, the 2 cycles of added delay should hurt by
    ~5%-7%

    5% performance loss sounds expensive for a something that *seems*
    not terribly expensive to fix.

    [snip]
    I would have *guessed* that an AGLU (a functional unit providing
    address generation and "simple" ALU functions, like AMD's Bobcat?)
    would be more area and power efficient than having separate
    pipelines, at least for store address generation.

    Be careful with assumptions like that. Silicon area with no moving
    signals is remarkably power efficient.

    There is also the extra forwarding for separate functional units
    (and perhaps some extra costs from increased distance), but I
    admit that such factors really expose my complete lack of hardware experience. (I am aware of clock gating as a power saving
    technique and that "doing nothing" is cheap, but I have no
    intuition of the weights of the tradeoffs.)

    Mc 88120 had forwarding into the reservation stations and forwarding
    between reservation station output and function unit input. That is
    a lot of forwarding.

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)

    Imaging having to stick all this stuff on a die at 2µ instead of 5nm !!

    [snip interesting stuff]
    Perhaps mildly out-of-order designs (say a little more than the
    PowerPC 750) are not actually useful (other than as a starting
    point for understanding out-of-order design). I do not understand
    why such an intermediate design (between in-order and 30+
    scheduling window out-of-order) is not useful. It may be that

    It is useful, just not all that much.

    going from say 10 to 30 scheduler entries gives so much benefit
    for relatively little extra cost (and no design is so precisely
    area constrained — even doubling core size would not mean pushing
    L1 off-chip, e.g.). I have a lumper taxonomic bias, so I have some
    emotional investment in intermediate and mixed designs.

    10 does not accommodate much ILP beyond that of a 10 deep pipeline.
    30 accommodates L1 cache misses and typical FP latencies.
    90 accommodates "almost everything else"
    250 accommodates multiple L1 misses with L2 hits and "everything
    else".

    Presumably the benefit depends on issue width and load-to-use
    latency (pipeline depth, cache capacity, etc.). [For a cheap
    "general purpose" processor, not covering FP latencies well may
    not be very important.] Better hiding L1 _hit_ latency would seem
    to provide a significant fraction of the frequency and ILP benefit
    of out-or-order for a smallish core. (Some branch resolution
    latency can also be hidden; an in-order core can delay resolution
    until writeback of control-dependent instructions, but OoO's extra
    buffering facilitates deeper speculation.)

    If one has a scheduling window of 90 operations, having only three
    issue ports seems imbalanced to me.

    I agree:: for Mc 88120 we had 96 instructions (max) in flight for
    a 6-wide {issue, launch, execute, result, and retire}, we also
    had 16-cycle execution window, so to stream DGEMM (from Matrix300}
    we had to execute a LD {which would miss ½ the time} and them have
    4 cycles for FMUL and 3 cycles for FADD allowing ST to capture the
    FADD result and ship it off to cache. Going backwards; 16-(1+3+4)
    meant the LD->L1$->miss->memory->LDalign had only 8 cycles.

    The modern version with FMAC would allow 11-cycles LD-Miss-Align.

    Out-of-order execution would also seem to facilitate opportunistic
    use of existing functionality. Even just buffering decoded
    instructions would seem to allow a 16-byte (aligned) instruction
    fetch with two instruction decoders to issue more than two
    instructions on some cycles without increasing register port
    count, forwarding paths, etc. OoO would further increase the
    frequency of being able to do more work with given hardware
    resources.

    My 66150 does 16B fetch and parses 2 instructions per cycle,
    even though it is only 1-wide. By fetching wide, and scanning
    ahead, one can identify branches and fetch their targets prior
    to executing the branch, eliminating the need for the delay-slot
    and reducing branch taken overhead down to about 0.13 cycles
    even without branch prediction !!

    But anything wider than 1-inistruction will need a branch predictor
    of some sort.

    Perhaps there may even be a case for a 1+ wide OoO core, i.e., an
    OoO core which sometimes issue more than one instruction in a
    cycle.

    For something like a smart phone, one or two small cores might be
    useful for background activity, tasks whose latency (within a
    broad range) is not related to system responsiveness for the user.

    For a server expected to run embarrassingly parallel workloads, if

    Servers are not expected to run embarrassingly parallel applications,
    they are expected to run an embarrassing large number of essentially
    serial applications.

    Shared caching of instructions still seems beneficial in "server
    worklaods" compared to fully general multiprogram workloads. A
    database server might even have more sharing, potentially having a
    single process (so page table sharing would be more beneficial),
    but that seems a less common use.

    a wimpy core provides sufficient responsiveness, I would expect
    most of the cores (possibly even all of the cores) to be wimpy.
    There might not be many workloads with such characteristics;

    Talk to Google about that....

    Urs Hölzle of Google put out a paper "Brawny cores still beat
    wimpy cores, most of the time"(2010). While some of the points —
    such as tail latency effects and software developments costs —
    made in the paper are (in my opinion) quite significant, I thought
    the argument significantly flawed. (I even wrote a blog post about
    this paper: https://dandelion-watcher.blogspot.com/2012/01/weak-case-against-wimpy-cores.html)

    The microservice programming model (motivated, from what I
    understand, by problem-size and performance scaling and service
    reliability with moderately reliable hardware without requiring
    much programming effort to support scaling) may also have
    significant implications on microarchitecture.

    The design space is also very large. One can have heterogeneity of
    wimpy and brawny cores at the rack level, wimpy-only chips within
    a heterogeneous package, heterogeneity within a chip, temporal
    heterogeneity (SMT and dynamic partitioning of core resources),
    etc. Core strength can very widely and performance balance can be
    diverse (e.g., a core with a quarter of the performance of a
    brawny core on general tasks might have — with coprocessors,
    tightly coupled accelerators, or general microarchitecture —
    approximately equal performance for some tasks).

    With a "proper interface" one should be able to off-load any
    crypto processing too a place that is both time-constant and
    where sensitive data never passes into the cache hierarchy of
    an untrusted core.

    The performance of weaker cores can also be increased by
    increasing communication performance within local groups of such
    cores. Exploiting this would likely require significant
    programming effort, but some of the effort might be automated
    (even before AI replaces programmers). This assumes that there is
    significant communication that is less temporally local than
    within a core (out-of-order execution changes the temporal
    proximity of value communication; a result consumer might be
    nearby in program order but substantially more distant in
    execution order) and that intermediate resource allocation to
    intermediate latency/bandwdith communication can be beneficial.

    (I also think that there is an opportunity for optimization in the
    on-chip network. Optimizing the on-chip network for any-to-any
    communication seems less appropriate for many workloads not only
    because of the often limited scale of communication but also
    because the communication is, I suspect, often specialized.

    And often necessarily serialized.

    Getting a network design that is very good for some uses and
    adequate others seems challenging even with software cooperation.

    See:: https://www.tachyum.com/media/pdf/tachyum_20isc20.pdf

    Rings seem really nice for pipeline-style parallelism and some
    other uses, crossbars seem nice for small node groups with heavy communication, grids seem to fit large node counts with nearest
    neighbor communication (physical modeling?), etc. Channel width,
    flit size, channel count also involve tradeoffs. Some
    communication does not require sending an entire cache block of
    data, but a smaller flit will have more overhead.)

    We are arriving at the scale where we want to ship a cache line of data
    in a single clock in order to have sufficient coherent BW for 128+ cores.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Sun Mar 24 20:39:18 2024
    [email protected] (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)

    Imaging having to stick all this stuff on a die at 2µ instead of 5nm !!

    I suspect Paul is refering to what ARMv8 calls "System Registers";
    despite the name, most are stored in flops, and in the case of
    the ID registers, wires (perhaps anded with local e-fuses).

    Accesses to some of them are either self-synchronizing[*]
    the rest must be followed by an appropriate barrier
    instruction for the effects to be architecturally visible.

    [*] E.g. ICC_IAR1_EL1 (An interrupt acknowledge register).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Paul A. Clayton on Mon Mar 25 08:41:06 2024
    "Paul A. Clayton" <[email protected]> writes:
    On 3/24/24 4:39 PM, Scott Lurndal wrote:
    [email protected] (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)
    ...
    However, having over 100 seems like a lot. Supporting performance
    counters and tracing is also something that would have been nearly >inconceivable for something like the MIPS R2000.

    Certainly. The A55 is similar to the 21164 (1994), which is much
    bigger than the R2000. For competition to the R2000, better look at
    the ARM1/ARM2, or, for something more contemporary, maybe the
    Cortex-M1.

    An argument might be made that some designs would have no use for
    most of such extra state. Performance monitoring is useful for
    software development (and theoretically for OS decisions for
    scheduling, core migration, and other functions), but seems likely
    to be highly underutilized for typical use. A55 is presumably
    large enough that a synthesis-time remove of much of this
    functionality would have a tiny effect on total area.

    ARM also has the Cortex-A35 (with a 25% smaller core than the A53 and
    80-100% of its performance according to ARM). I am unaware of it
    being used in smartphones, though.

    Even for a
    microcontroller the area cost might not be problematic.

    ARM-A is not for microcontrollers. ARM has ARM-M for that, e.g., the
    Cortex-M0 if you want it to be really small.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Paul A. Clayton on Mon Mar 25 12:36:27 2024
    Paul A. Clayton wrote:
    On 3/24/24 4:39 PM, Scott Lurndal wrote:
    [email protected] (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)

    However, having over 100 seems like a lot. Supporting performance
    counters and tracing is also something that would have been nearly inconceivable for something like the MIPS R2000.

    Many of these register are configuration control that
    get set once at boot and never change.
    Others are dynamic but not time critical, like debug registers.

    Only a small number would be diddled on a regular basis,
    like interrupt control.

    They don't all need the same access speed -
    depending on usage some (most?) can be on "slow" buses
    that maybe take multiple clocks to read or write.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Mon Mar 25 13:03:59 2024
    EricP wrote:
    Paul A. Clayton wrote:
    On 3/24/24 4:39 PM, Scott Lurndal wrote:
    [email protected] (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)

    However, having over 100 seems like a lot. Supporting performance
    counters and tracing is also something that would have been nearly
    inconceivable for something like the MIPS R2000.

    Many of these register are configuration control that
    get set once at boot and never change.
    Others are dynamic but not time critical, like debug registers.

    Only a small number would be diddled on a regular basis,
    like interrupt control.

    They don't all need the same access speed -
    depending on usage some (most?) can be on "slow" buses
    that maybe take multiple clocks to read or write.

    Also accessing many control registers must not occur out of order
    and must be guarded either implicitly or explicitly by instructions
    or uOps before and after to drain the pipeline.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Paul A. Clayton on Mon Mar 25 17:04:44 2024
    "Paul A. Clayton" <[email protected]> writes:
    On 3/24/24 4:39 PM, Scott Lurndal wrote:
    [email protected] (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)

    Imaging having to stick all this stuff on a die at 2µ instead of 5nm !! >>
    I suspect Paul is refering to what ARMv8 calls "System Registers";

    Yes. (There were also some debug registers, performance monitoring
    registers, trace registers, etc.)

    despite the name, most are stored in flops, and in the case of
    the ID registers, wires (perhaps anded with local e-fuses).

    Yes, many of the bits would be implemented as ROM/PROM and many
    would presumably be scattered about because they control/interact
    with specific functionality. They are similar I/O device
    registers. (I/O devices have also become more complex.)

    However, having over 100 seems like a lot. Supporting performance
    counters and tracing is also something that would have been nearly >inconceivable for something like the MIPS R2000.

    Yes, there are over 1000 system registers. Most of them are
    only present and implemented if associated feature(s) are supported by the implementation.

    The MIPS 2000 was designed three decades ago and implemented in
    a 2 micrometer node. Whose law states that logic will expand to
    fill the area available :-)?


    An argument might be made that some designs would have no use for
    most of such extra state. Performance monitoring is useful for
    software development (and theoretically for OS decisions for
    scheduling, core migration, and other functions), but seems likely
    to be highly underutilized for typical use.

    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Mar 25 17:38:58 2024
    Scott Lurndal wrote:

    "Paul A. Clayton" <[email protected]> writes:
    On 3/24/24 4:39 PM, Scott Lurndal wrote:

    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    My 66000 Architecture defines 8 performance counters at each layer of
    the design:: cores gets 8 counters, L1s gets 8 counters, L3s gets 8
    counters Interconnect gets 8 counters, Memory Controller gets 8 counters,
    PCIe root gets 8 counters--and every instance multiplies the counters.
    All counters are available via MMI/O space, and can be copied out or reinitialized in a single LDM, STM, or MM instruction. Any thread with
    a TLB mapping can read or write based on permission bits.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Mon Mar 25 18:35:35 2024
    [email protected] (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel or AMD.

    My theory was that the CPU manufacturers put performance monitoring
    counters in CPUs in order to understand the performance of real-world
    programs themselves, and how they should tweak the successor core to
    relieve it of bottlenecks.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Mon Mar 25 18:23:50 2024
    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:

    "Paul A. Clayton" <[email protected]> writes:
    On 3/24/24 4:39 PM, Scott Lurndal wrote:

    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    My 66000 Architecture defines 8 performance counters at each layer of
    the design:: cores gets 8 counters, L1s gets 8 counters, L3s gets 8
    counters Interconnect gets 8 counters, Memory Controller gets 8 counters, >PCIe root gets 8 counters--and every instance multiplies the counters.

    It's not really the number of counters that is important, rather
    it is what the counters count (i.e. which events can be counted).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Mon Mar 25 20:22:00 2024
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:
    [email protected] (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel
    or AMD.

    The question is if "users" to ARM Holdings are actual end-users, or the
    SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect
    most of the latter to want those features so that they can understand the performance of their silicon better.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Dallman on Mon Mar 25 20:46:39 2024
    [email protected] (John Dallman) writes:
    In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
    [email protected] (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel
    or AMD.

    The question is if "users" to ARM Holdings are actual end-users, or the
    SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.

    The biggest demand is from the OS vendors. Hardware folks have
    simulation and emulators.

    Look at vtune, for example.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Terje Mathisen on Mon Mar 25 20:48:08 2024
    Terje Mathisen <[email protected]> writes:
    Anton Ertl wrote:
    [email protected] (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel or AMD.

    My theory was that the CPU manufacturers put performance monitoring
    counters in CPUs in order to understand the performance of real-world
    programs themselves, and how they should tweak the successor core to
    relieve it of bottlenecks.

    Having reverse engineered the original Pentium EMON counters I got a
    meeting with Intel about their next cpu (the PentiumPro), what I was
    told about the Pentium was that this chip was the first one which was
    too complicated to create/sell an In-Circuit Emulator (ICE) version, so >instead they added a bunch of counters for near-zero overhead monitoring
    and depended on a bit-serial read-out when they needed to dump all state
    for debugging. (I have forgotten the proper term for that interface! :-( )

    Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Mon Mar 25 21:42:18 2024
    Anton Ertl wrote:
    [email protected] (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel or AMD.

    My theory was that the CPU manufacturers put performance monitoring
    counters in CPUs in order to understand the performance of real-world programs themselves, and how they should tweak the successor core to
    relieve it of bottlenecks.

    Having reverse engineered the original Pentium EMON counters I got a
    meeting with Intel about their next cpu (the PentiumPro), what I was
    told about the Pentium was that this chip was the first one which was
    too complicated to create/sell an In-Circuit Emulator (ICE) version, so
    instead they added a bunch of counters for near-zero overhead monitoring
    and depended on a bit-serial read-out when they needed to dump all state
    for debugging. (I have forgotten the proper term for that interface! :-( )

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Tue Mar 26 09:22:31 2024
    [email protected] (John Dallman) writes:
    The question is if "users" to ARM Holdings are actual end-users, or the
    SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.

    That might explain why for the AmLogic S922X in the Odroid N2/N2+
    there is a Linux 4.9 kernel that supports performance monitoring
    counters (AmLogic put that in for their own uses), but the mainline
    Linux kernel does not support perf on the S922X (perf was not in the requirements of whoever integrated the S922X stuff into the mainline).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Scott Lurndal on Tue Mar 26 10:47:07 2024
    Scott Lurndal wrote:
    Terje Mathisen <[email protected]> writes:
    Having reverse engineered the original Pentium EMON counters I got a
    meeting with Intel about their next cpu (the PentiumPro), what I was
    told about the Pentium was that this chip was the first one which was
    too complicated to create/sell an In-Circuit Emulator (ICE) version, so
    instead they added a bunch of counters for near-zero overhead monitoring
    and depended on a bit-serial read-out when they needed to dump all state
    for debugging. (I have forgotten the proper term for that interface! :-( )

    Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.


    Thanks!

    JTAG was indeed the term as was looking for (and not remembering). Maybe
    I'm getting old?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Tue Mar 26 09:27:54 2024
    [email protected] (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have
    simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a long-running program.

    Look at vtune, for example.

    And?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Tue Mar 26 14:15:41 2024
    [email protected] (Anton Ertl) writes:
    [email protected] (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Oct 1 18:45:11 2024
    On Tue, 26 Mar 2024 14:15:41 +0000, Scott Lurndal wrote:

    [email protected] (Anton Ertl) writes:
    [email protected] (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >>long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.

    It is sequence compliance. At this point in the game all the FUs
    are known to produce correct results. But we live in a world
    where::
    a) The test case takes the correct number of cycles
    b) leaves all the right bit patterns in registers and memory
    c) took at the right directions at all the branches
    d) and went through an invalid sequence to get there.

    HW verification is mostly about proving the sequencers are correct.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)