Forum: >>> Magnum BBS <<<

Re: Efficiency of in-order vs. OoO

From Anton Ertl@21:1/5 to Paul A. Clayton on Wed Jan 24 07:47:31 2024

"Paul A. Clayton" <[email protected]> writes:

In AnandTech's Exynos9820 comparison, one knows the process used
is the same, but one does not know how optimized the designs were
at the HDL level nor at the netlist ("compiled") level. It is also
possible to optimize the same HDL for different power-performance-
area targets.

Suipposedly the A55 is there for efficiency, so if different
optimizations were applied, one would assume that the A55 was
optimized for perf/W.

I would not be surprised if ARM did not invest the same design
effort per unit performance (e.g.) in A55 as in A75.

The A55 served as little core for the A75, A76, A77, and A78, and it
served as only core for a significant number of SoCs. Why should ARM
make it performance-inefficient and power-inefficient like you
suggest? It certainly is more performance-efficient than the A53:

LateX benchmark, numbers are times in seconds:

- Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
- Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended) 2.105

Likewise, I could imagine Samsung putting less effort into
optimizing A55.

What effort do you expect on Samsung's side? They only had a few
months from the time when ARM gives them the IP until they tape out
the Exynos 9820, so I expect thet used the IP as-is, only deciding on
some process parameters.

Performance optimization likely makes less sense
for background tasks (the likely targeted use for A55 in this
case) and the benefit of core-level power optimization is likely
less significant than the I/O power for many of the targeted
tasks. Even optimizing for low energy cost for bursty workloads
(useless energy for sleep/wakeup, e.g.) would probably not help
much because of system power consumption.

So you speculate (without evidence) that ARM/Samsung optimized the A55
for low performance and low power-efficiency? Not very plausible.

In any case, the A53/A55/A5xx line of ARM are the only cases where
core designers have not switched from in-order to OoO designs (unlike
Intel with their E-Cores and Xeon Phi where they switched from
in-order to OoO in the face of power efficiency being of supreme
importance for the E-Core line). And now you write that ARM did not
design it for power efficiency. If you are right, that supports the
position that in-order is uncompetetive not just wrt performance, but
also perf/W as soon as there are relatively low performance
requirements.

The memory system, on-chip network, and such would also affect the
energy efficiency. Exynos9820's memory system might _reasonably_
be optimized for high power/high performance use; that would tend
to hurt the efficiency of wimpy cores.

What scenario do you imagine where one would want these in-order
cores? ARM's niche for them is the little cores in a big.LITTLE
design; that is necessarily coupled with a memory system with a high
bandwidth. There are also SoCs with only A55 cores (no BIG ones) like
the RK3566, but they are only bought because of the price, not because
of their power-efficiency.

I think system power is also less likely to scale well downward
with performance. E.g., the same capacity L2 suited to one A75
core might properly service more than two A55 cores. If the design
had more A55 cores per L2 than A75 cores per L2, the A55 cores
could be at a power disadvantage in single threaded use just from
the L2 cache.

In the Exynos 9820 the A55s have no private L2 cache, and they access
1MB (shared with all other cores) out of the 4MB L3 cache (3MB of
which are exclusive to the two M4 cores). This is Samsung's work, and
it is not plausible that they optimized this setup for power
inefficiency.

One might be able to adjust for system power scaling factors by
using all cores of a type for a run (e.g., SPECrate), but I
suspect that would be tricky given fixed aspects of the hardware.

<https://images.anandtech.com/doci/14072/Exynos9820-Perf-Estimated_575px.png>
<https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png> >>
from the article

<https://www.anandtech.com/show/14072/the-samsung-galaxy-s10plus-review/4> >>
In the Exynos 9820, we see at different points of the DVFS curve:

A55 | A75
in-order | OoO
perf mW pf/mW | perf mW pf/mW
1.0 22 0.046 | 3.7 88 0.042 highest efficiency point for each core
1.4 33 0.042 | 3.7 88 0.042 same pf/mW at highest common efficiency >> 2.7 90 0.030 | 3.7 88 0.042 same mW at lowest common mW
5.1 400 0.013 | 5.1 124 0.041 same perf at highest common performance >> 5.1 400 0.013 | 10.5 400 0.027 same mW at highest common mW
5.1 400 0.013 | 17.2 1270 0.013 highest performance point for each core >>
"prf" is SPEC2006 Int+FP Geomean. "pf/mW" (shown as "Perf/W" in the
second graph) is SPEC Int+FP Geomean/mW (you can confirm this by
computing corresponding numbers from the first graph).

The SPEC2006 workload probably also biases is favor of larger
cores, especially the FP portion.

So you admit that in-order is not efficient for SPEC2006? Given that
SPEC CPU benchmarks are commonly accepted as representative for tasks
where CPU is relevant, what does that tell us about in-order cores?

I suspect A55 uses 64-bit width
SIMD execution (which makes sense for the targeted use), which
would substantially reduce SPECFP performance and possibly degrade
SPECINT performance.

The power consumption should be correspondingly lower. If there was a power-efficiency advantage to in-order, you should still see it.

Even the gcc component of SPECINT might be more compute dense than
the targeted workloads for A55 (which might often be more
performance constrained by I/O) and gcc is probably less "compute
dense" than other SPECINT components.

So the supposed power efficiency of in-order is only relevant for
workloads that don't compute, but only wait for I/O all the time?
Even if it was, the solution to me seems to get rid of most of this
waiting by moving it off the CPU to some dedicated circuit. And AFAIK
in all I/O that moves a lot of data (e.g., block devices, network
devices), that has happened.

Obviously an extremely biased workload like the data analysis
workloads targeted by Intel's research chip would probably show
A55 in a better light (though A55 would likely be very inefficient
compared to the research design, I think it used 4-way threaded
in-order cores with limited cache and narrow memory channels [to
avoid 64-byte accesses to access 64-bits or less of data]), but
that would not be "fair".

I have no idea what Intel research chip you have in mind, but
certainly, for special workloads specialized designs have been used successfully. In some cases, these workloads generate so much revenue
that they result in their own specialized processors, as happened with
GPUs which then have also been used for HPC and AI as GPGPUs (or is
there a new word for that).

But programs for GPGPUs do not run efficiently on CPUs and vice versa,
and the question at hand is if in-order is power-efficient for CPUs.
And above a certain (relatively low) performance level, the answer
seems to be: No.

Core efficiency cannot be isolated from the system, especially if
measured by system resource use (I *suspect* AnandTech measured
system power and subtracted idle system power).

You can read about how it was measured in the article: https://www.anandtech.com/print/14072/the-samsung-galaxy-s10plus-review

Fair comparison is difficult, especially when the design targets
are different.

I think that the comparison is as fair as we can get. Of course if
for some reason you don't want to be convinced, there are always some
straws that you can grasp in the hope that they will save the belief
system you favour. But if you look at it objectively, all evidence
there is (from Transmeta through Intel's E-cores and the lack of
in-order at Apple, Intel, and AMD to Andrei Frumusanu's Exynos 9820
data) supports the position that in-order is not more power-efficient
than OoO above a certain performance level, while the opposite
position cannot point to evidence, but only to some corners where we
don't have evidence, and where in-order fans hope that these corners
will favour in-order.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Wed Jan 24 18:38:53 2024

On Wed, 24 Jan 2024 07:47:31 +0000, Anton Ertl wrote:

I think that the comparison is as fair as we can get. Of course if
for some reason you don't want to be convinced, there are always some
straws that you can grasp in the hope that they will save the belief
system you favour. But if you look at it objectively, all evidence
there is (from Transmeta through Intel's E-cores and the lack of
in-order at Apple, Intel, and AMD to Andrei Frumusanu's Exynos 9820
data) supports the position that in-order is not more power-efficient
than OoO above a certain performance level, while the opposite
position cannot point to evidence, but only to some corners where we
don't have evidence, and where in-order fans hope that these corners
will favour in-order.

Above a certain performance level, _all_ cores are out-of-order.

If in-order is more power-efficient than out-of-order at *low*
performance levels, than the basic notion that implementing
out-of-order requires some extra transistors, and transistors
take power, is confirmed. That basic notion is what leads
people to hope that, if in-order could be extended to higher
performance levels, then it would provide power savings there
too.

Let us then imagine what a high-performance in-order CPU would
look like. Its goal would be to achieve what OoO achieves to
improve performance without being OoO.

Thus, such a CPU would have a giant architectural register
file - to match the large hardware register files, including
rename registers, of OoO systems.

So we're talking AMD 29000 or Itanium. AMD sold off the 29000,
and it's still being used for compatibility reasons in some
aviation hardware.

The sample size is small, and so it's not that unreasonable to
argue that although the Itanium failed to meet expectations, this
class of architectures may still deserve some more investigation
and study. Yes, there's no high-performance OoO-beating in-order
chip you can buy off the shelf today, but maybe it's still worth
trying to design one.

What arguments are there against that? I can see a few:

- It's been tried many times, and failed each time. (This
doesn't _seem_ to be the case, but the few times it was
tried may have been enough to prove the point.)

- The benefits of in-order at high performance are known
to be negligible. (That is, the gate cost of OoO at
high performance scales well, and becomes a decreasing
fraction of transistor count in higher-performance designs.
Mitch tells us that the GBOoO direction of progress is
*not sustainable*, so that doesn't seem to be the case.)

- The drawbacks of in-order outweigh their benefits.
(If you have larger register files, you have bigger
instructions, so you fetch more code out of DRAM.
Is that really enough to make the difference?)

However, in framing this counterargument in favor of
in-order, the *fatal* drawback of in-order for high
performance has dawned on me. (Although Ivan Godard
in his Mill design is, in fact, making an effort to
address just this particular drawback!)

As Mitch notes, to further increase performance, OoO
has become GBOoO: ever larger hardware register files
and so on.

This means that, even if an in-order design which had
large architectural register files, an exposed pipeline,
and so on, matched _current_ OoO CPUs in performance
for less power...

the performance of OoO CPUs doesn't stand still...

and so the _next generation_ of the in-order design
would have to have *larger* register files (and, no
doubt, all sorts of other things)...

which means it wouldn't be upwards-compatible with
software for the last generation.

That's why in-order RISC ended up being succeeded by
OoO implementations of the same ISA! Going from 32
registers to 128 registers to stay in-order... isn't
just something you can *do only once*, and solve the
problem forever!

It is by looking at the real problem that the false hope
of high-performance in-order can finally be dashed. Maybe
it isn't technically impossible. But for the mass market
that wants to coalesce around a popular and stable
platform, it may not be able to meet *their* requirements,
even if such architectures could still find a niche
(like supercomputers that are only programmed by the
users themselvels in FORTRAN).

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Quadibloc on Wed Jan 24 20:03:15 2024

Quadibloc wrote:

On Wed, 24 Jan 2024 07:47:31 +0000, Anton Ertl wrote:

I think that the comparison is as fair as we can get. Of course if
for some reason you don't want to be convinced, there are always some
straws that you can grasp in the hope that they will save the belief
system you favour. But if you look at it objectively, all evidence
there is (from Transmeta through Intel's E-cores and the lack of
in-order at Apple, Intel, and AMD to Andrei Frumusanu's Exynos 9820
data) supports the position that in-order is not more power-efficient
than OoO above a certain performance level, while the opposite
position cannot point to evidence, but only to some corners where we
don't have evidence, and where in-order fans hope that these corners
will favour in-order.

Above a certain performance level, _all_ cores are out-of-order.

Above about 1.0 I/c everything goes OoO.

Remember the 1st generation RISCs performed at about 0.7 I/c
and the 2-wide In-Order machines close to 1.0 I/c;
Data from simulations showed 4-wide IO machines near 1.13 I/c

If in-order is more power-efficient than out-of-order at *low*
performance levels, than the basic notion that implementing
out-of-order requires some extra transistors, and transistors
take power, is confirmed. That basic notion is what leads
people to hope that, if in-order could be extended to higher
performance levels, then it would provide power savings there
too.

Let us then imagine what a high-performance in-order CPU would
look like. Its goal would be to achieve what OoO achieves to
improve performance without being OoO.

Do not forget Vector Machines as IO at high perf. By repeating
the same calculation (or memory reference) 64 times and chaining
(i.e., forwarding) they could achieve several (~3 I/c) long term.

Thus, such a CPU would have a giant architectural register
file - to match the large hardware register files, including
rename registers, of OoO systems.

CRAY 1 had 4096 Bytes of Vector Registers (and only 8 registers).

So we're talking AMD 29000 or Itanium. AMD sold off the 29000,
and it's still being used for compatibility reasons in some
aviation hardware.

The sample size is small, and so it's not that unreasonable to
argue that although the Itanium failed to meet expectations, this
class of architectures may still deserve some more investigation
and study. Yes, there's no high-performance OoO-beating in-order
chip you can buy off the shelf today, but maybe it's still worth
trying to design one.

What arguments are there against that? I can see a few:

- It's been tried many times, and failed each time. (This
doesn't _seem_ to be the case, but the few times it was
tried may have been enough to prove the point.)

You can add DataFlow to this list. Tried several times, the most
successful (I think) was Monsoon.

- The benefits of in-order at high performance are known
to be negligible. (That is, the gate cost of OoO at
high performance scales well, and becomes a decreasing
fraction of transistor count in higher-performance designs.
Mitch tells us that the GBOoO direction of progress is
*not sustainable*, so that doesn't seem to be the case.)

Non-switching transistors cost area but not <much> power.

- The drawbacks of in-order outweigh their benefits.
(If you have larger register files, you have bigger
instructions, so you fetch more code out of DRAM.
Is that really enough to make the difference?)

There is a difference between the architectural register file
(32 entry) and the implementation register file (128 rename
pool). The above confuses the two.

However, in framing this counterargument in favor of
in-order, the *fatal* drawback of in-order for high
performance has dawned on me. (Although Ivan Godard
in his Mill design is, in fact, making an effort to
address just this particular drawback!)

As Mitch notes, to further increase performance, OoO
has become GBOoO: ever larger hardware register files
and so on.

Only to the point where the register file can still cycle
in 1 clock. This puts the limit somewhere between 128 and
256 total registers.

This means that, even if an in-order design which had
large architectural register files, an exposed pipeline,
and so on, matched _current_ OoO CPUs in performance
for less power...

the performance of OoO CPUs doesn't stand still...

and so the _next generation_ of the in-order design
would have to have *larger* register files (and, no
doubt, all sorts of other things)...

which means it wouldn't be upwards-compatible with
software for the last generation.

That's why in-order RISC ended up being succeeded by
OoO implementations of the same ISA! Going from 32
registers to 128 registers to stay in-order... isn't
just something you can *do only once*, and solve the
problem forever!

And Itanic's downfall.

It is by looking at the real problem that the false hope
of high-performance in-order can finally be dashed. Maybe
it isn't technically impossible. But for the mass market
that wants to coalesce around a popular and stable
platform, it may not be able to meet *their* requirements,
even if such architectures could still find a niche
(like supercomputers that are only programmed by the
users themselvels in FORTRAN).

Vector machines fell out of fashion when the length of the
vector register could no longer absorb the latency to memory.
{{Although NEC persisted for longer}}

Given certain kinds of HW (CAMs) one can build sort algorithms
in linear time for sorts of less than 128-entries. Then resort
to merges for longer lists.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Quadibloc on Wed Jan 24 20:33:20 2024

Quadibloc <[email protected]d> schrieb:

Above a certain performance level, _all_ cores are out-of-order.

That is true for general-purpose CPUs, but not for GPUs - these
are in-order. I think AMD and NVIDIA differ in their handling
of register hazards - AMD handles them, NVIDIA depends on the
compiler (well, whatever you want to call the piece of software
that translates the intermediate PTX into whatever the graphics
card itself understands) to do this.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Wed Jan 24 21:54:12 2024

Quadibloc <[email protected]d> writes:

On Wed, 24 Jan 2024 07:47:31 +0000, Anton Ertl wrote:

I think that the comparison is as fair as we can get. Of course if
for some reason you don't want to be convinced, there are always some
straws that you can grasp in the hope that they will save the belief
system you favour. But if you look at it objectively, all evidence
there is (from Transmeta through Intel's E-cores and the lack of
in-order at Apple, Intel, and AMD to Andrei Frumusanu's Exynos 9820
data) supports the position that in-order is not more power-efficient
than OoO above a certain performance level, while the opposite
position cannot point to evidence, but only to some corners where we
don't have evidence, and where in-order fans hope that these corners
will favour in-order.

Above a certain performance level, _all_ cores are out-of-order.

True, but not only that: In the Exynos 9820, at nearly all of the
performance range of the A55 (roughly as soon as you clock it

=500MHz), the A75 offers more performance at better efficiency; the

A55 can run at 1800MHz on the Exynos 9820, but one better shouldn't,
and certainly not above 1000MHz where not just the efficiency, but
also the power consumption overtakes that of the A75 at its
lowest-power point despite delivering less performance.

If in-order is more power-efficient than out-of-order at *low*
performance levels, than the basic notion that implementing
out-of-order requires some extra transistors, and transistors
take power, is confirmed.

Certainly the A75 takes more area and more transistors than the A55;
concerning area, looking at <https://images.anandtech.com/doci/14069/ChipRebel9820.png> there
seems to be roughly a factor or 3-4 between them, just as for
performance. But the A75 has been designed as a big core in ARMs
big.LITTLE system, so it should not be surprising that it is bigger.

Transistors (and connections) take power when they switch, and
(non-switching) leakage has also been a big topic maybe 15 years ago;
CPU manufacturers have dealt with leakage by powering down inactive
units, but powering them up again takes some time.

That basic notion is what leads
people to hope that, if in-order could be extended to higher
performance levels, then it would provide power savings there
too.

Higher performance levels without more transistors? How?

Let us then imagine what a high-performance in-order CPU would
look like. Its goal would be to achieve what OoO achieves to
improve performance without being OoO.

We don't have to imagine. We know what the various IA-64
implementations look like. And they were not power-efficient, on the
contrary, IIRC Merced in particular was exceptionally power-hungry for
its time at IIRC 130W. The 1.66MHz Madison is rated by Intel at 122W <https://ark.intel.com/content/www/us/en/ark/products/27995/intel-itanium-processor-1-66-ghz-9m-cache-667-mhz-fsb.html>,
and the Itanium 9560 (Poulson) with 8 cores @ 2.53GHz has a TDP of
170W.

Thus, such a CPU would have a giant architectural register
file - to match the large hardware register files, including
rename registers, of OoO systems.

If a big register file is all that is needed, IA-64 would have
performed well (not just on software-pipelinable HPC code). But
compare a 2002-vintage 900MHz Itanium 2 (130W TDP <https://www.hardware-aktuell.com/lexikon/Intel_Itanium_2>, 180nm)
with its 128 integer registers with a 2000-vintage 800MHz K7
(Thunderbird, also 180nm, 42.6W maximum power dissipation <https://www.cpu-world.com/CPUs/K7/AMD-Athlon%20800%20-%20A0800APT3B.html>) with IIRC 72 physical registers, on our LaTeX benchmark (lower is
better):

- HP workstation 900MHz Itanium II, Debian Linux 3.528
- Athlon (Thunderbird) 800, Abit KT7, PC100-333, RedHat 5.1 2.49

So here in-order provided lower performance at thrice the power
consumption, two years later.

Anyway, a major advantage of OoO is that its scheduler can make use of
the dynamic branch predictor and its superior accuracy. (Joshua
Landau pointed out a way that allows static schedulers to make use of
this accuracy, but it's doubtful that this can be achieved without a
code explosion).

Concerning the kind of regular code where IA-64 performed well, the
rest of the world added SIMD registers which can be used to perform
well on those applications; and even in that world (Xeon Phi), Intel
first tried to go for in-order, but replaced it with OoO in the next generation, and eventually just added AVX512 to its mainstream
performance cores, and made them the replacement for the Phi-Xeons.

The sample size is small, and so it's not that unreasonable to
argue that although the Itanium failed to meet expectations, this
class of architectures may still deserve some more investigation
and study.

But both IA-64 and Transmeta burned through serious amounts of money
pursuing the dream of superior in-order performance and (later for
Transmeta, after superior performance evaporated) efficiency. If you
cannot identify what you plan to do better and why that solves the
problems that IA-64 had, you will likely find out that in-order cannot
compete in performance and is not so great on efficiency, either.

- It's been tried many times, and failed each time. (This
doesn't _seem_ to be the case, but the few times it was
tried may have been enough to prove the point.)

In addition to IA-64 and the Transmeta chips, we can also point to the
big in-order cores of the times before OoO took over, e.g., the 4-wide
in-order 21164; it was succeeded and eclipsed by the 4-wide OoO 21264.
Sun tried to stick to in-order (or failed to produce a competetive OoO
CPU) for a long time, e.g., the 4-wide UltraSPARC III/IV/IV+, which
was succeeded by the OoO SPARC64 VI (from Fujitsu). IBM switched from
OoO in Power5 to in-order in Power6, and then back to OoO in Power7,
but I know too little about these CPUs. Anyway, high-performance
in-order has not only been tried by Intel in IA-64 and Transmeta, but
OoO has won. For efficiency, one can point to Intel's in-order
Bonnell being succeeded by Intel's OoO Silvermont.

- The benefits of in-order at high performance are known
to be negligible. (That is, the gate cost of OoO at
high performance scales well, and becomes a decreasing
fraction of transistor count in higher-performance designs.
Mitch tells us that the GBOoO direction of progress is
*not sustainable*, so that doesn't seem to be the case.)

For a while OoO seemed to be limited to 3-4 wide (Intel 1995-2010
three-wide, 2011-2014 4-wide, AMD from at least 1999 (maybe earlier)
until (I think) 2016), but in recent years we see significant width
growth; e.g. the Cortex-X4 is 10-wide and even Intel's E-Core
Gracemont is 5-wide.

- The drawbacks of in-order outweigh their benefits.
(If you have larger register files, you have bigger
instructions, so you fetch more code out of DRAM.
Is that really enough to make the difference?)

I don't think so.

However, in framing this counterargument in favor of
in-order, the *fatal* drawback of in-order for high
performance has dawned on me. (Although Ivan Godard
in his Mill design is, in fact, making an effort to
address just this particular drawback!)

As Mitch notes, to further increase performance, OoO
has become GBOoO: ever larger hardware register files
and so on.

This means that, even if an in-order design which had
large architectural register files, an exposed pipeline,
and so on, matched _current_ OoO CPUs in performance
for less power...

the performance of OoO CPUs doesn't stand still...

and so the _next generation_ of the in-order design
would have to have *larger* register files (and, no
doubt, all sorts of other things)...

Yes, one of the benefits of OoO is that existing code just runs fine
*and fast* on next year's CPU, but in-order also tends to lose on code
that is compiled specifically for the model it runs on.

With the branch prediction disadvantage, you cannot make good use of
more than 128 registers for speculation even if you have more.

even if such architectures could still find a niche
(like supercomputers that are only programmed by the
users themselvels in FORTRAN).

Well, that part of the market has been taken away from in-order CPUs
by OoO CPUs with SIMD, and by GPGPUs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Jan 24 19:20:13 2024

So here in-order provided lower performance at thrice the power
consumption, two years later.

What is clear is that currently, no one know how to make in-order CPUs
as fast as OoO for "general purpose" computing (i.e. not things you can
run on things like GPGPUs or TPUs).

But indeed, the more interesting aspect is that even in terms of
efficiency, in-order seems to be a losing proposition.
I'd be interested to hear opinions about why that is the case.

I can think of two factors, tho there are probably more:
- in-order CPUs spend more time waiting (which is the cause for their
lower performance), and they still burn Joules while they wait,
which throws away the Joules they presumably saved by staying clear of
the OoO "baggage".
- OoO execution is naturally more asynchronous, making it possible to
make decisions about what to do when in a more local way, thus wasting
less energy on costly whole-chip synchronization.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Jan 25 01:21:00 2024

Stefan Monnier wrote:

So here in-order provided lower performance at thrice the power
consumption, two years later.

What is clear is that currently, no one know how to make in-order CPUs
as fast as OoO for "general purpose" computing (i.e. not things you can
run on things like GPGPUs or TPUs).

I think this calls for a "point of order"::

An in order pipeline must be kept short (5 cycles on the thin end,
8 cycle on the fat end) whereas GBOoO machines start with 12 cycles
on the thin end and 30 cycles on the fat end.

You can make a GBOoO machine clock faster than the IO machine simply
from less work in each pipe stage--and this makes up for the depth
of the pipeline.

Furthermore: IO machines are always latency bound, while GBOoO machines
are schedule bound, capable of absorbing L1 cache misses, long cycle
count instructions, ... that significantly harm IO machines.

But indeed, the more interesting aspect is that even in terms of
efficiency, in-order seems to be a losing proposition.
I'd be interested to hear opinions about why that is the case.

I can think of two factors, tho there are probably more:
- in-order CPUs spend more time waiting (which is the cause for their
lower performance), and they still burn Joules while they wait,

A properly clock-gated IO design should not be wasting clocking (and
flip-flop) power while waiting. In 2005 I designed an IO x86 that
went clock = 0Hz while waiting on L1 miss. The whole pipeline stopped eliminating 2 from the text vector exponent.

which throws away the Joules they presumably saved by staying clear of
the OoO "baggage".

The OoO Baggage that is not changing its assertions burn little power.
Clocking an IO pipeline while stalled burns significant power. {Hint:
it takes more power to clock the pipeline than to perform integer calculations.}

- OoO execution is naturally more asynchronous, making it possible to
make decisions about what to do when in a more local way, thus wasting
less energy on costly whole-chip synchronization.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Wed Jan 24 23:28:49 2024

Anton Ertl wrote:

Anyway, a major advantage of OoO is that its scheduler can make use of
the dynamic branch predictor and its superior accuracy. (Joshua
Landau pointed out a way that allows static schedulers to make use of
this accuracy, but it's doubtful that this can be achieved without a
code explosion).

And OoO can queue multiple overlapping cache misses.
This later allows multiple instructions to complete at once,
which allows multiple instructions to retire at once,
which allows it to fill in pipeline bubbles and catch up.

InO simply can't do that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Thu Jan 25 06:46:31 2024

EricP <[email protected]> writes:

And OoO can queue multiple overlapping cache misses.
This later allows multiple instructions to complete at once,
which allows multiple instructions to retire at once,
which allows it to fill in pipeline bubbles and catch up.

InO simply can't do that.

If it is designed accordingly (and I am sure that all IA-64
implementations are), it can: It starts a load, starts the next load
etc. The in-order property only comes into play when it wants to use
the result of one of these loads.

E.g., looking at <https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:

Specifically, the A510 can overlap two cache misses with the following
between them:

* 12 total instructions, up from 8 on the A53

* 6 FP instructions, up from 4 on the A53. This includes 128-bit
vector instructions on the A510 but not on the A53. A53 finds
vector operations scary and will stall immediately on encountering
one

* 3 branches, unchanged from A53

* 5 loads. The A53 would stall on any memory access past a cache miss.

And that's for a LITTLE core.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Thu Jan 25 07:49:15 2024

Stefan Monnier <[email protected]> writes:

But indeed, the more interesting aspect is that even in terms of
efficiency, in-order seems to be a losing proposition.
I'd be interested to hear opinions about why that is the case.

I can think of two factors, tho there are probably more:
- in-order CPUs spend more time waiting (which is the cause for their
lower performance), and they still burn Joules while they wait,
which throws away the Joules they presumably saved by staying clear of
the OoO "baggage".
- OoO execution is naturally more asynchronous, making it possible to
make decisions about what to do when in a more local way, thus wasting
less energy on costly whole-chip synchronization.

These are the two explanations I have come up with, too.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to [email protected] on Thu Jan 25 07:05:13 2024

[email protected] (MitchAlsup1) writes:

Furthermore: IO machines are always latency bound, while GBOoO machines
are schedule bound, capable of absorbing L1 cache misses, long cycle
count instructions, ... that significantly harm IO machines.

What does "schedule bound" mean?

I have seen enough cases where a chain of dependent instructions
(whether it is a chain of multiplications, a chain of L1-hitting
loads, or even a chain of integer adds mixed with occasional
L1-hitting loads) determines the performance of an OoO machine, in
particular a wide OoO machine.

If branch mispredictions are low enough, what limits the performance
of an OoO machine is

* either its resources (functional units, rename width, or somesuch),
and I call that "resource bound",

* or a dependence chain is so long (and the rest of the instructions
consume so few resources) that eventually the reorder buffers are
filled with the rest of the instructions or the schedulers are
filled with instructions from the dependence chain. Then the
machine has to wait for an instruction from the dependence chain to
retire (for unclogging the ROB) or to produce a result (for freeing
a scheduler slot). I call that latency-bound or dependence-bound.

The wider the OoO engine, the fewer programs will be resource-bound on
that machine. Hardware designers use deep ROBs and deep schedulers on
wide OoO engines to reduce the number or impact of dependence-bound
cases, and indeed, with a bigger scheduling window, one may be able to
see more parallelism than with a smaller window.

And at some point there will be a branch misprediction, which acts as
an in-order constraint for the dependence-bound case. In the
resource-bound case, if the machine starts resolving the branch
misprediction before retiring the branch, there are still instructions
waiting for their functional unit, so the misprediction penalty will
be lower than otherwise.

As for in-order machines, for data-parallel stuff like, say, matrix multiplication, they can also be resource bound, and indeed, these are
the kinds of codes where IA-64 performed particularly well.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to All on Thu Jan 25 08:34:58 2024

On Thu, 25 Jan 2024 01:21:00 +0000, MitchAlsup1 wrote:

You can make a GBOoO machine clock faster than the IO machine simply
from less work in each pipe stage--and this makes up for the depth
of the pipeline.

That's true, but to a naive reader that would seem utterly meaningless:
more clocks, but each instruction takes exactly the same number of
gate delays to do. So what?

Of course, though, that's _not_ the whole truth.

What a faster clock speed means for a pipelined computer, especially
one with out-of-order execution, is that it can do more instructions
in parallel, each one at a different stage of completion. So it's
just like adding more cores.. except it's even better, because
everything is more tightly coupled.

This, of course, is obvious stuff that you know perfectly well, but
some readers of your post could have missed it.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to All on Thu Jan 25 08:43:05 2024

On Wed, 24 Jan 2024 20:03:15 +0000, MitchAlsup1 wrote:

Vector machines fell out of fashion when the length of the
vector register could no longer absorb the latency to memory.
{{Although NEC persisted for longer}}

Hmm.

If the latency to memory is bigger, then having more vector
regisers lets you access stuff for a bigger percentage of
the time that is faster than memory.

Just like cache, or regulsr register files, therefore, one
would expect the utility of vector registers to increase,
not decrease, when memory becomes slower by comparison.

So I'm missing something here.

One possibility is that vector registers are usually used to
facilitate operations between vectors in memory - vectors
that are several times longer than the length of a vector
register. So the speed of memory controls the speed of the
overall calculation - in part. The vector registers multiply
it by a factor of how much work gets done on values once
they're read in - but perhaps if memory gets slow enough,
there's not much benefit over less elaborate local storage.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to All on Thu Jan 25 08:49:14 2024

On Thu, 25 Jan 2024 01:21:00 +0000, MitchAlsup1 wrote:

Furthermore: IO machines are always latency bound, while GBOoO machines
are schedule bound, capable of absorbing L1 cache misses, long cycle
count instructions, ... that significantly harm IO machines.

Ah. This is useful information. It's the L1 cache misses, not L2 or L3
cache misses, that OoO is absorbing, and increasing performance thereby.

That makes sense: OoO has a limited capacity to look ahead and move instructions around, so the short delays caused by a miss in the highest
level of cache to the next highest are the ones it's best able to deal
with.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Thu Jan 25 08:45:49 2024

Anton Ertl wrote:

EricP <[email protected]> writes:

And OoO can queue multiple overlapping cache misses.
This later allows multiple instructions to complete at once,
which allows multiple instructions to retire at once,
which allows it to fill in pipeline bubbles and catch up.

InO simply can't do that.

If it is designed accordingly (and I am sure that all IA-64
implementations are), it can: It starts a load, starts the next load
etc. The in-order property only comes into play when it wants to use
the result of one of these loads.

E.g., looking at <https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:

Specifically, the A510 can overlap two cache misses with the following between them:

* 12 total instructions, up from 8 on the A53

* 6 FP instructions, up from 4 on the A53. This includes 128-bit
vector instructions on the A510 but not on the A53. A53 finds
vector operations scary and will stall immediately on encountering
one

* 3 branches, unchanged from A53

* 5 loads. The A53 would stall on any memory access past a cache miss.

And that's for a LITTLE core.

- anton

That 510 backend is not in-order, it's light weight OoO.
That 3-way superscalar CDC6600 style backend allows a younger instruction
to proceed to its next processing stage even though an older instruction
is still executing. That's fine, and it might be possible even to forward pending function unit results to other function unit inputs,
and as long as writeback happens in-order interrupts will be precise.
But that is a form of bypassing.

That uArch is distinct from a dual or triple InO pipeline because
in those if one pipeline stage stalls, they all stall.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Stefan Monnier on Thu Jan 25 09:26:35 2024

Stefan Monnier wrote:

So here in-order provided lower performance at thrice the power
consumption, two years later.

What is clear is that currently, no one know how to make in-order CPUs
as fast as OoO for "general purpose" computing (i.e. not things you can
run on things like GPGPUs or TPUs).

But indeed, the more interesting aspect is that even in terms of
efficiency, in-order seems to be a losing proposition.
I'd be interested to hear opinions about why that is the case.

I can think of two factors, tho there are probably more:
- in-order CPUs spend more time waiting (which is the cause for their
lower performance), and they still burn Joules while they wait,
which throws away the Joules they presumably saved by staying clear of
the OoO "baggage".
- OoO execution is naturally more asynchronous, making it possible to
make decisions about what to do when in a more local way, thus wasting
less energy on costly whole-chip synchronization.

Stefan

In-order serializes when operations start, OoO synchronizes after they finish. The later creates more potential opportunities for asynchronous concurrency, and this potential propagates through the whole system design.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Thu Jan 25 15:22:30 2024

EricP <[email protected]> writes:

Anton Ertl wrote:

EricP <[email protected]> writes:

And OoO can queue multiple overlapping cache misses.
This later allows multiple instructions to complete at once,
which allows multiple instructions to retire at once,
which allows it to fill in pipeline bubbles and catch up.

InO simply can't do that.

If it is designed accordingly (and I am sure that all IA-64
implementations are), it can: It starts a load, starts the next load
etc. The in-order property only comes into play when it wants to use
the result of one of these loads.

E.g., looking at
<https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:

Specifically, the A510 can overlap two cache misses with the following
between them:

* 12 total instructions, up from 8 on the A53

* 6 FP instructions, up from 4 on the A53. This includes 128-bit
vector instructions on the A510 but not on the A53. A53 finds
vector operations scary and will stall immediately on encountering
one

* 3 branches, unchanged from A53

* 5 loads. The A53 would stall on any memory access past a cache miss.

And that's for a LITTLE core.

- anton

That 510 backend is not in-order, it's light weight OoO.
That 3-way superscalar CDC6600 style backend allows a younger instruction
to proceed to its next processing stage even though an older instruction
is still executing. That's fine, and it might be possible even to forward >pending function unit results to other function unit inputs,
and as long as writeback happens in-order interrupts will be precise.
But that is a form of bypassing.

Not OoO in my book. By your definition anything is OoO that allows
some execution overlap of an architecturally earlier instruction with
an architecturally later instruction. With your definition, all
pipelined CPUs are OoO, including the MIPS R2000 with its delayed
branch, delayed load, and especially the multiply/divide unit.

Also, the 21064 which even allowed to issue two instructions at the
same time, as well as having instructions with more than one cycle of load-to-use latency; e.g., there could be an FP multiplication
followed by a load followed by an add, and the add would actually
finish using its ALU before the FP multiplication or the load
finishes.

As described above, the A53 would be OoO by your definition, too.

Last, but not least, all IA-64 implementations would be OoO by your
definition.

A definition that classifies everything as OoO and nothing as in-order
is neither helpful nor is it the commonly understood meaning of
"in-order" and OoO. I think the commonly understood meaning is that
all instructions start their execution in-order (i.e., none goes to a functional unit earlier than an architecturally earlier instruction).
Execution can overlap.

Concerning precise interrupts, that is certainly a problem for CPUs
without reorder buffers; the Alpha architects even put imprecise FP
interrupts and the trapb instruction (IIRC) in the architecture
because of that.

That uArch is distinct from a dual or triple InO pipeline because
in those if one pipeline stage stalls, they all stall.

That's a somewhat different definition. AFAIK the R2000 stalls the
whole (integer) pipeline on a cache miss despite allowing overlap
between instruction executions.

AFAIK microarchitects got rid of this limitation as soon as there were
enough transistors available. The problem with this limitation is
that it makes it pointless to schedule a load further ahead to reduce
the impact of a cache-miss latency, or to use a prefetch instruction,
because either one would stop the whole machine during the cache miss.
A prefetch could actually be counterproductive, but it would
definitely never help.

So this definition may describe some historical designs, but it's not
the difference between in-order and OoO as commonly understood.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Thu Jan 25 16:45:19 2024

Anton Ertl wrote:

[email protected] (MitchAlsup1) writes:

Furthermore: IO machines are always latency bound, while GBOoO machines
are schedule bound, capable of absorbing L1 cache misses, long cycle
count instructions, ... that significantly harm IO machines.

What does "schedule bound" mean?

I have seen enough cases where a chain of dependent instructions
(whether it is a chain of multiplications, a chain of L1-hitting
loads, or even a chain of integer adds mixed with occasional
L1-hitting loads) determines the performance of an OoO machine, in
particular a wide OoO machine.

Those ARE schedule bound--the speed of the scheduler(s) launching data dependent instructions is the limiting performance.

If branch mispredictions are low enough, what limits the performance
of an OoO machine is

* either its resources (functional units, rename width, or somesuch),
and I call that "resource bound",

Those are simply the targets of the scheduler(s). Getting the instruction launched is somewhat harder.

The FUs are easy to pipeline, data-dependent operations cannot use the available BW of the FUs when schedule bound.

* or a dependence chain is so long (and the rest of the instructions
consume so few resources) that eventually the reorder buffers are
filled with the rest of the instructions or the schedulers are
filled with instructions from the dependence chain. Then the
machine has to wait for an instruction from the dependence chain to
retire (for unclogging the ROB) or to produce a result (for freeing
a scheduler slot). I call that latency-bound or dependence-bound.

This is the other end of the schedule pipeline.

The wider the OoO engine, the fewer programs will be resource-bound on
that machine.

And the more will be schedule bound.

Hardware designers use deep ROBs and deep schedulers on
wide OoO engines to reduce the number or impact of dependence-bound
cases, and indeed, with a bigger scheduling window, one may be able to
see more parallelism than with a smaller window.

Yes, Indeed.

And at some point there will be a branch misprediction, which acts as
an in-order constraint for the dependence-bound case. In the
resource-bound case, if the machine starts resolving the branch
misprediction before retiring the branch, there are still instructions waiting for their functional unit, so the misprediction penalty will
be lower than otherwise.

As for in-order machines, for data-parallel stuff like, say, matrix multiplication, they can also be resource bound, and indeed, these are
the kinds of codes where IA-64 performed particularly well.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Quadibloc on Thu Jan 25 16:47:49 2024

Quadibloc wrote:

On Wed, 24 Jan 2024 20:03:15 +0000, MitchAlsup1 wrote:

Vector machines fell out of fashion when the length of the
vector register could no longer absorb the latency to memory.
{{Although NEC persisted for longer}}

Hmm.

If the latency to memory is bigger, then having more vector
regisers lets you access stuff for a bigger percentage of
the time that is faster than memory.

And lose code compatibility with your predecessors.

Just like cache, or regulsr register files, therefore, one
would expect the utility of vector registers to increase,
not decrease, when memory becomes slower by comparison.

So I'm missing something here.

Amdahl's law still applies.

One possibility is that vector registers are usually used to
facilitate operations between vectors in memory - vectors
that are several times longer than the length of a vector
register. So the speed of memory controls the speed of the
overall calculation - in part. The vector registers multiply
it by a factor of how much work gets done on values once
they're read in - but perhaps if memory gets slow enough,
there's not much benefit over less elaborate local storage.

Speed of memory ~== bisection bandwidth.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Jan 25 16:49:32 2024

EricP wrote:

Stefan Monnier wrote:

So here in-order provided lower performance at thrice the power
consumption, two years later.

What is clear is that currently, no one know how to make in-order CPUs
as fast as OoO for "general purpose" computing (i.e. not things you can
run on things like GPGPUs or TPUs).

But indeed, the more interesting aspect is that even in terms of
efficiency, in-order seems to be a losing proposition.
I'd be interested to hear opinions about why that is the case.

I can think of two factors, tho there are probably more:
- in-order CPUs spend more time waiting (which is the cause for their
lower performance), and they still burn Joules while they wait,
which throws away the Joules they presumably saved by staying clear of
the OoO "baggage".
- OoO execution is naturally more asynchronous, making it possible to
make decisions about what to do when in a more local way, thus wasting
less energy on costly whole-chip synchronization.

Stefan

In-order serializes when operations start,

And remain serialized while traversing the pipeline.

OoO synchronizes after they finish.
The later creates more potential opportunities for asynchronous concurrency, and this potential propagates through the whole system design.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Fri Jan 26 10:06:35 2024

Anton Ertl wrote:

EricP <[email protected]> writes:

Anton Ertl wrote:

EricP <[email protected]> writes:

And OoO can queue multiple overlapping cache misses.
This later allows multiple instructions to complete at once,
which allows multiple instructions to retire at once,
which allows it to fill in pipeline bubbles and catch up.

InO simply can't do that.

If it is designed accordingly (and I am sure that all IA-64
implementations are), it can: It starts a load, starts the next load
etc. The in-order property only comes into play when it wants to use
the result of one of these loads.

E.g., looking at
<https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:

Specifically, the A510 can overlap two cache misses with the following
between them:

* 12 total instructions, up from 8 on the A53

* 6 FP instructions, up from 4 on the A53. This includes 128-bit
vector instructions on the A510 but not on the A53. A53 finds
vector operations scary and will stall immediately on encountering
one

* 3 branches, unchanged from A53

* 5 loads. The A53 would stall on any memory access past a cache miss. >>>
And that's for a LITTLE core.

- anton

That 510 backend is not in-order, it's light weight OoO.
That 3-way superscalar CDC6600 style backend allows a younger instruction
to proceed to its next processing stage even though an older instruction
is still executing. That's fine, and it might be possible even to forward
pending function unit results to other function unit inputs,
and as long as writeback happens in-order interrupts will be precise.
But that is a form of bypassing.

Not OoO in my book. By your definition anything is OoO that allows
some execution overlap of an architecturally earlier instruction with
an architecturally later instruction. With your definition, all
pipelined CPUs are OoO, including the MIPS R2000 with its delayed
branch, delayed load, and especially the multiply/divide unit.

No, not overlap, bypassing. Multiple parallel pipelines is still in-order. Instructions enter each pipeline in program order, each maintains fifo
order internally, and results exit from each in program order.

I got the impression from the description of the 510 that it allowed
a limited form of bypass where it says under Execution Engine
"Instructions can co-issue if they�re independent, have their inputs ready..". It depends on exactly what that box labeled "Issue" does.

Also the figure in section 2.1 Pipeline Overview gave me the impression
that bypassing might be allowed.

Anyway, as long as the register file is updated in-order then the only one
that matters is the load store queue. While the LSQ allows 2 outstanding
cache misses, as long as it finishes each load/store in order then none
of this is visible.

Also, the 21064 which even allowed to issue two instructions at the
same time, as well as having instructions with more than one cycle of load-to-use latency; e.g., there could be an FP multiplication
followed by a load followed by an add, and the add would actually
finish using its ALU before the FP multiplication or the load
finishes.

As described above, the A53 would be OoO by your definition, too.

21164 was two parallel integer pipelines. I don't know about A53.

Last, but not least, all IA-64 implementations would be OoO by your definition.

A definition that classifies everything as OoO and nothing as in-order
is neither helpful nor is it the commonly understood meaning of
"in-order" and OoO. I think the commonly understood meaning is that
all instructions start their execution in-order (i.e., none goes to a functional unit earlier than an architecturally earlier instruction). Execution can overlap.

And it sounded like the 510 might to exactly that: send instructions
to a function unit OoO.

Concerning precise interrupts, that is certainly a problem for CPUs
without reorder buffers; the Alpha architects even put imprecise FP interrupts and the trapb instruction (IIRC) in the architecture
because of that.

And in that regard the Alpha made itself the poster boy for what not to do.

That uArch is distinct from a dual or triple InO pipeline because
in those if one pipeline stage stalls, they all stall.

That's a somewhat different definition. AFAIK the R2000 stalls the
whole (integer) pipeline on a cache miss despite allowing overlap
between instruction executions.

I thought the R2000 only has one pipeline.
Anyway, I was thinking the 21064 had two integer pipelines but
it only has one integer/memory and one float.
The 21164 has two int/mem, one float add, one float multiply.

AFAIK microarchitects got rid of this limitation as soon as there were
enough transistors available. The problem with this limitation is
that it makes it pointless to schedule a load further ahead to reduce
the impact of a cache-miss latency, or to use a prefetch instruction,
because either one would stop the whole machine during the cache miss.
A prefetch could actually be counterproductive, but it would
definitely never help.

So this definition may describe some historical designs, but it's not
the difference between in-order and OoO as commonly understood.

- anton

Yes, my brain fart. Forget I said that.
I wasn't thinking it was the difference between IO and OoO,
I was thinking there is no point in keeping two parallel integer pipelines running when one stalls for a long duration operation because the running
one will have to stall at the pipe end to synchronize writeback anyway.
But that ignores the possibility that the pipelines are asymmetric.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Jan 26 21:46:42 2024

EricP wrote:

Anton Ertl wrote:

EricP <[email protected]> writes:

Anton Ertl wrote:

EricP <[email protected]> writes:

And OoO can queue multiple overlapping cache misses.
This later allows multiple instructions to complete at once,
which allows multiple instructions to retire at once,
which allows it to fill in pipeline bubbles and catch up.

InO simply can't do that.

If it is designed accordingly (and I am sure that all IA-64
implementations are), it can: It starts a load, starts the next load
etc. The in-order property only comes into play when it wants to use
the result of one of these loads.

E.g., looking at
<https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:

Specifically, the A510 can overlap two cache misses with the following >>>> between them:

* 12 total instructions, up from 8 on the A53

* 6 FP instructions, up from 4 on the A53. This includes 128-bit
vector instructions on the A510 but not on the A53. A53 finds
vector operations scary and will stall immediately on encountering >>>> one

* 3 branches, unchanged from A53

* 5 loads. The A53 would stall on any memory access past a cache miss. >>>>
And that's for a LITTLE core.

- anton

That 510 backend is not in-order, it's light weight OoO.
That 3-way superscalar CDC6600 style backend allows a younger instruction >>> to proceed to its next processing stage even though an older instruction >>> is still executing. That's fine, and it might be possible even to forward >>> pending function unit results to other function unit inputs,
and as long as writeback happens in-order interrupts will be precise.
But that is a form of bypassing.

Not OoO in my book. By your definition anything is OoO that allows
some execution overlap of an architecturally earlier instruction with
an architecturally later instruction. With your definition, all
pipelined CPUs are OoO, including the MIPS R2000 with its delayed
branch, delayed load, and especially the multiply/divide unit.

No, not overlap, bypassing. Multiple parallel pipelines is still in-order.

Note:: Mc88100 had multiple parallel pipelines and was not In-Order !!
A older LD stall would allow a younger instructions to complete !

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Sat Jan 27 15:15:43 2024

EricP <[email protected]> writes:

Anton Ertl wrote:

[...]

Anyway, as long as the register file is updated in-order

This discussion resulted in the unearthing of memories of the things I
read about the microarchitectures of the advanced in-order machines of
last century (and I think that in-order machines in this century tend
to work in the same way). My memory may be unreliable here, but
anyway:

The way things worked was that in those machines, instructions were
issued in-order but the results could be written back out-of-order.
However, each register has a bit that tells whether the register is
up-to-date, or will be updated in the future by a currently in-flight instruction. This was often called scoreboarding (although Mitch
Alsup and <https://en.wikipedia.org/wiki/Scoreboarding> tell us that
the CDC 6600 scoreboard was a more sophisticated mechanism; given the
I in MIPS, one could also call it an interlock). So each instruction
checks whether all its source and destination registers are
up-to-date, and if not, it waits until they are (forwarding changes
the notion of "up-to-date" a bit, but I'll skip this here).

With out-of-order completion, how is architectural execution and, in particular, precise exceptions, ensured? For ordinary execution, it
does not matter for an instruction whether an unrelated register is
not up-to-date, and if the register is mentioned in the instruction,
the instruction and all that follow wait until the register is
up-to-date.

I don't remember how loads and stores were handled, but again, as long
as they were to non-overlapping addresses (and for weak memory
ordering in multiprocessors) one can do quite a bit in parallel
without destroying architectural order.

I also don't remember how flags registers were handled on
architectures that have them, but it needs something cleverer than the "up-to-date" scheme described above, or there would be lots of stalls
due to write-after-write dependences. I am sure the microarchitects
found something appropriate.

For precise exceptions, I remember discussions about the importance of
knowing early in the instruction that an exception happens; i.e., so
early that the writebacks of architecturally later instructions can be cancelled. For loads, the exception is known early, when the TLB
lookup has happened; I expect that the whole machine is stalled on a
TLB miss (or, with a software-managed TLB, the exception happens right
there). Alpha has imprecise FP exceptions because the architecture
wanted to allow implementing denormals through trapping, but it takes
several cycles to know whether an FP result is normal or not.

[Cortex-A510]

then the only one
that matters is the load store queue. While the LSQ allows 2 outstanding >cache misses, as long as it finishes each load/store in order then none
of this is visible.

I expect that the A510 uses the mechanism described above, which means
that loads can finish out of order, but none of this is visible
nonetheless.

Also, the 21064 which even allowed to issue two instructions at the
same time, as well as having instructions with more than one cycle of
load-to-use latency; e.g., there could be an FP multiplication
followed by a load followed by an add, and the add would actually
finish using its ALU before the FP multiplication or the load
finishes.

As described above, the A53 would be OoO by your definition, too.

21164 was two parallel integer pipelines. I don't know about A53.

The Cortex-A53 has two ALU ports <https://chipsandcheese.com/2023/05/28/arms-cortex-a53-tiny-but-important/>.

It's interesting to compare the A53 (2012) to the 21164 (1995). Both
have roughly similar execution resources (2 integer (one of which can
be a branch), 2 FP, 1LSU (not sure about that for the 21164)), but the
21164 has a four-wide decoder, while the A53 only has a two-wide
decoder. I guess the cost of decoding all of A64, A32, and especially
T32 caused them to limit the decoding capabilities.

For the A510 ARM expanded that to a three-wide decode, but the A510 is
an A64-only core. ARM also provided a third ALU and an additional
load unit to the A510. Given that an ALU was not that expensive even
in the 21164 timeframe, my guess is that the 21164 architects provided
only two because of register port or forwarding path limitations,
something that the ARM designers apparently have a solution for (more
metal layers?).

That's a somewhat different definition. AFAIK the R2000 stalls the
whole (integer) pipeline on a cache miss despite allowing overlap
between instruction executions.

I thought the R2000 only has one pipeline.

My memories from last century tells me that there was some concept
like "squashing pipeline bubbles" being discussed at the time, i.e.,
that instructions in earlier stages could advance until the first of
them reaches the stalled instruction. Conversely, instructions in
later stages could continue, filling the stages they left with bubbles
(I don't remember this being discussed). But of course none of that
is used in the R2000. The R2000 has a multiply/divide unit that takes
many cycles, and actually with interlocks. I don't know if that
continues working while a cache miss is served; the R2010 FPU
certainly continues working while a cache miss is served.

And then we got the 88100 with three pipelines, and then the 21064
with dual-issue and three pipelines.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Sun Jan 28 13:48:24 2024

MitchAlsup1 wrote:

EricP wrote:

Anton Ertl wrote:

EricP <[email protected]> writes:

Anton Ertl wrote:

EricP <[email protected]> writes:

And OoO can queue multiple overlapping cache misses.
This later allows multiple instructions to complete at once,
which allows multiple instructions to retire at once,
which allows it to fill in pipeline bubbles and catch up.

InO simply can't do that.

If it is designed accordingly (and I am sure that all IA-64
implementations are), it can: It starts a load, starts the next load >>>>> etc. The in-order property only comes into play when it wants to use >>>>> the result of one of these loads.

E.g., looking at
<https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>,
the A510 has a 5-entry load buffer. The text says:

Specifically, the A510 can overlap two cache misses with the following >>>>> between them:

* 12 total instructions, up from 8 on the A53
* 6 FP instructions, up from 4 on the A53. This includes 128-bit >>>>> vector instructions on the A510 but not on the A53. A53 finds
vector operations scary and will stall immediately on encountering >>>>> one

* 3 branches, unchanged from A53

* 5 loads. The A53 would stall on any memory access past a cache
miss.

And that's for a LITTLE core.

- anton

That 510 backend is not in-order, it's light weight OoO.
That 3-way superscalar CDC6600 style backend allows a younger
instruction
to proceed to its next processing stage even though an older
instruction
is still executing. That's fine, and it might be possible even to
forward
pending function unit results to other function unit inputs,
and as long as writeback happens in-order interrupts will be precise.
But that is a form of bypassing.

Not OoO in my book. By your definition anything is OoO that allows
some execution overlap of an architecturally earlier instruction with
an architecturally later instruction. With your definition, all
pipelined CPUs are OoO, including the MIPS R2000 with its delayed
branch, delayed load, and especially the multiply/divide unit.

No, not overlap, bypassing. Multiple parallel pipelines is still
in-order.

Note:: Mc88100 had multiple parallel pipelines and was not In-Order !!
A older LD stall would allow a younger instructions to complete !

Multiple parallel pipelines is fine but it has to sequence the pipe exits
so the results retire in order for precise exceptions and interrupts.

Also each pipeline can be a source for forwarding so it can wind up
with many forwarding buses which have to be checked for each source
operand on each issue lane.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Sun Jan 28 20:57:27 2024

Anton Ertl wrote:

EricP <[email protected]> writes:

Anton Ertl wrote:

[...]

Anyway, as long as the register file is updated in-order

This discussion resulted in the unearthing of memories of the things I
read about the microarchitectures of the advanced in-order machines of
last century (and I think that in-order machines in this century tend
to work in the same way). My memory may be unreliable here, but
anyway:

The way things worked was that in those machines, instructions were
issued in-order but the results could be written back out-of-order.
However, each register has a bit that tells whether the register is up-to-date, or will be updated in the future by a currently in-flight instruction. This was often called scoreboarding (although Mitch
Alsup and <https://en.wikipedia.org/wiki/Scoreboarding> tell us that
the CDC 6600 scoreboard was a more sophisticated mechanism; given the
I in MIPS, one could also call it an interlock). So each instruction
checks whether all its source and destination registers are
up-to-date, and if not, it waits until they are (forwarding changes
the notion of "up-to-date" a bit, but I'll skip this here).

With out-of-order completion, how is architectural execution and, in particular, precise exceptions, ensured? For ordinary execution, it
does not matter for an instruction whether an unrelated register is
not up-to-date, and if the register is mentioned in the instruction,
the instruction and all that follow wait until the register is
up-to-date.

I don't remember how loads and stores were handled, but again, as long
as they were to non-overlapping addresses (and for weak memory
ordering in multiprocessors) one can do quite a bit in parallel
without destroying architectural order.

I also don't remember how flags registers were handled on
architectures that have them, but it needs something cleverer than the "up-to-date" scheme described above, or there would be lots of stalls
due to write-after-write dependences. I am sure the microarchitects
found something appropriate.

For precise exceptions, I remember discussions about the importance of knowing early in the instruction that an exception happens; i.e., so
early that the writebacks of architecturally later instructions can be cancelled. For loads, the exception is known early, when the TLB
lookup has happened; I expect that the whole machine is stalled on a
TLB miss (or, with a software-managed TLB, the exception happens right there). Alpha has imprecise FP exceptions because the architecture
wanted to allow implementing denormals through trapping, but it takes
several cycles to know whether an FP result is normal or not.

That scoreboard allows OoO execution and completion,
and avoids RAW, WAW, and WAR hazards,
but it doesn't write back results in program order.
Exceptions can be made precise by (a) aways writing results in-order,
and (b) only recognizing exceptions at Writeback.

To write the results back in order one could attach a sequence counter
to each uOp - a counter with enough bits so that each possible in-flight
uOp in any stage has a unique number plus 1 bit for a wrap flag.
The uOps can then flow down separate parallel pipelines.

Writeback also has a sequence counter so it knows which uOp is
next to write its register. I would want two register write ports
so it at least has a chance of catching up after a bubble.
WB checks the exits of all the pipeline for the next two sequence numbers, removes those uOps from their pipelines and writes the results.

Each pipeline takes care of its own stalls internally and compacts out
NULL uOps from stages if possible. So the only time a pipeline has to completely stall is when all stages are full and the end result is not
the oldest uOp and so cannot be written back.

The uOp sequence numbers also allow branch mispredict to purge just those in-flight uOps that are younger than the branch. The Branch Control Unit detects a mispredicted conditional branch and transmits its own sequence
number on the Cancel Bus which goes to all stages of all pipelines.
Each stage compares its own sequence number to the cancel number and
if higher (younger) the it nullifies that entry.

[Cortex-A510]

then the only one
that matters is the load store queue. While the LSQ allows 2 outstanding
cache misses, as long as it finishes each load/store in order then none
of this is visible.

I expect that the A510 uses the mechanism described above, which means
that loads can finish out of order, but none of this is visible
nonetheless.

Also, the 21064 which even allowed to issue two instructions at the
same time, as well as having instructions with more than one cycle of
load-to-use latency; e.g., there could be an FP multiplication
followed by a load followed by an add, and the add would actually
finish using its ALU before the FP multiplication or the load
finishes.

As described above, the A53 would be OoO by your definition, too.

21164 was two parallel integer pipelines. I don't know about A53.

The Cortex-A53 has two ALU ports <https://chipsandcheese.com/2023/05/28/arms-cortex-a53-tiny-but-important/>.

It's interesting to compare the A53 (2012) to the 21164 (1995). Both
have roughly similar execution resources (2 integer (one of which can
be a branch), 2 FP, 1LSU (not sure about that for the 21164)), but the
21164 has a four-wide decoder, while the A53 only has a two-wide
decoder. I guess the cost of decoding all of A64, A32, and especially
T32 caused them to limit the decoding capabilities.

For the A510 ARM expanded that to a three-wide decode, but the A510 is
an A64-only core. ARM also provided a third ALU and an additional
load unit to the A510. Given that an ALU was not that expensive even
in the 21164 timeframe, my guess is that the 21164 architects provided
only two because of register port or forwarding path limitations,
something that the ARM designers apparently have a solution for (more
metal layers?).

That's a somewhat different definition. AFAIK the R2000 stalls the
whole (integer) pipeline on a cache miss despite allowing overlap
between instruction executions.

I thought the R2000 only has one pipeline.

My memories from last century tells me that there was some concept
like "squashing pipeline bubbles" being discussed at the time, i.e.,
that instructions in earlier stages could advance until the first of
them reaches the stalled instruction. Conversely, instructions in
later stages could continue, filling the stages they left with bubbles
(I don't remember this being discussed). But of course none of that
is used in the R2000. The R2000 has a multiply/divide unit that takes
many cycles, and actually with interlocks. I don't know if that
continues working while a cache miss is served; the R2010 FPU
certainly continues working while a cache miss is served.

And then we got the 88100 with three pipelines, and then the 21064
with dual-issue and three pipelines.

- anton

A simple way to squash bubbles (NULL uOps) out of pipeline stages is:

generate stage N stall signal and inhibit clocking its input buffer if
- the stage N buffer valid flag is set
- and stage N would generate a valid output (eg resources are available)
- and stage N+1 is generating a stall

The stage N stall signal propagates back to stage N-1.

Unfortunately in this simple design the stall signal serially propagates backwards through all the stages. Also the pipeline can stretch a long
way across the chip which means long wires.
This total stall signal delay is added to the worst case stage calculation delay and cuts into the max frequency.

Other designs called elastic buffers are possible where the stall is only between adjacent stages and they can squash bubbles but those require more
than double the cost of a stage buffer. One can also alternate the simple
stage design with elastic buffer to limit the stall propagation delay.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Mon Jan 29 18:08:42 2024

EricP <[email protected]> writes:

Anton Ertl wrote:

EricP <[email protected]> writes:

Anton Ertl wrote:

[...]

Anyway, as long as the register file is updated in-order

This discussion resulted in the unearthing of memories of the things I
read about the microarchitectures of the advanced in-order machines of
last century (and I think that in-order machines in this century tend
to work in the same way). My memory may be unreliable here, but
anyway:

The way things worked was that in those machines, instructions were
issued in-order but the results could be written back out-of-order.
However, each register has a bit that tells whether the register is
up-to-date, or will be updated in the future by a currently in-flight
instruction. This was often called scoreboarding (although Mitch
Alsup and <https://en.wikipedia.org/wiki/Scoreboarding> tell us that
the CDC 6600 scoreboard was a more sophisticated mechanism; given the
I in MIPS, one could also call it an interlock). So each instruction
checks whether all its source and destination registers are
up-to-date, and if not, it waits until they are (forwarding changes
the notion of "up-to-date" a bit, but I'll skip this here).

With out-of-order completion, how is architectural execution and, in
particular, precise exceptions, ensured? For ordinary execution, it
does not matter for an instruction whether an unrelated register is
not up-to-date, and if the register is mentioned in the instruction,
the instruction and all that follow wait until the register is
up-to-date.

I don't remember how loads and stores were handled, but again, as long
as they were to non-overlapping addresses (and for weak memory
ordering in multiprocessors) one can do quite a bit in parallel
without destroying architectural order.

I also don't remember how flags registers were handled on
architectures that have them, but it needs something cleverer than the
"up-to-date" scheme described above, or there would be lots of stalls
due to write-after-write dependences. I am sure the microarchitects
found something appropriate.

For precise exceptions, I remember discussions about the importance of
knowing early in the instruction that an exception happens; i.e., so
early that the writebacks of architecturally later instructions can be
cancelled. For loads, the exception is known early, when the TLB
lookup has happened; I expect that the whole machine is stalled on a
TLB miss (or, with a software-managed TLB, the exception happens right
there). Alpha has imprecise FP exceptions because the architecture
wanted to allow implementing denormals through trapping, but it takes
several cycles to know whether an FP result is normal or not.

That scoreboard allows OoO execution and completion,
and avoids RAW, WAW, and WAR hazards,
but it doesn't write back results in program order.
Exceptions can be made precise by (a) aways writing results in-order,
and (b) only recognizing exceptions at Writeback.

AFAIK the ROB for in-order completion only came with the modern wave
of microarchitectures with OoO execution (the 360/91 has no reorder
buffer).

The approach I described above is different: Results are written
out-of-order; precise exceptions are recognized in the first cycle of
the instruction, before any architecturally later instruction writes
back; the writebacks of these architecturally later instructions are
then suppressed. The question is how later writebacks are suppressed
without suppressing the writebacks of architecturally earlier
instructions. I can think of some mechanism, but I don't know if that
was used.

The same mechanism is needed for dealing with branches without delay
slot: Either they are predicted to go in some direction (as in, e.g.,
the 21064), or fallthrough is preferred (as in the 486), which is
equivalent to predicting not-taken. In either case, when the
prediction is wrong, the instructions along the predicted path must
not write back, and in this case the recovery must be fast (whereas
exceptions are so rare that a few cycles more would be acceptable).

To write the results back in order one could attach a sequence counter
to each uOp - a counter with enough bits so that each possible in-flight
uOp in any stage has a unique number plus 1 bit for a wrap flag.

A sequence counter is also the first solution for suppressing the
writebacks of architecturally later instructions in the OoO writeback
setup.

Writeback also has a sequence counter so it knows which uOp is
next to write its register. I would want two register write ports
so it at least has a chance of catching up after a bubble.

The 88100 has only one writeback port and writes results back
out-of-order. My first refereed paper <https://www.complang.tuwien.ac.at/papers/ertl%26krall91.ps.gz> was
about scheduling for the 88100, and utilizing writeback slots better
than the usual schedulers was the major benefit of our scheduler.

Concerning cache misses, the in-order scheme described above also can
live with setting aside loads for cache misses, and processing
architecturally later loads in the meantime, and I am sure that a
number of in-order microarchitectures have done that; recently the
A510. Of course, stores to overlapping addresses have to be processed
in-order wrt the loads and each other.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Mon Jan 29 21:04:54 2024

EricP wrote:

Anton Ertl wrote:

EricP <[email protected]> writes:

Anton Ertl wrote:

[...]

Anyway, as long as the register file is updated in-order

<snip>

For precise exceptions, I remember discussions about the importance of
knowing early in the instruction that an exception happens; i.e., so
early that the writebacks of architecturally later instructions can be
cancelled. For loads, the exception is known early, when the TLB
lookup has happened; I expect that the whole machine is stalled on a
TLB miss (or, with a software-managed TLB, the exception happens right
there). Alpha has imprecise FP exceptions because the architecture
wanted to allow implementing denormals through trapping, but it takes
several cycles to know whether an FP result is normal or not.

That scoreboard allows OoO execution and completion,
and avoids RAW, WAW, and WAR hazards,

Register hazards are obeyed, memory hazards are not necessarily obeyed.

but it doesn't write back results in program order.

Exceptions can be made precise by (a) aways writing results in-order,

OR by allowing younger writes only after kno0wing older results will not
raise exceptions.

and (b) only recognizing exceptions at Writeback.

A bit restrictive, but it does work.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Sun Feb 25 21:58:12 2024

Paul A. Clayton wrote:

On 1/28/24 1:48 PM, EricP wrote:
[snip]

Multiple parallel pipelines is fine but it has to sequence the pipe exits
so the results retire in order for precise exceptions and interrupts.

In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.

(From Computer Architecture: A Quantitative Approach, 3rd Ed.,
Appendix H, "One approach to this problem, used in the MIPS R3010,
is to identify instructions that may cause an exception early in
the instruction cycle. For example, an addition can overflow only
if one of the operands has an exponent of Emax, and so on. This
early check is conservative: It might flag an operation that
doesn’t actually cause an exception. However, if such false
positives are rare, then this technique will have excellent
performance. When an instruction is tagged as being possibly
exceptional, special code in a trap handler can compute it without
destroying any state. Remember that all these problems occur only
when trap handlers are enabled.")

Not writing results in order would require suppressing earlier
writes to the same register (a singular writeback stage design
would also have this). With simple in-order issue, this would
(I think) only occur when the result was never used (e.g., a
slow operation started before a conditional branch that
determines it use — or in a "free" delay slot — or if two results
are produced and one is unused such as unused flag settings).
Out-of-order writeback also presents register write port hazards;
more write ports might be needed than available.

It _might_ be practical to allow store instructions that use a
delayed result to issue before the result is available — similar

ST instructions are special in that one can compute the address
as soon as operand dependencies resolve, and then only access the
value to be aligned and stored after the ST instruction is retired.
This way, ST.data is never latent. HP has a patent on this circa
1986±.

to the classic store-address-generation/store-data split for
out-of-order execution. A store buffer entry could be marked as
not having valid data (similar to ready bits for registers) and
the slow operation could "forward" to the store buffer.

My pipelines don't even bother to fetch the data to be stored until
the ST instruction retires.

Multiply-
add instructions can also conceivably exploit delayed availability
of the addend. There might also be some cases were necessary
latency is data dependent and knowing that the computation can
be done faster the operations might be "issued" early as if it
had the normal/worse-case latency — that communication complexity
seems unlikely to be worthwhile but it is conceivably possible.

Since low-end out-of-order is not extraordinarily complex or resource-intensive, heroic efforts to provide slightly less
constrained but still in-order execution seem rather questionable.

The only counter point is that every time one allows the front of the
pipeline and the end of the pipeline to determine advance differently,
you add 1 to the exponent of test vector complexity. If you allow the
center of the pipeline to crush out bubbles, you have now added 2 to
the text vector complexity of the pipeline.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Sun Feb 25 22:22:04 2024

Paul A. Clayton wrote:

On 1/24/24 2:47 AM, Anton Ertl wrote:> "Paul A. Clayton" <[email protected]> writes:

When I looked at the pipeline design presented in the Arm Cortex-
A55 Software Optimization Guide, I was surprised by the design.
Figure 1 (page 10 in Revision R2p0) shows nine execution pipelines
(ALU0, ALU1, MAC, DIV, branch, store, load, FP/Neon MAC &
DIV/SWRT, FP/Neon ALU) and ALU0 and ALU1 have a shift pipeline
stage before an ALU stage (clearly for AArch32).

Almost like an Mc88100 which had 5 pipelines.

The separation of MAC and DIV is mildly questionable — from my
very amateur perspective — not supporting dual issue of a MAC-DIV
pair seems very unlikely to hurt performance but the cost may be
trivial.

Many (MANY) MUL-DIV pairs are data dependent. y = i*m/n;

The Chips and Cheese article also indicated that branches are only
resolved at writeback, two cycles later than if branch direction
was resolved in the first execution stage. The difference between
a six stage misprediction penalty and an eight stage one is not
huge, but it seems to indicate a difference in focus. With

In an 8 stage pipeline, the 2 cycles of added delay should hurt by ~5%-7%

condition code based branches and in-order execution, I would have
been tempted to try resolving such branches by the end of the
issue stage. (MIPS R2000 resolved register-compare branches at the
end of decode, so resolving branches based on a condition code —
if the data is available — in the cycle after decode does not seem incredibly difficult. It may be that condition codes are generally
not set early enough to justify such effort, but it seems
obviously "possible".)

I would have *guessed* that an AGLU (a functional unit providing
address generation and "simple" ALU functions, like AMD's Bobcat?)
would be more area and power efficient than having separate
pipelines, at least for store address generation.

Be careful with assumptions like that. Silicon area with no moving
signals is remarkably power efficient.

I may be misinterpreting/misunderstanding the information. While I
believe I am not entirely incompetent in general
microarchitectural design, it is difficult to believe that any
professional (much less a team of professionals) would do worse
than I would. Other tradeoffs (like design reuse) may also justify
design choices that seem worse.

snip

They had to choose the L1 size. Cortex-A55 supports L1 sizes of 16
KiB, 32 KiB, and 64 KiB. With a fixed three-cycle latency (and
other pipeline stages fixed in their work), the size of the L1
caches will affect not only cycle time. If the pipeline diagram is interpreted extremely literally, address generation takes one
cycle, data cache output takes one cycle, and align and extend
takes one cycle. If cache access itself takes one cycle and if
that latency increases by sqrt(2) with each capacity doubling,

A capacity doubling adds SQRT(2) to wire latency and 1 gates to gate-
delay latency. Depending one which is more critical, you may choose
to go one way or the other.

then implementations with the largest *either* data or instruction
cache would have twice as much time in a cycle as implementations
with both L1s being 16 KiB *if* the pipeline was designed for the
smallest cache.

(I would **GUESS** that ARM designed the pipeline for 32 KiB
caches and smaller caches mostly mean unused time within the cache
access cycle and larger caches mostly mean unused time within all
the other stages. The time to complete a certain about of logical
operation can be adjusted, e.g., using a faster adder, but not
shifting the clock boundaries constrains such changes as not all
chunks of logic can be made faster — intentional clock skew might
allow borrowing time — and synthesized designs might not get all
the possible changes.)

According to the AnandTech article, Samsung chose not to implement
an L2 for the A55 cores. Since accessing the L3 means crossing a
clocking domain, this would seem to have a significant impact on
performance for workloads like SPEC and, I suspect, a noticeable
impact on energy-efficiency. If this choice also lead to using 64
KiB L1 caches **and if** ARM optimized the pipeline for 32 KiB
caches, this might also have noticeably impacted performance and energy-efficiency.

Crossing a clock domain boundary is 2.5 clocks of latency.

(For SPEC, I would guess that even the 256 KiB maximum
configuration L2 size for A55 would have a significant performance
impact. SPEC2006 used by AnandTech might be friendlier to modest
L2 size than SPEC2017. If the software is "tuned" for workstation
hardware of five years before the SPEC benchmark, 2019 smart
phones might not be that far from 2001 workstations in terms of L2
sizes.)

-----------------------------

If my above guess that a 64 KiB L1 was used and that this impacts
frequency, voltage and frequency scaling may have been effected.
(I seem to recall reading that caches have poorer voltage-
frequency scaling; that *might* incline a larger L1 cache to
further hurt energy efficiency if a single voltage is used for the
whole core.)

SRAMs do not operate (well) below a certain voltage. At voltage
the sense amplifier will have a gain > 50× while below that voltage
the SA gain my decrease to 10× and there is a range of voltages
where the change in gain vs. change in voltage is quadratic.
------------------

With respect to sticking with in-order, there also seems to be a
tendency to go "all in" when switching to out-of-order, i.e., the
initial out-of-order design seems to be relatively "beefy" in its out-of-order resources. This may result from having delayed the
transition well beyond where performance or efficiency estimates
would have justified the change or perhaps from crossover being a
large enough region by the time a change is fully justified the
out-of-order design would be relatively beefy.

OoO gain in both ILP and in frequency gaining a quadratic uplift.

Perhaps mildly out-of-order designs (say a little more than the
PowerPC 750) are not actually useful (other than as a starting
point for understanding out-of-order design). I do not understand
why such an intermediate design (between in-order and 30+
scheduling window out-of-order) is not useful. It may be that

It is useful, just not all that much.

going from say 10 to 30 scheduler entries gives so much benefit
for relatively little extra cost (and no design is so precisely
area constrained — even doubling core size would not mean pushing
L1 off-chip, e.g.). I have a lumper taxonomic bias, so I have some
emotional investment in intermediate and mixed designs.

10 does not accommodate much ILP beyond that of a 10 deep pipeline.
30 accommodates L1 cache misses and typical FP latencies.
90 accommodates "almost everything else"
250 accommodates multiple L1 misses with L2 hits and "everything else".

And now you write that ARM did not
design it for power efficiency. If you are right, that

supports the

position that in-order is uncompetetive not just wrt

performance, but

also perf/W as soon as there are relatively low performance
requirements.

If ARM designed A55 for power efficiency (at that performance
level) over all other concerns, the L1 caches would be fixed size.
Users of ARM designs are obviously willing to sacrifice some power
efficiency for the benefit from flexible L1 size. With different functionality differing in timing and energy costs with different
processes, energy-efficiency at all costs would seem to lead to
different designs for different processes. Presumably this is not
cost effective.

The memory system, on-chip network, and such would also affect the
energy efficiency. Exynos9820's memory system might _reasonably_
be optimized for high power/high performance use; that would tend
to hurt the efficiency of wimpy cores.

What scenario do you imagine where one would want these in-order
cores? ARM's niche for them is the little cores in a big.LITTLE
design; that is necessarily coupled with a memory system with a

high

bandwidth. There are also SoCs with only A55 cores (no BIG

ones) like

the RK3566, but they are only bought because of the price, not

because

of their power-efficiency.

For something like a smart phone, one or two small cores might be
useful for background activity, tasks whose latency (within a
broad range) is not related to system responsiveness for the user.

For a server expected to run embarrassingly parallel workloads, if

Servers are not expected to run embarrassingly parallel applications,
they are expected to run an embarrassing large number of essentially
serial applications.

a wimpy core provides sufficient responsiveness, I would expect
most of the cores (possibly even all of the cores) to be wimpy.
There might not be many workloads with such characteristics;

Talk to Google about that....

although fundamental network latency has not improved that much
over the last decade, bandwidth has increased and server-side
processing complexity has increased. Even with splitting a request
to multiple threads can make wimpy cores less useful than one
might expect because work will not be perfectly distributed and
tail latency increases.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Paul A. Clayton on Mon Feb 26 10:48:39 2024

Paul A. Clayton wrote:

On 1/28/24 1:48 PM, EricP wrote:
[snip]

Multiple parallel pipelines is fine but it has to sequence the pipe exits
so the results retire in order for precise exceptions and interrupts.

In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.

Interrupts have to be restartable so in-order retire, where everything
older than the interrupt RIP is executed and retired and everything
after that RIP is not, is simplest and cheapest to implement.
Yes you could make it more complicated, but why?

As Mitch has pointed out many times, uOps with exceptions might look
ahead to see if all older uOps that might throw exceptions have executed
and did not indicate an exception. However I believe that exceptions are exceptional (unusual) and find the extra logic needed to do this to be
not justified for the benefits of early prefetching of an exception handler.

My only exception handler that is triggered with any regularity is
page fault (assuming a hardware table walker so no TLB miss exceptions),
and it typically invokes a handler with many thousands of instructions
so prefetching that code a few clocks earlier won't make any difference.

(From Computer Architecture: A Quantitative Approach, 3rd Ed.,
Appendix H, "One approach to this problem, used in the MIPS R3010,
is to identify instructions that may cause an exception early in
the instruction cycle. For example, an addition can overflow only
if one of the operands has an exponent of Emax, and so on. This
early check is conservative: It might flag an operation that
doesn’t actually cause an exception. However, if such false
positives are rare, then this technique will have excellent
performance. When an instruction is tagged as being possibly
exceptional, special code in a trap handler can compute it without
destroying any state. Remember that all these problems occur only
when trap handlers are enabled.")

Ok but their problem was they used the exception mechanism for Usuals,
TLB misses and in this case floating point fix-ups. And a consequence of
the exception mechanism is a pipeline drain, which doesn't matter if it
only happens rarely but does if it happens often.

This was in the early RISC days when they used traps for all kinds of
normal management, misaligned memory accesses or Sparc register windows.
And they all suffered performance problems.

So rather than fix the actual problems by adding in a HW table walker
and HW float fix-ups, it sounds like they added a complicated mechanism
to sort-of-almost-but-not-quite-multi-threaded to execute the trap handler
and avoid the pipeline drain. I had the same idea for Alpha's software
TLB miss handler, which sapped up to 25% of performance, but decided that software managed TLB's are a dead end and a HW table walker was best.

Moral of the story: don't use the exception mechanism for usuals
and then complain about the performance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Mon Feb 26 19:49:06 2024

EricP wrote:

Paul A. Clayton wrote:

On 1/28/24 1:48 PM, EricP wrote:
[snip]

Multiple parallel pipelines is fine but it has to sequence the pipe exits >>> so the results retire in order for precise exceptions and interrupts.

In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.

Interrupts have to be restartable so in-order retire, where everything
older than the interrupt RIP is executed and retired and everything
after that RIP is not, is simplest and cheapest to implement.
Yes you could make it more complicated, but why?

As Mitch has pointed out many times, uOps with exceptions might look
ahead to see if all older uOps that might throw exceptions have executed
and did not indicate an exception. However I believe that exceptions are exceptional (unusual) and find the extra logic needed to do this to be
not justified for the benefits of early prefetching of an exception handler.

My only exception handler that is triggered with any regularity is
page fault (assuming a hardware table walker so no TLB miss exceptions),
and it typically invokes a handler with many thousands of instructions
so prefetching that code a few clocks earlier won't make any difference.

(From Computer Architecture: A Quantitative Approach, 3rd Ed.,
Appendix H, "One approach to this problem, used in the MIPS R3010,
is to identify instructions that may cause an exception early in
the instruction cycle. For example, an addition can overflow only
if one of the operands has an exponent of Emax, and so on. This
early check is conservative: It might flag an operation that
doesn’t actually cause an exception. However, if such false
positives are rare, then this technique will have excellent
performance. When an instruction is tagged as being possibly
exceptional, special code in a trap handler can compute it without
destroying any state. Remember that all these problems occur only
when trap handlers are enabled.")

Ok but their problem was they used the exception mechanism for Usuals,
TLB misses and in this case floating point fix-ups. And a consequence of
the exception mechanism is a pipeline drain, which doesn't matter if it
only happens rarely but does if it happens often.

Note: the requirement of pipeline drain cost MIPS R2000 5 instructions
and a modern x86 up to 200 instructions. The bigger the execution window
the less you want to take exceptions.

This was in the early RISC days when they used traps for all kinds of
normal management, misaligned memory accesses or Sparc register windows.
And they all suffered performance problems.

Maybe--relative to the performance they had--but compared to the x86s and Mc68Ks of the competition, the RISCs outclassed them.

So rather than fix the actual problems by adding in a HW table walker
and HW float fix-ups, it sounds like they added a complicated mechanism
to sort-of-almost-but-not-quite-multi-threaded to execute the trap handler and avoid the pipeline drain. I had the same idea for Alpha's software
TLB miss handler, which sapped up to 25% of performance, but decided that software managed TLB's are a dead end and a HW table walker was best.

SW TLB miss handlers were dead the instant one wanted 2-level translation:: GuestOS and HyperVisor--which is the norm today (outside of µControllers.)

Moral of the story: don't use the exception mechanism for usuals
and then complain about the performance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sat Mar 2 20:55:18 2024

EricP wrote:

Paul A. Clayton wrote:

On 1/28/24 1:48 PM, EricP wrote:
[snip]

Multiple parallel pipelines is fine but it has to sequence the pipe exits >>> so the results retire in order for precise exceptions and interrupts.

In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.

Interrupts have to be restartable so in-order retire,

Either restartable or completable.

where everything
older than the interrupt RIP is executed and retired and everything
after that RIP is not, is simplest and cheapest to implement.

It is not the cheapest to implement, for that you need to review Mc 88100.

Yes you could make it more complicated, but why?

You CAN make it less complicated in HW and toss the burden off to SW.
{similar to VLIW, RISC vs CISC, ...}

As Mitch has pointed out many times, uOps with exceptions might look
ahead to see if all older uOps that might throw exceptions have executed
and did not indicate an exception.

You look backwards (not ahead) to see if older instruction can no longer
raise exceptions.

However I believe that exceptions are exceptional (unusual) and find the extra logic needed to do this to be
not justified for the benefits of early prefetching of an exception handler.

My only exception handler that is triggered with any regularity is
page fault (assuming a hardware table walker so no TLB miss exceptions),
and it typically invokes a handler with many thousands of instructions
so prefetching that code a few clocks earlier won't make any difference.

If you use it often enough it will still be in your cache when you next
need it. {I don't remember exactly who told me this, but it was one of
the original MIPS (the company not Stanford) guys}; so you don't need to prefetch it.

(From Computer Architecture: A Quantitative Approach, 3rd Ed.,
Appendix H, "One approach to this problem, used in the MIPS R3010,
is to identify instructions that may cause an exception early in
the instruction cycle. For example, an addition can overflow only
if one of the operands has an exponent of Emax, and so on. This
early check is conservative: It might flag an operation that
doesn’t actually cause an exception. However, if such false
positives are rare, then this technique will have excellent
performance. When an instruction is tagged as being possibly
exceptional, special code in a trap handler can compute it without
destroying any state. Remember that all these problems occur only
when trap handlers are enabled.")

Ok but their problem was they used the exception mechanism for Usuals,
TLB misses and in this case floating point fix-ups. And a consequence of
the exception mechanism is a pipeline drain, which doesn't matter if it
only happens rarely but does if it happens often.

It also matters less when the pipeline depth is small (less than 10).

This was in the early RISC days when they used traps for all kinds of
normal management, misaligned memory accesses or Sparc register windows.
And they all suffered performance problems.

Exceptions meant we did not have to build a bunch of mechanisms that
SW could do similarly well. The smarter of us learned out lessons and
modern Si means we have the logic to do it right without scrimping
{FPGA still has not reached this as BGB and another NG member subscribe.}

So rather than fix the actual problems by adding in a HW table walker
and HW float fix-ups, it sounds like they added a complicated mechanism
to sort-of-almost-but-not-quite-multi-threaded to execute the trap handler and avoid the pipeline drain.

The MIPS guys (above) would state that that mechanism already had to exist
and they could leverage it for FLB refills as they could page faults or FP fixups. Only page faults should remain in a modern (non-FPGA) implementation.

I had the same idea for Alpha's software
TLB miss handler, which sapped up to 25% of performance, but decided that software managed TLB's are a dead end and a HW table walker was best.

Moral of the story: don't use the exception mechanism for usuals
and then complain about the performance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Sat Mar 2 21:04:35 2024

Paul A. Clayton wrote:

On 1/25/24 10:22 AM, Anton Ertl wrote:
[snip]

I think the commonly understood meaning is that
all instructions start their execution in-order (i.e., none goes to a
functional unit earlier than an architecturally earlier instruction).
Execution can overlap.

What about a skewed pipeline? A simple skewed pipeline that
statically assigned operations to a pipeline-stage/execution unit
has been called in-order (in what I have read). A "second-chance"
pipeline (where many operations can dynamically choose the
pipeline stage based on operand availability) involves dynamic
scheduling (so would seem to fall in to out-of-order), but
counterflow pipelines ("Counterflow Pipeline Processor
Architecture", Robert F. Sproull et al., 1994) — which are more
extreme in some ways than pipelines that have two stages in which
operations can start — are stated to have "No overtaking.
Instructions must stay in program order in the instruction
pipeline.", which sounds "in-order" (and the paper was written by
people working at Sun Microsystems).

(I thought counterflow pipelines were weird. Simplifying
communication makes sense, but ...)

I get the impression that early PowerPC out-or-order execution implementations were really very similar to using the forwarding
network for out-of-order completion (with in-order writeback). If
I recall correctly, renaming was done by appending a version to
the architectural register name and operands would be captured as
soon as they were available rather than passing along the pipeline
with forwarding until the writeback stage.

This sounds more like Mc 88110 rather than PPC 620.

PPC was shrunk from 6-wide to 4-wide in order to fit in the acceptable
die area. Other things may have been jettisoned at this shrink of design
point. The 620 was originally targeted to be equal to Mc 88120 which was
a 6-wide GBOoO machine full Tomasulo with precise exceptions and 4 external busses named {Data Out, Data In, Address Out, Address In}

Address Out was used for cache misses to bring data to the CPU
Data Out was used for cache victims to send data to DRAM
Data In was used by arriving DRAM data
Address In was used for arriving Snoops

Smart externals could use Data In to send the CPU data before it knew it
needed it. That data could be code or data.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Sun Mar 3 01:04:44 2024

On Sat, 2 Mar 2024 21:04:35 +0000
[email protected] (MitchAlsup1) wrote:

Paul A. Clayton wrote:

On 1/25/24 10:22 AM, Anton Ertl wrote:
[snip]

I think the commonly understood meaning is that
all instructions start their execution in-order (i.e., none goes
to a functional unit earlier than an architecturally earlier
instruction). Execution can overlap.

What about a skewed pipeline? A simple skewed pipeline that
statically assigned operations to a pipeline-stage/execution unit
has been called in-order (in what I have read). A "second-chance"
pipeline (where many operations can dynamically choose the
pipeline stage based on operand availability) involves dynamic
scheduling (so would seem to fall in to out-of-order), but
counterflow pipelines ("Counterflow Pipeline Processor
Architecture", Robert F. Sproull et al., 1994) — which are more
extreme in some ways than pipelines that have two stages in which operations can start — are stated to have "No overtaking.
Instructions must stay in program order in the instruction
pipeline.", which sounds "in-order" (and the paper was written by
people working at Sun Microsystems).

(I thought counterflow pipelines were weird. Simplifying
communication makes sense, but ...)

I get the impression that early PowerPC out-or-order execution implementations were really very similar to using the forwarding
network for out-of-order completion (with in-order writeback). If
I recall correctly, renaming was done by appending a version to
the architectural register name and operands would be captured as
soon as they were available rather than passing along the pipeline
with forwarding until the writeback stage.

This sounds more like Mc 88110 rather than PPC 620.

Paul A. Clayton probably has in mind 603 and 7xx series rather than
(more ambitious) 604 and its ill-fated never shipped followup 620.

PPC was shrunk from 6-wide to 4-wide in order to fit in the acceptable
die area. Other things may have been jettisoned at this shrink of
design point. The 620 was originally targeted to be equal to Mc 88120
which was a 6-wide GBOoO machine full Tomasulo with precise
exceptions and 4 external busses named {Data Out, Data In, Address
Out, Address In}

Address Out was used for cache misses to bring data to the CPU
Data Out was used for cache victims to send data to DRAM
Data In was used by arriving DRAM data
Address In was used for arriving Snoops

Smart externals could use Data In to send the CPU data before it knew
it needed it. That data could be code or data.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Mon Mar 4 21:54:12 2024

MitchAlsup1 wrote:

EricP wrote:

My only exception handler that is triggered with any regularity is
page fault (assuming a hardware table walker so no TLB miss exceptions),
and it typically invokes a handler with many thousands of instructions
so prefetching that code a few clocks earlier won't make any difference.

If you use it often enough it will still be in your cache when you next
need it. {I don't remember exactly who told me this, but it was one of
the original MIPS (the company not Stanford) guys}; so you don't need to prefetch it.

That has been my rule-of-thumb for lookup tables replacing logic: If the
table is small enough and used often enough that it could make a
significant difference to the total runtime, then it will also stay in
cache nearly all the time.

if it does get evicted between uses most of the time, then it simply
wasn't that important.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Sat Mar 9 04:01:49 2024

Paul A. Clayton wrote:

On 2/26/24 10:48 AM, EricP wrote:

Paul A. Clayton wrote:

On 1/28/24 1:48 PM, EricP wrote:
[snip]

Multiple parallel pipelines is fine but it has to sequence the
pipe exits
so the results retire in order for precise exceptions and
interrupts.

In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.

Interrupts have to be restartable so in-order retire, where everything
older than the interrupt RIP is executed and retired and everything
after that RIP is not, is simplest and cheapest to implement.
Yes you could make it more complicated, but why?

The above described method still provides precise exceptions. The
absence of a earlier exception is required to allow such out-of-
order retirement.

This also means that handling of an asynchronous event might have
to be delayed (if one did not want to have two threads active)
until all instructions before the latest-in-program-order retired
instruction have retired.

For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.

What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC
fail.

I suspect general out-of-order retirement would not be worthwhile
with precise exceptions; it just sounds complex. My comment was
mainly to point out that such was possible not that it was wise.

We basically all converged on this about 1990.

[snip]

Moral of the story: don't use the exception mechanism for usuals
and then complain about the performance.

☺

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Sat Mar 9 15:03:05 2024

[email protected] (MitchAlsup1) writes:

Paul A. Clayton wrote:

On 2/26/24 10:48 AM, EricP wrote:

Paul A. Clayton wrote:

On 1/28/24 1:48 PM, EricP wrote:
[snip]

Multiple parallel pipelines is fine but it has to sequence the
pipe exits
so the results retire in order for precise exceptions and
interrupts.

In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.

Interrupts have to be restartable so in-order retire, where everything
older than the interrupt RIP is executed and retired and everything
after that RIP is not, is simplest and cheapest to implement.
Yes you could make it more complicated, but why?

The above described method still provides precise exceptions. The
absence of a earlier exception is required to allow such out-of-
order retirement.

This also means that handling of an asynchronous event might have
to be delayed (if one did not want to have two threads active)
until all instructions before the latest-in-program-order retired
instruction have retired.

For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.

What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC
fail.

As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.

The LLC (or memory controller) can optionally support an interrupt
to management software to indicate that an uncorrected fault occurred; that would, of course, be asynchronous and occur long after the
store had retired.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Mar 9 18:45:48 2024

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Paul A. Clayton wrote:

On 2/26/24 10:48 AM, EricP wrote:

Paul A. Clayton wrote:

On 1/28/24 1:48 PM, EricP wrote:
[snip]

Multiple parallel pipelines is fine but it has to sequence the
pipe exits
so the results retire in order for precise exceptions and
interrupts.

In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.

Interrupts have to be restartable so in-order retire, where everything >>>> older than the interrupt RIP is executed and retired and everything
after that RIP is not, is simplest and cheapest to implement.
Yes you could make it more complicated, but why?

The above described method still provides precise exceptions. The
absence of a earlier exception is required to allow such out-of-
order retirement.

This also means that handling of an asynchronous event might have
to be delayed (if one did not want to have two threads active)
until all instructions before the latest-in-program-order retired
instruction have retired.

For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.

What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC
fail.

As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.

Storing the bad <arriving> ECC should take care of that.

The LLC (or memory controller) can optionally support an interrupt
to management software to indicate that an uncorrected fault occurred; that would, of course, be asynchronous and occur long after the
store had retired.

I was going to check ECC on arrival at LLC and request retransmission
on failure. CPU sender cannot free the "miss buffer" until it gets a
release (arrived OK) from LLC. LLC then and later writes data into DRAM
through DRAM controller.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Mar 9 18:48:26 2024

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.

What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC
fail.

As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.

The LLC (or memory controller) can optionally support an interrupt
to management software to indicate that an uncorrected fault occurred; that would, of course, be asynchronous and occur long after the
store had retired.

The Interrupt Tables are manipulated by LLC (set, clear) and this is transmitted to CPU[*] by the cache coherence protocol (Invalidate Addr).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Sun Mar 10 11:53:37 2024

MitchAlsup1 wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Paul A. Clayton wrote:

For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.

What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC fail.

As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.

Storing the bad <arriving> ECC should take care of that.

I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.

However if the ST is less than 8 bytes or misaligned, it won't know which
of the 8 bytes was invalid so can't tell if the bad data was overwritten.
If it keeps the old ECC as an error indicator, that code might actually be correct for the new data. If it generates a new valid ECC then it loses
track of the fact that the data MAY be invalid.

In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.

When the modified line is written back to DRAM it retains the
double error ECC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Paul A. Clayton on Sun Mar 10 12:39:15 2024

Paul A. Clayton wrote:

On 2/26/24 10:48 AM, EricP wrote:

Paul A. Clayton wrote:

On 1/28/24 1:48 PM, EricP wrote:
[snip]

Multiple parallel pipelines is fine but it has to sequence the pipe
exits
so the results retire in order for precise exceptions and interrupts.

In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.

Interrupts have to be restartable so in-order retire, where everything
older than the interrupt RIP is executed and retired and everything
after that RIP is not, is simplest and cheapest to implement.
Yes you could make it more complicated, but why?

The above described method still provides precise exceptions. The
absence of a earlier exception is required to allow such out-of-
order retirement.

Yes, early OoO retire with precise exception is possible.
The criteria would seem to be that:
- all older instructions that might generate an exception must have
executed without detecting an exception
- plus all older loads and stores translated their virtual addresses
(loads don't need to have completed execution, and stores will not have)
- plus all older conditional branches have executed without mispredicting.

My concern is that the circuit for doing this could be pretty complicated.
Many of the pieces that have to be checked are scattered around the core.
Also many of states are in circular buffers so determining "older" starts getting slightly hairy (the Load Store Queue has a similar problem for disambiguation determining if all older loads and stores have "resolved").
And all this has to run in parallel so it takes less than 1 clock.

The motivation for early OoO retire is usually early recycling of some resources, usually physical registers. However note that you can't early recycle some resources like entries in circular buffers, such as the Instruction Queue, ROB/Future-File, LSQ, Branch Queue.

So the question I have is it really worth it?

This also means that handling of an asynchronous event might have
to be delayed (if one did not want to have two threads active)
until all instructions before the latest-in-program-order retired
instruction have retired.

I define exceptions as part of the ISA, internal, and synchronous with
their triggering instruction. Doing so allows the exception mechanism
to focus on doing just its one thing.

An asynchronous event would therefore not be an "exception".

I define interrupts as and asynchronous restartable traps,
with model dependent delivery and control.

I define errors as a whole different category from exceptions and
interrupts, and explicitly model dependent, and each error has its
own characteristics.

For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.

Errors are totally model and situation dependent. A bus parity error
reading a cache line from DRAM might mean logging the error and repeating
the last flit transfer, while a bus parity error reading a device control register is device dependent whether it can be repeated as some devices
change state on register read (e.g. a UART's received byte FIFO).

I suspect general out-of-order retirement would not be worthwhile
with precise exceptions; it just sounds complex. My comment was
mainly to point out that such was possible not that it was wise.

I agree it sounds complicated. The motivation for doing so which
I have seen is usually to recycle some resources earlier.
But you also have to consider all the resources required to
manage freeing up those resources earlier.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Sun Mar 10 13:26:01 2024

EricP wrote:

MitchAlsup1 wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Paul A. Clayton wrote:

For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.

What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC fail.

As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.

Storing the bad <arriving> ECC should take care of that.

I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.

However if the ST is less than 8 bytes or misaligned, it won't know which
of the 8 bytes was invalid so can't tell if the bad data was overwritten.
If it keeps the old ECC as an error indicator, that code might actually be correct for the new data. If it generates a new valid ECC then it loses
track of the fact that the data MAY be invalid.

In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.

When the modified line is written back to DRAM it retains the
double error ECC.

And if the page is out swapped and recycled we lose track of
the error indicator on that 8-byte value.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Sun Mar 10 18:34:12 2024

EricP <[email protected]> writes:

EricP wrote:

As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.

Storing the bad <arriving> ECC should take care of that.

I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.

However if the ST is less than 8 bytes or misaligned, it won't know which
of the 8 bytes was invalid so can't tell if the bad data was overwritten.
If it keeps the old ECC as an error indicator, that code might actually be >> correct for the new data. If it generates a new valid ECC then it loses
track of the fact that the data MAY be invalid.

In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.

When the modified line is written back to DRAM it retains the
double error ECC.

And if the page is out swapped and recycled we lose track of
the error indicator on that 8-byte value.

If it was properly poisoned, the access by the DMA engine will
cause a RAS error to be signalled and the DMA aborted.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sun Mar 10 19:31:10 2024

EricP wrote:

MitchAlsup1 wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Paul A. Clayton wrote:

For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.

What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC fail.

As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.

Storing the bad <arriving> ECC should take care of that.

I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.

For my scenario to transpire:: the cache line written back would have
had to be read in the L1/L2-cache with correct ECC (which accompanies
the line to DRAM controller) and the whole line would be written into
DRAM with the original ECC.

However if the ST is less than 8 bytes or misaligned, it won't know which
of the 8 bytes was invalid so can't tell if the bad data was overwritten.
If it keeps the old ECC as an error indicator, that code might actually be correct for the new data. If it generates a new valid ECC then it loses
track of the fact that the data MAY be invalid.

Even uncacheable DRAM is accessed line-at-a-time.

In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.

It knows which DoubleWords contain bad ECC ...

When the modified line is written back to DRAM it retains the
double error ECC.

Straight from the CPU cache.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sun Mar 10 19:39:21 2024

EricP wrote:

Paul A. Clayton wrote:

On 2/26/24 10:48 AM, EricP wrote:

Paul A. Clayton wrote:

On 1/28/24 1:48 PM, EricP wrote:
[snip]

The above described method still provides precise exceptions. The
absence of a earlier exception is required to allow such out-of-
order retirement.

Yes, early OoO retire with precise exception is possible.
The criteria would seem to be that:
- all older instructions that might generate an exception must have
executed without detecting an exception
- plus all older loads and stores translated their virtual addresses
(loads don't need to have completed execution, and stores will not have)
- plus all older conditional branches have executed without mispredicting.

You missed
- all inbound cache lines need to have arrived without ECC errors.

My concern is that the circuit for doing this could be pretty complicated.

Essentially equal in complexity to an IO retirement µArchitecture.

Many of the pieces that have to be checked are scattered around the core. Also many of states are in circular buffers so determining "older" starts getting slightly hairy (the Load Store Queue has a similar problem for disambiguation determining if all older loads and stores have "resolved"). And all this has to run in parallel so it takes less than 1 clock.

The motivation for early OoO retire is usually early recycling of some resources, usually physical registers. However note that you can't early recycle some resources like entries in circular buffers, such as the Instruction Queue, ROB/Future-File, LSQ, Branch Queue.

So the question I have is it really worth it?

History says no.

This also means that handling of an asynchronous event might have
to be delayed (if one did not want to have two threads active)
until all instructions before the latest-in-program-order retired
instruction have retired.

Which is exactly the IO retire criterion. Why go OoO retire when you have
to be able to IO retire under certain circumstances ?!?

<snip>

I define errors as a whole different category from exceptions and
interrupts, and explicitly model dependent, and each error has its
own characteristics.

Errors are different--machine checks, for example. These things SHOULD
not happen and you really do want to know if they do.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Scott Lurndal on Sun Mar 10 15:44:38 2024

Scott Lurndal wrote:

EricP <[email protected]> writes:

EricP wrote:

As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.

Storing the bad <arriving> ECC should take care of that.

I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.

However if the ST is less than 8 bytes or misaligned, it won't know which >>> of the 8 bytes was invalid so can't tell if the bad data was overwritten. >>> If it keeps the old ECC as an error indicator, that code might actually be >>> correct for the new data. If it generates a new valid ECC then it loses
track of the fact that the data MAY be invalid.

In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.

When the modified line is written back to DRAM it retains the
double error ECC.

And if the page is out swapped and recycled we lose track of
the error indicator on that 8-byte value.

If it was properly poisoned, the access by the DMA engine will
cause a RAS error to be signalled and the DMA aborted.

And the OS does what with the page and its data?
This could happen long after the owner process terminated,
maybe part of a lazy file cache write back.

The only option for the OS might be to log the error and just reset
the ECC to valid for the current data so the IO can complete.

There is little point in decommissioning the physical page frame for
just one incident as most dram errors are random single event upsets
which can affect multiple bits in adjacent rows or columns.
So there could be multiple errors in a frame due to a single event.

To decommission a bad frame you'd want to see multiple such events,
indicating perhaps a bad row or column.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sun Mar 10 19:41:15 2024

EricP wrote:

EricP wrote:

MitchAlsup1 wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Paul A. Clayton wrote:

For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as >>>>>> a fatal thread error that is asynchronous.

What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC fail. >>>

As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.

Storing the bad <arriving> ECC should take care of that.

I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.

However if the ST is less than 8 bytes or misaligned, it won't know which
of the 8 bytes was invalid so can't tell if the bad data was overwritten.
If it keeps the old ECC as an error indicator, that code might actually be >> correct for the new data. If it generates a new valid ECC then it loses
track of the fact that the data MAY be invalid.

In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.

When the modified line is written back to DRAM it retains the
double error ECC.

And if the page is out swapped and recycled we lose track of
the error indicator on that 8-byte value.

The line was displaced from an L1/L2 cache and its DRAM landing spot is
not in DRAM ?? but over on some disk/SSD ?!?

How (the frick) did it get into L1/L2 if it was not in DRAM ?? and thus
not on disk (as its only access point). ????

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Sun Mar 10 17:52:29 2024

MitchAlsup1 wrote:

EricP wrote:

EricP wrote:

MitchAlsup1 wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Paul A. Clayton wrote:

For memory reads, the late failure generated by an uncorrectable >>>>>>> ECC error would probably have to be handled differently or there >>>>>>> would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as >>>>>>> a fatal thread error that is asynchronous.

What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC
fail.

As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.

Storing the bad <arriving> ECC should take care of that.

I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.

However if the ST is less than 8 bytes or misaligned, it won't know
which
of the 8 bytes was invalid so can't tell if the bad data was
overwritten.
If it keeps the old ECC as an error indicator, that code might
actually be
correct for the new data. If it generates a new valid ECC then it loses
track of the fact that the data MAY be invalid.

In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.

When the modified line is written back to DRAM it retains the
double error ECC.

And if the page is out swapped and recycled we lose track of
the error indicator on that 8-byte value.

The line was displaced from an L1/L2 cache and its DRAM landing spot is
not in DRAM ?? but over on some disk/SSD ?!?
How (the frick) did it get into L1/L2 if it was not in DRAM ?? and thus
not on disk (as its only access point). ????

I'm just pointing out that the erroneous value with its
poisoned ECC that is written from LLC back to DRAM can eventually
lose its ECC error tag when it is out swapped.

What we are left with is probably an error report buried in a
log someplace and an seemingly valid value on disk.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Sun Mar 10 22:14:50 2024

EricP <[email protected]> writes:

Scott Lurndal wrote:

EricP <[email protected]> writes:

EricP wrote:

As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.

Storing the bad <arriving> ECC should take care of that.

I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.

However if the ST is less than 8 bytes or misaligned, it won't know which >>>> of the 8 bytes was invalid so can't tell if the bad data was overwritten. >>>> If it keeps the old ECC as an error indicator, that code might actually be >>>> correct for the new data. If it generates a new valid ECC then it loses >>>> track of the fact that the data MAY be invalid.

In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.

When the modified line is written back to DRAM it retains the
double error ECC.

And if the page is out swapped and recycled we lose track of
the error indicator on that 8-byte value.

If it was properly poisoned, the access by the DMA engine will
cause a RAS error to be signalled and the DMA aborted.

And the OS does what with the page and its data?
This could happen long after the owner process terminated,
maybe part of a lazy file cache write back.

The only option for the OS might be to log the error and just reset
the ECC to valid for the current data so the IO can complete.

No, the I/O must be aborted. RAS 101 - do not propogate
poisoned data.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Scott Lurndal on Mon Mar 11 14:19:19 2024

Scott Lurndal wrote:

EricP <[email protected]> writes:

Scott Lurndal wrote:

EricP <[email protected]> writes:

EricP wrote:

As most stores are posted, the data stored needs to be 'poisoned' >>>>>>> so that any subsequent use of the data (e.g. a load) will report >>>>>>> a fault.

Storing the bad <arriving> ECC should take care of that.

I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.

However if the ST is less than 8 bytes or misaligned, it won't know which >>>>> of the 8 bytes was invalid so can't tell if the bad data was overwritten. >>>>> If it keeps the old ECC as an error indicator, that code might actually be
correct for the new data. If it generates a new valid ECC then it loses >>>>> track of the fact that the data MAY be invalid.

In this second case of partial overwrite I think it has to generate a >>>>> new invalid ECC for the new 8 byte data indicating a double error.

When the modified line is written back to DRAM it retains the
double error ECC.

And if the page is out swapped and recycled we lose track of
the error indicator on that 8-byte value.

If it was properly poisoned, the access by the DMA engine will
cause a RAS error to be signalled and the DMA aborted.

And the OS does what with the page and its data?
This could happen long after the owner process terminated,
maybe part of a lazy file cache write back.

The only option for the OS might be to log the error and just reset
the ECC to valid for the current data so the IO can complete.

No, the I/O must be aborted. RAS 101 - do not propogate
poisoned data.

Perhaps but tossing a whole block from an IO expands the size of
the problem by a factor of 1000's.

If that was one byte wrong in a text file then I think most people
would want it written, as opposed to tossing out their work.

If that was one byte wrong in a file system meta data block then
there is no good answer. Many of the meta data blocks are in linked lists
or B+ trees so not writing the block could corrupt a whole file system,
and writing the block could also cause corruption but hopefully less likely.

So you are damned if you do fix the ECC and write the block,
and damned if you don't. But do seems less damning.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Mon Mar 11 18:50:03 2024

EricP <[email protected]> writes:

Scott Lurndal wrote:

EricP <[email protected]> writes:

Scott Lurndal wrote:

EricP <[email protected]> writes:

EricP wrote:

As most stores are posted, the data stored needs to be 'poisoned' >>>>>>>> so that any subsequent use of the data (e.g. a load) will report >>>>>>>> a fault.

Storing the bad <arriving> ECC should take care of that.

I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate >>>>>> a new valid ECC and correct the error.

However if the ST is less than 8 bytes or misaligned, it won't know which
of the 8 bytes was invalid so can't tell if the bad data was overwritten.
If it keeps the old ECC as an error indicator, that code might actually be
correct for the new data. If it generates a new valid ECC then it loses >>>>>> track of the fact that the data MAY be invalid.

In this second case of partial overwrite I think it has to generate a >>>>>> new invalid ECC for the new 8 byte data indicating a double error. >>>>>>
When the modified line is written back to DRAM it retains the
double error ECC.

And if the page is out swapped and recycled we lose track of
the error indicator on that 8-byte value.

If it was properly poisoned, the access by the DMA engine will
cause a RAS error to be signalled and the DMA aborted.

And the OS does what with the page and its data?
This could happen long after the owner process terminated,
maybe part of a lazy file cache write back.

The only option for the OS might be to log the error and just reset
the ECC to valid for the current data so the IO can complete.

No, the I/O must be aborted. RAS 101 - do not propogate
poisoned data.

Perhaps but tossing a whole block from an IO expands the size of
the problem by a factor of 1000's.

Not Having the data (or at least the data in the I/O block being
written (512/4k) given non-sequential underlying disk sector allocations)
is _far far_ better than having corrupt data. The former can be
repaired. The latter is may not even be detected.

If that was one byte wrong in a text file then I think most people
would want it written, as opposed to tossing out their work.

I really doubt that any programmer would prefer bad data to no data.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Wed Mar 13 10:24:25 2024

MitchAlsup1 wrote:

EricP wrote:

My concern is that the circuit for doing this could be pretty
complicated.

Essentially equal in complexity to an IO retirement µArchitecture.

For my uArch Retire should be quite straight forward to implement.

Retire reads the tail (oldest) entry in the Instruction Queue (IQ) and
checks if the Done flag is set. If it is and the entry's Exception flag
is clear:

- if instruction was not a taken branch Retire adds the instruction
length to the committed RIP register.
- else if it is a taken branch Retire pops the new committed RIP from
the tail of the branch queue in the Branch Control Unit.
- it clears the Architecture Reg flag on the old dest physical register
(which also frees it) and sets it on the new dest physical register
- updates the Committed-RAT with the new dest register for the Arch register
- increments IQ tail pointer, freeing the entry.

If the entry's Exception flag is set then it is also straight forward,
with a flush of all in-flight instructions, bulk copy the Committed-RAT
into the Future-RAT to restore renaming, and set a jump address in Fetch.
(Any in-flight cache miss operations are allowed to complete.)

This is also relatively straight forward to do multiple retires per clock,
each mostly costs an extra read port on IQ and extra write ports on the Committed-RAT and the Physical Register Status register.

Many of the pieces that have to be checked are scattered around the core.
Also many of states are in circular buffers so determining "older" starts
getting slightly hairy (the Load Store Queue has a similar problem for
disambiguation determining if all older loads and stores have
"resolved").
And all this has to run in parallel so it takes less than 1 clock.

Adding the structures to support OoO Retire would greatly complicate this.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Mar 13 15:34:47 2024

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

My concern is that the circuit for doing this could be pretty
complicated.

Essentially equal in complexity to an IO retirement µArchitecture.

For my uArch Retire should be quite straight forward to implement.

Retire reads the tail (oldest) entry in the Instruction Queue (IQ) and
checks if the Done flag is set. If it is and the entry's Exception flag
is clear:

- if instruction was not a taken branch Retire adds the instruction
length to the committed RIP register.
- else if it is a taken branch Retire pops the new committed RIP from
the tail of the branch queue in the Branch Control Unit.
- it clears the Architecture Reg flag on the old dest physical register
(which also frees it) and sets it on the new dest physical register
- updates the Committed-RAT with the new dest register for the Arch register - increments IQ tail pointer, freeing the entry.

All of these would have been completed when the instruction comes out
of its function unit, and then retire multiplexes this data onto the
current retired instruction state. {2-gates not 13-gates}

If the entry's Exception flag is set then it is also straight forward,
with a flush of all in-flight instructions, bulk copy the Committed-RAT
into the Future-RAT to restore renaming, and set a jump address in Fetch. (Any in-flight cache miss operations are allowed to complete.)

This is also relatively straight forward to do multiple retires per clock, each mostly costs an extra read port on IQ and extra write ports on the Committed-RAT and the Physical Register Status register.

Many of the pieces that have to be checked are scattered around the core. >>> Also many of states are in circular buffers so determining "older" starts >>> getting slightly hairy (the Load Store Queue has a similar problem for
disambiguation determining if all older loads and stores have
"resolved").
And all this has to run in parallel so it takes less than 1 clock.

Adding the structures to support OoO Retire would greatly complicate this.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Mar 13 15:31:50 2024

Scott Lurndal wrote:

EricP <[email protected]> writes:

No, the I/O must be aborted. RAS 101 - do not propogate
poisoned data.

Perhaps but tossing a whole block from an IO expands the size of
the problem by a factor of 1000's.

Not Having the data (or at least the data in the I/O block being
written (512/4k) given non-sequential underlying disk sector allocations)
is _far far_ better than having corrupt data. The former can be
repaired. The latter is may not even be detected.

If that was one byte wrong in a text file then I think most people
would want it written, as opposed to tossing out their work.

I really doubt that any programmer would prefer bad data to no data.

Any application dealing with money will prefer knowing the data is bad
to not knowing if the data is bad.

On the other hand, engine controllers deal with bad data all the time,
and correct any current data problem on the next engine revolution.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Mar 13 15:36:50 2024

EricP wrote:

Scott Lurndal wrote:

EricP <[email protected]> writes:

Scott Lurndal wrote:

EricP <[email protected]> writes:

EricP wrote:

As most stores are posted, the data stored needs to be 'poisoned' >>>>>>>> so that any subsequent use of the data (e.g. a load) will report >>>>>>>> a fault.

Storing the bad <arriving> ECC should take care of that.

I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate >>>>>> a new valid ECC and correct the error.

However if the ST is less than 8 bytes or misaligned, it won't know which
of the 8 bytes was invalid so can't tell if the bad data was overwritten.
If it keeps the old ECC as an error indicator, that code might actually be
correct for the new data. If it generates a new valid ECC then it loses >>>>>> track of the fact that the data MAY be invalid.

In this second case of partial overwrite I think it has to generate a >>>>>> new invalid ECC for the new 8 byte data indicating a double error. >>>>>>
When the modified line is written back to DRAM it retains the
double error ECC.

And if the page is out swapped and recycled we lose track of
the error indicator on that 8-byte value.

If it was properly poisoned, the access by the DMA engine will
cause a RAS error to be signalled and the DMA aborted.

And the OS does what with the page and its data?
This could happen long after the owner process terminated,
maybe part of a lazy file cache write back.

The only option for the OS might be to log the error and just reset
the ECC to valid for the current data so the IO can complete.

No, the I/O must be aborted. RAS 101 - do not propogate
poisoned data.

Consider a page being written out and the last cache line in the page
has a bad ECC. What command does one send the disk to indicate "forget
all that data I just sent you" ??

Perhaps but tossing a whole block from an IO expands the size of
the problem by a factor of 1000's.

If that was one byte wrong in a text file then I think most people
would want it written, as opposed to tossing out their work.

If that was one byte wrong in a file system meta data block then
there is no good answer. Many of the meta data blocks are in linked lists
or B+ trees so not writing the block could corrupt a whole file system,
and writing the block could also cause corruption but hopefully less likely.

So you are damned if you do fix the ECC and write the block,
and damned if you don't. But do seems less damning.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Wed Mar 13 12:31:14 2024

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

My concern is that the circuit for doing this could be pretty
complicated.

Essentially equal in complexity to an IO retirement µArchitecture.

For my uArch Retire should be quite straight forward to implement.

Retire reads the tail (oldest) entry in the Instruction Queue (IQ) and
checks if the Done flag is set. If it is and the entry's Exception flag
is clear:

- if instruction was not a taken branch Retire adds the instruction
length to the committed RIP register.
- else if it is a taken branch Retire pops the new committed RIP from
the tail of the branch queue in the Branch Control Unit.
- it clears the Architecture Reg flag on the old dest physical register
(which also frees it) and sets it on the new dest physical register
- updates the Committed-RAT with the new dest register for the Arch
register
- increments IQ tail pointer, freeing the entry.

All of these would have been completed when the instruction comes out of
its function unit, and then retire multiplexes this data onto the
current retired instruction state. {2-gates not 13-gates}

IIRC the Alpha 21064 was 16 gates per stage so if my Retire unit
could hit 13 gates I'd be extremely chuffed (delighted).
I would likely be targetting 20 gates per stage anyway.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Wed Mar 13 16:16:39 2024

[email protected] (MitchAlsup1) writes:

EricP wrote:

No, the I/O must be aborted. RAS 101 - do not propogate
poisoned data.

Consider a page being written out and the last cache line in the page
has a bad ECC. What command does one send the disk to indicate "forget
all that data I just sent you" ??

You're not sending the data asychronously. The DMA engine on
the disk controller is requesting the data from DRAM. The
response to the DMA READ indicates an error to the disk
controller and it aborts the write (and if it is buffered in the
RAM disk controller cache, the entire I/O can be aborted
with no change to the sector(s) on disk).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Mar 13 19:14:51 2024

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

My concern is that the circuit for doing this could be pretty
complicated.

Essentially equal in complexity to an IO retirement µArchitecture.

For my uArch Retire should be quite straight forward to implement.

Retire reads the tail (oldest) entry in the Instruction Queue (IQ) and
checks if the Done flag is set. If it is and the entry's Exception flag
is clear:

- if instruction was not a taken branch Retire adds the instruction
length to the committed RIP register.
- else if it is a taken branch Retire pops the new committed RIP from
the tail of the branch queue in the Branch Control Unit.
- it clears the Architecture Reg flag on the old dest physical register
(which also frees it) and sets it on the new dest physical register
- updates the Committed-RAT with the new dest register for the Arch
register
- increments IQ tail pointer, freeing the entry.

All of these would have been completed when the instruction comes out of
its function unit, and then retire multiplexes this data onto the
current retired instruction state. {2-gates not 13-gates}

IIRC the Alpha 21064 was 16

carefully tuned

gates per stage so if my Retire unit
could hit 13 gates I'd be extremely chuffed (delighted).
I would likely be targeting 20 gates per stage anyway.

For example, Athlon was a 16-gate machine and Opteron was a 17-gate
machine. The 64-bit* integer adder was 11-gates of delay which had
been carefully tuned so it was at least as fast as 8-random gates
of FO4.

(*) and the 56-bit fraction FADD adder was also 11-gates.

As to gates of delay per stage::

At 20-gates you can run 6-wide forwarding anything goes anywhere and hit
each cache port twice per cycle (generally 1 RD 1 WT). This µArchitecture shortens the number of retire stages. One can also use register file ports twice per cycle so a 6-port RF can do 6 RDs and 6 WTs per cycle.

At 16-gates 3-4-wide machines can perform everything goes everywhere forwarding but cannot run an SRAM twice per cycle {either RD-RD or RD-WT}. It is right
on the edge of doable to use your register ports twice per cycle--I would recommend not trying} 30 years ago with circuit designers tuning gates you could now with gates-only-from-library you cannot.

At 12-gates per stage you cannot perform anything goes anywhere forwarding
{for example an ADD-Btye (x86) could not be forwarded to a 32-bit or 64-bit integer ADD. Part of the problem is x86 defines byte addition as insert.}

At 8-gates per stage, the integer adder and accessing SRAM both take an
entire cycle, so a LD cannot be shorter than 3-cycles and set associative caches are often 4-cycles. {So DM caches may actually outperform SA cache} Decode is at least 2 cycles even on a 1-wide machine. Decode is at least 3-cycles on a GBOoO machine. Forwarding is approximately ½ cycle.

-----------------------------

Having doe designs in each of these arenas:: I lean towards 16-gates on
narrow machines and 20-gates on GBOoO machines.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Paul A. Clayton on Fri Mar 22 10:23:09 2024

Paul A. Clayton wrote:

On 1/22/24 9:44 AM, Paul A. Clayton wrote:
[snip]

Obviously an extremely biased workload like the data analysis
workloads targeted by Intel's research chip would probably show
A55 in a better light (though A55 would likely be very inefficient
compared to the research design, I think it used 4-way threaded
in-order cores with limited cache and narrow memory channels [to avoid
64-byte accesses to access 64-bits or less of data]), but
that would not be "fair".

I (finally) found a reference to the Intel research chip. https://ieeexplore.ieee.org/document/10188866
"The Intel Programmable and Integrated Unified Memory Architecture
Graph Analytics Processor" (Sriram Aananthakrishnan et al., 2023)
A PDF of the paper appears to be available at https://heirman.net/papers/aananthakrishnan2023piuma.pdf

Interesting. Thanks.
I haven't finished reading it but one thing I notice is that since
normally all of the chased pointers are virtual addresses, while they
mention "Address translation tables (ATT)", I didn't see how they
actually DO the virtual address translation during these offloaded chases.

Also interesting are some of the authors other recent publishings. E.g.:

https://scholar.google.com/citations?hl=en&user=bUTgzBUAAAAJ&view_op=list_works&sortby=pubdate

https://scholar.google.com/citations?hl=en&user=ySqvmSQAAAAJ&view_op=list_works&sortby=pubdate

This is a different approach to OoO uArch.
Existing OoO work on the basis that most things are serial and predictable. This approach is optimized for sparse: short sequential code segments intermixed with sparse conditional code segments, chasing sparse data.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Sun Mar 24 19:00:22 2024

Paul A. Clayton wrote:

On 2/25/24 5:22 PM, MitchAlsup1 wrote:

Paul A. Clayton wrote:

[snip]

When I looked at the pipeline design presented in the Arm Cortex-
A55 Software Optimization Guide, I was surprised by the design.
Figure 1 (page 10 in Revision R2p0) shows nine execution pipelines
(ALU0, ALU1, MAC, DIV, branch, store, load, FP/Neon MAC &
DIV/SWRT, FP/Neon ALU) and ALU0 and ALU1 have a shift pipeline
stage before an ALU stage (clearly for AArch32).

Almost like an Mc88100 which had 5 pipelines.

I think I have an incorrect conception of data communication
(fowarding and register-to-functional-unit). I also seem to be
conflating somewhat issue port and functional unit. Forwarding
from nine locations to nine locations and the remaining eight
locations to eight locations (counting functional unit as a single
target location even though a functional unit may have three
functionally different input operands).

Much newer µArchitectural literature does not draw a firm box
properly around real function units.

For example, Mc 88120 has 6 function units buffered by 6 reservation
stations. Each function unit had an Integer Adder including things
like the branch resolution unit, FADD, and FMUL. When I drew those
boxes, I would show post-forwarding operands arriving at the FU
and then after arriving either being diverted to the INT unit or
being diverted to the "other" function unit. This way you could
count operand and result busses and end points for fan-in::fan-out
reasons.

This style seems to have fallen from favor; possible because we made
the transition from value-containing reservation stations to value-
free reservation stations--alleviating register file porting problems.

I am used to functionality being merged; e.g., the multiplier also
having a general ALU. Merged functional units would still need to
route the operands to the appropriate functionality, but selecting
the operation path for two operands *seems* simpler than selecting
distinct operands and separate functional unit independently. This
might also be a nomenclature issue.

The above remains my style in µArchitecture literature, but when
describing block diagram and circuit design levels, only the interior
of the function unit is illustrated.

If one can only begin two operations in a cycle, the generality of
having nine potential paths seems wasteful to me. Having separate
paths for FP/Neon and GPR-using operations makes sense because of
the different register sets (as well as latency/efficiency-
optimized functional units vs. SIMD-optimized functional units;
sharing execution hardware is tempting but there are tradeoffs).

In general, operand timing is tight and you better not screw it up;
while result delivery timing only has to deal with fan-out and data
arrival issues.

My style was conceived back in the days where wires were fast and
metal was precious (3 layers). Now that we have 12-15 layers it
matters less, I suppose.

With nine potential issue ports, it seems strange to me that width
is strictly capped at two.

Likely to be a register porting or a register port analysis limitation. Value-free reservation stations exacerbate this.

Even though AArch64 does not have My
66000's Virtual Vector Method to exploit normally underutilized,
there would be cases where an extra instruction or two could
execute in parallel without increasing resources significantly. As
an outsider, I can only assume that any benefit did not justify
the costs in hardware and design effort. (With in-order execution,
even a nearly free [hardware] increasing of width may not result
in improved performance or efficiency.)

VVM works best with value-containing reservation stations.

The separation of MAC and DIV is mildly questionable — from my
very amateur perspective — not supporting dual issue of a MAC-DIV
pair seems very unlikely to hurt performance but the cost may be
trivial.

Many (MANY) MUL-DIV pairs are data dependent. y = i*m/n;

I also ass?me the other operations are usually available for
parallel execution (though this depends somewhat on compiler
optimization for the microarchitecture), so execution of a
multiply and a divide in parallel is probably uncommon.

In general, any 2 calculations that are not data-dependent, can
be launched into execution without temporal binds.

The FP/Neon section has these operations merged into a functional
unit; I guess — I am not motivated to look this — that this is
because FP divide/sqrt use the multiplier while integer divide
does not.

The Chips and Cheese article also indicated that branches are only
resolved at writeback, two cycles later than if branch direction
was resolved in the first execution stage. The difference between
a six stage misprediction penalty and an eight stage one is not
huge, but it seems to indicate a difference in focus. With

In an 8 stage pipeline, the 2 cycles of added delay should hurt by
~5%-7%

5% performance loss sounds expensive for a something that *seems*
not terribly expensive to fix.

[snip]

I would have *guessed* that an AGLU (a functional unit providing
address generation and "simple" ALU functions, like AMD's Bobcat?)
would be more area and power efficient than having separate
pipelines, at least for store address generation.

Be careful with assumptions like that. Silicon area with no moving
signals is remarkably power efficient.

There is also the extra forwarding for separate functional units
(and perhaps some extra costs from increased distance), but I
admit that such factors really expose my complete lack of hardware experience. (I am aware of clock gating as a power saving
technique and that "doing nothing" is cheap, but I have no
intuition of the weights of the tradeoffs.)

Mc 88120 had forwarding into the reservation stations and forwarding
between reservation station output and function unit input. That is
a lot of forwarding.

(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)

Imaging having to stick all this stuff on a die at 2µ instead of 5nm !!

[snip interesting stuff]

Perhaps mildly out-of-order designs (say a little more than the
PowerPC 750) are not actually useful (other than as a starting
point for understanding out-of-order design). I do not understand
why such an intermediate design (between in-order and 30+
scheduling window out-of-order) is not useful. It may be that

It is useful, just not all that much.

going from say 10 to 30 scheduler entries gives so much benefit
for relatively little extra cost (and no design is so precisely
area constrained — even doubling core size would not mean pushing
L1 off-chip, e.g.). I have a lumper taxonomic bias, so I have some
emotional investment in intermediate and mixed designs.

10 does not accommodate much ILP beyond that of a 10 deep pipeline.
30 accommodates L1 cache misses and typical FP latencies.
90 accommodates "almost everything else"
250 accommodates multiple L1 misses with L2 hits and "everything
else".

Presumably the benefit depends on issue width and load-to-use
latency (pipeline depth, cache capacity, etc.). [For a cheap
"general purpose" processor, not covering FP latencies well may
not be very important.] Better hiding L1 _hit_ latency would seem
to provide a significant fraction of the frequency and ILP benefit
of out-or-order for a smallish core. (Some branch resolution
latency can also be hidden; an in-order core can delay resolution
until writeback of control-dependent instructions, but OoO's extra
buffering facilitates deeper speculation.)

If one has a scheduling window of 90 operations, having only three
issue ports seems imbalanced to me.

I agree:: for Mc 88120 we had 96 instructions (max) in flight for
a 6-wide {issue, launch, execute, result, and retire}, we also
had 16-cycle execution window, so to stream DGEMM (from Matrix300}
we had to execute a LD {which would miss ½ the time} and them have
4 cycles for FMUL and 3 cycles for FADD allowing ST to capture the
FADD result and ship it off to cache. Going backwards; 16-(1+3+4)
meant the LD->L1$->miss->memory->LDalign had only 8 cycles.

The modern version with FMAC would allow 11-cycles LD-Miss-Align.

Out-of-order execution would also seem to facilitate opportunistic
use of existing functionality. Even just buffering decoded
instructions would seem to allow a 16-byte (aligned) instruction
fetch with two instruction decoders to issue more than two
instructions on some cycles without increasing register port
count, forwarding paths, etc. OoO would further increase the
frequency of being able to do more work with given hardware
resources.

My 66150 does 16B fetch and parses 2 instructions per cycle,
even though it is only 1-wide. By fetching wide, and scanning
ahead, one can identify branches and fetch their targets prior
to executing the branch, eliminating the need for the delay-slot
and reducing branch taken overhead down to about 0.13 cycles
even without branch prediction !!

But anything wider than 1-inistruction will need a branch predictor
of some sort.

Perhaps there may even be a case for a 1+ wide OoO core, i.e., an
OoO core which sometimes issue more than one instruction in a
cycle.

For something like a smart phone, one or two small cores might be
useful for background activity, tasks whose latency (within a
broad range) is not related to system responsiveness for the user.

For a server expected to run embarrassingly parallel workloads, if

Servers are not expected to run embarrassingly parallel applications,
they are expected to run an embarrassing large number of essentially
serial applications.

Shared caching of instructions still seems beneficial in "server
worklaods" compared to fully general multiprogram workloads. A
database server might even have more sharing, potentially having a
single process (so page table sharing would be more beneficial),
but that seems a less common use.

a wimpy core provides sufficient responsiveness, I would expect
most of the cores (possibly even all of the cores) to be wimpy.
There might not be many workloads with such characteristics;

Talk to Google about that....

Urs Hölzle of Google put out a paper "Brawny cores still beat
wimpy cores, most of the time"(2010). While some of the points —
such as tail latency effects and software developments costs —
made in the paper are (in my opinion) quite significant, I thought
the argument significantly flawed. (I even wrote a blog post about
this paper: https://dandelion-watcher.blogspot.com/2012/01/weak-case-against-wimpy-cores.html)

The microservice programming model (motivated, from what I
understand, by problem-size and performance scaling and service
reliability with moderately reliable hardware without requiring
much programming effort to support scaling) may also have
significant implications on microarchitecture.

The design space is also very large. One can have heterogeneity of
wimpy and brawny cores at the rack level, wimpy-only chips within
a heterogeneous package, heterogeneity within a chip, temporal
heterogeneity (SMT and dynamic partitioning of core resources),
etc. Core strength can very widely and performance balance can be
diverse (e.g., a core with a quarter of the performance of a
brawny core on general tasks might have — with coprocessors,
tightly coupled accelerators, or general microarchitecture —
approximately equal performance for some tasks).

With a "proper interface" one should be able to off-load any
crypto processing too a place that is both time-constant and
where sensitive data never passes into the cache hierarchy of
an untrusted core.

The performance of weaker cores can also be increased by
increasing communication performance within local groups of such
cores. Exploiting this would likely require significant
programming effort, but some of the effort might be automated
(even before AI replaces programmers). This assumes that there is
significant communication that is less temporally local than
within a core (out-of-order execution changes the temporal
proximity of value communication; a result consumer might be
nearby in program order but substantially more distant in
execution order) and that intermediate resource allocation to
intermediate latency/bandwdith communication can be beneficial.

(I also think that there is an opportunity for optimization in the
on-chip network. Optimizing the on-chip network for any-to-any
communication seems less appropriate for many workloads not only
because of the often limited scale of communication but also
because the communication is, I suspect, often specialized.

And often necessarily serialized.

Getting a network design that is very good for some uses and
adequate others seems challenging even with software cooperation.

See:: https://www.tachyum.com/media/pdf/tachyum_20isc20.pdf

Rings seem really nice for pipeline-style parallelism and some
other uses, crossbars seem nice for small node groups with heavy communication, grids seem to fit large node counts with nearest
neighbor communication (physical modeling?), etc. Channel width,
flit size, channel count also involve tradeoffs. Some
communication does not require sending an entire cache block of
data, but a smaller flit will have more overhead.)

We are arriving at the scale where we want to ship a cache line of data
in a single clock in order to have sufficient coherent BW for 128+ cores.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Sun Mar 24 20:39:18 2024

[email protected] (MitchAlsup1) writes:

Paul A. Clayton wrote:

(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)

Imaging having to stick all this stuff on a die at 2µ instead of 5nm !!

I suspect Paul is refering to what ARMv8 calls "System Registers";
despite the name, most are stored in flops, and in the case of
the ID registers, wires (perhaps anded with local e-fuses).

Accesses to some of them are either self-synchronizing[*]
the rest must be followed by an appropriate barrier
instruction for the effects to be architecturally visible.

[*] E.g. ICC_IAR1_EL1 (An interrupt acknowledge register).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Paul A. Clayton on Mon Mar 25 08:41:06 2024

"Paul A. Clayton" <[email protected]> writes:

On 3/24/24 4:39 PM, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Paul A. Clayton wrote:

(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)

...

However, having over 100 seems like a lot. Supporting performance
counters and tracing is also something that would have been nearly >inconceivable for something like the MIPS R2000.

Certainly. The A55 is similar to the 21164 (1994), which is much
bigger than the R2000. For competition to the R2000, better look at
the ARM1/ARM2, or, for something more contemporary, maybe the
Cortex-M1.

An argument might be made that some designs would have no use for
most of such extra state. Performance monitoring is useful for
software development (and theoretically for OS decisions for
scheduling, core migration, and other functions), but seems likely
to be highly underutilized for typical use. A55 is presumably
large enough that a synthesis-time remove of much of this
functionality would have a tiny effect on total area.

ARM also has the Cortex-A35 (with a 25% smaller core than the A53 and
80-100% of its performance according to ARM). I am unaware of it
being used in smartphones, though.

Even for a
microcontroller the area cost might not be problematic.

ARM-A is not for microcontrollers. ARM has ARM-M for that, e.g., the
Cortex-M0 if you want it to be really small.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Paul A. Clayton on Mon Mar 25 12:36:27 2024

Paul A. Clayton wrote:

On 3/24/24 4:39 PM, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Paul A. Clayton wrote:

(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)

However, having over 100 seems like a lot. Supporting performance
counters and tracing is also something that would have been nearly inconceivable for something like the MIPS R2000.

Many of these register are configuration control that
get set once at boot and never change.
Others are dynamic but not time critical, like debug registers.

Only a small number would be diddled on a regular basis,
like interrupt control.

They don't all need the same access speed -
depending on usage some (most?) can be on "slow" buses
that maybe take multiple clocks to read or write.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Mon Mar 25 13:03:59 2024

EricP wrote:

Paul A. Clayton wrote:

On 3/24/24 4:39 PM, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Paul A. Clayton wrote:

(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)

However, having over 100 seems like a lot. Supporting performance
counters and tracing is also something that would have been nearly
inconceivable for something like the MIPS R2000.

Many of these register are configuration control that
get set once at boot and never change.
Others are dynamic but not time critical, like debug registers.

Only a small number would be diddled on a regular basis,
like interrupt control.

They don't all need the same access speed -
depending on usage some (most?) can be on "slow" buses
that maybe take multiple clocks to read or write.

Also accessing many control registers must not occur out of order
and must be guarded either implicitly or explicitly by instructions
or uOps before and after to drain the pipeline.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Paul A. Clayton on Mon Mar 25 17:04:44 2024

"Paul A. Clayton" <[email protected]> writes:

On 3/24/24 4:39 PM, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Paul A. Clayton wrote:

(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)

Imaging having to stick all this stuff on a die at 2Âµ instead of 5nm !! >>

I suspect Paul is refering to what ARMv8 calls "System Registers";

Yes. (There were also some debug registers, performance monitoring
registers, trace registers, etc.)

despite the name, most are stored in flops, and in the case of
the ID registers, wires (perhaps anded with local e-fuses).

Yes, many of the bits would be implemented as ROM/PROM and many
would presumably be scattered about because they control/interact
with specific functionality. They are similar I/O device
registers. (I/O devices have also become more complex.)

However, having over 100 seems like a lot. Supporting performance
counters and tracing is also something that would have been nearly >inconceivable for something like the MIPS R2000.

Yes, there are over 1000 system registers. Most of them are
only present and implemented if associated feature(s) are supported by the implementation.

The MIPS 2000 was designed three decades ago and implemented in
a 2 micrometer node. Whose law states that logic will expand to
fill the area available :-)?

An argument might be made that some designs would have no use for
most of such extra state. Performance monitoring is useful for
software development (and theoretically for OS decisions for
scheduling, core migration, and other functions), but seems likely
to be highly underutilized for typical use.

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Mar 25 17:38:58 2024

Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

On 3/24/24 4:39 PM, Scott Lurndal wrote:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

My 66000 Architecture defines 8 performance counters at each layer of
the design:: cores gets 8 counters, L1s gets 8 counters, L3s gets 8
counters Interconnect gets 8 counters, Memory Controller gets 8 counters,
PCIe root gets 8 counters--and every instance multiplies the counters.
All counters are available via MMI/O space, and can be copied out or reinitialized in a single LDM, STM, or MM instruction. Any thread with
a TLB mapping can read or write based on permission bits.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Mon Mar 25 18:35:35 2024

[email protected] (Scott Lurndal) writes:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel or AMD.

My theory was that the CPU manufacturers put performance monitoring
counters in CPUs in order to understand the performance of real-world
programs themselves, and how they should tweak the successor core to
relieve it of bottlenecks.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Mon Mar 25 18:23:50 2024

[email protected] (MitchAlsup1) writes:

Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

On 3/24/24 4:39 PM, Scott Lurndal wrote:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

My 66000 Architecture defines 8 performance counters at each layer of
the design:: cores gets 8 counters, L1s gets 8 counters, L3s gets 8
counters Interconnect gets 8 counters, Memory Controller gets 8 counters, >PCIe root gets 8 counters--and every instance multiplies the counters.

It's not really the number of counters that is important, rather
it is what the counters count (i.e. which events can be counted).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Mon Mar 25 20:22:00 2024

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

[email protected] (Scott Lurndal) writes:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel
or AMD.

The question is if "users" to ARM Holdings are actual end-users, or the
SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect
most of the latter to want those features so that they can understand the performance of their silicon better.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Dallman on Mon Mar 25 20:46:39 2024

[email protected] (John Dallman) writes:

In article <[email protected]>, >[email protected] (Anton Ertl) wrote:

[email protected] (Scott Lurndal) writes:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel
or AMD.

The question is if "users" to ARM Holdings are actual end-users, or the
SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.

The biggest demand is from the OS vendors. Hardware folks have
simulation and emulators.

Look at vtune, for example.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Terje Mathisen on Mon Mar 25 20:48:08 2024

Terje Mathisen <[email protected]> writes:

Anton Ertl wrote:

[email protected] (Scott Lurndal) writes:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel or AMD.

My theory was that the CPU manufacturers put performance monitoring
counters in CPUs in order to understand the performance of real-world
programs themselves, and how they should tweak the successor core to
relieve it of bottlenecks.

Having reverse engineered the original Pentium EMON counters I got a
meeting with Intel about their next cpu (the PentiumPro), what I was
told about the Pentium was that this chip was the first one which was
too complicated to create/sell an In-Circuit Emulator (ICE) version, so >instead they added a bunch of counters for near-zero overhead monitoring
and depended on a bit-serial read-out when they needed to dump all state
for debugging. (I have forgotten the proper term for that interface! :-( )

Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Mon Mar 25 21:42:18 2024

Anton Ertl wrote:

[email protected] (Scott Lurndal) writes:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel or AMD.

My theory was that the CPU manufacturers put performance monitoring
counters in CPUs in order to understand the performance of real-world programs themselves, and how they should tweak the successor core to
relieve it of bottlenecks.

Having reverse engineered the original Pentium EMON counters I got a
meeting with Intel about their next cpu (the PentiumPro), what I was
told about the Pentium was that this chip was the first one which was
too complicated to create/sell an In-Circuit Emulator (ICE) version, so
instead they added a bunch of counters for near-zero overhead monitoring
and depended on a bit-serial read-out when they needed to dump all state
for debugging. (I have forgotten the proper term for that interface! :-( )

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Tue Mar 26 09:22:31 2024

[email protected] (John Dallman) writes:

The question is if "users" to ARM Holdings are actual end-users, or the
SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.

That might explain why for the AmLogic S922X in the Odroid N2/N2+
there is a Linux 4.9 kernel that supports performance monitoring
counters (AmLogic put that in for their own uses), but the mainline
Linux kernel does not support perf on the S922X (perf was not in the requirements of whoever integrated the S922X stuff into the mainline).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Scott Lurndal on Tue Mar 26 10:47:07 2024

Scott Lurndal wrote:

Terje Mathisen <[email protected]> writes:

Having reverse engineered the original Pentium EMON counters I got a
meeting with Intel about their next cpu (the PentiumPro), what I was
told about the Pentium was that this chip was the first one which was
too complicated to create/sell an In-Circuit Emulator (ICE) version, so
instead they added a bunch of counters for near-zero overhead monitoring
and depended on a bit-serial read-out when they needed to dump all state
for debugging. (I have forgotten the proper term for that interface! :-( )

Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.

Thanks!

JTAG was indeed the term as was looking for (and not remembering). Maybe
I'm getting old?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Tue Mar 26 09:27:54 2024

[email protected] (Scott Lurndal) writes:

The biggest demand is from the OS vendors. Hardware folks have
simulation and emulators.

You don't want to use a full-blown microarchitectural emulator for a long-running program.

Look at vtune, for example.

And?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Tue Mar 26 14:15:41 2024

[email protected] (Anton Ertl) writes:

[email protected] (Scott Lurndal) writes:

The biggest demand is from the OS vendors. Hardware folks have >>simulation and emulators.

You don't want to use a full-blown microarchitectural emulator for a >long-running program.

Generally hardware folks don't run 'long-running programs' when
analyzing performance, they use the emulator for determining latencies, bandwidths and efficiacy of cache coherency algorithms and
cache prefetchers.

Their target is not application analysis.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Oct 1 18:45:11 2024

On Tue, 26 Mar 2024 14:15:41 +0000, Scott Lurndal wrote:

[email protected] (Anton Ertl) writes:

[email protected] (Scott Lurndal) writes:

The biggest demand is from the OS vendors. Hardware folks have >>>simulation and emulators.

You don't want to use a full-blown microarchitectural emulator for a >>long-running program.

Generally hardware folks don't run 'long-running programs' when
analyzing performance, they use the emulator for determining latencies, bandwidths and efficiacy of cache coherency algorithms and
cache prefetchers.

Their target is not application analysis.

It is sequence compliance. At this point in the game all the FUs
are known to produce correct results. But we live in a world
where::
a) The test case takes the correct number of cycles
b) leaves all the right bit patterns in registers and memory
c) took at the right directions at all the branches
d) and went through an invalid sequence to get there.

HW verification is mostly about proving the sequencers are correct.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet
- Bob Worm
  Mon Jul 27 15:19:55 2026
  from Wales, Uk via Telnet
- Rixter
  Mon Jul 27 13:04:59 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	48:24:34
Calls:	12,444
Calls today:	4
Files:	15,192
Messages:	6,537,117

Re: Efficiency of in-order vs. OoO

Who's Online

Recent Visitors

System Info