In AnandTech's Exynos9820 comparison, one knows the process used
is the same, but one does not know how optimized the designs were
at the HDL level nor at the netlist ("compiled") level. It is also
possible to optimize the same HDL for different power-performance-
area targets.
I would not be surprised if ARM did not invest the same design
effort per unit performance (e.g.) in A55 as in A75.
Likewise, I could imagine Samsung putting less effort into
optimizing A55.
Performance optimization likely makes less sense
for background tasks (the likely targeted use for A55 in this
case) and the benefit of core-level power optimization is likely
less significant than the I/O power for many of the targeted
tasks. Even optimizing for low energy cost for bursty workloads
(useless energy for sleep/wakeup, e.g.) would probably not help
much because of system power consumption.
The memory system, on-chip network, and such would also affect the
energy efficiency. Exynos9820's memory system might _reasonably_
be optimized for high power/high performance use; that would tend
to hurt the efficiency of wimpy cores.
I think system power is also less likely to scale well downward
with performance. E.g., the same capacity L2 suited to one A75
core might properly service more than two A55 cores. If the design
had more A55 cores per L2 than A75 cores per L2, the A55 cores
could be at a power disadvantage in single threaded use just from
the L2 cache.
One might be able to adjust for system power scaling factors by
using all cores of a type for a run (e.g., SPECrate), but I
suspect that would be tricky given fixed aspects of the hardware.
<https://images.anandtech.com/doci/14072/Exynos9820-Perf-Estimated_575px.png>
<https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png> >>
from the article
<https://www.anandtech.com/show/14072/the-samsung-galaxy-s10plus-review/4> >>
In the Exynos 9820, we see at different points of the DVFS curve:
A55 | A75
in-order | OoO
perf mW pf/mW | perf mW pf/mW
1.0 22 0.046 | 3.7 88 0.042 highest efficiency point for each core
1.4 33 0.042 | 3.7 88 0.042 same pf/mW at highest common efficiency >> 2.7 90 0.030 | 3.7 88 0.042 same mW at lowest common mW
5.1 400 0.013 | 5.1 124 0.041 same perf at highest common performance >> 5.1 400 0.013 | 10.5 400 0.027 same mW at highest common mW
5.1 400 0.013 | 17.2 1270 0.013 highest performance point for each core >>
"prf" is SPEC2006 Int+FP Geomean. "pf/mW" (shown as "Perf/W" in the
second graph) is SPEC Int+FP Geomean/mW (you can confirm this by
computing corresponding numbers from the first graph).
The SPEC2006 workload probably also biases is favor of larger
cores, especially the FP portion.
I suspect A55 uses 64-bit width
SIMD execution (which makes sense for the targeted use), which
would substantially reduce SPECFP performance and possibly degrade
SPECINT performance.
Even the gcc component of SPECINT might be more compute dense than
the targeted workloads for A55 (which might often be more
performance constrained by I/O) and gcc is probably less "compute
dense" than other SPECINT components.
Obviously an extremely biased workload like the data analysis
workloads targeted by Intel's research chip would probably show
A55 in a better light (though A55 would likely be very inefficient
compared to the research design, I think it used 4-way threaded
in-order cores with limited cache and narrow memory channels [to
avoid 64-byte accesses to access 64-bits or less of data]), but
that would not be "fair".
Core efficiency cannot be isolated from the system, especially if
measured by system resource use (I *suspect* AnandTech measured
system power and subtracted idle system power).
Fair comparison is difficult, especially when the design targets
are different.
I think that the comparison is as fair as we can get. Of course if
for some reason you don't want to be convinced, there are always some
straws that you can grasp in the hope that they will save the belief
system you favour. But if you look at it objectively, all evidence
there is (from Transmeta through Intel's E-cores and the lack of
in-order at Apple, Intel, and AMD to Andrei Frumusanu's Exynos 9820
data) supports the position that in-order is not more power-efficient
than OoO above a certain performance level, while the opposite
position cannot point to evidence, but only to some corners where we
don't have evidence, and where in-order fans hope that these corners
will favour in-order.
On Wed, 24 Jan 2024 07:47:31 +0000, Anton Ertl wrote:
I think that the comparison is as fair as we can get. Of course if
for some reason you don't want to be convinced, there are always some
straws that you can grasp in the hope that they will save the belief
system you favour. But if you look at it objectively, all evidence
there is (from Transmeta through Intel's E-cores and the lack of
in-order at Apple, Intel, and AMD to Andrei Frumusanu's Exynos 9820
data) supports the position that in-order is not more power-efficient
than OoO above a certain performance level, while the opposite
position cannot point to evidence, but only to some corners where we
don't have evidence, and where in-order fans hope that these corners
will favour in-order.
Above a certain performance level, _all_ cores are out-of-order.
If in-order is more power-efficient than out-of-order at *low*
performance levels, than the basic notion that implementing
out-of-order requires some extra transistors, and transistors
take power, is confirmed. That basic notion is what leads
people to hope that, if in-order could be extended to higher
performance levels, then it would provide power savings there
too.
Let us then imagine what a high-performance in-order CPU would
look like. Its goal would be to achieve what OoO achieves to
improve performance without being OoO.
Thus, such a CPU would have a giant architectural register
file - to match the large hardware register files, including
rename registers, of OoO systems.
So we're talking AMD 29000 or Itanium. AMD sold off the 29000,
and it's still being used for compatibility reasons in some
aviation hardware.
The sample size is small, and so it's not that unreasonable to
argue that although the Itanium failed to meet expectations, this
class of architectures may still deserve some more investigation
and study. Yes, there's no high-performance OoO-beating in-order
chip you can buy off the shelf today, but maybe it's still worth
trying to design one.
What arguments are there against that? I can see a few:
- It's been tried many times, and failed each time. (This
doesn't _seem_ to be the case, but the few times it was
tried may have been enough to prove the point.)
- The benefits of in-order at high performance are known
to be negligible. (That is, the gate cost of OoO at
high performance scales well, and becomes a decreasing
fraction of transistor count in higher-performance designs.
Mitch tells us that the GBOoO direction of progress is
*not sustainable*, so that doesn't seem to be the case.)
- The drawbacks of in-order outweigh their benefits.
(If you have larger register files, you have bigger
instructions, so you fetch more code out of DRAM.
Is that really enough to make the difference?)
However, in framing this counterargument in favor of
in-order, the *fatal* drawback of in-order for high
performance has dawned on me. (Although Ivan Godard
in his Mill design is, in fact, making an effort to
address just this particular drawback!)
As Mitch notes, to further increase performance, OoO
has become GBOoO: ever larger hardware register files
and so on.
This means that, even if an in-order design which had
large architectural register files, an exposed pipeline,
and so on, matched _current_ OoO CPUs in performance
for less power...
the performance of OoO CPUs doesn't stand still...
and so the _next generation_ of the in-order design
would have to have *larger* register files (and, no
doubt, all sorts of other things)...
which means it wouldn't be upwards-compatible with
software for the last generation.
That's why in-order RISC ended up being succeeded by
OoO implementations of the same ISA! Going from 32
registers to 128 registers to stay in-order... isn't
just something you can *do only once*, and solve the
problem forever!
It is by looking at the real problem that the false hope
of high-performance in-order can finally be dashed. Maybe
it isn't technically impossible. But for the mass market
that wants to coalesce around a popular and stable
platform, it may not be able to meet *their* requirements,
even if such architectures could still find a niche
(like supercomputers that are only programmed by the
users themselvels in FORTRAN).
John Savard
Above a certain performance level, _all_ cores are out-of-order.
On Wed, 24 Jan 2024 07:47:31 +0000, Anton Ertl wrote:
I think that the comparison is as fair as we can get. Of course if
for some reason you don't want to be convinced, there are always some
straws that you can grasp in the hope that they will save the belief
system you favour. But if you look at it objectively, all evidence
there is (from Transmeta through Intel's E-cores and the lack of
in-order at Apple, Intel, and AMD to Andrei Frumusanu's Exynos 9820
data) supports the position that in-order is not more power-efficient
than OoO above a certain performance level, while the opposite
position cannot point to evidence, but only to some corners where we
don't have evidence, and where in-order fans hope that these corners
will favour in-order.
Above a certain performance level, _all_ cores are out-of-order.
=500MHz), the A75 offers more performance at better efficiency; theA55 can run at 1800MHz on the Exynos 9820, but one better shouldn't,
If in-order is more power-efficient than out-of-order at *low*
performance levels, than the basic notion that implementing
out-of-order requires some extra transistors, and transistors
take power, is confirmed.
That basic notion is what leads
people to hope that, if in-order could be extended to higher
performance levels, then it would provide power savings there
too.
Let us then imagine what a high-performance in-order CPU would
look like. Its goal would be to achieve what OoO achieves to
improve performance without being OoO.
Thus, such a CPU would have a giant architectural register
file - to match the large hardware register files, including
rename registers, of OoO systems.
The sample size is small, and so it's not that unreasonable to
argue that although the Itanium failed to meet expectations, this
class of architectures may still deserve some more investigation
and study.
- It's been tried many times, and failed each time. (This
doesn't _seem_ to be the case, but the few times it was
tried may have been enough to prove the point.)
- The benefits of in-order at high performance are known
to be negligible. (That is, the gate cost of OoO at
high performance scales well, and becomes a decreasing
fraction of transistor count in higher-performance designs.
Mitch tells us that the GBOoO direction of progress is
*not sustainable*, so that doesn't seem to be the case.)
- The drawbacks of in-order outweigh their benefits.
(If you have larger register files, you have bigger
instructions, so you fetch more code out of DRAM.
Is that really enough to make the difference?)
However, in framing this counterargument in favor of
in-order, the *fatal* drawback of in-order for high
performance has dawned on me. (Although Ivan Godard
in his Mill design is, in fact, making an effort to
address just this particular drawback!)
As Mitch notes, to further increase performance, OoO
has become GBOoO: ever larger hardware register files
and so on.
This means that, even if an in-order design which had
large architectural register files, an exposed pipeline,
and so on, matched _current_ OoO CPUs in performance
for less power...
the performance of OoO CPUs doesn't stand still...
and so the _next generation_ of the in-order design
would have to have *larger* register files (and, no
doubt, all sorts of other things)...
even if such architectures could still find a niche
(like supercomputers that are only programmed by the
users themselvels in FORTRAN).
So here in-order provided lower performance at thrice the power
consumption, two years later.
So here in-order provided lower performance at thrice the power
consumption, two years later.
What is clear is that currently, no one know how to make in-order CPUs
as fast as OoO for "general purpose" computing (i.e. not things you can
run on things like GPGPUs or TPUs).
But indeed, the more interesting aspect is that even in terms of
efficiency, in-order seems to be a losing proposition.
I'd be interested to hear opinions about why that is the case.
I can think of two factors, tho there are probably more:
- in-order CPUs spend more time waiting (which is the cause for their
lower performance), and they still burn Joules while they wait,
which throws away the Joules they presumably saved by staying clear of
the OoO "baggage".
- OoO execution is naturally more asynchronous, making it possible to
make decisions about what to do when in a more local way, thus wasting
less energy on costly whole-chip synchronization.
Stefan
Anyway, a major advantage of OoO is that its scheduler can make use of
the dynamic branch predictor and its superior accuracy. (Joshua
Landau pointed out a way that allows static schedulers to make use of
this accuracy, but it's doubtful that this can be achieved without a
code explosion).
And OoO can queue multiple overlapping cache misses.
This later allows multiple instructions to complete at once,
which allows multiple instructions to retire at once,
which allows it to fill in pipeline bubbles and catch up.
InO simply can't do that.
But indeed, the more interesting aspect is that even in terms of
efficiency, in-order seems to be a losing proposition.
I'd be interested to hear opinions about why that is the case.
I can think of two factors, tho there are probably more:
- in-order CPUs spend more time waiting (which is the cause for their
lower performance), and they still burn Joules while they wait,
which throws away the Joules they presumably saved by staying clear of
the OoO "baggage".
- OoO execution is naturally more asynchronous, making it possible to
make decisions about what to do when in a more local way, thus wasting
less energy on costly whole-chip synchronization.
Furthermore: IO machines are always latency bound, while GBOoO machines
are schedule bound, capable of absorbing L1 cache misses, long cycle
count instructions, ... that significantly harm IO machines.
You can make a GBOoO machine clock faster than the IO machine simply
from less work in each pipe stage--and this makes up for the depth
of the pipeline.
Vector machines fell out of fashion when the length of the
vector register could no longer absorb the latency to memory.
{{Although NEC persisted for longer}}
Furthermore: IO machines are always latency bound, while GBOoO machines
are schedule bound, capable of absorbing L1 cache misses, long cycle
count instructions, ... that significantly harm IO machines.
EricP <[email protected]> writes:
And OoO can queue multiple overlapping cache misses.
This later allows multiple instructions to complete at once,
which allows multiple instructions to retire at once,
which allows it to fill in pipeline bubbles and catch up.
InO simply can't do that.
If it is designed accordingly (and I am sure that all IA-64
implementations are), it can: It starts a load, starts the next load
etc. The in-order property only comes into play when it wants to use
the result of one of these loads.
E.g., looking at <https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:
Specifically, the A510 can overlap two cache misses with the following between them:
* 12 total instructions, up from 8 on the A53
* 6 FP instructions, up from 4 on the A53. This includes 128-bit
vector instructions on the A510 but not on the A53. A53 finds
vector operations scary and will stall immediately on encountering
one
* 3 branches, unchanged from A53
* 5 loads. The A53 would stall on any memory access past a cache miss.
And that's for a LITTLE core.
- anton
So here in-order provided lower performance at thrice the power
consumption, two years later.
What is clear is that currently, no one know how to make in-order CPUs
as fast as OoO for "general purpose" computing (i.e. not things you can
run on things like GPGPUs or TPUs).
But indeed, the more interesting aspect is that even in terms of
efficiency, in-order seems to be a losing proposition.
I'd be interested to hear opinions about why that is the case.
I can think of two factors, tho there are probably more:
- in-order CPUs spend more time waiting (which is the cause for their
lower performance), and they still burn Joules while they wait,
which throws away the Joules they presumably saved by staying clear of
the OoO "baggage".
- OoO execution is naturally more asynchronous, making it possible to
make decisions about what to do when in a more local way, thus wasting
less energy on costly whole-chip synchronization.
Stefan
Anton Ertl wrote:
EricP <[email protected]> writes:
And OoO can queue multiple overlapping cache misses.
This later allows multiple instructions to complete at once,
which allows multiple instructions to retire at once,
which allows it to fill in pipeline bubbles and catch up.
InO simply can't do that.
If it is designed accordingly (and I am sure that all IA-64
implementations are), it can: It starts a load, starts the next load
etc. The in-order property only comes into play when it wants to use
the result of one of these loads.
E.g., looking at
<https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:
Specifically, the A510 can overlap two cache misses with the following
between them:
* 12 total instructions, up from 8 on the A53
* 6 FP instructions, up from 4 on the A53. This includes 128-bit
vector instructions on the A510 but not on the A53. A53 finds
vector operations scary and will stall immediately on encountering
one
* 3 branches, unchanged from A53
* 5 loads. The A53 would stall on any memory access past a cache miss.
And that's for a LITTLE core.
- anton
That 510 backend is not in-order, it's light weight OoO.
That 3-way superscalar CDC6600 style backend allows a younger instruction
to proceed to its next processing stage even though an older instruction
is still executing. That's fine, and it might be possible even to forward >pending function unit results to other function unit inputs,
and as long as writeback happens in-order interrupts will be precise.
But that is a form of bypassing.
That uArch is distinct from a dual or triple InO pipeline because
in those if one pipeline stage stalls, they all stall.
[email protected] (MitchAlsup1) writes:
Furthermore: IO machines are always latency bound, while GBOoO machines
are schedule bound, capable of absorbing L1 cache misses, long cycle
count instructions, ... that significantly harm IO machines.
What does "schedule bound" mean?
I have seen enough cases where a chain of dependent instructions
(whether it is a chain of multiplications, a chain of L1-hitting
loads, or even a chain of integer adds mixed with occasional
L1-hitting loads) determines the performance of an OoO machine, in
particular a wide OoO machine.
If branch mispredictions are low enough, what limits the performance
of an OoO machine is
* either its resources (functional units, rename width, or somesuch),
and I call that "resource bound",
* or a dependence chain is so long (and the rest of the instructions
consume so few resources) that eventually the reorder buffers are
filled with the rest of the instructions or the schedulers are
filled with instructions from the dependence chain. Then the
machine has to wait for an instruction from the dependence chain to
retire (for unclogging the ROB) or to produce a result (for freeing
a scheduler slot). I call that latency-bound or dependence-bound.
The wider the OoO engine, the fewer programs will be resource-bound on
that machine.
Hardware designers use deep ROBs and deep schedulers on
wide OoO engines to reduce the number or impact of dependence-bound
cases, and indeed, with a bigger scheduling window, one may be able to
see more parallelism than with a smaller window.
And at some point there will be a branch misprediction, which acts as
an in-order constraint for the dependence-bound case. In the
resource-bound case, if the machine starts resolving the branch
misprediction before retiring the branch, there are still instructions waiting for their functional unit, so the misprediction penalty will
be lower than otherwise.
As for in-order machines, for data-parallel stuff like, say, matrix multiplication, they can also be resource bound, and indeed, these are
the kinds of codes where IA-64 performed particularly well.
- anton
On Wed, 24 Jan 2024 20:03:15 +0000, MitchAlsup1 wrote:
Vector machines fell out of fashion when the length of the
vector register could no longer absorb the latency to memory.
{{Although NEC persisted for longer}}
Hmm.
If the latency to memory is bigger, then having more vector
regisers lets you access stuff for a bigger percentage of
the time that is faster than memory.
Just like cache, or regulsr register files, therefore, one
would expect the utility of vector registers to increase,
not decrease, when memory becomes slower by comparison.
So I'm missing something here.
One possibility is that vector registers are usually used to
facilitate operations between vectors in memory - vectors
that are several times longer than the length of a vector
register. So the speed of memory controls the speed of the
overall calculation - in part. The vector registers multiply
it by a factor of how much work gets done on values once
they're read in - but perhaps if memory gets slow enough,
there's not much benefit over less elaborate local storage.
John Savard
Stefan Monnier wrote:
So here in-order provided lower performance at thrice the power
consumption, two years later.
What is clear is that currently, no one know how to make in-order CPUs
as fast as OoO for "general purpose" computing (i.e. not things you can
run on things like GPGPUs or TPUs).
But indeed, the more interesting aspect is that even in terms of
efficiency, in-order seems to be a losing proposition.
I'd be interested to hear opinions about why that is the case.
I can think of two factors, tho there are probably more:
- in-order CPUs spend more time waiting (which is the cause for their
lower performance), and they still burn Joules while they wait,
which throws away the Joules they presumably saved by staying clear of
the OoO "baggage".
- OoO execution is naturally more asynchronous, making it possible to
make decisions about what to do when in a more local way, thus wasting
less energy on costly whole-chip synchronization.
Stefan
In-order serializes when operations start,
OoO synchronizes after they finish.
The later creates more potential opportunities for asynchronous concurrency, and this potential propagates through the whole system design.
EricP <[email protected]> writes:
Anton Ertl wrote:
EricP <[email protected]> writes:That 510 backend is not in-order, it's light weight OoO.
And OoO can queue multiple overlapping cache misses.If it is designed accordingly (and I am sure that all IA-64
This later allows multiple instructions to complete at once,
which allows multiple instructions to retire at once,
which allows it to fill in pipeline bubbles and catch up.
InO simply can't do that.
implementations are), it can: It starts a load, starts the next load
etc. The in-order property only comes into play when it wants to use
the result of one of these loads.
E.g., looking at
<https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:
Specifically, the A510 can overlap two cache misses with the following
between them:
* 12 total instructions, up from 8 on the A53
* 6 FP instructions, up from 4 on the A53. This includes 128-bit
vector instructions on the A510 but not on the A53. A53 finds
vector operations scary and will stall immediately on encountering
one
* 3 branches, unchanged from A53
* 5 loads. The A53 would stall on any memory access past a cache miss. >>>
And that's for a LITTLE core.
- anton
That 3-way superscalar CDC6600 style backend allows a younger instruction
to proceed to its next processing stage even though an older instruction
is still executing. That's fine, and it might be possible even to forward
pending function unit results to other function unit inputs,
and as long as writeback happens in-order interrupts will be precise.
But that is a form of bypassing.
Not OoO in my book. By your definition anything is OoO that allows
some execution overlap of an architecturally earlier instruction with
an architecturally later instruction. With your definition, all
pipelined CPUs are OoO, including the MIPS R2000 with its delayed
branch, delayed load, and especially the multiply/divide unit.
Also, the 21064 which even allowed to issue two instructions at the
same time, as well as having instructions with more than one cycle of load-to-use latency; e.g., there could be an FP multiplication
followed by a load followed by an add, and the add would actually
finish using its ALU before the FP multiplication or the load
finishes.
As described above, the A53 would be OoO by your definition, too.
Last, but not least, all IA-64 implementations would be OoO by your definition.
A definition that classifies everything as OoO and nothing as in-order
is neither helpful nor is it the commonly understood meaning of
"in-order" and OoO. I think the commonly understood meaning is that
all instructions start their execution in-order (i.e., none goes to a functional unit earlier than an architecturally earlier instruction). Execution can overlap.
Concerning precise interrupts, that is certainly a problem for CPUs
without reorder buffers; the Alpha architects even put imprecise FP interrupts and the trapb instruction (IIRC) in the architecture
because of that.
That uArch is distinct from a dual or triple InO pipeline because
in those if one pipeline stage stalls, they all stall.
That's a somewhat different definition. AFAIK the R2000 stalls the
whole (integer) pipeline on a cache miss despite allowing overlap
between instruction executions.
AFAIK microarchitects got rid of this limitation as soon as there were
enough transistors available. The problem with this limitation is
that it makes it pointless to schedule a load further ahead to reduce
the impact of a cache-miss latency, or to use a prefetch instruction,
because either one would stop the whole machine during the cache miss.
A prefetch could actually be counterproductive, but it would
definitely never help.
So this definition may describe some historical designs, but it's not
the difference between in-order and OoO as commonly understood.
- anton
Anton Ertl wrote:
EricP <[email protected]> writes:
Anton Ertl wrote:
EricP <[email protected]> writes:That 510 backend is not in-order, it's light weight OoO.
And OoO can queue multiple overlapping cache misses.If it is designed accordingly (and I am sure that all IA-64
This later allows multiple instructions to complete at once,
which allows multiple instructions to retire at once,
which allows it to fill in pipeline bubbles and catch up.
InO simply can't do that.
implementations are), it can: It starts a load, starts the next load
etc. The in-order property only comes into play when it wants to use
the result of one of these loads.
E.g., looking at
<https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>, the A510 has a 5-entry load buffer. The text says:
Specifically, the A510 can overlap two cache misses with the following >>>> between them:
* 12 total instructions, up from 8 on the A53
* 6 FP instructions, up from 4 on the A53. This includes 128-bit
vector instructions on the A510 but not on the A53. A53 finds
vector operations scary and will stall immediately on encountering >>>> one
* 3 branches, unchanged from A53
* 5 loads. The A53 would stall on any memory access past a cache miss. >>>>
And that's for a LITTLE core.
- anton
That 3-way superscalar CDC6600 style backend allows a younger instruction >>> to proceed to its next processing stage even though an older instruction >>> is still executing. That's fine, and it might be possible even to forward >>> pending function unit results to other function unit inputs,
and as long as writeback happens in-order interrupts will be precise.
But that is a form of bypassing.
Not OoO in my book. By your definition anything is OoO that allows
some execution overlap of an architecturally earlier instruction with
an architecturally later instruction. With your definition, all
pipelined CPUs are OoO, including the MIPS R2000 with its delayed
branch, delayed load, and especially the multiply/divide unit.
No, not overlap, bypassing. Multiple parallel pipelines is still in-order.
Anton Ertl wrote:[...]
Anyway, as long as the register file is updated in-order
then the only one
that matters is the load store queue. While the LSQ allows 2 outstanding >cache misses, as long as it finishes each load/store in order then none
of this is visible.
Also, the 21064 which even allowed to issue two instructions at the
same time, as well as having instructions with more than one cycle of
load-to-use latency; e.g., there could be an FP multiplication
followed by a load followed by an add, and the add would actually
finish using its ALU before the FP multiplication or the load
finishes.
As described above, the A53 would be OoO by your definition, too.
21164 was two parallel integer pipelines. I don't know about A53.
That's a somewhat different definition. AFAIK the R2000 stalls the
whole (integer) pipeline on a cache miss despite allowing overlap
between instruction executions.
I thought the R2000 only has one pipeline.
EricP wrote:
Anton Ertl wrote:
EricP <[email protected]> writes:
Anton Ertl wrote:
EricP <[email protected]> writes:That 510 backend is not in-order, it's light weight OoO.
And OoO can queue multiple overlapping cache misses.If it is designed accordingly (and I am sure that all IA-64
This later allows multiple instructions to complete at once,
which allows multiple instructions to retire at once,
which allows it to fill in pipeline bubbles and catch up.
InO simply can't do that.
implementations are), it can: It starts a load, starts the next load >>>>> etc. The in-order property only comes into play when it wants to use >>>>> the result of one of these loads.
E.g., looking at
<https://chipsandcheese.com/2023/10/01/arms-cortex-a510-two-kids-in-a-trench-coat/>,
the A510 has a 5-entry load buffer. The text says:
Specifically, the A510 can overlap two cache misses with the following >>>>> between them:
* 12 total instructions, up from 8 on the A53
* 6 FP instructions, up from 4 on the A53. This includes 128-bit >>>>> vector instructions on the A510 but not on the A53. A53 finds
vector operations scary and will stall immediately on encountering >>>>> one
* 3 branches, unchanged from A53
* 5 loads. The A53 would stall on any memory access past a cache
miss.
And that's for a LITTLE core.
- anton
That 3-way superscalar CDC6600 style backend allows a younger
instruction
to proceed to its next processing stage even though an older
instruction
is still executing. That's fine, and it might be possible even to
forward
pending function unit results to other function unit inputs,
and as long as writeback happens in-order interrupts will be precise.
But that is a form of bypassing.
Not OoO in my book. By your definition anything is OoO that allows
some execution overlap of an architecturally earlier instruction with
an architecturally later instruction. With your definition, all
pipelined CPUs are OoO, including the MIPS R2000 with its delayed
branch, delayed load, and especially the multiply/divide unit.
No, not overlap, bypassing. Multiple parallel pipelines is still
in-order.
Note:: Mc88100 had multiple parallel pipelines and was not In-Order !!
A older LD stall would allow a younger instructions to complete !
EricP <[email protected]> writes:
Anton Ertl wrote:[...]
Anyway, as long as the register file is updated in-order
This discussion resulted in the unearthing of memories of the things I
read about the microarchitectures of the advanced in-order machines of
last century (and I think that in-order machines in this century tend
to work in the same way). My memory may be unreliable here, but
anyway:
The way things worked was that in those machines, instructions were
issued in-order but the results could be written back out-of-order.
However, each register has a bit that tells whether the register is up-to-date, or will be updated in the future by a currently in-flight instruction. This was often called scoreboarding (although Mitch
Alsup and <https://en.wikipedia.org/wiki/Scoreboarding> tell us that
the CDC 6600 scoreboard was a more sophisticated mechanism; given the
I in MIPS, one could also call it an interlock). So each instruction
checks whether all its source and destination registers are
up-to-date, and if not, it waits until they are (forwarding changes
the notion of "up-to-date" a bit, but I'll skip this here).
With out-of-order completion, how is architectural execution and, in particular, precise exceptions, ensured? For ordinary execution, it
does not matter for an instruction whether an unrelated register is
not up-to-date, and if the register is mentioned in the instruction,
the instruction and all that follow wait until the register is
up-to-date.
I don't remember how loads and stores were handled, but again, as long
as they were to non-overlapping addresses (and for weak memory
ordering in multiprocessors) one can do quite a bit in parallel
without destroying architectural order.
I also don't remember how flags registers were handled on
architectures that have them, but it needs something cleverer than the "up-to-date" scheme described above, or there would be lots of stalls
due to write-after-write dependences. I am sure the microarchitects
found something appropriate.
For precise exceptions, I remember discussions about the importance of knowing early in the instruction that an exception happens; i.e., so
early that the writebacks of architecturally later instructions can be cancelled. For loads, the exception is known early, when the TLB
lookup has happened; I expect that the whole machine is stalled on a
TLB miss (or, with a software-managed TLB, the exception happens right there). Alpha has imprecise FP exceptions because the architecture
wanted to allow implementing denormals through trapping, but it takes
several cycles to know whether an FP result is normal or not.
[Cortex-A510]
then the only one
that matters is the load store queue. While the LSQ allows 2 outstanding
cache misses, as long as it finishes each load/store in order then none
of this is visible.
I expect that the A510 uses the mechanism described above, which means
that loads can finish out of order, but none of this is visible
nonetheless.
Also, the 21064 which even allowed to issue two instructions at the21164 was two parallel integer pipelines. I don't know about A53.
same time, as well as having instructions with more than one cycle of
load-to-use latency; e.g., there could be an FP multiplication
followed by a load followed by an add, and the add would actually
finish using its ALU before the FP multiplication or the load
finishes.
As described above, the A53 would be OoO by your definition, too.
The Cortex-A53 has two ALU ports <https://chipsandcheese.com/2023/05/28/arms-cortex-a53-tiny-but-important/>.
It's interesting to compare the A53 (2012) to the 21164 (1995). Both
have roughly similar execution resources (2 integer (one of which can
be a branch), 2 FP, 1LSU (not sure about that for the 21164)), but the
21164 has a four-wide decoder, while the A53 only has a two-wide
decoder. I guess the cost of decoding all of A64, A32, and especially
T32 caused them to limit the decoding capabilities.
For the A510 ARM expanded that to a three-wide decode, but the A510 is
an A64-only core. ARM also provided a third ALU and an additional
load unit to the A510. Given that an ALU was not that expensive even
in the 21164 timeframe, my guess is that the 21164 architects provided
only two because of register port or forwarding path limitations,
something that the ARM designers apparently have a solution for (more
metal layers?).
That's a somewhat different definition. AFAIK the R2000 stalls theI thought the R2000 only has one pipeline.
whole (integer) pipeline on a cache miss despite allowing overlap
between instruction executions.
My memories from last century tells me that there was some concept
like "squashing pipeline bubbles" being discussed at the time, i.e.,
that instructions in earlier stages could advance until the first of
them reaches the stalled instruction. Conversely, instructions in
later stages could continue, filling the stages they left with bubbles
(I don't remember this being discussed). But of course none of that
is used in the R2000. The R2000 has a multiply/divide unit that takes
many cycles, and actually with interlocks. I don't know if that
continues working while a cache miss is served; the R2010 FPU
certainly continues working while a cache miss is served.
And then we got the 88100 with three pipelines, and then the 21064
with dual-issue and three pipelines.
- anton
Anton Ertl wrote:
EricP <[email protected]> writes:
Anton Ertl wrote:[...]
Anyway, as long as the register file is updated in-order
This discussion resulted in the unearthing of memories of the things I
read about the microarchitectures of the advanced in-order machines of
last century (and I think that in-order machines in this century tend
to work in the same way). My memory may be unreliable here, but
anyway:
The way things worked was that in those machines, instructions were
issued in-order but the results could be written back out-of-order.
However, each register has a bit that tells whether the register is
up-to-date, or will be updated in the future by a currently in-flight
instruction. This was often called scoreboarding (although Mitch
Alsup and <https://en.wikipedia.org/wiki/Scoreboarding> tell us that
the CDC 6600 scoreboard was a more sophisticated mechanism; given the
I in MIPS, one could also call it an interlock). So each instruction
checks whether all its source and destination registers are
up-to-date, and if not, it waits until they are (forwarding changes
the notion of "up-to-date" a bit, but I'll skip this here).
With out-of-order completion, how is architectural execution and, in
particular, precise exceptions, ensured? For ordinary execution, it
does not matter for an instruction whether an unrelated register is
not up-to-date, and if the register is mentioned in the instruction,
the instruction and all that follow wait until the register is
up-to-date.
I don't remember how loads and stores were handled, but again, as long
as they were to non-overlapping addresses (and for weak memory
ordering in multiprocessors) one can do quite a bit in parallel
without destroying architectural order.
I also don't remember how flags registers were handled on
architectures that have them, but it needs something cleverer than the
"up-to-date" scheme described above, or there would be lots of stalls
due to write-after-write dependences. I am sure the microarchitects
found something appropriate.
For precise exceptions, I remember discussions about the importance of
knowing early in the instruction that an exception happens; i.e., so
early that the writebacks of architecturally later instructions can be
cancelled. For loads, the exception is known early, when the TLB
lookup has happened; I expect that the whole machine is stalled on a
TLB miss (or, with a software-managed TLB, the exception happens right
there). Alpha has imprecise FP exceptions because the architecture
wanted to allow implementing denormals through trapping, but it takes
several cycles to know whether an FP result is normal or not.
That scoreboard allows OoO execution and completion,
and avoids RAW, WAW, and WAR hazards,
but it doesn't write back results in program order.
Exceptions can be made precise by (a) aways writing results in-order,
and (b) only recognizing exceptions at Writeback.
To write the results back in order one could attach a sequence counter
to each uOp - a counter with enough bits so that each possible in-flight
uOp in any stage has a unique number plus 1 bit for a wrap flag.
Writeback also has a sequence counter so it knows which uOp is
next to write its register. I would want two register write ports
so it at least has a chance of catching up after a bubble.
Anton Ertl wrote:
EricP <[email protected]> writes:
Anton Ertl wrote:[...]
Anyway, as long as the register file is updated in-order
<snip>
For precise exceptions, I remember discussions about the importance of
knowing early in the instruction that an exception happens; i.e., so
early that the writebacks of architecturally later instructions can be
cancelled. For loads, the exception is known early, when the TLB
lookup has happened; I expect that the whole machine is stalled on a
TLB miss (or, with a software-managed TLB, the exception happens right
there). Alpha has imprecise FP exceptions because the architecture
wanted to allow implementing denormals through trapping, but it takes
several cycles to know whether an FP result is normal or not.
That scoreboard allows OoO execution and completion,
and avoids RAW, WAW, and WAR hazards,
but it doesn't write back results in program order.
Exceptions can be made precise by (a) aways writing results in-order,
and (b) only recognizing exceptions at Writeback.
On 1/28/24 1:48 PM, EricP wrote:
[snip]
Multiple parallel pipelines is fine but it has to sequence the pipe exits
so the results retire in order for precise exceptions and interrupts.
In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.
(From Computer Architecture: A Quantitative Approach, 3rd Ed.,
Appendix H, "One approach to this problem, used in the MIPS R3010,
is to identify instructions that may cause an exception early in
the instruction cycle. For example, an addition can overflow only
if one of the operands has an exponent of Emax, and so on. This
early check is conservative: It might flag an operation that
doesn’t actually cause an exception. However, if such false
positives are rare, then this technique will have excellent
performance. When an instruction is tagged as being possibly
exceptional, special code in a trap handler can compute it without
destroying any state. Remember that all these problems occur only
when trap handlers are enabled.")
Not writing results in order would require suppressing earlier
writes to the same register (a singular writeback stage design
would also have this). With simple in-order issue, this would
(I think) only occur when the result was never used (e.g., a
slow operation started before a conditional branch that
determines it use — or in a "free" delay slot — or if two results
are produced and one is unused such as unused flag settings).
Out-of-order writeback also presents register write port hazards;
more write ports might be needed than available.
It _might_ be practical to allow store instructions that use a
delayed result to issue before the result is available — similar
to the classic store-address-generation/store-data split for
out-of-order execution. A store buffer entry could be marked as
not having valid data (similar to ready bits for registers) and
the slow operation could "forward" to the store buffer.
Multiply-
add instructions can also conceivably exploit delayed availability
of the addend. There might also be some cases were necessary
latency is data dependent and knowing that the computation can
be done faster the operations might be "issued" early as if it
had the normal/worse-case latency — that communication complexity
seems unlikely to be worthwhile but it is conceivably possible.
Since low-end out-of-order is not extraordinarily complex or resource-intensive, heroic efforts to provide slightly less
constrained but still in-order execution seem rather questionable.
On 1/24/24 2:47 AM, Anton Ertl wrote:> "Paul A. Clayton" <[email protected]> writes:
When I looked at the pipeline design presented in the Arm Cortex-
A55 Software Optimization Guide, I was surprised by the design.
Figure 1 (page 10 in Revision R2p0) shows nine execution pipelines
(ALU0, ALU1, MAC, DIV, branch, store, load, FP/Neon MAC &
DIV/SWRT, FP/Neon ALU) and ALU0 and ALU1 have a shift pipeline
stage before an ALU stage (clearly for AArch32).
The separation of MAC and DIV is mildly questionable — from my
very amateur perspective — not supporting dual issue of a MAC-DIV
pair seems very unlikely to hurt performance but the cost may be
trivial.
The Chips and Cheese article also indicated that branches are only
resolved at writeback, two cycles later than if branch direction
was resolved in the first execution stage. The difference between
a six stage misprediction penalty and an eight stage one is not
huge, but it seems to indicate a difference in focus. With
condition code based branches and in-order execution, I would have
been tempted to try resolving such branches by the end of the
issue stage. (MIPS R2000 resolved register-compare branches at the
end of decode, so resolving branches based on a condition code —
if the data is available — in the cycle after decode does not seem incredibly difficult. It may be that condition codes are generally
not set early enough to justify such effort, but it seems
obviously "possible".)
I would have *guessed* that an AGLU (a functional unit providing
address generation and "simple" ALU functions, like AMD's Bobcat?)
would be more area and power efficient than having separate
pipelines, at least for store address generation.
I may be misinterpreting/misunderstanding the information. While Isnip
believe I am not entirely incompetent in general
microarchitectural design, it is difficult to believe that any
professional (much less a team of professionals) would do worse
than I would. Other tradeoffs (like design reuse) may also justify
design choices that seem worse.
They had to choose the L1 size. Cortex-A55 supports L1 sizes of 16
KiB, 32 KiB, and 64 KiB. With a fixed three-cycle latency (and
other pipeline stages fixed in their work), the size of the L1
caches will affect not only cycle time. If the pipeline diagram is interpreted extremely literally, address generation takes one
cycle, data cache output takes one cycle, and align and extend
takes one cycle. If cache access itself takes one cycle and if
that latency increases by sqrt(2) with each capacity doubling,
then implementations with the largest *either* data or instruction
cache would have twice as much time in a cycle as implementations
with both L1s being 16 KiB *if* the pipeline was designed for the
smallest cache.
(I would **GUESS** that ARM designed the pipeline for 32 KiB
caches and smaller caches mostly mean unused time within the cache
access cycle and larger caches mostly mean unused time within all
the other stages. The time to complete a certain about of logical
operation can be adjusted, e.g., using a faster adder, but not
shifting the clock boundaries constrains such changes as not all
chunks of logic can be made faster — intentional clock skew might
allow borrowing time — and synthesized designs might not get all
the possible changes.)
According to the AnandTech article, Samsung chose not to implement
an L2 for the A55 cores. Since accessing the L3 means crossing a
clocking domain, this would seem to have a significant impact on
performance for workloads like SPEC and, I suspect, a noticeable
impact on energy-efficiency. If this choice also lead to using 64
KiB L1 caches **and if** ARM optimized the pipeline for 32 KiB
caches, this might also have noticeably impacted performance and energy-efficiency.
(For SPEC, I would guess that even the 256 KiB maximum
configuration L2 size for A55 would have a significant performance
impact. SPEC2006 used by AnandTech might be friendlier to modest
L2 size than SPEC2017. If the software is "tuned" for workstation
hardware of five years before the SPEC benchmark, 2019 smart
phones might not be that far from 2001 workstations in terms of L2
sizes.)
If my above guess that a 64 KiB L1 was used and that this impacts
frequency, voltage and frequency scaling may have been effected.
(I seem to recall reading that caches have poorer voltage-
frequency scaling; that *might* incline a larger L1 cache to
further hurt energy efficiency if a single voltage is used for the
whole core.)
With respect to sticking with in-order, there also seems to be a
tendency to go "all in" when switching to out-of-order, i.e., the
initial out-of-order design seems to be relatively "beefy" in its out-of-order resources. This may result from having delayed the
transition well beyond where performance or efficiency estimates
would have justified the change or perhaps from crossover being a
large enough region by the time a change is fully justified the
out-of-order design would be relatively beefy.
Perhaps mildly out-of-order designs (say a little more than the
PowerPC 750) are not actually useful (other than as a starting
point for understanding out-of-order design). I do not understand
why such an intermediate design (between in-order and 30+
scheduling window out-of-order) is not useful. It may be that
going from say 10 to 30 scheduler entries gives so much benefit
for relatively little extra cost (and no design is so precisely
area constrained — even doubling core size would not mean pushing
L1 off-chip, e.g.). I have a lumper taxonomic bias, so I have some
emotional investment in intermediate and mixed designs.
And now you write that ARM did notsupports the
design it for power efficiency. If you are right, that
position that in-order is uncompetetive not just wrtperformance, but
also perf/W as soon as there are relatively low performance
requirements.
If ARM designed A55 for power efficiency (at that performance
level) over all other concerns, the L1 caches would be fixed size.
Users of ARM designs are obviously willing to sacrifice some power
efficiency for the benefit from flexible L1 size. With different functionality differing in timing and energy costs with different
processes, energy-efficiency at all costs would seem to lead to
different designs for different processes. Presumably this is not
cost effective.
The memory system, on-chip network, and such would also affect the
energy efficiency. Exynos9820's memory system might _reasonably_
be optimized for high power/high performance use; that would tend
to hurt the efficiency of wimpy cores.
What scenario do you imagine where one would want these in-orderhigh
cores? ARM's niche for them is the little cores in a big.LITTLE
design; that is necessarily coupled with a memory system with a
bandwidth. There are also SoCs with only A55 cores (no BIGones) like
the RK3566, but they are only bought because of the price, notbecause
of their power-efficiency.
For something like a smart phone, one or two small cores might be
useful for background activity, tasks whose latency (within a
broad range) is not related to system responsiveness for the user.
For a server expected to run embarrassingly parallel workloads, if
a wimpy core provides sufficient responsiveness, I would expect
most of the cores (possibly even all of the cores) to be wimpy.
There might not be many workloads with such characteristics;
although fundamental network latency has not improved that much
over the last decade, bandwidth has increased and server-side
processing complexity has increased. Even with splitting a request
to multiple threads can make wimpy cores less useful than one
might expect because work will not be perfectly distributed and
tail latency increases.
On 1/28/24 1:48 PM, EricP wrote:
[snip]
Multiple parallel pipelines is fine but it has to sequence the pipe exits
so the results retire in order for precise exceptions and interrupts.
In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.
(From Computer Architecture: A Quantitative Approach, 3rd Ed.,
Appendix H, "One approach to this problem, used in the MIPS R3010,
is to identify instructions that may cause an exception early in
the instruction cycle. For example, an addition can overflow only
if one of the operands has an exponent of Emax, and so on. This
early check is conservative: It might flag an operation that
doesn’t actually cause an exception. However, if such false
positives are rare, then this technique will have excellent
performance. When an instruction is tagged as being possibly
exceptional, special code in a trap handler can compute it without
destroying any state. Remember that all these problems occur only
when trap handlers are enabled.")
Paul A. Clayton wrote:
On 1/28/24 1:48 PM, EricP wrote:
[snip]
Multiple parallel pipelines is fine but it has to sequence the pipe exits >>> so the results retire in order for precise exceptions and interrupts.
In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.
Interrupts have to be restartable so in-order retire, where everything
older than the interrupt RIP is executed and retired and everything
after that RIP is not, is simplest and cheapest to implement.
Yes you could make it more complicated, but why?
As Mitch has pointed out many times, uOps with exceptions might look
ahead to see if all older uOps that might throw exceptions have executed
and did not indicate an exception. However I believe that exceptions are exceptional (unusual) and find the extra logic needed to do this to be
not justified for the benefits of early prefetching of an exception handler.
My only exception handler that is triggered with any regularity is
page fault (assuming a hardware table walker so no TLB miss exceptions),
and it typically invokes a handler with many thousands of instructions
so prefetching that code a few clocks earlier won't make any difference.
(From Computer Architecture: A Quantitative Approach, 3rd Ed.,
Appendix H, "One approach to this problem, used in the MIPS R3010,
is to identify instructions that may cause an exception early in
the instruction cycle. For example, an addition can overflow only
if one of the operands has an exponent of Emax, and so on. This
early check is conservative: It might flag an operation that
doesn’t actually cause an exception. However, if such false
positives are rare, then this technique will have excellent
performance. When an instruction is tagged as being possibly
exceptional, special code in a trap handler can compute it without
destroying any state. Remember that all these problems occur only
when trap handlers are enabled.")
Ok but their problem was they used the exception mechanism for Usuals,
TLB misses and in this case floating point fix-ups. And a consequence of
the exception mechanism is a pipeline drain, which doesn't matter if it
only happens rarely but does if it happens often.
This was in the early RISC days when they used traps for all kinds of
normal management, misaligned memory accesses or Sparc register windows.
And they all suffered performance problems.
So rather than fix the actual problems by adding in a HW table walker
and HW float fix-ups, it sounds like they added a complicated mechanism
to sort-of-almost-but-not-quite-multi-threaded to execute the trap handler and avoid the pipeline drain. I had the same idea for Alpha's software
TLB miss handler, which sapped up to 25% of performance, but decided that software managed TLB's are a dead end and a HW table walker was best.
Moral of the story: don't use the exception mechanism for usuals
and then complain about the performance.
Paul A. Clayton wrote:
On 1/28/24 1:48 PM, EricP wrote:
[snip]
Multiple parallel pipelines is fine but it has to sequence the pipe exits >>> so the results retire in order for precise exceptions and interrupts.
In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.
Interrupts have to be restartable so in-order retire,
where everything
older than the interrupt RIP is executed and retired and everything
after that RIP is not, is simplest and cheapest to implement.
Yes you could make it more complicated, but why?
As Mitch has pointed out many times, uOps with exceptions might look
ahead to see if all older uOps that might throw exceptions have executed
and did not indicate an exception.
However I believe that exceptions are exceptional (unusual) and find the extra logic needed to do this to be
not justified for the benefits of early prefetching of an exception handler.
My only exception handler that is triggered with any regularity is
page fault (assuming a hardware table walker so no TLB miss exceptions),
and it typically invokes a handler with many thousands of instructions
so prefetching that code a few clocks earlier won't make any difference.
(From Computer Architecture: A Quantitative Approach, 3rd Ed.,
Appendix H, "One approach to this problem, used in the MIPS R3010,
is to identify instructions that may cause an exception early in
the instruction cycle. For example, an addition can overflow only
if one of the operands has an exponent of Emax, and so on. This
early check is conservative: It might flag an operation that
doesn’t actually cause an exception. However, if such false
positives are rare, then this technique will have excellent
performance. When an instruction is tagged as being possibly
exceptional, special code in a trap handler can compute it without
destroying any state. Remember that all these problems occur only
when trap handlers are enabled.")
Ok but their problem was they used the exception mechanism for Usuals,
TLB misses and in this case floating point fix-ups. And a consequence of
the exception mechanism is a pipeline drain, which doesn't matter if it
only happens rarely but does if it happens often.
This was in the early RISC days when they used traps for all kinds of
normal management, misaligned memory accesses or Sparc register windows.
And they all suffered performance problems.
So rather than fix the actual problems by adding in a HW table walker
and HW float fix-ups, it sounds like they added a complicated mechanism
to sort-of-almost-but-not-quite-multi-threaded to execute the trap handler and avoid the pipeline drain.
I had the same idea for Alpha's software
TLB miss handler, which sapped up to 25% of performance, but decided that software managed TLB's are a dead end and a HW table walker was best.
Moral of the story: don't use the exception mechanism for usuals
and then complain about the performance.
On 1/25/24 10:22 AM, Anton Ertl wrote:
[snip]
I think the commonly understood meaning is that
all instructions start their execution in-order (i.e., none goes to a
functional unit earlier than an architecturally earlier instruction).
Execution can overlap.
What about a skewed pipeline? A simple skewed pipeline that
statically assigned operations to a pipeline-stage/execution unit
has been called in-order (in what I have read). A "second-chance"
pipeline (where many operations can dynamically choose the
pipeline stage based on operand availability) involves dynamic
scheduling (so would seem to fall in to out-of-order), but
counterflow pipelines ("Counterflow Pipeline Processor
Architecture", Robert F. Sproull et al., 1994) — which are more
extreme in some ways than pipelines that have two stages in which
operations can start — are stated to have "No overtaking.
Instructions must stay in program order in the instruction
pipeline.", which sounds "in-order" (and the paper was written by
people working at Sun Microsystems).
(I thought counterflow pipelines were weird. Simplifying
communication makes sense, but ...)
I get the impression that early PowerPC out-or-order execution implementations were really very similar to using the forwarding
network for out-of-order completion (with in-order writeback). If
I recall correctly, renaming was done by appending a version to
the architectural register name and operands would be captured as
soon as they were available rather than passing along the pipeline
with forwarding until the writeback stage.
Paul A. Clayton wrote:
On 1/25/24 10:22 AM, Anton Ertl wrote:
[snip]
I think the commonly understood meaning is that
all instructions start their execution in-order (i.e., none goes
to a functional unit earlier than an architecturally earlier
instruction). Execution can overlap.
What about a skewed pipeline? A simple skewed pipeline that
statically assigned operations to a pipeline-stage/execution unit
has been called in-order (in what I have read). A "second-chance"
pipeline (where many operations can dynamically choose the
pipeline stage based on operand availability) involves dynamic
scheduling (so would seem to fall in to out-of-order), but
counterflow pipelines ("Counterflow Pipeline Processor
Architecture", Robert F. Sproull et al., 1994) — which are more
extreme in some ways than pipelines that have two stages in which operations can start — are stated to have "No overtaking.
Instructions must stay in program order in the instruction
pipeline.", which sounds "in-order" (and the paper was written by
people working at Sun Microsystems).
(I thought counterflow pipelines were weird. Simplifying
communication makes sense, but ...)
I get the impression that early PowerPC out-or-order execution implementations were really very similar to using the forwarding
network for out-of-order completion (with in-order writeback). If
I recall correctly, renaming was done by appending a version to
the architectural register name and operands would be captured as
soon as they were available rather than passing along the pipeline
with forwarding until the writeback stage.
This sounds more like Mc 88110 rather than PPC 620.
PPC was shrunk from 6-wide to 4-wide in order to fit in the acceptable
die area. Other things may have been jettisoned at this shrink of
design point. The 620 was originally targeted to be equal to Mc 88120
which was a 6-wide GBOoO machine full Tomasulo with precise
exceptions and 4 external busses named {Data Out, Data In, Address
Out, Address In}
Address Out was used for cache misses to bring data to the CPU
Data Out was used for cache victims to send data to DRAM
Data In was used by arriving DRAM data
Address In was used for arriving Snoops
Smart externals could use Data In to send the CPU data before it knew
it needed it. That data could be code or data.
EricP wrote:
My only exception handler that is triggered with any regularity is
page fault (assuming a hardware table walker so no TLB miss exceptions),
and it typically invokes a handler with many thousands of instructions
so prefetching that code a few clocks earlier won't make any difference.
If you use it often enough it will still be in your cache when you next
need it. {I don't remember exactly who told me this, but it was one of
the original MIPS (the company not Stanford) guys}; so you don't need to prefetch it.
On 2/26/24 10:48 AM, EricP wrote:
Paul A. Clayton wrote:
On 1/28/24 1:48 PM, EricP wrote:
[snip]
Multiple parallel pipelines is fine but it has to sequence the
pipe exits
so the results retire in order for precise exceptions and
interrupts.
In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.
Interrupts have to be restartable so in-order retire, where everything
older than the interrupt RIP is executed and retired and everything
after that RIP is not, is simplest and cheapest to implement.
Yes you could make it more complicated, but why?
The above described method still provides precise exceptions. The
absence of a earlier exception is required to allow such out-of-
order retirement.
This also means that handling of an asynchronous event might have
to be delayed (if one did not want to have two threads active)
until all instructions before the latest-in-program-order retired
instruction have retired.
For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.
I suspect general out-of-order retirement would not be worthwhile
with precise exceptions; it just sounds complex. My comment was
mainly to point out that such was possible not that it was wise.
[snip]
Moral of the story: don't use the exception mechanism for usuals
and then complain about the performance.
☺
Paul A. Clayton wrote:
On 2/26/24 10:48 AM, EricP wrote:
Paul A. Clayton wrote:
On 1/28/24 1:48 PM, EricP wrote:
[snip]
Multiple parallel pipelines is fine but it has to sequence the
pipe exits
so the results retire in order for precise exceptions and
interrupts.
In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.
Interrupts have to be restartable so in-order retire, where everything
older than the interrupt RIP is executed and retired and everything
after that RIP is not, is simplest and cheapest to implement.
Yes you could make it more complicated, but why?
The above described method still provides precise exceptions. The
absence of a earlier exception is required to allow such out-of-
order retirement.
This also means that handling of an asynchronous event might have
to be delayed (if one did not want to have two threads active)
until all instructions before the latest-in-program-order retired
instruction have retired.
For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.
What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC
fail.
[email protected] (MitchAlsup1) writes:
Paul A. Clayton wrote:
On 2/26/24 10:48 AM, EricP wrote:
Paul A. Clayton wrote:
On 1/28/24 1:48 PM, EricP wrote:
[snip]
Multiple parallel pipelines is fine but it has to sequence the
pipe exits
so the results retire in order for precise exceptions and
interrupts.
In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.
Interrupts have to be restartable so in-order retire, where everything >>>> older than the interrupt RIP is executed and retired and everything
after that RIP is not, is simplest and cheapest to implement.
Yes you could make it more complicated, but why?
The above described method still provides precise exceptions. The
absence of a earlier exception is required to allow such out-of-
order retirement.
This also means that handling of an asynchronous event might have
to be delayed (if one did not want to have two threads active)
until all instructions before the latest-in-program-order retired
instruction have retired.
For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.
What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC
fail.
As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.
The LLC (or memory controller) can optionally support an interrupt
to management software to indicate that an uncorrected fault occurred; that would, of course, be asynchronous and occur long after the
store had retired.
[email protected] (MitchAlsup1) writes:
For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.
What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC
fail.
As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.
The LLC (or memory controller) can optionally support an interrupt
to management software to indicate that an uncorrected fault occurred; that would, of course, be asynchronous and occur long after the
store had retired.
Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
Paul A. Clayton wrote:
For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.
What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC fail.
As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.
Storing the bad <arriving> ECC should take care of that.
On 2/26/24 10:48 AM, EricP wrote:
Paul A. Clayton wrote:
On 1/28/24 1:48 PM, EricP wrote:
[snip]
Multiple parallel pipelines is fine but it has to sequence the pipe
exits
so the results retire in order for precise exceptions and interrupts.
In-order retire is not strictly required for precise exceptions
and certainly is not needed for interrupts. If the exception's
presence is determined before writeback of results from later
instructions, these writebacks can be prevented. One could
alternatively use a conservative filter of exception conditions
to stall writeback of later results (and stall those pipelines)
until it is known whether the exception occurs.
Interrupts have to be restartable so in-order retire, where everything
older than the interrupt RIP is executed and retired and everything
after that RIP is not, is simplest and cheapest to implement.
Yes you could make it more complicated, but why?
The above described method still provides precise exceptions. The
absence of a earlier exception is required to allow such out-of-
order retirement.
This also means that handling of an asynchronous event might have
to be delayed (if one did not want to have two threads active)
until all instructions before the latest-in-program-order retired
instruction have retired.
For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.
I suspect general out-of-order retirement would not be worthwhile
with precise exceptions; it just sounds complex. My comment was
mainly to point out that such was possible not that it was wise.
MitchAlsup1 wrote:
Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
Paul A. Clayton wrote:
For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.
What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC fail.
As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.
Storing the bad <arriving> ECC should take care of that.
I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.
However if the ST is less than 8 bytes or misaligned, it won't know which
of the 8 bytes was invalid so can't tell if the bad data was overwritten.
If it keeps the old ECC as an error indicator, that code might actually be correct for the new data. If it generates a new valid ECC then it loses
track of the fact that the data MAY be invalid.
In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.
When the modified line is written back to DRAM it retains the
double error ECC.
EricP wrote:
As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.
Storing the bad <arriving> ECC should take care of that.
I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.
However if the ST is less than 8 bytes or misaligned, it won't know which
of the 8 bytes was invalid so can't tell if the bad data was overwritten.
If it keeps the old ECC as an error indicator, that code might actually be >> correct for the new data. If it generates a new valid ECC then it loses
track of the fact that the data MAY be invalid.
In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.
When the modified line is written back to DRAM it retains the
double error ECC.
And if the page is out swapped and recycled we lose track of
the error indicator on that 8-byte value.
MitchAlsup1 wrote:
Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
Paul A. Clayton wrote:
For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.
What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC fail.
As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.
Storing the bad <arriving> ECC should take care of that.
I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.
However if the ST is less than 8 bytes or misaligned, it won't know which
of the 8 bytes was invalid so can't tell if the bad data was overwritten.
If it keeps the old ECC as an error indicator, that code might actually be correct for the new data. If it generates a new valid ECC then it loses
track of the fact that the data MAY be invalid.
In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.
When the modified line is written back to DRAM it retains the
double error ECC.
Paul A. Clayton wrote:You missed
On 2/26/24 10:48 AM, EricP wrote:
Paul A. Clayton wrote:The above described method still provides precise exceptions. The
On 1/28/24 1:48 PM, EricP wrote:
[snip]
absence of a earlier exception is required to allow such out-of-
order retirement.
Yes, early OoO retire with precise exception is possible.
The criteria would seem to be that:
- all older instructions that might generate an exception must have
executed without detecting an exception
- plus all older loads and stores translated their virtual addresses
(loads don't need to have completed execution, and stores will not have)
- plus all older conditional branches have executed without mispredicting.
My concern is that the circuit for doing this could be pretty complicated.
Many of the pieces that have to be checked are scattered around the core. Also many of states are in circular buffers so determining "older" starts getting slightly hairy (the Load Store Queue has a similar problem for disambiguation determining if all older loads and stores have "resolved"). And all this has to run in parallel so it takes less than 1 clock.
The motivation for early OoO retire is usually early recycling of some resources, usually physical registers. However note that you can't early recycle some resources like entries in circular buffers, such as the Instruction Queue, ROB/Future-File, LSQ, Branch Queue.
So the question I have is it really worth it?
This also means that handling of an asynchronous event might have
to be delayed (if one did not want to have two threads active)
until all instructions before the latest-in-program-order retired
instruction have retired.
I define errors as a whole different category from exceptions and
interrupts, and explicitly model dependent, and each error has its
own characteristics.
EricP <[email protected]> writes:
EricP wrote:
And if the page is out swapped and recycled we lose track ofI don't think that will always work. Assuming we are using aAs most stores are posted, the data stored needs to be 'poisoned'Storing the bad <arriving> ECC should take care of that.
so that any subsequent use of the data (e.g. a load) will report
a fault.
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.
However if the ST is less than 8 bytes or misaligned, it won't know which >>> of the 8 bytes was invalid so can't tell if the bad data was overwritten. >>> If it keeps the old ECC as an error indicator, that code might actually be >>> correct for the new data. If it generates a new valid ECC then it loses
track of the fact that the data MAY be invalid.
In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.
When the modified line is written back to DRAM it retains the
double error ECC.
the error indicator on that 8-byte value.
If it was properly poisoned, the access by the DMA engine will
cause a RAS error to be signalled and the DMA aborted.
EricP wrote:
MitchAlsup1 wrote:
Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
Paul A. Clayton wrote:As most stores are posted, the data stored needs to be 'poisoned'
For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as >>>>>> a fatal thread error that is asynchronous.
What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC fail. >>>
so that any subsequent use of the data (e.g. a load) will report
a fault.
Storing the bad <arriving> ECC should take care of that.
I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.
However if the ST is less than 8 bytes or misaligned, it won't know which
of the 8 bytes was invalid so can't tell if the bad data was overwritten.
If it keeps the old ECC as an error indicator, that code might actually be >> correct for the new data. If it generates a new valid ECC then it loses
track of the fact that the data MAY be invalid.
In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.
When the modified line is written back to DRAM it retains the
double error ECC.
And if the page is out swapped and recycled we lose track of
the error indicator on that 8-byte value.
EricP wrote:
EricP wrote:
MitchAlsup1 wrote:
Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
Paul A. Clayton wrote:
For memory reads, the late failure generated by an uncorrectable >>>>>>> ECC error would probably have to be handled differently or there >>>>>>> would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as >>>>>>> a fatal thread error that is asynchronous.
What about for memory stores where the ECC check on the delivered
data fails ?? This seems to be just as fatal as a LD with an ECC
fail.
As most stores are posted, the data stored needs to be 'poisoned'
so that any subsequent use of the data (e.g. a load) will report
a fault.
Storing the bad <arriving> ECC should take care of that.
I don't think that will always work. Assuming we are using a
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.
However if the ST is less than 8 bytes or misaligned, it won't know
which
of the 8 bytes was invalid so can't tell if the bad data was
overwritten.
If it keeps the old ECC as an error indicator, that code might
actually be
correct for the new data. If it generates a new valid ECC then it loses
track of the fact that the data MAY be invalid.
In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.
When the modified line is written back to DRAM it retains the
double error ECC.
And if the page is out swapped and recycled we lose track of
the error indicator on that 8-byte value.
The line was displaced from an L1/L2 cache and its DRAM landing spot is
not in DRAM ?? but over on some disk/SSD ?!?
How (the frick) did it get into L1/L2 if it was not in DRAM ?? and thus
not on disk (as its only access point). ????
Scott Lurndal wrote:
EricP <[email protected]> writes:
EricP wrote:
And if the page is out swapped and recycled we lose track ofI don't think that will always work. Assuming we are using aAs most stores are posted, the data stored needs to be 'poisoned'Storing the bad <arriving> ECC should take care of that.
so that any subsequent use of the data (e.g. a load) will report
a fault.
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.
However if the ST is less than 8 bytes or misaligned, it won't know which >>>> of the 8 bytes was invalid so can't tell if the bad data was overwritten. >>>> If it keeps the old ECC as an error indicator, that code might actually be >>>> correct for the new data. If it generates a new valid ECC then it loses >>>> track of the fact that the data MAY be invalid.
In this second case of partial overwrite I think it has to generate a
new invalid ECC for the new 8 byte data indicating a double error.
When the modified line is written back to DRAM it retains the
double error ECC.
the error indicator on that 8-byte value.
If it was properly poisoned, the access by the DMA engine will
cause a RAS error to be signalled and the DMA aborted.
And the OS does what with the page and its data?
This could happen long after the owner process terminated,
maybe part of a lazy file cache write back.
The only option for the OS might be to log the error and just reset
the ECC to valid for the current data so the IO can complete.
EricP <[email protected]> writes:
Scott Lurndal wrote:
EricP <[email protected]> writes:And the OS does what with the page and its data?
EricP wrote:If it was properly poisoned, the access by the DMA engine will
And if the page is out swapped and recycled we lose track ofI don't think that will always work. Assuming we are using aAs most stores are posted, the data stored needs to be 'poisoned' >>>>>>> so that any subsequent use of the data (e.g. a load) will report >>>>>>> a fault.Storing the bad <arriving> ECC should take care of that.
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate
a new valid ECC and correct the error.
However if the ST is less than 8 bytes or misaligned, it won't know which >>>>> of the 8 bytes was invalid so can't tell if the bad data was overwritten. >>>>> If it keeps the old ECC as an error indicator, that code might actually be
correct for the new data. If it generates a new valid ECC then it loses >>>>> track of the fact that the data MAY be invalid.
In this second case of partial overwrite I think it has to generate a >>>>> new invalid ECC for the new 8 byte data indicating a double error.
When the modified line is written back to DRAM it retains the
double error ECC.
the error indicator on that 8-byte value.
cause a RAS error to be signalled and the DMA aborted.
This could happen long after the owner process terminated,
maybe part of a lazy file cache write back.
The only option for the OS might be to log the error and just reset
the ECC to valid for the current data so the IO can complete.
No, the I/O must be aborted. RAS 101 - do not propogate
poisoned data.
Scott Lurndal wrote:
EricP <[email protected]> writes:
Scott Lurndal wrote:
EricP <[email protected]> writes:And the OS does what with the page and its data?
EricP wrote:If it was properly poisoned, the access by the DMA engine will
And if the page is out swapped and recycled we lose track ofI don't think that will always work. Assuming we are using aAs most stores are posted, the data stored needs to be 'poisoned' >>>>>>>> so that any subsequent use of the data (e.g. a load) will report >>>>>>>> a fault.Storing the bad <arriving> ECC should take care of that.
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate >>>>>> a new valid ECC and correct the error.
However if the ST is less than 8 bytes or misaligned, it won't know which
of the 8 bytes was invalid so can't tell if the bad data was overwritten.
If it keeps the old ECC as an error indicator, that code might actually be
correct for the new data. If it generates a new valid ECC then it loses >>>>>> track of the fact that the data MAY be invalid.
In this second case of partial overwrite I think it has to generate a >>>>>> new invalid ECC for the new 8 byte data indicating a double error. >>>>>>
When the modified line is written back to DRAM it retains the
double error ECC.
the error indicator on that 8-byte value.
cause a RAS error to be signalled and the DMA aborted.
This could happen long after the owner process terminated,
maybe part of a lazy file cache write back.
The only option for the OS might be to log the error and just reset
the ECC to valid for the current data so the IO can complete.
No, the I/O must be aborted. RAS 101 - do not propogate
poisoned data.
Perhaps but tossing a whole block from an IO expands the size of
the problem by a factor of 1000's.
If that was one byte wrong in a text file then I think most people
would want it written, as opposed to tossing out their work.
EricP wrote:
My concern is that the circuit for doing this could be pretty
complicated.
Essentially equal in complexity to an IO retirement µArchitecture.
Many of the pieces that have to be checked are scattered around the core.
Also many of states are in circular buffers so determining "older" starts
getting slightly hairy (the Load Store Queue has a similar problem for
disambiguation determining if all older loads and stores have
"resolved").
And all this has to run in parallel so it takes less than 1 clock.
MitchAlsup1 wrote:
EricP wrote:
My concern is that the circuit for doing this could be pretty
complicated.
Essentially equal in complexity to an IO retirement µArchitecture.
For my uArch Retire should be quite straight forward to implement.
Retire reads the tail (oldest) entry in the Instruction Queue (IQ) and
checks if the Done flag is set. If it is and the entry's Exception flag
is clear:
- if instruction was not a taken branch Retire adds the instruction
length to the committed RIP register.
- else if it is a taken branch Retire pops the new committed RIP from
the tail of the branch queue in the Branch Control Unit.
- it clears the Architecture Reg flag on the old dest physical register
(which also frees it) and sets it on the new dest physical register
- updates the Committed-RAT with the new dest register for the Arch register - increments IQ tail pointer, freeing the entry.
If the entry's Exception flag is set then it is also straight forward,
with a flush of all in-flight instructions, bulk copy the Committed-RAT
into the Future-RAT to restore renaming, and set a jump address in Fetch. (Any in-flight cache miss operations are allowed to complete.)
This is also relatively straight forward to do multiple retires per clock, each mostly costs an extra read port on IQ and extra write ports on the Committed-RAT and the Physical Register Status register.
Many of the pieces that have to be checked are scattered around the core. >>> Also many of states are in circular buffers so determining "older" starts >>> getting slightly hairy (the Load Store Queue has a similar problem for
disambiguation determining if all older loads and stores have
"resolved").
And all this has to run in parallel so it takes less than 1 clock.
Adding the structures to support OoO Retire would greatly complicate this.
EricP <[email protected]> writes:
No, the I/O must be aborted. RAS 101 - do not propogate
poisoned data.
Perhaps but tossing a whole block from an IO expands the size of
the problem by a factor of 1000's.
Not Having the data (or at least the data in the I/O block being
written (512/4k) given non-sequential underlying disk sector allocations)
is _far far_ better than having corrupt data. The former can be
repaired. The latter is may not even be detected.
If that was one byte wrong in a text file then I think most people
would want it written, as opposed to tossing out their work.
I really doubt that any programmer would prefer bad data to no data.
Scott Lurndal wrote:
EricP <[email protected]> writes:
Scott Lurndal wrote:
EricP <[email protected]> writes:And the OS does what with the page and its data?
EricP wrote:If it was properly poisoned, the access by the DMA engine will
And if the page is out swapped and recycled we lose track ofI don't think that will always work. Assuming we are using aAs most stores are posted, the data stored needs to be 'poisoned' >>>>>>>> so that any subsequent use of the data (e.g. a load) will report >>>>>>>> a fault.Storing the bad <arriving> ECC should take care of that.
72-bit SECDED ECC and a cache line is read with a double error,
then if the ST overwrites an 8 byte aligned value it will generate >>>>>> a new valid ECC and correct the error.
However if the ST is less than 8 bytes or misaligned, it won't know which
of the 8 bytes was invalid so can't tell if the bad data was overwritten.
If it keeps the old ECC as an error indicator, that code might actually be
correct for the new data. If it generates a new valid ECC then it loses >>>>>> track of the fact that the data MAY be invalid.
In this second case of partial overwrite I think it has to generate a >>>>>> new invalid ECC for the new 8 byte data indicating a double error. >>>>>>
When the modified line is written back to DRAM it retains the
double error ECC.
the error indicator on that 8-byte value.
cause a RAS error to be signalled and the DMA aborted.
This could happen long after the owner process terminated,
maybe part of a lazy file cache write back.
The only option for the OS might be to log the error and just reset
the ECC to valid for the current data so the IO can complete.
No, the I/O must be aborted. RAS 101 - do not propogate
poisoned data.
Perhaps but tossing a whole block from an IO expands the size of
the problem by a factor of 1000's.
If that was one byte wrong in a text file then I think most people
would want it written, as opposed to tossing out their work.
If that was one byte wrong in a file system meta data block then
there is no good answer. Many of the meta data blocks are in linked lists
or B+ trees so not writing the block could corrupt a whole file system,
and writing the block could also cause corruption but hopefully less likely.
So you are damned if you do fix the ECC and write the block,
and damned if you don't. But do seems less damning.
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
My concern is that the circuit for doing this could be pretty
complicated.
Essentially equal in complexity to an IO retirement µArchitecture.
For my uArch Retire should be quite straight forward to implement.
Retire reads the tail (oldest) entry in the Instruction Queue (IQ) and
checks if the Done flag is set. If it is and the entry's Exception flag
is clear:
- if instruction was not a taken branch Retire adds the instruction
length to the committed RIP register.
- else if it is a taken branch Retire pops the new committed RIP from
the tail of the branch queue in the Branch Control Unit.
- it clears the Architecture Reg flag on the old dest physical register
(which also frees it) and sets it on the new dest physical register
- updates the Committed-RAT with the new dest register for the Arch
register
- increments IQ tail pointer, freeing the entry.
All of these would have been completed when the instruction comes out of
its function unit, and then retire multiplexes this data onto the
current retired instruction state. {2-gates not 13-gates}
EricP wrote:
No, the I/O must be aborted. RAS 101 - do not propogate
poisoned data.
Consider a page being written out and the last cache line in the page
has a bad ECC. What command does one send the disk to indicate "forget
all that data I just sent you" ??
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
My concern is that the circuit for doing this could be pretty
complicated.
Essentially equal in complexity to an IO retirement µArchitecture.
For my uArch Retire should be quite straight forward to implement.
Retire reads the tail (oldest) entry in the Instruction Queue (IQ) and
checks if the Done flag is set. If it is and the entry's Exception flag
is clear:
- if instruction was not a taken branch Retire adds the instruction
length to the committed RIP register.
- else if it is a taken branch Retire pops the new committed RIP from
the tail of the branch queue in the Branch Control Unit.
- it clears the Architecture Reg flag on the old dest physical register
(which also frees it) and sets it on the new dest physical register
- updates the Committed-RAT with the new dest register for the Arch
register
- increments IQ tail pointer, freeing the entry.
All of these would have been completed when the instruction comes out of
its function unit, and then retire multiplexes this data onto the
current retired instruction state. {2-gates not 13-gates}
IIRC the Alpha 21064 was 16
gates per stage so if my Retire unit
could hit 13 gates I'd be extremely chuffed (delighted).
I would likely be targeting 20 gates per stage anyway.
On 1/22/24 9:44 AM, Paul A. Clayton wrote:
[snip]
Obviously an extremely biased workload like the data analysis
workloads targeted by Intel's research chip would probably show
A55 in a better light (though A55 would likely be very inefficient
compared to the research design, I think it used 4-way threaded
in-order cores with limited cache and narrow memory channels [to avoid
64-byte accesses to access 64-bits or less of data]), but
that would not be "fair".
I (finally) found a reference to the Intel research chip. https://ieeexplore.ieee.org/document/10188866
"The Intel Programmable and Integrated Unified Memory Architecture
Graph Analytics Processor" (Sriram Aananthakrishnan et al., 2023)
A PDF of the paper appears to be available at https://heirman.net/papers/aananthakrishnan2023piuma.pdf
On 2/25/24 5:22 PM, MitchAlsup1 wrote:
Paul A. Clayton wrote:[snip]
When I looked at the pipeline design presented in the Arm Cortex-
A55 Software Optimization Guide, I was surprised by the design.
Figure 1 (page 10 in Revision R2p0) shows nine execution pipelines
(ALU0, ALU1, MAC, DIV, branch, store, load, FP/Neon MAC &
DIV/SWRT, FP/Neon ALU) and ALU0 and ALU1 have a shift pipeline
stage before an ALU stage (clearly for AArch32).
Almost like an Mc88100 which had 5 pipelines.
I think I have an incorrect conception of data communication
(fowarding and register-to-functional-unit). I also seem to be
conflating somewhat issue port and functional unit. Forwarding
from nine locations to nine locations and the remaining eight
locations to eight locations (counting functional unit as a single
target location even though a functional unit may have three
functionally different input operands).
I am used to functionality being merged; e.g., the multiplier also
having a general ALU. Merged functional units would still need to
route the operands to the appropriate functionality, but selecting
the operation path for two operands *seems* simpler than selecting
distinct operands and separate functional unit independently. This
might also be a nomenclature issue.
If one can only begin two operations in a cycle, the generality of
having nine potential paths seems wasteful to me. Having separate
paths for FP/Neon and GPR-using operations makes sense because of
the different register sets (as well as latency/efficiency-
optimized functional units vs. SIMD-optimized functional units;
sharing execution hardware is tempting but there are tradeoffs).
With nine potential issue ports, it seems strange to me that width
is strictly capped at two.
Even though AArch64 does not have My
66000's Virtual Vector Method to exploit normally underutilized,
there would be cases where an extra instruction or two could
execute in parallel without increasing resources significantly. As
an outsider, I can only assume that any benefit did not justify
the costs in hardware and design effort. (With in-order execution,
even a nearly free [hardware] increasing of width may not result
in improved performance or efficiency.)
The separation of MAC and DIV is mildly questionable — from my
very amateur perspective — not supporting dual issue of a MAC-DIV
pair seems very unlikely to hurt performance but the cost may be
trivial.
Many (MANY) MUL-DIV pairs are data dependent. y = i*m/n;
I also ass?me the other operations are usually available for
parallel execution (though this depends somewhat on compiler
optimization for the microarchitecture), so execution of a
multiply and a divide in parallel is probably uncommon.
The FP/Neon section has these operations merged into a functional
unit; I guess — I am not motivated to look this — that this is
because FP divide/sqrt use the multiplier while integer divide
does not.
The Chips and Cheese article also indicated that branches are only
resolved at writeback, two cycles later than if branch direction
was resolved in the first execution stage. The difference between
a six stage misprediction penalty and an eight stage one is not
huge, but it seems to indicate a difference in focus. With
In an 8 stage pipeline, the 2 cycles of added delay should hurt by
~5%-7%
5% performance loss sounds expensive for a something that *seems*
not terribly expensive to fix.
[snip]
I would have *guessed* that an AGLU (a functional unit providing
address generation and "simple" ALU functions, like AMD's Bobcat?)
would be more area and power efficient than having separate
pipelines, at least for store address generation.
Be careful with assumptions like that. Silicon area with no moving
signals is remarkably power efficient.
There is also the extra forwarding for separate functional units
(and perhaps some extra costs from increased distance), but I
admit that such factors really expose my complete lack of hardware experience. (I am aware of clock gating as a power saving
technique and that "doing nothing" is cheap, but I have no
intuition of the weights of the tradeoffs.)
(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)
[snip interesting stuff]
Perhaps mildly out-of-order designs (say a little more than the
PowerPC 750) are not actually useful (other than as a starting
point for understanding out-of-order design). I do not understand
why such an intermediate design (between in-order and 30+
scheduling window out-of-order) is not useful. It may be that
It is useful, just not all that much.
going from say 10 to 30 scheduler entries gives so much benefit
for relatively little extra cost (and no design is so precisely
area constrained — even doubling core size would not mean pushing
L1 off-chip, e.g.). I have a lumper taxonomic bias, so I have some
emotional investment in intermediate and mixed designs.
10 does not accommodate much ILP beyond that of a 10 deep pipeline.
30 accommodates L1 cache misses and typical FP latencies.
90 accommodates "almost everything else"
250 accommodates multiple L1 misses with L2 hits and "everything
else".
Presumably the benefit depends on issue width and load-to-use
latency (pipeline depth, cache capacity, etc.). [For a cheap
"general purpose" processor, not covering FP latencies well may
not be very important.] Better hiding L1 _hit_ latency would seem
to provide a significant fraction of the frequency and ILP benefit
of out-or-order for a smallish core. (Some branch resolution
latency can also be hidden; an in-order core can delay resolution
until writeback of control-dependent instructions, but OoO's extra
buffering facilitates deeper speculation.)
If one has a scheduling window of 90 operations, having only three
issue ports seems imbalanced to me.
Out-of-order execution would also seem to facilitate opportunistic
use of existing functionality. Even just buffering decoded
instructions would seem to allow a 16-byte (aligned) instruction
fetch with two instruction decoders to issue more than two
instructions on some cycles without increasing register port
count, forwarding paths, etc. OoO would further increase the
frequency of being able to do more work with given hardware
resources.
Perhaps there may even be a case for a 1+ wide OoO core, i.e., an
OoO core which sometimes issue more than one instruction in a
cycle.
For something like a smart phone, one or two small cores might be
useful for background activity, tasks whose latency (within a
broad range) is not related to system responsiveness for the user.
For a server expected to run embarrassingly parallel workloads, if
Servers are not expected to run embarrassingly parallel applications,
they are expected to run an embarrassing large number of essentially
serial applications.
Shared caching of instructions still seems beneficial in "server
worklaods" compared to fully general multiprogram workloads. A
database server might even have more sharing, potentially having a
single process (so page table sharing would be more beneficial),
but that seems a less common use.
a wimpy core provides sufficient responsiveness, I would expect
most of the cores (possibly even all of the cores) to be wimpy.
There might not be many workloads with such characteristics;
Talk to Google about that....
Urs Hölzle of Google put out a paper "Brawny cores still beat
wimpy cores, most of the time"(2010). While some of the points —
such as tail latency effects and software developments costs —
made in the paper are (in my opinion) quite significant, I thought
the argument significantly flawed. (I even wrote a blog post about
this paper: https://dandelion-watcher.blogspot.com/2012/01/weak-case-against-wimpy-cores.html)
The microservice programming model (motivated, from what I
understand, by problem-size and performance scaling and service
reliability with moderately reliable hardware without requiring
much programming effort to support scaling) may also have
significant implications on microarchitecture.
The design space is also very large. One can have heterogeneity of
wimpy and brawny cores at the rack level, wimpy-only chips within
a heterogeneous package, heterogeneity within a chip, temporal
heterogeneity (SMT and dynamic partitioning of core resources),
etc. Core strength can very widely and performance balance can be
diverse (e.g., a core with a quarter of the performance of a
brawny core on general tasks might have — with coprocessors,
tightly coupled accelerators, or general microarchitecture —
approximately equal performance for some tasks).
The performance of weaker cores can also be increased by
increasing communication performance within local groups of such
cores. Exploiting this would likely require significant
programming effort, but some of the effort might be automated
(even before AI replaces programmers). This assumes that there is
significant communication that is less temporally local than
within a core (out-of-order execution changes the temporal
proximity of value communication; a result consumer might be
nearby in program order but substantially more distant in
execution order) and that intermediate resource allocation to
intermediate latency/bandwdith communication can be beneficial.
(I also think that there is an opportunity for optimization in the
on-chip network. Optimizing the on-chip network for any-to-any
communication seems less appropriate for many workloads not only
because of the often limited scale of communication but also
because the communication is, I suspect, often specialized.
Getting a network design that is very good for some uses and
adequate others seems challenging even with software cooperation.
Rings seem really nice for pipeline-style parallelism and some
other uses, crossbars seem nice for small node groups with heavy communication, grids seem to fit large node counts with nearest
neighbor communication (physical modeling?), etc. Channel width,
flit size, channel count also involve tradeoffs. Some
communication does not require sending an entire cache block of
data, but a smaller flit will have more overhead.)
Paul A. Clayton wrote:
(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)
Imaging having to stick all this stuff on a die at 2µ instead of 5nm !!
On 3/24/24 4:39 PM, Scott Lurndal wrote:...
[email protected] (MitchAlsup1) writes:
Paul A. Clayton wrote:
(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)
However, having over 100 seems like a lot. Supporting performance
counters and tracing is also something that would have been nearly >inconceivable for something like the MIPS R2000.
An argument might be made that some designs would have no use for
most of such extra state. Performance monitoring is useful for
software development (and theoretically for OS decisions for
scheduling, core migration, and other functions), but seems likely
to be highly underutilized for typical use. A55 is presumably
large enough that a synthesis-time remove of much of this
functionality would have a tiny effect on total area.
Even for a
microcontroller the area cost might not be problematic.
On 3/24/24 4:39 PM, Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
Paul A. Clayton wrote:
(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)
However, having over 100 seems like a lot. Supporting performance
counters and tracing is also something that would have been nearly inconceivable for something like the MIPS R2000.
Paul A. Clayton wrote:
On 3/24/24 4:39 PM, Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
Paul A. Clayton wrote:
(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)
However, having over 100 seems like a lot. Supporting performance
counters and tracing is also something that would have been nearly
inconceivable for something like the MIPS R2000.
Many of these register are configuration control that
get set once at boot and never change.
Others are dynamic but not time critical, like debug registers.
Only a small number would be diddled on a regular basis,
like interrupt control.
They don't all need the same access speed -
depending on usage some (most?) can be on "slow" buses
that maybe take multiple clocks to read or write.
On 3/24/24 4:39 PM, Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
Paul A. Clayton wrote:I suspect Paul is refering to what ARMv8 calls "System Registers";
(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)
Imaging having to stick all this stuff on a die at 2µ instead of 5nm !! >>
Yes. (There were also some debug registers, performance monitoring
registers, trace registers, etc.)
despite the name, most are stored in flops, and in the case of
the ID registers, wires (perhaps anded with local e-fuses).
Yes, many of the bits would be implemented as ROM/PROM and many
would presumably be scattered about because they control/interact
with specific functionality. They are similar I/O device
registers. (I/O devices have also become more complex.)
However, having over 100 seems like a lot. Supporting performance
counters and tracing is also something that would have been nearly >inconceivable for something like the MIPS R2000.
An argument might be made that some designs would have no use for
most of such extra state. Performance monitoring is useful for
software development (and theoretically for OS decisions for
scheduling, core migration, and other functions), but seems likely
to be highly underutilized for typical use.
"Paul A. Clayton" <[email protected]> writes:
On 3/24/24 4:39 PM, Scott Lurndal wrote:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
Scott Lurndal wrote:
"Paul A. Clayton" <[email protected]> writes:
On 3/24/24 4:39 PM, Scott Lurndal wrote:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
My 66000 Architecture defines 8 performance counters at each layer of
the design:: cores gets 8 counters, L1s gets 8 counters, L3s gets 8
counters Interconnect gets 8 counters, Memory Controller gets 8 counters, >PCIe root gets 8 counters--and every instance multiplies the counters.
[email protected] (Scott Lurndal) writes:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel
or AMD.
In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
[email protected] (Scott Lurndal) writes:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel
or AMD.
The question is if "users" to ARM Holdings are actual end-users, or the
SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.
Anton Ertl wrote:
[email protected] (Scott Lurndal) writes:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel or AMD.
My theory was that the CPU manufacturers put performance monitoring
counters in CPUs in order to understand the performance of real-world
programs themselves, and how they should tweak the successor core to
relieve it of bottlenecks.
Having reverse engineered the original Pentium EMON counters I got a
meeting with Intel about their next cpu (the PentiumPro), what I was
told about the Pentium was that this chip was the first one which was
too complicated to create/sell an In-Circuit Emulator (ICE) version, so >instead they added a bunch of counters for near-zero overhead monitoring
and depended on a bit-serial read-out when they needed to dump all state
for debugging. (I have forgotten the proper term for that interface! :-( )
[email protected] (Scott Lurndal) writes:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel or AMD.
My theory was that the CPU manufacturers put performance monitoring
counters in CPUs in order to understand the performance of real-world programs themselves, and how they should tweak the successor core to
relieve it of bottlenecks.
The question is if "users" to ARM Holdings are actual end-users, or the
SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.
Terje Mathisen <[email protected]> writes:
Having reverse engineered the original Pentium EMON counters I got a
meeting with Intel about their next cpu (the PentiumPro), what I was
told about the Pentium was that this chip was the first one which was
too complicated to create/sell an In-Circuit Emulator (ICE) version, so
instead they added a bunch of counters for near-zero overhead monitoring
and depended on a bit-serial read-out when they needed to dump all state
for debugging. (I have forgotten the proper term for that interface! :-( )
Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.
The biggest demand is from the OS vendors. Hardware folks have
simulation and emulators.
Look at vtune, for example.
[email protected] (Scott Lurndal) writes:
The biggest demand is from the OS vendors. Hardware folks have >>simulation and emulators.
You don't want to use a full-blown microarchitectural emulator for a >long-running program.
[email protected] (Anton Ertl) writes:
[email protected] (Scott Lurndal) writes:
The biggest demand is from the OS vendors. Hardware folks have >>>simulation and emulators.
You don't want to use a full-blown microarchitectural emulator for a >>long-running program.
Generally hardware folks don't run 'long-running programs' when
analyzing performance, they use the emulator for determining latencies, bandwidths and efficiacy of cache coherency algorithms and
cache prefetchers.
Their target is not application analysis.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (3 / 13) |
| Uptime: | 19:15:01 |
| Calls: | 12,104 |
| Calls today: | 4 |
| Files: | 15,004 |
| Messages: | 6,518,087 |