AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register renaming,
up to a certain limit. This will, of course, decrease latencies,
especially on an OoO machine.
POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).
So, what are the tradeoffs? Will a zero-cycle register move make
the pipeline deeper?
Thomas Koenig <[email protected]> writes:
AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register renaming,
up to a certain limit.
The limit in recent CPUs seems to be the width of the register renamer
(6 on Golden Cove and Zen3). For Golden Cove, that optimization
includes constant adds in the range -1024..1023 with the intermediate
sum not exceeding -4096..4095.
This will, of course, decrease latencies,
especially on an OoO machine.
POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).
Two cycles of latency for arithmetic instructions like integer adds?
Ouch!
So, what are the tradeoffs? Will a zero-cycle register move make
the pipeline deeper?
Pipeline depths have not been published for Intel and AMD CPUs in
recent years. ARM publishes its pipeline lengths. One could compare
the last ARM of a line without this feature to the first with this
feature, and get an indication whether it made the pipeline deeper.
AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register renaming,
up to a certain limit.
This will, of course, decrease latencies,
especially on an OoO machine.
POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).
So, what are the tradeoffs? Will a zero-cycle register move make
the pipeline deeper?
Thomas Koenig <[email protected]> writes:
AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register
renaming, up to a certain limit.
The limit in recent CPUs seems to be the width of the register renamer
(6 on Golden Cove and Zen3). For Golden Cove, that optimization
includes constant adds in the range -1024..1023 with the intermediate
sum not exceeding -4096..4095.
This will, of course, decrease latencies,
especially on an OoO machine.
POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).
Two cycles of latency for arithmetic instructions like integer adds?
Ouch!
Michael S <[email protected]> schrieb:
On Tue, 02 Jan 2024 10:51:40 GMT
[email protected] (Anton Ertl) wrote:
Thomas Koenig <[email protected]> writes:
AFAIK, modern Intel, AMD and ARM CPUs do not forward register
moves to their execution units; they are done directly via
register renaming, up to a certain limit.
The limit in recent CPUs seems to be the width of the register
renamer (6 on Golden Cove and Zen3). For Golden Cove, that
optimization includes constant adds in the range -1024..1023 with
the intermediate sum not exceeding -4096..4095.
This will, of course, decrease latencies,
especially on an OoO machine.
POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).
Two cycles of latency for arithmetic instructions like integer
adds? Ouch!
The same as all recent Apple 'performance' cores. Which didn't
prevent them from being pretty damn good 'latency' engines.
I speak only little ARM, but if I read https://dougallj.github.io/applecpu/firestorm-int.html correctly,
then add is only two cycles if one of the operands needs to be
extended (at leat for the M1 chip). Was this changed in later
versions?
On Tue, 02 Jan 2024 10:51:40 GMT
[email protected] (Anton Ertl) wrote:
Thomas Koenig <[email protected]> writes:
AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register
renaming, up to a certain limit.
The limit in recent CPUs seems to be the width of the register renamer
(6 on Golden Cove and Zen3). For Golden Cove, that optimization
includes constant adds in the range -1024..1023 with the intermediate
sum not exceeding -4096..4095.
This will, of course, decrease latencies,
especially on an OoO machine.
POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).
Two cycles of latency for arithmetic instructions like integer adds?
Ouch!
The same as all recent Apple 'performance' cores. Which didn't prevent
them from being pretty damn good 'latency' engines.
Anton Ertl <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register renaming, >>>up to a certain limit.
The limit in recent CPUs seems to be the width of the register renamer
(6 on Golden Cove and Zen3). For Golden Cove, that optimization
includes constant adds in the range -1024..1023 with the intermediate
sum not exceeding -4096..4095.
This will, of course, decrease latencies,
especially on an OoO machine.
POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).
Two cycles of latency for arithmetic instructions like integer adds?
Ouch!
Yes, ouch. I don't know what they spend that extra cycle on.
Probably, their die just got too big, their timing too agressive,
or rather a combination of both.
By the way, "mr ra,rb" is just an alias for "or ra,rb,rb", so they
actually do register copying through the ALU, like architectures
of old.
So, what are the tradeoffs? Will a zero-cycle register move make
the pipeline deeper?
Pipeline depths have not been published for Intel and AMD CPUs in
recent years. ARM publishes its pipeline lengths. One could compare
the last ARM of a line without this feature to the first with this
feature, and get an indication whether it made the pipeline deeper.
Does anybody (Scott?) have an indication of which chips this
might be?
I can't speak to anything non-public. The Wikipedia page for
neoverse shows a pipeline depth of 10 cycles for the N2 family.
https://en.wikipedia.org/wiki/ARM_Neoverse#Neoverse_N2
Scott Lurndal wrote:
I can't speak to anything non-public. The Wikipedia page for
neoverse shows a pipeline depth of 10 cycles for the N2 family.
https://en.wikipedia.org/wiki/ARM_Neoverse#Neoverse_N2
Along with the 4-cycle LD-use latency indicates a high frequency
wide-issue design, the 10-cycle pipeline depth indicates little
time for instruction fusing or register write elision.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (3 / 13) |
| Uptime: | 28:33:02 |
| Calls: | 12,107 |
| Calls today: | 7 |
| Files: | 15,006 |
| Messages: | 6,518,234 |