Forum: >>> Magnum BBS <<<

Benefit vs cost of zero-cycle register moves

From Thomas Koenig@21:1/5 to All on Mon Jan 1 21:16:11 2024

AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register renaming,
up to a certain limit. This will, of course, decrease latencies,
especially on an OoO machine.

POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).

So, what are the tradeoffs? Will a zero-cycle register move make
the pipeline deeper?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Thomas Koenig on Mon Jan 1 23:59:35 2024

Thomas Koenig wrote:

AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register renaming,
up to a certain limit. This will, of course, decrease latencies,
especially on an OoO machine.

POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).

So, what are the tradeoffs? Will a zero-cycle register move make
the pipeline deeper?

If you have 3 stages of register rename in your pipeline you can 0-cycle
MOVs (equivalent to 4-5 stages between Fetch and Issue).

If you have a thinner Decode pipeline (say 1 cycle) you cannot.

There is also a dependency on the style of register file you have.

A CAM read decoder with a binary write decoder cannot perform MOVs in
0-cycles, whereas reading the RF after reservation station launch can.

Mostly whether MOVs take 0-cycles or not does not show up with much
performance when the depth of the execution window is 16+ cycles or
when calculation latency takes multiple cycles (FP) or incurs memory
latency (pointer chasing, cache misses high).

Also note: x86 has more MOV instructions than most RISCs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Tue Jan 2 12:05:58 2024

Anton Ertl <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register renaming,
up to a certain limit.

The limit in recent CPUs seems to be the width of the register renamer
(6 on Golden Cove and Zen3). For Golden Cove, that optimization
includes constant adds in the range -1024..1023 with the intermediate
sum not exceeding -4096..4095.

This will, of course, decrease latencies,
especially on an OoO machine.

POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).

Two cycles of latency for arithmetic instructions like integer adds?
Ouch!

Yes, ouch. I don't know what they spend that extra cycle on.
Probably, their die just got too big, their timing too agressive,
or rather a combination of both.

By the way, "mr ra,rb" is just an alias for "or ra,rb,rb", so they
actually do register copying through the ALU, like architectures
of old.

So, what are the tradeoffs? Will a zero-cycle register move make
the pipeline deeper?

Pipeline depths have not been published for Intel and AMD CPUs in
recent years. ARM publishes its pipeline lengths. One could compare
the last ARM of a line without this feature to the first with this
feature, and get an indication whether it made the pipeline deeper.

Does anybody (Scott?) have an indication of which chips this
might be?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Tue Jan 2 10:51:40 2024

Thomas Koenig <[email protected]> writes:

AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register renaming,
up to a certain limit.

The limit in recent CPUs seems to be the width of the register renamer
(6 on Golden Cove and Zen3). For Golden Cove, that optimization
includes constant adds in the range -1024..1023 with the intermediate
sum not exceeding -4096..4095.

This will, of course, decrease latencies,
especially on an OoO machine.

POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).

Two cycles of latency for arithmetic instructions like integer adds?
Ouch!

So, what are the tradeoffs? Will a zero-cycle register move make
the pipeline deeper?

Pipeline depths have not been published for Intel and AMD CPUs in
recent years. ARM publishes its pipeline lengths. One could compare
the last ARM of a line without this feature to the first with this
feature, and get an indication whether it made the pipeline deeper.

The main tradeoff seems to be in putting the effort in to implement
this optimization. Even Gracemont (the current Intel E-Core) can
perform 5 dependent moves (but not constant adds) in one cycle, so it
probably does not cost much area or energy compared to its benefits.

My guess is that Power10 is designed more for throughput computing
where lots of instruction-level parallelism is available so you can
live with long latencies (fill it with independent instructions),
while Intel, AMD, ARM and Apple design also for code where latency
plays a bigger role. As expressed in the LaTeX benchmark (lower is
better) <https://www.complang.tuwien.ac.at/franz/latex-bench>:

Power 10 (3900 MHz) AlmaLinux 9.2 TeX Live 2020 0.468
Core i3-1315U, Gracemont 2600MHz, Ub.22.04 texlive-latex-base 0.388
Apple M1 Firestorm 3000MHz Asahi Linux Debian pre12 0.27
Core i3-1315U, Golden Cove 3800MHz, Ub.22.04 texlive-latex-base 0.221
Ryzen 7 5800X, 4800MHz, Debian 11 (64-bit) texlive-latex-base 0.191
Xeon W-1370P (=Core i7-11700K), 5200MHz, Debian 11 (64-bit) 0.175

I.e., a current Intel E-Core running (for unknown reasons) 700MHz
below its nominal speed is faster on this benchmark than Power10.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Tue Jan 2 15:58:26 2024

On Tue, 02 Jan 2024 10:51:40 GMT
[email protected] (Anton Ertl) wrote:

Thomas Koenig <[email protected]> writes:

AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register
renaming, up to a certain limit.

The limit in recent CPUs seems to be the width of the register renamer
(6 on Golden Cove and Zen3). For Golden Cove, that optimization
includes constant adds in the range -1024..1023 with the intermediate
sum not exceeding -4096..4095.

This will, of course, decrease latencies,
especially on an OoO machine.

POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).

Two cycles of latency for arithmetic instructions like integer adds?
Ouch!

The same as all recent Apple 'performance' cores. Which didn't prevent
them from being pretty damn good 'latency' engines.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Tue Jan 2 17:57:04 2024

On Tue, 2 Jan 2024 15:15:53 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Tue, 02 Jan 2024 10:51:40 GMT
[email protected] (Anton Ertl) wrote:

Thomas Koenig <[email protected]> writes:

AFAIK, modern Intel, AMD and ARM CPUs do not forward register
moves to their execution units; they are done directly via
register renaming, up to a certain limit.

The limit in recent CPUs seems to be the width of the register
renamer (6 on Golden Cove and Zen3). For Golden Cove, that
optimization includes constant adds in the range -1024..1023 with
the intermediate sum not exceeding -4096..4095.

This will, of course, decrease latencies,
especially on an OoO machine.

POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).

Two cycles of latency for arithmetic instructions like integer
adds? Ouch!

The same as all recent Apple 'performance' cores. Which didn't
prevent them from being pretty damn good 'latency' engines.

I speak only little ARM, but if I read https://dougallj.github.io/applecpu/firestorm-int.html correctly,
then add is only two cycles if one of the operands needs to be
extended (at leat for the M1 chip). Was this changed in later
versions?

You are right. Somehow I misremembered.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Tue Jan 2 15:15:53 2024

Michael S <[email protected]> schrieb:

On Tue, 02 Jan 2024 10:51:40 GMT
[email protected] (Anton Ertl) wrote:

Thomas Koenig <[email protected]> writes:

AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register
renaming, up to a certain limit.

The limit in recent CPUs seems to be the width of the register renamer
(6 on Golden Cove and Zen3). For Golden Cove, that optimization
includes constant adds in the range -1024..1023 with the intermediate
sum not exceeding -4096..4095.

This will, of course, decrease latencies,
especially on an OoO machine.

POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).

Two cycles of latency for arithmetic instructions like integer adds?
Ouch!

The same as all recent Apple 'performance' cores. Which didn't prevent
them from being pretty damn good 'latency' engines.

I speak only little ARM, but if I read https://dougallj.github.io/applecpu/firestorm-int.html correctly,
then add is only two cycles if one of the operands needs to be
extended (at leat for the M1 chip). Was this changed in later
versions?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Tue Jan 2 19:58:29 2024

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register renaming, >>>up to a certain limit.

The limit in recent CPUs seems to be the width of the register renamer
(6 on Golden Cove and Zen3). For Golden Cove, that optimization
includes constant adds in the range -1024..1023 with the intermediate
sum not exceeding -4096..4095.

This will, of course, decrease latencies,
especially on an OoO machine.

POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).

Two cycles of latency for arithmetic instructions like integer adds?
Ouch!

Yes, ouch. I don't know what they spend that extra cycle on.
Probably, their die just got too big, their timing too agressive,
or rather a combination of both.

By the way, "mr ra,rb" is just an alias for "or ra,rb,rb", so they
actually do register copying through the ALU, like architectures
of old.

So, what are the tradeoffs? Will a zero-cycle register move make
the pipeline deeper?

Pipeline depths have not been published for Intel and AMD CPUs in
recent years. ARM publishes its pipeline lengths. One could compare
the last ARM of a line without this feature to the first with this
feature, and get an indication whether it made the pipeline deeper.

Does anybody (Scott?) have an indication of which chips this
might be?

I can't speak to anything non-public. The Wikipedia page for
neoverse shows a pipeline depth of 10 cycles for the N2 family.

https://en.wikipedia.org/wiki/ARM_Neoverse#Neoverse_N2

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Thu Jan 4 00:15:13 2024

Scott Lurndal wrote:

I can't speak to anything non-public. The Wikipedia page for
neoverse shows a pipeline depth of 10 cycles for the N2 family.

https://en.wikipedia.org/wiki/ARM_Neoverse#Neoverse_N2

Along with the 4-cycle LD-use latency indicates a high frequency
wide-issue design, the 10-cycle pipeline depth indicates little
time for instruction fusing or register write elision.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to MitchAlsup on Thu Jan 4 09:22:46 2024

MitchAlsup <[email protected]> schrieb:

Scott Lurndal wrote:

I can't speak to anything non-public. The Wikipedia page for
neoverse shows a pipeline depth of 10 cycles for the N2 family.

https://en.wikipedia.org/wiki/ARM_Neoverse#Neoverse_N2

Along with the 4-cycle LD-use latency indicates a high frequency
wide-issue design, the 10-cycle pipeline depth indicates little
time for instruction fusing or register write elision.

The ARM Neoverse N2 Software Optimization Guide gives a one-cycle
execution latency for register to register moves (with four in
parallel). Constant loads take zero cycles; and simple register
moves are also listed under "Zero Latency MOVs" with the somehwat
less than illuminating caveat

"The last 3 instructions may not be executed with zero latency
under certain conditions".

https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/aarch64/tuning_models/neoversen2.h;hb=HEAD

gives that cost as 1 (so presumably these conditions happen).

They also fuse some instructions for aarch64

CMP/CMN (immediate) + B.cond
CMP/CMN (register) + B.cond
CMP (immediate) + CSEL
CMP (register) + CSEL
CMP (immediate) + CSET
CMP (register) + CSET
TST (immediate) + B.cond
TST (register) + B.cond
BICS (register) + B.cond
NOP + Any instruction

plus for both 64-bit and 32-bit

AESE + AESMC (see Section 4.6 on AES Encryption/Decryption)
AESD + AESIMC (see Section 4.6 on AES Encryption/Decryption)
CMP/CMN (immediate) + B.cond
CMP/CMN (register) + B.cond
TST (immediate) + B.cond
TST (register) + B.cond
BICS (register) + B.cond

where conditions apply which they actually spell out.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Thu Jul 30 02:32:09 2026
  from Madison, Nc via Telnet
- Bob Worm
  Wed Jul 29 22:26:45 2026
  from Wales, Uk via Telnet
- Zenobyte
  Wed Jul 29 21:08:05 2026
  from San Juan, Pr via Telnet
- Guest
  Wed Jul 29 14:26:54 2026
  from Balkans via Telnet
- Rixter
  Wed Jul 29 14:18:17 2026
  from Madison, Nc via Telnet
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	81:40:01
Calls:	12,451
Calls today:	1
Files:	15,194
Messages:	6,537,751

Benefit vs cost of zero-cycle register moves

Who's Online

Recent Visitors

System Info