• Benefit vs cost of zero-cycle register moves

    From Thomas Koenig@21:1/5 to All on Mon Jan 1 21:16:11 2024
    AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
    to their execution units; they are done directly via register renaming,
    up to a certain limit. This will, of course, decrease latencies,
    especially on an OoO machine.

    POWER is an exception (surprising to me); a dependency in an
    MR instruction will introduce two cycles of latency, the usual
    latency for an arithmetic instruction (also on Power10, I mesured
    that today).

    So, what are the tradeoffs? Will a zero-cycle register move make
    the pipeline deeper?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Mon Jan 1 23:59:35 2024
    Thomas Koenig wrote:

    AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
    to their execution units; they are done directly via register renaming,
    up to a certain limit. This will, of course, decrease latencies,
    especially on an OoO machine.

    POWER is an exception (surprising to me); a dependency in an
    MR instruction will introduce two cycles of latency, the usual
    latency for an arithmetic instruction (also on Power10, I mesured
    that today).

    So, what are the tradeoffs? Will a zero-cycle register move make
    the pipeline deeper?

    If you have 3 stages of register rename in your pipeline you can 0-cycle
    MOVs (equivalent to 4-5 stages between Fetch and Issue).

    If you have a thinner Decode pipeline (say 1 cycle) you cannot.

    There is also a dependency on the style of register file you have.

    A CAM read decoder with a binary write decoder cannot perform MOVs in
    0-cycles, whereas reading the RF after reservation station launch can.

    Mostly whether MOVs take 0-cycles or not does not show up with much
    performance when the depth of the execution window is 16+ cycles or
    when calculation latency takes multiple cycles (FP) or incurs memory
    latency (pointer chasing, cache misses high).

    Also note: x86 has more MOV instructions than most RISCs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Tue Jan 2 12:05:58 2024
    Anton Ertl <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
    to their execution units; they are done directly via register renaming,
    up to a certain limit.

    The limit in recent CPUs seems to be the width of the register renamer
    (6 on Golden Cove and Zen3). For Golden Cove, that optimization
    includes constant adds in the range -1024..1023 with the intermediate
    sum not exceeding -4096..4095.

    This will, of course, decrease latencies,
    especially on an OoO machine.

    POWER is an exception (surprising to me); a dependency in an
    MR instruction will introduce two cycles of latency, the usual
    latency for an arithmetic instruction (also on Power10, I mesured
    that today).

    Two cycles of latency for arithmetic instructions like integer adds?
    Ouch!

    Yes, ouch. I don't know what they spend that extra cycle on.
    Probably, their die just got too big, their timing too agressive,
    or rather a combination of both.

    By the way, "mr ra,rb" is just an alias for "or ra,rb,rb", so they
    actually do register copying through the ALU, like architectures
    of old.

    So, what are the tradeoffs? Will a zero-cycle register move make
    the pipeline deeper?

    Pipeline depths have not been published for Intel and AMD CPUs in
    recent years. ARM publishes its pipeline lengths. One could compare
    the last ARM of a line without this feature to the first with this
    feature, and get an indication whether it made the pipeline deeper.

    Does anybody (Scott?) have an indication of which chips this
    might be?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Tue Jan 2 10:51:40 2024
    Thomas Koenig <[email protected]> writes:
    AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
    to their execution units; they are done directly via register renaming,
    up to a certain limit.

    The limit in recent CPUs seems to be the width of the register renamer
    (6 on Golden Cove and Zen3). For Golden Cove, that optimization
    includes constant adds in the range -1024..1023 with the intermediate
    sum not exceeding -4096..4095.

    This will, of course, decrease latencies,
    especially on an OoO machine.

    POWER is an exception (surprising to me); a dependency in an
    MR instruction will introduce two cycles of latency, the usual
    latency for an arithmetic instruction (also on Power10, I mesured
    that today).

    Two cycles of latency for arithmetic instructions like integer adds?
    Ouch!

    So, what are the tradeoffs? Will a zero-cycle register move make
    the pipeline deeper?

    Pipeline depths have not been published for Intel and AMD CPUs in
    recent years. ARM publishes its pipeline lengths. One could compare
    the last ARM of a line without this feature to the first with this
    feature, and get an indication whether it made the pipeline deeper.

    The main tradeoff seems to be in putting the effort in to implement
    this optimization. Even Gracemont (the current Intel E-Core) can
    perform 5 dependent moves (but not constant adds) in one cycle, so it
    probably does not cost much area or energy compared to its benefits.

    My guess is that Power10 is designed more for throughput computing
    where lots of instruction-level parallelism is available so you can
    live with long latencies (fill it with independent instructions),
    while Intel, AMD, ARM and Apple design also for code where latency
    plays a bigger role. As expressed in the LaTeX benchmark (lower is
    better) <https://www.complang.tuwien.ac.at/franz/latex-bench>:

    Power 10 (3900 MHz) AlmaLinux 9.2 TeX Live 2020 0.468
    Core i3-1315U, Gracemont 2600MHz, Ub.22.04 texlive-latex-base 0.388
    Apple M1 Firestorm 3000MHz Asahi Linux Debian pre12 0.27
    Core i3-1315U, Golden Cove 3800MHz, Ub.22.04 texlive-latex-base 0.221
    Ryzen 7 5800X, 4800MHz, Debian 11 (64-bit) texlive-latex-base 0.191
    Xeon W-1370P (=Core i7-11700K), 5200MHz, Debian 11 (64-bit) 0.175

    I.e., a current Intel E-Core running (for unknown reasons) 700MHz
    below its nominal speed is faster on this benchmark than Power10.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Jan 2 15:58:26 2024
    On Tue, 02 Jan 2024 10:51:40 GMT
    [email protected] (Anton Ertl) wrote:

    Thomas Koenig <[email protected]> writes:
    AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
    to their execution units; they are done directly via register
    renaming, up to a certain limit.

    The limit in recent CPUs seems to be the width of the register renamer
    (6 on Golden Cove and Zen3). For Golden Cove, that optimization
    includes constant adds in the range -1024..1023 with the intermediate
    sum not exceeding -4096..4095.

    This will, of course, decrease latencies,
    especially on an OoO machine.

    POWER is an exception (surprising to me); a dependency in an
    MR instruction will introduce two cycles of latency, the usual
    latency for an arithmetic instruction (also on Power10, I mesured
    that today).

    Two cycles of latency for arithmetic instructions like integer adds?
    Ouch!


    The same as all recent Apple 'performance' cores. Which didn't prevent
    them from being pretty damn good 'latency' engines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Tue Jan 2 17:57:04 2024
    On Tue, 2 Jan 2024 15:15:53 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Tue, 02 Jan 2024 10:51:40 GMT
    [email protected] (Anton Ertl) wrote:

    Thomas Koenig <[email protected]> writes:
    AFAIK, modern Intel, AMD and ARM CPUs do not forward register
    moves to their execution units; they are done directly via
    register renaming, up to a certain limit.

    The limit in recent CPUs seems to be the width of the register
    renamer (6 on Golden Cove and Zen3). For Golden Cove, that
    optimization includes constant adds in the range -1024..1023 with
    the intermediate sum not exceeding -4096..4095.

    This will, of course, decrease latencies,
    especially on an OoO machine.

    POWER is an exception (surprising to me); a dependency in an
    MR instruction will introduce two cycles of latency, the usual
    latency for an arithmetic instruction (also on Power10, I mesured
    that today).

    Two cycles of latency for arithmetic instructions like integer
    adds? Ouch!


    The same as all recent Apple 'performance' cores. Which didn't
    prevent them from being pretty damn good 'latency' engines.

    I speak only little ARM, but if I read https://dougallj.github.io/applecpu/firestorm-int.html correctly,
    then add is only two cycles if one of the operands needs to be
    extended (at leat for the M1 chip). Was this changed in later
    versions?

    You are right. Somehow I misremembered.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Tue Jan 2 15:15:53 2024
    Michael S <[email protected]> schrieb:
    On Tue, 02 Jan 2024 10:51:40 GMT
    [email protected] (Anton Ertl) wrote:

    Thomas Koenig <[email protected]> writes:
    AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
    to their execution units; they are done directly via register
    renaming, up to a certain limit.

    The limit in recent CPUs seems to be the width of the register renamer
    (6 on Golden Cove and Zen3). For Golden Cove, that optimization
    includes constant adds in the range -1024..1023 with the intermediate
    sum not exceeding -4096..4095.

    This will, of course, decrease latencies,
    especially on an OoO machine.

    POWER is an exception (surprising to me); a dependency in an
    MR instruction will introduce two cycles of latency, the usual
    latency for an arithmetic instruction (also on Power10, I mesured
    that today).

    Two cycles of latency for arithmetic instructions like integer adds?
    Ouch!


    The same as all recent Apple 'performance' cores. Which didn't prevent
    them from being pretty damn good 'latency' engines.

    I speak only little ARM, but if I read https://dougallj.github.io/applecpu/firestorm-int.html correctly,
    then add is only two cycles if one of the operands needs to be
    extended (at leat for the M1 chip). Was this changed in later
    versions?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Tue Jan 2 19:58:29 2024
    Thomas Koenig <[email protected]> writes:
    Anton Ertl <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
    to their execution units; they are done directly via register renaming, >>>up to a certain limit.

    The limit in recent CPUs seems to be the width of the register renamer
    (6 on Golden Cove and Zen3). For Golden Cove, that optimization
    includes constant adds in the range -1024..1023 with the intermediate
    sum not exceeding -4096..4095.

    This will, of course, decrease latencies,
    especially on an OoO machine.

    POWER is an exception (surprising to me); a dependency in an
    MR instruction will introduce two cycles of latency, the usual
    latency for an arithmetic instruction (also on Power10, I mesured
    that today).

    Two cycles of latency for arithmetic instructions like integer adds?
    Ouch!

    Yes, ouch. I don't know what they spend that extra cycle on.
    Probably, their die just got too big, their timing too agressive,
    or rather a combination of both.

    By the way, "mr ra,rb" is just an alias for "or ra,rb,rb", so they
    actually do register copying through the ALU, like architectures
    of old.

    So, what are the tradeoffs? Will a zero-cycle register move make
    the pipeline deeper?

    Pipeline depths have not been published for Intel and AMD CPUs in
    recent years. ARM publishes its pipeline lengths. One could compare
    the last ARM of a line without this feature to the first with this
    feature, and get an indication whether it made the pipeline deeper.

    Does anybody (Scott?) have an indication of which chips this
    might be?

    I can't speak to anything non-public. The Wikipedia page for
    neoverse shows a pipeline depth of 10 cycles for the N2 family.

    https://en.wikipedia.org/wiki/ARM_Neoverse#Neoverse_N2

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Thu Jan 4 00:15:13 2024
    Scott Lurndal wrote:

    I can't speak to anything non-public. The Wikipedia page for
    neoverse shows a pipeline depth of 10 cycles for the N2 family.

    https://en.wikipedia.org/wiki/ARM_Neoverse#Neoverse_N2

    Along with the 4-cycle LD-use latency indicates a high frequency
    wide-issue design, the 10-cycle pipeline depth indicates little
    time for instruction fusing or register write elision.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Thu Jan 4 09:22:46 2024
    MitchAlsup <[email protected]> schrieb:
    Scott Lurndal wrote:

    I can't speak to anything non-public. The Wikipedia page for
    neoverse shows a pipeline depth of 10 cycles for the N2 family.

    https://en.wikipedia.org/wiki/ARM_Neoverse#Neoverse_N2

    Along with the 4-cycle LD-use latency indicates a high frequency
    wide-issue design, the 10-cycle pipeline depth indicates little
    time for instruction fusing or register write elision.

    The ARM Neoverse N2 Software Optimization Guide gives a one-cycle
    execution latency for register to register moves (with four in
    parallel). Constant loads take zero cycles; and simple register
    moves are also listed under "Zero Latency MOVs" with the somehwat
    less than illuminating caveat

    "The last 3 instructions may not be executed with zero latency
    under certain conditions".

    https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/aarch64/tuning_models/neoversen2.h;hb=HEAD

    gives that cost as 1 (so presumably these conditions happen).

    They also fuse some instructions for aarch64

    CMP/CMN (immediate) + B.cond
    CMP/CMN (register) + B.cond
    CMP (immediate) + CSEL
    CMP (register) + CSEL
    CMP (immediate) + CSET
    CMP (register) + CSET
    TST (immediate) + B.cond
    TST (register) + B.cond
    BICS (register) + B.cond
    NOP + Any instruction

    plus for both 64-bit and 32-bit

    AESE + AESMC (see Section 4.6 on AES Encryption/Decryption)
    AESD + AESIMC (see Section 4.6 on AES Encryption/Decryption)
    CMP/CMN (immediate) + B.cond
    CMP/CMN (register) + B.cond
    TST (immediate) + B.cond
    TST (register) + B.cond
    BICS (register) + B.cond

    where conditions apply which they actually spell out.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)