• SHA3 - 32 registers required for speed?

    From Terje Mathisen@21:1/5 to All on Thu Aug 21 12:17:13 2025
    I looked at the (newish) SHA3 backup algorithm for the current SHA2, it
    is based on a 5x5x64 bit internal core, with mixing operations that have
    been based on having each of those 5x5=25 stripes in a separate 64-bit (unsigned) variable, and then do a lot of logic ops across them, and
    using rotations for vertical motion.

    With a core containing 25 named variables, you need an easy way to
    access any of them, so even using SIMD might be a stretch?

    I see a github SHA3 project that simply defines an array of uint64_t,
    and all the ops index into this, but unless the compiler can extract
    them all into registers, this is going to generate a lot of extra $L1
    traffic.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Thu Aug 21 13:23:11 2025
    Terje Mathisen <[email protected]> writes:
    I looked at the (newish) SHA3 backup algorithm for the current SHA2, it
    is based on a 5x5x64 bit internal core, with mixing operations that have
    been based on having each of those 5x5=25 stripes in a separate 64-bit >(unsigned) variable, and then do a lot of logic ops across them, and
    using rotations for vertical motion.

    With a core containing 25 named variables, you need an easy way to
    access any of them, so even using SIMD might be a stretch?

    Maybe the SIMD shuffle operations can help.

    Even if not, AMD64 has 15 GPRs (+RSP), and 16 XMM registers, with
    instructions for moving between GPRs and XMM, so that should fit.

    x86-64-v4 has 32 XMM registers.

    What other instruction set with 64-bit GPRs and less than 32 GPRs do
    you have in mind?

    If you want to limit yourself to the GPRs of AMD64, keeping a part of
    the variables in memory is not that bad for latency on recent
    high-performance microarchitectures, thanks to 0-cycle store-to-load forwarding; it still costs slots in the load/store units and in the
    renamer (I think, not sure how AMD's macro-ops consume renamer
    resources), so avoiding loads and stores through register allocation
    is still a good idea.

    I see a github SHA3 project that simply defines an array of uint64_t,
    and all the ops index into this, but unless the compiler can extract
    them all into registers, this is going to generate a lot of extra $L1 >traffic.

    I would not trust the compiler to be good at converting these accesses
    to registers. If you are unlucky, it auto-vectorizes the memory
    accesses and runs into a slow path of the CPU while doing so <[email protected]>.

    Daniel J. Bernstein has analysed the performane of software
    implementations of SHA3 candidates (the one that eventually won is a
    variant of Keccak) in <https://cr.yp.to/hash/sha3opt-20120104.pdf> and
    measured around 12 cycles/byte for keccak512 on AMD64 CPUs of
    2007-2010 vintage, with a theoretical minimum of 7.824 cycles/byte
    (apparently based on resource limits); he suspects that the extra 4
    cycles per byte are due to loads and stores, so apparently he did not
    keep all values in registers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Terje Mathisen on Thu Aug 21 15:26:56 2025
    Terje Mathisen <[email protected]> writes:
    I looked at the (newish) SHA3 backup algorithm for the current SHA2, it
    is based on a 5x5x64 bit internal core, with mixing operations that have
    been based on having each of those 5x5=25 stripes in a separate 64-bit >(unsigned) variable, and then do a lot of logic ops across them, and
    using rotations for vertical motion.

    With a core containing 25 named variables, you need an easy way to
    access any of them, so even using SIMD might be a stretch?

    I see a github SHA3 project that simply defines an array of uint64_t,
    and all the ops index into this, but unless the compiler can extract
    them all into registers, this is going to generate a lot of extra $L1 >traffic.

    The ARMv8 architecture includes optional instructions to accelerate
    SHA3 generation: BCAX, EOR3, RAX1 and XAR.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)