Forum: >>> Magnum BBS <<<

SHA3 - 32 registers required for speed?

From Terje Mathisen@21:1/5 to All on Thu Aug 21 12:17:13 2025

I looked at the (newish) SHA3 backup algorithm for the current SHA2, it
is based on a 5x5x64 bit internal core, with mixing operations that have
been based on having each of those 5x5=25 stripes in a separate 64-bit (unsigned) variable, and then do a lot of logic ops across them, and
using rotations for vertical motion.

With a core containing 25 named variables, you need an easy way to
access any of them, so even using SIMD might be a stretch?

I see a github SHA3 project that simply defines an array of uint64_t,
and all the ops index into this, but unless the compiler can extract
them all into registers, this is going to generate a lot of extra $L1
traffic.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Thu Aug 21 13:23:11 2025

Terje Mathisen <[email protected]> writes:

I looked at the (newish) SHA3 backup algorithm for the current SHA2, it
is based on a 5x5x64 bit internal core, with mixing operations that have
been based on having each of those 5x5=25 stripes in a separate 64-bit >(unsigned) variable, and then do a lot of logic ops across them, and
using rotations for vertical motion.

With a core containing 25 named variables, you need an easy way to
access any of them, so even using SIMD might be a stretch?

Maybe the SIMD shuffle operations can help.

Even if not, AMD64 has 15 GPRs (+RSP), and 16 XMM registers, with
instructions for moving between GPRs and XMM, so that should fit.

x86-64-v4 has 32 XMM registers.

What other instruction set with 64-bit GPRs and less than 32 GPRs do
you have in mind?

If you want to limit yourself to the GPRs of AMD64, keeping a part of
the variables in memory is not that bad for latency on recent
high-performance microarchitectures, thanks to 0-cycle store-to-load forwarding; it still costs slots in the load/store units and in the
renamer (I think, not sure how AMD's macro-ops consume renamer
resources), so avoiding loads and stores through register allocation
is still a good idea.

I see a github SHA3 project that simply defines an array of uint64_t,
and all the ops index into this, but unless the compiler can extract
them all into registers, this is going to generate a lot of extra $L1 >traffic.

I would not trust the compiler to be good at converting these accesses
to registers. If you are unlucky, it auto-vectorizes the memory
accesses and runs into a slow path of the CPU while doing so <[email protected]>.

Daniel J. Bernstein has analysed the performane of software
implementations of SHA3 candidates (the one that eventually won is a
variant of Keccak) in <https://cr.yp.to/hash/sha3opt-20120104.pdf> and
measured around 12 cycles/byte for keccak512 on AMD64 CPUs of
2007-2010 vintage, with a theoretical minimum of 7.824 cycles/byte
(apparently based on resource limits); he suspects that the extra 4
cycles per byte are due to loads and stores, so apparently he did not
keep all values in registers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Terje Mathisen on Thu Aug 21 15:26:56 2025

Terje Mathisen <[email protected]> writes:

I looked at the (newish) SHA3 backup algorithm for the current SHA2, it
is based on a 5x5x64 bit internal core, with mixing operations that have
been based on having each of those 5x5=25 stripes in a separate 64-bit >(unsigned) variable, and then do a lot of logic ops across them, and
using rotations for vertical motion.

With a core containing 25 named variables, you need an easy way to
access any of them, so even using SIMD might be a stretch?

I see a github SHA3 project that simply defines an array of uint64_t,
and all the ops index into this, but unless the compiler can extract
them all into registers, this is going to generate a lot of extra $L1 >traffic.

The ARMv8 architecture includes optional instructions to accelerate
SHA3 generation: BCAX, EOR3, RAX1 and XAR.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Guest
  Wed Jul 29 14:26:54 2026
  from Balkans via Telnet
- Rixter
  Wed Jul 29 14:18:17 2026
  from Madison, Nc via Telnet
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (3 / 13)
Uptime:	67:00:10
Calls:	12,448
Calls today:	3
Files:	15,194
Messages:	6,537,574

SHA3 - 32 registers required for speed?

Who's Online

Recent Visitors

System Info