Terje Mathisen <
[email protected]> writes:
I looked at the (newish) SHA3 backup algorithm for the current SHA2, it
is based on a 5x5x64 bit internal core, with mixing operations that have
been based on having each of those 5x5=25 stripes in a separate 64-bit >(unsigned) variable, and then do a lot of logic ops across them, and
using rotations for vertical motion.
With a core containing 25 named variables, you need an easy way to
access any of them, so even using SIMD might be a stretch?
Maybe the SIMD shuffle operations can help.
Even if not, AMD64 has 15 GPRs (+RSP), and 16 XMM registers, with
instructions for moving between GPRs and XMM, so that should fit.
x86-64-v4 has 32 XMM registers.
What other instruction set with 64-bit GPRs and less than 32 GPRs do
you have in mind?
If you want to limit yourself to the GPRs of AMD64, keeping a part of
the variables in memory is not that bad for latency on recent
high-performance microarchitectures, thanks to 0-cycle store-to-load forwarding; it still costs slots in the load/store units and in the
renamer (I think, not sure how AMD's macro-ops consume renamer
resources), so avoiding loads and stores through register allocation
is still a good idea.
I see a github SHA3 project that simply defines an array of uint64_t,
and all the ops index into this, but unless the compiler can extract
them all into registers, this is going to generate a lot of extra $L1 >traffic.
I would not trust the compiler to be good at converting these accesses
to registers. If you are unlucky, it auto-vectorizes the memory
accesses and runs into a slow path of the CPU while doing so <
[email protected]>.
Daniel J. Bernstein has analysed the performane of software
implementations of SHA3 candidates (the one that eventually won is a
variant of Keccak) in <
https://cr.yp.to/hash/sha3opt-20120104.pdf> and
measured around 12 cycles/byte for keccak512 on AMD64 CPUs of
2007-2010 vintage, with a theoretical minimum of 7.824 cycles/byte
(apparently based on resource limits); he suspects that the extra 4
cycles per byte are due to loads and stores, so apparently he did not
keep all values in registers.
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <
[email protected]>
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)