[email protected] (MitchAlsup1) writes:
On Sat, 1 Feb 2025 22:42:39 +0000, BGB wrote:
Whereas, if performance is dominated by a piece of code that looks like,
say:
v0=dytf_int2fixnum(123);
v1=dytf_int2fixnum(456);
v2=dytf_mul(v0, v1);
v3=dytf_int2fixnum(789);
v4=dytf_add(v2, v3);
v5=dytf_wrapsymbol("x");
dytf_storeindex(obj, v5, v4);
...
With, say, N levels of call-graph in each called function, but with this
sort of code still managing to dominate the total CPU ("Self%" time).
This seems to be a situation where callee-save registers are a big win
for performance IME.
With callee save registers, the prologue and epilogue of subroutines
sees all the save/restore memory traffic
No. Only if a callee-saved register is used, does it need to be
saved.
sometimes saving a register
that is not "in use" and restoring it later.
That's possible. That's the cost of using calling conventions instead
of whole-program register allocation.
With caller save registers, the caller saves exactly the registers
it needs preserved, while the callee saves/restores none. Moreover
it only saves registers currently "in use" and may defer restoring
since it does not need that value in that register for a while.
So, the instruction path length has a better story in caller saves
than callee saves. Nothing that was "Not live" is ever saved or
restored.
The arguments for callee save have to do with I cache footprint.
No. Callee-saved registers are useful for local variables that live
across more than one call; if a variable lives across exactly one
call, it does not matter; if a variable lives across no call,
caller-saved registers are better. Here's a simple example
int f2(int);
void f3(int);
int f1(int a, int b)
{
int c=a+b;
int d=f2(a);
c=c+d;
f3(d);
return c;
}
Here a, b, and d live across no calls and therefore are best placed in caller-saved-registers, whereas c lives across two calls and is best
placed in a callee-saved register. With that placement the code
produced by gcc on RISC-V is:
addi sp,sp,-16
sd ra,8(sp) #return address
sd s0,0(sp) #make room in a callee-saved reg for c
addw s0,a0,a1 #c is in s0, a in a0, b in a1
call f2
addw s0,s0,a0 #d is in a0 here, c still in s0
call f3
mv a0,s0 #mv c into result register
ld ra,8(sp) #restore return address
ld s0,0(sp) #restore s0
addi sp,sp,16
jr ra
Let's compare this with having c in a caller-saved register
c in callee-saved s0 c in caller-saved t0
addi sp,sp,-16 addi sp,sp,-16
sd ra,8(sp) sd ra,8(sp)
sd s0,0(sp)
addw s0,a0,a1 addw t0,a0,a1
sd t0,0(sp)
call f2 call f2
ld t0,0(sp)
addw s0,s0,a0 addw t0,t0,a0
sd t0,0(sp)
call f3 call f3
ld a0,0(sp)
mv a0,s0
ld ra,8(sp) ld ra,8(sp)
ld s0,0(sp)
addi sp,sp,16 addi sp,sp,16
jr ra jr ra
So if you keep c in a caller-saved register, you save one store and
one load of a callee-saved register, but you have to put in two stores
and two loads of caller-saved registers; you also save the need for
moving the result from the callee-saved register to the result
register. But the bottom line is that it's cheaper to put c in a
callee-saved register.
The return address demonstrates that even with a caller-saved register
you don't need to save and restore around every call; you only need to
load it after a call if the variable is read between the call and the
next call, and only need to store it before a call if the variable is
written between the previous call and the call under consideration.
It becomes more complex with control flow, but nothing that's still a
research topic.
Back to BGB's example:
v0=dytf_int2fixnum(123);
v1=dytf_int2fixnum(456);
v2=dytf_mul(v0, v1);
v3=dytf_int2fixnum(789);
v4=dytf_add(v2, v3);
v5=dytf_wrapsymbol("x");
dytf_storeindex(obj, v5, v4);
If we assume that none of the variables are read after this fragment:
caller-saved: v1, v3, v5; can all live in the same register
callee-saved: v0, v2, v4; can all live in the same register
But are v0, v2, v4 all only used across one call each? Yes, but they
can use the same register, so with a caller-saved register that
register lives across three calls, and has to be stored and loaded
three times, whereas with a callee-saved register that register has to
be saved and loaded only once.
Interestingly, I am not aware of a paper that gives a satisfying
treatment of register allocation in the light of these aspects.
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <
[email protected]>
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)