• Caller-saved vs. callee-saved registers (was: Misc: Ongoing status...)

    From Anton Ertl@21:1/5 to [email protected] on Sun Feb 2 14:52:27 2025
    [email protected] (MitchAlsup1) writes:
    On Sat, 1 Feb 2025 22:42:39 +0000, BGB wrote:
    Whereas, if performance is dominated by a piece of code that looks like,
    say:
    v0=dytf_int2fixnum(123);
    v1=dytf_int2fixnum(456);
    v2=dytf_mul(v0, v1);
    v3=dytf_int2fixnum(789);
    v4=dytf_add(v2, v3);
    v5=dytf_wrapsymbol("x");
    dytf_storeindex(obj, v5, v4);
    ...
    With, say, N levels of call-graph in each called function, but with this
    sort of code still managing to dominate the total CPU ("Self%" time).

    This seems to be a situation where callee-save registers are a big win
    for performance IME.

    With callee save registers, the prologue and epilogue of subroutines
    sees all the save/restore memory traffic

    No. Only if a callee-saved register is used, does it need to be
    saved.

    sometimes saving a register
    that is not "in use" and restoring it later.

    That's possible. That's the cost of using calling conventions instead
    of whole-program register allocation.

    With caller save registers, the caller saves exactly the registers
    it needs preserved, while the callee saves/restores none. Moreover
    it only saves registers currently "in use" and may defer restoring
    since it does not need that value in that register for a while.

    So, the instruction path length has a better story in caller saves
    than callee saves. Nothing that was "Not live" is ever saved or
    restored.

    The arguments for callee save have to do with I cache footprint.

    No. Callee-saved registers are useful for local variables that live
    across more than one call; if a variable lives across exactly one
    call, it does not matter; if a variable lives across no call,
    caller-saved registers are better. Here's a simple example

    int f2(int);
    void f3(int);
    int f1(int a, int b)
    {
    int c=a+b;
    int d=f2(a);
    c=c+d;
    f3(d);
    return c;
    }

    Here a, b, and d live across no calls and therefore are best placed in caller-saved-registers, whereas c lives across two calls and is best
    placed in a callee-saved register. With that placement the code
    produced by gcc on RISC-V is:

    addi sp,sp,-16
    sd ra,8(sp) #return address
    sd s0,0(sp) #make room in a callee-saved reg for c
    addw s0,a0,a1 #c is in s0, a in a0, b in a1
    call f2
    addw s0,s0,a0 #d is in a0 here, c still in s0
    call f3
    mv a0,s0 #mv c into result register
    ld ra,8(sp) #restore return address
    ld s0,0(sp) #restore s0
    addi sp,sp,16
    jr ra


    Let's compare this with having c in a caller-saved register

    c in callee-saved s0 c in caller-saved t0
    addi sp,sp,-16 addi sp,sp,-16
    sd ra,8(sp) sd ra,8(sp)
    sd s0,0(sp)
    addw s0,a0,a1 addw t0,a0,a1
    sd t0,0(sp)
    call f2 call f2
    ld t0,0(sp)
    addw s0,s0,a0 addw t0,t0,a0
    sd t0,0(sp)
    call f3 call f3
    ld a0,0(sp)
    mv a0,s0
    ld ra,8(sp) ld ra,8(sp)
    ld s0,0(sp)
    addi sp,sp,16 addi sp,sp,16
    jr ra jr ra

    So if you keep c in a caller-saved register, you save one store and
    one load of a callee-saved register, but you have to put in two stores
    and two loads of caller-saved registers; you also save the need for
    moving the result from the callee-saved register to the result
    register. But the bottom line is that it's cheaper to put c in a
    callee-saved register.

    The return address demonstrates that even with a caller-saved register
    you don't need to save and restore around every call; you only need to
    load it after a call if the variable is read between the call and the
    next call, and only need to store it before a call if the variable is
    written between the previous call and the call under consideration.
    It becomes more complex with control flow, but nothing that's still a
    research topic.

    Back to BGB's example:

    v0=dytf_int2fixnum(123);
    v1=dytf_int2fixnum(456);
    v2=dytf_mul(v0, v1);
    v3=dytf_int2fixnum(789);
    v4=dytf_add(v2, v3);
    v5=dytf_wrapsymbol("x");
    dytf_storeindex(obj, v5, v4);

    If we assume that none of the variables are read after this fragment:

    caller-saved: v1, v3, v5; can all live in the same register
    callee-saved: v0, v2, v4; can all live in the same register

    But are v0, v2, v4 all only used across one call each? Yes, but they
    can use the same register, so with a caller-saved register that
    register lives across three calls, and has to be stored and loaded
    three times, whereas with a callee-saved register that register has to
    be saved and loaded only once.

    Interestingly, I am not aware of a paper that gives a satisfying
    treatment of register allocation in the light of these aspects.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)