• Re: Compiling Fibonacci

    From MitchAlsup1@21:1/5 to BGB-Alt on Tue Feb 6 22:23:57 2024
    BGB-Alt wrote:

    On 2/5/2024 10:42 PM, Robert Finch wrote:
    Got around to Compiling the Fibonacci function using the Arpl compiler
    with the same optimization settings for riscv and Qupls:
    Riscv:    30 instructions, 120 bytes
    Qupls:    13 instructions, 65 bytes

    Most of the code savings was due to the use of subroutine linkage
    instructions, ENTER and LEAVE. For a small routine these made up a
    significant portion of the code. Otherwise, the code is almost the same.



    I was at least pleasantly surprised when debugging to note that a
    trivial function, namely:
    int LittleLong(int x)
    { return x; }
    Compiled down to a pretty much optimal:
    LittleLong:
    MOV R4, R2
    RTS

    Compiles to:

    LittleLong:
    RET

    on My 66000. {I placed the first argument and the first return result in
    the same register.}

    Like, the little optimizations here and there have seemingly now
    combined that some trivial functions like this, actually generate near-optimal code sequences...

    Modulo screwed up ABIs.

    However, there was a bug in that a narrowing conversion in the caller
    was being skipped, resulting in the function passing along a value that
    was out-of-range for the type (could have fixed this if the compiler had
    used EXTx.x instead of MOV here, this is possible and would prevent out-of-range values from propagating as easily, but ATM the EXTx.x
    operations may be 2-cycle for some types (if trying to use the value immediately), vs 1-cycle for MOV; would likely require EXTx.x always be 1-cycle to justify this).

    LLVM for My 66000 has the caller and the called smash the value into range
    if any arithmetic has transpired.

    Had modified the logic for the caller-side type-conversion to actually
    do the conversion in this case (for some reason, had previously added a
    case to skip over narrowing conversions for integer types, not sure what
    my thinking was here...).

    Granted, it was specific to signed narrowing conversions, which was
    still technically allowed by the C standard, but is not valid if one
    assumes wrap-on-overflow semantics.

    Say, if you define a value variable as 'int', at no point should the
    program be able to observe values outside of 'int' range.

    Modulo the notion that int should be the fastest arithmetic form. Many
    64-bit RISCs are slower with ints that longs due to the smashes.

    But, then counter-balanced by cases where they had caused bugs, like
    recently finding that my recent optimization of "&&" and "||" had
    introduced a bug, where things like:
    if((ptr!=NULL) && (ptr[0]!=0))
    ...
    Had concluded that both sides were logical expressions, and would
    incorrectly load from memory before dealing with the possibility of the pointer being NULL.

    Shame.

    Had to modify the checks to verify that the expressions would not access memory through pointers, which did reduce the effectiveness of the optimization but also avoids some incorrect memory accesses.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Feb 7 18:05:22 2024
    BGB wrote:

    On 2/6/2024 4:23 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:

    On 2/5/2024 10:42 PM, Robert Finch wrote:
    Got around to Compiling the Fibonacci function using the Arpl
    compiler with the same optimization settings for riscv and Qupls:
    Riscv:    30 instructions, 120 bytes
    Qupls:    13 instructions, 65 bytes

    Most of the code savings was due to the use of subroutine linkage
    instructions, ENTER and LEAVE. For a small routine these made up a
    significant portion of the code. Otherwise, the code is almost the same. >>>>


    I was at least pleasantly surprised when debugging to note that a
    trivial function, namely:
       int LittleLong(int x)
         { return x; }
    Compiled down to a pretty much optimal:
       LittleLong:
       MOV R4, R2
       RTS

    Compiles to:

    LittleLong:
        RET

    on My 66000. {I placed the first argument and the first return result in
    the same register.}

    Like, the little optimizations here and there have seemingly now
    combined that some trivial functions like this, actually generate
    near-optimal code sequences...

    Modulo screwed up ABIs.


    The ABI design in my case was indirectly derived from the WinCE SH-4 ABI (which in some ways has a certain level of similarity to the Windows X64 ABI).

    Say:
    Arguments in R4..R7
    Return value in R0/R1;

    Arguments in R1..R9
    Results in R1..R9

    -------------------

    For vararg and missing prototype functions, would need to always assume
    the full 16-argument spill space with the 16 argument ABI.

    Called is responsible for spill space.
    varargs is fully functional without prototypes.

    The spill-space issue does effect stack-frame overhead, but harder to determine if it has much practical effect on performance (effect seems
    fairly small in any case).

    No extraneous stack overhead.

    Granted, yes, theoretically bigger stack frames does mean more
    stack-related L1 misses, but stack L1 hit rate seems to be pretty good
    on average.


    However, there was a bug in that a narrowing conversion in the caller
    was being skipped, resulting in the function passing along a value
    that was out-of-range for the type (could have fixed this if the
    compiler had used EXTx.x instead of MOV here, this is possible and
    would prevent out-of-range values from propagating as easily, but ATM
    the EXTx.x operations may be 2-cycle for some types (if trying to use
    the value immediately), vs 1-cycle for MOV; would likely require
    EXTx.x always be 1-cycle to justify this).

    LLVM for My 66000 has the caller and the called smash the value into range >> if any arithmetic has transpired.


    Yeah, this is probably a good strategy.
    I had partly relied on having a few special-case instructions for
    working with 32-bit values.

    But, for some operations, only 64-bit instructions exist.
    Say, unlike RISC-V, in BJX2 I did not provide alternate FP converter ops
    for each pair of FP and Integer types.

    Say, BJX2:
    FLDCI: Int64->Binary64
    FSTCI: Binary64->Int64
    FLDCF: Binary32->Binary64
    FSTCF: Binary64->Binary32

    RISC-V:
    FCVT.W.S //Binary32->Int32
    FCVT.WU.S //Binary32->UInt32
    FCVT.L.S //Binary32->Int64
    FCVT.LU.S //Binary32->UInt64
    FCVT.W.D //Binary64->Int32
    FCVT.WU.D //Binary64->UInt32
    FCVT.L.D //Binary64->Int64
    FCVT.LU.D //Binary64->UInt64
    FCVT.S.W //Int32->Binary32
    FCVT.S.WU //UInt32->Binary32
    FCVT.S.L //Int64->Binary32
    FCVT.S.LU //UInt64->Binary32
    FCVT.D.W //Int32->Binary64
    FCVT.D.WU //UInt32->Binary64
    FCVT.D.L //Int64->Binary64
    FCVT.D.LU //UInt64->Binary64
    ...

    But, as I see it, these don't gain much other than burning encoding
    space and making the FPU more expensive...

    And, admittedly, my implementation doesn't entirely distinguish these
    cases (which may in turn potentially generate out-of-range values).

    What if the FPU supports Binary16 or similar? Now you have even more,
    because the number of converter ops in RV's scheme is apparently X^2
    relative to the number of types involved, whereas for BJX2, it was
    closer to linear (it adds FLDCH and FSTCH and similar).

    And, say, converting Binary16->Int32 is going to take multiple
    instructions (but, as I see it, this isn't too big of an issue).

    There was a difference between RV and BJX2 regarding single-precision Load/Store, since RV loads it into a register as-is, and BJX2 had an
    optional "FMOV.S" instruction which basically glues on Single<->Double
    format converters (not free, but the cost is justifiable).


    Also the converters are supposed to range-clamp the outputs for
    conversion to integer types, but like, no...
    Say: W: -(1<<31) .. ((1<<31)-1)
    Say: WU: 0 .. ((1<<32)-1)
    Say: L: -(1<<63) .. ((1<<63)-1)
    Say: LU: 0 .. ((1<<64)-1)

    Shift Right performs extractions and works both signed and unsigned
    on any size::

    S12 -(1<<11)..(1<<11)-1
    U12 0..(1<<12)-1

    With CARRY the field may span a 65-bit register boundary.

    {8, 16, 32} are simply subsets of SR.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Feb 8 18:10:13 2024
    BGB wrote:

    On 2/7/2024 12:05 PM, MitchAlsup1 wrote:
    BGB wrote:

    The ABI design in my case was indirectly derived from the WinCE SH-4
    ABI (which in some ways has a certain level of similarity to the
    Windows X64 ABI).

    Say:
       Arguments in R4..R7
       Return value in R0/R1;

    Arguments in R1..R9
    Results   in R1..R9


    It is hit/miss.

    -------------------

    For vararg and missing prototype functions, would need to always
    assume the full 16-argument spill space with the 16 argument ABI.

    Called is responsible for spill space.
    varargs is fully functional without prototypes.


    If you provide spill space for 16 args, and the called function only
    assumes 8, there is no issue.

    But you see, I do NOT provide space for varargs ! If the called function
    is not varargs there is 0 wasted space. If the called function is varagrs
    then it stores the register arguments on the stack area it allocates, and
    it deallocates.

    For fixed argument functions, when the arguments use fewer than 8
    registers, there is no practical difference between the 8 and 16
    register ABI (apart from the difference in the spill space).

    NO spill space.

    For varargs, it is the same either way (and the missing prototype
    doesn't matter).


    Shrinking the spill space for the args<=8 only case could help, but
    mostly has the tradeoff of needing to add code to check for and handle
    this case in my compiler (and uncertainty as to how much it could help,
    or if its practicality would be defeated by the probability of a given function also containing a call to printf or similar, ...).

    No spill space.

    The main exception case if one has a function that returns a struct, and
    one calls it with a missing prototype, the program will very likely have unpredictable behavior and/or crash (since the caller will not have
    provided a valid address for the returned struct).

    Though, the Win64 ABI has the same issue, and it mostly doesn't matter.


    The spill-space issue does effect stack-frame overhead, but harder to
    determine if it has much practical effect on performance (effect seems
    fairly small in any case).

    No extraneous stack overhead.


    It is a tradeoff.

    It can potentially simplify varargs handling, since then one can dump
    all the argument registers to the stack and then generate a pointer to
    the just after the last fixed argument, or allow for dealing with code
    that takes the address of a function argument and expects to be able to
    use the argument list as an array.


    It does effect stack usage, so will potentially exhaust the stack
    slightly faster when using 128K or 256K of stack space.

    No unneeded argument or unit of data is on the stack if it does not need
    to be on the stack. Thus, no wasted space and no spill space, either.
    Also the converters are supposed to range-clamp the outputs for
    conversion to integer types, but like, no...
       Say: W: -(1<<31) .. ((1<<31)-1)
       Say: WU: 0 .. ((1<<32)-1)
       Say: L: -(1<<63) .. ((1<<63)-1)
       Say: LU: 0 .. ((1<<64)-1)

    Shift Right performs extractions and works both signed and unsigned on
    any size::

            S12 -(1<<11)..(1<<11)-1
            U12 0..(1<<12)-1

    With CARRY the field may span a 65-bit register boundary.

    {8, 16, 32} are simply subsets of SR.


    Yeah, but in this case I was mostly dealing with the annoyance of there
    being an unreasonable number of floating-point converter special cases
    in RISC-V, since they decided to going from every possible format to
    every other possible format, rather than taking a more "hub and spokes" approach.


    IOW:
    16 different converter ops *just* for scalar Int<->Float conversion.

    Whereas, I had done this with 2 converter ops...

    How did you get all the needed rounding modes into 2 conversions ??
    {{even int->double needs rounding modes when int > 2^52 !!}}

    I could almost get away with hand-waving the issue and not dealing with
    these cases in the RV64 support, except that RV64's wonk of
    "sign-extended unsigned values" kinda throws a wrench into the mix.

    Though, this mostly effects Int->Float, for Float->Int, can still be
    like "meh whatever", I am not inclined to spend the LUT cost of
    bothering with range-clamping the conversions.

    Congratulations, you are not IEEE 754 compliant !!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Feb 8 22:35:17 2024
    BGB wrote:

    On 2/8/2024 12:10 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 2/7/2024 12:05 PM, MitchAlsup1 wrote:
    BGB wrote:

    The ABI design in my case was indirectly derived from the WinCE SH-4 >>>>> ABI (which in some ways has a certain level of similarity to the
    Windows X64 ABI).

    Say:
       Arguments in R4..R7
       Return value in R0/R1;

    Arguments in R1..R9
    Results   in R1..R9


    It is hit/miss.

    -------------------

    For vararg and missing prototype functions, would need to always
    assume the full 16-argument spill space with the 16 argument ABI.

    Called is responsible for spill space.
    varargs is fully functional without prototypes.


    If you provide spill space for 16 args, and the called function only
    assumes 8, there is no issue.

    But you see, I do NOT provide space for varargs ! If the called function
    is not varargs there is 0 wasted space. If the called function is varagrs
    then it stores the register arguments on the stack area it allocates, and
    it deallocates.

    For fixed argument functions, when the arguments use fewer than 8
    registers, there is no practical difference between the 8 and 16
    register ABI (apart from the difference in the spill space).

    NO spill space.

    For varargs, it is the same either way (and the missing prototype
    doesn't matter).


    Shrinking the spill space for the args<=8 only case could help, but
    mostly has the tradeoff of needing to add code to check for and handle
    this case in my compiler (and uncertainty as to how much it could
    help, or if its practicality would be defeated by the probability of a
    given function also containing a call to printf or similar, ...).

    No spill space.


    Whether or not one thinks spill space is bad, doesn't at the moment seem
    like enough of an issue to change the ABI over it.

    Change, yes: agreed. But when facing a clean sheet, it is not necessary.
    One of the whole points in RISC architecture is to leave out that which
    is unnecessary. Spill space is demonstratively unnecessary. Even if it
    is performance neutral.

    Adding a stat to my compiler:
    Average stack-frame size is ~ 425 bytes (though, 416 and 432 are the
    closest valid sizes, this is excluding leaf functions).

    So (for 128 bytes), roughly 30% of this would be the spill-space (and,
    for "no spill space" could potentially be closer to 297, so ~ 43%
    relative overhead...).

    And a few exceptions to the GuestOS to allocate more pages to the stack
    area.



    IOW:
       16 different converter ops *just* for scalar Int<->Float conversion. >>
    Whereas, I had done this with 2 converter ops...

    How did you get all the needed rounding modes into 2 conversions ??
    {{even int->double needs rounding modes when int > 2^52 !!}}


    Typically, Double->Int needs only one type of rounding:
    Truncate towards zero.

    What about Int->Double?
    Round Nearest.
    Though, only relevant for values outside the 2^52 range.
    But, presumably, everyone knows this is not exact.


    If you want Single, well, use two converter ops in a row.

    Makes you subject to double rounding errors--thus not IEEE 754
    compliant.

    But realistically you still have to have efficient conversions specified
    in FORTRAN.

    But, was less of an issue if scalar math was internally always in
    Binary64 format (regardless of the in-memory representation).

    Lessening the scale of the issue does not achieve IEEE 754 compliance.
    You can get away with it in an FPGA that only you use, not so much
    when you want to sell parts against a similar performant chip that is compliant.

    Maybe a case could be made for a round-truncate converter, don't have
    this as of yet.

    Maybe a case could also be made for an unsigned UInt64->Binary64
    converter op, vs, doing it in software:
    if(x>>63)
    {
    f=(double)((int64_t)(x>>1));
    f=f+f;
    }else
    {
    f=(double)((int64_t)x);
    }
    return(f);

    RNDPI Rint,Rfp

    round towards positive infinity.

    Say, with a cast to a signed type being needed so that the runtime call doesn't go into infinite recursion.

    Since, granted, one does need a converter op to deal with this case for RISC-V.

    Along with ops to deal with conversion between integer types and Binary32.


    So, it is possible I could justify adding a few more converter ops to BJX2.



    For the SIMD converters, a different strategy was used.

    Using Double and Int32 because easier to demonstrate in hex:
    Say, we convert a Int32 to Binary64 by first repacking:
    ZXXXXXXX
    To, say:
    400ZXXXXXXX00000
    Where: Z has high bit flipped for signed.
    Value is mapped between 2.0 and 4.0.
    Then add a bias:
    -2.0 for unsigned, -3.0 for signed.
    Then scale, if value isn't unit-range.

    Though, this mechanism was defined in terms of 8/16 bit integer values
    and 16/32 bit floating point, but same basic idea.

    Though, mostly this was for cost reasons.


    I could almost get away with hand-waving the issue and not dealing
    with these cases in the RV64 support, except that RV64's wonk of
    "sign-extended unsigned values" kinda throws a wrench into the mix.

    Though, this mostly effects Int->Float, for Float->Int, can still be
    like "meh whatever", I am not inclined to spend the LUT cost of
    bothering with range-clamping the conversions.

    Congratulations, you are not IEEE 754 compliant !!

    I never claimed it was.

    A fully IEEE compliant FPU would be too expensive...

    And yet, it seams that everyone but FPGA implementers found otherwise

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Feb 9 03:03:01 2024
    BGB wrote:

    On 2/8/2024 4:35 PM, MitchAlsup1 wrote:



    Potentially.

    Did get around to implementing the "check if calling a varargs function
    and only then reserve space to spill 16 registers" logic.

    I need no prototypes to get varargs working.

    So, now, most functions (excluding those that call printf or similar)
    will only need to reserve 64 bytes of spill space (though, 128+ is still needed if a function is called that needs more than 8 arguments, or is undeclared).

    0 in this case is always better than more than 0.




    Main issue here is that for unsigned values larger than 2^63, the normal FLDCI will see them as negative.

    Not mine, I have {signed, unsigned, single and double}^2 conversions
    in all rounding modes.

    Workaround is to shift them right by 1 bit, convert, and then scale by
    2. Here, "f=f+f;" being a fairly cheap way to scale the value by 2.

    yawn.

    But, may consider adding an FLDCIU instruction or similar.

    There are 48 cases to consider in conversion^2

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Sat Feb 10 10:54:26 2024
    All the Fib messages made me wonder:

    How small can a naive (doubly recursive) implementation be?

    fn fib(u:u32)->u32
    {
    if u <= 2 {return u;}
    fib(u-1)+fib(u-2)
    }

    ;; eax = inout
    proc fib
    cmp eax,2 ; 3 bytes
    jbe done ; 2 bytes
    dec eax ; 1
    push eax ; 1
    dec eax ; 1
    call fib ; 3 (or 5?)
    xchg eax,[esp]; 2
    call fib ; 3 / 5
    pop ebx ; 1
    add eax,ebx ; 2
    done:
    ret ; 1
    endp

    Seems like 20 or (more likely) 24 bytes?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Paul A. Clayton on Mon Feb 26 12:39:10 2024
    On 23/02/2024 23:44, Paul A. Clayton wrote:
    On 2/6/24 5:23 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:
    [snip]
    I was at least pleasantly surprised when debugging to note that a
    trivial function, namely:
       int LittleLong(int x)
         { return x; }
    Compiled down to a pretty much optimal:
       LittleLong:
       MOV R4, R2
       RTS

    Compiles to:

    LittleLong:
         RET

    on My 66000. {I placed the first argument and the first return result
    in the same register.}

    With C (and other languages?) not preserving the arguments (no
    "inout" — "inout" would seem to present the opportunity to have
    dummy arguments that are callee saved though that seems bad
    programming practice as these are not really arguments), I have
    wondered why the arguments and return value were separated in
    ABIs. (I could also imagine cases where the return value is used
    fairly soon as an argument — quite possibly the first argument —
    of a called function.)

    It's common that ABIs have the results in the same registers as the
    first couple of arguments. It's not that way on all ABIs, but it is
    very common.


    Of course, an ABI only needs to be used for external functions.
    (This depends on debugging tools being up to handling non-ABI
    calls.)

    Most compilers follow the ABI for any independently generated function -
    it quickly gets complicated otherwise. But if you've got static or
    inline functions, or are using LTO, compilers can and do mess around a
    bit - such as using constant propagation, inlining, function cloning,
    and re-arranging arguments.


    I have also wondered how difficult it would be to have each
    function call define its interface with less strict binding to the
    argument ordering (and perhaps even content) of the programmer-
    facing argument list. With respect to content, a call interface
    might be improved, e.g., by passing the only two members of a
    structure used in a function rather than a pointer to the
    structure while maintaining the programmer comprehensibility of
    passing the entire structure by reference. "Prefetching" structure
    members into registers might also make sense, particularly if
    callers are very likely to have the values already in registers.

    Implementing interprocedural optimizations in external library
    functions has the obvious problems of needing a means to
    communicate the interface details, making assumptions about common
    and/or critical uses of the interface, and making assumptions
    about future extensions. If a structure pointer was removed as an
    argument (replaced by the used members), the function could not be
    compatibly extended to use additional members. (Multithreading
    would constrain the ability to pass members by value rather than
    the structure by reference.)

    I suspect such "optimizations" would not be worth the required
    effort, but they are obvious possibilities.

    I think you are right here. Such things would be very complicated to implement, and give relatively little performance improvement. You get
    a lot more, for less compiler effort, with link-time optimisation
    (allowing some kinds of inter-procedural optimisations across modules).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)