On 2/5/2024 10:42 PM, Robert Finch wrote:
Got around to Compiling the Fibonacci function using the Arpl compiler
with the same optimization settings for riscv and Qupls:
Riscv: 30 instructions, 120 bytes
Qupls: 13 instructions, 65 bytes
Most of the code savings was due to the use of subroutine linkage
instructions, ENTER and LEAVE. For a small routine these made up a
significant portion of the code. Otherwise, the code is almost the same.
I was at least pleasantly surprised when debugging to note that a
trivial function, namely:
int LittleLong(int x)
{ return x; }
Compiled down to a pretty much optimal:
LittleLong:
MOV R4, R2
RTS
Like, the little optimizations here and there have seemingly now
combined that some trivial functions like this, actually generate near-optimal code sequences...
However, there was a bug in that a narrowing conversion in the caller
was being skipped, resulting in the function passing along a value that
was out-of-range for the type (could have fixed this if the compiler had
used EXTx.x instead of MOV here, this is possible and would prevent out-of-range values from propagating as easily, but ATM the EXTx.x
operations may be 2-cycle for some types (if trying to use the value immediately), vs 1-cycle for MOV; would likely require EXTx.x always be 1-cycle to justify this).
Had modified the logic for the caller-side type-conversion to actually
do the conversion in this case (for some reason, had previously added a
case to skip over narrowing conversions for integer types, not sure what
my thinking was here...).
Granted, it was specific to signed narrowing conversions, which was
still technically allowed by the C standard, but is not valid if one
assumes wrap-on-overflow semantics.
Say, if you define a value variable as 'int', at no point should the
program be able to observe values outside of 'int' range.
But, then counter-balanced by cases where they had caused bugs, like
recently finding that my recent optimization of "&&" and "||" had
introduced a bug, where things like:
if((ptr!=NULL) && (ptr[0]!=0))
...
Had concluded that both sides were logical expressions, and would
incorrectly load from memory before dealing with the possibility of the pointer being NULL.
Had to modify the checks to verify that the expressions would not access memory through pointers, which did reduce the effectiveness of the optimization but also avoids some incorrect memory accesses.
On 2/6/2024 4:23 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
On 2/5/2024 10:42 PM, Robert Finch wrote:
Got around to Compiling the Fibonacci function using the Arpl
compiler with the same optimization settings for riscv and Qupls:
Riscv: 30 instructions, 120 bytes
Qupls: 13 instructions, 65 bytes
Most of the code savings was due to the use of subroutine linkage
instructions, ENTER and LEAVE. For a small routine these made up a
significant portion of the code. Otherwise, the code is almost the same. >>>>
I was at least pleasantly surprised when debugging to note that a
trivial function, namely:
int LittleLong(int x)
{ return x; }
Compiled down to a pretty much optimal:
LittleLong:
MOV R4, R2
RTS
Compiles to:
LittleLong:
RET
on My 66000. {I placed the first argument and the first return result in
the same register.}
Like, the little optimizations here and there have seemingly now
combined that some trivial functions like this, actually generate
near-optimal code sequences...
Modulo screwed up ABIs.
The ABI design in my case was indirectly derived from the WinCE SH-4 ABI (which in some ways has a certain level of similarity to the Windows X64 ABI).
Say:
Arguments in R4..R7
Return value in R0/R1;
-------------------
For vararg and missing prototype functions, would need to always assume
the full 16-argument spill space with the 16 argument ABI.
The spill-space issue does effect stack-frame overhead, but harder to determine if it has much practical effect on performance (effect seems
fairly small in any case).
Granted, yes, theoretically bigger stack frames does mean more
stack-related L1 misses, but stack L1 hit rate seems to be pretty good
on average.
However, there was a bug in that a narrowing conversion in the caller
was being skipped, resulting in the function passing along a value
that was out-of-range for the type (could have fixed this if the
compiler had used EXTx.x instead of MOV here, this is possible and
would prevent out-of-range values from propagating as easily, but ATM
the EXTx.x operations may be 2-cycle for some types (if trying to use
the value immediately), vs 1-cycle for MOV; would likely require
EXTx.x always be 1-cycle to justify this).
LLVM for My 66000 has the caller and the called smash the value into range >> if any arithmetic has transpired.
Yeah, this is probably a good strategy.
I had partly relied on having a few special-case instructions for
working with 32-bit values.
But, for some operations, only 64-bit instructions exist.
Say, unlike RISC-V, in BJX2 I did not provide alternate FP converter ops
for each pair of FP and Integer types.
Say, BJX2:
FLDCI: Int64->Binary64
FSTCI: Binary64->Int64
FLDCF: Binary32->Binary64
FSTCF: Binary64->Binary32
RISC-V:
FCVT.W.S //Binary32->Int32
FCVT.WU.S //Binary32->UInt32
FCVT.L.S //Binary32->Int64
FCVT.LU.S //Binary32->UInt64
FCVT.W.D //Binary64->Int32
FCVT.WU.D //Binary64->UInt32
FCVT.L.D //Binary64->Int64
FCVT.LU.D //Binary64->UInt64
FCVT.S.W //Int32->Binary32
FCVT.S.WU //UInt32->Binary32
FCVT.S.L //Int64->Binary32
FCVT.S.LU //UInt64->Binary32
FCVT.D.W //Int32->Binary64
FCVT.D.WU //UInt32->Binary64
FCVT.D.L //Int64->Binary64
FCVT.D.LU //UInt64->Binary64
...
But, as I see it, these don't gain much other than burning encoding
space and making the FPU more expensive...
And, admittedly, my implementation doesn't entirely distinguish these
cases (which may in turn potentially generate out-of-range values).
What if the FPU supports Binary16 or similar? Now you have even more,
because the number of converter ops in RV's scheme is apparently X^2
relative to the number of types involved, whereas for BJX2, it was
closer to linear (it adds FLDCH and FSTCH and similar).
And, say, converting Binary16->Int32 is going to take multiple
instructions (but, as I see it, this isn't too big of an issue).
There was a difference between RV and BJX2 regarding single-precision Load/Store, since RV loads it into a register as-is, and BJX2 had an
optional "FMOV.S" instruction which basically glues on Single<->Double
format converters (not free, but the cost is justifiable).
Also the converters are supposed to range-clamp the outputs for
conversion to integer types, but like, no...
Say: W: -(1<<31) .. ((1<<31)-1)
Say: WU: 0 .. ((1<<32)-1)
Say: L: -(1<<63) .. ((1<<63)-1)
Say: LU: 0 .. ((1<<64)-1)
On 2/7/2024 12:05 PM, MitchAlsup1 wrote:
BGB wrote:
The ABI design in my case was indirectly derived from the WinCE SH-4
ABI (which in some ways has a certain level of similarity to the
Windows X64 ABI).
Say:
Arguments in R4..R7
Return value in R0/R1;
Arguments in R1..R9
Results in R1..R9
It is hit/miss.
-------------------
For vararg and missing prototype functions, would need to always
assume the full 16-argument spill space with the 16 argument ABI.
Called is responsible for spill space.
varargs is fully functional without prototypes.
If you provide spill space for 16 args, and the called function only
assumes 8, there is no issue.
For fixed argument functions, when the arguments use fewer than 8
registers, there is no practical difference between the 8 and 16
register ABI (apart from the difference in the spill space).
For varargs, it is the same either way (and the missing prototype
doesn't matter).
Shrinking the spill space for the args<=8 only case could help, but
mostly has the tradeoff of needing to add code to check for and handle
this case in my compiler (and uncertainty as to how much it could help,
or if its practicality would be defeated by the probability of a given function also containing a call to printf or similar, ...).
The main exception case if one has a function that returns a struct, and
one calls it with a missing prototype, the program will very likely have unpredictable behavior and/or crash (since the caller will not have
provided a valid address for the returned struct).
Though, the Win64 ABI has the same issue, and it mostly doesn't matter.
The spill-space issue does effect stack-frame overhead, but harder to
determine if it has much practical effect on performance (effect seems
fairly small in any case).
No extraneous stack overhead.
It is a tradeoff.
It can potentially simplify varargs handling, since then one can dump
all the argument registers to the stack and then generate a pointer to
the just after the last fixed argument, or allow for dealing with code
that takes the address of a function argument and expects to be able to
use the argument list as an array.
It does effect stack usage, so will potentially exhaust the stack
slightly faster when using 128K or 256K of stack space.
Also the converters are supposed to range-clamp the outputs for
conversion to integer types, but like, no...
Say: W: -(1<<31) .. ((1<<31)-1)
Say: WU: 0 .. ((1<<32)-1)
Say: L: -(1<<63) .. ((1<<63)-1)
Say: LU: 0 .. ((1<<64)-1)
Shift Right performs extractions and works both signed and unsigned on
any size::
S12 -(1<<11)..(1<<11)-1
U12 0..(1<<12)-1
With CARRY the field may span a 65-bit register boundary.
{8, 16, 32} are simply subsets of SR.
Yeah, but in this case I was mostly dealing with the annoyance of there
being an unreasonable number of floating-point converter special cases
in RISC-V, since they decided to going from every possible format to
every other possible format, rather than taking a more "hub and spokes" approach.
IOW:
16 different converter ops *just* for scalar Int<->Float conversion.
Whereas, I had done this with 2 converter ops...
I could almost get away with hand-waving the issue and not dealing with
these cases in the RV64 support, except that RV64's wonk of
"sign-extended unsigned values" kinda throws a wrench into the mix.
Though, this mostly effects Int->Float, for Float->Int, can still be
like "meh whatever", I am not inclined to spend the LUT cost of
bothering with range-clamping the conversions.
On 2/8/2024 12:10 PM, MitchAlsup1 wrote:
BGB wrote:
On 2/7/2024 12:05 PM, MitchAlsup1 wrote:
BGB wrote:
The ABI design in my case was indirectly derived from the WinCE SH-4 >>>>> ABI (which in some ways has a certain level of similarity to the
Windows X64 ABI).
Say:
Arguments in R4..R7
Return value in R0/R1;
Arguments in R1..R9
Results in R1..R9
It is hit/miss.
-------------------
For vararg and missing prototype functions, would need to always
assume the full 16-argument spill space with the 16 argument ABI.
Called is responsible for spill space.
varargs is fully functional without prototypes.
If you provide spill space for 16 args, and the called function only
assumes 8, there is no issue.
But you see, I do NOT provide space for varargs ! If the called function
is not varargs there is 0 wasted space. If the called function is varagrs
then it stores the register arguments on the stack area it allocates, and
it deallocates.
For fixed argument functions, when the arguments use fewer than 8
registers, there is no practical difference between the 8 and 16
register ABI (apart from the difference in the spill space).
NO spill space.
For varargs, it is the same either way (and the missing prototype
doesn't matter).
Shrinking the spill space for the args<=8 only case could help, but
mostly has the tradeoff of needing to add code to check for and handle
this case in my compiler (and uncertainty as to how much it could
help, or if its practicality would be defeated by the probability of a
given function also containing a call to printf or similar, ...).
No spill space.
Whether or not one thinks spill space is bad, doesn't at the moment seem
like enough of an issue to change the ABI over it.
Adding a stat to my compiler:
Average stack-frame size is ~ 425 bytes (though, 416 and 432 are the
closest valid sizes, this is excluding leaf functions).
So (for 128 bytes), roughly 30% of this would be the spill-space (and,
for "no spill space" could potentially be closer to 297, so ~ 43%
relative overhead...).
IOW:
16 different converter ops *just* for scalar Int<->Float conversion. >>
Whereas, I had done this with 2 converter ops...
How did you get all the needed rounding modes into 2 conversions ??
{{even int->double needs rounding modes when int > 2^52 !!}}
Typically, Double->Int needs only one type of rounding:
Truncate towards zero.
What about Int->Double?
Round Nearest.
Though, only relevant for values outside the 2^52 range.
But, presumably, everyone knows this is not exact.
If you want Single, well, use two converter ops in a row.
But, was less of an issue if scalar math was internally always in
Binary64 format (regardless of the in-memory representation).
Maybe a case could be made for a round-truncate converter, don't have
this as of yet.
Maybe a case could also be made for an unsigned UInt64->Binary64
converter op, vs, doing it in software:
if(x>>63)
{
f=(double)((int64_t)(x>>1));
f=f+f;
}else
{
f=(double)((int64_t)x);
}
return(f);
Say, with a cast to a signed type being needed so that the runtime call doesn't go into infinite recursion.
Since, granted, one does need a converter op to deal with this case for RISC-V.
Along with ops to deal with conversion between integer types and Binary32.
So, it is possible I could justify adding a few more converter ops to BJX2.
For the SIMD converters, a different strategy was used.
Using Double and Int32 because easier to demonstrate in hex:
Say, we convert a Int32 to Binary64 by first repacking:
ZXXXXXXX
To, say:
400ZXXXXXXX00000
Where: Z has high bit flipped for signed.
Value is mapped between 2.0 and 4.0.
Then add a bias:
-2.0 for unsigned, -3.0 for signed.
Then scale, if value isn't unit-range.
Though, this mechanism was defined in terms of 8/16 bit integer values
and 16/32 bit floating point, but same basic idea.
Though, mostly this was for cost reasons.
I could almost get away with hand-waving the issue and not dealing
with these cases in the RV64 support, except that RV64's wonk of
"sign-extended unsigned values" kinda throws a wrench into the mix.
Though, this mostly effects Int->Float, for Float->Int, can still be
like "meh whatever", I am not inclined to spend the LUT cost of
bothering with range-clamping the conversions.
Congratulations, you are not IEEE 754 compliant !!
I never claimed it was.
A fully IEEE compliant FPU would be too expensive...
On 2/8/2024 4:35 PM, MitchAlsup1 wrote:
Potentially.
Did get around to implementing the "check if calling a varargs function
and only then reserve space to spill 16 registers" logic.
So, now, most functions (excluding those that call printf or similar)
will only need to reserve 64 bytes of spill space (though, 128+ is still needed if a function is called that needs more than 8 arguments, or is undeclared).
Main issue here is that for unsigned values larger than 2^63, the normal FLDCI will see them as negative.
Workaround is to shift them right by 1 bit, convert, and then scale by
2. Here, "f=f+f;" being a fairly cheap way to scale the value by 2.
But, may consider adding an FLDCIU instruction or similar.
On 2/6/24 5:23 PM, MitchAlsup1 wrote:
BGB-Alt wrote:[snip]
I was at least pleasantly surprised when debugging to note that a
trivial function, namely:
int LittleLong(int x)
{ return x; }
Compiled down to a pretty much optimal:
LittleLong:
MOV R4, R2
RTS
Compiles to:
LittleLong:
RET
on My 66000. {I placed the first argument and the first return result
in the same register.}
With C (and other languages?) not preserving the arguments (no
"inout" — "inout" would seem to present the opportunity to have
dummy arguments that are callee saved though that seems bad
programming practice as these are not really arguments), I have
wondered why the arguments and return value were separated in
ABIs. (I could also imagine cases where the return value is used
fairly soon as an argument — quite possibly the first argument —
of a called function.)
Of course, an ABI only needs to be used for external functions.
(This depends on debugging tools being up to handling non-ABI
calls.)
I have also wondered how difficult it would be to have each
function call define its interface with less strict binding to the
argument ordering (and perhaps even content) of the programmer-
facing argument list. With respect to content, a call interface
might be improved, e.g., by passing the only two members of a
structure used in a function rather than a pointer to the
structure while maintaining the programmer comprehensibility of
passing the entire structure by reference. "Prefetching" structure
members into registers might also make sense, particularly if
callers are very likely to have the values already in registers.
Implementing interprocedural optimizations in external library
functions has the obvious problems of needing a means to
communicate the interface details, making assumptions about common
and/or critical uses of the interface, and making assumptions
about future extensions. If a structure pointer was removed as an
argument (replaced by the used members), the function could not be
compatibly extended to use additional members. (Multithreading
would constrain the ability to pass members by value rather than
the structure by reference.)
I suspect such "optimizations" would not be worth the required
effort, but they are obvious possibilities.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (3 / 13) |
| Uptime: | 28:37:53 |
| Calls: | 12,107 |
| Calls today: | 7 |
| Files: | 15,006 |
| Messages: | 6,518,234 |