Forum: >>> Magnum BBS <<<

Re: Compiling Fibonacci

From MitchAlsup1@21:1/5 to BGB-Alt on Tue Feb 6 22:23:57 2024

BGB-Alt wrote:

On 2/5/2024 10:42 PM, Robert Finch wrote:

Got around to Compiling the Fibonacci function using the Arpl compiler
with the same optimization settings for riscv and Qupls:
Riscv: 30 instructions, 120 bytes
Qupls: 13 instructions, 65 bytes

Most of the code savings was due to the use of subroutine linkage
instructions, ENTER and LEAVE. For a small routine these made up a
significant portion of the code. Otherwise, the code is almost the same.

I was at least pleasantly surprised when debugging to note that a
trivial function, namely:
int LittleLong(int x)
{ return x; }
Compiled down to a pretty much optimal:
LittleLong:
MOV R4, R2
RTS

Compiles to:

LittleLong:
RET

on My 66000. {I placed the first argument and the first return result in
the same register.}

Like, the little optimizations here and there have seemingly now
combined that some trivial functions like this, actually generate near-optimal code sequences...

Modulo screwed up ABIs.

However, there was a bug in that a narrowing conversion in the caller
was being skipped, resulting in the function passing along a value that
was out-of-range for the type (could have fixed this if the compiler had
used EXTx.x instead of MOV here, this is possible and would prevent out-of-range values from propagating as easily, but ATM the EXTx.x
operations may be 2-cycle for some types (if trying to use the value immediately), vs 1-cycle for MOV; would likely require EXTx.x always be 1-cycle to justify this).

LLVM for My 66000 has the caller and the called smash the value into range
if any arithmetic has transpired.

Had modified the logic for the caller-side type-conversion to actually
do the conversion in this case (for some reason, had previously added a
case to skip over narrowing conversions for integer types, not sure what
my thinking was here...).

Granted, it was specific to signed narrowing conversions, which was
still technically allowed by the C standard, but is not valid if one
assumes wrap-on-overflow semantics.

Say, if you define a value variable as 'int', at no point should the
program be able to observe values outside of 'int' range.

Modulo the notion that int should be the fastest arithmetic form. Many
64-bit RISCs are slower with ints that longs due to the smashes.

But, then counter-balanced by cases where they had caused bugs, like
recently finding that my recent optimization of "&&" and "||" had
introduced a bug, where things like:
if((ptr!=NULL) && (ptr[0]!=0))
...
Had concluded that both sides were logical expressions, and would
incorrectly load from memory before dealing with the possibility of the pointer being NULL.

Shame.

Had to modify the checks to verify that the expressions would not access memory through pointers, which did reduce the effectiveness of the optimization but also avoids some incorrect memory accesses.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Feb 7 18:05:22 2024

BGB wrote:

On 2/6/2024 4:23 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

On 2/5/2024 10:42 PM, Robert Finch wrote:

Got around to Compiling the Fibonacci function using the Arpl
compiler with the same optimization settings for riscv and Qupls:
Riscv:    30 instructions, 120 bytes
Qupls:    13 instructions, 65 bytes

Most of the code savings was due to the use of subroutine linkage
instructions, ENTER and LEAVE. For a small routine these made up a
significant portion of the code. Otherwise, the code is almost the same. >>>>

I was at least pleasantly surprised when debugging to note that a
trivial function, namely:
   int LittleLong(int x)
     { return x; }
Compiled down to a pretty much optimal:
   LittleLong:
   MOV R4, R2
   RTS

Compiles to:

LittleLong:
    RET

on My 66000. {I placed the first argument and the first return result in
the same register.}

Like, the little optimizations here and there have seemingly now
combined that some trivial functions like this, actually generate
near-optimal code sequences...

Modulo screwed up ABIs.

The ABI design in my case was indirectly derived from the WinCE SH-4 ABI (which in some ways has a certain level of similarity to the Windows X64 ABI).

Say:
Arguments in R4..R7
Return value in R0/R1;

Arguments in R1..R9
Results in R1..R9

-------------------

For vararg and missing prototype functions, would need to always assume
the full 16-argument spill space with the 16 argument ABI.

Called is responsible for spill space.
varargs is fully functional without prototypes.

The spill-space issue does effect stack-frame overhead, but harder to determine if it has much practical effect on performance (effect seems
fairly small in any case).

No extraneous stack overhead.

Granted, yes, theoretically bigger stack frames does mean more
stack-related L1 misses, but stack L1 hit rate seems to be pretty good
on average.

However, there was a bug in that a narrowing conversion in the caller
was being skipped, resulting in the function passing along a value
that was out-of-range for the type (could have fixed this if the
compiler had used EXTx.x instead of MOV here, this is possible and
would prevent out-of-range values from propagating as easily, but ATM
the EXTx.x operations may be 2-cycle for some types (if trying to use
the value immediately), vs 1-cycle for MOV; would likely require
EXTx.x always be 1-cycle to justify this).

LLVM for My 66000 has the caller and the called smash the value into range >> if any arithmetic has transpired.

Yeah, this is probably a good strategy.
I had partly relied on having a few special-case instructions for
working with 32-bit values.

But, for some operations, only 64-bit instructions exist.
Say, unlike RISC-V, in BJX2 I did not provide alternate FP converter ops
for each pair of FP and Integer types.

Say, BJX2:
FLDCI: Int64->Binary64
FSTCI: Binary64->Int64
FLDCF: Binary32->Binary64
FSTCF: Binary64->Binary32

RISC-V:
FCVT.W.S //Binary32->Int32
FCVT.WU.S //Binary32->UInt32
FCVT.L.S //Binary32->Int64
FCVT.LU.S //Binary32->UInt64
FCVT.W.D //Binary64->Int32
FCVT.WU.D //Binary64->UInt32
FCVT.L.D //Binary64->Int64
FCVT.LU.D //Binary64->UInt64
FCVT.S.W //Int32->Binary32
FCVT.S.WU //UInt32->Binary32
FCVT.S.L //Int64->Binary32
FCVT.S.LU //UInt64->Binary32
FCVT.D.W //Int32->Binary64
FCVT.D.WU //UInt32->Binary64
FCVT.D.L //Int64->Binary64
FCVT.D.LU //UInt64->Binary64
...

But, as I see it, these don't gain much other than burning encoding
space and making the FPU more expensive...

And, admittedly, my implementation doesn't entirely distinguish these
cases (which may in turn potentially generate out-of-range values).

What if the FPU supports Binary16 or similar? Now you have even more,
because the number of converter ops in RV's scheme is apparently X^2
relative to the number of types involved, whereas for BJX2, it was
closer to linear (it adds FLDCH and FSTCH and similar).

And, say, converting Binary16->Int32 is going to take multiple
instructions (but, as I see it, this isn't too big of an issue).

There was a difference between RV and BJX2 regarding single-precision Load/Store, since RV loads it into a register as-is, and BJX2 had an
optional "FMOV.S" instruction which basically glues on Single<->Double
format converters (not free, but the cost is justifiable).

Also the converters are supposed to range-clamp the outputs for
conversion to integer types, but like, no...
Say: W: -(1<<31) .. ((1<<31)-1)
Say: WU: 0 .. ((1<<32)-1)
Say: L: -(1<<63) .. ((1<<63)-1)
Say: LU: 0 .. ((1<<64)-1)

Shift Right performs extractions and works both signed and unsigned
on any size::

S12 -(1<<11)..(1<<11)-1
U12 0..(1<<12)-1

With CARRY the field may span a 65-bit register boundary.

{8, 16, 32} are simply subsets of SR.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Feb 8 18:10:13 2024

BGB wrote:

On 2/7/2024 12:05 PM, MitchAlsup1 wrote:

BGB wrote:

The ABI design in my case was indirectly derived from the WinCE SH-4
ABI (which in some ways has a certain level of similarity to the
Windows X64 ABI).

Say:
   Arguments in R4..R7
   Return value in R0/R1;

Arguments in R1..R9
Results   in R1..R9

It is hit/miss.

-------------------

For vararg and missing prototype functions, would need to always
assume the full 16-argument spill space with the 16 argument ABI.

Called is responsible for spill space.
varargs is fully functional without prototypes.

If you provide spill space for 16 args, and the called function only
assumes 8, there is no issue.

But you see, I do NOT provide space for varargs ! If the called function
is not varargs there is 0 wasted space. If the called function is varagrs
then it stores the register arguments on the stack area it allocates, and
it deallocates.

For fixed argument functions, when the arguments use fewer than 8
registers, there is no practical difference between the 8 and 16
register ABI (apart from the difference in the spill space).

NO spill space.

For varargs, it is the same either way (and the missing prototype
doesn't matter).

Shrinking the spill space for the args<=8 only case could help, but
mostly has the tradeoff of needing to add code to check for and handle
this case in my compiler (and uncertainty as to how much it could help,
or if its practicality would be defeated by the probability of a given function also containing a call to printf or similar, ...).

No spill space.

The main exception case if one has a function that returns a struct, and
one calls it with a missing prototype, the program will very likely have unpredictable behavior and/or crash (since the caller will not have
provided a valid address for the returned struct).

Though, the Win64 ABI has the same issue, and it mostly doesn't matter.

The spill-space issue does effect stack-frame overhead, but harder to
determine if it has much practical effect on performance (effect seems
fairly small in any case).

No extraneous stack overhead.

It is a tradeoff.

It can potentially simplify varargs handling, since then one can dump
all the argument registers to the stack and then generate a pointer to
the just after the last fixed argument, or allow for dealing with code
that takes the address of a function argument and expects to be able to
use the argument list as an array.

It does effect stack usage, so will potentially exhaust the stack
slightly faster when using 128K or 256K of stack space.

No unneeded argument or unit of data is on the stack if it does not need
to be on the stack. Thus, no wasted space and no spill space, either.

Also the converters are supposed to range-clamp the outputs for
conversion to integer types, but like, no...
   Say: W: -(1<<31) .. ((1<<31)-1)
   Say: WU: 0 .. ((1<<32)-1)
   Say: L: -(1<<63) .. ((1<<63)-1)
   Say: LU: 0 .. ((1<<64)-1)

Shift Right performs extractions and works both signed and unsigned on
any size::

        S12 -(1<<11)..(1<<11)-1
        U12 0..(1<<12)-1

With CARRY the field may span a 65-bit register boundary.

{8, 16, 32} are simply subsets of SR.

Yeah, but in this case I was mostly dealing with the annoyance of there
being an unreasonable number of floating-point converter special cases
in RISC-V, since they decided to going from every possible format to
every other possible format, rather than taking a more "hub and spokes" approach.

IOW:
16 different converter ops *just* for scalar Int<->Float conversion.

Whereas, I had done this with 2 converter ops...

How did you get all the needed rounding modes into 2 conversions ??
{{even int->double needs rounding modes when int > 2^52 !!}}

I could almost get away with hand-waving the issue and not dealing with
these cases in the RV64 support, except that RV64's wonk of
"sign-extended unsigned values" kinda throws a wrench into the mix.

Though, this mostly effects Int->Float, for Float->Int, can still be
like "meh whatever", I am not inclined to spend the LUT cost of
bothering with range-clamping the conversions.

Congratulations, you are not IEEE 754 compliant !!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Feb 8 22:35:17 2024

BGB wrote:

On 2/8/2024 12:10 PM, MitchAlsup1 wrote:

BGB wrote:

On 2/7/2024 12:05 PM, MitchAlsup1 wrote:

BGB wrote:

The ABI design in my case was indirectly derived from the WinCE SH-4 >>>>> ABI (which in some ways has a certain level of similarity to the
Windows X64 ABI).

Say:
   Arguments in R4..R7
   Return value in R0/R1;

Arguments in R1..R9
Results   in R1..R9

It is hit/miss.

-------------------

For vararg and missing prototype functions, would need to always
assume the full 16-argument spill space with the 16 argument ABI.

Called is responsible for spill space.
varargs is fully functional without prototypes.

If you provide spill space for 16 args, and the called function only
assumes 8, there is no issue.

But you see, I do NOT provide space for varargs ! If the called function
is not varargs there is 0 wasted space. If the called function is varagrs
then it stores the register arguments on the stack area it allocates, and
it deallocates.

For fixed argument functions, when the arguments use fewer than 8
registers, there is no practical difference between the 8 and 16
register ABI (apart from the difference in the spill space).

NO spill space.

For varargs, it is the same either way (and the missing prototype
doesn't matter).

Shrinking the spill space for the args<=8 only case could help, but
mostly has the tradeoff of needing to add code to check for and handle
this case in my compiler (and uncertainty as to how much it could
help, or if its practicality would be defeated by the probability of a
given function also containing a call to printf or similar, ...).

No spill space.

Whether or not one thinks spill space is bad, doesn't at the moment seem
like enough of an issue to change the ABI over it.

Change, yes: agreed. But when facing a clean sheet, it is not necessary.
One of the whole points in RISC architecture is to leave out that which
is unnecessary. Spill space is demonstratively unnecessary. Even if it
is performance neutral.

Adding a stat to my compiler:
Average stack-frame size is ~ 425 bytes (though, 416 and 432 are the
closest valid sizes, this is excluding leaf functions).

So (for 128 bytes), roughly 30% of this would be the spill-space (and,
for "no spill space" could potentially be closer to 297, so ~ 43%
relative overhead...).

And a few exceptions to the GuestOS to allocate more pages to the stack
area.

IOW:
   16 different converter ops *just* for scalar Int<->Float conversion. >>
Whereas, I had done this with 2 converter ops...

How did you get all the needed rounding modes into 2 conversions ??
{{even int->double needs rounding modes when int > 2^52 !!}}

Typically, Double->Int needs only one type of rounding:
Truncate towards zero.

What about Int->Double?
Round Nearest.
Though, only relevant for values outside the 2^52 range.
But, presumably, everyone knows this is not exact.

If you want Single, well, use two converter ops in a row.

Makes you subject to double rounding errors--thus not IEEE 754
compliant.

But realistically you still have to have efficient conversions specified
in FORTRAN.

But, was less of an issue if scalar math was internally always in
Binary64 format (regardless of the in-memory representation).

Lessening the scale of the issue does not achieve IEEE 754 compliance.
You can get away with it in an FPGA that only you use, not so much
when you want to sell parts against a similar performant chip that is compliant.

Maybe a case could be made for a round-truncate converter, don't have
this as of yet.

Maybe a case could also be made for an unsigned UInt64->Binary64
converter op, vs, doing it in software:
if(x>>63)
{
f=(double)((int64_t)(x>>1));
f=f+f;
}else
{
f=(double)((int64_t)x);
}
return(f);

RNDPI Rint,Rfp

round towards positive infinity.

Say, with a cast to a signed type being needed so that the runtime call doesn't go into infinite recursion.

Since, granted, one does need a converter op to deal with this case for RISC-V.

Along with ops to deal with conversion between integer types and Binary32.

So, it is possible I could justify adding a few more converter ops to BJX2.

For the SIMD converters, a different strategy was used.

Using Double and Int32 because easier to demonstrate in hex:
Say, we convert a Int32 to Binary64 by first repacking:
ZXXXXXXX
To, say:
400ZXXXXXXX00000
Where: Z has high bit flipped for signed.
Value is mapped between 2.0 and 4.0.
Then add a bias:
-2.0 for unsigned, -3.0 for signed.
Then scale, if value isn't unit-range.

Though, this mechanism was defined in terms of 8/16 bit integer values
and 16/32 bit floating point, but same basic idea.

Though, mostly this was for cost reasons.

I could almost get away with hand-waving the issue and not dealing
with these cases in the RV64 support, except that RV64's wonk of
"sign-extended unsigned values" kinda throws a wrench into the mix.

Though, this mostly effects Int->Float, for Float->Int, can still be
like "meh whatever", I am not inclined to spend the LUT cost of
bothering with range-clamping the conversions.

Congratulations, you are not IEEE 754 compliant !!

I never claimed it was.

A fully IEEE compliant FPU would be too expensive...

And yet, it seams that everyone but FPGA implementers found otherwise

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Feb 9 03:03:01 2024

BGB wrote:

On 2/8/2024 4:35 PM, MitchAlsup1 wrote:

Potentially.

Did get around to implementing the "check if calling a varargs function
and only then reserve space to spill 16 registers" logic.

I need no prototypes to get varargs working.

So, now, most functions (excluding those that call printf or similar)
will only need to reserve 64 bytes of spill space (though, 128+ is still needed if a function is called that needs more than 8 arguments, or is undeclared).

0 in this case is always better than more than 0.

Main issue here is that for unsigned values larger than 2^63, the normal FLDCI will see them as negative.

Not mine, I have {signed, unsigned, single and double}^2 conversions
in all rounding modes.

Workaround is to shift them right by 1 bit, convert, and then scale by
2. Here, "f=f+f;" being a fairly cheap way to scale the value by 2.

yawn.

But, may consider adding an FLDCIU instruction or similar.

There are 48 cases to consider in conversion^2

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Sat Feb 10 10:54:26 2024

All the Fib messages made me wonder:

How small can a naive (doubly recursive) implementation be?

fn fib(u:u32)->u32
{
if u <= 2 {return u;}
fib(u-1)+fib(u-2)
}

;; eax = inout
proc fib
cmp eax,2 ; 3 bytes
jbe done ; 2 bytes
dec eax ; 1
push eax ; 1
dec eax ; 1
call fib ; 3 (or 5?)
xchg eax,[esp]; 2
call fib ; 3 / 5
pop ebx ; 1
add eax,ebx ; 2
done:
ret ; 1
endp

Seems like 20 or (more likely) 24 bytes?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Paul A. Clayton on Mon Feb 26 12:39:10 2024

On 23/02/2024 23:44, Paul A. Clayton wrote:

On 2/6/24 5:23 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

[snip]

I was at least pleasantly surprised when debugging to note that a
trivial function, namely:
   int LittleLong(int x)
     { return x; }
Compiled down to a pretty much optimal:
   LittleLong:
   MOV R4, R2
   RTS

Compiles to:

LittleLong:
     RET

on My 66000. {I placed the first argument and the first return result
in the same register.}

With C (and other languages?) not preserving the arguments (no
"inout" — "inout" would seem to present the opportunity to have
dummy arguments that are callee saved though that seems bad
programming practice as these are not really arguments), I have
wondered why the arguments and return value were separated in
ABIs. (I could also imagine cases where the return value is used
fairly soon as an argument — quite possibly the first argument —
of a called function.)

It's common that ABIs have the results in the same registers as the
first couple of arguments. It's not that way on all ABIs, but it is
very common.

Of course, an ABI only needs to be used for external functions.
(This depends on debugging tools being up to handling non-ABI
calls.)

Most compilers follow the ABI for any independently generated function -
it quickly gets complicated otherwise. But if you've got static or
inline functions, or are using LTO, compilers can and do mess around a
bit - such as using constant propagation, inlining, function cloning,
and re-arranging arguments.

I have also wondered how difficult it would be to have each
function call define its interface with less strict binding to the
argument ordering (and perhaps even content) of the programmer-
facing argument list. With respect to content, a call interface
might be improved, e.g., by passing the only two members of a
structure used in a function rather than a pointer to the
structure while maintaining the programmer comprehensibility of
passing the entire structure by reference. "Prefetching" structure
members into registers might also make sense, particularly if
callers are very likely to have the values already in registers.

Implementing interprocedural optimizations in external library
functions has the obvious problems of needing a means to
communicate the interface details, making assumptions about common
and/or critical uses of the interface, and making assumptions
about future extensions. If a structure pointer was removed as an
argument (replaced by the used members), the function could not be
compatibly extended to use additional members. (Multithreading
would constrain the ability to pass members by value rather than
the structure by reference.)

I suspect such "optimizations" would not be worth the required
effort, but they are obvious possibilities.

I think you are right here. Such things would be very complicated to implement, and give relatively little performance improvement. You get
a lot more, for less compiler effort, with link-time optimisation
(allowing some kinds of inter-procedural optimisations across modules).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Thu Jul 30 14:17:17 2026
  from Madison, Nc via Telnet
- Krenn
  Thu Jul 30 13:16:49 2026
  from Sydney, Nsw via Telnet
- Bob Worm
  Thu Jul 30 09:03:28 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:47:34 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:36:06 2026
  from Wales, Uk via Telnet
- Rixter
  Thu Jul 30 02:32:09 2026
  from Madison, Nc via Telnet
- Bob Worm
  Wed Jul 29 22:26:45 2026
  from Wales, Uk via Telnet
- Zenobyte
  Wed Jul 29 21:08:05 2026
  from San Juan, Pr via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	93:54:25
Calls:	12,456
Calls today:	6
Files:	15,197
Messages:	6,537,905

Re: Compiling Fibonacci

Who's Online

Recent Visitors

System Info