Does anyone know why libm defines
double lgamma(x) { }
with an extern int signgam
instead of:
typedef struct { double result;
int sign; } gammaresult;
gammaresult lgamma( double x );
Struct returns from subroutines were part of C back in 1980...
{when I started using C}
One could add to this discussion as to why errno was not
done with struct return
Does anyone know why libm defines
double lgamma(x) { }
with an extern int signgam
instead of:
typedef struct { double result;
int sign; } gammaresult;
gammaresult lgamma( double x );
[email protected] (MitchAlsup1) writes:
Does anyone know why libm defines
double lgamma(x) { }
with an extern int signgam
instead of:
typedef struct { double result;
int sign; } gammaresult;
gammaresult lgamma( double x );
Struct returns from subroutines were part of C back in 1980...
{when I started using C}
Possibly, but they were not part of early C, are not particularly
efficient on many ABIs, and are inconvenient to use if you want to use
all the components of the struct. So there were lots of reasons why
API designers avoided the use of struct returns. An alternative would
have been
double lgamma(double gamma, int *signgam);
however.
On 10/07/2025 18:58, Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
Does anyone know why libm defines
double lgamma(x) { }
with an extern int signgam
instead of:
typedef struct { double result;
int sign; } gammaresult;
gammaresult lgamma( double x );
Struct returns from subroutines were part of C back in 1980...
{when I started using C}
Possibly, but they were not part of early C, are not particularly
efficient on many ABIs, and are inconvenient to use if you want to use
all the components of the struct. So there were lots of reasons why
API designers avoided the use of struct returns. An alternative would
have been
double lgamma(double gamma, int *signgam);
however.
Struct returns, even on poorer ABI's (and there are /many/ ABI's that
are bad for struct handling), are unlikely to be noticeably less
efficient than using a pointer-to-return-value solution.
After all, the typical simplistic struct return here would be roughly equivalent to :
void lgamma(gammaresult * result, double gamma);
It would have been nice if, when struct returns and struct parameters
were added to C, someone had taken time to improve the ABI's to make
them efficient. Then the habit of decent ABI's could have continued
when new architectures were developed. Instead, many current ABI's are
at least sub-optimal for structs - a particular pain for C++.
One could add to this discussion as to why errno was not
done with struct return; ala::
typedef struct { int fides;
int error; } openresult;
openresult open( char *string, int modes );
as are many of the Linux OS entry points.
This has to be a better solution compared to errno and signgam.
On Thu, 10 Jul 2025 15:50:38 +0000, MitchAlsup1 wrote:
One could add to this discussion as to why errno was not
done with struct return; ala::
typedef struct { int fides;
int error; } openresult;
openresult open( char *string, int modes );
as are many of the Linux OS entry points.
This has to be a better solution compared to errno and signgam.
The actual Linux kernel entry point for open(2) returns a non-negative FD >number on success, and a negative error code on failure.
In any case, these days errno is a perversity kept alive by backwards compatibility: The C wrapper for the system call has to check whether
there is an error, then has to compute the error number and
expensively store it to the thread-local storage where errno resides.
The open(2) API and errno mechanism was defined in very early unix a half century ago.
It was standardized in the System V Interface Definion (SVID) in the
1980s and in POSIX a few years later, followed by the X Portability
Guide (XPG) and finally the Single Unix specification. In all cases
backward compatibility at the source level was a requirement.
Extensions and new capabilities related to opening a file are encapsulated
in additional APIs such as fcntl(2), ioctl(2), tcsetattr(2), et alia.
Yes, there are likely differnt possible APIs; all new standardized Unix C APIs >(e.g. posix_spawn, pthreads, et alia) return the E* error number directly >(for thread safety) or zero for success eschewing errno completely. Any other
data returned by an API is via pointer parameters (often with 'restrict' qualification).
On Thu, 10 Jul 2025 19:18:05 +0000, David Brown wrote:
On 10/07/2025 18:58, Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
Does anyone know why libm defines
double lgamma(x) { }
with an extern int signgam
instead of:
typedef struct { double result;
int sign; } gammaresult;
gammaresult lgamma( double x );
Struct returns from subroutines were part of C back in 1980...
{when I started using C}
Possibly, but they were not part of early C, are not particularly
efficient on many ABIs, and are inconvenient to use if you want to use
all the components of the struct. So there were lots of reasons why
API designers avoided the use of struct returns. An alternative would
have been
double lgamma(double gamma, int *signgam);
however.
Struct returns, even on poorer ABI's (and there are /many/ ABI's that
are bad for struct handling), are unlikely to be noticeably less
efficient than using a pointer-to-return-value solution.
Given that one has to
a) look at something (return value or flag)
b) and if set "bad" go find errno
c) set errno to negative of value
d) return
e) look at return value
f) if "bad" go find errno
g) read errno
h) go do something about it
I think it is easy to make the argument that structure returns
is almost always less expensive:: as::
a) return 2 values
b) if second value is "bad"
c) go do something about it
And this is thread safe, too.
After all, the
typical simplistic struct return here would be roughly equivalent to :
void lgamma(gammaresult * result, double gamma);
It would have been nice if, when struct returns and struct parameters
were added to C, someone had taken time to improve the ABI's to make
them efficient. Then the habit of decent ABI's could have continued
when new architectures were developed. Instead, many current ABI's are
at least sub-optimal for structs - a particular pain for C++.
Do you think it is time to make another layer of wrappers::
// for illustrative purposes
typedef struct { int first, second } two_returns;
fides open( char *string, int flags )
{
two_returns old = new_open( string, flags );
if( old.second )
{
errno = -old.second;
old.first = -1;
}
return (fides)old.first;
}
enum System_Calls { ..., file_open, ... };
two_returns new_open( char *string, int flags )
{
return SYSCALL( char *string, int flags, file_open );
}
This results in a system call that is easily inlined by the compiler and results in 2 or 3 instructions in many new architectures, instead of
"lots"
including additional control transfers (call and return) along with
accessing
errno (signgam), ...
One would not want to inline the old way. So, now we can let the
compiler
inline SYSCALLs with reasonable safety.
In article <Ih_bQ.958133$[email protected]>,
Scott Lurndal <[email protected]> wrote:
The open(2) API and errno mechanism was defined in very early unix a half century ago.
It was standardized in the System V Interface Definion (SVID) in the
1980s and in POSIX a few years later, followed by the X Portability
Guide (XPG) and finally the Single Unix specification. In all cases >>backward compatibility at the source level was a requirement.
Extensions and new capabilities related to opening a file are encapsulated >>in additional APIs such as fcntl(2), ioctl(2), tcsetattr(2), et alia.
Yes, there are likely differnt possible APIs; all new standardized Unix C APIs
(e.g. posix_spawn, pthreads, et alia) return the E* error number directly >>(for thread safety) or zero for success eschewing errno completely. Any other
data returned by an API is via pointer parameters (often with 'restrict' qualification).
POSIX mandates that `errno` be (essentially) thread-local, so
thread safety isn't much of a consideration here. Traditionally
Unix kernels have returned a single value in a register, and set
a flag (in the PSW or whatever) to indicate failure, leaving it
to the syscall stubs in e.g. the C library to take whatever the
kernel gives back from the actual syscall exit and make sure
that `errno` is set appropriately.
I can image that a kernel call interface where `errno` is not
set is a bit more direct, but I don't think concurrency plays a
huge role there; but maybe these interfaces were designed in
that awkward time before `errno` was thread safe by mandate.
And the case of `posix_spawn` might be special, since it is so
often written in terms of `vfork`, which has its own bizarre
semantics.
Struct returns, even on poorer ABI's (and there are /many/ ABI's that
are bad for struct handling), are unlikely to be noticeably less
efficient than using a pointer-to-return-value solution. After all, the >typical simplistic struct return here would be roughly equivalent to :
void lgamma(gammaresult * result, double gamma);
It would have been nice if, when struct returns and struct parameters
were added to C, someone had taken time to improve the ABI's to make
them efficient.
Then the habit of decent ABI's could have continued
when new architectures were developed.
Instead, many current ABI's are
at least sub-optimal for structs
[email protected] (Dan Cross) writes:
And the case of `posix_spawn` might be special, since it is so
often written in terms of `vfork`, which has its own bizarre
semantics.
posix_spawn was modeled somewhat after ADA process creation primitives.
The rationale is included in the standard page.
https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.htm
[email protected] (Dan Cross) writes:
In article <Ih_bQ.958133$[email protected]>,
Scott Lurndal <[email protected]> wrote:
The open(2) API and errno mechanism was defined in very early unix a half century ago.
It was standardized in the System V Interface Definion (SVID) in the >>>1980s and in POSIX a few years later, followed by the X Portability
Guide (XPG) and finally the Single Unix specification. In all cases >>>backward compatibility at the source level was a requirement.
Extensions and new capabilities related to opening a file are encapsulated >>>in additional APIs such as fcntl(2), ioctl(2), tcsetattr(2), et alia.
Yes, there are likely differnt possible APIs; all new standardized Unix C APIs
(e.g. posix_spawn, pthreads, et alia) return the E* error number directly >>>(for thread safety) or zero for success eschewing errno completely. Any other
data returned by an API is via pointer parameters (often with 'restrict' qualification).
POSIX mandates that `errno` be (essentially) thread-local, so
thread safety isn't much of a consideration here. Traditionally
Unix kernels have returned a single value in a register, and set
a flag (in the PSW or whatever) to indicate failure, leaving it
to the syscall stubs in e.g. the C library to take whatever the
kernel gives back from the actual syscall exit and make sure
that `errno` is set appropriately.
I can image that a kernel call interface where `errno` is not
set is a bit more direct, but I don't think concurrency plays a
huge role there; but maybe these interfaces were designed in
that awkward time before `errno` was thread safe by mandate.
I was on the XPG working group in those years, and yes, they
were designed in that awkward time as 1003.4a was being
developed.
And the case of `posix_spawn` might be special, since it is so
often written in terms of `vfork`, which has its own bizarre
semantics.
posix_spawn was modeled somewhat after ADA process creation primitives.
The rationale is included in the standard page.
https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.html
"Instead, posix_spawn() and posix_spawnp()
In article <av8cQ.984246$[email protected]>,
Scott Lurndal <[email protected]> wrote:
The rationale is included in the standard page.
https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.html
Thanks, but I don't think that directly addresses why they chose
to return error status directly in the return value, and not set
errno as a side-effect.
David Brown <[email protected]> writes:
Struct returns, even on poorer ABI's (and there are /many/ ABI's that
are bad for struct handling), are unlikely to be noticeably less
efficient than using a pointer-to-return-value solution. After all, the >>typical simplistic struct return here would be roughly equivalent to :
void lgamma(gammaresult * result, double gamma);
Let's see:
#include <stdio.h>
typedef struct { double result;
int sign; } gammaresult;
gammaresult lgamma_alsup1( double x )
{
gammaresult r;
r.result = x+1.;
r.sign = -1;
return r;
}
double lgamma_ertl1(double x, int *signgam)
{
*signgam = -1;
return x+1.;
}
extern gammaresult lgamma_alsup2( double x );
void call_alsup()
{
gammaresult r=lgamma_alsup2(1.);
printf("%f ",r.result);
printf("%d ",r.sign);
}
extern double lgamma_ertl2(double x, int *signgam);
void call_ertl()
{
int sign;
printf("%f ",lgamma_ertl2(1.,&sign));
printf("%d ",sign);
}
Here the calls are to a differently-named function with the same
interface such that we see what happens without inlining. The first
thing to note is that the source code for the struct-returning
function is longer. The calling code is slightly longer.
I have compiled that on AMD64 with:
gcc -fpcc-struct-return -Wall -O -c lgamma.c
The output of "objdump -d lgamma.o" for lgamma_*1 is:
0000000000000000 <lgamma_alsup1>:
0: 48 89 f8 mov %rdi,%rax
3: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # b <lgamma_alsup1+0xb>
a: 00
b: f2 0f 11 07 movsd %xmm0,(%rdi)
f: c7 47 08 ff ff ff ff movl $0xffffffff,0x8(%rdi)
16: c3 ret
0000000000000017 <lgamma_ertl1>:
17: c7 07 ff ff ff ff movl $0xffffffff,(%rdi)
1d: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # 25 <lgamma_ertl1+0xe>
24: 00
25: c3 ret
So with the typical simplistic struct return (aka pcc-struct-return)
the code of the function is longer.
The code for the call_* functions is:
0000000000000026 <call_alsup>:
26: 53 push %rbx
27: 48 83 ec 10 sub $0x10,%rsp
2b: 48 89 e7 mov %rsp,%rdi
2e: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 36 <call_alsup+0x10>
35: 00
36: e8 00 00 00 00 call 3b <call_alsup+0x15>
3b: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx
40: f2 0f 10 04 24 movsd (%rsp),%xmm0
45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 4c <call_alsup+0x26>
4c: b8 01 00 00 00 mov $0x1,%eax
51: e8 00 00 00 00 call 56 <call_alsup+0x30>
56: 89 de mov %ebx,%esi
58: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 5f <call_alsup+0x39>
5f: b8 00 00 00 00 mov $0x0,%eax
64: e8 00 00 00 00 call 69 <call_alsup+0x43>
69: 48 83 c4 10 add $0x10,%rsp
6d: 5b pop %rbx
6e: c3 ret
000000000000006f <call_ertl>:
6f: 48 83 ec 18 sub $0x18,%rsp
73: 48 8d 7c 24 0c lea 0xc(%rsp),%rdi
78: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 80 <call_ertl+0x11>
7f: 00
80: e8 00 00 00 00 call 85 <call_ertl+0x16>
85: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 8c <call_ertl+0x1d>
8c: b8 01 00 00 00 mov $0x1,%eax
91: e8 00 00 00 00 call 96 <call_ertl+0x27>
96: 8b 74 24 0c mov 0xc(%rsp),%esi
9a: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # a1 <call_ertl+0x32>
a1: b8 00 00 00 00 mov $0x0,%eax
a6: e8 00 00 00 00 call ab <call_ertl+0x3c>
ab: 48 83 c4 18 add $0x18,%rsp
af: c3 ret
18 instructions for call_alsup() vs. 14 for call_ertl(),
so again the struct-return variant leads to longer code with pcc-struct-return.
It would have been nice if, when struct returns and struct parameters
were added to C, someone had taken time to improve the ABI's to make
them efficient.
Given the name of the calling convention variant, this was introduced
in PCC (and probably struct returns themselves were introduced in
PCC). PCC was released in 1979 on the machines of the day, such as
the PDP-11; I am sure Johnson implemented a calling convention for
struct passing and struct returns that used the least amount of code.
If Johnson had had more space to play with, he probably would have had
other things on the agenda before improving the struct return calling convention. E.g., the calling conventions at the time passed all
parameters on the stack, and we still have this in the Intel calling convention for IA-32.
Early RISC calling conventions passed several parameters in registers,
but still used pcc-struct-returns.
But struct returns were so rare in libraries that gcc added an option -freg-struct-return which returns
small structs in registers, and this option used to be usable, because libraries or system calls did not use struct-returns at the time.
Eventually, ABI specifications went for more efficient, but also more
complex and less forgiving calling conventions, so on AMD64 without -fpcc-struct-return gammaresult is actually returned in registers,
leading to 3 instructions for lgamma_alsup1 (same as lgamma_ertl1) and
13 instructions for call_alsup (shorter than call_ertl).
Then the habit of decent ABI's could have continued
when new architectures were developed.
It seems to me that that's what happened (except that it was not a continuation): When new architectures were introduced, ABIs were
introduced that made use of the additional memory, but also took compatibility with existing practice into account.
E.g., MIPS (1986) got a calling convention that passes the first four
words of parameters in integer registers and the rest on the stack.
That's not particularly efficient for passing FP parameters, but it
meant that calls to functions, including varargs functions like
printf() would work without prototypes (C89 only came later) and
varags functions could be implemented simply by storing these four
registers to the stack (IIRC the four slots for these parameter words
were reserved).
As time progressed, calling conventions tried to keep stuff more in
registers and in the right kind of registers, at the cost of a more
complex implementation and breaking programs without prototypes.
E.g., the AMD64 ABI specifies register struct returns for small
structs.
Instead, many current ABI's are
at least sub-optimal for structs
Which ones do you have in mind?
- anton
Niklas Holsti <[email protected]d> schrieb:
"Instead, posix_spawn() and posix_spawnp()
For a second, I read that as posix_swamp().
But then again, I have been known to write about unsinged numbers.
[email protected] (Dan Cross) writes:
In article <av8cQ.984246$[email protected]>,
Scott Lurndal <[email protected]> wrote:
<snip posix_spawn discussion>
The rationale is included in the standard page.Thanks, but I don't think that directly addresses why they chose
https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.html >>
to return error status directly in the return value, and not set
errno as a side-effect.
My recollection is the choice to return errno directly was made
because we were aware of the pending 1003.4a specification (I sat
in on a couple of those meetings as well when our regular posix rep
wasn't available).
Niklas Holsti <[email protected]d> schrieb:
"Instead, posix_spawn() and posix_spawnp()
For a second, I read that as posix_swamp().
But then again, I have been known to write about unsinged numbers.
David Brown <[email protected]> writes:<snip>
It would have been nice if, when struct returns and struct parameters
were added to C, someone had taken time to improve the ABI's to make
them efficient.
Given the name of the calling convention variant, this was introduced
in PCC (and probably struct returns themselves were introduced in
PCC). PCC was released in 1979 on the machines of the day, such as
the PDP-11; I am sure Johnson implemented a calling convention for
struct passing and struct returns that used the least amount of code.
If Johnson had had more space to play with, he probably would have had
other things on the agenda before improving the struct return calling convention. E.g., the calling conventions at the time passed all
parameters on the stack, and we still have this in the Intel calling convention for IA-32.
Then the habit of decent ABI's could have continued
when new architectures were developed.
It seems to me that that's what happened (except that it was not a continuation): When new architectures were introduced, ABIs were
introduced that made use of the additional memory, but also took compatibility with existing practice into account.
E.g., MIPS (1986) got a calling convention that passes the first four
words of parameters in integer registers and the rest on the stack.
That's not particularly efficient for passing FP parameters, but it
meant that calls to functions, including varargs functions like
printf() would work without prototypes (C89 only came later) and
varags functions could be implemented simply by storing these four
registers to the stack (IIRC the four slots for these parameter words
were reserved).
As time progressed, calling conventions tried to keep stuff more in
registers and in the right kind of registers,
at the cost of a more
complex implementation and breaking programs without prototypes.
E.g., the AMD64 ABI specifies register struct returns for small
structs.
Instead, many current ABI's are at least sub-optimal for structs
Which ones do you have in mind?
- anton
Unsinged numbers are cool :-)
Yeah, I find that singed numbers make it harder to concentrate.
Stefan
Unsinged numbers are cool :-)
On Fri, 11 Jul 2025 16:58:07 +0000, Thomas Koenig wrote:
Niklas Holsti <[email protected]d> schrieb:
"Instead, posix_spawn() and posix_spawnp()
For a second, I read that as posix_swamp().
It might very well be.....
But then again, I have been known to write about unsinged numbers.
It would have been nice if, when struct returns and struct parameters >>>were added to C, someone had taken time to improve the ABI's to make
them efficient. ...
Given that PDP-11 had 6 general purpose useable registers, and x86
started out with similar, it would have been quite difficult to
pass the first few arguments in registers. PDP-11 and x86 were
easy to push arguments onto the stack, and address in callee from
the stack.
E.g., the calling conventions at the time passed all parameters on the
stack, and we still have this in the Intel calling convention for IA-32.
On Fri, 11 Jul 2025 14:50:58 +0000, Anton Ertl wrote:
David Brown <[email protected]> writes:
Struct returns, even on poorer ABI's (and there are /many/ ABI's that
are bad for struct handling), are unlikely to be noticeably less >>>efficient than using a pointer-to-return-value solution. After all, the >>>typical simplistic struct return here would be roughly equivalent to :
void lgamma(gammaresult * result, double gamma);
Early RISC calling conventions passed several parameters in registers,
but still used pcc-struct-returns.
Greenhills compiler for 88K use register struct returns (1983)
IIRC 4 registers; so that complex doubles were in registers
both calling and returning.
Niklas Holsti <[email protected]d> schrieb:
"Instead, posix_spawn() and posix_spawnp()
For a second, I read that as posix_swamp().
But then again, I have been known to write about unsinged numbers.
According to MitchAlsup1 <[email protected]>:
Given that PDP-11 had 6 general purpose useable registers, and x86
started out with similar, it would have been quite difficult to
pass the first few arguments in registers.
PDP-11 and x86 were
easy to push arguments onto the stack, and address in callee from
the stack.
The C compilers at that time were not very sophisticated. They compiled
one statment at a time, and the only way to tell them to leave values
in registers was an explicit "register" declaration. Except in the most >trivial routines, it'd usually have to stash the argument in memory to
make room for something else, so there'd have been no benefit.
[email protected] (MitchAlsup1) writes:
On Fri, 11 Jul 2025 14:50:58 +0000, Anton Ertl wrote:
David Brown <[email protected]> writes:
Struct returns, even on poorer ABI's (and there are /many/ ABI's that >>>>are bad for struct handling), are unlikely to be noticeably less >>>>efficient than using a pointer-to-return-value solution. After all, the >>>>typical simplistic struct return here would be roughly equivalent to :
void lgamma(gammaresult * result, double gamma);
Early RISC calling conventions passed several parameters in registers,
but still used pcc-struct-returns.
Greenhills compiler for 88K use register struct returns (1983)
IIRC 4 registers; so that complex doubles were in registers
both calling and returning.
The formal defintion for the 88k Unix ABI was the 88Open BCS[*] (I was
the
Unisys rep on the 88Open committee). I don't recall four register
returns, but all my documentation from those days is boxed up. I think
I
have a copy of the 88k PCC sources around somewhere...
[*] Binary Compatibility Standard. There was also an Object
Compatibility
Standard (OCS) to support link-time compatibility between compiler vendors
(e.g. Unisoft, DG, Motorola, Unisys, Greenhills, Diab Data, et
alia).
John Levine <[email protected]> writes:
According to MitchAlsup1 <[email protected]>:
Given that PDP-11 had 6 general purpose useable registers, and x86 >>>started out with similar, it would have been quite difficult to
pass the first few arguments in registers.
It's not any more difficult to pass, say, 4 arguments in registers if
you have 6 registers available than it is if you have 30 registers
available.
PDP-11 and x86 were
easy to push arguments onto the stack, and address in callee from
the stack.
I think neither PDP-11 nor IA-32 has instructions that push, say, the
sum of two other registers, whereas at least IA-32 has an instruction
that computes the sum of two registers and puts it in a third
register.
Concerning the implicit memory access: it costs more than using
registers on all IA-32 implementations I am aware of, and I expect
that's also true of the PDP-11.
The C compilers at that time were not very sophisticated. They compiled >>one statment at a time, and the only way to tell them to leave values
in registers was an explicit "register" declaration. Except in the most >>trivial routines, it'd usually have to stash the argument in memory to
make room for something else, so there'd have been no benefit.
Many frequently-called library routines, such as strlen() or
memcpy()[1] can easily keep all their parameters, variables, and
intermediate results in 6 registers or less.
Therefore I expect that many of the frequently-called library routines compiled with PCC made extensive use of the register storage class.
In that scenario passing the arguments in registers avoids the cost of pushing them in the caller and the cost of loading them from memory at
the start of the callee.
As for the functions that do not use the register storage class for parameters, pushing or storing them at the start of the callee is not
slower than doing it right before the call, and it can lead to shorter
code.
Anyway, I expect that Unix already had a calling convention on PDP-11
and several other machines, and of course PCC followed that
convention. As for the C compiler that introduced these calling
conventions (probably by Ritchie), my guess is that he was happy to
produce a working C compiler that ran in the little RAM they had.
But Intel had a clean slate when they designed the Intel calling
convention for IA-32. When the 386 came out in 1985, Wulf et
al. [wulf+75] was a decade old, and Chaitin's graph-coloring paper was
4 years old, and the 386 typically had much more memory available than
Wulf et al. MIPS introduced a calling convention that passed 4 words
in registers shortly after, and Intel could have gone done so, too.
And it seems that they paid dearly for their decision, as I find lots
of documentation on alternative calling conventions for IA-32 and how
to tell the compiler about them.
@Book{wulf+75,
author = {William Wulf and Richard K. Johnsson and Charles
B. Weinstock and Steven O. Hobbs and Charles M.
Geschke},
title = {The Design of an Optimizing Compiler},
publisher = {Elsvier},
year = {1975},
isbn = {0-444-0164-6},
annote = {Describes a complete Bliss/11 compiler for the
PDP-11. It uses some interesting techniques: it
uses a (hand-constructed) tree parsing automaton for
parts of the code selection (Section~3.4); it
optimizes the use of unary complement operators
(Section~3.3); it uses a smart scheme to represent
a conservative approximation of the lifetime of
variables in constant space and uses that for
register allocation (Sections~4.1.3 and~4.3).}
}
This book cannot be praised enough, and it's celebrating its 50th
anniversary this year.
- anton
Anyway, I expect that Unix already had a calling convention on PDP-11
and several other machines, and of course PCC followed that
convention. As for the C compiler that introduced these calling
conventions (probably by Ritchie), my guess is that he was happy to
produce a working C compiler that ran in the little RAM they had.
According to Anton Ertl <[email protected]>:
Anyway, I expect that Unix already had a calling convention on PDP-11
and several other machines, and of course PCC followed that
convention. As for the C compiler that introduced these calling >>conventions (probably by Ritchie), my guess is that he was happy to
produce a working C compiler that ran in the little RAM they had.
It was two passes each about 24K bytes and a third optional optimizer
that slightly rewrote the assembler code.
The Ritchie complier and I think PCC reserved up to three registers
for declared register variables, and used the rest as a stack for temporaries. It used Sethi-Ullman numbering to do the more complex subexpressions first to avoid running out of registers. If it did
run out of registers I think it just gave up, but I don't ever
remember that happening.
Reserving more registers would have been really hard.
I agree that on the 386 it would probably have been practical to pass arguments in registers, but I suspect that for whatever reason they
wanted to make the calling sequence similar to the 8086 and 286.
David Brown <[email protected]> writes:
Struct returns, even on poorer ABI's (and there are /many/ ABI's that
are bad for struct handling), are unlikely to be noticeably less
efficient than using a pointer-to-return-value solution. After all, the
typical simplistic struct return here would be roughly equivalent to :
void lgamma(gammaresult * result, double gamma);
Let's see:
#include <stdio.h>
typedef struct { double result;
int sign; } gammaresult;
gammaresult lgamma_alsup1( double x )
{
gammaresult r;
r.result = x+1.;
r.sign = -1;
return r;
}
double lgamma_ertl1(double x, int *signgam)
{
*signgam = -1;
return x+1.;
}
extern gammaresult lgamma_alsup2( double x );
void call_alsup()
{
gammaresult r=lgamma_alsup2(1.);
printf("%f ",r.result);
printf("%d ",r.sign);
}
extern double lgamma_ertl2(double x, int *signgam);
void call_ertl()
{
int sign;
printf("%f ",lgamma_ertl2(1.,&sign));
printf("%d ",sign);
}
Here the calls are to a differently-named function with the same
interface such that we see what happens without inlining. The first
thing to note is that the source code for the struct-returning
function is longer. The calling code is slightly longer.
I have compiled that on AMD64 with:
gcc -fpcc-struct-return -Wall -O -c lgamma.c
The output of "objdump -d lgamma.o" for lgamma_*1 is:
0000000000000000 <lgamma_alsup1>:
0: 48 89 f8 mov %rdi,%rax
3: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # b <lgamma_alsup1+0xb>
a: 00
b: f2 0f 11 07 movsd %xmm0,(%rdi)
f: c7 47 08 ff ff ff ff movl $0xffffffff,0x8(%rdi)
16: c3 ret
0000000000000017 <lgamma_ertl1>:
17: c7 07 ff ff ff ff movl $0xffffffff,(%rdi)
1d: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # 25 <lgamma_ertl1+0xe>
24: 00
25: c3 ret
So with the typical simplistic struct return (aka pcc-struct-return)
the code of the function is longer.
The code for the call_* functions is:
0000000000000026 <call_alsup>:
26: 53 push %rbx
27: 48 83 ec 10 sub $0x10,%rsp
2b: 48 89 e7 mov %rsp,%rdi
2e: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 36 <call_alsup+0x10>
35: 00
36: e8 00 00 00 00 call 3b <call_alsup+0x15>
3b: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx
40: f2 0f 10 04 24 movsd (%rsp),%xmm0
45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 4c <call_alsup+0x26>
4c: b8 01 00 00 00 mov $0x1,%eax
51: e8 00 00 00 00 call 56 <call_alsup+0x30>
56: 89 de mov %ebx,%esi
58: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 5f <call_alsup+0x39>
5f: b8 00 00 00 00 mov $0x0,%eax
64: e8 00 00 00 00 call 69 <call_alsup+0x43>
69: 48 83 c4 10 add $0x10,%rsp
6d: 5b pop %rbx
6e: c3 ret
000000000000006f <call_ertl>:
6f: 48 83 ec 18 sub $0x18,%rsp
73: 48 8d 7c 24 0c lea 0xc(%rsp),%rdi
78: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 80 <call_ertl+0x11>
7f: 00
80: e8 00 00 00 00 call 85 <call_ertl+0x16>
85: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 8c <call_ertl+0x1d>
8c: b8 01 00 00 00 mov $0x1,%eax
91: e8 00 00 00 00 call 96 <call_ertl+0x27>
96: 8b 74 24 0c mov 0xc(%rsp),%esi
9a: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # a1 <call_ertl+0x32>
a1: b8 00 00 00 00 mov $0x0,%eax
a6: e8 00 00 00 00 call ab <call_ertl+0x3c>
ab: 48 83 c4 18 add $0x18,%rsp
af: c3 ret
18 instructions for call_alsup() vs. 14 for call_ertl(), so again the struct-return variant leads to longer code with pcc-struct-return.
It would have been nice if, when struct returns and struct parameters
were added to C, someone had taken time to improve the ABI's to make
them efficient.
Given the name of the calling convention variant, this was introduced
in PCC (and probably struct returns themselves were introduced in
PCC). PCC was released in 1979 on the machines of the day, such as
the PDP-11; I am sure Johnson implemented a calling convention for
struct passing and struct returns that used the least amount of code.
If Johnson had had more space to play with, he probably would have had
other things on the agenda before improving the struct return calling convention. E.g., the calling conventions at the time passed all
parameters on the stack, and we still have this in the Intel calling convention for IA-32.
Early RISC calling conventions passed several parameters in registers,
but still used pcc-struct-returns. But struct returns were so rare in libraries that gcc added an option -freg-struct-return which returns
small structs in registers, and this option used to be usable, because libraries or system calls did not use struct-returns at the time.
Eventually, ABI specifications went for more efficient, but also more
complex and less forgiving calling conventions, so on AMD64 without -fpcc-struct-return gammaresult is actually returned in registers,
leading to 3 instructions for lgamma_alsup1 (same as lgamma_ertl1) and
13 instructions for call_alsup (shorter than call_ertl).
Then the habit of decent ABI's could have continued
when new architectures were developed.
It seems to me that that's what happened (except that it was not a continuation): When new architectures were introduced, ABIs were
introduced that made use of the additional memory, but also took compatibility with existing practice into account.
E.g., MIPS (1986) got a calling convention that passes the first four
words of parameters in integer registers and the rest on the stack.
That's not particularly efficient for passing FP parameters, but it
meant that calls to functions, including varargs functions like
printf() would work without prototypes (C89 only came later) and
varags functions could be implemented simply by storing these four
registers to the stack (IIRC the four slots for these parameter words
were reserved).
As time progressed, calling conventions tried to keep stuff more in
registers and in the right kind of registers, at the cost of a more
complex implementation and breaking programs without prototypes.
E.g., the AMD64 ABI specifies register struct returns for small
structs.
Instead, many current ABI's are
at least sub-optimal for structs
Which ones do you have in mind?
I agree that on the 386 it would probably have been practical to pass >arguments in registers, but I suspect that for whatever reason they
wanted to make the calling sequence similar to the 8086 and 286.
On Sat, 12 Jul 2025 15:25:43 +0000, Anton Ertl wrote:
John Levine <[email protected]> writes:
According to MitchAlsup1 <[email protected]>:
Given that PDP-11 had 6 general purpose useable registers, and x86 >>>>started out with similar, it would have been quite difficult to
pass the first few arguments in registers.
It's not any more difficult to pass, say, 4 arguments in registers if
you have 6 registers available than it is if you have 30 registers
available.
There comes a point where it becomes harder than the compilers of that
era could perform--for example, consider an expression to be passed
as an argument that requires 3 registers to compute.
If you only have
6 registers and you want to pass 4 in registers, you might have to
calculate several arguments, push them on the stack, then calculate
the last one (3 registers) into the right register, then pop the others
off the stack in order to perform the all.
I think neither PDP-11 nor IA-32 has instructions that push, say, the
sum of two other registers, whereas at least IA-32 has an instruction
that computes the sum of two registers and puts it in a third
register.
Neither was a non-destructive register model (a = b + c) both were
a destruction model (a = a + b)
Concerning the implicit memory access: it costs more than using
registers on all IA-32 implementations I am aware of, and I expect
that's also true of the PDP-11.
Time: yes, instruction space: somewhat--but you had (r5) and (r5)+
and @(r5)+ and -(r5) and @-(r5) which cost no space but did cost time.
The key metrics are not, I think, instruction counts - but memory
accesses and how likely they are to cause delays.
As you show, having a pointer to "int * signgam" means that there will
be only one extra write to memory (in the callee) and one extra read (in
the caller) - while for a "pcc-struct-return" API you have two. However, >those will be adjacent and probably combined.
In theory, even if a struct return needs to pass a hidden pointer, the >compiler knows more about it than for a general "int *" pointer
parameter. It knows that there are no aliasing issues or "escapes" -
when you have a local variable whose address is passed on to
"lgamma_ertl", the compiler has to assume that the function might store
the address and later functions might use it to change the value of the
local variable "sign". With the hidden struct pointer, the compiler
knows that access via the pointer is much more restricted.
(With C23, a function like "lgamma_ertl" would be marked
[[unsequenced]], or at least [[reproducible]], which would let the
compiler make similar assumptions for optimisation.
Would struct returns have been used more if they were not so
inefficient?
E.g., MIPS (1986) got a calling convention that passes the first four
words of parameters in integer registers and the rest on the stack.
That's not particularly efficient for passing FP parameters, but it
meant that calls to functions, including varargs functions like
printf() would work without prototypes (C89 only came later) and
varags functions could be implemented simply by storing these four
registers to the stack (IIRC the four slots for these parameter words
were reserved).
Instead, many current ABI's are
at least sub-optimal for structs
Which ones do you have in mind?
The architecture that is most relevant for my daily work, and where >efficiency matters to me, is 32-bit ARM for embedded systems.
I realise 32-bit ARM was around before much of this was relevant (I
first played with ARM assembly in 1988 as a schoolkid). But it is
surely possible to modernise things a little?
I have also seen
gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
part of Hennessey's small integer benchmarks (from the 1980s)
John Levine <[email protected]> writes:
According to MitchAlsup1 <[email protected]>:
Given that PDP-11 had 6 general purpose useable registers, and x86 >>>started out with similar, it would have been quite difficult to
pass the first few arguments in registers.
It's not any more difficult to pass, say, 4 arguments in registers if
you have 6 registers available than it is if you have 30 registers
available.
PDP-11 and x86 were
easy to push arguments onto the stack, and address in callee from
the stack.
I think neither PDP-11 nor IA-32 has instructions that push, say, the
sum of two other registers, whereas at least IA-32 has an instruction
that computes the sum of two registers and puts it in a third
register.
I would like to quote Press, Teukolsky, Vetterling and Flannery,
from "Numerical Recipes":
"If you know what bubble sort is, wipe it from your mind; if you don't
know, make a point of never finding out!"
Those compilers were so space constrained that they compiled a statement at a >time, keeping only a stack of open loops so they knew where to jump back to. For
a procedure call it evaluated each argument expression and pushed it. Trying to
figure out which registere might be available for what was way beyond what it >could do.
The BLISS-11 compiler might have done more clever register allocation
but it ran on a PDP-10 which could address the equivalent of a
megabyte, not the 11's 64K.
PDP-11 instructions were all one or two operand, with all operands being fully >general.
Anton Ertl <[email protected]> schrieb:
I have also seen
gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
part of Hennessey's small integer benchmarks (from the 1980s)
I would like to quote Press, Teukolsky, Vetterling and Flannery,
from "Numerical Recipes":
"If you know what bubble sort is, wipe it from your mind; if you
don't know, make a point of never finding out!"
David Brown <[email protected]> writes:
The key metrics are not, I think, instruction counts - but memory
accesses and how likely they are to cause delays.
And one might also wonder what hardware one should look at. AMD64
does not use pcc-struct-returns by default, so finding out in how many
cases 0-cycle store-to-load forwarding (implemented in recent cores) eliminates the delays does not tell us the performance characteristics
on hardware that mostly executed IA-32 code where pcc-struct-returns
are the default.
As you show, having a pointer to "int * signgam" means that there will
be only one extra write to memory (in the callee) and one extra read (in
the caller) - while for a "pcc-struct-return" API you have two. However,
those will be adjacent and probably combined.
The stores go separately to the store units (and consume the resources there), and the stores are to write-back cache, not write-combining
memory. The loads go separately to the load units and consume the
resources there; no combining happens. The data will be in the
D-cache in the usual case, and on recent hardware there could even be
0-cycle store-to-load-forwarding.
If you are thinking about autovectorization by the compiler, yes, that
could happen, but IMO it costs more than it buys.
I have also seen
gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
part of Hennessey's small integer benchmarks (from the 1980s) by auto-vectorizing the adjacent accesses of bubble-sort. Not only does
the code execute significantly more instructions, it also hits a slow hardware path in store-to-load-forwarding on every store it performs
in this way.
But even without this slow path, my expectation is that the auto-vectorization overhead would slow the benchmark down compared to
the -O1 version (which is just scalar code), but how could I measure
this?
The slow path should not occur in the struct-return case, though.
Another combining idea is the use of ARM A64's store pair and load
pair instructions, which result in only one memory access for each
such instruction and result in fewer instructions than doing unpaired
loads and stores, while the code resulting from auto-vectorization on
AMD64 is longer than two scalar stores and two scalar loads.
Unfortunatly, store-pair and load-pair do not support storing or
loading an FP and an integer value AFAIK.
In theory, even if a struct return needs to pass a hidden pointer, the
compiler knows more about it than for a general "int *" pointer
parameter. It knows that there are no aliasing issues or "escapes" -
when you have a local variable whose address is passed on to
"lgamma_ertl", the compiler has to assume that the function might store
the address and later functions might use it to change the value of the
local variable "sign". With the hidden struct pointer, the compiler
knows that access via the pointer is much more restricted.
(With C23, a function like "lgamma_ertl" would be marked
[[unsequenced]], or at least [[reproducible]], which would let the
compiler make similar assumptions for optimisation.
You mean that the programmer could mark the function in that way?
Wouldn't some use of "restrict" give the compiler similar information?
I just don't know where in the code to apply "restrict". Maybe
double lgamma_ertl2(double x, int *restrict signgam);
?
Would struct returns have been used more if they were not so
inefficient?
Possibly. I certainly remember wanting to use them for something Gforth-internal, and then deciding against them after seeing the
generated code.
E.g., MIPS (1986) got a calling convention that passes the first four
words of parameters in integer registers and the rest on the stack.
That's not particularly efficient for passing FP parameters, but it
meant that calls to functions, including varargs functions like
printf() would work without prototypes (C89 only came later) and
varags functions could be implemented simply by storing these four
registers to the stack (IIRC the four slots for these parameter words
were reserved).
I think it's more complicated: If the first parameter is an integer
one, then do everything in integer registers, otherwise pass FP stuff
in FP registers. Probably the idea is that varargs functions always
start with an integer parameter.
Later I saw a calling convention (IIRC Alpha) where parameter n was
passed in integer register n if it was integer and FP register n if it
was an FP value. The respectiv other register went unused.
Recently I have seen a calling convention (IIRC RISC-V) where the used integer register are allocated one after the other whether there were
FP parameters interleaved or not, and the same on the FP side. I
don't remember what happens if the call runs out of one kind of
register, and the other kind is still available.
Instead, many current ABI's are
at least sub-optimal for structs
Which ones do you have in mind?
The architecture that is most relevant for my daily work, and where
efficiency matters to me, is 32-bit ARM for embedded systems.
ARM A32 (and T32 uses the same calling conventions) is from around the
same time as MIPS, so similar calling conventions are to be expected. However, I see various ABIs mentioned in the descriptions of various
things (eABI, oABI, etc.). So apparently they did several.
I realise 32-bit ARM was around before much of this was relevant (I
first played with ARM assembly in 1988 as a schoolkid). But it is
surely possible to modernise things a little?
Breaking compatibility has an immediate cost and (hopefully) a
long-term return. It's a relly hard sell. But apparently ARM with
their several ABIs has gone there. Too little?
Anton Ertl <[email protected]> schrieb:
I have also seen
gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
part of Hennessey's small integer benchmarks (from the 1980s)
I would like to quote Press, Teukolsky, Vetterling and Flannery,
from "Numerical Recipes":
"If you know what bubble sort is, wipe it from your mind; if you
don't know, make a point of never finding out!"
PDP-11 instructions were all one or two operand, with all operands being
fully
general.
It's interesting that VAX generalized this to general three-address operations (and added a proper indexed mode), while the 68K and IA-32 architects decided to support only one memory operand for most
instructions (but with more addressing modes, including proper indexed addressing modes). For the 68k the limitation to one memory operand
for most instructions probably was not a matter of principle (it has a
move instruction that supports two memory operands); my guess is that
they decided that for encoding reasons.
- anton
Thomas Koenig <[email protected]> writes:
Anton Ertl <[email protected]> schrieb:
I have also seen
gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
part of Hennessey's small integer benchmarks (from the 1980s)
I would like to quote Press, Teukolsky, Vetterling and Flannery,
from "Numerical Recipes":
"If you know what bubble sort is, wipe it from your mind; if you
don't know, make a point of never finding out!"
Unless you can prove that this kind of bad code generation by gcc can
only occur for bubble sort, this benchmark is a reason to ignore this
advice.
According to Anton Ertl <[email protected]>:
Anyway, I expect that Unix already had a calling convention on PDP-11
and several other machines, and of course PCC followed that
convention. As for the C compiler that introduced these calling
conventions (probably by Ritchie), my guess is that he was happy to
produce a working C compiler that ran in the little RAM they had.
It was two passes each about 24K bytes and a third optional optimizer
that slightly rewrote the assembler code.
The Ritchie complier and I think PCC reserved up to three registers
for declared register variables, and used the rest as a stack for temporaries. It used Sethi-Ullman numbering to do the more complex subexpressions first to avoid running out of registers. If it did
run out of registers I think it just gave up, but I don't ever
remember that happening.
Reserving more registers would have been really hard.
I agree that on the 386 it would probably have been practical to pass arguments in registers, but I suspect that for whatever reason they
wanted to make the calling sequence similar to the 8086 and 286.
Ease of adapting 16-bit compilers and library routines might have been reasons.
I am still of the opinion that fewer instructions remains better;
especially if they occupy the same code footprint.
On Mon, 14 Jul 2025 17:33:34 +0000, MitchAlsup1 wrote:
I am still of the opinion that fewer instructions remains better;
especially if they occupy the same code footprint.
I remember this rather large (6:1 code size ratio) counterexample from
the VAX ...
On Mon, 14 Jul 2025 22:47:02 +0000, Lawrence D'Oliveiro wrote:
On Mon, 14 Jul 2025 17:33:34 +0000, MitchAlsup1 wrote:
I am still of the opinion that fewer instructions remains better;
especially if they occupy the same code footprint.
I remember this rather large (6:1 code size ratio) counterexample from
the VAX ...
As I remember::
[examples omitted]
POLY could be faster in instructions when there were enough terms for Estrin's method to pay dividends.
VAX was admired and beloved for a decade, before sliding off into insignificance.
When I was doing 88100 at Motorola, the 68020 guys would say that
once there were sufficient resources, they could make a MOV-CALK
run just as fast as a 2-operand 1-result instruction model
68020
MOV D3,D2 // first 16-bits
CALK D3,D1 // 32-bits
88100
CALK D3,D2,D1 // 32-bits
I am still of the opinion that fewer instructions remains better;
especially if they occupy the same code footprint.
Bat as I'm sure that you have filled out a PR, because you are such
a constructive person bent on helping others instead of whining.
We have been over that before: I have reported gcc bugs in the past, but
my experience in the last few decades is that it is not at all
constructive, but a waste of time. See, e.g., PR93811.
[email protected] (MitchAlsup1) writes:
When I was doing 88100 at Motorola, the 68020 guys would say that
once there were sufficient resources, they could make a MOV-CALK
run just as fast as a 2-operand 1-result instruction model
68020
MOV D3,D2 // first 16-bits
CALK D3,D1 // 32-bits
88100
CALK D3,D2,D1 // 32-bits
That day arrived at the latest when Sandy Bridge was released in 2011
with its separate physical register files and register renamer. It
usually handles the register-register mov in the renamer, resulting in 0-cycle movs, especially in cases like these where the result of the
mov is overwritten soon.
Another option would be to let the decoder
combine the MOV and the CALK into one three-address microinstruction.
I am still of the opinion that fewer instructions remains better; >especially if they occupy the same code footprint.
Intel apparently thinks so; they introduce three-address encodings for
the existing instructions with APX.
What is the advantage of APX over the register renamer approach? It
takes fewer resources in the register renamer (which is often the
narrowest part of a core).
What is the advantage of APX over combining the instructions in the
decoder? If the CALK part traps (e.g, because it includes a memory
access), the architecture requires that the exception handler is
presented with the architectural state between the MOV and the CALK,
and this requires additional complications, while an architectural three-address instruction does not have this complication.
IIRC there are code size advantages to the APX three-address encodings
over the MOV-CALK combination in some, but not all cases.
- anton
Thomas Koenig <[email protected]> writes:
Bat as I'm sure that you have filled out a PR, because you are such
a constructive person bent on helping others instead of whining.
We have been over that before: I have reported gcc bugs in the past,
but my experience in the last few decades is that it is not at all constructive, but a waste of time. See, e.g., PR93811.
But if you think that it is useful, spend your own time on it. In the meantime I still amuse myself by making fun of gcc and clang failures.
- anton
On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:
POLY could be faster in instructions when there were enough terms for
Estrin's method to pay dividends.
The problem with polynomial evaluation is, at least in the examples I came across in my numerical-analysis courses, evaluation terminated much more commonly based on convergence to the final result, not on some
predetermined number of terms. But the VAX instruction only did a predetermined number of terms. So it didn’t seem that useful in real life.
[email protected] (MitchAlsup1) writes:
When I was doing 88100 at Motorola, the 68020 guys would say that
once there were sufficient resources, they could make a MOV-CALK
run just as fast as a 2-operand 1-result instruction model
68020
MOV D3,D2 // first 16-bits
CALK D3,D1 // 32-bits
88100
CALK D3,D2,D1 // 32-bits
That day arrived at the latest when Sandy Bridge was released in 2011
with its separate physical register files and register renamer. It
usually handles the register-register mov in the renamer, resulting in 0-cycle movs, especially in cases like these where the result of the
mov is overwritten soon. Another option would be to let the decoder
combine the MOV and the CALK into one three-address microinstruction.
I am still of the opinion that fewer instructions remains better; >>especially if they occupy the same code footprint.
Intel apparently thinks so; they introduce three-address encodings for
the existing instructions with APX.
What is the advantage of APX over the register renamer approach? It
takes fewer resources in the register renamer (which is often the
narrowest part of a core).
What is the advantage of APX over combining the instructions in the
decoder? If the CALK part traps (e.g, because it includes a memory
access), the architecture requires that the exception handler is
presented with the architectural state between the MOV and the CALK,
and this requires additional complications, while an architectural three-address instruction does not have this complication.
IIRC there are code size advantages to the APX three-address encodings
over the MOV-CALK combination in some, but not all cases.
- anton
Lawrence D'Oliveiro wrote:
On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:
POLY could be faster in instructions when there were enough terms for
Estrin's method to pay dividends.
The problem with polynomial evaluation is, at least in the examples I
came
across in my numerical-analysis courses, evaluation terminated much more
commonly based on convergence to the final result, not on some
predetermined number of terms. But the VAX instruction only did a
predetermined number of terms. So it didn’t seem that useful in real >> life.
You obviously have never implemented any fp library:
When you write code for things like log/ln/exp/sin/cos/tan/atan/etc, you pretty much always use fixed-number-of-term polys.
Terje
On Tue, 15 Jul 2025 11:46:10 +0000, Terje Mathisen wrote:
Lawrence D'Oliveiro wrote:
On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:
POLY could be faster in instructions when there were enough terms
for Estrin's method to pay dividends.
The problem with polynomial evaluation is, at least in the
examples I came
across in my numerical-analysis courses, evaluation terminated
much more commonly based on convergence to the final result, not
on some predetermined number of terms. But the VAX instruction
only did a predetermined number of terms. So it didn’t seem that
useful in real life.
You obviously have never implemented any fp library:
When you write code for things like
log/ln/exp/sin/cos/tan/atan/etc, you pretty much always use fixed-number-of-term polys.
Certainly when following Cody and Waite or J.M. Muller. But there are
ways
of implementing the same list as above, testing is the significance
has leveled off and early out. It is generally slower in worst case
and not much faster in the typical case--but it is a method taught in Numerical Method's classes.
Terje
Thomas Koenig <[email protected]> writes:
Bat as I'm sure that you have filled out a PR, because you are such
a constructive person bent on helping others instead of whining.
We have been over that before: I have reported gcc bugs in the past,
but my experience in the last few decades is that it is not at all constructive, but a waste of time. See, e.g., PR93811.
But if you think that it is useful, spend your own time on it. In the meantime I still amuse myself by making fun of gcc and clang failures.
On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:
POLY could be faster in instructions when there were enough terms for
Estrin's method to pay dividends.
The problem with polynomial evaluation is, at least in the examples I came across in my numerical-analysis courses, evaluation terminated much more commonly based on convergence to the final result, not on some
predetermined number of terms. But the VAX instruction only did a predetermined number of terms. So it didn’t seem that useful in real life.
On Tue, 15 Jul 2025 17:44:19 +0000
[email protected] (MitchAlsup1) wrote:
On Tue, 15 Jul 2025 11:46:10 +0000, Terje Mathisen wrote:
Lawrence D'Oliveiro wrote:
On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:
POLY could be faster in instructions when there were enough terms
for Estrin's method to pay dividends.
The problem with polynomial evaluation is, at least in the
examples I came
across in my numerical-analysis courses, evaluation terminated
much more commonly based on convergence to the final result, not
on some predetermined number of terms. But the VAX instruction
only did a predetermined number of terms. So it didn’t seem that >>>> useful in real life.
You obviously have never implemented any fp library:
When you write code for things like
log/ln/exp/sin/cos/tan/atan/etc, you pretty much always use
fixed-number-of-term polys.
Certainly when following Cody and Waite or J.M. Muller. But there are
ways
of implementing the same list as above, testing is the significance
has leveled off and early out. It is generally slower in worst case
and not much faster in the typical case--but it is a method taught in
Numerical Method's classes.
Terje
You mean, to summate starting from bigger terms to smaller terms?
Something like:
sum = a[0];
xx = x;
for (int i = 1; ; ++i) {
sum1 = sum + xx * a[i];
if (sum == sum1)
break;
sum = sum1;
xx *= x;
}
That is the worst possible order of evaluation from perspective of
precision.
That's the worst possible meth
Lawrence D'Oliveiro wrote:
On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:
POLY could be faster in instructions when there were enough terms for
Estrin's method to pay dividends.
The problem with polynomial evaluation is, at least in the examples I came >> across in my numerical-analysis courses, evaluation terminated much more
commonly based on convergence to the final result, not on some
predetermined number of terms. But the VAX instruction only did a
predetermined number of terms. So it didn’t seem that useful in real life.
You obviously have never implemented any fp library:
When you write code for things like log/ln/exp/sin/cos/tan/atan/etc, you pretty much always use fixed-number-of-term polys.
On Tue, 15 Jul 2025 13:46:10 +0200, Terje Mathisen wrote:
Lawrence D'Oliveiro wrote:
On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:
POLY could be faster in instructions when there were enough terms for
Estrin's method to pay dividends.
The problem with polynomial evaluation is, at least in the examples I came >>> across in my numerical-analysis courses, evaluation terminated much more >>> commonly based on convergence to the final result, not on some
predetermined number of terms. But the VAX instruction only did a
predetermined number of terms. So it didn’t seem that useful in real life.
You obviously have never implemented any fp library:
When you write code for things like log/ln/exp/sin/cos/tan/atan/etc, you
pretty much always use fixed-number-of-term polys.
Computing π to a given precision: <https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
No fixed number of terms in the common algorithms, as you can see.
Conclusion: What is the value of continued fractions?into them caan offer better examples of how to use them ...
Clearly mathematicians have a lot of fun with them. But speaking as someone who does computation on a daily basis, I have to say I don’t think they’re a practical way of evaluating anything. Maybe I’m wrong, and someone who has delved more deeply
Lawrence D'Oliveiro wrote:
If this was supposed to show how you would use variable number of terms
Computing π to a given precision:
<https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
No fixed number of terms in the common algorithms, as you can see.
for common library functions, then I failed to understand it.
On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:
Lawrence D'Oliveiro wrote:
If this was supposed to show how you would use variable number of terms
Computing π to a given precision:
<https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
No fixed number of terms in the common algorithms, as you can see.
for common library functions, then I failed to understand it.
Quote:
Or compare this function, adapted from the recipes section of the
decimal module documentation:
[code omitted -- see reference]
As you can see, this converges a lot quicker.
Terje Mathisen <[email protected]> writes:
Lawrence D'Oliveiro wrote:
On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:
=20Another, somewhat important consideration:
If you want to make it possible to auto-vectorize code, then you
pretty=20
much need for all instructions to have constant latency, maybe with a=20 >>few exceptions that will then cause pipeline bubbles.
For security purposes, all instruction timing must be data independent.
Lawrence D'Oliveiro wrote:
On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:
=20Another, somewhat important consideration:
If you want to make it possible to auto-vectorize code, then you pretty=20 >much need for all instructions to have constant latency, maybe with a=20
few exceptions that will then cause pipeline bubbles.
Lawrence D'Oliveiro wrote:
On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:Another, somewhat important consideration:
Lawrence D'Oliveiro wrote:
If this was supposed to show how you would use variable number of terms
Computing π to a given precision:
<https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
No fixed number of terms in the common algorithms, as you can see.
for common library functions, then I failed to understand it.
Quote:
Or compare this function, adapted from the recipes section of the
decimal module documentation:
[code omitted -- see reference]
As you can see, this converges a lot quicker.
If you want to make it possible to auto-vectorize code, then you pretty
much need for all instructions to have constant latency, maybe with a
few exceptions that will then cause pipeline bubbles.
This was definitely a requirement for the Mill fp emulation work I did.
Terje
On Thu, 17 Jul 2025 9:18:00 +0000, Terje Mathisen wrote:
Lawrence D'Oliveiro wrote:
On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:Another, somewhat important consideration:
Lawrence D'Oliveiro wrote:
If this was supposed to show how you would use variable number of terms >>>> for common library functions, then I failed to understand it.
Computing À to a given precision:
<https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>. >>>>> No fixed number of terms in the common algorithms, as you can see.
Quote:
Or compare this function, adapted from the recipes section of the >>> decimal module documentation:
[code omitted -- see reference]
As you can see, this converges a lot quicker.
If you want to make it possible to auto-vectorize code, then you pretty
much need for all instructions to have constant latency, maybe with a
few exceptions that will then cause pipeline bubbles.
Can I get your definition of "auto-vectorize"
A wide-decode and a set of reservation stations can "vectorize" a
loop or straight line of code. Does this qualify as "auto-vectorize" ??
Whereas, My 66000 VEC-LOOP is definitely a "compiler-vectorize"
directive.
This was definitely a requirement for the Mill fp emulation work I did.
Given that there are a few instructions which can have variable latency
and a spattering that HAVE TO HAVE variable latency this requirement
causes "problems".
In 1991, working on Mc 88120, we had FDIV that was within 0.125 ULP
at cycle 12, and it took 5 more cycles to KNOW that the result was
properly rounded (all RMs). So, instead of having FDIV have 17 cycle
latency, we allowed it to have 12 cycles of latency 87.5% of the time
and 17 cycles 12.5% of the time for an average latency of 12.625 cycles.
This is usefully faster than fixed 17 cycles.
The same argument applies to SQRT.
Any LD instruction backed by a cache HAS TO HAVE variable latency.
Any memory ref with a translated address HAS TO HAVE variable
latency (TLB miss).
Store instruction waiting on long latency result data HAS TO HAVE
variable latency between AGEN and Write.
MitchAlsup1 wrote:
On Thu, 17 Jul 2025 9:18:00 +0000, Terje Mathisen wrote:
Lawrence D'Oliveiro wrote:
On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:Another, somewhat important consideration:
Lawrence D'Oliveiro wrote:
If this was supposed to show how you would use variable number of terms >>>>> for common library functions, then I failed to understand it.
Computing À to a given precision:
<https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>. >>>>>> No fixed number of terms in the common algorithms, as you can see. >>>>>>
Quote:
Or compare this function, adapted from the recipes section of the >>>> decimal module documentation:
[code omitted -- see reference]
As you can see, this converges a lot quicker.
If you want to make it possible to auto-vectorize code, then you pretty
much need for all instructions to have constant latency, maybe with a
few exceptions that will then cause pipeline bubbles.
Can I get your definition of "auto-vectorize"
A wide-decode and a set of reservation stations can "vectorize" a
loop or straight line of code. Does this qualify as "auto-vectorize" ??
Whereas, My 66000 VEC-LOOP is definitely a "compiler-vectorize"
directive.
This was definitely a requirement for the Mill fp emulation work I did.
Given that there are a few instructions which can have variable latency
and a spattering that HAVE TO HAVE variable latency this requirement
causes "problems".
Yeah, I do know that. Memory ops in SIMD style short vectors typically
have all slots resding in the same cache line, so even though the
latency is not predictable, it will probably be the same for all
elements.
In 1991, working on Mc 88120, we had FDIV that was within 0.125 ULP
at cycle 12, and it took 5 more cycles to KNOW that the result was
properly rounded (all RMs). So, instead of having FDIV have 17 cycle
latency, we allowed it to have 12 cycles of latency 87.5% of the time
and 17 cycles 12.5% of the time for an average latency of 12.625 cycles.
This is usefully faster than fixed 17 cycles.
So if 87.5% of all divisions finish in 12 cycles, and you do 8 of them
in parallel, then (for random inputs), all 8 will finish in 12 with a
34% probability, leaving 17 cycles as the actual latency in 66% of all
cases. Total average latency becomes 15.3 cycles, so most of the gain is lost.
The same argument applies to SQRT.
Any LD instruction backed by a cache HAS TO HAVE variable latency.
Any memory ref with a translated address HAS TO HAVE variable
latency (TLB miss).
Store instruction waiting on long latency result data HAS TO HAVE
variable latency between AGEN and Write.
I don't think we disagree Mitch, I'm just stating that if you have a
lockstep programming model, then variable latency per slot tends to end
up with worst case latency all over, so if you could have done the Mc
88K FDIV in a fixed 16-cycles, that might have been better for this particular programming model.
Terje
On Fri, 11 Jul 2025 14:50:58 +0000, Anton Ertl wrote:
David Brown <[email protected]> writes:<snip>
It would have been nice if, when struct returns and struct parameters >>>were added to C, someone had taken time to improve the ABI's to make
them efficient.
Given the name of the calling convention variant, this was introduced
in PCC (and probably struct returns themselves were introduced in
PCC). PCC was released in 1979 on the machines of the day, such as
the PDP-11; I am sure Johnson implemented a calling convention for
struct passing and struct returns that used the least amount of code.
If Johnson had had more space to play with, he probably would have had
other things on the agenda before improving the struct return calling
convention. E.g., the calling conventions at the time passed all
parameters on the stack, and we still have this in the Intel calling
convention for IA-32.
Given that PDP-11 had 6 general purpose useable registers, and x86
started out with similar, it would have been quite difficult to
pass the first few arguments in registers. PDP-11 and x86 were
easy to push arguments onto the stack, and address in callee from
the stack.
[1] I have wondered about the selection of registers for the System V
calling convention for the System V ABI for AMD64: the first 6
arguments go in RDI, RSI, RDX, RCX, R8, R9. The first two are optimal
for memcpy() implemented with REP MOVSB, but then RCX would be better
in third position. RDI is also good for memset() with REP STOSB, RDI
and RSI are also good for memcmp() with REP CMPSB, and I expect that
there are other uses of REP instructions for implementing memory-block
or string functions where the placement in RDI and RSI is
helpful. Except that the library routines then often do not use the
REP instructions.
Watcom C for 386 offered a register passing convention, IIRC first 3
integer (or equivalent) arguments were passed in registers.
ANd this convention gave measurable speedup compared to standard
convention.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 26:39:19 |
| Calls: | 12,106 |
| Calls today: | 6 |
| Files: | 15,006 |
| Messages: | 6,518,197 |