Forum: >>> Magnum BBS <<<

Speculation from the past

From MitchAlsup1@21:1/5 to All on Thu Jul 10 15:50:38 2025

Does anyone know why libm defines

double lgamma(x) { }
with an extern int signgam

instead of:

typedef struct { double result;
int sign; } gammaresult;

gammaresult lgamma( double x );

Struct returns from subroutines were part of C back in 1980...
{when I started using C}

One could add to this discussion as to why errno was not
done with struct return; ala::

typedef struct { int fides;
int error; } openresult;

openresult open( char *string, int modes );

as are many of the Linux OS entry points.

This has to be a better solution compared to errno and signgam.

Speculations welcome.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to [email protected] on Thu Jul 10 16:58:46 2025

[email protected] (MitchAlsup1) writes:

Does anyone know why libm defines

double lgamma(x) { }
with an extern int signgam

instead of:

typedef struct { double result;
int sign; } gammaresult;

gammaresult lgamma( double x );

Struct returns from subroutines were part of C back in 1980...
{when I started using C}

Possibly, but they were not part of early C, are not particularly
efficient on many ABIs, and are inconvenient to use if you want to use
all the components of the struct. So there were lots of reasons why
API designers avoided the use of struct returns. An alternative would
have been

double lgamma(double gamma, int *signgam);

however.

One could add to this discussion as to why errno was not
done with struct return

Struct return did not exist in early C, so C wrappers for system calls
(which existed from the start) do not use it.

However, the actual system call interface does not have errno, but
either returns the result in one register, or in one register and a
flag (IIRC the carry flag in some system call interfaces I looked at).
If the result is returned in one register, the usual indication of an
error is that the sign bit is set; in that case the value of the
register is the negated error number. For a separate flag, the value
of the register is the error number. If you look at the original
system calls of Unix, the limitation to positive numbers is not a
problem.

To a large degree, that is still the case, although, e.g., mmap() on a
32-bit system can return a negative address, so the condition for an
error of mmap() is a little bit more complicated than just checking
the sign bit.

In any case, these days errno is a perversity kept alive by backwards compatibility: The C wrapper for the system call has to check whether
there is an error, then has to compute the error number and
expensively store it to the thread-local storage where errno resides.
Then the caller tests the return value of the C wrapper for indicating
an error, and then accesses errno expensively in thread-local storage.
If the C wrapper directly returned the return value of the system
call, with some macros for finding out if there is an error and what
the errno is, the whole system call would be more efficient.

You might wonder about the architectures that use the carry flag to
indicate that there is an error. But given that all maintained OSs
for these architectures have to also work on architectures that do not
pass the error indication in that way, I expect that the C wrapper
could transform that into the variant that uses the same error
indication as on the architectures that do not use the carry bit.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Thu Jul 10 17:37:55 2025

[email protected] (MitchAlsup1) writes:

Does anyone know why libm defines

double lgamma(x) { }
with an extern int signgam

instead of:

typedef struct { double result;
int sign; } gammaresult;

gammaresult lgamma( double x );

Because the original lgamma was defined long before
the committee added the 'signgam' feature, which was
defined before pthreads was adopted from 1003.4 into
XPG.

The committee doesn't change existing function
definitions in order to avoid breaking existing applications,
so the extern was added. In retrospect, given the
subsequent adoption of pthreads, it would have been better
to create a new interface, not named 'lgamma' to support
returning the sign value.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Thu Jul 10 21:18:05 2025

On 10/07/2025 18:58, Anton Ertl wrote:

[email protected] (MitchAlsup1) writes:

Does anyone know why libm defines

double lgamma(x) { }
with an extern int signgam

instead of:

typedef struct { double result;
int sign; } gammaresult;

gammaresult lgamma( double x );

Struct returns from subroutines were part of C back in 1980...
{when I started using C}

Possibly, but they were not part of early C, are not particularly
efficient on many ABIs, and are inconvenient to use if you want to use
all the components of the struct. So there were lots of reasons why
API designers avoided the use of struct returns. An alternative would
have been

double lgamma(double gamma, int *signgam);

however.

Struct returns, even on poorer ABI's (and there are /many/ ABI's that
are bad for struct handling), are unlikely to be noticeably less
efficient than using a pointer-to-return-value solution. After all, the typical simplistic struct return here would be roughly equivalent to :

void lgamma(gammaresult * result, double gamma);

It would have been nice if, when struct returns and struct parameters
were added to C, someone had taken time to improve the ABI's to make
them efficient. Then the habit of decent ABI's could have continued
when new architectures were developed. Instead, many current ABI's are
at least sub-optimal for structs - a particular pain for C++.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Thu Jul 10 21:30:22 2025

On Thu, 10 Jul 2025 19:18:05 +0000, David Brown wrote:

On 10/07/2025 18:58, Anton Ertl wrote:

[email protected] (MitchAlsup1) writes:

Does anyone know why libm defines

double lgamma(x) { }
with an extern int signgam

instead of:

typedef struct { double result;
int sign; } gammaresult;

gammaresult lgamma( double x );

Struct returns from subroutines were part of C back in 1980...
{when I started using C}

Possibly, but they were not part of early C, are not particularly
efficient on many ABIs, and are inconvenient to use if you want to use
all the components of the struct. So there were lots of reasons why
API designers avoided the use of struct returns. An alternative would
have been

double lgamma(double gamma, int *signgam);

however.

Struct returns, even on poorer ABI's (and there are /many/ ABI's that
are bad for struct handling), are unlikely to be noticeably less
efficient than using a pointer-to-return-value solution.

Given that one has to
a) look at something (return value or flag)
b) and if set "bad" go find errno
c) set errno to negative of value
d) return
e) look at return value
f) if "bad" go find errno
g) read errno
h) go do something about it

I think it is easy to make the argument that structure returns
is almost always less expensive:: as::

a) return 2 values
b) if second value is "bad"
c) go do something about it

And this is thread safe, too.

After all, the typical simplistic struct return here would be roughly equivalent to :

void lgamma(gammaresult * result, double gamma);

It would have been nice if, when struct returns and struct parameters
were added to C, someone had taken time to improve the ABI's to make
them efficient. Then the habit of decent ABI's could have continued
when new architectures were developed. Instead, many current ABI's are
at least sub-optimal for structs - a particular pain for C++.

Do you think it is time to make another layer of wrappers::
// for illustrative purposes

typedef struct { int first, second } two_returns;

fides open( char *string, int flags )
{
two_returns old = new_open( string, flags );
if( old.second )
{
errno = -old.second;
old.first = -1;
}
return (fides)old.first;
}

enum System_Calls { ..., file_open, ... };

two_returns new_open( char *string, int flags )
{
return SYSCALL( char *string, int flags, file_open );
}

This results in a system call that is easily inlined by the compiler and results in 2 or 3 instructions in many new architectures, instead of
"lots"
including additional control transfers (call and return) along with
accessing
errno (signgam), ...

One would not want to inline the old way. So, now we can let the
compiler
inline SYSCALLs with reasonable safety.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Thu Jul 10 22:10:56 2025

On Thu, 10 Jul 2025 15:50:38 +0000, MitchAlsup1 wrote:

One could add to this discussion as to why errno was not
done with struct return; ala::

typedef struct { int fides;
int error; } openresult;

openresult open( char *string, int modes );

as are many of the Linux OS entry points.

This has to be a better solution compared to errno and signgam.

The actual Linux kernel entry point for open(2) returns a non-negative FD number on success, and a negative error code on failure. Other calls do
similar things; it is the C runtime library wrapper that implements errno
(as defined by C and POSIC APIs), it is not something the kernel knows (or cares) about.

errno is a hack. It can’t even be treated as a simple global variable, because of the interaction with multithreading -- each thread has to have
its own errno.

There are some Linux-specific kernel calls where the userland API doesn’t even bother going through errno.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Fri Jul 11 02:03:20 2025

Lawrence D'Oliveiro <[email protected]d> writes:

On Thu, 10 Jul 2025 15:50:38 +0000, MitchAlsup1 wrote:

One could add to this discussion as to why errno was not
done with struct return; ala::

typedef struct { int fides;
int error; } openresult;

openresult open( char *string, int modes );

as are many of the Linux OS entry points.

This has to be a better solution compared to errno and signgam.

The actual Linux kernel entry point for open(2) returns a non-negative FD >number on success, and a negative error code on failure.

Irrelevent.

The open(2) API and errno mechanism was defined in very early unix a half century ago.

It was standardized in the System V Interface Definion (SVID) in the
1980s and in POSIX a few years later, followed by the X Portability
Guide (XPG) and finally the Single Unix specification. In all cases
backward compatibility at the source level was a requirement.

Extensions and new capabilities related to opening a file are encapsulated
in additional APIs such as fcntl(2), ioctl(2), tcsetattr(2), et alia.

Yes, there are likely differnt possible APIs; all new standardized Unix C APIs (e.g. posix_spawn, pthreads, et alia) return the E* error number directly
(for thread safety) or zero for success eschewing errno completely. Any other data returned by an API is via pointer parameters (often with 'restrict' qualification).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri Jul 11 02:48:33 2025

On Thu, 10 Jul 2025 16:58:46 GMT, Anton Ertl wrote:

In any case, these days errno is a perversity kept alive by backwards compatibility: The C wrapper for the system call has to check whether
there is an error, then has to compute the error number and
expensively store it to the thread-local storage where errno resides.

On the assumption that error conditions are less common than success, the
fact that errno retains its previous value on the success case helps
reduce the cost.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dan Cross@21:1/5 to Scott Lurndal on Fri Jul 11 11:03:32 2025

In article <Ih_bQ.958133$[email protected]>,
Scott Lurndal <[email protected]> wrote:

The open(2) API and errno mechanism was defined in very early unix a half century ago.

It was standardized in the System V Interface Definion (SVID) in the
1980s and in POSIX a few years later, followed by the X Portability
Guide (XPG) and finally the Single Unix specification. In all cases
backward compatibility at the source level was a requirement.

Extensions and new capabilities related to opening a file are encapsulated
in additional APIs such as fcntl(2), ioctl(2), tcsetattr(2), et alia.

Yes, there are likely differnt possible APIs; all new standardized Unix C APIs >(e.g. posix_spawn, pthreads, et alia) return the E* error number directly >(for thread safety) or zero for success eschewing errno completely. Any other
data returned by an API is via pointer parameters (often with 'restrict' qualification).

POSIX mandates that `errno` be (essentially) thread-local, so
thread safety isn't much of a consideration here. Traditionally
Unix kernels have returned a single value in a register, and set
a flag (in the PSW or whatever) to indicate failure, leaving it
to the syscall stubs in e.g. the C library to take whatever the
kernel gives back from the actual syscall exit and make sure
that `errno` is set appropriately.

I can image that a kernel call interface where `errno` is not
set is a bit more direct, but I don't think concurrency plays a
huge role there; but maybe these interfaces were designed in
that awkward time before `errno` was thread safe by mandate.
And the case of `posix_spawn` might be special, since it is so
often written in terms of `vfork`, which has its own bizarre
semantics.

- Dan C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Fri Jul 11 12:53:36 2025

On 10/07/2025 23:30, MitchAlsup1 wrote:

On Thu, 10 Jul 2025 19:18:05 +0000, David Brown wrote:

On 10/07/2025 18:58, Anton Ertl wrote:

[email protected] (MitchAlsup1) writes:

Does anyone know why libm defines

double lgamma(x) { }
with an extern int signgam

instead of:

typedef struct { double result;
                 int    sign; } gammaresult;

gammaresult lgamma( double x );

Struct returns from subroutines were part of C back in 1980...
{when I started using C}

Possibly, but they were not part of early C, are not particularly
efficient on many ABIs, and are inconvenient to use if you want to use
all the components of the struct. So there were lots of reasons why
API designers avoided the use of struct returns. An alternative would
have been

double lgamma(double gamma, int *signgam);

however.

Struct returns, even on poorer ABI's (and there are /many/ ABI's that
are bad for struct handling), are unlikely to be noticeably less
efficient than using a pointer-to-return-value solution.

Given that one has to
a) look at something (return value or flag)
b) and if set "bad" go find errno
c) set errno to negative of value
d) return
e) look at return value
f) if "bad" go find errno
g) read errno
h) go do something about it

I think it is easy to make the argument that structure returns
is almost always less expensive:: as::

a) return 2 values
b) if second value is "bad"
c) go do something about it

And this is thread safe, too.

Sure. I was comparing struct returns to pointer-to-return-value
functions. I agree that using errno is usually less efficient. (errno
can be thread-safe using thread-specific errno - but then it is even
more overhead in use.)

Where errno can be a good idea is if you are doing a lot of calculations
and then check errno once at the end. I don't know how often that is
done in practice.

To me, the real benefit of having functions return a struct rather than
use errno (or some other global variable) or take a
pointer-to-return-value parameter, is that the function becomes "pure".
The outputs depend solely on the inputs, and are consistent from call to
call, with no side-effects. Now you can re-arrange them like any
arithmetic code (with the same provisos about IEEE accuracy for floating point), pre-calculate results, skip duplicate calls, and do any other
kinds of manipulation that suits. And it is far easier to reason about
the correctness of code that has no side-effects.

                                                          After all, the
typical simplistic struct return here would be roughly equivalent to :

    void lgamma(gammaresult * result, double gamma);

It would have been nice if, when struct returns and struct parameters
were added to C, someone had taken time to improve the ABI's to make
them efficient. Then the habit of decent ABI's could have continued
when new architectures were developed. Instead, many current ABI's are
at least sub-optimal for structs - a particular pain for C++.

Do you think it is time to make another layer of wrappers::
// for illustrative purposes

typedef struct { int first, second } two_returns;

fides                  open( char *string, int flags )
{
    two_returns old = new_open( string, flags );
    if( old.second )
    {
        errno = -old.second;
        old.first = -1;
    }
    return (fides)old.first;
}

enum System_Calls { ..., file_open, ... };

two_returns new_open( char *string, int flags )
{
    return SYSCALL( char *string, int flags, file_open );
}

This results in a system call that is easily inlined by the compiler and results in 2 or 3 instructions in many new architectures, instead of
"lots"
including additional control transfers (call and return) along with
accessing
errno (signgam), ...

One would not want to inline the old way. So, now we can let the
compiler
inline SYSCALLs with reasonable safety.

I think that for "big" functions - like most system calls - it's not
worth the effort from an efficiency viewpoint. And it does not make the function "pure". So for C, that would be a waste of time for something
like "open()".

For maths functions and similar code, on the other hand, it can make a
much bigger difference.

For C++, the difference in usability is significant. Handling struct
returns is somewhat inconvenient in C, though the C23 "auto" type
inference helps a bit. C++ has significantly better support, especially
if the struct types are std::expected<>, std::variant<> or
std::optional<>. But even with plain old structs, C++ has structured
binding and std::tie<> that make it all easier to use (especially with
the new anonymous _ in C++26). Add to that, C++ has been gaining
steadily more compile-time calculations (constexpr, consteval, and now
the beginnings of reflection) which cannot work with side-effect functions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Dan Cross on Fri Jul 11 13:40:22 2025

[email protected] (Dan Cross) writes:

In article <Ih_bQ.958133$[email protected]>,
Scott Lurndal <[email protected]> wrote:

The open(2) API and errno mechanism was defined in very early unix a half century ago.

It was standardized in the System V Interface Definion (SVID) in the
1980s and in POSIX a few years later, followed by the X Portability
Guide (XPG) and finally the Single Unix specification. In all cases >>backward compatibility at the source level was a requirement.

Extensions and new capabilities related to opening a file are encapsulated >>in additional APIs such as fcntl(2), ioctl(2), tcsetattr(2), et alia.

Yes, there are likely differnt possible APIs; all new standardized Unix C APIs
(e.g. posix_spawn, pthreads, et alia) return the E* error number directly >>(for thread safety) or zero for success eschewing errno completely. Any other
data returned by an API is via pointer parameters (often with 'restrict' qualification).

POSIX mandates that `errno` be (essentially) thread-local, so
thread safety isn't much of a consideration here. Traditionally
Unix kernels have returned a single value in a register, and set
a flag (in the PSW or whatever) to indicate failure, leaving it
to the syscall stubs in e.g. the C library to take whatever the
kernel gives back from the actual syscall exit and make sure
that `errno` is set appropriately.

I can image that a kernel call interface where `errno` is not
set is a bit more direct, but I don't think concurrency plays a
huge role there; but maybe these interfaces were designed in
that awkward time before `errno` was thread safe by mandate.

I was on the XPG working group in those years, and yes, they
were designed in that awkward time as 1003.4a was being
developed.

And the case of `posix_spawn` might be special, since it is so
often written in terms of `vfork`, which has its own bizarre
semantics.

posix_spawn was modeled somewhat after ADA process creation primitives.

The rationale is included in the standard page.

https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Fri Jul 11 14:50:58 2025

David Brown <[email protected]> writes:

Struct returns, even on poorer ABI's (and there are /many/ ABI's that
are bad for struct handling), are unlikely to be noticeably less
efficient than using a pointer-to-return-value solution. After all, the >typical simplistic struct return here would be roughly equivalent to :

void lgamma(gammaresult * result, double gamma);

Let's see:

#include <stdio.h>

typedef struct { double result;
int sign; } gammaresult;

gammaresult lgamma_alsup1( double x )
{
gammaresult r;
r.result = x+1.;
r.sign = -1;
return r;
}

double lgamma_ertl1(double x, int *signgam)
{
*signgam = -1;
return x+1.;
}

extern gammaresult lgamma_alsup2( double x );

void call_alsup()
{
gammaresult r=lgamma_alsup2(1.);
printf("%f ",r.result);
printf("%d ",r.sign);
}

extern double lgamma_ertl2(double x, int *signgam);

void call_ertl()
{
int sign;
printf("%f ",lgamma_ertl2(1.,&sign));
printf("%d ",sign);
}

Here the calls are to a differently-named function with the same
interface such that we see what happens without inlining. The first
thing to note is that the source code for the struct-returning
function is longer. The calling code is slightly longer.

I have compiled that on AMD64 with:

gcc -fpcc-struct-return -Wall -O -c lgamma.c

The output of "objdump -d lgamma.o" for lgamma_*1 is:

0000000000000000 <lgamma_alsup1>:
0: 48 89 f8 mov %rdi,%rax
3: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # b <lgamma_alsup1+0xb>
a: 00
b: f2 0f 11 07 movsd %xmm0,(%rdi)
f: c7 47 08 ff ff ff ff movl $0xffffffff,0x8(%rdi)
16: c3 ret

0000000000000017 <lgamma_ertl1>:
17: c7 07 ff ff ff ff movl $0xffffffff,(%rdi)
1d: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # 25 <lgamma_ertl1+0xe>
24: 00
25: c3 ret

So with the typical simplistic struct return (aka pcc-struct-return)
the code of the function is longer.

The code for the call_* functions is:

0000000000000026 <call_alsup>:
26: 53 push %rbx
27: 48 83 ec 10 sub $0x10,%rsp
2b: 48 89 e7 mov %rsp,%rdi
2e: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 36 <call_alsup+0x10>
35: 00
36: e8 00 00 00 00 call 3b <call_alsup+0x15>
3b: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx
40: f2 0f 10 04 24 movsd (%rsp),%xmm0
45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 4c <call_alsup+0x26>
4c: b8 01 00 00 00 mov $0x1,%eax
51: e8 00 00 00 00 call 56 <call_alsup+0x30>
56: 89 de mov %ebx,%esi
58: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 5f <call_alsup+0x39>
5f: b8 00 00 00 00 mov $0x0,%eax
64: e8 00 00 00 00 call 69 <call_alsup+0x43>
69: 48 83 c4 10 add $0x10,%rsp
6d: 5b pop %rbx
6e: c3 ret

000000000000006f <call_ertl>:
6f: 48 83 ec 18 sub $0x18,%rsp
73: 48 8d 7c 24 0c lea 0xc(%rsp),%rdi
78: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 80 <call_ertl+0x11>
7f: 00
80: e8 00 00 00 00 call 85 <call_ertl+0x16>
85: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 8c <call_ertl+0x1d>
8c: b8 01 00 00 00 mov $0x1,%eax
91: e8 00 00 00 00 call 96 <call_ertl+0x27>
96: 8b 74 24 0c mov 0xc(%rsp),%esi
9a: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # a1 <call_ertl+0x32>
a1: b8 00 00 00 00 mov $0x0,%eax
a6: e8 00 00 00 00 call ab <call_ertl+0x3c>
ab: 48 83 c4 18 add $0x18,%rsp
af: c3 ret

18 instructions for call_alsup() vs. 14 for call_ertl(), so again the struct-return variant leads to longer code with pcc-struct-return.

It would have been nice if, when struct returns and struct parameters
were added to C, someone had taken time to improve the ABI's to make
them efficient.

Given the name of the calling convention variant, this was introduced
in PCC (and probably struct returns themselves were introduced in
PCC). PCC was released in 1979 on the machines of the day, such as
the PDP-11; I am sure Johnson implemented a calling convention for
struct passing and struct returns that used the least amount of code.
If Johnson had had more space to play with, he probably would have had
other things on the agenda before improving the struct return calling convention. E.g., the calling conventions at the time passed all
parameters on the stack, and we still have this in the Intel calling
convention for IA-32.

Early RISC calling conventions passed several parameters in registers,
but still used pcc-struct-returns. But struct returns were so rare in libraries that gcc added an option -freg-struct-return which returns
small structs in registers, and this option used to be usable, because libraries or system calls did not use struct-returns at the time.

Eventually, ABI specifications went for more efficient, but also more
complex and less forgiving calling conventions, so on AMD64 without -fpcc-struct-return gammaresult is actually returned in registers,
leading to 3 instructions for lgamma_alsup1 (same as lgamma_ertl1) and
13 instructions for call_alsup (shorter than call_ertl).

Then the habit of decent ABI's could have continued
when new architectures were developed.

It seems to me that that's what happened (except that it was not a continuation): When new architectures were introduced, ABIs were
introduced that made use of the additional memory, but also took
compatibility with existing practice into account.

E.g., MIPS (1986) got a calling convention that passes the first four
words of parameters in integer registers and the rest on the stack.
That's not particularly efficient for passing FP parameters, but it
meant that calls to functions, including varargs functions like
printf() would work without prototypes (C89 only came later) and
varags functions could be implemented simply by storing these four
registers to the stack (IIRC the four slots for these parameter words
were reserved).

As time progressed, calling conventions tried to keep stuff more in
registers and in the right kind of registers, at the cost of a more
complex implementation and breaking programs without prototypes.
E.g., the AMD64 ABI specifies register struct returns for small
structs.

Instead, many current ABI's are
at least sub-optimal for structs

Which ones do you have in mind?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Scott Lurndal on Fri Jul 11 18:33:19 2025

On 2025-07-11 16:40, Scott Lurndal wrote:

[email protected] (Dan Cross) writes:

[snip]

And the case of `posix_spawn` might be special, since it is so
often written in terms of `vfork`, which has its own bizarre
semantics.

posix_spawn was modeled somewhat after ADA process creation primitives.

The rationale is included in the standard page.

https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.htm

To clarify: the models for posix_spawn were not Ada /language/
primitives, but process-creation operations provided in a standard
Ada-to-POSIX binding. Quoting from the page referenced above:

"Instead, posix_spawn() and posix_spawnp() are process creation
primitives like the Start_Process and Start_Process_Search Ada language bindings [in] package POSIX_Process_Primitives and also like those in
many operating systems that are not UNIX systems, but with some
POSIX-specific additions."

The Ada language itself does not have a "process" concept. Ada has
"tasks" that are execution threads that run in a shared address space.
Tasks in Ada are created by dedicated syntax and not by calling some task-creating operations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dan Cross@21:1/5 to Scott Lurndal on Fri Jul 11 16:11:00 2025

In article <av8cQ.984246$[email protected]>,
Scott Lurndal <[email protected]> wrote:

[email protected] (Dan Cross) writes:

In article <Ih_bQ.958133$[email protected]>,
Scott Lurndal <[email protected]> wrote:

The open(2) API and errno mechanism was defined in very early unix a half century ago.

It was standardized in the System V Interface Definion (SVID) in the >>>1980s and in POSIX a few years later, followed by the X Portability
Guide (XPG) and finally the Single Unix specification. In all cases >>>backward compatibility at the source level was a requirement.

Extensions and new capabilities related to opening a file are encapsulated >>>in additional APIs such as fcntl(2), ioctl(2), tcsetattr(2), et alia.

Yes, there are likely differnt possible APIs; all new standardized Unix C APIs
(e.g. posix_spawn, pthreads, et alia) return the E* error number directly >>>(for thread safety) or zero for success eschewing errno completely. Any other
data returned by an API is via pointer parameters (often with 'restrict' qualification).

POSIX mandates that `errno` be (essentially) thread-local, so
thread safety isn't much of a consideration here. Traditionally
Unix kernels have returned a single value in a register, and set
a flag (in the PSW or whatever) to indicate failure, leaving it
to the syscall stubs in e.g. the C library to take whatever the
kernel gives back from the actual syscall exit and make sure
that `errno` is set appropriately.

I can image that a kernel call interface where `errno` is not
set is a bit more direct, but I don't think concurrency plays a
huge role there; but maybe these interfaces were designed in
that awkward time before `errno` was thread safe by mandate.

I was on the XPG working group in those years, and yes, they
were designed in that awkward time as 1003.4a was being
developed.

Thanks for the confirmation; that makes sense.

And the case of `posix_spawn` might be special, since it is so
often written in terms of `vfork`, which has its own bizarre
semantics.

posix_spawn was modeled somewhat after ADA process creation primitives.

The rationale is included in the standard page.

https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.html

Thanks, but I don't think that directly addresses why they chose
to return error status directly in the return value, and not set
errno as a side-effect.

Perhaps a hint is given here, from the rationale you pointed to
earlier:

|The posix_spawn() function is implementable as a library
|routine, but both posix_spawn() and posix_spawnp() are designed
|as kernel operations.

...one presumes that, on systems where it is implemented as a
library routine, it is written in terms of fork/exec and
capturing the value of errno in the case of a successful fork/
failed exec might be challening. On existing Unix-y systems, I
suspect it is almost always implemented in terms of vfork/exec,
which has its own issues, but since the child "borrows" its
parents address space until it either exec's or exits, maybe it
would be _easier_ to bubble errno values back up.

- Dan C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Niklas Holsti on Fri Jul 11 16:58:07 2025

Niklas Holsti <[email protected]d> schrieb:

"Instead, posix_spawn() and posix_spawnp()

For a second, I read that as posix_swamp().

But then again, I have been known to write about unsinged numbers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Dan Cross on Fri Jul 11 17:21:25 2025

[email protected] (Dan Cross) writes:

In article <av8cQ.984246$[email protected]>,
Scott Lurndal <[email protected]> wrote:

<snip posix_spawn discussion>

The rationale is included in the standard page.

https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.html

Thanks, but I don't think that directly addresses why they chose
to return error status directly in the return value, and not set
errno as a side-effect.

My recollection is the choice to return errno directly was made
because we were aware of the pending 1003.4a specification (I sat
in on a couple of those meetings as well when our regular posix rep
wasn't available).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Fri Jul 11 19:25:21 2025

On Fri, 11 Jul 2025 14:50:58 +0000, Anton Ertl wrote:

David Brown <[email protected]> writes:

Struct returns, even on poorer ABI's (and there are /many/ ABI's that
are bad for struct handling), are unlikely to be noticeably less
efficient than using a pointer-to-return-value solution. After all, the >>typical simplistic struct return here would be roughly equivalent to :

void lgamma(gammaresult * result, double gamma);

Let's see:

#include <stdio.h>

typedef struct { double result;
int sign; } gammaresult;

gammaresult lgamma_alsup1( double x )
{
gammaresult r;
r.result = x+1.;
r.sign = -1;
return r;
}

lgamma_alsup1:
FADD R1,R1,#1.
MOV R2,#-1
RET // 3 instructions no memory

double lgamma_ertl1(double x, int *signgam)
{
*signgam = -1;
return x+1.;
}

lgamma_ertl1:
ST #-1,[R2]
FADD R1,R1,#1.
RET // 3 instructions 1 memory == more power

extern gammaresult lgamma_alsup2( double x );

void call_alsup()
{
gammaresult r=lgamma_alsup2(1.);
printf("%f ",r.result);
printf("%d ",r.sign);
}

call_alsup:
ENTER R0,R0,#16
CVTSD R1,#1 // 4 bytes instead of MOV R1,#1.0D0 as 12
bytes
CALX [IP,,GOT[lgamma_alsup2#]-.]
STD R2,[SP,#8]
MOV R2,R2
LDA R1,&"%f "
CALL printf
LD R2,[SP,8]
LDA R1,&"%d "
CALL printf
EXIT R0,R0,#8 // 11 instructions 1 STD 1 LDD 2 LDA

extern double lgamma_ertl2(double x, int *signgam);

void call_ertl()
{
int sign;
printf("%f ",lgamma_ertl2(1.,&sign));
printf("%d ",sign);
}

call_ertl1:
ENTER R0,R0,#16
CVTSD R1,#1
LDA R2,[SP,16]
CALX [IP,,GOT[lgamma_ertl2#]-.]
MOV R2,R1
LDA R1,%"%f "
CALL printf
LDA R2,[SP,16]
LDA R1,&"%d "
CALL printf
EXIT R0,R0,#16 // 11 instructions 4 LDA

With the same instruction count the argument is a wash on My 66000 architecture.

Here the calls are to a differently-named function with the same
interface such that we see what happens without inlining. The first
thing to note is that the source code for the struct-returning
function is longer. The calling code is slightly longer.

I have compiled that on AMD64 with:

gcc -fpcc-struct-return -Wall -O -c lgamma.c

The output of "objdump -d lgamma.o" for lgamma_*1 is:

0000000000000000 <lgamma_alsup1>:
0: 48 89 f8 mov %rdi,%rax
3: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # b <lgamma_alsup1+0xb>
a: 00
b: f2 0f 11 07 movsd %xmm0,(%rdi)
f: c7 47 08 ff ff ff ff movl $0xffffffff,0x8(%rdi)
16: c3 ret

// 5 instructons

0000000000000017 <lgamma_ertl1>:
17: c7 07 ff ff ff ff movl $0xffffffff,(%rdi)
1d: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # 25 <lgamma_ertl1+0xe>
24: 00
25: c3 ret

3 instructions: just like both My 66000 compilations

So with the typical simplistic struct return (aka pcc-struct-return)
the code of the function is longer.

The code for the call_* functions is:

0000000000000026 <call_alsup>:
26: 53 push %rbx
27: 48 83 ec 10 sub $0x10,%rsp
2b: 48 89 e7 mov %rsp,%rdi
2e: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 36 <call_alsup+0x10>
35: 00
36: e8 00 00 00 00 call 3b <call_alsup+0x15>
3b: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx
40: f2 0f 10 04 24 movsd (%rsp),%xmm0
45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 4c <call_alsup+0x26>
4c: b8 01 00 00 00 mov $0x1,%eax
51: e8 00 00 00 00 call 56 <call_alsup+0x30>
56: 89 de mov %ebx,%esi
58: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 5f <call_alsup+0x39>
5f: b8 00 00 00 00 mov $0x0,%eax
64: e8 00 00 00 00 call 69 <call_alsup+0x43>
69: 48 83 c4 10 add $0x10,%rsp
6d: 5b pop %rbx
6e: c3 ret

17 instructions

000000000000006f <call_ertl>:
6f: 48 83 ec 18 sub $0x18,%rsp
73: 48 8d 7c 24 0c lea 0xc(%rsp),%rdi
78: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 80 <call_ertl+0x11>
7f: 00
80: e8 00 00 00 00 call 85 <call_ertl+0x16>
85: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 8c <call_ertl+0x1d>
8c: b8 01 00 00 00 mov $0x1,%eax
91: e8 00 00 00 00 call 96 <call_ertl+0x27>
96: 8b 74 24 0c mov 0xc(%rsp),%esi
9a: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # a1 <call_ertl+0x32>
a1: b8 00 00 00 00 mov $0x0,%eax
a6: e8 00 00 00 00 call ab <call_ertl+0x3c>
ab: 48 83 c4 18 add $0x18,%rsp
af: c3 ret

13 instructions

Both longer than my 66000 versions.

18 instructions for call_alsup() vs. 14 for call_ertl(),

I got 1 less for both in counting ARM instructions. Am I missing some-
thing ?!? PLUS I used dynamically loaded Calling for lgamma* not static loading.

so again the struct-return variant leads to longer code with pcc-struct-return.

For AMR yes, for My 66000 no.

But this has been my constant argument for the last 6 years:: you don't
finish the ISA development until after the compiler ahs been written.
When you find an awkward code sequence--figure out how to fix it, then
teach the compiler to use that.

It would have been nice if, when struct returns and struct parameters
were added to C, someone had taken time to improve the ABI's to make
them efficient.

Someone did!

Given the name of the calling convention variant, this was introduced
in PCC (and probably struct returns themselves were introduced in
PCC). PCC was released in 1979 on the machines of the day, such as
the PDP-11; I am sure Johnson implemented a calling convention for
struct passing and struct returns that used the least amount of code.
If Johnson had had more space to play with, he probably would have had
other things on the agenda before improving the struct return calling convention. E.g., the calling conventions at the time passed all
parameters on the stack, and we still have this in the Intel calling convention for IA-32.

Early RISC calling conventions passed several parameters in registers,
but still used pcc-struct-returns.

Greenhills compiler for 88K use register struct returns (1983)
IIRC 4 registers; so that complex doubles were in registers
both calling and returning.

But struct returns were so rare in libraries that gcc added an option -freg-struct-return which returns
small structs in registers, and this option used to be usable, because libraries or system calls did not use struct-returns at the time.

Eventually, ABI specifications went for more efficient, but also more
complex and less forgiving calling conventions, so on AMD64 without -fpcc-struct-return gammaresult is actually returned in registers,
leading to 3 instructions for lgamma_alsup1 (same as lgamma_ertl1) and
13 instructions for call_alsup (shorter than call_ertl).

My 66000 ABI provides up to 8 doublewords of register struct return
values.

Then the habit of decent ABI's could have continued
when new architectures were developed.

It seems to me that that's what happened (except that it was not a continuation): When new architectures were introduced, ABIs were
introduced that made use of the additional memory, but also took compatibility with existing practice into account.

E.g., MIPS (1986) got a calling convention that passes the first four
words of parameters in integer registers and the rest on the stack.

My 66000 first 8 DoubleWords in registers calling and returning,
the rest on the stack.

That's not particularly efficient for passing FP parameters, but it
meant that calls to functions, including varargs functions like
printf() would work without prototypes (C89 only came later) and
varags functions could be implemented simply by storing these four
registers to the stack (IIRC the four slots for these parameter words
were reserved).

My 66000 does not have FP registers, just GPRs. (a topic for another
day)

As time progressed, calling conventions tried to keep stuff more in
registers and in the right kind of registers, at the cost of a more
complex implementation and breaking programs without prototypes.
E.g., the AMD64 ABI specifies register struct returns for small
structs.

So do My 66000, except small == 1 cache line.

Instead, many current ABI's are
at least sub-optimal for structs

Which ones do you have in mind?

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Jul 11 19:26:15 2025

On Fri, 11 Jul 2025 16:58:07 +0000, Thomas Koenig wrote:

Niklas Holsti <[email protected]d> schrieb:

"Instead, posix_spawn() and posix_spawnp()

For a second, I read that as posix_swamp().

It might very well be.....

But then again, I have been known to write about unsinged numbers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dan Cross@21:1/5 to Scott Lurndal on Fri Jul 11 20:40:56 2025

In article <pKbcQ.592388$[email protected]>,
Scott Lurndal <[email protected]> wrote:

[email protected] (Dan Cross) writes:

In article <av8cQ.984246$[email protected]>,
Scott Lurndal <[email protected]> wrote:

<snip posix_spawn discussion>

The rationale is included in the standard page.

https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.html >>

Thanks, but I don't think that directly addresses why they chose
to return error status directly in the return value, and not set
errno as a side-effect.

My recollection is the choice to return errno directly was made
because we were aware of the pending 1003.4a specification (I sat
in on a couple of those meetings as well when our regular posix rep
wasn't available).

That makes some sense, I suppose. Thanks.

- Dan C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Fri Jul 11 14:27:06 2025

On 7/11/2025 9:58 AM, Thomas Koenig wrote:

Niklas Holsti <[email protected]d> schrieb:

"Instead, posix_spawn() and posix_spawnp()

For a second, I read that as posix_swamp().

But then again, I have been known to write about unsinged numbers.

Unsinged numbers are cool :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Fri Jul 11 21:56:17 2025

On Fri, 11 Jul 2025 14:50:58 +0000, Anton Ertl wrote:

David Brown <[email protected]> writes:

<snip>

It would have been nice if, when struct returns and struct parameters
were added to C, someone had taken time to improve the ABI's to make
them efficient.

Given the name of the calling convention variant, this was introduced
in PCC (and probably struct returns themselves were introduced in
PCC). PCC was released in 1979 on the machines of the day, such as
the PDP-11; I am sure Johnson implemented a calling convention for
struct passing and struct returns that used the least amount of code.
If Johnson had had more space to play with, he probably would have had
other things on the agenda before improving the struct return calling convention. E.g., the calling conventions at the time passed all
parameters on the stack, and we still have this in the Intel calling convention for IA-32.

Given that PDP-11 had 6 general purpose useable registers, and x86
started out with similar, it would have been quite difficult to
pass the first few arguments in registers. PDP-11 and x86 were
easy to push arguments onto the stack, and address in callee from
the stack.

The thing is:: we learned (most of the good ones of us).
There needs to be a lot of GPRs
we should be able to use 1/4-1/2 of them passing arguments
and returning results
while preserving ~1/2 of them across call/return boundaries
One needs IP-relative addressing to data
And one needs efficient dynamically linked subroutines and data
And we should not allocate ANY registers to the dynamic linker.

A few tidbits I picked up along the way::
a) When a dynamically linked subroutine has not been linked, the
faulting instruction access needs to contain a means to directly
derive its GOT[index] without knowing the IP of the instruction.

b) tabularized switch tables should use bytes or halfwords instead
of doublewords.

<snip>

Then the habit of decent ABI's could have continued
when new architectures were developed.

It seems to me that that's what happened (except that it was not a continuation): When new architectures were introduced, ABIs were
introduced that made use of the additional memory, but also took compatibility with existing practice into account.

Another account of the architects having not been exposed to enough
of the disease before crafting their design.

E.g., MIPS (1986) got a calling convention that passes the first four
words of parameters in integer registers and the rest on the stack.
That's not particularly efficient for passing FP parameters, but it
meant that calls to functions, including varargs functions like
printf() would work without prototypes (C89 only came later) and
varags functions could be implemented simply by storing these four
registers to the stack (IIRC the four slots for these parameter words
were reserved).

My 66000 compilation environment does not need varargs prototypes
in scope to build correct calling sequences. The calling sequences
are independent of the callers requirements.

As time progressed, calling conventions tried to keep stuff more in
registers and in the right kind of registers,

This is simple when there is only 1 kind of register !!

at the cost of a more
complex implementation and breaking programs without prototypes.
E.g., the AMD64 ABI specifies register struct returns for small
structs.

Instead, many current ABI's are at least sub-optimal for structs

Which ones do you have in mind?

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Fri Jul 11 22:03:16 2025

On Fri, 11 Jul 2025 22:00:44 +0000, Stefan Monnier wrote:

Unsinged numbers are cool :-)

Yeah, I find that singed numbers make it harder to concentrate.

But they are easier to eat!

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Fri Jul 11 18:00:44 2025

Unsinged numbers are cool :-)

Yeah, I find that singed numbers make it harder to concentrate.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From moi@21:1/5 to All on Fri Jul 11 23:23:18 2025

On 11/07/2025 20:26, MitchAlsup1 wrote:

On Fri, 11 Jul 2025 16:58:07 +0000, Thomas Koenig wrote:

Niklas Holsti <[email protected]d> schrieb:

"Instead, posix_spawn() and posix_spawnp()

For a second, I read that as posix_swamp().

It might very well be.....

But then again, I have been known to write about unsinged numbers.

Just so long as they are not unhinged!

--
Bill F.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Fri Jul 11 23:02:17 2025

According to MitchAlsup1 <[email protected]>:

It would have been nice if, when struct returns and struct parameters >>>were added to C, someone had taken time to improve the ABI's to make
them efficient. ...

Given that PDP-11 had 6 general purpose useable registers, and x86
started out with similar, it would have been quite difficult to
pass the first few arguments in registers. PDP-11 and x86 were
easy to push arguments onto the stack, and address in callee from
the stack.

The C compilers at that time were not very sophisticated. They compiled
one statment at a time, and the only way to tell them to leave values
in registers was an explicit "register" declaration. Except in the most trivial routines, it'd usually have to stash the argument in memory to
make room for something else, so there'd have been no benefit.

SPARC used the PCC compiler, which still wasn't very clever, so it had
register windows with separate groups of registers for input arguments,
output arguments, and temporaries. The IBM 801 had the first graph
coloring compiler so I expect it passed all sorts of stuff in registers.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri Jul 11 23:19:35 2025

On Fri, 11 Jul 2025 14:50:58 GMT, Anton Ertl wrote:

E.g., the calling conventions at the time passed all parameters on the
stack, and we still have this in the Intel calling convention for IA-32.

No choice. What registers were there to use for passing arguments?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Sat Jul 12 00:30:14 2025

[email protected] (MitchAlsup1) writes:

On Fri, 11 Jul 2025 14:50:58 +0000, Anton Ertl wrote:

David Brown <[email protected]> writes:

Struct returns, even on poorer ABI's (and there are /many/ ABI's that
are bad for struct handling), are unlikely to be noticeably less >>>efficient than using a pointer-to-return-value solution. After all, the >>>typical simplistic struct return here would be roughly equivalent to :

void lgamma(gammaresult * result, double gamma);

Early RISC calling conventions passed several parameters in registers,
but still used pcc-struct-returns.

Greenhills compiler for 88K use register struct returns (1983)
IIRC 4 registers; so that complex doubles were in registers
both calling and returning.

The formal defintion for the 88k Unix ABI was the 88Open BCS[*] (I was the Unisys rep on the 88Open committee). I don't recall four register
returns, but all my documentation from those days is boxed up. I think I
have a copy of the 88k PCC sources around somewhere...

[*] Binary Compatibility Standard. There was also an Object Compatibility
Standard (OCS) to support link-time compatibility between compiler vendors
(e.g. Unisoft, DG, Motorola, Unisys, Greenhills, Diab Data, et alia).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Sat Jul 12 09:22:08 2025

Thomas Koenig wrote:

Niklas Holsti <[email protected]d> schrieb:

"Instead, posix_spawn() and posix_spawnp()

For a second, I read that as posix_swamp().

But then again, I have been known to write about unsinged numbers.

So have I, multiple times.

Still better than unhinged numbers?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Sat Jul 12 15:25:43 2025

John Levine <[email protected]> writes:

According to MitchAlsup1 <[email protected]>:

Given that PDP-11 had 6 general purpose useable registers, and x86
started out with similar, it would have been quite difficult to
pass the first few arguments in registers.

It's not any more difficult to pass, say, 4 arguments in registers if
you have 6 registers available than it is if you have 30 registers
available.

PDP-11 and x86 were
easy to push arguments onto the stack, and address in callee from
the stack.

I think neither PDP-11 nor IA-32 has instructions that push, say, the
sum of two other registers, whereas at least IA-32 has an instruction
that computes the sum of two registers and puts it in a third
register.

Concerning the implicit memory access: it costs more than using
registers on all IA-32 implementations I am aware of, and I expect
that's also true of the PDP-11.

The C compilers at that time were not very sophisticated. They compiled
one statment at a time, and the only way to tell them to leave values
in registers was an explicit "register" declaration. Except in the most >trivial routines, it'd usually have to stash the argument in memory to
make room for something else, so there'd have been no benefit.

Many frequently-called library routines, such as strlen() or
memcpy()[1] can easily keep all their parameters, variables, and
intermediate results in 6 registers or less.

Therefore I expect that many of the frequently-called library routines
compiled with PCC made extensive use of the register storage class.

In that scenario passing the arguments in registers avoids the cost of
pushing them in the caller and the cost of loading them from memory at
the start of the callee.

As for the functions that do not use the register storage class for
parameters, pushing or storing them at the start of the callee is not
slower than doing it right before the call, and it can lead to shorter
code.

Anyway, I expect that Unix already had a calling convention on PDP-11
and several other machines, and of course PCC followed that
convention. As for the C compiler that introduced these calling
conventions (probably by Ritchie), my guess is that he was happy to
produce a working C compiler that ran in the little RAM they had.

But Intel had a clean slate when they designed the Intel calling
convention for IA-32. When the 386 came out in 1985, Wulf et
al. [wulf+75] was a decade old, and Chaitin's graph-coloring paper was
4 years old, and the 386 typically had much more memory available than
Wulf et al. MIPS introduced a calling convention that passed 4 words
in registers shortly after, and Intel could have gone done so, too.
And it seems that they paid dearly for their decision, as I find lots
of documentation on alternative calling conventions for IA-32 and how
to tell the compiler about them.

@Book{wulf+75,
author = {William Wulf and Richard K. Johnsson and Charles
B. Weinstock and Steven O. Hobbs and Charles M. Geschke},
title = {The Design of an Optimizing Compiler},
publisher = {Elsvier},
year = {1975},
isbn = {0-444-0164-6},
annote = {Describes a complete Bliss/11 compiler for the
PDP-11. It uses some interesting techniques: it
uses a (hand-constructed) tree parsing automaton for
parts of the code selection (Section~3.4); it
optimizes the use of unary complement operators
(Section~3.3); it uses a smart scheme to represent
a conservative approximation of the lifetime of
variables in constant space and uses that for
register allocation (Sections~4.1.3 and~4.3).}
}

This book cannot be praised enough, and it's celebrating its 50th
anniversary this year.

While this book came out before Stephen C. Johnson wrote PCC, I can
understand why Johnson avoided going for an optimizing compiler.
Johnson had enough on his plate with adding features to the language
and designing for retargetability, and AFAIK he wrote PCC
single-handedly, while Wulf et al. seem to have been 5 people. And
given that Geschke graduated from CMU in 1972, they may have worked on
the compiler for several years even with five people. Plus, as I just
read, BLISS/11 was a cross-compiler from the PDP-10 to the PDP-11, so
these optimization techniques may have needed too much memory for a
PDP-11.

[1] I have wondered about the selection of registers for the System V
calling convention for the System V ABI for AMD64: the first 6
arguments go in RDI, RSI, RDX, RCX, R8, R9. The first two are optimal
for memcpy() implemented with REP MOVSB, but then RCX would be better
in third position. RDI is also good for memset() with REP STOSB, RDI
and RSI are also good for memcmp() with REP CMPSB, and I expect that
there are other uses of REP instructions for implementing memory-block
or string functions where the placement in RDI and RSI is
helpful. Except that the library routines then often do not use the
REP instructions.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Jul 12 19:13:16 2025

On Sat, 12 Jul 2025 0:30:14 +0000, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

On Fri, 11 Jul 2025 14:50:58 +0000, Anton Ertl wrote:

David Brown <[email protected]> writes:

Struct returns, even on poorer ABI's (and there are /many/ ABI's that >>>>are bad for struct handling), are unlikely to be noticeably less >>>>efficient than using a pointer-to-return-value solution. After all, the >>>>typical simplistic struct return here would be roughly equivalent to :

void lgamma(gammaresult * result, double gamma);

Early RISC calling conventions passed several parameters in registers,
but still used pcc-struct-returns.

Greenhills compiler for 88K use register struct returns (1983)
IIRC 4 registers; so that complex doubles were in registers
both calling and returning.

The formal defintion for the 88k Unix ABI was the 88Open BCS[*] (I was
the
Unisys rep on the 88Open committee). I don't recall four register
returns, but all my documentation from those days is boxed up. I think
I
have a copy of the 88k PCC sources around somewhere...

I was the Moto Architect. 4 registers (of 32-bits) were used to be able
to return a complex double precision value (2 doubles).

[*] Binary Compatibility Standard. There was also an Object
Compatibility
Standard (OCS) to support link-time compatibility between compiler vendors
(e.g. Unisoft, DG, Motorola, Unisys, Greenhills, Diab Data, et
alia).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sat Jul 12 19:32:10 2025

On Sat, 12 Jul 2025 15:25:43 +0000, Anton Ertl wrote:

John Levine <[email protected]> writes:

According to MitchAlsup1 <[email protected]>:

Given that PDP-11 had 6 general purpose useable registers, and x86 >>>started out with similar, it would have been quite difficult to
pass the first few arguments in registers.

It's not any more difficult to pass, say, 4 arguments in registers if
you have 6 registers available than it is if you have 30 registers
available.

There comes a point where it becomes harder than the compilers of that
era could perform--for example, consider an expression to be passed
as an argument that requires 3 registers to compute. If you only have
6 registers and you want to pass 4 in registers, you might have to
calculate several arguments, push them on the stack, then calculate
the last one (3 registers) into the right register, then pop the others
off the stack in order to perform the all.

At a certain point, its easier not to do this.

PDP-11 and x86 were
easy to push arguments onto the stack, and address in callee from
the stack.

I think neither PDP-11 nor IA-32 has instructions that push, say, the
sum of two other registers, whereas at least IA-32 has an instruction
that computes the sum of two registers and puts it in a third
register.

Neither was a non-destructive register model (a = b + c) both were
a destruction model (a = a + b)

Concerning the implicit memory access: it costs more than using
registers on all IA-32 implementations I am aware of, and I expect
that's also true of the PDP-11.

Time: yes, instruction space: somewhat--but you had (r5) and (r5)+
and @(r5)+ and -(r5) and @-(r5) which cost no space but did cost time.

The C compilers at that time were not very sophisticated. They compiled >>one statment at a time, and the only way to tell them to leave values
in registers was an explicit "register" declaration. Except in the most >>trivial routines, it'd usually have to stash the argument in memory to
make room for something else, so there'd have been no benefit.

My point from above.

Many frequently-called library routines, such as strlen() or
memcpy()[1] can easily keep all their parameters, variables, and
intermediate results in 6 registers or less.

IIRC only SP and IP were preserved across a call/return

Therefore I expect that many of the frequently-called library routines compiled with PCC made extensive use of the register storage class.

a necessary evil. The first thing a modern C compiler does is to remove "register" sub-types from variables.

In that scenario passing the arguments in registers avoids the cost of pushing them in the caller and the cost of loading them from memory at
the start of the callee.

As for the functions that do not use the register storage class for parameters, pushing or storing them at the start of the callee is not
slower than doing it right before the call, and it can lead to shorter
code.

Less total code but equal number of instructions executed.
When saved at entry, everyone who calls this subroutine shares
the memory reference instructions.

Anyway, I expect that Unix already had a calling convention on PDP-11
and several other machines, and of course PCC followed that
convention. As for the C compiler that introduced these calling
conventions (probably by Ritchie), my guess is that he was happy to
produce a working C compiler that ran in the little RAM they had.

The Denelcor C compiler I built had big trouble fitting in the PDP-11
memory. I had to remove all the superfluous "I wrote this" strings
at the start of the ASM modules to get it to fit.

But Intel had a clean slate when they designed the Intel calling
convention for IA-32. When the 386 came out in 1985, Wulf et
al. [wulf+75] was a decade old, and Chaitin's graph-coloring paper was
4 years old, and the 386 typically had much more memory available than
Wulf et al. MIPS introduced a calling convention that passed 4 words
in registers shortly after, and Intel could have gone done so, too.
And it seems that they paid dearly for their decision, as I find lots
of documentation on alternative calling conventions for IA-32 and how
to tell the compiler about them.

I agree they paid dearly. The marketplace does not.

@Book{wulf+75,
author = {William Wulf and Richard K. Johnsson and Charles
B. Weinstock and Steven O. Hobbs and Charles M.
Geschke},
title = {The Design of an Optimizing Compiler},
publisher = {Elsvier},
year = {1975},
isbn = {0-444-0164-6},
annote = {Describes a complete Bliss/11 compiler for the
PDP-11. It uses some interesting techniques: it
uses a (hand-constructed) tree parsing automaton for
parts of the code selection (Section~3.4); it
optimizes the use of unary complement operators
(Section~3.3); it uses a smart scheme to represent
a conservative approximation of the lifetime of
variables in constant space and uses that for
register allocation (Sections~4.1.3 and~4.3).}
}

This book cannot be praised enough, and it's celebrating its 50th
anniversary this year.

I have an original.

<snip>

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jul 13 00:06:42 2025

According to Anton Ertl <[email protected]>:

Anyway, I expect that Unix already had a calling convention on PDP-11
and several other machines, and of course PCC followed that
convention. As for the C compiler that introduced these calling
conventions (probably by Ritchie), my guess is that he was happy to
produce a working C compiler that ran in the little RAM they had.

It was two passes each about 24K bytes and a third optional optimizer
that slightly rewrote the assembler code.

The Ritchie complier and I think PCC reserved up to three registers
for declared register variables, and used the rest as a stack for
temporaries. It used Sethi-Ullman numbering to do the more complex subexpressions first to avoid running out of registers. If it did
run out of registers I think it just gave up, but I don't ever
remember that happening.

Reserving more registers would have been really hard.

I agree that on the 386 it would probably have been practical to pass
arguments in registers, but I suspect that for whatever reason they
wanted to make the calling sequence similar to the 8086 and 286.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Sun Jul 13 02:19:47 2025

On Sun, 13 Jul 2025 0:06:42 +0000, John Levine wrote:

According to Anton Ertl <[email protected]>:

Anyway, I expect that Unix already had a calling convention on PDP-11
and several other machines, and of course PCC followed that
convention. As for the C compiler that introduced these calling >>conventions (probably by Ritchie), my guess is that he was happy to
produce a working C compiler that ran in the little RAM they had.

It was two passes each about 24K bytes and a third optional optimizer
that slightly rewrote the assembler code.

The Ritchie complier and I think PCC reserved up to three registers
for declared register variables, and used the rest as a stack for temporaries. It used Sethi-Ullman numbering to do the more complex subexpressions first to avoid running out of registers. If it did
run out of registers I think it just gave up, but I don't ever
remember that happening.

Reserving more registers would have been really hard.

I agree that on the 386 it would probably have been practical to pass arguments in registers, but I suspect that for whatever reason they
wanted to make the calling sequence similar to the 8086 and 286.

Register arguments and results were not common until after MIPS R2000,
Although I did use register arguments and results on Denelcor HEP C
compiler (which was the same code generator as HEP Fortran.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Sun Jul 13 14:25:11 2025

On 11/07/2025 16:50, Anton Ertl wrote:

David Brown <[email protected]> writes:

Struct returns, even on poorer ABI's (and there are /many/ ABI's that
are bad for struct handling), are unlikely to be noticeably less
efficient than using a pointer-to-return-value solution. After all, the
typical simplistic struct return here would be roughly equivalent to :

void lgamma(gammaresult * result, double gamma);

Let's see:

#include <stdio.h>

typedef struct { double result;
int sign; } gammaresult;

gammaresult lgamma_alsup1( double x )
{
gammaresult r;
r.result = x+1.;
r.sign = -1;
return r;
}

double lgamma_ertl1(double x, int *signgam)
{
*signgam = -1;
return x+1.;
}

extern gammaresult lgamma_alsup2( double x );

void call_alsup()
{
gammaresult r=lgamma_alsup2(1.);
printf("%f ",r.result);
printf("%d ",r.sign);
}

extern double lgamma_ertl2(double x, int *signgam);

void call_ertl()
{
int sign;
printf("%f ",lgamma_ertl2(1.,&sign));
printf("%d ",sign);
}

Here the calls are to a differently-named function with the same
interface such that we see what happens without inlining. The first
thing to note is that the source code for the struct-returning
function is longer. The calling code is slightly longer.

I have compiled that on AMD64 with:

gcc -fpcc-struct-return -Wall -O -c lgamma.c

The output of "objdump -d lgamma.o" for lgamma_*1 is:

0000000000000000 <lgamma_alsup1>:
0: 48 89 f8 mov %rdi,%rax
3: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # b <lgamma_alsup1+0xb>
a: 00
b: f2 0f 11 07 movsd %xmm0,(%rdi)
f: c7 47 08 ff ff ff ff movl $0xffffffff,0x8(%rdi)
16: c3 ret

0000000000000017 <lgamma_ertl1>:
17: c7 07 ff ff ff ff movl $0xffffffff,(%rdi)
1d: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # 25 <lgamma_ertl1+0xe>
24: 00
25: c3 ret

So with the typical simplistic struct return (aka pcc-struct-return)
the code of the function is longer.

The code for the call_* functions is:

0000000000000026 <call_alsup>:
26: 53 push %rbx
27: 48 83 ec 10 sub $0x10,%rsp
2b: 48 89 e7 mov %rsp,%rdi
2e: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 36 <call_alsup+0x10>
35: 00
36: e8 00 00 00 00 call 3b <call_alsup+0x15>
3b: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx
40: f2 0f 10 04 24 movsd (%rsp),%xmm0
45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 4c <call_alsup+0x26>
4c: b8 01 00 00 00 mov $0x1,%eax
51: e8 00 00 00 00 call 56 <call_alsup+0x30>
56: 89 de mov %ebx,%esi
58: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 5f <call_alsup+0x39>
5f: b8 00 00 00 00 mov $0x0,%eax
64: e8 00 00 00 00 call 69 <call_alsup+0x43>
69: 48 83 c4 10 add $0x10,%rsp
6d: 5b pop %rbx
6e: c3 ret

000000000000006f <call_ertl>:
6f: 48 83 ec 18 sub $0x18,%rsp
73: 48 8d 7c 24 0c lea 0xc(%rsp),%rdi
78: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 80 <call_ertl+0x11>
7f: 00
80: e8 00 00 00 00 call 85 <call_ertl+0x16>
85: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 8c <call_ertl+0x1d>
8c: b8 01 00 00 00 mov $0x1,%eax
91: e8 00 00 00 00 call 96 <call_ertl+0x27>
96: 8b 74 24 0c mov 0xc(%rsp),%esi
9a: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # a1 <call_ertl+0x32>
a1: b8 00 00 00 00 mov $0x0,%eax
a6: e8 00 00 00 00 call ab <call_ertl+0x3c>
ab: 48 83 c4 18 add $0x18,%rsp
af: c3 ret

18 instructions for call_alsup() vs. 14 for call_ertl(), so again the struct-return variant leads to longer code with pcc-struct-return.

<https://godbolt.org/z/j9jMT5ave>

(I find godbolt clearer for looking at these things, and I prefer to
avoid using printf - it can easily complicate the code.)

The key metrics are not, I think, instruction counts - but memory
accesses and how likely they are to cause delays. (I know you have much
more experience than I do about the relative timings of assembly code sequences, especially on "big" processors. My work is mainly with
simpler processors - generally single-scaler, and for important code it
is all on-chip static ram.)

As you show, having a pointer to "int * signgam" means that there will
be only one extra write to memory (in the callee) and one extra read (in
the caller) - while for a "pcc-struct-return" API you have two. However,
those will be adjacent and probably combined.

In theory, even if a struct return needs to pass a hidden pointer, the
compiler knows more about it than for a general "int *" pointer
parameter. It knows that there are no aliasing issues or "escapes" -
when you have a local variable whose address is passed on to
"lgamma_ertl", the compiler has to assume that the function might store
the address and later functions might use it to change the value of the
local variable "sign". With the hidden struct pointer, the compiler
knows that access via the pointer is much more restricted.

(With C23, a function like "lgamma_ertl" would be marked
[[unsequenced]], or at least [[reproducible]], which would let the
compiler make similar assumptions for optimisation.

However, the best code (for caller and callee) is when there is a good
ABI for structure returns, and they are returned in registers.

It would have been nice if, when struct returns and struct parameters
were added to C, someone had taken time to improve the ABI's to make
them efficient.

Given the name of the calling convention variant, this was introduced
in PCC (and probably struct returns themselves were introduced in
PCC). PCC was released in 1979 on the machines of the day, such as
the PDP-11; I am sure Johnson implemented a calling convention for
struct passing and struct returns that used the least amount of code.
If Johnson had had more space to play with, he probably would have had
other things on the agenda before improving the struct return calling convention. E.g., the calling conventions at the time passed all
parameters on the stack, and we still have this in the Intel calling convention for IA-32.

Early RISC calling conventions passed several parameters in registers,
but still used pcc-struct-returns. But struct returns were so rare in libraries that gcc added an option -freg-struct-return which returns
small structs in registers, and this option used to be usable, because libraries or system calls did not use struct-returns at the time.

Would struct returns have been used more if they were not so
inefficient? (There are standard library functions like "div", "clock",
and "mktime" that return structs.)

Eventually, ABI specifications went for more efficient, but also more
complex and less forgiving calling conventions, so on AMD64 without -fpcc-struct-return gammaresult is actually returned in registers,
leading to 3 instructions for lgamma_alsup1 (same as lgamma_ertl1) and
13 instructions for call_alsup (shorter than call_ertl).

Then the habit of decent ABI's could have continued
when new architectures were developed.

It seems to me that that's what happened (except that it was not a continuation): When new architectures were introduced, ABIs were
introduced that made use of the additional memory, but also took compatibility with existing practice into account.

That sounds reasonable.

E.g., MIPS (1986) got a calling convention that passes the first four
words of parameters in integer registers and the rest on the stack.
That's not particularly efficient for passing FP parameters, but it
meant that calls to functions, including varargs functions like
printf() would work without prototypes (C89 only came later) and
varags functions could be implemented simply by storing these four
registers to the stack (IIRC the four slots for these parameter words
were reserved).

vararg functions are a real PITA for register-based ABI's ! They are
fine for stack-based parameter ABI's, but not ABI's that are more
efficient on modern devices and modern code.

As time progressed, calling conventions tried to keep stuff more in
registers and in the right kind of registers, at the cost of a more
complex implementation and breaking programs without prototypes.
E.g., the AMD64 ABI specifies register struct returns for small
structs.

Instead, many current ABI's are
at least sub-optimal for structs

Which ones do you have in mind?

The architecture that is most relevant for my daily work, and where
efficiency matters to me, is 32-bit ARM for embedded systems. It's fine
for calling functions with a few simple parameters and returning a
single scalar. But beyond that, it is often suboptimal - and with
modern C++ coding, you are often doing something beyond that.

ARM32 ABI can pass arguments in r0 to r3. (I'm ignoring floating point
for simplification.) r4 to r11 must be preserved by the caller. Why
then can they not also be used for passing arguments? I am no supporter
of having lots of parameters in a single function, but a function could
take a small number of larger parameters (64-bit integers, or structs of various kinds).

Normally only r0 is used for return values, but r0:r1 can be used for a fundamental type that is 64-bit (a long long int, for example). A
struct is only returned in a register if it fits in r0 - all other
structs are handled by passing a pointer to a stack block. That means
that you cannot, for example, make a C++ wrapper class around a uint64_t without suffering significant inefficiencies. Given that functions are
already allowed to change r0 to r3 without preserving them, it would
make sense to use all of r0 to r3 for return values.

C++ tag types - types with no values, used only in parameters to choose particular overloads for a function - are treated like "unsigned char"
by the ABI and thus cost a parameter register or force passing via stack parameters, when they could easily be omitted entirely.

I realise 32-bit ARM was around before much of this was relevant (I
first played with ARM assembly in 1988 as a schoolkid). But it is
surely possible to modernise things a little?

It is particularly galling for developers in small-systems embedded programming, where sometimes every cycle counts - and where we have
virtually no concern for backwards compatibility or interaction with
existing binary code, because we can happily re-compile everything on
the target.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Sun Jul 13 14:22:55 2025

John Levine <[email protected]> writes:

I agree that on the 386 it would probably have been practical to pass >arguments in registers, but I suspect that for whatever reason they
wanted to make the calling sequence similar to the 8086 and 286.

Ease of adapting 16-bit compilers and library routines might have been
reasons.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to [email protected] on Sun Jul 13 13:07:10 2025

[email protected] (MitchAlsup1) writes:

On Sat, 12 Jul 2025 15:25:43 +0000, Anton Ertl wrote:

John Levine <[email protected]> writes:

According to MitchAlsup1 <[email protected]>:

Given that PDP-11 had 6 general purpose useable registers, and x86 >>>>started out with similar, it would have been quite difficult to
pass the first few arguments in registers.

It's not any more difficult to pass, say, 4 arguments in registers if
you have 6 registers available than it is if you have 30 registers
available.

There comes a point where it becomes harder than the compilers of that
era could perform--for example, consider an expression to be passed
as an argument that requires 3 registers to compute.

In the worst case you are computing the fourth argument: three
registers are occupied with arguments, and three registers are
available for the computation.

Things become more challenging if you have an expression that "needs"
4 or more registers. A way to deal with that is to push the deepest
entry in your register stack on the memory stack when you run out of
registers. Then you can use the register where that intermediate
result resided. When the intermediate result is needed, pop it from
the memory stack.

The combination of register variables and parameter passing in
registers is also interesting. Let's assume we use the same registers
for register variables and parameter passing (useful if the parameters
are register variables, and it also means that we do not have to deal
with all registers being occupied by ). Just before a parameter is
computed, store the variable in its register into memory, and any
later accesses to the variable access that memory location. Just
before the call, write the remaining register variables to memory (caller-saved). After the call, load all the register variables from
memory to their register again.

There are, of course, ways to improve on this, but my point is that it
is feasible to pass 4 parameters in registers and use 3 register
variables on the PDP-11. It makes the compiler a little longer and a
lot harder to test.

If you only have
6 registers and you want to pass 4 in registers, you might have to
calculate several arguments, push them on the stack, then calculate
the last one (3 registers) into the right register, then pop the others
off the stack in order to perform the all.

For the numbers you have mentioned, that's not necessary, but in
general, that's another viable approach. Some of my students
implement parameter passing (for AMD64, i.e., with passing in
registers) by pushing each argument as it is computed and pulling them
all from the stack into the appropriate registers right before the
actual call. That may be less than optimal, but getting the
assignment done in time and correctly is more important.

I think neither PDP-11 nor IA-32 has instructions that push, say, the
sum of two other registers, whereas at least IA-32 has an instruction
that computes the sum of two registers and puts it in a third
register.

Neither was a non-destructive register model (a = b + c) both were
a destruction model (a = a + b)

IA-32 has

lea eax, (ebx, ecx)

which computes the sum of ebx and ecx and stores the result into eax. Admittedly, this only works for addition.

But it's also the case that only a limited number of operations are
supported for memory operands. E.g., consider

int r,i,a[];
r += a[i];

On IA-32 that's one instruction if r, a, and i are in registers:

add rcx, [rdx+rsi*4] # rcx=r, rdx=a, rsi=i

If they are all in memory, it's four instructions:

mov eax, a(esp)
mov ebx, i(esp)
mov eax, [eax+ebx*4]
add r(esp), eax

I leave the PDP-11 variant to more knowledgeable people.

Concerning the implicit memory access: it costs more than using
registers on all IA-32 implementations I am aware of, and I expect
that's also true of the PDP-11.

Time: yes, instruction space: somewhat--but you had (r5) and (r5)+
and @(r5)+ and -(r5) and @-(r5) which cost no space but did cost time.

While I have read papers about automatically arranging variables such
that this kind of technique can be used for accessing variables in
memory, that's a complicated technique, like register allocation, but
with less reward. Of course, in the spirit of explicit register
declarations, one can also leave it to the programmer to produce a
good order, and let the compiler just use autoincrement/decrement for
accessing the variable when the opportunity occurs. This still
requires more global analysis than PCC had AFAIK, so I doubt that PCC
used this technique.

I expect that PCC used the Indexed addressing mode (EA=SP+const) for
accessing non-register variables on the PDP-11, and in that case
non-register variables are also more expensive in code size. There is
a reason why these compilers supported a register storage class.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Sun Jul 13 14:24:53 2025

David Brown <[email protected]> writes:

The key metrics are not, I think, instruction counts - but memory
accesses and how likely they are to cause delays.

And one might also wonder what hardware one should look at. AMD64
does not use pcc-struct-returns by default, so finding out in how many
cases 0-cycle store-to-load forwarding (implemented in recent cores)
eliminates the delays does not tell us the performance characteristics
on hardware that mostly executed IA-32 code where pcc-struct-returns
are the default.

As you show, having a pointer to "int * signgam" means that there will
be only one extra write to memory (in the callee) and one extra read (in
the caller) - while for a "pcc-struct-return" API you have two. However, >those will be adjacent and probably combined.

The stores go separately to the store units (and consume the resources
there), and the stores are to write-back cache, not write-combining
memory. The loads go separately to the load units and consume the
resources there; no combining happens. The data will be in the
D-cache in the usual case, and on recent hardware there could even be
0-cycle store-to-load-forwarding.

If you are thinking about autovectorization by the compiler, yes, that
could happen, but IMO it costs more than it buys. I have also seen
gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
part of Hennessey's small integer benchmarks (from the 1980s) by auto-vectorizing the adjacent accesses of bubble-sort. Not only does
the code execute significantly more instructions, it also hits a slow
hardware path in store-to-load-forwarding on every store it performs
in this way.

But even without this slow path, my expectation is that the
auto-vectorization overhead would slow the benchmark down compared to
the -O1 version (which is just scalar code), but how could I measure
this?

The slow path should not occur in the struct-return case, though.

Another combining idea is the use of ARM A64's store pair and load
pair instructions, which result in only one memory access for each
such instruction and result in fewer instructions than doing unpaired
loads and stores, while the code resulting from auto-vectorization on
AMD64 is longer than two scalar stores and two scalar loads.

Unfortunatly, store-pair and load-pair do not support storing or
loading an FP and an integer value AFAIK.

In theory, even if a struct return needs to pass a hidden pointer, the >compiler knows more about it than for a general "int *" pointer
parameter. It knows that there are no aliasing issues or "escapes" -
when you have a local variable whose address is passed on to
"lgamma_ertl", the compiler has to assume that the function might store
the address and later functions might use it to change the value of the
local variable "sign". With the hidden struct pointer, the compiler
knows that access via the pointer is much more restricted.

(With C23, a function like "lgamma_ertl" would be marked
[[unsequenced]], or at least [[reproducible]], which would let the
compiler make similar assumptions for optimisation.

You mean that the programmer could mark the function in that way?

Wouldn't some use of "restrict" give the compiler similar information?
I just don't know where in the code to apply "restrict". Maybe

double lgamma_ertl2(double x, int *restrict signgam);

?

Would struct returns have been used more if they were not so
inefficient?

Possibly. I certainly remember wanting to use them for something Gforth-internal, and then deciding against them after seeing the
generated code.

E.g., MIPS (1986) got a calling convention that passes the first four
words of parameters in integer registers and the rest on the stack.
That's not particularly efficient for passing FP parameters, but it
meant that calls to functions, including varargs functions like
printf() would work without prototypes (C89 only came later) and
varags functions could be implemented simply by storing these four
registers to the stack (IIRC the four slots for these parameter words
were reserved).

I think it's more complicated: If the first parameter is an integer
one, then do everything in integer registers, otherwise pass FP stuff
in FP registers. Probably the idea is that varargs functions always
start with an integer parameter.

Later I saw a calling convention (IIRC Alpha) where parameter n was
passed in integer register n if it was integer and FP register n if it
was an FP value. The respectiv other register went unused.

Recently I have seen a calling convention (IIRC RISC-V) where the used
integer register are allocated one after the other whether there were
FP parameters interleaved or not, and the same on the FP side. I
don't remember what happens if the call runs out of one kind of
register, and the other kind is still available.

Instead, many current ABI's are
at least sub-optimal for structs

Which ones do you have in mind?

The architecture that is most relevant for my daily work, and where >efficiency matters to me, is 32-bit ARM for embedded systems.

ARM A32 (and T32 uses the same calling conventions) is from around the
same time as MIPS, so similar calling conventions are to be expected.
However, I see various ABIs mentioned in the descriptions of various
things (eABI, oABI, etc.). So apparently they did several.

I realise 32-bit ARM was around before much of this was relevant (I
first played with ARM assembly in 1988 as a schoolkid). But it is
surely possible to modernise things a little?

Breaking compatibility has an immediate cost and (hopefully) a
long-term return. It's a relly hard sell. But apparently ARM with
their several ABIs has gone there. Too little?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sun Jul 13 17:00:44 2025

Anton Ertl <[email protected]> schrieb:

I have also seen
gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
part of Hennessey's small integer benchmarks (from the 1980s)

I would like to quote Press, Teukolsky, Vetterling and Flannery,
from "Numerical Recipes":

"If you know what bubble sort is, wipe it from your mind; if you
don't know, make a point of never finding out!"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to It appears that Anton Ertl on Sun Jul 13 21:22:02 2025

It appears that Anton Ertl <[email protected]> said:

John Levine <[email protected]> writes:

According to MitchAlsup1 <[email protected]>:

Given that PDP-11 had 6 general purpose useable registers, and x86 >>>started out with similar, it would have been quite difficult to
pass the first few arguments in registers.

It's not any more difficult to pass, say, 4 arguments in registers if
you have 6 registers available than it is if you have 30 registers
available.

Those compilers were so space constrained that they compiled a statement at a time, keeping only a stack of open loops so they knew where to jump back to. For
a procedure call it evaluated each argument expression and pushed it. Trying to figure out which registere might be available for what was way beyond what it could do.

This could produce fairly tangled code since the code that naturally came at the
end of a for(;;) loop was generated at the top. The separate optimization pass did read in the generated assembler a routine at a time, and somewhat untangled the code. It removed jumps to jumps, and moved a block of code reached by
an unconditional jump to where the jump was. I don't recall it doing anything with registers.

The BLISS-11 compiler might have done more clever register allocation
but it ran on a PDP-10 which could address the equivalent of a
megabyte, not the 11's 64K.

PDP-11 and x86 were
easy to push arguments onto the stack, and address in callee from
the stack.

I think neither PDP-11 nor IA-32 has instructions that push, say, the
sum of two other registers, whereas at least IA-32 has an instruction
that computes the sum of two registers and puts it in a third
register.

PDP-11 instructions were all one or two operand, with all operands being fully general. To push the sum of two registers on the stack without clobbering the registers you could do this:

MOV R1,-(SP) ; 2 mem cycles
ADD R2,(SP) ; 2 mem cycles

since the -11 ran mostly at the speed of its memory this would
be no faster and the code was longer:

MOV R1,R0 ; 1 mem cycle
ADD R2,R0 ; 1 mem cycle
MOV R0,-(SP) ; 2 mem cycles
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Sun Jul 13 22:20:27 2025

On Sun, 13 Jul 2025 17:00:44 -0000 (UTC), Thomas Koenig wrote:

I would like to quote Press, Teukolsky, Vetterling and Flannery,
from "Numerical Recipes":

"If you know what bubble sort is, wipe it from your mind; if you don't
know, make a point of never finding out!"

But Shellsort is basically “bubble sort done right”. And that is, or was, certainly worth using: a decent sort algorithm that didn’t require a lot
of code to implement.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Mon Jul 14 06:22:29 2025

John Levine <[email protected]> writes:

Those compilers were so space constrained that they compiled a statement at a >time, keeping only a stack of open loops so they knew where to jump back to. For
a procedure call it evaluated each argument expression and pushed it. Trying to
figure out which registere might be available for what was way beyond what it >could do.

What I outlined is in the context of a statement-at-a-time compiler;
it costs very little data space: The compiler already records for each
local variable whether it is in a register (and which one) or in
memory (and where); the storage class "static" also needs to be
represented, but that does not affect the present discussion. The
additional data for when you store the register variable to memory
while evaluating an argument in that register can be as small as 8
bits for each of the three registers used for variables: these 8 bits
would tell if the variable currently resides in its register, or in
memory, and if in memory, where.

It would cost somewhat more code space, so given that they were so
heavily space-constrained, I understand that they did not want to go
there.

Another cost is that the potential for bugs increases quite
significantly, so one would have to use quite a bit more testing for
the same kind of reliability. Another reason not to go there.

The BLISS-11 compiler might have done more clever register allocation
but it ran on a PDP-10 which could address the equivalent of a
megabyte, not the 11's 64K.

The BLISS-11 compiler does global register allocation. It uses a very
compact way to represent the necessary information: For each variable
the start of its first live range and the end of its last live range
is remembered, and that was used as approximate liveness information
for determining whether two variables conflict. It will not allocate
two variables to the same register where the second variable's live
range fits in a live range hole of the first, but it can allocate two
variables to the same register if the last use of one variable is
before the first store to the other variable.

So, again, this does not cost a lot in data space, at least as far as
variables are concerned. It does mean that one has to look at the
whole function and do the register allocation before making any
compilation decisions, though. It also costs code space that a simple statement-at-a-time compiler does not need. I guess one could do this
on a PDP-11 with several passes, but if I have the choice to do it on
a PDP-10, keeping the whole function and the data about its variables
in memory, I would do it on the PDP-10; and I may have developed a
BLISS-10 compiler on the PDP-10 already.

But in any case, the global register allocation of BLISS-11 is far
beyond what I was discussing.

PDP-11 instructions were all one or two operand, with all operands being fully >general.

It's interesting that VAX generalized this to general three-address
operations (and added a proper indexed mode), while the 68K and IA-32 architects decided to support only one memory operand for most
instructions (but with more addressing modes, including proper indexed addressing modes). For the 68k the limitation to one memory operand
for most instructions probably was not a matter of principle (it has a
move instruction that supports two memory operands); my guess is that
they decided that for encoding reasons.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Mon Jul 14 10:30:06 2025

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

I have also seen
gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
part of Hennessey's small integer benchmarks (from the 1980s)

I would like to quote Press, Teukolsky, Vetterling and Flannery,
from "Numerical Recipes":

"If you know what bubble sort is, wipe it from your mind; if you
don't know, make a point of never finding out!"

Unless you can prove that this kind of bad code generation by gcc can
only occur for bubble sort, this benchmark is a reason to ignore this
advice.

Of course, an alternative is to close your eyes and ears and find some
excuse for every case where gcc does something undesirable.
"Undefined behaviour" is the default excuse, but you can vary the
excuses by quoting from books; appeal to authority is a good argument
in these times.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Mon Jul 14 16:51:57 2025

On 13/07/2025 16:24, Anton Ertl wrote:

David Brown <[email protected]> writes:

The key metrics are not, I think, instruction counts - but memory
accesses and how likely they are to cause delays.

And one might also wonder what hardware one should look at. AMD64
does not use pcc-struct-returns by default, so finding out in how many
cases 0-cycle store-to-load forwarding (implemented in recent cores) eliminates the delays does not tell us the performance characteristics
on hardware that mostly executed IA-32 code where pcc-struct-returns
are the default.

As you show, having a pointer to "int * signgam" means that there will
be only one extra write to memory (in the callee) and one extra read (in
the caller) - while for a "pcc-struct-return" API you have two. However,
those will be adjacent and probably combined.

The stores go separately to the store units (and consume the resources there), and the stores are to write-back cache, not write-combining
memory. The loads go separately to the load units and consume the
resources there; no combining happens. The data will be in the
D-cache in the usual case, and on recent hardware there could even be
0-cycle store-to-load-forwarding.

OK. (That is all, of course, very dependent on the processor in question.)

If you are thinking about autovectorization by the compiler, yes, that
could happen, but IMO it costs more than it buys.

No, I was not thinking of that. I was thinking that adjacent memory
accesses can be handled more efficiently in hardware than separate ones.
You will probably avoid two cache misses, for example. And I would
expect that on some processors at least, adjacent writes could be
combined when there are databuses that are wider than the individual writes.

I have also seen
gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
part of Hennessey's small integer benchmarks (from the 1980s) by auto-vectorizing the adjacent accesses of bubble-sort. Not only does
the code execute significantly more instructions, it also hits a slow hardware path in store-to-load-forwarding on every store it performs
in this way.

Yes, I have also seen enthusiastic autovectorisation being
counter-productive, especially if you are actually using small amounts
of data. (clang/llvm seems keener on autovectorising code than gcc,
IME.) And I've seen other situations in which "gcc -O3" generates
slower code than "gcc -O2" - "gcc -O3" should only be used with care and extensive testing on the real code and real target.

But even without this slow path, my expectation is that the auto-vectorization overhead would slow the benchmark down compared to
the -O1 version (which is just scalar code), but how could I measure
this?

The slow path should not occur in the struct-return case, though.

Another combining idea is the use of ARM A64's store pair and load
pair instructions, which result in only one memory access for each
such instruction and result in fewer instructions than doing unpaired
loads and stores, while the code resulting from auto-vectorization on
AMD64 is longer than two scalar stores and two scalar loads.

Yes, that is another possibility.

Unfortunatly, store-pair and load-pair do not support storing or
loading an FP and an integer value AFAIK.

There are other circumstances where one might want to return a struct
than just calling a gamma function!

In theory, even if a struct return needs to pass a hidden pointer, the
compiler knows more about it than for a general "int *" pointer
parameter. It knows that there are no aliasing issues or "escapes" -
when you have a local variable whose address is passed on to
"lgamma_ertl", the compiler has to assume that the function might store
the address and later functions might use it to change the value of the
local variable "sign". With the hidden struct pointer, the compiler
knows that access via the pointer is much more restricted.

(With C23, a function like "lgamma_ertl" would be marked
[[unsequenced]], or at least [[reproducible]], which would let the
compiler make similar assumptions for optimisation.

You mean that the programmer could mark the function in that way?

Yes. Or, for a library function, the library header would mark it that
way in the declaration.

Wouldn't some use of "restrict" give the compiler similar information?
I just don't know where in the code to apply "restrict". Maybe

double lgamma_ertl2(double x, int *restrict signgam);

?

I don't see how "restrict" would help here.

If the lgamma_ertl2 function is declared "[[unsequenced]]", then the
compiler knows that it will not store the "signgam" pointer anywhere
else. Thus it knows any other functions called after lgamma_ertl2
cannot change the variable that "signgam" pointed to.

(Marking it as [[unsequenced]] or [[reproducible]] gives other
optimisation advantages for the calling code, and would be a good idea
anyway even if the function returned a struct. But a function that
changes a global variable cannot be thus marked.)

Would struct returns have been used more if they were not so
inefficient?

Possibly. I certainly remember wanting to use them for something Gforth-internal, and then deciding against them after seeing the
generated code.

E.g., MIPS (1986) got a calling convention that passes the first four
words of parameters in integer registers and the rest on the stack.
That's not particularly efficient for passing FP parameters, but it
meant that calls to functions, including varargs functions like
printf() would work without prototypes (C89 only came later) and
varags functions could be implemented simply by storing these four
registers to the stack (IIRC the four slots for these parameter words
were reserved).

I think it's more complicated: If the first parameter is an integer
one, then do everything in integer registers, otherwise pass FP stuff
in FP registers. Probably the idea is that varargs functions always
start with an integer parameter.

Later I saw a calling convention (IIRC Alpha) where parameter n was
passed in integer register n if it was integer and FP register n if it
was an FP value. The respectiv other register went unused.

Recently I have seen a calling convention (IIRC RISC-V) where the used integer register are allocated one after the other whether there were
FP parameters interleaved or not, and the same on the FP side. I
don't remember what happens if the call runs out of one kind of
register, and the other kind is still available.

Instead, many current ABI's are
at least sub-optimal for structs

Which ones do you have in mind?

The architecture that is most relevant for my daily work, and where
efficiency matters to me, is 32-bit ARM for embedded systems.

ARM A32 (and T32 uses the same calling conventions) is from around the
same time as MIPS, so similar calling conventions are to be expected. However, I see various ABIs mentioned in the descriptions of various
things (eABI, oABI, etc.). So apparently they did several.

Yes, there have been a few modifications to the ARM32 ABI - there are
also small differences, I believe, between the details for Linux,
Windows and embedded toolchains. (It's only the last one that is really relevant now at 32-bit.)

I realise 32-bit ARM was around before much of this was relevant (I
first played with ARM assembly in 1988 as a schoolkid). But it is
surely possible to modernise things a little?

Breaking compatibility has an immediate cost and (hopefully) a
long-term return. It's a relly hard sell. But apparently ARM with
their several ABIs has gone there. Too little?

Yes, too little.

The immediate cost for embedded toolchains would not be too high -
certainly not in comparison to hosted targets. You need to add a
compiler flag for the new ABI (along with support via __attribute__,
#pragma, etc.), and add it to the list of static library builds you make
for the toolchain. Then developers can use it simply by adding the flag
to their CCFLAGS in their makefile, or whatever build system they like.
Those who have existing pre-compiled binaries (such as commercial
libraries or RTOS's) won't be able to use it easily until their supplier updates the libraries, but that would happen sooner or later.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Mon Jul 14 19:30:06 2025

On Sun, 13 Jul 2025 17:00:44 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Anton Ertl <[email protected]> schrieb:

I have also seen
gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
part of Hennessey's small integer benchmarks (from the 1980s)

I would like to quote Press, Teukolsky, Vetterling and Flannery,
from "Numerical Recipes":

"If you know what bubble sort is, wipe it from your mind; if you
don't know, make a point of never finding out!"

The same can be said (with stronger vindication) to many of their
recipes. Less so to algorithms, more so to to code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Mon Jul 14 17:33:34 2025

On Mon, 14 Jul 2025 6:22:29 +0000, Anton Ertl wrote:
------------

PDP-11 instructions were all one or two operand, with all operands being
fully
general.

It's interesting that VAX generalized this to general three-address operations (and added a proper indexed mode), while the 68K and IA-32 architects decided to support only one memory operand for most
instructions (but with more addressing modes, including proper indexed addressing modes). For the 68k the limitation to one memory operand
for most instructions probably was not a matter of principle (it has a
move instruction that supports two memory operands); my guess is that
they decided that for encoding reasons.

When I was doing 88100 at Motorola, the 68020 guys would say that
once there were sufficient resources, they could make a MOV-CALK
run just as fast as a 2-operand 1-result instruction model

68020
MOV D3,D2 // first 16-bits
CALK D3,D1 // 32-bits

88100
CALK D3,D2,D1 // 32-bits

I am still of the opinion that fewer instructions remains better;
especially if they occupy the same code footprint.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Mon Jul 14 19:03:32 2025

Anton Ertl <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

I have also seen
gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
part of Hennessey's small integer benchmarks (from the 1980s)

I would like to quote Press, Teukolsky, Vetterling and Flannery,
from "Numerical Recipes":

"If you know what bubble sort is, wipe it from your mind; if you
don't know, make a point of never finding out!"

Unless you can prove that this kind of bad code generation by gcc can
only occur for bubble sort, this benchmark is a reason to ignore this
advice.

Not for me to prove anything.

Bat as I'm sure that you have filled out a PR, because you are such
a constructive person bent on helping others instead of whining.

Could you give me the PR number? I could then re-check and
(if necessary) re-confirm.

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to John Levine on Mon Jul 14 21:34:33 2025

John Levine wrote:

According to Anton Ertl <[email protected]>:

Anyway, I expect that Unix already had a calling convention on PDP-11
and several other machines, and of course PCC followed that
convention. As for the C compiler that introduced these calling
conventions (probably by Ritchie), my guess is that he was happy to
produce a working C compiler that ran in the little RAM they had.

It was two passes each about 24K bytes and a third optional optimizer
that slightly rewrote the assembler code.

The Ritchie complier and I think PCC reserved up to three registers
for declared register variables, and used the rest as a stack for temporaries. It used Sethi-Ullman numbering to do the more complex subexpressions first to avoid running out of registers. If it did
run out of registers I think it just gave up, but I don't ever
remember that happening.

Reserving more registers would have been really hard.

I agree that on the 386 it would probably have been practical to pass arguments in registers, but I suspect that for whatever reason they
wanted to make the calling sequence similar to the 8086 and 286.

Not only that, but the 386 still had just 8 minus 1 or 2 total registers:

If you only have eax, ebx, ecx, edx, esi, edi as regular registers, ebp
as either frame pointer or (typically for leaf functions) another reg by
making do without a frame pointer, then you had just 6 more-or-less
general registers.

Several had to be used by many instructions: edx+eax was always the
target for 32x32->64-bit MUL, source for DIV, ecx (cl) had to be used
for all variable shift counts etc.

ESI/EDI/ECX were used for all string ops and block moves.

In short, even for my own asm code I very rarely used more than two
register variables as function parameters.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Jul 14 22:41:44 2025

On Sun, 13 Jul 2025 14:22:55 GMT, Anton Ertl wrote:

Ease of adapting 16-bit compilers and library routines might have been reasons.

This is why I always felt that Intel took a short-sighted approach to each
new generation of chips from 8086/80186 to 80286 to 80386.

Contrast Motorola, where the original 16-bit 68000 was clearly a cut-down 32-bit design to begin with. The progression to the 68020 was largely a
matter of filling in obvious gaps, which made the software transition so
much easier.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Mon Jul 14 22:47:02 2025

On Mon, 14 Jul 2025 17:33:34 +0000, MitchAlsup1 wrote:

I am still of the opinion that fewer instructions remains better;
especially if they occupy the same code footprint.

I remember this rather large (6:1 code size ratio) counterexample from the
VAX ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon Jul 14 23:14:19 2025

On Mon, 14 Jul 2025 22:47:02 +0000, Lawrence D'Oliveiro wrote:

On Mon, 14 Jul 2025 17:33:34 +0000, MitchAlsup1 wrote:

I am still of the opinion that fewer instructions remains better;
especially if they occupy the same code footprint.

I remember this rather large (6:1 code size ratio) counterexample from
the VAX ...

As I remember::

CALLS and RET could be faster when using JSR and JMP and SW pushes
and pops of preserved registers.

Coroutines would use JSR +(SP) between co-routines. {pop one off
then push the new return address on}.

POLY could be faster in instructions when there were enough terms for
Estrin's method to pay dividends.

Simple (i.e., COBOL picture) EDIT and MARK could be faster with
just instructions.

VAX was admired and beloved for a decade, before sliding off into insignificance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Jul 15 00:58:52 2025

On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

On Mon, 14 Jul 2025 22:47:02 +0000, Lawrence D'Oliveiro wrote:

On Mon, 14 Jul 2025 17:33:34 +0000, MitchAlsup1 wrote:

I am still of the opinion that fewer instructions remains better;
especially if they occupy the same code footprint.

I remember this rather large (6:1 code size ratio) counterexample from
the VAX ...

As I remember::

[examples omitted]

Maybe true, but I doubt any of them made this much difference. The big one
was this: saving registers R0-R5 on entry to a kernel routine (which
happened quite commonly) could be done most compactly as

PUSHR #^M<R0,R1,R2,R3,R4,R5>

which was a single instruction of just 2 bytes. Or it could be done much
more verbosely as

PUSHL R5
PUSHL R4
PUSHL R3
PUSHL R2
PUSHL R1
PUSHL R0

which was 6 instructions totalling 12 bytes.

The latter was faster.

POLY could be faster in instructions when there were enough terms for Estrin's method to pay dividends.

The problem with polynomial evaluation is, at least in the examples I came across in my numerical-analysis courses, evaluation terminated much more commonly based on convergence to the final result, not on some
predetermined number of terms. But the VAX instruction only did a
predetermined number of terms. So it didn’t seem that useful in real life.

VAX was admired and beloved for a decade, before sliding off into insignificance.

Remember it straddled those transitions between instruction sets that had annoying arbitrary restrictions because of hardware limitations, to the intermediate era when the hardware limitations went away, then onto the
RISC era, when instruction sets went back to simplicity, but in a
different direction and for a new reason: because that was the way to
maximize performance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to [email protected] on Tue Jul 15 05:34:44 2025

[email protected] (MitchAlsup1) writes:

When I was doing 88100 at Motorola, the 68020 guys would say that
once there were sufficient resources, they could make a MOV-CALK
run just as fast as a 2-operand 1-result instruction model

68020
MOV D3,D2 // first 16-bits
CALK D3,D1 // 32-bits

88100
CALK D3,D2,D1 // 32-bits

That day arrived at the latest when Sandy Bridge was released in 2011
with its separate physical register files and register renamer. It
usually handles the register-register mov in the renamer, resulting in
0-cycle movs, especially in cases like these where the result of the
mov is overwritten soon. Another option would be to let the decoder
combine the MOV and the CALK into one three-address microinstruction.

I am still of the opinion that fewer instructions remains better;
especially if they occupy the same code footprint.

Intel apparently thinks so; they introduce three-address encodings for
the existing instructions with APX.

What is the advantage of APX over the register renamer approach? It
takes fewer resources in the register renamer (which is often the
narrowest part of a core).

What is the advantage of APX over combining the instructions in the
decoder? If the CALK part traps (e.g, because it includes a memory
access), the architecture requires that the exception handler is
presented with the architectural state between the MOV and the CALK,
and this requires additional complications, while an architectural three-address instruction does not have this complication.

IIRC there are code size advantages to the APX three-address encodings
over the MOV-CALK combination in some, but not all cases.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Tue Jul 15 06:04:03 2025

Thomas Koenig <[email protected]> writes:

Bat as I'm sure that you have filled out a PR, because you are such
a constructive person bent on helping others instead of whining.

We have been over that before: I have reported gcc bugs in the past,
but my experience in the last few decades is that it is not at all constructive, but a waste of time. See, e.g., PR93811.

But if you think that it is useful, spend your own time on it. In the
meantime I still amuse myself by making fun of gcc and clang failures.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Jul 15 06:53:59 2025

On Tue, 15 Jul 2025 06:04:03 GMT, Anton Ertl wrote:

We have been over that before: I have reported gcc bugs in the past, but
my experience in the last few decades is that it is not at all
constructive, but a waste of time. See, e.g., PR93811.

They seem to think it is not needed on PowerPC <https://gcc.gnu.org/pipermail/gcc-bugs/2020-February/690898.html>.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Tue Jul 15 13:14:05 2025

On Tue, 15 Jul 2025 05:34:44 GMT
[email protected] (Anton Ertl) wrote:

[email protected] (MitchAlsup1) writes:

When I was doing 88100 at Motorola, the 68020 guys would say that
once there were sufficient resources, they could make a MOV-CALK
run just as fast as a 2-operand 1-result instruction model

68020
MOV D3,D2 // first 16-bits
CALK D3,D1 // 32-bits

88100
CALK D3,D2,D1 // 32-bits

That day arrived at the latest when Sandy Bridge was released in 2011
with its separate physical register files and register renamer. It
usually handles the register-register mov in the renamer, resulting in 0-cycle movs, especially in cases like these where the result of the
mov is overwritten soon.

All that is great for low-IPC latency-bound code. It helps little in
high-IPC code very rename stage tends to be the narrowest bottleneck.

Another option would be to let the decoder
combine the MOV and the CALK into one three-address microinstruction.

I am still of the opinion that fewer instructions remains better; >especially if they occupy the same code footprint.

Intel apparently thinks so; they introduce three-address encodings for
the existing instructions with APX.

What is the advantage of APX over the register renamer approach? It
takes fewer resources in the register renamer (which is often the
narrowest part of a core).

What is the advantage of APX over combining the instructions in the
decoder? If the CALK part traps (e.g, because it includes a memory
access), the architecture requires that the exception handler is
presented with the architectural state between the MOV and the CALK,
and this requires additional complications, while an architectural three-address instruction does not have this complication.

IIRC there are code size advantages to the APX three-address encodings
over the MOV-CALK combination in some, but not all cases.

- anton

The biggest question about APX is "Will it ship?"

X86S is canceled. Which is a good thing.

AVX10 is canceled except few minor bits. Which is, may be, a good thing
from point of view of software fragmentation, because now Intel is
forced to implement AVX512 on their future E cores.
I still think that from technical perspective full-featured 256-bit
SIMD is a better technical solution then neither-there-nor-here 512-bit
thing, but what they say about water under bridge?

If APX does not ship in Panther Cove cores (not to be confused with
Panther Lake SoC that is based on previous-generation cores) then it is
dead. We will know how it is going pretty soon, no later than early
2027, but likely earlier.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Tue Jul 15 13:31:15 2025

On Tue, 15 Jul 2025 06:04:03 GMT
[email protected] (Anton Ertl) wrote:

Thomas Koenig <[email protected]> writes:

Bat as I'm sure that you have filled out a PR, because you are such
a constructive person bent on helping others instead of whining.

We have been over that before: I have reported gcc bugs in the past,
but my experience in the last few decades is that it is not at all constructive, but a waste of time. See, e.g., PR93811.

My personal experience with pessimization-related PRs is that solution
rate is low, but above zero. Something like 10-15% of my PRs were solved
over time span of couple of gcc generations.
Of course, in 2 generations time many new pessimization cases pop up :(
But still, I think that submitting this sort of PRs is not totally
useless.

But if you think that it is useful, spend your own time on it. In the meantime I still amuse myself by making fun of gcc and clang failures.

- anton

I never submitted PR to clang. Certainly not because I had never seen
it generating horrendous code. I simply never cared to learn how to do
it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Tue Jul 15 13:46:10 2025

Lawrence D'Oliveiro wrote:

On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

POLY could be faster in instructions when there were enough terms for
Estrin's method to pay dividends.

The problem with polynomial evaluation is, at least in the examples I came across in my numerical-analysis courses, evaluation terminated much more commonly based on convergence to the final result, not on some
predetermined number of terms. But the VAX instruction only did a predetermined number of terms. So it didnâ€™t seem that useful in real life.

You obviously have never implemented any fp library:

When you write code for things like log/ln/exp/sin/cos/tan/atan/etc, you
pretty much always use fixed-number-of-term polys.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Jul 15 17:38:36 2025

On Tue, 15 Jul 2025 5:34:44 +0000, Anton Ertl wrote:

[email protected] (MitchAlsup1) writes:

When I was doing 88100 at Motorola, the 68020 guys would say that
once there were sufficient resources, they could make a MOV-CALK
run just as fast as a 2-operand 1-result instruction model

68020
MOV D3,D2 // first 16-bits
CALK D3,D1 // 32-bits

88100
CALK D3,D2,D1 // 32-bits

That day arrived at the latest when Sandy Bridge was released in 2011
with its separate physical register files and register renamer. It
usually handles the register-register mov in the renamer, resulting in 0-cycle movs, especially in cases like these where the result of the
mov is overwritten soon. Another option would be to let the decoder
combine the MOV and the CALK into one three-address microinstruction.

AMD K9 would have done that circa 2006--but I digress.

I am still of the opinion that fewer instructions remains better; >>especially if they occupy the same code footprint.

Intel apparently thinks so; they introduce three-address encodings for
the existing instructions with APX.

What is the advantage of APX over the register renamer approach? It
takes fewer resources in the register renamer (which is often the
narrowest part of a core).

Having the compiler (an already NP-complete piece of work) do it
is vastly better than having HW stumble over it and catch the
ones it can.

What is the advantage of APX over combining the instructions in the
decoder? If the CALK part traps (e.g, because it includes a memory
access), the architecture requires that the exception handler is
presented with the architectural state between the MOV and the CALK,
and this requires additional complications, while an architectural three-address instruction does not have this complication.

An examples from My 66000 ISA that is illustrative::

CALX--this is basically a LDD IP,[address] with R0=next instruction
address (that is; its a CALL from a table in memory).

When CALX reads in a zero (ld.so has not loaded the dynamic library)
the trap is presented with the CALX (not just the JMP Rk instruction)
So there is a 5 instruction sequence that results in the GOT[index#]
allowing ld.so to know which library was called and go do its job.

The advantage is speed equal in the "it works case" and usefully
faster in the "didn't work" cases. And of course the side advantages
of not consuming a register, .....

IIRC there are code size advantages to the APX three-address encodings
over the MOV-CALK combination in some, but not all cases.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Jul 15 17:44:19 2025

On Tue, 15 Jul 2025 11:46:10 +0000, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

POLY could be faster in instructions when there were enough terms for
Estrin's method to pay dividends.

The problem with polynomial evaluation is, at least in the examples I
came
across in my numerical-analysis courses, evaluation terminated much more
commonly based on convergence to the final result, not on some
predetermined number of terms. But the VAX instruction only did a
predetermined number of terms. So it didnâ€™t seem that useful in real >> life.

You obviously have never implemented any fp library:

When you write code for things like log/ln/exp/sin/cos/tan/atan/etc, you pretty much always use fixed-number-of-term polys.

Certainly when following Cody and Waite or J.M. Muller. But there are
ways
of implementing the same list as above, testing is the significance has
leveled off and early out. It is generally slower in worst case and not
much faster in the typical case--but it is a method taught in Numerical Method's classes.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Tue Jul 15 23:52:01 2025

On Tue, 15 Jul 2025 17:44:19 +0000
[email protected] (MitchAlsup1) wrote:

On Tue, 15 Jul 2025 11:46:10 +0000, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

POLY could be faster in instructions when there were enough terms
for Estrin's method to pay dividends.

The problem with polynomial evaluation is, at least in the
examples I came
across in my numerical-analysis courses, evaluation terminated
much more commonly based on convergence to the final result, not
on some predetermined number of terms. But the VAX instruction
only did a predetermined number of terms. So it didnâ€™t seem that
useful in real life.

You obviously have never implemented any fp library:

When you write code for things like
log/ln/exp/sin/cos/tan/atan/etc, you pretty much always use fixed-number-of-term polys.

Certainly when following Cody and Waite or J.M. Muller. But there are
ways
of implementing the same list as above, testing is the significance
has leveled off and early out. It is generally slower in worst case
and not much faster in the typical case--but it is a method taught in Numerical Method's classes.

Terje

You mean, to summate starting from bigger terms to smaller terms?
Something like:

sum = a[0];
xx = x;
for (int i = 1; ; ++i) {
sum1 = sum + xx * a[i];
if (sum == sum1)
break;
sum = sum1;
xx *= x;
}

That is the worst possible order of evaluation from perspective of
precision.

That's the worst possible meth

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Tue Jul 15 21:01:06 2025

Anton Ertl <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Bat as I'm sure that you have filled out a PR, because you are such
a constructive person bent on helping others instead of whining.

We have been over that before: I have reported gcc bugs in the past,
but my experience in the last few decades is that it is not at all constructive, but a waste of time. See, e.g., PR93811.

You can also submit a patch, you know.

But if you have a self-contained test case, post it here, I'll submit
it for you.

But if you think that it is useful, spend your own time on it. In the meantime I still amuse myself by making fun of gcc and clang failures.

Non-constructive whining, paired with a heavy dose of arrogance.
Oh well.

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Lawrence D'Oliveiro on Tue Jul 15 17:44:04 2025

Lawrence D'Oliveiro wrote:

On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

POLY could be faster in instructions when there were enough terms for
Estrin's method to pay dividends.

The problem with polynomial evaluation is, at least in the examples I came across in my numerical-analysis courses, evaluation terminated much more commonly based on convergence to the final result, not on some
predetermined number of terms. But the VAX instruction only did a predetermined number of terms. So it didn’t seem that useful in real life.

The problem with VAX POLY was that it was implemented differently on
different models, with different mistakes. To save microcode it was
eventually eliminated from hardware (traps to emulate if used).

How the VAX Lost Its POLY (and EMOD and ACB_floating too), 2011 https://simh.trailing-edge.com/docs/vax_poly.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Wed Jul 16 01:21:44 2025

On Tue, 15 Jul 2025 20:52:01 +0000, Michael S wrote:

On Tue, 15 Jul 2025 17:44:19 +0000
[email protected] (MitchAlsup1) wrote:

On Tue, 15 Jul 2025 11:46:10 +0000, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

POLY could be faster in instructions when there were enough terms
for Estrin's method to pay dividends.

The problem with polynomial evaluation is, at least in the
examples I came
across in my numerical-analysis courses, evaluation terminated
much more commonly based on convergence to the final result, not
on some predetermined number of terms. But the VAX instruction
only did a predetermined number of terms. So it didnâ€™t seem that >>>> useful in real life.

You obviously have never implemented any fp library:

When you write code for things like
log/ln/exp/sin/cos/tan/atan/etc, you pretty much always use
fixed-number-of-term polys.

Certainly when following Cody and Waite or J.M. Muller. But there are
ways
of implementing the same list as above, testing is the significance
has leveled off and early out. It is generally slower in worst case
and not much faster in the typical case--but it is a method taught in
Numerical Method's classes.

Terje

You mean, to summate starting from bigger terms to smaller terms?
Something like:

sum = a[0];
xx = x;
for (int i = 1; ; ++i) {
sum1 = sum + xx * a[i];
if (sum == sum1)
break;
sum = sum1;
xx *= x;
}

That is the worst possible order of evaluation from perspective of
precision.

Not in HW where you have a minimum of 2× fraction width.

That's the worst possible meth

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Wed Jul 16 05:47:08 2025

On Tue, 15 Jul 2025 13:46:10 +0200, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

POLY could be faster in instructions when there were enough terms for
Estrin's method to pay dividends.

The problem with polynomial evaluation is, at least in the examples I came >> across in my numerical-analysis courses, evaluation terminated much more
commonly based on convergence to the final result, not on some
predetermined number of terms. But the VAX instruction only did a
predetermined number of terms. So it didnâ€™t seem that useful in real life.

You obviously have never implemented any fp library:

When you write code for things like log/ln/exp/sin/cos/tan/atan/etc, you pretty much always use fixed-number-of-term polys.

Computing π to a given precision: <https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
No fixed number of terms in the common algorithms, as you can see.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Wed Jul 16 14:44:33 2025

Lawrence D'Oliveiro wrote:

On Tue, 15 Jul 2025 13:46:10 +0200, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

POLY could be faster in instructions when there were enough terms for
Estrin's method to pay dividends.

The problem with polynomial evaluation is, at least in the examples I came >>> across in my numerical-analysis courses, evaluation terminated much more >>> commonly based on convergence to the final result, not on some
predetermined number of terms. But the VAX instruction only did a
predetermined number of terms. So it didnâ€™t seem that useful in real life.

You obviously have never implemented any fp library:

When you write code for things like log/ln/exp/sin/cos/tan/atan/etc, you
pretty much always use fixed-number-of-term polys.

Computing π to a given precision: <https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
No fixed number of terms in the common algorithms, as you can see.

Quoting from your own link:

Conclusion: What is the value of continued fractions?

Clearly mathematicians have a lot of fun with them. But speaking as someone who does computation on a daily basis, I have to say I don’t think they’re a practical way of evaluating anything. Maybe I’m wrong, and someone who has delved more deeply

into them caan offer better examples of how to use them ...

If this was supposed to show how you would use variable number of terms
for common library functions, then I failed to understand it.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Thu Jul 17 01:54:13 2025

On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

Computing π to a given precision:
<https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
No fixed number of terms in the common algorithms, as you can see.

If this was supposed to show how you would use variable number of terms
for common library functions, then I failed to understand it.

Quote:

Or compare this function, adapted from the recipes section of the
decimal module documentation:

[code omitted -- see reference]

As you can see, this converges a lot quicker.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Thu Jul 17 11:18:00 2025

Lawrence D'Oliveiro wrote:

On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

Computing Ï€ to a given precision:
<https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
No fixed number of terms in the common algorithms, as you can see.

If this was supposed to show how you would use variable number of terms
for common library functions, then I failed to understand it.

Quote:

Or compare this function, adapted from the recipes section of the
decimal module documentation:

[code omitted -- see reference]

As you can see, this converges a lot quicker.

Another, somewhat important consideration:

If you want to make it possible to auto-vectorize code, then you pretty
much need for all instructions to have constant latency, maybe with a
few exceptions that will then cause pipeline bubbles.

This was definitely a requirement for the Mill fp emulation work I did.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Jul 17 14:37:20 2025

On Thu, 17 Jul 2025 14:15:30 +0000, Scott Lurndal wrote:

Terje Mathisen <[email protected]> writes:

Lawrence D'Oliveiro wrote:

On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

=20

Another, somewhat important consideration:

If you want to make it possible to auto-vectorize code, then you
pretty=20
much need for all instructions to have constant latency, maybe with a=20 >>few exceptions that will then cause pipeline bubbles.

For security purposes, all instruction timing must be data independent.

I like this wording better than auto-vectorize.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Terje Mathisen on Thu Jul 17 14:15:30 2025

Terje Mathisen <[email protected]> writes:

Lawrence D'Oliveiro wrote:

On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

=20

Another, somewhat important consideration:

If you want to make it possible to auto-vectorize code, then you pretty=20 >much need for all instructions to have constant latency, maybe with a=20
few exceptions that will then cause pipeline bubbles.

For security purposes, all instruction timing must be data independent.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Thu Jul 17 14:36:41 2025

On Thu, 17 Jul 2025 9:18:00 +0000, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

Computing Ï€ to a given precision:
<https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
No fixed number of terms in the common algorithms, as you can see.

If this was supposed to show how you would use variable number of terms
for common library functions, then I failed to understand it.

Quote:

Or compare this function, adapted from the recipes section of the
decimal module documentation:

[code omitted -- see reference]

As you can see, this converges a lot quicker.

Another, somewhat important consideration:

If you want to make it possible to auto-vectorize code, then you pretty
much need for all instructions to have constant latency, maybe with a
few exceptions that will then cause pipeline bubbles.

Can I get your definition of "auto-vectorize"

A wide-decode and a set of reservation stations can "vectorize" a
loop or straight line of code. Does this qualify as "auto-vectorize" ??

Whereas, My 66000 VEC-LOOP is definitely a "compiler-vectorize"
directive.

This was definitely a requirement for the Mill fp emulation work I did.

Given that there are a few instructions which can have variable latency
and a spattering that HAVE TO HAVE variable latency this requirement
causes "problems".

In 1991, working on Mc 88120, we had FDIV that was within 0.125 ULP
at cycle 12, and it took 5 more cycles to KNOW that the result was
properly rounded (all RMs). So, instead of having FDIV have 17 cycle
latency, we allowed it to have 12 cycles of latency 87.5% of the time
and 17 cycles 12.5% of the time for an average latency of 12.625 cycles.
This is usefully faster than fixed 17 cycles.

The same argument applies to SQRT.

Any LD instruction backed by a cache HAS TO HAVE variable latency.
Any memory ref with a translated address HAS TO HAVE variable
latency (TLB miss).
Store instruction waiting on long latency result data HAS TO HAVE
variable latency between AGEN and Write.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Fri Jul 18 15:06:49 2025

MitchAlsup1 wrote:

On Thu, 17 Jul 2025 9:18:00 +0000, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

Computing Ãâ‚¬ to a given precision:
<https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>. >>>>> No fixed number of terms in the common algorithms, as you can see.

If this was supposed to show how you would use variable number of terms >>>> for common library functions, then I failed to understand it.

Quote:

     Or compare this function, adapted from the recipes section of the >>>      decimal module documentation:

     [code omitted -- see reference]

     As you can see, this converges a lot quicker.

Another, somewhat important consideration:

If you want to make it possible to auto-vectorize code, then you pretty
much need for all instructions to have constant latency, maybe with a
few exceptions that will then cause pipeline bubbles.

Can I get your definition of "auto-vectorize"

A wide-decode and a set of reservation stations can "vectorize" a
loop or straight line of code. Does this qualify as "auto-vectorize" ??

Whereas, My 66000 VEC-LOOP is definitely a "compiler-vectorize"
directive.

This was definitely a requirement for the Mill fp emulation work I did.

Given that there are a few instructions which can have variable latency
and a spattering that HAVE TO HAVE variable latency this requirement
causes "problems".

Yeah, I do know that. Memory ops in SIMD style short vectors typically
have all slots resding in the same cache line, so even though the
latency is not predictable, it will probably be the same for all elements.

In 1991, working on Mc 88120, we had FDIV that was within 0.125 ULP
at cycle 12, and it took 5 more cycles to KNOW that the result was
properly rounded (all RMs). So, instead of having FDIV have 17 cycle
latency, we allowed it to have 12 cycles of latency 87.5% of the time
and 17 cycles 12.5% of the time for an average latency of 12.625 cycles.
This is usefully faster than fixed 17 cycles.

So if 87.5% of all divisions finish in 12 cycles, and you do 8 of them
in parallel, then (for random inputs), all 8 will finish in 12 with a
34% probability, leaving 17 cycles as the actual latency in 66% of all
cases. Total average latency becomes 15.3 cycles, so most of the gain is
lost.

The same argument applies to SQRT.

Any LD instruction backed by a cache HAS TO HAVE variable latency.
Any memory ref with a translated address HAS TO HAVE variable
latency (TLB miss).
Store instruction waiting on long latency result data HAS TO HAVE
variable latency between AGEN and Write.

I don't think we disagree Mitch, I'm just stating that if you have a
lockstep programming model, then variable latency per slot tends to end
up with worst case latency all over, so if you could have done the Mc
88K FDIV in a fixed 16-cycles, that might have been better for this
particular programming model.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Fri Jul 18 15:16:47 2025

On Fri, 18 Jul 2025 13:06:49 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Thu, 17 Jul 2025 9:18:00 +0000, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

Computing Ãâ‚¬ to a given precision:
<https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>. >>>>>> No fixed number of terms in the common algorithms, as you can see. >>>>>>

If this was supposed to show how you would use variable number of terms >>>>> for common library functions, then I failed to understand it.

Quote:

     Or compare this function, adapted from the recipes section of the >>>>      decimal module documentation:

     [code omitted -- see reference]

     As you can see, this converges a lot quicker.

Another, somewhat important consideration:

If you want to make it possible to auto-vectorize code, then you pretty
much need for all instructions to have constant latency, maybe with a
few exceptions that will then cause pipeline bubbles.

Can I get your definition of "auto-vectorize"

A wide-decode and a set of reservation stations can "vectorize" a
loop or straight line of code. Does this qualify as "auto-vectorize" ??

Whereas, My 66000 VEC-LOOP is definitely a "compiler-vectorize"
directive.

This was definitely a requirement for the Mill fp emulation work I did.

Given that there are a few instructions which can have variable latency
and a spattering that HAVE TO HAVE variable latency this requirement
causes "problems".

Yeah, I do know that. Memory ops in SIMD style short vectors typically
have all slots resding in the same cache line, so even though the
latency is not predictable, it will probably be the same for all
elements.

In 1991, working on Mc 88120, we had FDIV that was within 0.125 ULP
at cycle 12, and it took 5 more cycles to KNOW that the result was
properly rounded (all RMs). So, instead of having FDIV have 17 cycle
latency, we allowed it to have 12 cycles of latency 87.5% of the time
and 17 cycles 12.5% of the time for an average latency of 12.625 cycles.
This is usefully faster than fixed 17 cycles.

So if 87.5% of all divisions finish in 12 cycles, and you do 8 of them
in parallel, then (for random inputs), all 8 will finish in 12 with a
34% probability, leaving 17 cycles as the actual latency in 66% of all
cases. Total average latency becomes 15.3 cycles, so most of the gain is lost.

If you are doing enough FDIVs to matter, the long count and the short
counts will be randomly distributed across the lanes. SO the long term
average (OoO) style will approximate the previous cycle counts.

You don't do this if all 8 lanes have to remain in lock step.

The same argument applies to SQRT.

Any LD instruction backed by a cache HAS TO HAVE variable latency.
Any memory ref with a translated address HAS TO HAVE variable
latency (TLB miss).
Store instruction waiting on long latency result data HAS TO HAVE
variable latency between AGEN and Write.

I don't think we disagree Mitch, I'm just stating that if you have a
lockstep programming model, then variable latency per slot tends to end
up with worst case latency all over, so if you could have done the Mc
88K FDIV in a fixed 16-cycles, that might have been better for this particular programming model.

I whole-heartedly agree with that paragraph.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to [email protected] on Fri Jul 18 20:01:16 2025

MitchAlsup1 <[email protected]> wrote:

On Fri, 11 Jul 2025 14:50:58 +0000, Anton Ertl wrote:

David Brown <[email protected]> writes:

<snip>

It would have been nice if, when struct returns and struct parameters >>>were added to C, someone had taken time to improve the ABI's to make
them efficient.

Given the name of the calling convention variant, this was introduced
in PCC (and probably struct returns themselves were introduced in
PCC). PCC was released in 1979 on the machines of the day, such as
the PDP-11; I am sure Johnson implemented a calling convention for
struct passing and struct returns that used the least amount of code.
If Johnson had had more space to play with, he probably would have had
other things on the agenda before improving the struct return calling
convention. E.g., the calling conventions at the time passed all
parameters on the stack, and we still have this in the Intel calling
convention for IA-32.

Given that PDP-11 had 6 general purpose useable registers, and x86
started out with similar, it would have been quite difficult to
pass the first few arguments in registers. PDP-11 and x86 were
easy to push arguments onto the stack, and address in callee from
the stack.

Watcom C for 386 offered a register passing convention, IIRC first
3 integer (or equivalent) arguments were passed in registers.
ANd this convention gave measurable speedup compared to standard
convention.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Anton Ertl on Fri Jul 18 20:12:38 2025

Anton Ertl <[email protected]> wrote:

[1] I have wondered about the selection of registers for the System V
calling convention for the System V ABI for AMD64: the first 6
arguments go in RDI, RSI, RDX, RCX, R8, R9. The first two are optimal
for memcpy() implemented with REP MOVSB, but then RCX would be better
in third position. RDI is also good for memset() with REP STOSB, RDI
and RSI are also good for memcmp() with REP CMPSB, and I expect that
there are other uses of REP instructions for implementing memory-block
or string functions where the placement in RDI and RSI is
helpful. Except that the library routines then often do not use the
REP instructions.

There is a paper by (IIRC) Jan Hubicka for GCC developers sunmit
(probaly in 2005) about targeting AMD64. This paper explaions
several ABI design decisions (but possibly not the ordering
between RDX and RCX). ABI was developed before the team doing
port had access to actial hardware, so they mostly looked at
code size. IIRC number of registers was chosen based on code
size for a collection of benchmarks, 6 gave smallest size.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Waldek Hebisch on Fri Jul 25 01:59:31 2025

On Fri, 18 Jul 2025 20:01:16 -0000 (UTC), Waldek Hebisch wrote:

Watcom C for 386 offered a register passing convention, IIRC first 3
integer (or equivalent) arguments were passed in registers.
ANd this convention gave measurable speedup compared to standard
convention.

WATCOM C was also used to compile FoxBase. When Microsoft acquired that,
they tried switching to their own C compiler. Unfortunately this produced larger code, which made the program overflow the 640K RAM limit.

Yes, it was that long ago.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Thu Jul 30 20:01:55 2026
  from Wales, Uk via Telnet
- Rixter
  Thu Jul 30 14:17:17 2026
  from Madison, Nc via Telnet
- Krenn
  Thu Jul 30 13:16:49 2026
  from Sydney, Nsw via Telnet
- Bob Worm
  Thu Jul 30 09:03:28 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:47:34 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:36:06 2026
  from Wales, Uk via Telnet
- Rixter
  Thu Jul 30 02:32:09 2026
  from Madison, Nc via Telnet
- Bob Worm
  Wed Jul 29 22:26:45 2026
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	97:07:18
Calls:	12,458
Calls today:	8
Files:	15,197
Messages:	6,537,944

Speculation from the past

Who's Online

Recent Visitors

System Info