• Speculation from the past

    From MitchAlsup1@21:1/5 to All on Thu Jul 10 15:50:38 2025
    Does anyone know why libm defines

    double lgamma(x) { }
    with an extern int signgam

    instead of:

    typedef struct { double result;
    int sign; } gammaresult;

    gammaresult lgamma( double x );

    Struct returns from subroutines were part of C back in 1980...
    {when I started using C}

    One could add to this discussion as to why errno was not
    done with struct return; ala::

    typedef struct { int fides;
    int error; } openresult;

    openresult open( char *string, int modes );

    as are many of the Linux OS entry points.

    This has to be a better solution compared to errno and signgam.

    Speculations welcome.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Thu Jul 10 16:58:46 2025
    [email protected] (MitchAlsup1) writes:
    Does anyone know why libm defines

    double lgamma(x) { }
    with an extern int signgam

    instead of:

    typedef struct { double result;
    int sign; } gammaresult;

    gammaresult lgamma( double x );

    Struct returns from subroutines were part of C back in 1980...
    {when I started using C}

    Possibly, but they were not part of early C, are not particularly
    efficient on many ABIs, and are inconvenient to use if you want to use
    all the components of the struct. So there were lots of reasons why
    API designers avoided the use of struct returns. An alternative would
    have been

    double lgamma(double gamma, int *signgam);

    however.

    One could add to this discussion as to why errno was not
    done with struct return

    Struct return did not exist in early C, so C wrappers for system calls
    (which existed from the start) do not use it.

    However, the actual system call interface does not have errno, but
    either returns the result in one register, or in one register and a
    flag (IIRC the carry flag in some system call interfaces I looked at).
    If the result is returned in one register, the usual indication of an
    error is that the sign bit is set; in that case the value of the
    register is the negated error number. For a separate flag, the value
    of the register is the error number. If you look at the original
    system calls of Unix, the limitation to positive numbers is not a
    problem.

    To a large degree, that is still the case, although, e.g., mmap() on a
    32-bit system can return a negative address, so the condition for an
    error of mmap() is a little bit more complicated than just checking
    the sign bit.

    In any case, these days errno is a perversity kept alive by backwards compatibility: The C wrapper for the system call has to check whether
    there is an error, then has to compute the error number and
    expensively store it to the thread-local storage where errno resides.
    Then the caller tests the return value of the C wrapper for indicating
    an error, and then accesses errno expensively in thread-local storage.
    If the C wrapper directly returned the return value of the system
    call, with some macros for finding out if there is an error and what
    the errno is, the whole system call would be more efficient.

    You might wonder about the architectures that use the carry flag to
    indicate that there is an error. But given that all maintained OSs
    for these architectures have to also work on architectures that do not
    pass the error indication in that way, I expect that the C wrapper
    could transform that into the variant that uses the same error
    indication as on the architectures that do not use the carry bit.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Thu Jul 10 17:37:55 2025
    [email protected] (MitchAlsup1) writes:
    Does anyone know why libm defines

    double lgamma(x) { }
    with an extern int signgam

    instead of:

    typedef struct { double result;
    int sign; } gammaresult;

    gammaresult lgamma( double x );

    Because the original lgamma was defined long before
    the committee added the 'signgam' feature, which was
    defined before pthreads was adopted from 1003.4 into
    XPG.

    The committee doesn't change existing function
    definitions in order to avoid breaking existing applications,
    so the extern was added. In retrospect, given the
    subsequent adoption of pthreads, it would have been better
    to create a new interface, not named 'lgamma' to support
    returning the sign value.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Thu Jul 10 21:18:05 2025
    On 10/07/2025 18:58, Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Does anyone know why libm defines

    double lgamma(x) { }
    with an extern int signgam

    instead of:

    typedef struct { double result;
    int sign; } gammaresult;

    gammaresult lgamma( double x );

    Struct returns from subroutines were part of C back in 1980...
    {when I started using C}

    Possibly, but they were not part of early C, are not particularly
    efficient on many ABIs, and are inconvenient to use if you want to use
    all the components of the struct. So there were lots of reasons why
    API designers avoided the use of struct returns. An alternative would
    have been

    double lgamma(double gamma, int *signgam);

    however.


    Struct returns, even on poorer ABI's (and there are /many/ ABI's that
    are bad for struct handling), are unlikely to be noticeably less
    efficient than using a pointer-to-return-value solution. After all, the typical simplistic struct return here would be roughly equivalent to :

    void lgamma(gammaresult * result, double gamma);

    It would have been nice if, when struct returns and struct parameters
    were added to C, someone had taken time to improve the ABI's to make
    them efficient. Then the habit of decent ABI's could have continued
    when new architectures were developed. Instead, many current ABI's are
    at least sub-optimal for structs - a particular pain for C++.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Thu Jul 10 21:30:22 2025
    On Thu, 10 Jul 2025 19:18:05 +0000, David Brown wrote:

    On 10/07/2025 18:58, Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Does anyone know why libm defines

    double lgamma(x) { }
    with an extern int signgam

    instead of:

    typedef struct { double result;
    int sign; } gammaresult;

    gammaresult lgamma( double x );

    Struct returns from subroutines were part of C back in 1980...
    {when I started using C}

    Possibly, but they were not part of early C, are not particularly
    efficient on many ABIs, and are inconvenient to use if you want to use
    all the components of the struct. So there were lots of reasons why
    API designers avoided the use of struct returns. An alternative would
    have been

    double lgamma(double gamma, int *signgam);

    however.


    Struct returns, even on poorer ABI's (and there are /many/ ABI's that
    are bad for struct handling), are unlikely to be noticeably less
    efficient than using a pointer-to-return-value solution.

    Given that one has to
    a) look at something (return value or flag)
    b) and if set "bad" go find errno
    c) set errno to negative of value
    d) return
    e) look at return value
    f) if "bad" go find errno
    g) read errno
    h) go do something about it

    I think it is easy to make the argument that structure returns
    is almost always less expensive:: as::

    a) return 2 values
    b) if second value is "bad"
    c) go do something about it

    And this is thread safe, too.

    After all, the typical simplistic struct return here would be roughly equivalent to :

    void lgamma(gammaresult * result, double gamma);

    It would have been nice if, when struct returns and struct parameters
    were added to C, someone had taken time to improve the ABI's to make
    them efficient. Then the habit of decent ABI's could have continued
    when new architectures were developed. Instead, many current ABI's are
    at least sub-optimal for structs - a particular pain for C++.

    Do you think it is time to make another layer of wrappers::
    // for illustrative purposes

    typedef struct { int first, second } two_returns;

    fides open( char *string, int flags )
    {
    two_returns old = new_open( string, flags );
    if( old.second )
    {
    errno = -old.second;
    old.first = -1;
    }
    return (fides)old.first;
    }

    enum System_Calls { ..., file_open, ... };

    two_returns new_open( char *string, int flags )
    {
    return SYSCALL( char *string, int flags, file_open );
    }

    This results in a system call that is easily inlined by the compiler and results in 2 or 3 instructions in many new architectures, instead of
    "lots"
    including additional control transfers (call and return) along with
    accessing
    errno (signgam), ...

    One would not want to inline the old way. So, now we can let the
    compiler
    inline SYSCALLs with reasonable safety.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Thu Jul 10 22:10:56 2025
    On Thu, 10 Jul 2025 15:50:38 +0000, MitchAlsup1 wrote:

    One could add to this discussion as to why errno was not
    done with struct return; ala::

    typedef struct { int fides;
    int error; } openresult;

    openresult open( char *string, int modes );

    as are many of the Linux OS entry points.

    This has to be a better solution compared to errno and signgam.

    The actual Linux kernel entry point for open(2) returns a non-negative FD number on success, and a negative error code on failure. Other calls do
    similar things; it is the C runtime library wrapper that implements errno
    (as defined by C and POSIC APIs), it is not something the kernel knows (or cares) about.

    errno is a hack. It can’t even be treated as a simple global variable, because of the interaction with multithreading -- each thread has to have
    its own errno.

    There are some Linux-specific kernel calls where the userland API doesn’t even bother going through errno.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Fri Jul 11 02:03:20 2025
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Thu, 10 Jul 2025 15:50:38 +0000, MitchAlsup1 wrote:

    One could add to this discussion as to why errno was not
    done with struct return; ala::

    typedef struct { int fides;
    int error; } openresult;

    openresult open( char *string, int modes );

    as are many of the Linux OS entry points.

    This has to be a better solution compared to errno and signgam.

    The actual Linux kernel entry point for open(2) returns a non-negative FD >number on success, and a negative error code on failure.

    Irrelevent.

    The open(2) API and errno mechanism was defined in very early unix a half century ago.

    It was standardized in the System V Interface Definion (SVID) in the
    1980s and in POSIX a few years later, followed by the X Portability
    Guide (XPG) and finally the Single Unix specification. In all cases
    backward compatibility at the source level was a requirement.

    Extensions and new capabilities related to opening a file are encapsulated
    in additional APIs such as fcntl(2), ioctl(2), tcsetattr(2), et alia.

    Yes, there are likely differnt possible APIs; all new standardized Unix C APIs (e.g. posix_spawn, pthreads, et alia) return the E* error number directly
    (for thread safety) or zero for success eschewing errno completely. Any other data returned by an API is via pointer parameters (often with 'restrict' qualification).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri Jul 11 02:48:33 2025
    On Thu, 10 Jul 2025 16:58:46 GMT, Anton Ertl wrote:

    In any case, these days errno is a perversity kept alive by backwards compatibility: The C wrapper for the system call has to check whether
    there is an error, then has to compute the error number and
    expensively store it to the thread-local storage where errno resides.

    On the assumption that error conditions are less common than success, the
    fact that errno retains its previous value on the success case helps
    reduce the cost.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to Scott Lurndal on Fri Jul 11 11:03:32 2025
    In article <Ih_bQ.958133$[email protected]>,
    Scott Lurndal <[email protected]> wrote:
    The open(2) API and errno mechanism was defined in very early unix a half century ago.

    It was standardized in the System V Interface Definion (SVID) in the
    1980s and in POSIX a few years later, followed by the X Portability
    Guide (XPG) and finally the Single Unix specification. In all cases
    backward compatibility at the source level was a requirement.

    Extensions and new capabilities related to opening a file are encapsulated
    in additional APIs such as fcntl(2), ioctl(2), tcsetattr(2), et alia.

    Yes, there are likely differnt possible APIs; all new standardized Unix C APIs >(e.g. posix_spawn, pthreads, et alia) return the E* error number directly >(for thread safety) or zero for success eschewing errno completely. Any other
    data returned by an API is via pointer parameters (often with 'restrict' qualification).

    POSIX mandates that `errno` be (essentially) thread-local, so
    thread safety isn't much of a consideration here. Traditionally
    Unix kernels have returned a single value in a register, and set
    a flag (in the PSW or whatever) to indicate failure, leaving it
    to the syscall stubs in e.g. the C library to take whatever the
    kernel gives back from the actual syscall exit and make sure
    that `errno` is set appropriately.

    I can image that a kernel call interface where `errno` is not
    set is a bit more direct, but I don't think concurrency plays a
    huge role there; but maybe these interfaces were designed in
    that awkward time before `errno` was thread safe by mandate.
    And the case of `posix_spawn` might be special, since it is so
    often written in terms of `vfork`, which has its own bizarre
    semantics.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Fri Jul 11 12:53:36 2025
    On 10/07/2025 23:30, MitchAlsup1 wrote:
    On Thu, 10 Jul 2025 19:18:05 +0000, David Brown wrote:

    On 10/07/2025 18:58, Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Does anyone know why libm defines

    double lgamma(x) { }
    with an extern int signgam

    instead of:

    typedef struct { double result;
                     int    sign; } gammaresult;

    gammaresult lgamma( double x );

    Struct returns from subroutines were part of C back in 1980...
    {when I started using C}

    Possibly, but they were not part of early C, are not particularly
    efficient on many ABIs, and are inconvenient to use if you want to use
    all the components of the struct.  So there were lots of reasons why
    API designers avoided the use of struct returns.  An alternative would
    have been

    double lgamma(double gamma, int *signgam);

    however.


    Struct returns, even on poorer ABI's (and there are /many/ ABI's that
    are bad for struct handling), are unlikely to be noticeably less
    efficient than using a pointer-to-return-value solution.

    Given that one has to
    a) look at something (return value or flag)
    b) and if set "bad" go find errno
    c) set errno to negative of value
    d) return
    e) look at return value
    f) if "bad" go find errno
    g) read errno
    h) go do something about it

    I think it is easy to make the argument that structure returns
    is almost always less expensive:: as::

    a) return 2 values
    b) if second value is "bad"
    c) go do something about it

    And this is thread safe, too.


    Sure. I was comparing struct returns to pointer-to-return-value
    functions. I agree that using errno is usually less efficient. (errno
    can be thread-safe using thread-specific errno - but then it is even
    more overhead in use.)

    Where errno can be a good idea is if you are doing a lot of calculations
    and then check errno once at the end. I don't know how often that is
    done in practice.

    To me, the real benefit of having functions return a struct rather than
    use errno (or some other global variable) or take a
    pointer-to-return-value parameter, is that the function becomes "pure".
    The outputs depend solely on the inputs, and are consistent from call to
    call, with no side-effects. Now you can re-arrange them like any
    arithmetic code (with the same provisos about IEEE accuracy for floating point), pre-calculate results, skip duplicate calls, and do any other
    kinds of manipulation that suits. And it is far easier to reason about
    the correctness of code that has no side-effects.


                                                              After all, the
    typical simplistic struct return here would be roughly equivalent to :

        void lgamma(gammaresult * result, double gamma);

    It would have been nice if, when struct returns and struct parameters
    were added to C, someone had taken time to improve the ABI's to make
    them efficient.  Then the habit of decent ABI's could have continued
    when new architectures were developed.  Instead, many current ABI's are
    at least sub-optimal for structs - a particular pain for C++.

    Do you think it is time to make another layer of wrappers::
    // for illustrative purposes

    typedef struct { int first, second } two_returns;

    fides                  open( char *string, int flags )
    {
        two_returns old = new_open( string, flags );
        if( old.second )
        {
            errno = -old.second;
            old.first = -1;
        }
        return (fides)old.first;
    }

    enum System_Calls { ..., file_open, ... };

    two_returns new_open( char *string, int flags )
    {
        return SYSCALL( char *string, int flags, file_open );
    }

    This results in a system call that is easily inlined by the compiler and results in 2 or 3 instructions in many new architectures, instead of
    "lots"
    including additional control transfers (call and return) along with
    accessing
    errno (signgam), ...

    One would not want to inline the old way. So, now we can let the
    compiler
    inline SYSCALLs with reasonable safety.

    I think that for "big" functions - like most system calls - it's not
    worth the effort from an efficiency viewpoint. And it does not make the function "pure". So for C, that would be a waste of time for something
    like "open()".

    For maths functions and similar code, on the other hand, it can make a
    much bigger difference.

    For C++, the difference in usability is significant. Handling struct
    returns is somewhat inconvenient in C, though the C23 "auto" type
    inference helps a bit. C++ has significantly better support, especially
    if the struct types are std::expected<>, std::variant<> or
    std::optional<>. But even with plain old structs, C++ has structured
    binding and std::tie<> that make it all easier to use (especially with
    the new anonymous _ in C++26). Add to that, C++ has been gaining
    steadily more compile-time calculations (constexpr, consteval, and now
    the beginnings of reflection) which cannot work with side-effect functions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Dan Cross on Fri Jul 11 13:40:22 2025
    [email protected] (Dan Cross) writes:
    In article <Ih_bQ.958133$[email protected]>,
    Scott Lurndal <[email protected]> wrote:
    The open(2) API and errno mechanism was defined in very early unix a half century ago.

    It was standardized in the System V Interface Definion (SVID) in the
    1980s and in POSIX a few years later, followed by the X Portability
    Guide (XPG) and finally the Single Unix specification. In all cases >>backward compatibility at the source level was a requirement.

    Extensions and new capabilities related to opening a file are encapsulated >>in additional APIs such as fcntl(2), ioctl(2), tcsetattr(2), et alia.

    Yes, there are likely differnt possible APIs; all new standardized Unix C APIs
    (e.g. posix_spawn, pthreads, et alia) return the E* error number directly >>(for thread safety) or zero for success eschewing errno completely. Any other
    data returned by an API is via pointer parameters (often with 'restrict' qualification).

    POSIX mandates that `errno` be (essentially) thread-local, so
    thread safety isn't much of a consideration here. Traditionally
    Unix kernels have returned a single value in a register, and set
    a flag (in the PSW or whatever) to indicate failure, leaving it
    to the syscall stubs in e.g. the C library to take whatever the
    kernel gives back from the actual syscall exit and make sure
    that `errno` is set appropriately.

    I can image that a kernel call interface where `errno` is not
    set is a bit more direct, but I don't think concurrency plays a
    huge role there; but maybe these interfaces were designed in
    that awkward time before `errno` was thread safe by mandate.

    I was on the XPG working group in those years, and yes, they
    were designed in that awkward time as 1003.4a was being
    developed.

    And the case of `posix_spawn` might be special, since it is so
    often written in terms of `vfork`, which has its own bizarre
    semantics.

    posix_spawn was modeled somewhat after ADA process creation primitives.

    The rationale is included in the standard page.

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.html

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Fri Jul 11 14:50:58 2025
    David Brown <[email protected]> writes:
    Struct returns, even on poorer ABI's (and there are /many/ ABI's that
    are bad for struct handling), are unlikely to be noticeably less
    efficient than using a pointer-to-return-value solution. After all, the >typical simplistic struct return here would be roughly equivalent to :

    void lgamma(gammaresult * result, double gamma);

    Let's see:

    #include <stdio.h>

    typedef struct { double result;
    int sign; } gammaresult;


    gammaresult lgamma_alsup1( double x )
    {
    gammaresult r;
    r.result = x+1.;
    r.sign = -1;
    return r;
    }

    double lgamma_ertl1(double x, int *signgam)
    {
    *signgam = -1;
    return x+1.;
    }

    extern gammaresult lgamma_alsup2( double x );

    void call_alsup()
    {
    gammaresult r=lgamma_alsup2(1.);
    printf("%f ",r.result);
    printf("%d ",r.sign);
    }

    extern double lgamma_ertl2(double x, int *signgam);

    void call_ertl()
    {
    int sign;
    printf("%f ",lgamma_ertl2(1.,&sign));
    printf("%d ",sign);
    }

    Here the calls are to a differently-named function with the same
    interface such that we see what happens without inlining. The first
    thing to note is that the source code for the struct-returning
    function is longer. The calling code is slightly longer.

    I have compiled that on AMD64 with:

    gcc -fpcc-struct-return -Wall -O -c lgamma.c

    The output of "objdump -d lgamma.o" for lgamma_*1 is:

    0000000000000000 <lgamma_alsup1>:
    0: 48 89 f8 mov %rdi,%rax
    3: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # b <lgamma_alsup1+0xb>
    a: 00
    b: f2 0f 11 07 movsd %xmm0,(%rdi)
    f: c7 47 08 ff ff ff ff movl $0xffffffff,0x8(%rdi)
    16: c3 ret

    0000000000000017 <lgamma_ertl1>:
    17: c7 07 ff ff ff ff movl $0xffffffff,(%rdi)
    1d: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # 25 <lgamma_ertl1+0xe>
    24: 00
    25: c3 ret

    So with the typical simplistic struct return (aka pcc-struct-return)
    the code of the function is longer.

    The code for the call_* functions is:

    0000000000000026 <call_alsup>:
    26: 53 push %rbx
    27: 48 83 ec 10 sub $0x10,%rsp
    2b: 48 89 e7 mov %rsp,%rdi
    2e: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 36 <call_alsup+0x10>
    35: 00
    36: e8 00 00 00 00 call 3b <call_alsup+0x15>
    3b: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx
    40: f2 0f 10 04 24 movsd (%rsp),%xmm0
    45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 4c <call_alsup+0x26>
    4c: b8 01 00 00 00 mov $0x1,%eax
    51: e8 00 00 00 00 call 56 <call_alsup+0x30>
    56: 89 de mov %ebx,%esi
    58: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 5f <call_alsup+0x39>
    5f: b8 00 00 00 00 mov $0x0,%eax
    64: e8 00 00 00 00 call 69 <call_alsup+0x43>
    69: 48 83 c4 10 add $0x10,%rsp
    6d: 5b pop %rbx
    6e: c3 ret

    000000000000006f <call_ertl>:
    6f: 48 83 ec 18 sub $0x18,%rsp
    73: 48 8d 7c 24 0c lea 0xc(%rsp),%rdi
    78: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 80 <call_ertl+0x11>
    7f: 00
    80: e8 00 00 00 00 call 85 <call_ertl+0x16>
    85: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 8c <call_ertl+0x1d>
    8c: b8 01 00 00 00 mov $0x1,%eax
    91: e8 00 00 00 00 call 96 <call_ertl+0x27>
    96: 8b 74 24 0c mov 0xc(%rsp),%esi
    9a: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # a1 <call_ertl+0x32>
    a1: b8 00 00 00 00 mov $0x0,%eax
    a6: e8 00 00 00 00 call ab <call_ertl+0x3c>
    ab: 48 83 c4 18 add $0x18,%rsp
    af: c3 ret

    18 instructions for call_alsup() vs. 14 for call_ertl(), so again the struct-return variant leads to longer code with pcc-struct-return.

    It would have been nice if, when struct returns and struct parameters
    were added to C, someone had taken time to improve the ABI's to make
    them efficient.

    Given the name of the calling convention variant, this was introduced
    in PCC (and probably struct returns themselves were introduced in
    PCC). PCC was released in 1979 on the machines of the day, such as
    the PDP-11; I am sure Johnson implemented a calling convention for
    struct passing and struct returns that used the least amount of code.
    If Johnson had had more space to play with, he probably would have had
    other things on the agenda before improving the struct return calling convention. E.g., the calling conventions at the time passed all
    parameters on the stack, and we still have this in the Intel calling
    convention for IA-32.

    Early RISC calling conventions passed several parameters in registers,
    but still used pcc-struct-returns. But struct returns were so rare in libraries that gcc added an option -freg-struct-return which returns
    small structs in registers, and this option used to be usable, because libraries or system calls did not use struct-returns at the time.

    Eventually, ABI specifications went for more efficient, but also more
    complex and less forgiving calling conventions, so on AMD64 without -fpcc-struct-return gammaresult is actually returned in registers,
    leading to 3 instructions for lgamma_alsup1 (same as lgamma_ertl1) and
    13 instructions for call_alsup (shorter than call_ertl).

    Then the habit of decent ABI's could have continued
    when new architectures were developed.

    It seems to me that that's what happened (except that it was not a continuation): When new architectures were introduced, ABIs were
    introduced that made use of the additional memory, but also took
    compatibility with existing practice into account.

    E.g., MIPS (1986) got a calling convention that passes the first four
    words of parameters in integer registers and the rest on the stack.
    That's not particularly efficient for passing FP parameters, but it
    meant that calls to functions, including varargs functions like
    printf() would work without prototypes (C89 only came later) and
    varags functions could be implemented simply by storing these four
    registers to the stack (IIRC the four slots for these parameter words
    were reserved).

    As time progressed, calling conventions tried to keep stuff more in
    registers and in the right kind of registers, at the cost of a more
    complex implementation and breaking programs without prototypes.
    E.g., the AMD64 ABI specifies register struct returns for small
    structs.

    Instead, many current ABI's are
    at least sub-optimal for structs

    Which ones do you have in mind?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Scott Lurndal on Fri Jul 11 18:33:19 2025
    On 2025-07-11 16:40, Scott Lurndal wrote:
    [email protected] (Dan Cross) writes:

    [snip]

    And the case of `posix_spawn` might be special, since it is so
    often written in terms of `vfork`, which has its own bizarre
    semantics.

    posix_spawn was modeled somewhat after ADA process creation primitives.

    The rationale is included in the standard page.

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.htm

    To clarify: the models for posix_spawn were not Ada /language/
    primitives, but process-creation operations provided in a standard
    Ada-to-POSIX binding. Quoting from the page referenced above:

    "Instead, posix_spawn() and posix_spawnp() are process creation
    primitives like the Start_Process and Start_Process_Search Ada language bindings [in] package POSIX_Process_Primitives and also like those in
    many operating systems that are not UNIX systems, but with some
    POSIX-specific additions."

    The Ada language itself does not have a "process" concept. Ada has
    "tasks" that are execution threads that run in a shared address space.
    Tasks in Ada are created by dedicated syntax and not by calling some task-creating operations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to Scott Lurndal on Fri Jul 11 16:11:00 2025
    In article <av8cQ.984246$[email protected]>,
    Scott Lurndal <[email protected]> wrote:
    [email protected] (Dan Cross) writes:
    In article <Ih_bQ.958133$[email protected]>,
    Scott Lurndal <[email protected]> wrote:
    The open(2) API and errno mechanism was defined in very early unix a half century ago.

    It was standardized in the System V Interface Definion (SVID) in the >>>1980s and in POSIX a few years later, followed by the X Portability
    Guide (XPG) and finally the Single Unix specification. In all cases >>>backward compatibility at the source level was a requirement.

    Extensions and new capabilities related to opening a file are encapsulated >>>in additional APIs such as fcntl(2), ioctl(2), tcsetattr(2), et alia.

    Yes, there are likely differnt possible APIs; all new standardized Unix C APIs
    (e.g. posix_spawn, pthreads, et alia) return the E* error number directly >>>(for thread safety) or zero for success eschewing errno completely. Any other
    data returned by an API is via pointer parameters (often with 'restrict' qualification).

    POSIX mandates that `errno` be (essentially) thread-local, so
    thread safety isn't much of a consideration here. Traditionally
    Unix kernels have returned a single value in a register, and set
    a flag (in the PSW or whatever) to indicate failure, leaving it
    to the syscall stubs in e.g. the C library to take whatever the
    kernel gives back from the actual syscall exit and make sure
    that `errno` is set appropriately.

    I can image that a kernel call interface where `errno` is not
    set is a bit more direct, but I don't think concurrency plays a
    huge role there; but maybe these interfaces were designed in
    that awkward time before `errno` was thread safe by mandate.

    I was on the XPG working group in those years, and yes, they
    were designed in that awkward time as 1003.4a was being
    developed.

    Thanks for the confirmation; that makes sense.

    And the case of `posix_spawn` might be special, since it is so
    often written in terms of `vfork`, which has its own bizarre
    semantics.

    posix_spawn was modeled somewhat after ADA process creation primitives.

    The rationale is included in the standard page.

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.html

    Thanks, but I don't think that directly addresses why they chose
    to return error status directly in the return value, and not set
    errno as a side-effect.

    Perhaps a hint is given here, from the rationale you pointed to
    earlier:

    |The posix_spawn() function is implementable as a library
    |routine, but both posix_spawn() and posix_spawnp() are designed
    |as kernel operations.

    ...one presumes that, on systems where it is implemented as a
    library routine, it is written in terms of fork/exec and
    capturing the value of errno in the case of a successful fork/
    failed exec might be challening. On existing Unix-y systems, I
    suspect it is almost always implemented in terms of vfork/exec,
    which has its own issues, but since the child "borrows" its
    parents address space until it either exec's or exits, maybe it
    would be _easier_ to bubble errno values back up.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Niklas Holsti on Fri Jul 11 16:58:07 2025
    Niklas Holsti <[email protected]d> schrieb:

    "Instead, posix_spawn() and posix_spawnp()

    For a second, I read that as posix_swamp().

    But then again, I have been known to write about unsinged numbers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Dan Cross on Fri Jul 11 17:21:25 2025
    [email protected] (Dan Cross) writes:
    In article <av8cQ.984246$[email protected]>,
    Scott Lurndal <[email protected]> wrote:

    <snip posix_spawn discussion>

    The rationale is included in the standard page.
    https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.html

    Thanks, but I don't think that directly addresses why they chose
    to return error status directly in the return value, and not set
    errno as a side-effect.

    My recollection is the choice to return errno directly was made
    because we were aware of the pending 1003.4a specification (I sat
    in on a couple of those meetings as well when our regular posix rep
    wasn't available).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Fri Jul 11 19:25:21 2025
    On Fri, 11 Jul 2025 14:50:58 +0000, Anton Ertl wrote:

    David Brown <[email protected]> writes:
    Struct returns, even on poorer ABI's (and there are /many/ ABI's that
    are bad for struct handling), are unlikely to be noticeably less
    efficient than using a pointer-to-return-value solution. After all, the >>typical simplistic struct return here would be roughly equivalent to :

    void lgamma(gammaresult * result, double gamma);

    Let's see:

    #include <stdio.h>

    typedef struct { double result;
    int sign; } gammaresult;


    gammaresult lgamma_alsup1( double x )
    {
    gammaresult r;
    r.result = x+1.;
    r.sign = -1;
    return r;
    }

    lgamma_alsup1:
    FADD R1,R1,#1.
    MOV R2,#-1
    RET // 3 instructions no memory

    double lgamma_ertl1(double x, int *signgam)
    {
    *signgam = -1;
    return x+1.;
    }

    lgamma_ertl1:
    ST #-1,[R2]
    FADD R1,R1,#1.
    RET // 3 instructions 1 memory == more power

    extern gammaresult lgamma_alsup2( double x );

    void call_alsup()
    {
    gammaresult r=lgamma_alsup2(1.);
    printf("%f ",r.result);
    printf("%d ",r.sign);
    }


    call_alsup:
    ENTER R0,R0,#16
    CVTSD R1,#1 // 4 bytes instead of MOV R1,#1.0D0 as 12
    bytes
    CALX [IP,,GOT[lgamma_alsup2#]-.]
    STD R2,[SP,#8]
    MOV R2,R2
    LDA R1,&"%f "
    CALL printf
    LD R2,[SP,8]
    LDA R1,&"%d "
    CALL printf
    EXIT R0,R0,#8 // 11 instructions 1 STD 1 LDD 2 LDA

    extern double lgamma_ertl2(double x, int *signgam);

    void call_ertl()
    {
    int sign;
    printf("%f ",lgamma_ertl2(1.,&sign));
    printf("%d ",sign);
    }

    call_ertl1:
    ENTER R0,R0,#16
    CVTSD R1,#1
    LDA R2,[SP,16]
    CALX [IP,,GOT[lgamma_ertl2#]-.]
    MOV R2,R1
    LDA R1,%"%f "
    CALL printf
    LDA R2,[SP,16]
    LDA R1,&"%d "
    CALL printf
    EXIT R0,R0,#16 // 11 instructions 4 LDA

    With the same instruction count the argument is a wash on My 66000 architecture.


    Here the calls are to a differently-named function with the same
    interface such that we see what happens without inlining. The first
    thing to note is that the source code for the struct-returning
    function is longer. The calling code is slightly longer.

    I have compiled that on AMD64 with:

    gcc -fpcc-struct-return -Wall -O -c lgamma.c

    The output of "objdump -d lgamma.o" for lgamma_*1 is:

    0000000000000000 <lgamma_alsup1>:
    0: 48 89 f8 mov %rdi,%rax
    3: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # b <lgamma_alsup1+0xb>
    a: 00
    b: f2 0f 11 07 movsd %xmm0,(%rdi)
    f: c7 47 08 ff ff ff ff movl $0xffffffff,0x8(%rdi)
    16: c3 ret

    // 5 instructons

    0000000000000017 <lgamma_ertl1>:
    17: c7 07 ff ff ff ff movl $0xffffffff,(%rdi)
    1d: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # 25 <lgamma_ertl1+0xe>
    24: 00
    25: c3 ret

    3 instructions: just like both My 66000 compilations

    So with the typical simplistic struct return (aka pcc-struct-return)
    the code of the function is longer.

    The code for the call_* functions is:

    0000000000000026 <call_alsup>:
    26: 53 push %rbx
    27: 48 83 ec 10 sub $0x10,%rsp
    2b: 48 89 e7 mov %rsp,%rdi
    2e: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 36 <call_alsup+0x10>
    35: 00
    36: e8 00 00 00 00 call 3b <call_alsup+0x15>
    3b: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx
    40: f2 0f 10 04 24 movsd (%rsp),%xmm0
    45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 4c <call_alsup+0x26>
    4c: b8 01 00 00 00 mov $0x1,%eax
    51: e8 00 00 00 00 call 56 <call_alsup+0x30>
    56: 89 de mov %ebx,%esi
    58: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 5f <call_alsup+0x39>
    5f: b8 00 00 00 00 mov $0x0,%eax
    64: e8 00 00 00 00 call 69 <call_alsup+0x43>
    69: 48 83 c4 10 add $0x10,%rsp
    6d: 5b pop %rbx
    6e: c3 ret

    17 instructions

    000000000000006f <call_ertl>:
    6f: 48 83 ec 18 sub $0x18,%rsp
    73: 48 8d 7c 24 0c lea 0xc(%rsp),%rdi
    78: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 80 <call_ertl+0x11>
    7f: 00
    80: e8 00 00 00 00 call 85 <call_ertl+0x16>
    85: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 8c <call_ertl+0x1d>
    8c: b8 01 00 00 00 mov $0x1,%eax
    91: e8 00 00 00 00 call 96 <call_ertl+0x27>
    96: 8b 74 24 0c mov 0xc(%rsp),%esi
    9a: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # a1 <call_ertl+0x32>
    a1: b8 00 00 00 00 mov $0x0,%eax
    a6: e8 00 00 00 00 call ab <call_ertl+0x3c>
    ab: 48 83 c4 18 add $0x18,%rsp
    af: c3 ret

    13 instructions

    Both longer than my 66000 versions.

    18 instructions for call_alsup() vs. 14 for call_ertl(),

    I got 1 less for both in counting ARM instructions. Am I missing some-
    thing ?!? PLUS I used dynamically loaded Calling for lgamma* not static loading.

    so again the struct-return variant leads to longer code with pcc-struct-return.

    For AMR yes, for My 66000 no.

    But this has been my constant argument for the last 6 years:: you don't
    finish the ISA development until after the compiler ahs been written.
    When you find an awkward code sequence--figure out how to fix it, then
    teach the compiler to use that.

    It would have been nice if, when struct returns and struct parameters
    were added to C, someone had taken time to improve the ABI's to make
    them efficient.

    Someone did!

    Given the name of the calling convention variant, this was introduced
    in PCC (and probably struct returns themselves were introduced in
    PCC). PCC was released in 1979 on the machines of the day, such as
    the PDP-11; I am sure Johnson implemented a calling convention for
    struct passing and struct returns that used the least amount of code.
    If Johnson had had more space to play with, he probably would have had
    other things on the agenda before improving the struct return calling convention. E.g., the calling conventions at the time passed all
    parameters on the stack, and we still have this in the Intel calling convention for IA-32.

    Early RISC calling conventions passed several parameters in registers,
    but still used pcc-struct-returns.

    Greenhills compiler for 88K use register struct returns (1983)
    IIRC 4 registers; so that complex doubles were in registers
    both calling and returning.

    But struct returns were so rare in libraries that gcc added an option -freg-struct-return which returns
    small structs in registers, and this option used to be usable, because libraries or system calls did not use struct-returns at the time.

    Eventually, ABI specifications went for more efficient, but also more
    complex and less forgiving calling conventions, so on AMD64 without -fpcc-struct-return gammaresult is actually returned in registers,
    leading to 3 instructions for lgamma_alsup1 (same as lgamma_ertl1) and
    13 instructions for call_alsup (shorter than call_ertl).

    My 66000 ABI provides up to 8 doublewords of register struct return
    values.

    Then the habit of decent ABI's could have continued
    when new architectures were developed.

    It seems to me that that's what happened (except that it was not a continuation): When new architectures were introduced, ABIs were
    introduced that made use of the additional memory, but also took compatibility with existing practice into account.

    E.g., MIPS (1986) got a calling convention that passes the first four
    words of parameters in integer registers and the rest on the stack.

    My 66000 first 8 DoubleWords in registers calling and returning,
    the rest on the stack.

    That's not particularly efficient for passing FP parameters, but it
    meant that calls to functions, including varargs functions like
    printf() would work without prototypes (C89 only came later) and
    varags functions could be implemented simply by storing these four
    registers to the stack (IIRC the four slots for these parameter words
    were reserved).

    My 66000 does not have FP registers, just GPRs. (a topic for another
    day)

    As time progressed, calling conventions tried to keep stuff more in
    registers and in the right kind of registers, at the cost of a more
    complex implementation and breaking programs without prototypes.
    E.g., the AMD64 ABI specifies register struct returns for small
    structs.

    So do My 66000, except small == 1 cache line.

    Instead, many current ABI's are
    at least sub-optimal for structs

    Which ones do you have in mind?

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Jul 11 19:26:15 2025
    On Fri, 11 Jul 2025 16:58:07 +0000, Thomas Koenig wrote:

    Niklas Holsti <[email protected]d> schrieb:

    "Instead, posix_spawn() and posix_spawnp()

    For a second, I read that as posix_swamp().

    It might very well be.....

    But then again, I have been known to write about unsinged numbers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to Scott Lurndal on Fri Jul 11 20:40:56 2025
    In article <pKbcQ.592388$[email protected]>,
    Scott Lurndal <[email protected]> wrote:
    [email protected] (Dan Cross) writes:
    In article <av8cQ.984246$[email protected]>,
    Scott Lurndal <[email protected]> wrote:

    <snip posix_spawn discussion>

    The rationale is included in the standard page.
    https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.html >>
    Thanks, but I don't think that directly addresses why they chose
    to return error status directly in the return value, and not set
    errno as a side-effect.

    My recollection is the choice to return errno directly was made
    because we were aware of the pending 1003.4a specification (I sat
    in on a couple of those meetings as well when our regular posix rep
    wasn't available).

    That makes some sense, I suppose. Thanks.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Fri Jul 11 14:27:06 2025
    On 7/11/2025 9:58 AM, Thomas Koenig wrote:
    Niklas Holsti <[email protected]d> schrieb:

    "Instead, posix_spawn() and posix_spawnp()

    For a second, I read that as posix_swamp().

    But then again, I have been known to write about unsinged numbers.


    Unsinged numbers are cool :-)


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Fri Jul 11 21:56:17 2025
    On Fri, 11 Jul 2025 14:50:58 +0000, Anton Ertl wrote:

    David Brown <[email protected]> writes:
    <snip>
    It would have been nice if, when struct returns and struct parameters
    were added to C, someone had taken time to improve the ABI's to make
    them efficient.

    Given the name of the calling convention variant, this was introduced
    in PCC (and probably struct returns themselves were introduced in
    PCC). PCC was released in 1979 on the machines of the day, such as
    the PDP-11; I am sure Johnson implemented a calling convention for
    struct passing and struct returns that used the least amount of code.
    If Johnson had had more space to play with, he probably would have had
    other things on the agenda before improving the struct return calling convention. E.g., the calling conventions at the time passed all
    parameters on the stack, and we still have this in the Intel calling convention for IA-32.

    Given that PDP-11 had 6 general purpose useable registers, and x86
    started out with similar, it would have been quite difficult to
    pass the first few arguments in registers. PDP-11 and x86 were
    easy to push arguments onto the stack, and address in callee from
    the stack.

    The thing is:: we learned (most of the good ones of us).
    There needs to be a lot of GPRs
    we should be able to use 1/4-1/2 of them passing arguments
    and returning results
    while preserving ~1/2 of them across call/return boundaries
    One needs IP-relative addressing to data
    And one needs efficient dynamically linked subroutines and data
    And we should not allocate ANY registers to the dynamic linker.

    A few tidbits I picked up along the way::
    a) When a dynamically linked subroutine has not been linked, the
    faulting instruction access needs to contain a means to directly
    derive its GOT[index] without knowing the IP of the instruction.

    b) tabularized switch tables should use bytes or halfwords instead
    of doublewords.

    <snip>
    Then the habit of decent ABI's could have continued
    when new architectures were developed.

    It seems to me that that's what happened (except that it was not a continuation): When new architectures were introduced, ABIs were
    introduced that made use of the additional memory, but also took compatibility with existing practice into account.

    Another account of the architects having not been exposed to enough
    of the disease before crafting their design.

    E.g., MIPS (1986) got a calling convention that passes the first four
    words of parameters in integer registers and the rest on the stack.
    That's not particularly efficient for passing FP parameters, but it
    meant that calls to functions, including varargs functions like
    printf() would work without prototypes (C89 only came later) and
    varags functions could be implemented simply by storing these four
    registers to the stack (IIRC the four slots for these parameter words
    were reserved).

    My 66000 compilation environment does not need varargs prototypes
    in scope to build correct calling sequences. The calling sequences
    are independent of the callers requirements.

    As time progressed, calling conventions tried to keep stuff more in
    registers and in the right kind of registers,

    This is simple when there is only 1 kind of register !!

    at the cost of a more
    complex implementation and breaking programs without prototypes.
    E.g., the AMD64 ABI specifies register struct returns for small
    structs.

    Instead, many current ABI's are at least sub-optimal for structs

    Which ones do you have in mind?

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Fri Jul 11 22:03:16 2025
    On Fri, 11 Jul 2025 22:00:44 +0000, Stefan Monnier wrote:

    Unsinged numbers are cool :-)

    Yeah, I find that singed numbers make it harder to concentrate.

    But they are easier to eat!


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Jul 11 18:00:44 2025
    Unsinged numbers are cool :-)

    Yeah, I find that singed numbers make it harder to concentrate.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to All on Fri Jul 11 23:23:18 2025
    On 11/07/2025 20:26, MitchAlsup1 wrote:
    On Fri, 11 Jul 2025 16:58:07 +0000, Thomas Koenig wrote:

    Niklas Holsti <[email protected]d> schrieb:

    "Instead, posix_spawn() and posix_spawnp()

    For a second, I read that as posix_swamp().

    It might very well be.....

    But then again, I have been known to write about unsinged numbers.

    Just so long as they are not unhinged!

    --
    Bill F.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Fri Jul 11 23:02:17 2025
    According to MitchAlsup1 <[email protected]>:
    It would have been nice if, when struct returns and struct parameters >>>were added to C, someone had taken time to improve the ABI's to make
    them efficient. ...

    Given that PDP-11 had 6 general purpose useable registers, and x86
    started out with similar, it would have been quite difficult to
    pass the first few arguments in registers. PDP-11 and x86 were
    easy to push arguments onto the stack, and address in callee from
    the stack.

    The C compilers at that time were not very sophisticated. They compiled
    one statment at a time, and the only way to tell them to leave values
    in registers was an explicit "register" declaration. Except in the most trivial routines, it'd usually have to stash the argument in memory to
    make room for something else, so there'd have been no benefit.

    SPARC used the PCC compiler, which still wasn't very clever, so it had
    register windows with separate groups of registers for input arguments,
    output arguments, and temporaries. The IBM 801 had the first graph
    coloring compiler so I expect it passed all sorts of stuff in registers.


    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri Jul 11 23:19:35 2025
    On Fri, 11 Jul 2025 14:50:58 GMT, Anton Ertl wrote:

    E.g., the calling conventions at the time passed all parameters on the
    stack, and we still have this in the Intel calling convention for IA-32.

    No choice. What registers were there to use for passing arguments?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Sat Jul 12 00:30:14 2025
    [email protected] (MitchAlsup1) writes:
    On Fri, 11 Jul 2025 14:50:58 +0000, Anton Ertl wrote:

    David Brown <[email protected]> writes:
    Struct returns, even on poorer ABI's (and there are /many/ ABI's that
    are bad for struct handling), are unlikely to be noticeably less >>>efficient than using a pointer-to-return-value solution. After all, the >>>typical simplistic struct return here would be roughly equivalent to :

    void lgamma(gammaresult * result, double gamma);


    Early RISC calling conventions passed several parameters in registers,
    but still used pcc-struct-returns.

    Greenhills compiler for 88K use register struct returns (1983)
    IIRC 4 registers; so that complex doubles were in registers
    both calling and returning.

    The formal defintion for the 88k Unix ABI was the 88Open BCS[*] (I was the Unisys rep on the 88Open committee). I don't recall four register
    returns, but all my documentation from those days is boxed up. I think I
    have a copy of the 88k PCC sources around somewhere...

    [*] Binary Compatibility Standard. There was also an Object Compatibility
    Standard (OCS) to support link-time compatibility between compiler vendors
    (e.g. Unisoft, DG, Motorola, Unisys, Greenhills, Diab Data, et alia).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Sat Jul 12 09:22:08 2025
    Thomas Koenig wrote:
    Niklas Holsti <[email protected]d> schrieb:

    "Instead, posix_spawn() and posix_spawnp()

    For a second, I read that as posix_swamp().

    But then again, I have been known to write about unsinged numbers.


    So have I, multiple times.

    Still better than unhinged numbers?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Sat Jul 12 15:25:43 2025
    John Levine <[email protected]> writes:
    According to MitchAlsup1 <[email protected]>:
    Given that PDP-11 had 6 general purpose useable registers, and x86
    started out with similar, it would have been quite difficult to
    pass the first few arguments in registers.

    It's not any more difficult to pass, say, 4 arguments in registers if
    you have 6 registers available than it is if you have 30 registers
    available.

    PDP-11 and x86 were
    easy to push arguments onto the stack, and address in callee from
    the stack.

    I think neither PDP-11 nor IA-32 has instructions that push, say, the
    sum of two other registers, whereas at least IA-32 has an instruction
    that computes the sum of two registers and puts it in a third
    register.

    Concerning the implicit memory access: it costs more than using
    registers on all IA-32 implementations I am aware of, and I expect
    that's also true of the PDP-11.

    The C compilers at that time were not very sophisticated. They compiled
    one statment at a time, and the only way to tell them to leave values
    in registers was an explicit "register" declaration. Except in the most >trivial routines, it'd usually have to stash the argument in memory to
    make room for something else, so there'd have been no benefit.

    Many frequently-called library routines, such as strlen() or
    memcpy()[1] can easily keep all their parameters, variables, and
    intermediate results in 6 registers or less.

    Therefore I expect that many of the frequently-called library routines
    compiled with PCC made extensive use of the register storage class.

    In that scenario passing the arguments in registers avoids the cost of
    pushing them in the caller and the cost of loading them from memory at
    the start of the callee.

    As for the functions that do not use the register storage class for
    parameters, pushing or storing them at the start of the callee is not
    slower than doing it right before the call, and it can lead to shorter
    code.

    Anyway, I expect that Unix already had a calling convention on PDP-11
    and several other machines, and of course PCC followed that
    convention. As for the C compiler that introduced these calling
    conventions (probably by Ritchie), my guess is that he was happy to
    produce a working C compiler that ran in the little RAM they had.

    But Intel had a clean slate when they designed the Intel calling
    convention for IA-32. When the 386 came out in 1985, Wulf et
    al. [wulf+75] was a decade old, and Chaitin's graph-coloring paper was
    4 years old, and the 386 typically had much more memory available than
    Wulf et al. MIPS introduced a calling convention that passed 4 words
    in registers shortly after, and Intel could have gone done so, too.
    And it seems that they paid dearly for their decision, as I find lots
    of documentation on alternative calling conventions for IA-32 and how
    to tell the compiler about them.

    @Book{wulf+75,
    author = {William Wulf and Richard K. Johnsson and Charles
    B. Weinstock and Steven O. Hobbs and Charles M. Geschke},
    title = {The Design of an Optimizing Compiler},
    publisher = {Elsvier},
    year = {1975},
    isbn = {0-444-0164-6},
    annote = {Describes a complete Bliss/11 compiler for the
    PDP-11. It uses some interesting techniques: it
    uses a (hand-constructed) tree parsing automaton for
    parts of the code selection (Section~3.4); it
    optimizes the use of unary complement operators
    (Section~3.3); it uses a smart scheme to represent
    a conservative approximation of the lifetime of
    variables in constant space and uses that for
    register allocation (Sections~4.1.3 and~4.3).}
    }

    This book cannot be praised enough, and it's celebrating its 50th
    anniversary this year.

    While this book came out before Stephen C. Johnson wrote PCC, I can
    understand why Johnson avoided going for an optimizing compiler.
    Johnson had enough on his plate with adding features to the language
    and designing for retargetability, and AFAIK he wrote PCC
    single-handedly, while Wulf et al. seem to have been 5 people. And
    given that Geschke graduated from CMU in 1972, they may have worked on
    the compiler for several years even with five people. Plus, as I just
    read, BLISS/11 was a cross-compiler from the PDP-10 to the PDP-11, so
    these optimization techniques may have needed too much memory for a
    PDP-11.

    [1] I have wondered about the selection of registers for the System V
    calling convention for the System V ABI for AMD64: the first 6
    arguments go in RDI, RSI, RDX, RCX, R8, R9. The first two are optimal
    for memcpy() implemented with REP MOVSB, but then RCX would be better
    in third position. RDI is also good for memset() with REP STOSB, RDI
    and RSI are also good for memcmp() with REP CMPSB, and I expect that
    there are other uses of REP instructions for implementing memory-block
    or string functions where the placement in RDI and RSI is
    helpful. Except that the library routines then often do not use the
    REP instructions.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Jul 12 19:13:16 2025
    On Sat, 12 Jul 2025 0:30:14 +0000, Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    On Fri, 11 Jul 2025 14:50:58 +0000, Anton Ertl wrote:

    David Brown <[email protected]> writes:
    Struct returns, even on poorer ABI's (and there are /many/ ABI's that >>>>are bad for struct handling), are unlikely to be noticeably less >>>>efficient than using a pointer-to-return-value solution. After all, the >>>>typical simplistic struct return here would be roughly equivalent to :

    void lgamma(gammaresult * result, double gamma);


    Early RISC calling conventions passed several parameters in registers,
    but still used pcc-struct-returns.

    Greenhills compiler for 88K use register struct returns (1983)
    IIRC 4 registers; so that complex doubles were in registers
    both calling and returning.

    The formal defintion for the 88k Unix ABI was the 88Open BCS[*] (I was
    the
    Unisys rep on the 88Open committee). I don't recall four register
    returns, but all my documentation from those days is boxed up. I think
    I
    have a copy of the 88k PCC sources around somewhere...

    I was the Moto Architect. 4 registers (of 32-bits) were used to be able
    to return a complex double precision value (2 doubles).

    [*] Binary Compatibility Standard. There was also an Object
    Compatibility
    Standard (OCS) to support link-time compatibility between compiler vendors
    (e.g. Unisoft, DG, Motorola, Unisys, Greenhills, Diab Data, et
    alia).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sat Jul 12 19:32:10 2025
    On Sat, 12 Jul 2025 15:25:43 +0000, Anton Ertl wrote:

    John Levine <[email protected]> writes:
    According to MitchAlsup1 <[email protected]>:
    Given that PDP-11 had 6 general purpose useable registers, and x86 >>>started out with similar, it would have been quite difficult to
    pass the first few arguments in registers.

    It's not any more difficult to pass, say, 4 arguments in registers if
    you have 6 registers available than it is if you have 30 registers
    available.

    There comes a point where it becomes harder than the compilers of that
    era could perform--for example, consider an expression to be passed
    as an argument that requires 3 registers to compute. If you only have
    6 registers and you want to pass 4 in registers, you might have to
    calculate several arguments, push them on the stack, then calculate
    the last one (3 registers) into the right register, then pop the others
    off the stack in order to perform the all.

    At a certain point, its easier not to do this.

    PDP-11 and x86 were
    easy to push arguments onto the stack, and address in callee from
    the stack.

    I think neither PDP-11 nor IA-32 has instructions that push, say, the
    sum of two other registers, whereas at least IA-32 has an instruction
    that computes the sum of two registers and puts it in a third
    register.

    Neither was a non-destructive register model (a = b + c) both were
    a destruction model (a = a + b)

    Concerning the implicit memory access: it costs more than using
    registers on all IA-32 implementations I am aware of, and I expect
    that's also true of the PDP-11.

    Time: yes, instruction space: somewhat--but you had (r5) and (r5)+
    and @(r5)+ and -(r5) and @-(r5) which cost no space but did cost time.

    The C compilers at that time were not very sophisticated. They compiled >>one statment at a time, and the only way to tell them to leave values
    in registers was an explicit "register" declaration. Except in the most >>trivial routines, it'd usually have to stash the argument in memory to
    make room for something else, so there'd have been no benefit.

    My point from above.

    Many frequently-called library routines, such as strlen() or
    memcpy()[1] can easily keep all their parameters, variables, and
    intermediate results in 6 registers or less.

    IIRC only SP and IP were preserved across a call/return

    Therefore I expect that many of the frequently-called library routines compiled with PCC made extensive use of the register storage class.

    a necessary evil. The first thing a modern C compiler does is to remove "register" sub-types from variables.

    In that scenario passing the arguments in registers avoids the cost of pushing them in the caller and the cost of loading them from memory at
    the start of the callee.

    As for the functions that do not use the register storage class for parameters, pushing or storing them at the start of the callee is not
    slower than doing it right before the call, and it can lead to shorter
    code.

    Less total code but equal number of instructions executed.
    When saved at entry, everyone who calls this subroutine shares
    the memory reference instructions.

    Anyway, I expect that Unix already had a calling convention on PDP-11
    and several other machines, and of course PCC followed that
    convention. As for the C compiler that introduced these calling
    conventions (probably by Ritchie), my guess is that he was happy to
    produce a working C compiler that ran in the little RAM they had.

    The Denelcor C compiler I built had big trouble fitting in the PDP-11
    memory. I had to remove all the superfluous "I wrote this" strings
    at the start of the ASM modules to get it to fit.

    But Intel had a clean slate when they designed the Intel calling
    convention for IA-32. When the 386 came out in 1985, Wulf et
    al. [wulf+75] was a decade old, and Chaitin's graph-coloring paper was
    4 years old, and the 386 typically had much more memory available than
    Wulf et al. MIPS introduced a calling convention that passed 4 words
    in registers shortly after, and Intel could have gone done so, too.
    And it seems that they paid dearly for their decision, as I find lots
    of documentation on alternative calling conventions for IA-32 and how
    to tell the compiler about them.

    I agree they paid dearly. The marketplace does not.

    @Book{wulf+75,
    author = {William Wulf and Richard K. Johnsson and Charles
    B. Weinstock and Steven O. Hobbs and Charles M.
    Geschke},
    title = {The Design of an Optimizing Compiler},
    publisher = {Elsvier},
    year = {1975},
    isbn = {0-444-0164-6},
    annote = {Describes a complete Bliss/11 compiler for the
    PDP-11. It uses some interesting techniques: it
    uses a (hand-constructed) tree parsing automaton for
    parts of the code selection (Section~3.4); it
    optimizes the use of unary complement operators
    (Section~3.3); it uses a smart scheme to represent
    a conservative approximation of the lifetime of
    variables in constant space and uses that for
    register allocation (Sections~4.1.3 and~4.3).}
    }

    This book cannot be praised enough, and it's celebrating its 50th
    anniversary this year.

    I have an original.

    <snip>

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Jul 13 00:06:42 2025
    According to Anton Ertl <[email protected]>:
    Anyway, I expect that Unix already had a calling convention on PDP-11
    and several other machines, and of course PCC followed that
    convention. As for the C compiler that introduced these calling
    conventions (probably by Ritchie), my guess is that he was happy to
    produce a working C compiler that ran in the little RAM they had.

    It was two passes each about 24K bytes and a third optional optimizer
    that slightly rewrote the assembler code.

    The Ritchie complier and I think PCC reserved up to three registers
    for declared register variables, and used the rest as a stack for
    temporaries. It used Sethi-Ullman numbering to do the more complex subexpressions first to avoid running out of registers. If it did
    run out of registers I think it just gave up, but I don't ever
    remember that happening.

    Reserving more registers would have been really hard.

    I agree that on the 386 it would probably have been practical to pass
    arguments in registers, but I suspect that for whatever reason they
    wanted to make the calling sequence similar to the 8086 and 286.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Sun Jul 13 02:19:47 2025
    On Sun, 13 Jul 2025 0:06:42 +0000, John Levine wrote:

    According to Anton Ertl <[email protected]>:
    Anyway, I expect that Unix already had a calling convention on PDP-11
    and several other machines, and of course PCC followed that
    convention. As for the C compiler that introduced these calling >>conventions (probably by Ritchie), my guess is that he was happy to
    produce a working C compiler that ran in the little RAM they had.

    It was two passes each about 24K bytes and a third optional optimizer
    that slightly rewrote the assembler code.

    The Ritchie complier and I think PCC reserved up to three registers
    for declared register variables, and used the rest as a stack for temporaries. It used Sethi-Ullman numbering to do the more complex subexpressions first to avoid running out of registers. If it did
    run out of registers I think it just gave up, but I don't ever
    remember that happening.

    Reserving more registers would have been really hard.

    I agree that on the 386 it would probably have been practical to pass arguments in registers, but I suspect that for whatever reason they
    wanted to make the calling sequence similar to the 8086 and 286.

    Register arguments and results were not common until after MIPS R2000,
    Although I did use register arguments and results on Denelcor HEP C
    compiler (which was the same code generator as HEP Fortran.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Sun Jul 13 14:25:11 2025
    On 11/07/2025 16:50, Anton Ertl wrote:
    David Brown <[email protected]> writes:
    Struct returns, even on poorer ABI's (and there are /many/ ABI's that
    are bad for struct handling), are unlikely to be noticeably less
    efficient than using a pointer-to-return-value solution. After all, the
    typical simplistic struct return here would be roughly equivalent to :

    void lgamma(gammaresult * result, double gamma);

    Let's see:

    #include <stdio.h>

    typedef struct { double result;
    int sign; } gammaresult;


    gammaresult lgamma_alsup1( double x )
    {
    gammaresult r;
    r.result = x+1.;
    r.sign = -1;
    return r;
    }

    double lgamma_ertl1(double x, int *signgam)
    {
    *signgam = -1;
    return x+1.;
    }

    extern gammaresult lgamma_alsup2( double x );

    void call_alsup()
    {
    gammaresult r=lgamma_alsup2(1.);
    printf("%f ",r.result);
    printf("%d ",r.sign);
    }

    extern double lgamma_ertl2(double x, int *signgam);

    void call_ertl()
    {
    int sign;
    printf("%f ",lgamma_ertl2(1.,&sign));
    printf("%d ",sign);
    }

    Here the calls are to a differently-named function with the same
    interface such that we see what happens without inlining. The first
    thing to note is that the source code for the struct-returning
    function is longer. The calling code is slightly longer.

    I have compiled that on AMD64 with:

    gcc -fpcc-struct-return -Wall -O -c lgamma.c

    The output of "objdump -d lgamma.o" for lgamma_*1 is:

    0000000000000000 <lgamma_alsup1>:
    0: 48 89 f8 mov %rdi,%rax
    3: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # b <lgamma_alsup1+0xb>
    a: 00
    b: f2 0f 11 07 movsd %xmm0,(%rdi)
    f: c7 47 08 ff ff ff ff movl $0xffffffff,0x8(%rdi)
    16: c3 ret

    0000000000000017 <lgamma_ertl1>:
    17: c7 07 ff ff ff ff movl $0xffffffff,(%rdi)
    1d: f2 0f 58 05 00 00 00 addsd 0x0(%rip),%xmm0 # 25 <lgamma_ertl1+0xe>
    24: 00
    25: c3 ret

    So with the typical simplistic struct return (aka pcc-struct-return)
    the code of the function is longer.

    The code for the call_* functions is:

    0000000000000026 <call_alsup>:
    26: 53 push %rbx
    27: 48 83 ec 10 sub $0x10,%rsp
    2b: 48 89 e7 mov %rsp,%rdi
    2e: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 36 <call_alsup+0x10>
    35: 00
    36: e8 00 00 00 00 call 3b <call_alsup+0x15>
    3b: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx
    40: f2 0f 10 04 24 movsd (%rsp),%xmm0
    45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 4c <call_alsup+0x26>
    4c: b8 01 00 00 00 mov $0x1,%eax
    51: e8 00 00 00 00 call 56 <call_alsup+0x30>
    56: 89 de mov %ebx,%esi
    58: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 5f <call_alsup+0x39>
    5f: b8 00 00 00 00 mov $0x0,%eax
    64: e8 00 00 00 00 call 69 <call_alsup+0x43>
    69: 48 83 c4 10 add $0x10,%rsp
    6d: 5b pop %rbx
    6e: c3 ret

    000000000000006f <call_ertl>:
    6f: 48 83 ec 18 sub $0x18,%rsp
    73: 48 8d 7c 24 0c lea 0xc(%rsp),%rdi
    78: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 80 <call_ertl+0x11>
    7f: 00
    80: e8 00 00 00 00 call 85 <call_ertl+0x16>
    85: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 8c <call_ertl+0x1d>
    8c: b8 01 00 00 00 mov $0x1,%eax
    91: e8 00 00 00 00 call 96 <call_ertl+0x27>
    96: 8b 74 24 0c mov 0xc(%rsp),%esi
    9a: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # a1 <call_ertl+0x32>
    a1: b8 00 00 00 00 mov $0x0,%eax
    a6: e8 00 00 00 00 call ab <call_ertl+0x3c>
    ab: 48 83 c4 18 add $0x18,%rsp
    af: c3 ret

    18 instructions for call_alsup() vs. 14 for call_ertl(), so again the struct-return variant leads to longer code with pcc-struct-return.


    <https://godbolt.org/z/j9jMT5ave>

    (I find godbolt clearer for looking at these things, and I prefer to
    avoid using printf - it can easily complicate the code.)

    The key metrics are not, I think, instruction counts - but memory
    accesses and how likely they are to cause delays. (I know you have much
    more experience than I do about the relative timings of assembly code sequences, especially on "big" processors. My work is mainly with
    simpler processors - generally single-scaler, and for important code it
    is all on-chip static ram.)

    As you show, having a pointer to "int * signgam" means that there will
    be only one extra write to memory (in the callee) and one extra read (in
    the caller) - while for a "pcc-struct-return" API you have two. However,
    those will be adjacent and probably combined.

    In theory, even if a struct return needs to pass a hidden pointer, the
    compiler knows more about it than for a general "int *" pointer
    parameter. It knows that there are no aliasing issues or "escapes" -
    when you have a local variable whose address is passed on to
    "lgamma_ertl", the compiler has to assume that the function might store
    the address and later functions might use it to change the value of the
    local variable "sign". With the hidden struct pointer, the compiler
    knows that access via the pointer is much more restricted.

    (With C23, a function like "lgamma_ertl" would be marked
    [[unsequenced]], or at least [[reproducible]], which would let the
    compiler make similar assumptions for optimisation.

    However, the best code (for caller and callee) is when there is a good
    ABI for structure returns, and they are returned in registers.


    It would have been nice if, when struct returns and struct parameters
    were added to C, someone had taken time to improve the ABI's to make
    them efficient.

    Given the name of the calling convention variant, this was introduced
    in PCC (and probably struct returns themselves were introduced in
    PCC). PCC was released in 1979 on the machines of the day, such as
    the PDP-11; I am sure Johnson implemented a calling convention for
    struct passing and struct returns that used the least amount of code.
    If Johnson had had more space to play with, he probably would have had
    other things on the agenda before improving the struct return calling convention. E.g., the calling conventions at the time passed all
    parameters on the stack, and we still have this in the Intel calling convention for IA-32.

    Early RISC calling conventions passed several parameters in registers,
    but still used pcc-struct-returns. But struct returns were so rare in libraries that gcc added an option -freg-struct-return which returns
    small structs in registers, and this option used to be usable, because libraries or system calls did not use struct-returns at the time.


    Would struct returns have been used more if they were not so
    inefficient? (There are standard library functions like "div", "clock",
    and "mktime" that return structs.)

    Eventually, ABI specifications went for more efficient, but also more
    complex and less forgiving calling conventions, so on AMD64 without -fpcc-struct-return gammaresult is actually returned in registers,
    leading to 3 instructions for lgamma_alsup1 (same as lgamma_ertl1) and
    13 instructions for call_alsup (shorter than call_ertl).

    Then the habit of decent ABI's could have continued
    when new architectures were developed.

    It seems to me that that's what happened (except that it was not a continuation): When new architectures were introduced, ABIs were
    introduced that made use of the additional memory, but also took compatibility with existing practice into account.


    That sounds reasonable.

    E.g., MIPS (1986) got a calling convention that passes the first four
    words of parameters in integer registers and the rest on the stack.
    That's not particularly efficient for passing FP parameters, but it
    meant that calls to functions, including varargs functions like
    printf() would work without prototypes (C89 only came later) and
    varags functions could be implemented simply by storing these four
    registers to the stack (IIRC the four slots for these parameter words
    were reserved).


    vararg functions are a real PITA for register-based ABI's ! They are
    fine for stack-based parameter ABI's, but not ABI's that are more
    efficient on modern devices and modern code.

    As time progressed, calling conventions tried to keep stuff more in
    registers and in the right kind of registers, at the cost of a more
    complex implementation and breaking programs without prototypes.
    E.g., the AMD64 ABI specifies register struct returns for small
    structs.

    Instead, many current ABI's are
    at least sub-optimal for structs

    Which ones do you have in mind?


    The architecture that is most relevant for my daily work, and where
    efficiency matters to me, is 32-bit ARM for embedded systems. It's fine
    for calling functions with a few simple parameters and returning a
    single scalar. But beyond that, it is often suboptimal - and with
    modern C++ coding, you are often doing something beyond that.

    ARM32 ABI can pass arguments in r0 to r3. (I'm ignoring floating point
    for simplification.) r4 to r11 must be preserved by the caller. Why
    then can they not also be used for passing arguments? I am no supporter
    of having lots of parameters in a single function, but a function could
    take a small number of larger parameters (64-bit integers, or structs of various kinds).

    Normally only r0 is used for return values, but r0:r1 can be used for a fundamental type that is 64-bit (a long long int, for example). A
    struct is only returned in a register if it fits in r0 - all other
    structs are handled by passing a pointer to a stack block. That means
    that you cannot, for example, make a C++ wrapper class around a uint64_t without suffering significant inefficiencies. Given that functions are
    already allowed to change r0 to r3 without preserving them, it would
    make sense to use all of r0 to r3 for return values.

    C++ tag types - types with no values, used only in parameters to choose particular overloads for a function - are treated like "unsigned char"
    by the ABI and thus cost a parameter register or force passing via stack parameters, when they could easily be omitted entirely.

    I realise 32-bit ARM was around before much of this was relevant (I
    first played with ARM assembly in 1988 as a schoolkid). But it is
    surely possible to modernise things a little?

    It is particularly galling for developers in small-systems embedded programming, where sometimes every cycle counts - and where we have
    virtually no concern for backwards compatibility or interaction with
    existing binary code, because we can happily re-compile everything on
    the target.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Sun Jul 13 14:22:55 2025
    John Levine <[email protected]> writes:
    I agree that on the 386 it would probably have been practical to pass >arguments in registers, but I suspect that for whatever reason they
    wanted to make the calling sequence similar to the 8086 and 286.

    Ease of adapting 16-bit compilers and library routines might have been
    reasons.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Sun Jul 13 13:07:10 2025
    [email protected] (MitchAlsup1) writes:
    On Sat, 12 Jul 2025 15:25:43 +0000, Anton Ertl wrote:

    John Levine <[email protected]> writes:
    According to MitchAlsup1 <[email protected]>:
    Given that PDP-11 had 6 general purpose useable registers, and x86 >>>>started out with similar, it would have been quite difficult to
    pass the first few arguments in registers.

    It's not any more difficult to pass, say, 4 arguments in registers if
    you have 6 registers available than it is if you have 30 registers
    available.

    There comes a point where it becomes harder than the compilers of that
    era could perform--for example, consider an expression to be passed
    as an argument that requires 3 registers to compute.

    In the worst case you are computing the fourth argument: three
    registers are occupied with arguments, and three registers are
    available for the computation.

    Things become more challenging if you have an expression that "needs"
    4 or more registers. A way to deal with that is to push the deepest
    entry in your register stack on the memory stack when you run out of
    registers. Then you can use the register where that intermediate
    result resided. When the intermediate result is needed, pop it from
    the memory stack.

    The combination of register variables and parameter passing in
    registers is also interesting. Let's assume we use the same registers
    for register variables and parameter passing (useful if the parameters
    are register variables, and it also means that we do not have to deal
    with all registers being occupied by ). Just before a parameter is
    computed, store the variable in its register into memory, and any
    later accesses to the variable access that memory location. Just
    before the call, write the remaining register variables to memory (caller-saved). After the call, load all the register variables from
    memory to their register again.

    There are, of course, ways to improve on this, but my point is that it
    is feasible to pass 4 parameters in registers and use 3 register
    variables on the PDP-11. It makes the compiler a little longer and a
    lot harder to test.

    If you only have
    6 registers and you want to pass 4 in registers, you might have to
    calculate several arguments, push them on the stack, then calculate
    the last one (3 registers) into the right register, then pop the others
    off the stack in order to perform the all.

    For the numbers you have mentioned, that's not necessary, but in
    general, that's another viable approach. Some of my students
    implement parameter passing (for AMD64, i.e., with passing in
    registers) by pushing each argument as it is computed and pulling them
    all from the stack into the appropriate registers right before the
    actual call. That may be less than optimal, but getting the
    assignment done in time and correctly is more important.

    I think neither PDP-11 nor IA-32 has instructions that push, say, the
    sum of two other registers, whereas at least IA-32 has an instruction
    that computes the sum of two registers and puts it in a third
    register.

    Neither was a non-destructive register model (a = b + c) both were
    a destruction model (a = a + b)

    IA-32 has

    lea eax, (ebx, ecx)

    which computes the sum of ebx and ecx and stores the result into eax. Admittedly, this only works for addition.

    But it's also the case that only a limited number of operations are
    supported for memory operands. E.g., consider

    int r,i,a[];
    r += a[i];

    On IA-32 that's one instruction if r, a, and i are in registers:

    add rcx, [rdx+rsi*4] # rcx=r, rdx=a, rsi=i

    If they are all in memory, it's four instructions:

    mov eax, a(esp)
    mov ebx, i(esp)
    mov eax, [eax+ebx*4]
    add r(esp), eax

    I leave the PDP-11 variant to more knowledgeable people.

    Concerning the implicit memory access: it costs more than using
    registers on all IA-32 implementations I am aware of, and I expect
    that's also true of the PDP-11.

    Time: yes, instruction space: somewhat--but you had (r5) and (r5)+
    and @(r5)+ and -(r5) and @-(r5) which cost no space but did cost time.

    While I have read papers about automatically arranging variables such
    that this kind of technique can be used for accessing variables in
    memory, that's a complicated technique, like register allocation, but
    with less reward. Of course, in the spirit of explicit register
    declarations, one can also leave it to the programmer to produce a
    good order, and let the compiler just use autoincrement/decrement for
    accessing the variable when the opportunity occurs. This still
    requires more global analysis than PCC had AFAIK, so I doubt that PCC
    used this technique.

    I expect that PCC used the Indexed addressing mode (EA=SP+const) for
    accessing non-register variables on the PDP-11, and in that case
    non-register variables are also more expensive in code size. There is
    a reason why these compilers supported a register storage class.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Sun Jul 13 14:24:53 2025
    David Brown <[email protected]> writes:
    The key metrics are not, I think, instruction counts - but memory
    accesses and how likely they are to cause delays.

    And one might also wonder what hardware one should look at. AMD64
    does not use pcc-struct-returns by default, so finding out in how many
    cases 0-cycle store-to-load forwarding (implemented in recent cores)
    eliminates the delays does not tell us the performance characteristics
    on hardware that mostly executed IA-32 code where pcc-struct-returns
    are the default.

    As you show, having a pointer to "int * signgam" means that there will
    be only one extra write to memory (in the callee) and one extra read (in
    the caller) - while for a "pcc-struct-return" API you have two. However, >those will be adjacent and probably combined.

    The stores go separately to the store units (and consume the resources
    there), and the stores are to write-back cache, not write-combining
    memory. The loads go separately to the load units and consume the
    resources there; no combining happens. The data will be in the
    D-cache in the usual case, and on recent hardware there could even be
    0-cycle store-to-load-forwarding.

    If you are thinking about autovectorization by the compiler, yes, that
    could happen, but IMO it costs more than it buys. I have also seen
    gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
    part of Hennessey's small integer benchmarks (from the 1980s) by auto-vectorizing the adjacent accesses of bubble-sort. Not only does
    the code execute significantly more instructions, it also hits a slow
    hardware path in store-to-load-forwarding on every store it performs
    in this way.

    But even without this slow path, my expectation is that the
    auto-vectorization overhead would slow the benchmark down compared to
    the -O1 version (which is just scalar code), but how could I measure
    this?

    The slow path should not occur in the struct-return case, though.

    Another combining idea is the use of ARM A64's store pair and load
    pair instructions, which result in only one memory access for each
    such instruction and result in fewer instructions than doing unpaired
    loads and stores, while the code resulting from auto-vectorization on
    AMD64 is longer than two scalar stores and two scalar loads.

    Unfortunatly, store-pair and load-pair do not support storing or
    loading an FP and an integer value AFAIK.

    In theory, even if a struct return needs to pass a hidden pointer, the >compiler knows more about it than for a general "int *" pointer
    parameter. It knows that there are no aliasing issues or "escapes" -
    when you have a local variable whose address is passed on to
    "lgamma_ertl", the compiler has to assume that the function might store
    the address and later functions might use it to change the value of the
    local variable "sign". With the hidden struct pointer, the compiler
    knows that access via the pointer is much more restricted.

    (With C23, a function like "lgamma_ertl" would be marked
    [[unsequenced]], or at least [[reproducible]], which would let the
    compiler make similar assumptions for optimisation.

    You mean that the programmer could mark the function in that way?

    Wouldn't some use of "restrict" give the compiler similar information?
    I just don't know where in the code to apply "restrict". Maybe

    double lgamma_ertl2(double x, int *restrict signgam);

    ?

    Would struct returns have been used more if they were not so
    inefficient?

    Possibly. I certainly remember wanting to use them for something Gforth-internal, and then deciding against them after seeing the
    generated code.

    E.g., MIPS (1986) got a calling convention that passes the first four
    words of parameters in integer registers and the rest on the stack.
    That's not particularly efficient for passing FP parameters, but it
    meant that calls to functions, including varargs functions like
    printf() would work without prototypes (C89 only came later) and
    varags functions could be implemented simply by storing these four
    registers to the stack (IIRC the four slots for these parameter words
    were reserved).

    I think it's more complicated: If the first parameter is an integer
    one, then do everything in integer registers, otherwise pass FP stuff
    in FP registers. Probably the idea is that varargs functions always
    start with an integer parameter.

    Later I saw a calling convention (IIRC Alpha) where parameter n was
    passed in integer register n if it was integer and FP register n if it
    was an FP value. The respectiv other register went unused.

    Recently I have seen a calling convention (IIRC RISC-V) where the used
    integer register are allocated one after the other whether there were
    FP parameters interleaved or not, and the same on the FP side. I
    don't remember what happens if the call runs out of one kind of
    register, and the other kind is still available.

    Instead, many current ABI's are
    at least sub-optimal for structs

    Which ones do you have in mind?


    The architecture that is most relevant for my daily work, and where >efficiency matters to me, is 32-bit ARM for embedded systems.

    ARM A32 (and T32 uses the same calling conventions) is from around the
    same time as MIPS, so similar calling conventions are to be expected.
    However, I see various ABIs mentioned in the descriptions of various
    things (eABI, oABI, etc.). So apparently they did several.

    I realise 32-bit ARM was around before much of this was relevant (I
    first played with ARM assembly in 1988 as a schoolkid). But it is
    surely possible to modernise things a little?

    Breaking compatibility has an immediate cost and (hopefully) a
    long-term return. It's a relly hard sell. But apparently ARM with
    their several ABIs has gone there. Too little?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sun Jul 13 17:00:44 2025
    Anton Ertl <[email protected]> schrieb:
    I have also seen
    gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
    part of Hennessey's small integer benchmarks (from the 1980s)

    I would like to quote Press, Teukolsky, Vetterling and Flannery,
    from "Numerical Recipes":

    "If you know what bubble sort is, wipe it from your mind; if you
    don't know, make a point of never finding out!"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to It appears that Anton Ertl on Sun Jul 13 21:22:02 2025
    It appears that Anton Ertl <[email protected]> said:
    John Levine <[email protected]> writes:
    According to MitchAlsup1 <[email protected]>:
    Given that PDP-11 had 6 general purpose useable registers, and x86 >>>started out with similar, it would have been quite difficult to
    pass the first few arguments in registers.

    It's not any more difficult to pass, say, 4 arguments in registers if
    you have 6 registers available than it is if you have 30 registers
    available.

    Those compilers were so space constrained that they compiled a statement at a time, keeping only a stack of open loops so they knew where to jump back to. For
    a procedure call it evaluated each argument expression and pushed it. Trying to figure out which registere might be available for what was way beyond what it could do.

    This could produce fairly tangled code since the code that naturally came at the
    end of a for(;;) loop was generated at the top. The separate optimization pass did read in the generated assembler a routine at a time, and somewhat untangled the code. It removed jumps to jumps, and moved a block of code reached by
    an unconditional jump to where the jump was. I don't recall it doing anything with registers.

    The BLISS-11 compiler might have done more clever register allocation
    but it ran on a PDP-10 which could address the equivalent of a
    megabyte, not the 11's 64K.

    PDP-11 and x86 were
    easy to push arguments onto the stack, and address in callee from
    the stack.

    I think neither PDP-11 nor IA-32 has instructions that push, say, the
    sum of two other registers, whereas at least IA-32 has an instruction
    that computes the sum of two registers and puts it in a third
    register.

    PDP-11 instructions were all one or two operand, with all operands being fully general. To push the sum of two registers on the stack without clobbering the registers you could do this:

    MOV R1,-(SP) ; 2 mem cycles
    ADD R2,(SP) ; 2 mem cycles

    since the -11 ran mostly at the speed of its memory this would
    be no faster and the code was longer:

    MOV R1,R0 ; 1 mem cycle
    ADD R2,R0 ; 1 mem cycle
    MOV R0,-(SP) ; 2 mem cycles
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Sun Jul 13 22:20:27 2025
    On Sun, 13 Jul 2025 17:00:44 -0000 (UTC), Thomas Koenig wrote:

    I would like to quote Press, Teukolsky, Vetterling and Flannery,
    from "Numerical Recipes":

    "If you know what bubble sort is, wipe it from your mind; if you don't
    know, make a point of never finding out!"

    But Shellsort is basically “bubble sort done right”. And that is, or was, certainly worth using: a decent sort algorithm that didn’t require a lot
    of code to implement.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Mon Jul 14 06:22:29 2025
    John Levine <[email protected]> writes:
    Those compilers were so space constrained that they compiled a statement at a >time, keeping only a stack of open loops so they knew where to jump back to. For
    a procedure call it evaluated each argument expression and pushed it. Trying to
    figure out which registere might be available for what was way beyond what it >could do.

    What I outlined is in the context of a statement-at-a-time compiler;
    it costs very little data space: The compiler already records for each
    local variable whether it is in a register (and which one) or in
    memory (and where); the storage class "static" also needs to be
    represented, but that does not affect the present discussion. The
    additional data for when you store the register variable to memory
    while evaluating an argument in that register can be as small as 8
    bits for each of the three registers used for variables: these 8 bits
    would tell if the variable currently resides in its register, or in
    memory, and if in memory, where.

    It would cost somewhat more code space, so given that they were so
    heavily space-constrained, I understand that they did not want to go
    there.

    Another cost is that the potential for bugs increases quite
    significantly, so one would have to use quite a bit more testing for
    the same kind of reliability. Another reason not to go there.

    The BLISS-11 compiler might have done more clever register allocation
    but it ran on a PDP-10 which could address the equivalent of a
    megabyte, not the 11's 64K.

    The BLISS-11 compiler does global register allocation. It uses a very
    compact way to represent the necessary information: For each variable
    the start of its first live range and the end of its last live range
    is remembered, and that was used as approximate liveness information
    for determining whether two variables conflict. It will not allocate
    two variables to the same register where the second variable's live
    range fits in a live range hole of the first, but it can allocate two
    variables to the same register if the last use of one variable is
    before the first store to the other variable.

    So, again, this does not cost a lot in data space, at least as far as
    variables are concerned. It does mean that one has to look at the
    whole function and do the register allocation before making any
    compilation decisions, though. It also costs code space that a simple statement-at-a-time compiler does not need. I guess one could do this
    on a PDP-11 with several passes, but if I have the choice to do it on
    a PDP-10, keeping the whole function and the data about its variables
    in memory, I would do it on the PDP-10; and I may have developed a
    BLISS-10 compiler on the PDP-10 already.

    But in any case, the global register allocation of BLISS-11 is far
    beyond what I was discussing.

    PDP-11 instructions were all one or two operand, with all operands being fully >general.

    It's interesting that VAX generalized this to general three-address
    operations (and added a proper indexed mode), while the 68K and IA-32 architects decided to support only one memory operand for most
    instructions (but with more addressing modes, including proper indexed addressing modes). For the 68k the limitation to one memory operand
    for most instructions probably was not a matter of principle (it has a
    move instruction that supports two memory operands); my guess is that
    they decided that for encoding reasons.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Mon Jul 14 10:30:06 2025
    Thomas Koenig <[email protected]> writes:
    Anton Ertl <[email protected]> schrieb:
    I have also seen
    gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
    part of Hennessey's small integer benchmarks (from the 1980s)

    I would like to quote Press, Teukolsky, Vetterling and Flannery,
    from "Numerical Recipes":

    "If you know what bubble sort is, wipe it from your mind; if you
    don't know, make a point of never finding out!"

    Unless you can prove that this kind of bad code generation by gcc can
    only occur for bubble sort, this benchmark is a reason to ignore this
    advice.

    Of course, an alternative is to close your eyes and ears and find some
    excuse for every case where gcc does something undesirable.
    "Undefined behaviour" is the default excuse, but you can vary the
    excuses by quoting from books; appeal to authority is a good argument
    in these times.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Mon Jul 14 16:51:57 2025
    On 13/07/2025 16:24, Anton Ertl wrote:
    David Brown <[email protected]> writes:
    The key metrics are not, I think, instruction counts - but memory
    accesses and how likely they are to cause delays.

    And one might also wonder what hardware one should look at. AMD64
    does not use pcc-struct-returns by default, so finding out in how many
    cases 0-cycle store-to-load forwarding (implemented in recent cores) eliminates the delays does not tell us the performance characteristics
    on hardware that mostly executed IA-32 code where pcc-struct-returns
    are the default.

    As you show, having a pointer to "int * signgam" means that there will
    be only one extra write to memory (in the callee) and one extra read (in
    the caller) - while for a "pcc-struct-return" API you have two. However,
    those will be adjacent and probably combined.

    The stores go separately to the store units (and consume the resources there), and the stores are to write-back cache, not write-combining
    memory. The loads go separately to the load units and consume the
    resources there; no combining happens. The data will be in the
    D-cache in the usual case, and on recent hardware there could even be
    0-cycle store-to-load-forwarding.

    OK. (That is all, of course, very dependent on the processor in question.)


    If you are thinking about autovectorization by the compiler, yes, that
    could happen, but IMO it costs more than it buys.

    No, I was not thinking of that. I was thinking that adjacent memory
    accesses can be handled more efficiently in hardware than separate ones.
    You will probably avoid two cache misses, for example. And I would
    expect that on some processors at least, adjacent writes could be
    combined when there are databuses that are wider than the individual writes.

    I have also seen
    gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
    part of Hennessey's small integer benchmarks (from the 1980s) by auto-vectorizing the adjacent accesses of bubble-sort. Not only does
    the code execute significantly more instructions, it also hits a slow hardware path in store-to-load-forwarding on every store it performs
    in this way.

    Yes, I have also seen enthusiastic autovectorisation being
    counter-productive, especially if you are actually using small amounts
    of data. (clang/llvm seems keener on autovectorising code than gcc,
    IME.) And I've seen other situations in which "gcc -O3" generates
    slower code than "gcc -O2" - "gcc -O3" should only be used with care and extensive testing on the real code and real target.


    But even without this slow path, my expectation is that the auto-vectorization overhead would slow the benchmark down compared to
    the -O1 version (which is just scalar code), but how could I measure
    this?

    The slow path should not occur in the struct-return case, though.

    Another combining idea is the use of ARM A64's store pair and load
    pair instructions, which result in only one memory access for each
    such instruction and result in fewer instructions than doing unpaired
    loads and stores, while the code resulting from auto-vectorization on
    AMD64 is longer than two scalar stores and two scalar loads.

    Yes, that is another possibility.


    Unfortunatly, store-pair and load-pair do not support storing or
    loading an FP and an integer value AFAIK.

    There are other circumstances where one might want to return a struct
    than just calling a gamma function!


    In theory, even if a struct return needs to pass a hidden pointer, the
    compiler knows more about it than for a general "int *" pointer
    parameter. It knows that there are no aliasing issues or "escapes" -
    when you have a local variable whose address is passed on to
    "lgamma_ertl", the compiler has to assume that the function might store
    the address and later functions might use it to change the value of the
    local variable "sign". With the hidden struct pointer, the compiler
    knows that access via the pointer is much more restricted.

    (With C23, a function like "lgamma_ertl" would be marked
    [[unsequenced]], or at least [[reproducible]], which would let the
    compiler make similar assumptions for optimisation.

    You mean that the programmer could mark the function in that way?

    Yes. Or, for a library function, the library header would mark it that
    way in the declaration.


    Wouldn't some use of "restrict" give the compiler similar information?
    I just don't know where in the code to apply "restrict". Maybe

    double lgamma_ertl2(double x, int *restrict signgam);

    ?

    I don't see how "restrict" would help here.

    If the lgamma_ertl2 function is declared "[[unsequenced]]", then the
    compiler knows that it will not store the "signgam" pointer anywhere
    else. Thus it knows any other functions called after lgamma_ertl2
    cannot change the variable that "signgam" pointed to.

    (Marking it as [[unsequenced]] or [[reproducible]] gives other
    optimisation advantages for the calling code, and would be a good idea
    anyway even if the function returned a struct. But a function that
    changes a global variable cannot be thus marked.)


    Would struct returns have been used more if they were not so
    inefficient?

    Possibly. I certainly remember wanting to use them for something Gforth-internal, and then deciding against them after seeing the
    generated code.

    E.g., MIPS (1986) got a calling convention that passes the first four
    words of parameters in integer registers and the rest on the stack.
    That's not particularly efficient for passing FP parameters, but it
    meant that calls to functions, including varargs functions like
    printf() would work without prototypes (C89 only came later) and
    varags functions could be implemented simply by storing these four
    registers to the stack (IIRC the four slots for these parameter words
    were reserved).

    I think it's more complicated: If the first parameter is an integer
    one, then do everything in integer registers, otherwise pass FP stuff
    in FP registers. Probably the idea is that varargs functions always
    start with an integer parameter.

    Later I saw a calling convention (IIRC Alpha) where parameter n was
    passed in integer register n if it was integer and FP register n if it
    was an FP value. The respectiv other register went unused.

    Recently I have seen a calling convention (IIRC RISC-V) where the used integer register are allocated one after the other whether there were
    FP parameters interleaved or not, and the same on the FP side. I
    don't remember what happens if the call runs out of one kind of
    register, and the other kind is still available.

    Instead, many current ABI's are
    at least sub-optimal for structs

    Which ones do you have in mind?


    The architecture that is most relevant for my daily work, and where
    efficiency matters to me, is 32-bit ARM for embedded systems.

    ARM A32 (and T32 uses the same calling conventions) is from around the
    same time as MIPS, so similar calling conventions are to be expected. However, I see various ABIs mentioned in the descriptions of various
    things (eABI, oABI, etc.). So apparently they did several.

    Yes, there have been a few modifications to the ARM32 ABI - there are
    also small differences, I believe, between the details for Linux,
    Windows and embedded toolchains. (It's only the last one that is really relevant now at 32-bit.)


    I realise 32-bit ARM was around before much of this was relevant (I
    first played with ARM assembly in 1988 as a schoolkid). But it is
    surely possible to modernise things a little?

    Breaking compatibility has an immediate cost and (hopefully) a
    long-term return. It's a relly hard sell. But apparently ARM with
    their several ABIs has gone there. Too little?


    Yes, too little.

    The immediate cost for embedded toolchains would not be too high -
    certainly not in comparison to hosted targets. You need to add a
    compiler flag for the new ABI (along with support via __attribute__,
    #pragma, etc.), and add it to the list of static library builds you make
    for the toolchain. Then developers can use it simply by adding the flag
    to their CCFLAGS in their makefile, or whatever build system they like.
    Those who have existing pre-compiled binaries (such as commercial
    libraries or RTOS's) won't be able to use it easily until their supplier updates the libraries, but that would happen sooner or later.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Mon Jul 14 19:30:06 2025
    On Sun, 13 Jul 2025 17:00:44 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Anton Ertl <[email protected]> schrieb:
    I have also seen
    gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
    part of Hennessey's small integer benchmarks (from the 1980s)

    I would like to quote Press, Teukolsky, Vetterling and Flannery,
    from "Numerical Recipes":

    "If you know what bubble sort is, wipe it from your mind; if you
    don't know, make a point of never finding out!"

    The same can be said (with stronger vindication) to many of their
    recipes. Less so to algorithms, more so to to code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Jul 14 17:33:34 2025
    On Mon, 14 Jul 2025 6:22:29 +0000, Anton Ertl wrote:
    ------------
    PDP-11 instructions were all one or two operand, with all operands being
    fully
    general.

    It's interesting that VAX generalized this to general three-address operations (and added a proper indexed mode), while the 68K and IA-32 architects decided to support only one memory operand for most
    instructions (but with more addressing modes, including proper indexed addressing modes). For the 68k the limitation to one memory operand
    for most instructions probably was not a matter of principle (it has a
    move instruction that supports two memory operands); my guess is that
    they decided that for encoding reasons.

    When I was doing 88100 at Motorola, the 68020 guys would say that
    once there were sufficient resources, they could make a MOV-CALK
    run just as fast as a 2-operand 1-result instruction model

    68020
    MOV D3,D2 // first 16-bits
    CALK D3,D1 // 32-bits

    88100
    CALK D3,D2,D1 // 32-bits

    I am still of the opinion that fewer instructions remains better;
    especially if they occupy the same code footprint.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Mon Jul 14 19:03:32 2025
    Anton Ertl <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Anton Ertl <[email protected]> schrieb:
    I have also seen
    gcc -O3 slow itself down below the gcc -O0 level on the bubblesort
    part of Hennessey's small integer benchmarks (from the 1980s)

    I would like to quote Press, Teukolsky, Vetterling and Flannery,
    from "Numerical Recipes":

    "If you know what bubble sort is, wipe it from your mind; if you
    don't know, make a point of never finding out!"

    Unless you can prove that this kind of bad code generation by gcc can
    only occur for bubble sort, this benchmark is a reason to ignore this
    advice.

    Not for me to prove anything.

    Bat as I'm sure that you have filled out a PR, because you are such
    a constructive person bent on helping others instead of whining.

    Could you give me the PR number? I could then re-check and
    (if necessary) re-confirm.


    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to John Levine on Mon Jul 14 21:34:33 2025
    John Levine wrote:
    According to Anton Ertl <[email protected]>:
    Anyway, I expect that Unix already had a calling convention on PDP-11
    and several other machines, and of course PCC followed that
    convention. As for the C compiler that introduced these calling
    conventions (probably by Ritchie), my guess is that he was happy to
    produce a working C compiler that ran in the little RAM they had.

    It was two passes each about 24K bytes and a third optional optimizer
    that slightly rewrote the assembler code.

    The Ritchie complier and I think PCC reserved up to three registers
    for declared register variables, and used the rest as a stack for temporaries. It used Sethi-Ullman numbering to do the more complex subexpressions first to avoid running out of registers. If it did
    run out of registers I think it just gave up, but I don't ever
    remember that happening.

    Reserving more registers would have been really hard.

    I agree that on the 386 it would probably have been practical to pass arguments in registers, but I suspect that for whatever reason they
    wanted to make the calling sequence similar to the 8086 and 286.

    Not only that, but the 386 still had just 8 minus 1 or 2 total registers:

    If you only have eax, ebx, ecx, edx, esi, edi as regular registers, ebp
    as either frame pointer or (typically for leaf functions) another reg by
    making do without a frame pointer, then you had just 6 more-or-less
    general registers.

    Several had to be used by many instructions: edx+eax was always the
    target for 32x32->64-bit MUL, source for DIV, ecx (cl) had to be used
    for all variable shift counts etc.

    ESI/EDI/ECX were used for all string ops and block moves.

    In short, even for my own asm code I very rarely used more than two
    register variables as function parameters.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Jul 14 22:41:44 2025
    On Sun, 13 Jul 2025 14:22:55 GMT, Anton Ertl wrote:

    Ease of adapting 16-bit compilers and library routines might have been reasons.

    This is why I always felt that Intel took a short-sighted approach to each
    new generation of chips from 8086/80186 to 80286 to 80386.

    Contrast Motorola, where the original 16-bit 68000 was clearly a cut-down 32-bit design to begin with. The progression to the 68020 was largely a
    matter of filling in obvious gaps, which made the software transition so
    much easier.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Mon Jul 14 22:47:02 2025
    On Mon, 14 Jul 2025 17:33:34 +0000, MitchAlsup1 wrote:

    I am still of the opinion that fewer instructions remains better;
    especially if they occupy the same code footprint.

    I remember this rather large (6:1 code size ratio) counterexample from the
    VAX ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon Jul 14 23:14:19 2025
    On Mon, 14 Jul 2025 22:47:02 +0000, Lawrence D'Oliveiro wrote:

    On Mon, 14 Jul 2025 17:33:34 +0000, MitchAlsup1 wrote:

    I am still of the opinion that fewer instructions remains better;
    especially if they occupy the same code footprint.

    I remember this rather large (6:1 code size ratio) counterexample from
    the VAX ...

    As I remember::

    CALLS and RET could be faster when using JSR and JMP and SW pushes
    and pops of preserved registers.

    Coroutines would use JSR +(SP) between co-routines. {pop one off
    then push the new return address on}.

    POLY could be faster in instructions when there were enough terms for
    Estrin's method to pay dividends.

    Simple (i.e., COBOL picture) EDIT and MARK could be faster with
    just instructions.

    VAX was admired and beloved for a decade, before sliding off into insignificance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Jul 15 00:58:52 2025
    On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

    On Mon, 14 Jul 2025 22:47:02 +0000, Lawrence D'Oliveiro wrote:

    On Mon, 14 Jul 2025 17:33:34 +0000, MitchAlsup1 wrote:

    I am still of the opinion that fewer instructions remains better;
    especially if they occupy the same code footprint.

    I remember this rather large (6:1 code size ratio) counterexample from
    the VAX ...

    As I remember::

    [examples omitted]

    Maybe true, but I doubt any of them made this much difference. The big one
    was this: saving registers R0-R5 on entry to a kernel routine (which
    happened quite commonly) could be done most compactly as

    PUSHR #^M<R0,R1,R2,R3,R4,R5>

    which was a single instruction of just 2 bytes. Or it could be done much
    more verbosely as

    PUSHL R5
    PUSHL R4
    PUSHL R3
    PUSHL R2
    PUSHL R1
    PUSHL R0

    which was 6 instructions totalling 12 bytes.

    The latter was faster.

    POLY could be faster in instructions when there were enough terms for Estrin's method to pay dividends.

    The problem with polynomial evaluation is, at least in the examples I came across in my numerical-analysis courses, evaluation terminated much more commonly based on convergence to the final result, not on some
    predetermined number of terms. But the VAX instruction only did a
    predetermined number of terms. So it didn’t seem that useful in real life.

    VAX was admired and beloved for a decade, before sliding off into insignificance.

    Remember it straddled those transitions between instruction sets that had annoying arbitrary restrictions because of hardware limitations, to the intermediate era when the hardware limitations went away, then onto the
    RISC era, when instruction sets went back to simplicity, but in a
    different direction and for a new reason: because that was the way to
    maximize performance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Tue Jul 15 05:34:44 2025
    [email protected] (MitchAlsup1) writes:
    When I was doing 88100 at Motorola, the 68020 guys would say that
    once there were sufficient resources, they could make a MOV-CALK
    run just as fast as a 2-operand 1-result instruction model

    68020
    MOV D3,D2 // first 16-bits
    CALK D3,D1 // 32-bits

    88100
    CALK D3,D2,D1 // 32-bits

    That day arrived at the latest when Sandy Bridge was released in 2011
    with its separate physical register files and register renamer. It
    usually handles the register-register mov in the renamer, resulting in
    0-cycle movs, especially in cases like these where the result of the
    mov is overwritten soon. Another option would be to let the decoder
    combine the MOV and the CALK into one three-address microinstruction.

    I am still of the opinion that fewer instructions remains better;
    especially if they occupy the same code footprint.

    Intel apparently thinks so; they introduce three-address encodings for
    the existing instructions with APX.

    What is the advantage of APX over the register renamer approach? It
    takes fewer resources in the register renamer (which is often the
    narrowest part of a core).

    What is the advantage of APX over combining the instructions in the
    decoder? If the CALK part traps (e.g, because it includes a memory
    access), the architecture requires that the exception handler is
    presented with the architectural state between the MOV and the CALK,
    and this requires additional complications, while an architectural three-address instruction does not have this complication.

    IIRC there are code size advantages to the APX three-address encodings
    over the MOV-CALK combination in some, but not all cases.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Tue Jul 15 06:04:03 2025
    Thomas Koenig <[email protected]> writes:
    Bat as I'm sure that you have filled out a PR, because you are such
    a constructive person bent on helping others instead of whining.

    We have been over that before: I have reported gcc bugs in the past,
    but my experience in the last few decades is that it is not at all constructive, but a waste of time. See, e.g., PR93811.

    But if you think that it is useful, spend your own time on it. In the
    meantime I still amuse myself by making fun of gcc and clang failures.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Jul 15 06:53:59 2025
    On Tue, 15 Jul 2025 06:04:03 GMT, Anton Ertl wrote:

    We have been over that before: I have reported gcc bugs in the past, but
    my experience in the last few decades is that it is not at all
    constructive, but a waste of time. See, e.g., PR93811.

    They seem to think it is not needed on PowerPC <https://gcc.gnu.org/pipermail/gcc-bugs/2020-February/690898.html>.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Jul 15 13:14:05 2025
    On Tue, 15 Jul 2025 05:34:44 GMT
    [email protected] (Anton Ertl) wrote:

    [email protected] (MitchAlsup1) writes:
    When I was doing 88100 at Motorola, the 68020 guys would say that
    once there were sufficient resources, they could make a MOV-CALK
    run just as fast as a 2-operand 1-result instruction model

    68020
    MOV D3,D2 // first 16-bits
    CALK D3,D1 // 32-bits

    88100
    CALK D3,D2,D1 // 32-bits

    That day arrived at the latest when Sandy Bridge was released in 2011
    with its separate physical register files and register renamer. It
    usually handles the register-register mov in the renamer, resulting in 0-cycle movs, especially in cases like these where the result of the
    mov is overwritten soon.

    All that is great for low-IPC latency-bound code. It helps little in
    high-IPC code very rename stage tends to be the narrowest bottleneck.

    Another option would be to let the decoder
    combine the MOV and the CALK into one three-address microinstruction.

    I am still of the opinion that fewer instructions remains better; >especially if they occupy the same code footprint.

    Intel apparently thinks so; they introduce three-address encodings for
    the existing instructions with APX.

    What is the advantage of APX over the register renamer approach? It
    takes fewer resources in the register renamer (which is often the
    narrowest part of a core).

    What is the advantage of APX over combining the instructions in the
    decoder? If the CALK part traps (e.g, because it includes a memory
    access), the architecture requires that the exception handler is
    presented with the architectural state between the MOV and the CALK,
    and this requires additional complications, while an architectural three-address instruction does not have this complication.

    IIRC there are code size advantages to the APX three-address encodings
    over the MOV-CALK combination in some, but not all cases.

    - anton

    The biggest question about APX is "Will it ship?"

    X86S is canceled. Which is a good thing.

    AVX10 is canceled except few minor bits. Which is, may be, a good thing
    from point of view of software fragmentation, because now Intel is
    forced to implement AVX512 on their future E cores.
    I still think that from technical perspective full-featured 256-bit
    SIMD is a better technical solution then neither-there-nor-here 512-bit
    thing, but what they say about water under bridge?

    If APX does not ship in Panther Cove cores (not to be confused with
    Panther Lake SoC that is based on previous-generation cores) then it is
    dead. We will know how it is going pretty soon, no later than early
    2027, but likely earlier.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Jul 15 13:31:15 2025
    On Tue, 15 Jul 2025 06:04:03 GMT
    [email protected] (Anton Ertl) wrote:

    Thomas Koenig <[email protected]> writes:
    Bat as I'm sure that you have filled out a PR, because you are such
    a constructive person bent on helping others instead of whining.

    We have been over that before: I have reported gcc bugs in the past,
    but my experience in the last few decades is that it is not at all constructive, but a waste of time. See, e.g., PR93811.


    My personal experience with pessimization-related PRs is that solution
    rate is low, but above zero. Something like 10-15% of my PRs were solved
    over time span of couple of gcc generations.
    Of course, in 2 generations time many new pessimization cases pop up :(
    But still, I think that submitting this sort of PRs is not totally
    useless.

    But if you think that it is useful, spend your own time on it. In the meantime I still amuse myself by making fun of gcc and clang failures.

    - anton

    I never submitted PR to clang. Certainly not because I had never seen
    it generating horrendous code. I simply never cared to learn how to do
    it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Tue Jul 15 13:46:10 2025
    Lawrence D'Oliveiro wrote:
    On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

    POLY could be faster in instructions when there were enough terms for
    Estrin's method to pay dividends.

    The problem with polynomial evaluation is, at least in the examples I came across in my numerical-analysis courses, evaluation terminated much more commonly based on convergence to the final result, not on some
    predetermined number of terms. But the VAX instruction only did a predetermined number of terms. So it didn’t seem that useful in real life.

    You obviously have never implemented any fp library:

    When you write code for things like log/ln/exp/sin/cos/tan/atan/etc, you
    pretty much always use fixed-number-of-term polys.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Jul 15 17:38:36 2025
    On Tue, 15 Jul 2025 5:34:44 +0000, Anton Ertl wrote:

    [email protected] (MitchAlsup1) writes:
    When I was doing 88100 at Motorola, the 68020 guys would say that
    once there were sufficient resources, they could make a MOV-CALK
    run just as fast as a 2-operand 1-result instruction model

    68020
    MOV D3,D2 // first 16-bits
    CALK D3,D1 // 32-bits

    88100
    CALK D3,D2,D1 // 32-bits

    That day arrived at the latest when Sandy Bridge was released in 2011
    with its separate physical register files and register renamer. It
    usually handles the register-register mov in the renamer, resulting in 0-cycle movs, especially in cases like these where the result of the
    mov is overwritten soon. Another option would be to let the decoder
    combine the MOV and the CALK into one three-address microinstruction.

    AMD K9 would have done that circa 2006--but I digress.

    I am still of the opinion that fewer instructions remains better; >>especially if they occupy the same code footprint.

    Intel apparently thinks so; they introduce three-address encodings for
    the existing instructions with APX.

    What is the advantage of APX over the register renamer approach? It
    takes fewer resources in the register renamer (which is often the
    narrowest part of a core).

    Having the compiler (an already NP-complete piece of work) do it
    is vastly better than having HW stumble over it and catch the
    ones it can.

    What is the advantage of APX over combining the instructions in the
    decoder? If the CALK part traps (e.g, because it includes a memory
    access), the architecture requires that the exception handler is
    presented with the architectural state between the MOV and the CALK,
    and this requires additional complications, while an architectural three-address instruction does not have this complication.

    An examples from My 66000 ISA that is illustrative::

    CALX--this is basically a LDD IP,[address] with R0=next instruction
    address (that is; its a CALL from a table in memory).

    When CALX reads in a zero (ld.so has not loaded the dynamic library)
    the trap is presented with the CALX (not just the JMP Rk instruction)
    So there is a 5 instruction sequence that results in the GOT[index#]
    allowing ld.so to know which library was called and go do its job.

    The advantage is speed equal in the "it works case" and usefully
    faster in the "didn't work" cases. And of course the side advantages
    of not consuming a register, .....

    IIRC there are code size advantages to the APX three-address encodings
    over the MOV-CALK combination in some, but not all cases.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Jul 15 17:44:19 2025
    On Tue, 15 Jul 2025 11:46:10 +0000, Terje Mathisen wrote:

    Lawrence D'Oliveiro wrote:
    On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

    POLY could be faster in instructions when there were enough terms for
    Estrin's method to pay dividends.

    The problem with polynomial evaluation is, at least in the examples I
    came
    across in my numerical-analysis courses, evaluation terminated much more
    commonly based on convergence to the final result, not on some
    predetermined number of terms. But the VAX instruction only did a
    predetermined number of terms. So it didn’t seem that useful in real >> life.

    You obviously have never implemented any fp library:

    When you write code for things like log/ln/exp/sin/cos/tan/atan/etc, you pretty much always use fixed-number-of-term polys.

    Certainly when following Cody and Waite or J.M. Muller. But there are
    ways
    of implementing the same list as above, testing is the significance has
    leveled off and early out. It is generally slower in worst case and not
    much faster in the typical case--but it is a method taught in Numerical Method's classes.


    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Tue Jul 15 23:52:01 2025
    On Tue, 15 Jul 2025 17:44:19 +0000
    [email protected] (MitchAlsup1) wrote:

    On Tue, 15 Jul 2025 11:46:10 +0000, Terje Mathisen wrote:

    Lawrence D'Oliveiro wrote:
    On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

    POLY could be faster in instructions when there were enough terms
    for Estrin's method to pay dividends.

    The problem with polynomial evaluation is, at least in the
    examples I came
    across in my numerical-analysis courses, evaluation terminated
    much more commonly based on convergence to the final result, not
    on some predetermined number of terms. But the VAX instruction
    only did a predetermined number of terms. So it didn’t seem that
    useful in real life.

    You obviously have never implemented any fp library:

    When you write code for things like
    log/ln/exp/sin/cos/tan/atan/etc, you pretty much always use fixed-number-of-term polys.

    Certainly when following Cody and Waite or J.M. Muller. But there are
    ways
    of implementing the same list as above, testing is the significance
    has leveled off and early out. It is generally slower in worst case
    and not much faster in the typical case--but it is a method taught in Numerical Method's classes.


    Terje

    You mean, to summate starting from bigger terms to smaller terms?
    Something like:

    sum = a[0];
    xx = x;
    for (int i = 1; ; ++i) {
    sum1 = sum + xx * a[i];
    if (sum == sum1)
    break;
    sum = sum1;
    xx *= x;
    }


    That is the worst possible order of evaluation from perspective of
    precision.






    That's the worst possible meth

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Tue Jul 15 21:01:06 2025
    Anton Ertl <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Bat as I'm sure that you have filled out a PR, because you are such
    a constructive person bent on helping others instead of whining.

    We have been over that before: I have reported gcc bugs in the past,
    but my experience in the last few decades is that it is not at all constructive, but a waste of time. See, e.g., PR93811.

    You can also submit a patch, you know.

    But if you have a self-contained test case, post it here, I'll submit
    it for you.

    But if you think that it is useful, spend your own time on it. In the meantime I still amuse myself by making fun of gcc and clang failures.

    Non-constructive whining, paired with a heavy dose of arrogance.
    Oh well.

    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Lawrence D'Oliveiro on Tue Jul 15 17:44:04 2025
    Lawrence D'Oliveiro wrote:
    On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

    POLY could be faster in instructions when there were enough terms for
    Estrin's method to pay dividends.

    The problem with polynomial evaluation is, at least in the examples I came across in my numerical-analysis courses, evaluation terminated much more commonly based on convergence to the final result, not on some
    predetermined number of terms. But the VAX instruction only did a predetermined number of terms. So it didn’t seem that useful in real life.

    The problem with VAX POLY was that it was implemented differently on
    different models, with different mistakes. To save microcode it was
    eventually eliminated from hardware (traps to emulate if used).

    How the VAX Lost Its POLY (and EMOD and ACB_floating too), 2011 https://simh.trailing-edge.com/docs/vax_poly.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Wed Jul 16 01:21:44 2025
    On Tue, 15 Jul 2025 20:52:01 +0000, Michael S wrote:

    On Tue, 15 Jul 2025 17:44:19 +0000
    [email protected] (MitchAlsup1) wrote:

    On Tue, 15 Jul 2025 11:46:10 +0000, Terje Mathisen wrote:

    Lawrence D'Oliveiro wrote:
    On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

    POLY could be faster in instructions when there were enough terms
    for Estrin's method to pay dividends.

    The problem with polynomial evaluation is, at least in the
    examples I came
    across in my numerical-analysis courses, evaluation terminated
    much more commonly based on convergence to the final result, not
    on some predetermined number of terms. But the VAX instruction
    only did a predetermined number of terms. So it didn’t seem that >>>> useful in real life.

    You obviously have never implemented any fp library:

    When you write code for things like
    log/ln/exp/sin/cos/tan/atan/etc, you pretty much always use
    fixed-number-of-term polys.

    Certainly when following Cody and Waite or J.M. Muller. But there are
    ways
    of implementing the same list as above, testing is the significance
    has leveled off and early out. It is generally slower in worst case
    and not much faster in the typical case--but it is a method taught in
    Numerical Method's classes.


    Terje

    You mean, to summate starting from bigger terms to smaller terms?
    Something like:

    sum = a[0];
    xx = x;
    for (int i = 1; ; ++i) {
    sum1 = sum + xx * a[i];
    if (sum == sum1)
    break;
    sum = sum1;
    xx *= x;
    }


    That is the worst possible order of evaluation from perspective of
    precision.

    Not in HW where you have a minimum of 2× fraction width.


    That's the worst possible meth

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Wed Jul 16 05:47:08 2025
    On Tue, 15 Jul 2025 13:46:10 +0200, Terje Mathisen wrote:

    Lawrence D'Oliveiro wrote:

    On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

    POLY could be faster in instructions when there were enough terms for
    Estrin's method to pay dividends.

    The problem with polynomial evaluation is, at least in the examples I came >> across in my numerical-analysis courses, evaluation terminated much more
    commonly based on convergence to the final result, not on some
    predetermined number of terms. But the VAX instruction only did a
    predetermined number of terms. So it didn’t seem that useful in real life.

    You obviously have never implemented any fp library:

    When you write code for things like log/ln/exp/sin/cos/tan/atan/etc, you pretty much always use fixed-number-of-term polys.

    Computing π to a given precision: <https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
    No fixed number of terms in the common algorithms, as you can see.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Wed Jul 16 14:44:33 2025
    Lawrence D'Oliveiro wrote:
    On Tue, 15 Jul 2025 13:46:10 +0200, Terje Mathisen wrote:

    Lawrence D'Oliveiro wrote:

    On Mon, 14 Jul 2025 23:14:19 +0000, MitchAlsup1 wrote:

    POLY could be faster in instructions when there were enough terms for
    Estrin's method to pay dividends.

    The problem with polynomial evaluation is, at least in the examples I came >>> across in my numerical-analysis courses, evaluation terminated much more >>> commonly based on convergence to the final result, not on some
    predetermined number of terms. But the VAX instruction only did a
    predetermined number of terms. So it didn’t seem that useful in real life.

    You obviously have never implemented any fp library:

    When you write code for things like log/ln/exp/sin/cos/tan/atan/etc, you
    pretty much always use fixed-number-of-term polys.

    Computing π to a given precision: <https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
    No fixed number of terms in the common algorithms, as you can see.

    Quoting from your own link:

    Conclusion: What is the value of continued fractions?

    Clearly mathematicians have a lot of fun with them. But speaking as someone who does computation on a daily basis, I have to say I don’t think they’re a practical way of evaluating anything. Maybe I’m wrong, and someone who has delved more deeply
    into them caan offer better examples of how to use them ...

    If this was supposed to show how you would use variable number of terms
    for common library functions, then I failed to understand it.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Thu Jul 17 01:54:13 2025
    On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

    Lawrence D'Oliveiro wrote:

    Computing π to a given precision:
    <https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
    No fixed number of terms in the common algorithms, as you can see.

    If this was supposed to show how you would use variable number of terms
    for common library functions, then I failed to understand it.

    Quote:

    Or compare this function, adapted from the recipes section of the
    decimal module documentation:

    [code omitted -- see reference]

    As you can see, this converges a lot quicker.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Thu Jul 17 11:18:00 2025
    Lawrence D'Oliveiro wrote:
    On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

    Lawrence D'Oliveiro wrote:

    Computing π to a given precision:
    <https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
    No fixed number of terms in the common algorithms, as you can see.

    If this was supposed to show how you would use variable number of terms
    for common library functions, then I failed to understand it.

    Quote:

    Or compare this function, adapted from the recipes section of the
    decimal module documentation:

    [code omitted -- see reference]

    As you can see, this converges a lot quicker.

    Another, somewhat important consideration:

    If you want to make it possible to auto-vectorize code, then you pretty
    much need for all instructions to have constant latency, maybe with a
    few exceptions that will then cause pipeline bubbles.

    This was definitely a requirement for the Mill fp emulation work I did.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Jul 17 14:37:20 2025
    On Thu, 17 Jul 2025 14:15:30 +0000, Scott Lurndal wrote:

    Terje Mathisen <[email protected]> writes:
    Lawrence D'Oliveiro wrote:
    On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

    =20
    Another, somewhat important consideration:

    If you want to make it possible to auto-vectorize code, then you
    pretty=20
    much need for all instructions to have constant latency, maybe with a=20 >>few exceptions that will then cause pipeline bubbles.

    For security purposes, all instruction timing must be data independent.

    I like this wording better than auto-vectorize.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Terje Mathisen on Thu Jul 17 14:15:30 2025
    Terje Mathisen <[email protected]> writes:
    Lawrence D'Oliveiro wrote:
    On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

    =20
    Another, somewhat important consideration:

    If you want to make it possible to auto-vectorize code, then you pretty=20 >much need for all instructions to have constant latency, maybe with a=20
    few exceptions that will then cause pipeline bubbles.

    For security purposes, all instruction timing must be data independent.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Thu Jul 17 14:36:41 2025
    On Thu, 17 Jul 2025 9:18:00 +0000, Terje Mathisen wrote:

    Lawrence D'Oliveiro wrote:
    On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

    Lawrence D'Oliveiro wrote:

    Computing π to a given precision:
    <https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>.
    No fixed number of terms in the common algorithms, as you can see.

    If this was supposed to show how you would use variable number of terms
    for common library functions, then I failed to understand it.

    Quote:

    Or compare this function, adapted from the recipes section of the
    decimal module documentation:

    [code omitted -- see reference]

    As you can see, this converges a lot quicker.

    Another, somewhat important consideration:

    If you want to make it possible to auto-vectorize code, then you pretty
    much need for all instructions to have constant latency, maybe with a
    few exceptions that will then cause pipeline bubbles.

    Can I get your definition of "auto-vectorize"

    A wide-decode and a set of reservation stations can "vectorize" a
    loop or straight line of code. Does this qualify as "auto-vectorize" ??

    Whereas, My 66000 VEC-LOOP is definitely a "compiler-vectorize"
    directive.

    This was definitely a requirement for the Mill fp emulation work I did.

    Given that there are a few instructions which can have variable latency
    and a spattering that HAVE TO HAVE variable latency this requirement
    causes "problems".

    In 1991, working on Mc 88120, we had FDIV that was within 0.125 ULP
    at cycle 12, and it took 5 more cycles to KNOW that the result was
    properly rounded (all RMs). So, instead of having FDIV have 17 cycle
    latency, we allowed it to have 12 cycles of latency 87.5% of the time
    and 17 cycles 12.5% of the time for an average latency of 12.625 cycles.
    This is usefully faster than fixed 17 cycles.

    The same argument applies to SQRT.

    Any LD instruction backed by a cache HAS TO HAVE variable latency.
    Any memory ref with a translated address HAS TO HAVE variable
    latency (TLB miss).
    Store instruction waiting on long latency result data HAS TO HAVE
    variable latency between AGEN and Write.


    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Fri Jul 18 15:06:49 2025
    MitchAlsup1 wrote:
    On Thu, 17 Jul 2025 9:18:00 +0000, Terje Mathisen wrote:

    Lawrence D'Oliveiro wrote:
    On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

    Lawrence D'Oliveiro wrote:

    Computing π to a given precision:
    <https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>. >>>>> No fixed number of terms in the common algorithms, as you can see.

    If this was supposed to show how you would use variable number of terms >>>> for common library functions, then I failed to understand it.

    Quote:

         Or compare this function, adapted from the recipes section of the >>>      decimal module documentation:

         [code omitted -- see reference]

         As you can see, this converges a lot quicker.

    Another, somewhat important consideration:

    If you want to make it possible to auto-vectorize code, then you pretty
    much need for all instructions to have constant latency, maybe with a
    few exceptions that will then cause pipeline bubbles.

    Can I get your definition of "auto-vectorize"

    A wide-decode and a set of reservation stations can "vectorize" a
    loop or straight line of code. Does this qualify as "auto-vectorize" ??

    Whereas, My 66000 VEC-LOOP is definitely a "compiler-vectorize"
    directive.

    This was definitely a requirement for the Mill fp emulation work I did.

    Given that there are a few instructions which can have variable latency
    and a spattering that HAVE TO HAVE variable latency this requirement
    causes "problems".

    Yeah, I do know that. Memory ops in SIMD style short vectors typically
    have all slots resding in the same cache line, so even though the
    latency is not predictable, it will probably be the same for all elements.

    In 1991, working on Mc 88120, we had FDIV that was within 0.125 ULP
    at cycle 12, and it took 5 more cycles to KNOW that the result was
    properly rounded (all RMs). So, instead of having FDIV have 17 cycle
    latency, we allowed it to have 12 cycles of latency 87.5% of the time
    and 17 cycles 12.5% of the time for an average latency of 12.625 cycles.
    This is usefully faster than fixed 17 cycles.

    So if 87.5% of all divisions finish in 12 cycles, and you do 8 of them
    in parallel, then (for random inputs), all 8 will finish in 12 with a
    34% probability, leaving 17 cycles as the actual latency in 66% of all
    cases. Total average latency becomes 15.3 cycles, so most of the gain is
    lost.

    The same argument applies to SQRT.

    Any LD instruction backed by a cache HAS TO HAVE variable latency.
    Any memory ref with a translated address HAS TO HAVE variable
    latency (TLB miss).
    Store instruction waiting on long latency result data HAS TO HAVE
    variable latency between AGEN and Write.

    I don't think we disagree Mitch, I'm just stating that if you have a
    lockstep programming model, then variable latency per slot tends to end
    up with worst case latency all over, so if you could have done the Mc
    88K FDIV in a fixed 16-cycles, that might have been better for this
    particular programming model.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Fri Jul 18 15:16:47 2025
    On Fri, 18 Jul 2025 13:06:49 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Thu, 17 Jul 2025 9:18:00 +0000, Terje Mathisen wrote:

    Lawrence D'Oliveiro wrote:
    On Wed, 16 Jul 2025 14:44:33 +0200, Terje Mathisen wrote:

    Lawrence D'Oliveiro wrote:

    Computing π to a given precision:
    <https://github.com/HamPUG/meetings/tree/master/2022/2022-11-14/ldo>. >>>>>> No fixed number of terms in the common algorithms, as you can see. >>>>>>
    If this was supposed to show how you would use variable number of terms >>>>> for common library functions, then I failed to understand it.

    Quote:

         Or compare this function, adapted from the recipes section of the >>>>      decimal module documentation:

         [code omitted -- see reference]

         As you can see, this converges a lot quicker.

    Another, somewhat important consideration:

    If you want to make it possible to auto-vectorize code, then you pretty
    much need for all instructions to have constant latency, maybe with a
    few exceptions that will then cause pipeline bubbles.

    Can I get your definition of "auto-vectorize"

    A wide-decode and a set of reservation stations can "vectorize" a
    loop or straight line of code. Does this qualify as "auto-vectorize" ??

    Whereas, My 66000 VEC-LOOP is definitely a "compiler-vectorize"
    directive.

    This was definitely a requirement for the Mill fp emulation work I did.

    Given that there are a few instructions which can have variable latency
    and a spattering that HAVE TO HAVE variable latency this requirement
    causes "problems".

    Yeah, I do know that. Memory ops in SIMD style short vectors typically
    have all slots resding in the same cache line, so even though the
    latency is not predictable, it will probably be the same for all
    elements.

    In 1991, working on Mc 88120, we had FDIV that was within 0.125 ULP
    at cycle 12, and it took 5 more cycles to KNOW that the result was
    properly rounded (all RMs). So, instead of having FDIV have 17 cycle
    latency, we allowed it to have 12 cycles of latency 87.5% of the time
    and 17 cycles 12.5% of the time for an average latency of 12.625 cycles.
    This is usefully faster than fixed 17 cycles.

    So if 87.5% of all divisions finish in 12 cycles, and you do 8 of them
    in parallel, then (for random inputs), all 8 will finish in 12 with a
    34% probability, leaving 17 cycles as the actual latency in 66% of all
    cases. Total average latency becomes 15.3 cycles, so most of the gain is lost.

    If you are doing enough FDIVs to matter, the long count and the short
    counts will be randomly distributed across the lanes. SO the long term
    average (OoO) style will approximate the previous cycle counts.

    You don't do this if all 8 lanes have to remain in lock step.

    The same argument applies to SQRT.

    Any LD instruction backed by a cache HAS TO HAVE variable latency.
    Any memory ref with a translated address HAS TO HAVE variable
    latency (TLB miss).
    Store instruction waiting on long latency result data HAS TO HAVE
    variable latency between AGEN and Write.

    I don't think we disagree Mitch, I'm just stating that if you have a
    lockstep programming model, then variable latency per slot tends to end
    up with worst case latency all over, so if you could have done the Mc
    88K FDIV in a fixed 16-cycles, that might have been better for this particular programming model.

    I whole-heartedly agree with that paragraph.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to [email protected] on Fri Jul 18 20:01:16 2025
    MitchAlsup1 <[email protected]> wrote:
    On Fri, 11 Jul 2025 14:50:58 +0000, Anton Ertl wrote:

    David Brown <[email protected]> writes:
    <snip>
    It would have been nice if, when struct returns and struct parameters >>>were added to C, someone had taken time to improve the ABI's to make
    them efficient.

    Given the name of the calling convention variant, this was introduced
    in PCC (and probably struct returns themselves were introduced in
    PCC). PCC was released in 1979 on the machines of the day, such as
    the PDP-11; I am sure Johnson implemented a calling convention for
    struct passing and struct returns that used the least amount of code.
    If Johnson had had more space to play with, he probably would have had
    other things on the agenda before improving the struct return calling
    convention. E.g., the calling conventions at the time passed all
    parameters on the stack, and we still have this in the Intel calling
    convention for IA-32.

    Given that PDP-11 had 6 general purpose useable registers, and x86
    started out with similar, it would have been quite difficult to
    pass the first few arguments in registers. PDP-11 and x86 were
    easy to push arguments onto the stack, and address in callee from
    the stack.

    Watcom C for 386 offered a register passing convention, IIRC first
    3 integer (or equivalent) arguments were passed in registers.
    ANd this convention gave measurable speedup compared to standard
    convention.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Anton Ertl on Fri Jul 18 20:12:38 2025
    Anton Ertl <[email protected]> wrote:
    [1] I have wondered about the selection of registers for the System V
    calling convention for the System V ABI for AMD64: the first 6
    arguments go in RDI, RSI, RDX, RCX, R8, R9. The first two are optimal
    for memcpy() implemented with REP MOVSB, but then RCX would be better
    in third position. RDI is also good for memset() with REP STOSB, RDI
    and RSI are also good for memcmp() with REP CMPSB, and I expect that
    there are other uses of REP instructions for implementing memory-block
    or string functions where the placement in RDI and RSI is
    helpful. Except that the library routines then often do not use the
    REP instructions.

    There is a paper by (IIRC) Jan Hubicka for GCC developers sunmit
    (probaly in 2005) about targeting AMD64. This paper explaions
    several ABI design decisions (but possibly not the ordering
    between RDX and RCX). ABI was developed before the team doing
    port had access to actial hardware, so they mostly looked at
    code size. IIRC number of registers was chosen based on code
    size for a collection of benchmarks, 6 gave smallest size.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Waldek Hebisch on Fri Jul 25 01:59:31 2025
    On Fri, 18 Jul 2025 20:01:16 -0000 (UTC), Waldek Hebisch wrote:

    Watcom C for 386 offered a register passing convention, IIRC first 3
    integer (or equivalent) arguments were passed in registers.
    ANd this convention gave measurable speedup compared to standard
    convention.

    WATCOM C was also used to compile FoxBase. When Microsoft acquired that,
    they tried switching to their own C compiler. Unfortunately this produced larger code, which made the program overflow the 640K RAM limit.

    Yes, it was that long ago.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)