• Why 8 bit exit status codes?

    From Andreas Kempe@21:1/5 to All on Fri Feb 2 16:05:14 2024
    Hello everyone,

    I'm wondering why, at least on Linux and FreeBSD, a process exit
    status was chosen to be only the lower 8 bits in the C interface, i.e.
    exit() and wait().

    This did bite some colleagues at work at one point who were porting a
    modem manager from a real-time OS to Linux because they were returning
    negative status codes for errors. We fixed it by changing the status
    codes and I never really thought about why this is the state of
    things... until now!

    Having a look at man 3 exit on my FreeBSD system, it states

    Both functions make the low-order eight bits of the status argument
    available to a parent process which has called a wait(2)-family
    function.

    and that it is conforming to the C99 standard

    The exit() and _Exit() functions conform to ISO/IEC 9899:1999 (“ISO C99”).

    C99 7.20.4.3 § 5 states

    Finally, control is returned to the host environment. If the value of
    status is zero or EXIT_SUCCESS, an implementation-defined form of the
    status successful termination is returned. If the value of status is EXIT_FAILURE, an implementation-defined form of the status
    unsuccessful termination is returned. Otherwise the status returned
    is implementation-defined.

    which I read as the C standard leaving it to the implementation to
    decide how to handle the int type argument.

    Having a look at man 2 _exit, the system call man page, it says
    nothing about the lower 8 bits, but claims conformance with
    IEEE Std 1003.1-1990 ("POSIX.1") which says
    in Part 1: System Application Program Interface (API) [C Language], 3.2.2.2 § 2

    If the parent process of the calling process is executing a wait() or waitpid(), it is notified of the termination of the calling process
    and the low order 8 bits of status are made available to it; see
    3.2.1.

    that only puts a requirement on making the lower 8 bits available.
    Looking at a more modern POSIX, IEEE Std 1003.1-2017, that has
    waitid() defined, it has the following for _exit()

    The value of status may be 0, EXIT_SUCCESS, EXIT_FAILURE, or any
    other value, though only the least significant 8 bits (that is,
    status & 0377) shall be available from wait() and waitpid(); the
    full value shall be available from waitid() and in the siginfo_t
    passed to a signal handler for SIGCHLD.

    so the mystery of why the implementation is the way it is was
    dispelled.

    The question that remains is what the rationale behind using the lower
    8 bits was from the start? Is it historical legacy that no one wanted
    to change for backwards compatibility? Is there no need for exit codes
    larger than 8 bits?

    I don't know if I have ever come into contact with software that deals
    with status codes that actually looks at the full value. My daily
    driver shell, fish, certainly does not.

    --
    Best regards,
    Andreas Kempe

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Andreas Kempe on Fri Feb 2 16:33:40 2024
    Andreas Kempe <[email protected]> writes:
    Hello everyone,

    I'm wondering why, at least on Linux and FreeBSD, a process exit
    status was chosen to be only the lower 8 bits in the C interface, i.e.
    exit() and wait().

    <snip>

    The question that remains is what the rationale behind using the lower
    8 bits was from the start? Is it historical legacy that no one wanted
    to change for backwards compatibility? Is there no need for exit codes
    larger than 8 bits?

    The definition of the wait system call. Recall that the
    PDP-11 was a 16-bit computer and wait needed to be able
    to include metadata along with the exit status.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Kempe@21:1/5 to All on Fri Feb 2 20:02:16 2024
    Den 2024-02-02 skrev Scott Lurndal <[email protected]>:
    Andreas Kempe <[email protected]> writes:
    Hello everyone,

    I'm wondering why, at least on Linux and FreeBSD, a process exit
    status was chosen to be only the lower 8 bits in the C interface, i.e. >>exit() and wait().

    <snip>

    The question that remains is what the rationale behind using the lower
    8 bits was from the start? Is it historical legacy that no one wanted
    to change for backwards compatibility? Is there no need for exit codes >>larger than 8 bits?

    The definition of the wait system call. Recall that the
    PDP-11 was a 16-bit computer

    I'm afraid that's a tall order. I had yet to learn how to read when
    they went out of production. :) Please excuse my ignorance.

    and wait needed to be able to include metadata along with the exit
    status.

    I'm a bit unclear on the order of things coming into being. Did their
    C implementation already use exit() with an int argument of size 16
    bits and they also masked? Or was an int 8 bits on PDP-11 with POSIX
    opting mask out the lower 8 bits on platforms with wider ints to
    maintain backwards compatibility?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Andreas Kempe on Fri Feb 2 21:13:41 2024
    On Fri, 2 Feb 2024 16:05:14 -0000 (UTC), Andreas Kempe wrote:

    I'm wondering why, at least on Linux and FreeBSD, a process exit status
    was chosen to be only the lower 8 bits in the C interface, i.e.
    exit() and wait().

    I’ve never used that many different values. E.g. 0 for some test condition passed, 1 for failed, 2 for unexpected error.

    This did bite some colleagues at work at one point who were porting a
    modem manager from a real-time OS to Linux because they were returning negative status codes for errors.

    True enough:

    ldo@theon:~> python3 -c "import sys; sys.exit(1)"; echo $?
    1
    ldo@theon:~> python3 -c "import sys; sys.exit(-1)"; echo $?
    255

    But you could always sign-extend it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Andreas Kempe on Fri Feb 2 20:15:24 2024
    Andreas Kempe <[email protected]> writes:
    Den 2024-02-02 skrev Scott Lurndal <[email protected]>:
    Andreas Kempe <[email protected]> writes:
    Hello everyone,

    I'm wondering why, at least on Linux and FreeBSD, a process exit
    status was chosen to be only the lower 8 bits in the C interface, i.e. >>>exit() and wait().

    <snip>

    The question that remains is what the rationale behind using the lower
    8 bits was from the start? Is it historical legacy that no one wanted
    to change for backwards compatibility? Is there no need for exit codes >>>larger than 8 bits?

    The definition of the wait system call. Recall that the
    PDP-11 was a 16-bit computer

    I'm afraid that's a tall order. I had yet to learn how to read when
    they went out of production. :) Please excuse my ignorance.

    and wait needed to be able to include metadata along with the exit
    status.

    I'm a bit unclear on the order of things coming into being. Did their
    C implementation already use exit() with an int argument of size 16
    bits and they also masked? Or was an int 8 bits on PDP-11 with POSIX
    opting mask out the lower 8 bits on platforms with wider ints to
    maintain backwards compatibility?

    The status argument to the wait system call returned
    a two part value; 8 bits of exit status and 8 bits
    that describe the termination conditions (e.g. the
    signal number that stopped or terminated the
    process).


    Here's the modern 32-bit layout (in little endian form):

    unsigned int __w_termsig:7; /* Terminating signal. */
    unsigned int __w_coredump:1; /* Set if dumped core. */
    unsigned int __w_retcode:8; /* Return code if exited normally. */
    unsigned int:16;

    It's just the PDP-11 unix 16-bit version with 16 unused padding bits.

    SVR4 added the waitid(2) system call which via the siginfo argument has
    access to the full 32-bit program exit status.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Andreas Kempe on Fri Feb 2 21:40:32 2024
    On Fri, 2 Feb 2024 21:20:22 -0000 (UTC), Andreas Kempe wrote:

    Why not use a char in exit() instead of int, with wait() returning the
    full 16 bits? If the program itself fills in the upper 8 bits, it makes sense, but otherwise I don't understand from an API perspective why one
    would use a data type with the caveat that only half is used.

    The other half contains information like whether the low half is actually
    an explicit exit code, or something else like a signal that killed the
    process. Or an indication that the process has not actually terminated,
    but is just stopped.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Fri Feb 2 21:38:54 2024
    On Fri, 02 Feb 2024 13:23:52 -0800, Keith Thompson wrote:

    The curl command defines nearly 100 error codes ("man curl" for
    details). That's the most I've seen.

    Another reason for staying away from curl, I would say. It needlessly replicates the functionality of a whole lot of different protocol clients,
    when all you need is HTTP/HTTPS (maybe FTP/FTPS as well). That’s why I
    stick to wget.

    (On Plan 9, a program's exit status is (was?) a string, empty for
    success, a description of the error condition on error. It's a cool
    idea, but I can imagine it introducing some interesting problems.)

    What, not a JSON object?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Kempe@21:1/5 to All on Fri Feb 2 21:20:22 2024
    Den 2024-02-02 skrev Scott Lurndal <[email protected]>:
    Andreas Kempe <[email protected]> writes:
    I'm a bit unclear on the order of things coming into being. Did their
    C implementation already use exit() with an int argument of size 16
    bits and they also masked? Or was an int 8 bits on PDP-11 with POSIX
    opting mask out the lower 8 bits on platforms with wider ints to
    maintain backwards compatibility?

    The status argument to the wait system call returned
    a two part value; 8 bits of exit status and 8 bits
    that describe the termination conditions (e.g. the
    signal number that stopped or terminated the
    process).


    Here's the modern 32-bit layout (in little endian form):

    unsigned int __w_termsig:7; /* Terminating signal. */
    unsigned int __w_coredump:1; /* Set if dumped core. */
    unsigned int __w_retcode:8; /* Return code if exited normally. */
    unsigned int:16;

    It's just the PDP-11 unix 16-bit version with 16 unused padding bits.


    Thank you for the clarification, but I don't think I have any problem
    grasping how the implementation works. My thought are why they did
    what they did.

    Why not use a char in exit() instead of int, with wait() returning the
    full 16 bits? If the program itself fills in the upper 8 bits, it
    makes sense, but otherwise I don't understand from an API perspective
    why one would use a data type with the caveat that only half is used.

    If we already have exit() and wait() using ints and want to stuff our
    extra information in there without changing the API, it also makes
    sense.

    SVR4 added the waitid(2) system call which via the siginfo argument has access to the full 32-bit program exit status.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Kempe@21:1/5 to All on Sat Feb 3 13:21:29 2024
    Den 2024-02-03 skrev Keith Thompson <[email protected]>:
    Andreas Kempe <[email protected]> writes:

    Why not use a char in exit() instead of int, with wait() returning the
    full 16 bits? If the program itself fills in the upper 8 bits, it
    makes sense, but otherwise I don't understand from an API perspective
    why one would use a data type with the caveat that only half is used.

    C tends to use int values even for character data (when not an element
    of a string). See for example the return types of getchar(), fgetc(),
    et al, and even the type of character constants ('x' is of type int, not char).


    I thought the reason for the int return type was to have an error code
    outside of the range of the valid data, with EOF being defined as
    being a negative integer. A reason that isn't applicable for the
    argument passing to exit by a program.

    In early C, int was in many ways a kind of default type. Functions with
    no visible declaration were assumed to return int. The signedness of
    plain char is implementation-defined.

    I realised that char was a bad example just as I posted. I should have
    chosen unsigned char instead.

    Supporting exit values from 0 to 255 is fairly reasonable. Using an
    int to store that value is also fairly reasonable -- especially
    since main() returns int, and exit(n) is very nearly equivalent to
    return n in main(). Ignoring all but the low-order 8 bits is not
    specified by C. Non-POSIX systems can use all 32 (or 16, or ...)
    bits of the return value.


    Yes, in my original post, I detailed that the restriction does not
    come from the C standard, but from POSIX. I'm not sure which came
    first.

    If C was first with having an exit() function and an int return for
    main, I can imagine that it went something like this

    - C chooses int for main
    - C uses int in exit() to match main
    - OS folks want to store extra data in the exit status, but they
    want to match the C API
    - let's just stuff it in the upper bits and keep the API the same with
    an imposed restriction on the value in POSIX

    or POSIX exit() was constructed with the int from main in mind, or it
    could just be, as you point out, that int is a nice default integer
    type and there wasn't much thought put into it beyond that.

    I can speculate a bunch different reasons, but I'm curious if anyone
    knows what the actual reasoning was.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Andreas Kempe on Sat Feb 3 16:38:39 2024
    On 03.02.2024 14:21, Andreas Kempe wrote:
    Den 2024-02-03 skrev Keith Thompson <[email protected]>:
    Andreas Kempe <[email protected]> writes:

    Why not use a char in exit() instead of int, with wait() returning the
    full 16 bits? If the program itself fills in the upper 8 bits, it
    makes sense, but otherwise I don't understand from an API perspective
    why one would use a data type with the caveat that only half is used.

    C tends to use int values even for character data (when not an element
    of a string). See for example the return types of getchar(), fgetc(),
    et al, and even the type of character constants ('x' is of type int, not
    char).


    I thought the reason for the int return type was to have an error code outside of the range of the valid data, with EOF being defined as
    being a negative integer. A reason that isn't applicable for the
    argument passing to exit by a program.

    In early C, int was in many ways a kind of default type. Functions with
    no visible declaration were assumed to return int. The signedness of
    plain char is implementation-defined.

    I realised that char was a bad example just as I posted. I should have
    chosen unsigned char instead.

    Supporting exit values from 0 to 255 is fairly reasonable. Using an
    int to store that value is also fairly reasonable -- especially
    since main() returns int, and exit(n) is very nearly equivalent to
    return n in main(). Ignoring all but the low-order 8 bits is not
    specified by C. Non-POSIX systems can use all 32 (or 16, or ...)
    bits of the return value.


    Yes, in my original post, I detailed that the restriction does not
    come from the C standard, but from POSIX. I'm not sure which came
    first.

    If C was first with having an exit() function and an int return for
    main, I can imagine that it went something like this

    - C chooses int for main
    - C uses int in exit() to match main
    - OS folks want to store extra data in the exit status, but they
    want to match the C API
    - let's just stuff it in the upper bits and keep the API the same with
    an imposed restriction on the value in POSIX

    or POSIX exit() was constructed with the int from main in mind, or it
    could just be, as you point out, that int is a nice default integer
    type and there wasn't much thought put into it beyond that.

    I can speculate a bunch different reasons, but I'm curious if anyone
    knows what the actual reasoning was.

    AFAICT; "historical reasons". You have some bits to carry some exit
    status, some bits to carry other termination information (signals),
    optionally some more bits to carry other supplementary information.
    If you want that information all carried across a single primitive
    data type you have to draw a line somewhere. Given that these days
    one can not assume that more than 16 bit in the default 'int' type
    guaranteed it seems quite obvious to split at 8 bit. (For practical
    reasons differentiating 255 error codes seems more than enough, if
    we consider what evaluating and individually acting on all of them
    at the calling/environment level would mean.)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Andreas Kempe on Sat Feb 3 21:34:29 2024
    Andreas Kempe <[email protected]> writes:
    Den 2024-02-03 skrev Keith Thompson <[email protected]>:
    Andreas Kempe <[email protected]> writes:

    Yes, in my original post, I detailed that the restriction does not
    come from the C standard, but from POSIX. I'm not sure which came
    first.

    The restriction predates both. It was how unix v6 worked; every
    version of unix thereafter continued that so that existing applications
    would not need to be rewritten.

    It was documented in the SVID (System V Interface Definition) which
    was part of the source materials used by X/Open when developing
    the X Portability Guides (xpg) (which became the SuS).

    Ken and Dennis chose to implement the wait system call (which
    the shell uses to collect the exit status) with an 8-bit value
    so they could use the other 8 bits of the 16-bit int for metadata.

    This could never be changed without breaking applications, so
    we still have it today in unix, linux and other POSIX-compliant
    operating evironments.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Sat Feb 3 21:37:55 2024
    On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:

    The signedness of plain char is implementation-defined.

    Why? Because the PDP-11 on which C and Unix were originally developed did
    sign extension when loading a byte quantity into a (word-length) register.

    Signed characters make no sense.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Joe Pfeiffer@21:1/5 to Lawrence D'Oliveiro on Sat Feb 3 20:33:19 2024
    Lawrence D'Oliveiro <[email protected]d> writes:

    On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:

    The signedness of plain char is implementation-defined.

    Why? Because the PDP-11 on which C and Unix were originally developed did sign extension when loading a byte quantity into a (word-length) register.

    Signed characters make no sense.

    Except in architectures where they do. If you're doing something where
    it matters (or even if you want your code to be more readable) used
    signed char or unsigned char as appropriate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Joe Pfeiffer on Sun Feb 4 06:41:25 2024
    On Sat, 03 Feb 2024 20:33:19 -0700, Joe Pfeiffer wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:

    The signedness of plain char is implementation-defined.

    Why? Because the PDP-11 on which C and Unix were originally developed
    did sign extension when loading a byte quantity into a (word-length)
    register.

    Signed characters make no sense.

    Except in architectures where they do.

    There are no character encodings which assign meanings to negative codes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Kettlewell@21:1/5 to Joe Pfeiffer on Sun Feb 4 08:49:13 2024
    Joe Pfeiffer <[email protected]> writes:
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:
    The signedness of plain char is implementation-defined.

    Why? Because the PDP-11 on which C and Unix were originally developed did
    sign extension when loading a byte quantity into a (word-length) register. >>
    Signed characters make no sense.

    Except in architectures where they do.

    Such as?

    If you're doing something where it matters (or even if you want your
    code to be more readable) used signed char or unsigned char as
    appropriate.

    Signed 8-bit integers are perfectly sensible, signed characters not so
    much.

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Sun Feb 4 16:25:03 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Sat, 03 Feb 2024 20:33:19 -0700, Joe Pfeiffer wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:

    The signedness of plain char is implementation-defined.

    Why? Because the PDP-11 on which C and Unix were originally developed
    did sign extension when loading a byte quantity into a (word-length)
    register.

    Signed characters make no sense.

    Except in architectures where they do.

    There are no character encodings which assign meanings to negative codes.

    But then 'signed char' doesn't necessarily need to be used
    for character encoding (consider int8_t, for example, which
    defines a signed arithmetic type from -128..+127.

    On the 16-bit PDP-11, signed 8-bit values would not have been uncommon,
    if only because of the limited address space.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Andreas Kempe on Mon Feb 5 16:11:09 2024
    Andreas Kempe <[email protected]> writes:
    Den 2024-02-03 skrev Keith Thompson <[email protected]>:
    Andreas Kempe <[email protected]> writes:

    [...]

    If C was first with having an exit() function and an int return for
    main, I can imagine that it went something like this

    - C chooses int for main
    - C uses int in exit() to match main
    - OS folks want to store extra data in the exit status, but they
    want to match the C API
    - let's just stuff it in the upper bits and keep the API the same with
    an imposed restriction on the value in POSIX

    or POSIX exit() was constructed with the int from main in mind, or it
    could just be, as you point out, that int is a nice default integer
    type and there wasn't much thought put into it beyond that.

    I can speculate a bunch different reasons, but I'm curious if anyone
    knows what the actual reasoning was.

    This should be pretty obvious: A C int is really a machine data type in disguise, namely, whatever fits into a common general purpose register
    of a certain machine. C was created for porting UNIX to
    the PDP-11 (or rather, rewriting UNIX for the PDP-11 with the goal of
    not having to rewrite it again for next type of machine which would need
    to be supported by it). Putting a value into a certain register is a
    common convention for returning values from functions (or rather, Dennis Ritchie probably thought it would be a sensible convention at that
    time). Hence, having main return an int was the 'natural' idea and
    allocating the lower half of this int to applications whising to return
    status codes and the upper half to the system for returning
    system-specific metadata was also the 'natural' idea.

    Surely, eight whole bits must be enough for everyone! :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Keith Thompson on Mon Feb 5 16:12:52 2024
    Keith Thompson <[email protected]> writes:

    [...]

    (On Plan 9, a program's exit status is (was?) a string, empty for
    success, a description of the error condition on error. It's a cool
    idea, but I can imagine it introducing some interesting problems.)

    That's interesting to know as I have been using the same convention for validation functions in Perl for some years: These return nothing when everything was ok or a textual error message otherwise.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kees Nuyt@21:1/5 to [email protected] on Mon Feb 5 18:22:59 2024
    On Sat, 03 Feb 2024 20:33:19 -0700, Joe Pfeiffer
    <[email protected]> wrote:

    Signed characters make no sense.

    Nor did 6 bit characters, but in the 1980s we had them:
    3 characters in a 24 bit word.
    Welcome to what was then called mini or midrange computers.

    (Yes, looking at you, Harris, with its Vulcan Operating System)

    --
    Regards,
    Kees Nuyt

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Kempe@21:1/5 to All on Mon Feb 5 19:02:24 2024
    Thank you everyone for the different informative replies and
    historical insight! I think I have gotten what I can out of this
    thread.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Kees Nuyt on Mon Feb 5 22:41:39 2024
    On Mon, 05 Feb 2024 18:22:59 +0100, Kees Nuyt wrote:

    On Sat, 3 Feb 2024 21:37:55 -0000 (UTC), Lawrence D'Oliveiro wrote:

    Signed characters make no sense.

    Nor did 6 bit characters, but in the 1980s we had them:
    3 characters in a 24 bit word.

    I see your sixbit and raise you Radix-50, which packed 3 characters into a 16-bit word.

    None of these used signed character codes, by the way. So my point still stands.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Tue Feb 6 00:58:31 2024
    On Mon, 05 Feb 2024 15:51:37 -0800, Keith Thompson wrote:

    My understanding is that on the PDP-11, making plain char signed made
    code that stored character values in int objects more efficient. Sign-extension was more efficient than zero-filling or something like
    that.

    The move-byte instruction did sign-extension when loading into a register,
    not storing into memory.

    There was no convert-byte-to-word instruction as such.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Keith Thompson on Tue Feb 6 00:16:56 2024
    Keith Thompson <[email protected]> writes:
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Mon, 05 Feb 2024 18:22:59 +0100, Kees Nuyt wrote:
    On Sat, 3 Feb 2024 21:37:55 -0000 (UTC), Lawrence D'Oliveiro wrote:
    Signed characters make no sense.

    Nor did 6 bit characters, but in the 1980s we had them:
    3 characters in a 24 bit word.

    I see your sixbit and raise you Radix-50, which packed 3 characters into a >> 16-bit word.

    None of these used signed character codes, by the way. So my point still
    stands.

    My understanding is that on the PDP-11, making plain char signed made
    code that stored character values in int objects more efficient. >Sign-extension was more efficient than zero-filling or something like
    that. I don't remember the details, but I'm sure it wouldn't be
    difficult to find out.

    The PDP-11 had two move instructions:

    MOV (r1)+,r2
    MOVB (r2)+,r3

    MOV moved source to destination. MOVB always sign-extended the byte
    to the destination register size (16-bit).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Tue Feb 6 03:10:52 2024
    On Mon, 05 Feb 2024 18:31:36 -0800, Keith Thompson wrote:

    If the PDP-11 had had an alternative MOVB instruction that did zero-extension, we might not be having this discussion.

    Which is effectively what I said:

    On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:

    > The signedness of plain char is implementation-defined.

    Why? Because the PDP-11 on which C and Unix were originally developed did
    sign extension when loading a byte quantity into a (word-length) register.

    Signed characters make no sense.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Kettlewell@21:1/5 to Keith Thompson on Tue Feb 6 17:00:25 2024
    Keith Thompson <[email protected]> writes:
    Lawrence D'Oliveiro <[email protected]d> writes:

    Signed characters make no sense.

    You wrote that "Signed characters make no sense". I was talking about a context in which they did make sense. How is that effectively what you
    said? (I was agreeing with and expanding on your statement about the PDP-11.)

    I still don’t see any explanation for signed characters as such making
    sense.

    I think the situation is more accurately interpreted as letting a PDP-11-specific optimization influence the language design, and
    (temporarily) getting away with it because the character values they
    cared about at the time happened to lie within a small enough range that negative values didn’t arise.

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Rainer Weikusat on Tue Feb 6 18:04:16 2024
    On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
    Richard Kettlewell <[email protected]d> writes:
    Keith Thompson <[email protected]> writes:
    Lawrence D'Oliveiro <[email protected]d> writes:

    Signed characters make no sense.

    You wrote that "Signed characters make no sense". I was talking about a >>> context in which they did make sense. How is that effectively what you
    said? (I was agreeing with and expanding on your statement about the
    PDP-11.)

    I still don’t see any explanation for signed characters as such making
    sense.

    I think the situation is more accurately interpreted as letting a
    PDP-11-specific optimization influence the language design, and
    (temporarily) getting away with it because the character values they
    cared about at the time happened to lie within a small enough range that
    negative values didn’t arise.

    I think that's just a (probably traditional) misnomer. A C char isn't a character, it's an integer type and it's a signed integer type because
    all other original C integer types (int and short) were signed as
    well. Unsigned integer types, as something that's different from
    pointer, were a later addition.

    Sure, except for the part where "abcd" denotes an object that is a null-terminated array of these *char* integers, that entity being formally called a "string" in ISO C, and used for representing text. (Or else "abcd" is initializer syntax for a four element (or larger) array of *char*).

    If *char* is signed (and CHAR_BIT is 8), then '\xff` produces a negative value, even though the constant has type *int*, and "\xff"[0] does likewise.

    This has been connected to needless bugs in C programs. An expression like table[str[i]] may result in table[] being negatively indexed.

    The <ctype.h> function require an argument that is either EOF
    or a value in the range of 0 to UCHAR_MAX, and so are incompatible
    with string elements.

    All this crap could have been avoided if *char* had been unsigned.
    *unsigned char* never needed to exist except as a synonym for plain
    *char*.

    Speaking of synonyms, *char* is a distinct type, and not a synonym for either *signed char* or *unsigned char*. It has to be that way, given the way it is defined, but it's just another complication that need not have existed:

    #include <stdio.h>

    int main(void)
    {
    char *cp = 0;
    unsigned char *ucp = 0;
    signed char *scp = 0;
    printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);
    printf("%d\n", '\xff');
    }

    char.c: In function ‘main’:
    char.c:8:27: warning: comparison of distinct pointer types lacks a cast
    printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);
    ^~
    char.c:8:38: warning: comparison of distinct pointer types lacks a cast
    printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);
    ^~
    char.c:8:50: warning: comparison of distinct pointer types lacks a cast
    printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @[email protected]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Richard Kettlewell on Tue Feb 6 17:35:01 2024
    Richard Kettlewell <[email protected]d> writes:
    Keith Thompson <[email protected]> writes:
    Lawrence D'Oliveiro <[email protected]d> writes:

    Signed characters make no sense.

    You wrote that "Signed characters make no sense". I was talking about a
    context in which they did make sense. How is that effectively what you
    said? (I was agreeing with and expanding on your statement about the
    PDP-11.)

    I still don’t see any explanation for signed characters as such making sense.

    I think the situation is more accurately interpreted as letting a PDP-11-specific optimization influence the language design, and
    (temporarily) getting away with it because the character values they
    cared about at the time happened to lie within a small enough range that negative values didn’t arise.

    I think that's just a (probably traditional) misnomer. A C char isn't a character, it's an integer type and it's a signed integer type because
    all other original C integer types (int and short) were signed as
    well. Unsigned integer types, as something that's different from
    pointer, were a later addition.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Rainer Weikusat on Tue Feb 6 18:38:06 2024
    On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
    ¹ My personal theory of human fallibility is that humans tend to fuck up everything they possibly can. Hence, so-called C pitfalls expose human
    traits (fallibility) and not language traits.

    Does that work for all safety devices? Isolation transformers, steel
    toed boots, helmets, seat belts, roll bars, third outlet prongs, ...

    A fractured skull reveals a human trait (accident proneness, weak bone)
    rather than the workplace trait of not enforcing helmet use.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @[email protected]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Kaz Kylheku on Tue Feb 6 19:02:00 2024
    Kaz Kylheku <[email protected]> writes:
    On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
    ¹ My personal theory of human fallibility is that humans tend to fuck up
    everything they possibly can. Hence, so-called C pitfalls expose human
    traits (fallibility) and not language traits.

    Does that work for all safety devices? Isolation transformers, steel
    toed boots, helmets, seat belts, roll bars, third outlet prongs, ...

    I wrote about C types and somewhat more generally, programming language features, and not "safety devices" supposed to protect human bodies from physical injury.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Kaz Kylheku on Tue Feb 6 18:30:46 2024
    Kaz Kylheku <[email protected]> writes:
    On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
    Richard Kettlewell <[email protected]d> writes:

    [...]

    I still don’t see any explanation for signed characters as such making >>> sense.

    I think the situation is more accurately interpreted as letting a
    PDP-11-specific optimization influence the language design, and
    (temporarily) getting away with it because the character values they
    cared about at the time happened to lie within a small enough range that >>> negative values didn’t arise.

    I think that's just a (probably traditional) misnomer. A C char isn't a
    character, it's an integer type and it's a signed integer type because
    all other original C integer types (int and short) were signed as
    well. Unsigned integer types, as something that's different from
    pointer, were a later addition.

    Sure, except for the part where "abcd" denotes an object that is a null-terminated array of these *char* integers, that entity being formally called a "string" in ISO C, and used for representing text. (Or else "abcd" is
    initializer syntax for a four element (or larger) array of *char*).

    If *char* is signed (and CHAR_BIT is 8), then '\xff` produces a negative value,
    even though the constant has type *int*, and "\xff"[0] does likewise.

    This has been connected to needless bugs in C programs. An expression like table[str[i]] may result in table[] being negatively indexed.

    The <ctype.h> function require an argument that is either EOF
    or a value in the range of 0 to UCHAR_MAX, and so are incompatible
    with string elements.

    All this crap could have been avoided if *char* had been unsigned.
    *unsigned char* never needed to exist except as a synonym for plain
    *char*.

    All of this may be true¹ but it's all besides the point. The original C language had three integer types, char, short and int, which were all
    signed types. It further supported declaring pointers to some type and
    pointers were basically unsigned integer indices into a linear memory
    array. Char couldn't have been an unsigned integer type, regardless if
    this would have made more sense², because unsigned integer types didn't
    exist in the language.

    ¹ My personal theory of human fallibility is that humans tend to fuck up everything they possibly can. Hence, so-called C pitfalls expose human
    traits (fallibility) and not language traits. Had they been avoided,
    human ingenuity would have found something else to fuck up.

    ² Being wise in hindsight is always easy. But that's not an option for
    people who need to create something which doesn't yet exist and not be
    wisely critical of something that does.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lew Pitcher@21:1/5 to Rainer Weikusat on Tue Feb 6 19:25:27 2024
    On Tue, 06 Feb 2024 18:30:46 +0000, Rainer Weikusat wrote:

    Kaz Kylheku <[email protected]> writes:
    On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
    Richard Kettlewell <[email protected]d> writes:

    [...]

    I still don’t see any explanation for signed characters as such making >>>> sense.

    I think the situation is more accurately interpreted as letting a
    PDP-11-specific optimization influence the language design, and
    (temporarily) getting away with it because the character values they
    cared about at the time happened to lie within a small enough range that >>>> negative values didn’t arise.

    I think that's just a (probably traditional) misnomer. A C char isn't a
    character, it's an integer type and it's a signed integer type because
    all other original C integer types (int and short) were signed as
    well. Unsigned integer types, as something that's different from
    pointer, were a later addition.

    Sure, except for the part where "abcd" denotes an object that is a
    null-terminated array of these *char* integers, that entity being formally >> called a "string" in ISO C, and used for representing text. (Or else "abcd" is
    initializer syntax for a four element (or larger) array of *char*).

    If *char* is signed (and CHAR_BIT is 8), then '\xff` produces a negative value,
    even though the constant has type *int*, and "\xff"[0] does likewise.

    This has been connected to needless bugs in C programs. An expression like >> table[str[i]] may result in table[] being negatively indexed.

    The <ctype.h> function require an argument that is either EOF
    or a value in the range of 0 to UCHAR_MAX, and so are incompatible
    with string elements.

    All this crap could have been avoided if *char* had been unsigned.
    *unsigned char* never needed to exist except as a synonym for plain
    *char*.

    All of this may be true¹ but it's all besides the point. The original C language had three integer types, char, short and int, which were all
    signed types.

    This view ignores the early implementation of (K&R) C on IBM 370 systems,
    where a char was 8 bits of EBCDIC. In EBCDIC, all alphabetic and numeric characters have their high bit set (alphabetics range from 0x80 through
    0xe9, while numerics range from 0xf0 through 0xf9). A char in this implementation, by necessity, was unsigned, as C "guarantees that any
    character in the machine's standard character set will never be negative"
    (K&R "The C Programming Language", p40)


    It further supported declaring pointers to some type and
    pointers were basically unsigned integer indices into a linear memory
    array. Char couldn't have been an unsigned integer type, regardless if
    this would have made more sense², because unsigned integer types didn't exist in the language.

    ¹ My personal theory of human fallibility is that humans tend to fuck up everything they possibly can. Hence, so-called C pitfalls expose human
    traits (fallibility) and not language traits. Had they been avoided,
    human ingenuity would have found something else to fuck up.

    ² Being wise in hindsight is always easy. But that's not an option for people who need to create something which doesn't yet exist and not be
    wisely critical of something that does.




    --
    Lew Pitcher
    "In Skills We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Lew Pitcher on Tue Feb 6 20:01:43 2024
    Lew Pitcher <[email protected]> writes:
    On Tue, 06 Feb 2024 18:30:46 +0000, Rainer Weikusat wrote:

    [Why-oh-why is char not unsigned?!?]


    All of this may be true¹ but it's all besides the point. The original C
    language had three integer types, char, short and int, which were all
    signed types.

    This view ignores the early implementation of (K&R) C on IBM 370 systems, where a char was 8 bits of EBCDIC. In EBCDIC, all alphabetic and numeric characters have their high bit set (alphabetics range from 0x80 through
    0xe9, while numerics range from 0xf0 through 0xf9).

    Indeed. It refers to the C lanuage as it existed/ was created when UNIX
    was brought over to the PDP-11. This language didn't have any unsigned
    integer types as the concept didn't yet exist.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Rainer Weikusat on Tue Feb 6 21:22:57 2024
    On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
    Kaz Kylheku <[email protected]> writes:
    On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
    ¹ My personal theory of human fallibility is that humans tend to fuck up >>> everything they possibly can. Hence, so-called C pitfalls expose human
    traits (fallibility) and not language traits.

    Does that work for all safety devices? Isolation transformers, steel
    toed boots, helmets, seat belts, roll bars, third outlet prongs, ...

    I wrote about C types and somewhat more generally, programming language features, and not "safety devices" supposed to protect human bodies from physical injury.

    Type systems are safety devices. That's why we have terms like "type
    safe" and "unsafe code".

    Type safety helps prevent misbehavior, which results in problems like
    incorrect results and data loss, which can have real economic harm.

    In a safety-critical embedded system, a connection between type safety
    and physical safety is readily identifiable.

    "Type safety" it's not just some fanciful metaphor like "debugging";
    there is a literal interpretation which is true.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @[email protected]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Kaz Kylheku on Tue Feb 6 21:37:50 2024
    Kaz Kylheku <[email protected]> writes:
    On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
    Kaz Kylheku <[email protected]> writes:
    On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
    ¹ My personal theory of human fallibility is that humans tend to fuck up >>>> everything they possibly can. Hence, so-called C pitfalls expose human >>>> traits (fallibility) and not language traits.

    Does that work for all safety devices? Isolation transformers, steel
    toed boots, helmets, seat belts, roll bars, third outlet prongs, ...

    I wrote about C types and somewhat more generally, programming language
    features, and not "safety devices" supposed to protect human bodies from
    physical injury.

    Type systems are safety devices. That's why we have terms like "type
    safe" and "unsafe code".

    They're not, at least not when safety device is supposed to mean
    something like hard hats. That's just an inappropriate analogy some
    people like to employ. This is, however, completely besides the point of
    my original text which was about providing an explanation why char is
    signed in C despite all kinds of smart alecs with fifty years of
    hindsight Ritchie didn't have in 1972 are extremely concvinced that this
    was an extremely bad idea.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Kempe@21:1/5 to All on Tue Feb 6 23:13:21 2024
    Den 2024-02-06 skrev Keith Thompson <[email protected]>:
    Richard Kettlewell <[email protected]d> writes:
    Keith Thompson <[email protected]> writes:
    Lawrence D'Oliveiro <[email protected]d> writes:
    Signed characters make no sense.

    You wrote that "Signed characters make no sense". I was talking about a >>> context in which they did make sense. How is that effectively what you
    said? (I was agreeing with and expanding on your statement about the
    PDP-11.)

    I still don’t see any explanation for signed characters as such making
    sense.

    I think the situation is more accurately interpreted as letting a
    PDP-11-specific optimization influence the language design, and
    (temporarily) getting away with it because the character values they
    cared about at the time happened to lie within a small enough range that
    negative values didn’t arise.

    I think we're mostly in agreement, perhaps with different understandings
    of "making sense". What I'm saying is that the decision to make char a signed type made sense for PDP-11 implementation, purely because of performance issues.

    I just did a quick test on x86_64, x86, and ARM. It appears that
    assigning either an unsigned char or a signed char to an int object
    takes a single instruction. (My test didn't distinguish between
    register or memory target.) I suspect there's no longer any performance justification on most modern platforms for making plain char signed.
    But there's like to be (bad or at least non-portable) code that depends
    on plain char being signed. As it happens, plain char is unsigned in
    gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
    to override the default.


    I wouldn't expect any difference on a modern CPU. I did a microbench
    on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
    and movsbl to move char to int so that's what I benched.

    The bench was done by moving a byte from the stack to eax using a loop
    of 10 movzbl/movsbl running 10M times. Both instructions gave on
    average about 0.7 cycles per instruction measured using rdtsc. The
    highest bit in the byte being set or unset made no difference.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Andreas Kempe on Tue Feb 6 23:27:23 2024
    Andreas Kempe <[email protected]> writes:
    Den 2024-02-06 skrev Keith Thompson <[email protected]>:
    Richard Kettlewell <[email protected]d> writes:


    I think we're mostly in agreement, perhaps with different understandings
    of "making sense". What I'm saying is that the decision to make char a
    signed type made sense for PDP-11 implementation, purely because of
    performance issues.

    I just did a quick test on x86_64, x86, and ARM. It appears that
    assigning either an unsigned char or a signed char to an int object
    takes a single instruction. (My test didn't distinguish between
    register or memory target.) I suspect there's no longer any performance
    justification on most modern platforms for making plain char signed.
    But there's like to be (bad or at least non-portable) code that depends
    on plain char being signed. As it happens, plain char is unsigned in
    gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
    to override the default.


    I wouldn't expect any difference on a modern CPU. I did a microbench
    on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
    and movsbl to move char to int so that's what I benched.

    A move from register to register isn't even executed on most modern
    processor designs. It is detected at fetch and the register is
    just renamed in the pipeline.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Andreas Kempe on Wed Feb 7 00:46:17 2024
    Andreas Kempe <[email protected]> writes:
    Den 2024-02-06 skrev Scott Lurndal <[email protected]>:
    Andreas Kempe <[email protected]> writes:
    Den 2024-02-06 skrev Keith Thompson <[email protected]>:
    Richard Kettlewell <[email protected]d> writes:


    I think we're mostly in agreement, perhaps with different understandings >>>> of "making sense". What I'm saying is that the decision to make char a >>>> signed type made sense for PDP-11 implementation, purely because of
    performance issues.

    I just did a quick test on x86_64, x86, and ARM. It appears that
    assigning either an unsigned char or a signed char to an int object
    takes a single instruction. (My test didn't distinguish between
    register or memory target.) I suspect there's no longer any performance >>>> justification on most modern platforms for making plain char signed.
    But there's like to be (bad or at least non-portable) code that depends >>>> on plain char being signed. As it happens, plain char is unsigned in
    gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options >>>> to override the default.


    I wouldn't expect any difference on a modern CPU. I did a microbench
    on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
    and movsbl to move char to int so that's what I benched.

    A move from register to register isn't even executed on most modern
    processor designs. It is detected at fetch and the register is
    just renamed in the pipeline.


    Yeah. I tried some different variations and by adding some data
    dependencies by incrementing the value and moving it around, I could
    get some difference between the two, approx 10 to 30 %, but I'm not
    sure how much is due to the instruction itself or other effects of >manipulating the data.


    The logic for sign extension (MOVSX) isn't complex, the added gate delay wouldn't affect the instruction timing. Fan the sign bit out
    to the higher bits through a couple of gates to either select the
    sign bit or the high order bits when storing into the new register.

    Sign extension on load (MOV from memory) will happen in the load unit before
    it hits the register file, most likely.

    The x86 MOVBE instruction is a slight more complex example.


    Funnily enough, the zero extend was the more performant in these tests
    making unsigned char possibly more performant.

    Within what margin of measurement error?


    My intention wasn't really to claim they're exactly the same, but that
    that I don't think there is any real performance benefit to be had by >switching char to unsigned. Even if the 10-30 % are a real thing, I
    wonder how much software is actually using char types in a way where
    it would make a difference?

    We use uint8_t extensively because the data is unsigned in the range 0-255.

    And generally want wrapping behavior modulo 2^8.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Kempe@21:1/5 to All on Wed Feb 7 00:26:08 2024
    Den 2024-02-06 skrev Scott Lurndal <[email protected]>:
    Andreas Kempe <[email protected]> writes:
    Den 2024-02-06 skrev Keith Thompson <[email protected]>:
    Richard Kettlewell <[email protected]d> writes:


    I think we're mostly in agreement, perhaps with different understandings >>> of "making sense". What I'm saying is that the decision to make char a
    signed type made sense for PDP-11 implementation, purely because of
    performance issues.

    I just did a quick test on x86_64, x86, and ARM. It appears that
    assigning either an unsigned char or a signed char to an int object
    takes a single instruction. (My test didn't distinguish between
    register or memory target.) I suspect there's no longer any performance >>> justification on most modern platforms for making plain char signed.
    But there's like to be (bad or at least non-portable) code that depends
    on plain char being signed. As it happens, plain char is unsigned in
    gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
    to override the default.


    I wouldn't expect any difference on a modern CPU. I did a microbench
    on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
    and movsbl to move char to int so that's what I benched.

    A move from register to register isn't even executed on most modern
    processor designs. It is detected at fetch and the register is
    just renamed in the pipeline.


    Yeah. I tried some different variations and by adding some data
    dependencies by incrementing the value and moving it around, I could
    get some difference between the two, approx 10 to 30 %, but I'm not
    sure how much is due to the instruction itself or other effects of
    manipulating the data.

    Funnily enough, the zero extend was the more performant in these tests
    making unsigned char possibly more performant.

    My intention wasn't really to claim they're exactly the same, but that
    that I don't think there is any real performance benefit to be had by
    switching char to unsigned. Even if the 10-30 % are a real thing, I
    wonder how much software is actually using char types in a way where
    it would make a difference?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Kempe@21:1/5 to All on Wed Feb 7 02:11:26 2024
    Den 2024-02-07 skrev Scott Lurndal <[email protected]>:
    Andreas Kempe <[email protected]> writes:
    Den 2024-02-06 skrev Scott Lurndal <[email protected]>:
    Andreas Kempe <[email protected]> writes:

    I wouldn't expect any difference on a modern CPU. I did a microbench
    on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl >>>>and movsbl to move char to int so that's what I benched.

    A move from register to register isn't even executed on most modern
    processor designs. It is detected at fetch and the register is
    just renamed in the pipeline.


    Yeah. I tried some different variations and by adding some data >>dependencies by incrementing the value and moving it around, I could
    get some difference between the two, approx 10 to 30 %, but I'm not
    sure how much is due to the instruction itself or other effects of >>manipulating the data.


    The logic for sign extension (MOVSX) isn't complex, the added gate delay wouldn't affect the instruction timing. Fan the sign bit out
    to the higher bits through a couple of gates to either select the
    sign bit or the high order bits when storing into the new register.

    Sign extension on load (MOV from memory) will happen in the load unit before it hits the register file, most likely.

    The x86 MOVBE instruction is a slight more complex example.


    Funnily enough, the zero extend was the more performant in these tests >>making unsigned char possibly more performant.

    Within what margin of measurement error?


    Here's an example of a test I played around with. The body of my loop
    does this 10M times for this test. movzbl is switched for movsbl when
    testing the other configuration.

    movzbl -24(%rsp), %eax
    movb %al, -25(%rsp)
    movzbl -25(%rsp), %eax
    movb %al, -26(%rsp)
    movzbl -26(%rsp), %eax
    movb %al, -27(%rsp)
    movzbl -27(%rsp), %eax
    movb %al, -28(%rsp)
    movzbl -28(%rsp), %eax
    incl %eax
    movb %al, -24(%rsp)

    This is the data, unit is total cycles for a run, from 2000 runs of
    10M each for the two different instructions:

    movzbl:
    mean = 1.24E+08
    variance = 3.95E+12

    movsbl:
    mean = 1.38E+08
    variance = 3.44E+12

    ratio movsbl/movzbl = 1.11

    Performing a two-tail student t-test gives

    p-value: 0.00E+00

    Something is causing these two test runs to give different performance
    results. I will not pretend I know enough about the inner workings of
    Intel's magic box to explain why.


    My intention wasn't really to claim they're exactly the same, but that
    that I don't think there is any real performance benefit to be had by >>switching char to unsigned. Even if the 10-30 % are a real thing, I
    wonder how much software is actually using char types in a way where
    it would make a difference?

    We use uint8_t extensively because the data is unsigned in the range 0-255.

    And generally want wrapping behavior modulo 2^8.

    Sure, but if you are using uint8_t, you have sidestepped the whole
    issues of char being signed or unsigned so a change wouldn't really
    affect you.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Kettlewell@21:1/5 to Keith Thompson on Wed Feb 7 10:29:58 2024
    Keith Thompson <[email protected]> writes:

    Richard Kettlewell <[email protected]d> writes:
    Keith Thompson <[email protected]> writes:
    Lawrence D'Oliveiro <[email protected]d> writes:
    Signed characters make no sense.

    You wrote that "Signed characters make no sense". I was talking about a >>> context in which they did make sense. How is that effectively what you
    said? (I was agreeing with and expanding on your statement about the
    PDP-11.)

    I still don’t see any explanation for signed characters as such making
    sense.

    I think the situation is more accurately interpreted as letting a
    PDP-11-specific optimization influence the language design, and
    (temporarily) getting away with it because the character values they
    cared about at the time happened to lie within a small enough range that
    negative values didn’t arise.

    I think we're mostly in agreement, perhaps with different understandings
    of "making sense". What I'm saying is that the decision to make char a signed type made sense for PDP-11 implementation, purely because of performance issues.

    Having a basic 8-bit integer type be signed type makes sense (in
    context) for performance reasons and perhaps for usability reasons too.

    But that’s really not the same as “signed characters make sense”. For signed characters to make sense there has to be encoding where some
    signs (or control codes, etc) are encoded to negative values. I’ve never heard of one.

    “char” isn’t just a random string of symbols. It’s obvious both from the
    name and the way it’s used in the language that it’s intended to
    represent characters, not just small integer values. If the purpose was
    purely the latter it would have been called ‘short short int’ or
    something like that.

    I just did a quick test on x86_64, x86, and ARM. It appears that
    assigning either an unsigned char or a signed char to an int object
    takes a single instruction. (My test didn't distinguish between
    register or memory target.) I suspect there's no longer any performance justification on most modern platforms for making plain char signed.
    But there's like to be (bad or at least non-portable) code that depends
    on plain char being signed. As it happens, plain char is unsigned in
    gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
    to override the default.

    i.e. we’re still suffering the locked-in side-effects of an ancient
    decision even though the original justification has become irrelevant.
    It might or might not have been a reasonable trade-off at the time, disregarding what were then hypotheticals about the future, but (indeed
    with hindsight) I think from today’s point of view it was clearly the
    wrong decision.

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Andreas Kempe on Wed Feb 7 15:22:04 2024
    Andreas Kempe <[email protected]> writes:
    Den 2024-02-07 skrev Scott Lurndal <[email protected]>:
    Andreas Kempe <[email protected]> writes:

    Funnily enough, the zero extend was the more performant in these tests >>>making unsigned char possibly more performant.

    Within what margin of measurement error?


    Here's an example of a test I played around with. The body of my loop
    does this 10M times for this test. movzbl is switched for movsbl when
    testing the other configuration.

    movzbl -24(%rsp), %eax
    movb %al, -25(%rsp)
    movzbl -25(%rsp), %eax
    movb %al, -26(%rsp)
    movzbl -26(%rsp), %eax
    movb %al, -27(%rsp)
    movzbl -27(%rsp), %eax
    movb %al, -28(%rsp)
    movzbl -28(%rsp), %eax
    incl %eax
    movb %al, -24(%rsp)

    This is the data, unit is total cycles for a run, from 2000 runs of
    10M each for the two different instructions:

    movzbl:
    mean = 1.24E+08
    variance = 3.95E+12

    movsbl:
    mean = 1.38E+08
    variance = 3.44E+12

    ratio movsbl/movzbl = 1.11

    Sehr interresant. Ich weiss nicht, warum es ist.


    We use uint8_t extensively because the data is unsigned in the range 0-255. >>
    And generally want wrapping behavior modulo 2^8.

    Sure, but if you are using uint8_t, you have sidestepped the whole
    issues of char being signed or unsigned so a change wouldn't really
    affect you.

    While most C compilers have a compile-time option to select the signed-ness of char, using uint8_t sidesteps the issue completely.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Richard Kettlewell on Wed Feb 7 15:30:23 2024
    Richard Kettlewell <[email protected]d> writes:
    Keith Thompson <[email protected]> writes:

    [...]

    I think we're mostly in agreement, perhaps with different understandings
    of "making sense". What I'm saying is that the decision to make char a
    signed type made sense for PDP-11 implementation, purely because of
    performance issues.

    Having a basic 8-bit integer type be signed type makes sense (in
    context) for performance reasons and perhaps for usability reasons too.

    But that’s really not the same as “signed characters make sense”. For signed characters to make sense there has to be encoding where some
    signs (or control codes, etc) are encoded to negative values. I’ve never heard of one.

    “char” isn’t just a random string of symbols. It’s obvious both from the
    name and the way it’s used in the language that it’s intended to represent characters, not just small integer values.

    Computers have absolutely no idea of "characters". They handle numbers,
    integer numbers in this case, and humans then interpret them as
    characters based on some convention for encoding characters as
    integers. Hence, a data type suitable for holding an encoded character
    (ie, an integer value from 0 - 127 for the case in question) is not the
    same as a character.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Kettlewell@21:1/5 to Rainer Weikusat on Wed Feb 7 20:20:12 2024
    Rainer Weikusat <[email protected]> writes:
    Richard Kettlewell <[email protected]d> writes:
    “char” isn’t just a random string of symbols. It’s obvious both from the
    name and the way it’s used in the language that it’s intended to
    represent characters, not just small integer values.

    Computers have absolutely no idea of "characters". They handle numbers, integer numbers in this case, and humans then interpret them as
    characters based on some convention for encoding characters as
    integers. Hence, a data type suitable for holding an encoded character
    (ie, an integer value from 0 - 127 for the case in question) is not the
    same as a character.

    Language designers do, however, have an idea of “characters”.

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Richard Kettlewell on Wed Feb 7 20:58:01 2024
    On Wed, 07 Feb 2024 20:20:12 +0000, Richard Kettlewell wrote:

    Language designers do, however, have an idea of “characters”.

    Unicode uses the terms “grapheme” and “text element”. Actually it also uses “character”, but it seems less clear on what that means. It is not
    the same as a “code point” or “glyph”.

    <https://www.unicode.org/faq/char_combmark.html>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Kettlewell@21:1/5 to Lawrence D'Oliveiro on Thu Feb 8 11:21:56 2024
    Lawrence D'Oliveiro <[email protected]d> writes:

    On Wed, 07 Feb 2024 20:20:12 +0000, Richard Kettlewell wrote:

    Language designers do, however, have an idea of “characters”.

    Unicode uses the terms “grapheme” and “text element”. Actually it also
    uses “character”, but it seems less clear on what that means. It is not the same as a “code point” or “glyph”.

    <https://www.unicode.org/faq/char_combmark.html>

    Sure, but this was all happening in the 1970s, long before Unicode
    existed.

    K&R1 explicitly says char is “capable of holding one character in the
    local character set” (and mentions EBCDIC as a concrete example on the
    same page - the problem must have been obvious already).

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Richard Kettlewell on Thu Feb 8 16:34:10 2024
    Richard Kettlewell <[email protected]d> writes:
    Rainer Weikusat <[email protected]> writes:
    Richard Kettlewell <[email protected]d> writes:
    “char” isn’t just a random string of symbols. It’s obvious both from the
    name and the way it’s used in the language that it’s intended to
    represent characters, not just small integer values.

    Computers have absolutely no idea of "characters". They handle numbers,
    integer numbers in this case, and humans then interpret them as
    characters based on some convention for encoding characters as
    integers. Hence, a data type suitable for holding an encoded character
    (ie, an integer value from 0 - 127 for the case in question) is not the
    same as a character.

    Language designers do, however, have an idea of “characters”.

    I don't quite understand what that's supposed to communicate. Insofar
    the machine is concerned, a character is nothig but an integer and a
    data type sufficient to hold a characters is thus necessarily an integer
    type of some size. In a language without unsigned integer types, it'll necessarily also be an signed integer type.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Keith Thompson on Thu Feb 8 17:46:20 2024
    Keith Thompson <[email protected]> writes:
    Rainer Weikusat <[email protected]> writes:
    Richard Kettlewell <[email protected]d> writes:
    Rainer Weikusat <[email protected]> writes:
    Richard Kettlewell <[email protected]d> writes:
    “char” isn’t just a random string of symbols. It’s obvious both from the
    name and the way it’s used in the language that it’s intended to >>>>> represent characters, not just small integer values.

    Computers have absolutely no idea of "characters". They handle numbers, >>>> integer numbers in this case, and humans then interpret them as
    characters based on some convention for encoding characters as
    integers. Hence, a data type suitable for holding an encoded character >>>> (ie, an integer value from 0 - 127 for the case in question) is not the >>>> same as a character.

    Language designers do, however, have an idea of “characters”.

    I don't quite understand what that's supposed to communicate. Insofar
    the machine is concerned, a character is nothig but an integer and a
    data type sufficient to hold a characters is thus necessarily an integer
    type of some size. In a language without unsigned integer types, it'll
    necessarily also be an signed integer type.

    Early C (pre-K&R1) didn't explicitly have unsigned integer types, but
    char was effectively unsigned in some implementations, in that
    converting a char value to int would zero-fill the result rather than
    doing sign-extension.

    According to Ritchie's "The Development of the C Language"

    ,----
    | During 1973-1980, the language grew a bit: the type structure gained
    | unsigned
    |
    | [...]
    |
    | the similarity of the arithmetic properties of character pointers and
    | unsigned integers made it hard to resist the temptation to identify
    | them. The unsigned types were added to make unsigned arithmetic
    | available without confusing it with pointer manipulation. Similarly, the
    | early language condoned assignments between integers and pointers
    `----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Rainer Weikusat on Thu Feb 8 19:54:12 2024
    On 2024-02-08, Rainer Weikusat <[email protected]> wrote:
    According to Ritchie's "The Development of the C Language"

    ,----
    | During 1973-1980, the language grew a bit: the type structure gained
    | unsigned
    |
    | [...]
    |
    | the similarity of the arithmetic properties of character pointers and
    | unsigned integers made it hard to resist the temptation to identify
    | them. The unsigned types were added to make unsigned arithmetic
    | available without confusing it with pointer manipulation. Similarly, the
    | early language condoned assignments between integers and pointers
    `----

    It seems like a very odd rationale, given how things played out.

    The difference between two pointers ended up signed (ptrdiff_t).
    So pointer arithmetic doesn't work exactly like unsigned. That's mostly
    a good thing, except that pointers farther from each other than half the address space cannot be subtracted. (ISO C mostly takes that away anyway
    since pointers to different objects may only be compared for exact
    equality, and canno tbe subtracted. If no object is half the address
    space or larger, subtraction overflow will never occur.)

    Moreover, unsigned ended up necessary for representing a simple byte
    in a nice way.

    Not only that, but unsigned types are useful for bit manipulation,
    without running into nonportable behaviors around shifting into and out
    of the sign bit.

    If you have a 32 bit int and want a 32 bit field, you want unsigned int.

    Very odd to see the existence of unsigned math justified in terms of
    some story about pointers.

    It seems Ritchie really didn't think much about portability; he
    probabably thought it was fine to do 1 << 15 with a 16 bit signed int
    to calculate a mask for the highest bit, since that happened to work in
    the systems he designed. If someone wanted C on their weird machine
    where that misbehaves, or produces an alternative zero that compares
    equal to regular zero, that was their problem.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @[email protected]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Thu Feb 8 21:57:57 2024
    On Thu, 08 Feb 2024 10:23:29 -0800, Keith Thompson wrote:

    The sentence "Whether or not sign-extension occurs for characters is
    machine dependent" might be written in more modern terms as "The
    signedness of char is implementation-defined".

    signed char and unsigned char (and unsigned short and unsigned long)
    were added in ANSI C 1989, possibly earlier.

    Here’s an odd thing: what happens when you shift a signed int? K&R allows left-shift with the obvious meaning, and says that, for right-shift,
    whether the top bits are zero-filled or sign-extended is implementation- defined; newer C specs say that left-shifting a negative value is simply “undefined”, and right-shifting a negative value is “implementation- defined”.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Thu Feb 8 22:30:52 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Thu, 08 Feb 2024 10:23:29 -0800, Keith Thompson wrote:

    The sentence "Whether or not sign-extension occurs for characters is
    machine dependent" might be written in more modern terms as "The
    signedness of char is implementation-defined".

    signed char and unsigned char (and unsigned short and unsigned long)
    were added in ANSI C 1989, possibly earlier.

    Here’s an odd thing: what happens when you shift a signed int? K&R allows >left-shift with the obvious meaning, and says that, for right-shift,
    whether the top bits are zero-filled or sign-extended is implementation- >defined; newer C specs say that left-shifting a negative value is simply >“undefined”, and right-shifting a negative value is “implementation- >defined”.

    There were extant hardware implementations exhibiting both behaviors. So they
    made the behavior implementation-defined in the compiler.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Thu Feb 8 23:26:55 2024
    On Thu, 08 Feb 2024 22:30:52 GMT, Scott Lurndal wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    Here’s an odd thing: what happens when you shift a signed int? K&R allows >>left-shift with the obvious meaning, and says that, for right-shift, >>whether the top bits are zero-filled or sign-extended is implementation- >>defined; newer C specs say that left-shifting a negative value is simply >>“undefined”, and right-shifting a negative value is “implementation- >>defined”.

    There were extant hardware implementations exhibiting both behaviors.

    Except the current spec doesn’t mention a choice between two behaviours.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)