Forum: >>> Magnum BBS <<<

Why 8 bit exit status codes?

From Andreas Kempe@21:1/5 to All on Fri Feb 2 16:05:14 2024

Hello everyone,

I'm wondering why, at least on Linux and FreeBSD, a process exit
status was chosen to be only the lower 8 bits in the C interface, i.e.
exit() and wait().

This did bite some colleagues at work at one point who were porting a
modem manager from a real-time OS to Linux because they were returning
negative status codes for errors. We fixed it by changing the status
codes and I never really thought about why this is the state of
things... until now!

Having a look at man 3 exit on my FreeBSD system, it states

Both functions make the low-order eight bits of the status argument
available to a parent process which has called a wait(2)-family
function.

and that it is conforming to the C99 standard

The exit() and _Exit() functions conform to ISO/IEC 9899:1999 (“ISO C99”).

C99 7.20.4.3 § 5 states

Finally, control is returned to the host environment. If the value of
status is zero or EXIT_SUCCESS, an implementation-defined form of the
status successful termination is returned. If the value of status is EXIT_FAILURE, an implementation-defined form of the status
unsuccessful termination is returned. Otherwise the status returned
is implementation-defined.

which I read as the C standard leaving it to the implementation to
decide how to handle the int type argument.

Having a look at man 2 _exit, the system call man page, it says
nothing about the lower 8 bits, but claims conformance with
IEEE Std 1003.1-1990 ("POSIX.1") which says
in Part 1: System Application Program Interface (API) [C Language], 3.2.2.2 § 2

If the parent process of the calling process is executing a wait() or waitpid(), it is notified of the termination of the calling process
and the low order 8 bits of status are made available to it; see
3.2.1.

that only puts a requirement on making the lower 8 bits available.
Looking at a more modern POSIX, IEEE Std 1003.1-2017, that has
waitid() defined, it has the following for _exit()

The value of status may be 0, EXIT_SUCCESS, EXIT_FAILURE, or any
other value, though only the least significant 8 bits (that is,
status & 0377) shall be available from wait() and waitpid(); the
full value shall be available from waitid() and in the siginfo_t
passed to a signal handler for SIGCHLD.

so the mystery of why the implementation is the way it is was
dispelled.

The question that remains is what the rationale behind using the lower
8 bits was from the start? Is it historical legacy that no one wanted
to change for backwards compatibility? Is there no need for exit codes
larger than 8 bits?

I don't know if I have ever come into contact with software that deals
with status codes that actually looks at the full value. My daily
driver shell, fish, certainly does not.

--
Best regards,
Andreas Kempe

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Andreas Kempe on Fri Feb 2 16:33:40 2024

Andreas Kempe <[email protected]> writes:

Hello everyone,

I'm wondering why, at least on Linux and FreeBSD, a process exit
status was chosen to be only the lower 8 bits in the C interface, i.e.
exit() and wait().

<snip>

The question that remains is what the rationale behind using the lower
8 bits was from the start? Is it historical legacy that no one wanted
to change for backwards compatibility? Is there no need for exit codes
larger than 8 bits?

The definition of the wait system call. Recall that the
PDP-11 was a 16-bit computer and wait needed to be able
to include metadata along with the exit status.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Kempe@21:1/5 to All on Fri Feb 2 20:02:16 2024

Den 2024-02-02 skrev Scott Lurndal <[email protected]>:

Andreas Kempe <[email protected]> writes:

Hello everyone,

I'm wondering why, at least on Linux and FreeBSD, a process exit
status was chosen to be only the lower 8 bits in the C interface, i.e. >>exit() and wait().

<snip>

The question that remains is what the rationale behind using the lower
8 bits was from the start? Is it historical legacy that no one wanted
to change for backwards compatibility? Is there no need for exit codes >>larger than 8 bits?

The definition of the wait system call. Recall that the
PDP-11 was a 16-bit computer

I'm afraid that's a tall order. I had yet to learn how to read when
they went out of production. :) Please excuse my ignorance.

and wait needed to be able to include metadata along with the exit
status.

I'm a bit unclear on the order of things coming into being. Did their
C implementation already use exit() with an int argument of size 16
bits and they also masked? Or was an int 8 bits on PDP-11 with POSIX
opting mask out the lower 8 bits on platforms with wider ints to
maintain backwards compatibility?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Andreas Kempe on Fri Feb 2 21:13:41 2024

On Fri, 2 Feb 2024 16:05:14 -0000 (UTC), Andreas Kempe wrote:

I'm wondering why, at least on Linux and FreeBSD, a process exit status
was chosen to be only the lower 8 bits in the C interface, i.e.
exit() and wait().

I’ve never used that many different values. E.g. 0 for some test condition passed, 1 for failed, 2 for unexpected error.

This did bite some colleagues at work at one point who were porting a
modem manager from a real-time OS to Linux because they were returning negative status codes for errors.

True enough:

ldo@theon:~> python3 -c "import sys; sys.exit(1)"; echo $?
1
ldo@theon:~> python3 -c "import sys; sys.exit(-1)"; echo $?
255

But you could always sign-extend it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Andreas Kempe on Fri Feb 2 20:15:24 2024

Andreas Kempe <[email protected]> writes:

Den 2024-02-02 skrev Scott Lurndal <[email protected]>:

Andreas Kempe <[email protected]> writes:

Hello everyone,

I'm wondering why, at least on Linux and FreeBSD, a process exit
status was chosen to be only the lower 8 bits in the C interface, i.e. >>>exit() and wait().

<snip>

The question that remains is what the rationale behind using the lower
8 bits was from the start? Is it historical legacy that no one wanted
to change for backwards compatibility? Is there no need for exit codes >>>larger than 8 bits?

The definition of the wait system call. Recall that the
PDP-11 was a 16-bit computer

I'm afraid that's a tall order. I had yet to learn how to read when
they went out of production. :) Please excuse my ignorance.

and wait needed to be able to include metadata along with the exit
status.

I'm a bit unclear on the order of things coming into being. Did their
C implementation already use exit() with an int argument of size 16
bits and they also masked? Or was an int 8 bits on PDP-11 with POSIX
opting mask out the lower 8 bits on platforms with wider ints to
maintain backwards compatibility?

The status argument to the wait system call returned
a two part value; 8 bits of exit status and 8 bits
that describe the termination conditions (e.g. the
signal number that stopped or terminated the
process).

Here's the modern 32-bit layout (in little endian form):

unsigned int __w_termsig:7; /* Terminating signal. */
unsigned int __w_coredump:1; /* Set if dumped core. */
unsigned int __w_retcode:8; /* Return code if exited normally. */
unsigned int:16;

It's just the PDP-11 unix 16-bit version with 16 unused padding bits.

SVR4 added the waitid(2) system call which via the siginfo argument has
access to the full 32-bit program exit status.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Andreas Kempe on Fri Feb 2 21:40:32 2024

On Fri, 2 Feb 2024 21:20:22 -0000 (UTC), Andreas Kempe wrote:

Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it makes sense, but otherwise I don't understand from an API perspective why one
would use a data type with the caveat that only half is used.

The other half contains information like whether the low half is actually
an explicit exit code, or something else like a signal that killed the
process. Or an indication that the process has not actually terminated,
but is just stopped.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Fri Feb 2 21:38:54 2024

On Fri, 02 Feb 2024 13:23:52 -0800, Keith Thompson wrote:

The curl command defines nearly 100 error codes ("man curl" for
details). That's the most I've seen.

Another reason for staying away from curl, I would say. It needlessly replicates the functionality of a whole lot of different protocol clients,
when all you need is HTTP/HTTPS (maybe FTP/FTPS as well). That’s why I
stick to wget.

(On Plan 9, a program's exit status is (was?) a string, empty for
success, a description of the error condition on error. It's a cool
idea, but I can imagine it introducing some interesting problems.)

What, not a JSON object?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Kempe@21:1/5 to All on Fri Feb 2 21:20:22 2024

Den 2024-02-02 skrev Scott Lurndal <[email protected]>:

Andreas Kempe <[email protected]> writes:

I'm a bit unclear on the order of things coming into being. Did their
C implementation already use exit() with an int argument of size 16
bits and they also masked? Or was an int 8 bits on PDP-11 with POSIX
opting mask out the lower 8 bits on platforms with wider ints to
maintain backwards compatibility?

The status argument to the wait system call returned
a two part value; 8 bits of exit status and 8 bits
that describe the termination conditions (e.g. the
signal number that stopped or terminated the
process).

Here's the modern 32-bit layout (in little endian form):

unsigned int __w_termsig:7; /* Terminating signal. */
unsigned int __w_coredump:1; /* Set if dumped core. */
unsigned int __w_retcode:8; /* Return code if exited normally. */
unsigned int:16;

It's just the PDP-11 unix 16-bit version with 16 unused padding bits.

Thank you for the clarification, but I don't think I have any problem
grasping how the implementation works. My thought are why they did
what they did.

Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it
makes sense, but otherwise I don't understand from an API perspective
why one would use a data type with the caveat that only half is used.

If we already have exit() and wait() using ints and want to stuff our
extra information in there without changing the API, it also makes
sense.

SVR4 added the waitid(2) system call which via the siginfo argument has access to the full 32-bit program exit status.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Kempe@21:1/5 to All on Sat Feb 3 13:21:29 2024

Den 2024-02-03 skrev Keith Thompson <[email protected]>:

Andreas Kempe <[email protected]> writes:

Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it
makes sense, but otherwise I don't understand from an API perspective
why one would use a data type with the caveat that only half is used.

C tends to use int values even for character data (when not an element
of a string). See for example the return types of getchar(), fgetc(),
et al, and even the type of character constants ('x' is of type int, not char).

I thought the reason for the int return type was to have an error code
outside of the range of the valid data, with EOF being defined as
being a negative integer. A reason that isn't applicable for the
argument passing to exit by a program.

In early C, int was in many ways a kind of default type. Functions with
no visible declaration were assumed to return int. The signedness of
plain char is implementation-defined.

I realised that char was a bad example just as I posted. I should have
chosen unsigned char instead.

Supporting exit values from 0 to 255 is fairly reasonable. Using an
int to store that value is also fairly reasonable -- especially
since main() returns int, and exit(n) is very nearly equivalent to
return n in main(). Ignoring all but the low-order 8 bits is not
specified by C. Non-POSIX systems can use all 32 (or 16, or ...)
bits of the return value.

Yes, in my original post, I detailed that the restriction does not
come from the C standard, but from POSIX. I'm not sure which came
first.

If C was first with having an exit() function and an int return for
main, I can imagine that it went something like this

- C chooses int for main
- C uses int in exit() to match main
- OS folks want to store extra data in the exit status, but they
want to match the C API
- let's just stuff it in the upper bits and keep the API the same with
an imposed restriction on the value in POSIX

or POSIX exit() was constructed with the int from main in mind, or it
could just be, as you point out, that int is a nice default integer
type and there wasn't much thought put into it beyond that.

I can speculate a bunch different reasons, but I'm curious if anyone
knows what the actual reasoning was.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Andreas Kempe on Sat Feb 3 16:38:39 2024

On 03.02.2024 14:21, Andreas Kempe wrote:

Den 2024-02-03 skrev Keith Thompson <[email protected]>:

Andreas Kempe <[email protected]> writes:

Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it
makes sense, but otherwise I don't understand from an API perspective
why one would use a data type with the caveat that only half is used.

C tends to use int values even for character data (when not an element
of a string). See for example the return types of getchar(), fgetc(),
et al, and even the type of character constants ('x' is of type int, not
char).

I thought the reason for the int return type was to have an error code outside of the range of the valid data, with EOF being defined as
being a negative integer. A reason that isn't applicable for the
argument passing to exit by a program.

In early C, int was in many ways a kind of default type. Functions with
no visible declaration were assumed to return int. The signedness of
plain char is implementation-defined.

I realised that char was a bad example just as I posted. I should have
chosen unsigned char instead.

Supporting exit values from 0 to 255 is fairly reasonable. Using an
int to store that value is also fairly reasonable -- especially
since main() returns int, and exit(n) is very nearly equivalent to
return n in main(). Ignoring all but the low-order 8 bits is not
specified by C. Non-POSIX systems can use all 32 (or 16, or ...)
bits of the return value.

Yes, in my original post, I detailed that the restriction does not
come from the C standard, but from POSIX. I'm not sure which came
first.

If C was first with having an exit() function and an int return for
main, I can imagine that it went something like this

- C chooses int for main
- C uses int in exit() to match main
- OS folks want to store extra data in the exit status, but they
want to match the C API
- let's just stuff it in the upper bits and keep the API the same with
an imposed restriction on the value in POSIX

or POSIX exit() was constructed with the int from main in mind, or it
could just be, as you point out, that int is a nice default integer
type and there wasn't much thought put into it beyond that.

I can speculate a bunch different reasons, but I'm curious if anyone
knows what the actual reasoning was.

AFAICT; "historical reasons". You have some bits to carry some exit
status, some bits to carry other termination information (signals),
optionally some more bits to carry other supplementary information.
If you want that information all carried across a single primitive
data type you have to draw a line somewhere. Given that these days
one can not assume that more than 16 bit in the default 'int' type
guaranteed it seems quite obvious to split at 8 bit. (For practical
reasons differentiating 255 error codes seems more than enough, if
we consider what evaluating and individually acting on all of them
at the calling/environment level would mean.)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Andreas Kempe on Sat Feb 3 21:34:29 2024

Andreas Kempe <[email protected]> writes:

Den 2024-02-03 skrev Keith Thompson <[email protected]>:

Andreas Kempe <[email protected]> writes:

Yes, in my original post, I detailed that the restriction does not
come from the C standard, but from POSIX. I'm not sure which came
first.

The restriction predates both. It was how unix v6 worked; every
version of unix thereafter continued that so that existing applications
would not need to be rewritten.

It was documented in the SVID (System V Interface Definition) which
was part of the source materials used by X/Open when developing
the X Portability Guides (xpg) (which became the SuS).

Ken and Dennis chose to implement the wait system call (which
the shell uses to collect the exit status) with an 8-bit value
so they could use the other 8 bits of the 16-bit int for metadata.

This could never be changed without breaking applications, so
we still have it today in unix, linux and other POSIX-compliant
operating evironments.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Sat Feb 3 21:37:55 2024

On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:

The signedness of plain char is implementation-defined.

Why? Because the PDP-11 on which C and Unix were originally developed did
sign extension when loading a byte quantity into a (word-length) register.

Signed characters make no sense.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Joe Pfeiffer@21:1/5 to Lawrence D'Oliveiro on Sat Feb 3 20:33:19 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:

The signedness of plain char is implementation-defined.

Why? Because the PDP-11 on which C and Unix were originally developed did sign extension when loading a byte quantity into a (word-length) register.

Signed characters make no sense.

Except in architectures where they do. If you're doing something where
it matters (or even if you want your code to be more readable) used
signed char or unsigned char as appropriate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Joe Pfeiffer on Sun Feb 4 06:41:25 2024

On Sat, 03 Feb 2024 20:33:19 -0700, Joe Pfeiffer wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:

The signedness of plain char is implementation-defined.

Why? Because the PDP-11 on which C and Unix were originally developed
did sign extension when loading a byte quantity into a (word-length)
register.

Signed characters make no sense.

Except in architectures where they do.

There are no character encodings which assign meanings to negative codes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Kettlewell@21:1/5 to Joe Pfeiffer on Sun Feb 4 08:49:13 2024

Joe Pfeiffer <[email protected]> writes:

Lawrence D'Oliveiro <[email protected]d> writes:

On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:

The signedness of plain char is implementation-defined.

Why? Because the PDP-11 on which C and Unix were originally developed did
sign extension when loading a byte quantity into a (word-length) register. >>
Signed characters make no sense.

Except in architectures where they do.

Such as?

If you're doing something where it matters (or even if you want your
code to be more readable) used signed char or unsigned char as
appropriate.

Signed 8-bit integers are perfectly sensible, signed characters not so
much.

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Sun Feb 4 16:25:03 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Sat, 03 Feb 2024 20:33:19 -0700, Joe Pfeiffer wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:

The signedness of plain char is implementation-defined.

Why? Because the PDP-11 on which C and Unix were originally developed
did sign extension when loading a byte quantity into a (word-length)
register.

Signed characters make no sense.

Except in architectures where they do.

There are no character encodings which assign meanings to negative codes.

But then 'signed char' doesn't necessarily need to be used
for character encoding (consider int8_t, for example, which
defines a signed arithmetic type from -128..+127.

On the 16-bit PDP-11, signed 8-bit values would not have been uncommon,
if only because of the limited address space.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Andreas Kempe on Mon Feb 5 16:11:09 2024

Andreas Kempe <[email protected]> writes:

Den 2024-02-03 skrev Keith Thompson <[email protected]>:

Andreas Kempe <[email protected]> writes:

[...]

If C was first with having an exit() function and an int return for
main, I can imagine that it went something like this

- C chooses int for main
- C uses int in exit() to match main
- OS folks want to store extra data in the exit status, but they
want to match the C API
- let's just stuff it in the upper bits and keep the API the same with
an imposed restriction on the value in POSIX

or POSIX exit() was constructed with the int from main in mind, or it
could just be, as you point out, that int is a nice default integer
type and there wasn't much thought put into it beyond that.

I can speculate a bunch different reasons, but I'm curious if anyone
knows what the actual reasoning was.

This should be pretty obvious: A C int is really a machine data type in disguise, namely, whatever fits into a common general purpose register
of a certain machine. C was created for porting UNIX to
the PDP-11 (or rather, rewriting UNIX for the PDP-11 with the goal of
not having to rewrite it again for next type of machine which would need
to be supported by it). Putting a value into a certain register is a
common convention for returning values from functions (or rather, Dennis Ritchie probably thought it would be a sensible convention at that
time). Hence, having main return an int was the 'natural' idea and
allocating the lower half of this int to applications whising to return
status codes and the upper half to the system for returning
system-specific metadata was also the 'natural' idea.

Surely, eight whole bits must be enough for everyone! :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Keith Thompson on Mon Feb 5 16:12:52 2024

Keith Thompson <[email protected]> writes:

[...]

(On Plan 9, a program's exit status is (was?) a string, empty for
success, a description of the error condition on error. It's a cool
idea, but I can imagine it introducing some interesting problems.)

That's interesting to know as I have been using the same convention for validation functions in Perl for some years: These return nothing when everything was ok or a textual error message otherwise.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kees Nuyt@21:1/5 to [email protected] on Mon Feb 5 18:22:59 2024

On Sat, 03 Feb 2024 20:33:19 -0700, Joe Pfeiffer
<[email protected]> wrote:

Signed characters make no sense.

Nor did 6 bit characters, but in the 1980s we had them:
3 characters in a 24 bit word.
Welcome to what was then called mini or midrange computers.

(Yes, looking at you, Harris, with its Vulcan Operating System)

--
Regards,
Kees Nuyt

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Kempe@21:1/5 to All on Mon Feb 5 19:02:24 2024

Thank you everyone for the different informative replies and
historical insight! I think I have gotten what I can out of this
thread.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Kees Nuyt on Mon Feb 5 22:41:39 2024

On Mon, 05 Feb 2024 18:22:59 +0100, Kees Nuyt wrote:

On Sat, 3 Feb 2024 21:37:55 -0000 (UTC), Lawrence D'Oliveiro wrote:

Signed characters make no sense.

Nor did 6 bit characters, but in the 1980s we had them:
3 characters in a 24 bit word.

I see your sixbit and raise you Radix-50, which packed 3 characters into a 16-bit word.

None of these used signed character codes, by the way. So my point still stands.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Tue Feb 6 00:58:31 2024

On Mon, 05 Feb 2024 15:51:37 -0800, Keith Thompson wrote:

My understanding is that on the PDP-11, making plain char signed made
code that stored character values in int objects more efficient. Sign-extension was more efficient than zero-filling or something like
that.

The move-byte instruction did sign-extension when loading into a register,
not storing into memory.

There was no convert-byte-to-word instruction as such.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Keith Thompson on Tue Feb 6 00:16:56 2024

Keith Thompson <[email protected]> writes:

Lawrence D'Oliveiro <[email protected]d> writes:

On Mon, 05 Feb 2024 18:22:59 +0100, Kees Nuyt wrote:

On Sat, 3 Feb 2024 21:37:55 -0000 (UTC), Lawrence D'Oliveiro wrote:
Signed characters make no sense.

Nor did 6 bit characters, but in the 1980s we had them:
3 characters in a 24 bit word.

I see your sixbit and raise you Radix-50, which packed 3 characters into a >> 16-bit word.

None of these used signed character codes, by the way. So my point still
stands.

My understanding is that on the PDP-11, making plain char signed made
code that stored character values in int objects more efficient. >Sign-extension was more efficient than zero-filling or something like
that. I don't remember the details, but I'm sure it wouldn't be
difficult to find out.

The PDP-11 had two move instructions:

MOV (r1)+,r2
MOVB (r2)+,r3

MOV moved source to destination. MOVB always sign-extended the byte
to the destination register size (16-bit).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Tue Feb 6 03:10:52 2024

On Mon, 05 Feb 2024 18:31:36 -0800, Keith Thompson wrote:

If the PDP-11 had had an alternative MOVB instruction that did zero-extension, we might not be having this discussion.

Which is effectively what I said:

On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:

> The signedness of plain char is implementation-defined.

Why? Because the PDP-11 on which C and Unix were originally developed did
sign extension when loading a byte quantity into a (word-length) register.

Signed characters make no sense.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Kettlewell@21:1/5 to Keith Thompson on Tue Feb 6 17:00:25 2024

Keith Thompson <[email protected]> writes:

Lawrence D'Oliveiro <[email protected]d> writes:

Signed characters make no sense.

You wrote that "Signed characters make no sense". I was talking about a context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the PDP-11.)

I still don’t see any explanation for signed characters as such making
sense.

I think the situation is more accurately interpreted as letting a PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that negative values didn’t arise.

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Rainer Weikusat on Tue Feb 6 18:04:16 2024

On 2024-02-06, Rainer Weikusat <[email protected]> wrote:

Richard Kettlewell <[email protected]d> writes:

Keith Thompson <[email protected]> writes:

Lawrence D'Oliveiro <[email protected]d> writes:

Signed characters make no sense.

You wrote that "Signed characters make no sense". I was talking about a >>> context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)

I still don’t see any explanation for signed characters as such making
sense.

I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.

I think that's just a (probably traditional) misnomer. A C char isn't a character, it's an integer type and it's a signed integer type because
all other original C integer types (int and short) were signed as
well. Unsigned integer types, as something that's different from
pointer, were a later addition.

Sure, except for the part where "abcd" denotes an object that is a null-terminated array of these *char* integers, that entity being formally called a "string" in ISO C, and used for representing text. (Or else "abcd" is initializer syntax for a four element (or larger) array of *char*).

If *char* is signed (and CHAR_BIT is 8), then '\xff` produces a negative value, even though the constant has type *int*, and "\xff"[0] does likewise.

This has been connected to needless bugs in C programs. An expression like table[str[i]] may result in table[] being negatively indexed.

The <ctype.h> function require an argument that is either EOF
or a value in the range of 0 to UCHAR_MAX, and so are incompatible
with string elements.

All this crap could have been avoided if *char* had been unsigned.
*unsigned char* never needed to exist except as a synonym for plain
*char*.

Speaking of synonyms, *char* is a distinct type, and not a synonym for either *signed char* or *unsigned char*. It has to be that way, given the way it is defined, but it's just another complication that need not have existed:

#include <stdio.h>

int main(void)
{
char *cp = 0;
unsigned char *ucp = 0;
signed char *scp = 0;
printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);
printf("%d\n", '\xff');
}

char.c: In function ‘main’:
char.c:8:27: warning: comparison of distinct pointer types lacks a cast
printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);
^~
char.c:8:38: warning: comparison of distinct pointer types lacks a cast
printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);
^~
char.c:8:50: warning: comparison of distinct pointer types lacks a cast
printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @[email protected]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Richard Kettlewell on Tue Feb 6 17:35:01 2024

Richard Kettlewell <[email protected]d> writes:

Keith Thompson <[email protected]> writes:

Lawrence D'Oliveiro <[email protected]d> writes:

Signed characters make no sense.

You wrote that "Signed characters make no sense". I was talking about a
context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)

I still don’t see any explanation for signed characters as such making sense.

I think the situation is more accurately interpreted as letting a PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that negative values didn’t arise.

I think that's just a (probably traditional) misnomer. A C char isn't a character, it's an integer type and it's a signed integer type because
all other original C integer types (int and short) were signed as
well. Unsigned integer types, as something that's different from
pointer, were a later addition.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Rainer Weikusat on Tue Feb 6 18:38:06 2024

On 2024-02-06, Rainer Weikusat <[email protected]> wrote:

¹ My personal theory of human fallibility is that humans tend to fuck up everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits.

Does that work for all safety devices? Isolation transformers, steel
toed boots, helmets, seat belts, roll bars, third outlet prongs, ...

A fractured skull reveals a human trait (accident proneness, weak bone)
rather than the workplace trait of not enforcing helmet use.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @[email protected]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Kaz Kylheku on Tue Feb 6 19:02:00 2024

Kaz Kylheku <[email protected]> writes:

On 2024-02-06, Rainer Weikusat <[email protected]> wrote:

¹ My personal theory of human fallibility is that humans tend to fuck up
everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits.

Does that work for all safety devices? Isolation transformers, steel
toed boots, helmets, seat belts, roll bars, third outlet prongs, ...

I wrote about C types and somewhat more generally, programming language features, and not "safety devices" supposed to protect human bodies from physical injury.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Kaz Kylheku on Tue Feb 6 18:30:46 2024

Kaz Kylheku <[email protected]> writes:

On 2024-02-06, Rainer Weikusat <[email protected]> wrote:

Richard Kettlewell <[email protected]d> writes:

[...]

I still don’t see any explanation for signed characters as such making >>> sense.

I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that >>> negative values didn’t arise.

I think that's just a (probably traditional) misnomer. A C char isn't a
character, it's an integer type and it's a signed integer type because
all other original C integer types (int and short) were signed as
well. Unsigned integer types, as something that's different from
pointer, were a later addition.

Sure, except for the part where "abcd" denotes an object that is a null-terminated array of these *char* integers, that entity being formally called a "string" in ISO C, and used for representing text. (Or else "abcd" is
initializer syntax for a four element (or larger) array of *char*).

If *char* is signed (and CHAR_BIT is 8), then '\xff` produces a negative value,
even though the constant has type *int*, and "\xff"[0] does likewise.

This has been connected to needless bugs in C programs. An expression like table[str[i]] may result in table[] being negatively indexed.

The <ctype.h> function require an argument that is either EOF
or a value in the range of 0 to UCHAR_MAX, and so are incompatible
with string elements.

All this crap could have been avoided if *char* had been unsigned.
*unsigned char* never needed to exist except as a synonym for plain
*char*.

All of this may be true¹ but it's all besides the point. The original C language had three integer types, char, short and int, which were all
signed types. It further supported declaring pointers to some type and
pointers were basically unsigned integer indices into a linear memory
array. Char couldn't have been an unsigned integer type, regardless if
this would have made more sense², because unsigned integer types didn't
exist in the language.

¹ My personal theory of human fallibility is that humans tend to fuck up everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits. Had they been avoided,
human ingenuity would have found something else to fuck up.

² Being wise in hindsight is always easy. But that's not an option for
people who need to create something which doesn't yet exist and not be
wisely critical of something that does.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lew Pitcher@21:1/5 to Rainer Weikusat on Tue Feb 6 19:25:27 2024

On Tue, 06 Feb 2024 18:30:46 +0000, Rainer Weikusat wrote:

Kaz Kylheku <[email protected]> writes:

On 2024-02-06, Rainer Weikusat <[email protected]> wrote:

Richard Kettlewell <[email protected]d> writes:

[...]

I still don’t see any explanation for signed characters as such making >>>> sense.

I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that >>>> negative values didn’t arise.

I think that's just a (probably traditional) misnomer. A C char isn't a
character, it's an integer type and it's a signed integer type because
all other original C integer types (int and short) were signed as
well. Unsigned integer types, as something that's different from
pointer, were a later addition.

Sure, except for the part where "abcd" denotes an object that is a
null-terminated array of these *char* integers, that entity being formally >> called a "string" in ISO C, and used for representing text. (Or else "abcd" is
initializer syntax for a four element (or larger) array of *char*).

If *char* is signed (and CHAR_BIT is 8), then '\xff` produces a negative value,
even though the constant has type *int*, and "\xff"[0] does likewise.

This has been connected to needless bugs in C programs. An expression like >> table[str[i]] may result in table[] being negatively indexed.

The <ctype.h> function require an argument that is either EOF
or a value in the range of 0 to UCHAR_MAX, and so are incompatible
with string elements.

All this crap could have been avoided if *char* had been unsigned.
*unsigned char* never needed to exist except as a synonym for plain
*char*.

All of this may be true¹ but it's all besides the point. The original C language had three integer types, char, short and int, which were all
signed types.

This view ignores the early implementation of (K&R) C on IBM 370 systems,
where a char was 8 bits of EBCDIC. In EBCDIC, all alphabetic and numeric characters have their high bit set (alphabetics range from 0x80 through
0xe9, while numerics range from 0xf0 through 0xf9). A char in this implementation, by necessity, was unsigned, as C "guarantees that any
character in the machine's standard character set will never be negative"
(K&R "The C Programming Language", p40)

It further supported declaring pointers to some type and
pointers were basically unsigned integer indices into a linear memory
array. Char couldn't have been an unsigned integer type, regardless if
this would have made more sense², because unsigned integer types didn't exist in the language.

¹ My personal theory of human fallibility is that humans tend to fuck up everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits. Had they been avoided,
human ingenuity would have found something else to fuck up.

² Being wise in hindsight is always easy. But that's not an option for people who need to create something which doesn't yet exist and not be
wisely critical of something that does.

--
Lew Pitcher
"In Skills We Trust"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Lew Pitcher on Tue Feb 6 20:01:43 2024

Lew Pitcher <[email protected]> writes:

On Tue, 06 Feb 2024 18:30:46 +0000, Rainer Weikusat wrote:

[Why-oh-why is char not unsigned?!?]

All of this may be true¹ but it's all besides the point. The original C
language had three integer types, char, short and int, which were all
signed types.

This view ignores the early implementation of (K&R) C on IBM 370 systems, where a char was 8 bits of EBCDIC. In EBCDIC, all alphabetic and numeric characters have their high bit set (alphabetics range from 0x80 through
0xe9, while numerics range from 0xf0 through 0xf9).

Indeed. It refers to the C lanuage as it existed/ was created when UNIX
was brought over to the PDP-11. This language didn't have any unsigned
integer types as the concept didn't yet exist.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Rainer Weikusat on Tue Feb 6 21:22:57 2024

On 2024-02-06, Rainer Weikusat <[email protected]> wrote:

Kaz Kylheku <[email protected]> writes:

On 2024-02-06, Rainer Weikusat <[email protected]> wrote:

¹ My personal theory of human fallibility is that humans tend to fuck up >>> everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits.

Does that work for all safety devices? Isolation transformers, steel
toed boots, helmets, seat belts, roll bars, third outlet prongs, ...

I wrote about C types and somewhat more generally, programming language features, and not "safety devices" supposed to protect human bodies from physical injury.

Type systems are safety devices. That's why we have terms like "type
safe" and "unsafe code".

Type safety helps prevent misbehavior, which results in problems like
incorrect results and data loss, which can have real economic harm.

In a safety-critical embedded system, a connection between type safety
and physical safety is readily identifiable.

"Type safety" it's not just some fanciful metaphor like "debugging";
there is a literal interpretation which is true.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @[email protected]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Kaz Kylheku on Tue Feb 6 21:37:50 2024

Kaz Kylheku <[email protected]> writes:

On 2024-02-06, Rainer Weikusat <[email protected]> wrote:

Kaz Kylheku <[email protected]> writes:

On 2024-02-06, Rainer Weikusat <[email protected]> wrote:

¹ My personal theory of human fallibility is that humans tend to fuck up >>>> everything they possibly can. Hence, so-called C pitfalls expose human >>>> traits (fallibility) and not language traits.

Does that work for all safety devices? Isolation transformers, steel
toed boots, helmets, seat belts, roll bars, third outlet prongs, ...

I wrote about C types and somewhat more generally, programming language
features, and not "safety devices" supposed to protect human bodies from
physical injury.

Type systems are safety devices. That's why we have terms like "type
safe" and "unsafe code".

They're not, at least not when safety device is supposed to mean
something like hard hats. That's just an inappropriate analogy some
people like to employ. This is, however, completely besides the point of
my original text which was about providing an explanation why char is
signed in C despite all kinds of smart alecs with fifty years of
hindsight Ritchie didn't have in 1972 are extremely concvinced that this
was an extremely bad idea.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Kempe@21:1/5 to All on Tue Feb 6 23:13:21 2024

Den 2024-02-06 skrev Keith Thompson <[email protected]>:

Richard Kettlewell <[email protected]d> writes:

Keith Thompson <[email protected]> writes:

Lawrence D'Oliveiro <[email protected]d> writes:

Signed characters make no sense.

You wrote that "Signed characters make no sense". I was talking about a >>> context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)

I still don’t see any explanation for signed characters as such making
sense.

I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.

I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a signed type made sense for PDP-11 implementation, purely because of performance issues.

I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.

I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
and movsbl to move char to int so that's what I benched.

The bench was done by moving a byte from the stack to eax using a loop
of 10 movzbl/movsbl running 10M times. Both instructions gave on
average about 0.7 cycles per instruction measured using rdtsc. The
highest bit in the byte being set or unset made no difference.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Andreas Kempe on Tue Feb 6 23:27:23 2024

Andreas Kempe <[email protected]> writes:

Den 2024-02-06 skrev Keith Thompson <[email protected]>:

Richard Kettlewell <[email protected]d> writes:

I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.

I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance
justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.

I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
and movsbl to move char to int so that's what I benched.

A move from register to register isn't even executed on most modern
processor designs. It is detected at fetch and the register is
just renamed in the pipeline.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Andreas Kempe on Wed Feb 7 00:46:17 2024

Andreas Kempe <[email protected]> writes:

Den 2024-02-06 skrev Scott Lurndal <[email protected]>:

Andreas Kempe <[email protected]> writes:

Den 2024-02-06 skrev Keith Thompson <[email protected]>:

Richard Kettlewell <[email protected]d> writes:

I think we're mostly in agreement, perhaps with different understandings >>>> of "making sense". What I'm saying is that the decision to make char a >>>> signed type made sense for PDP-11 implementation, purely because of
performance issues.

I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance >>>> justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends >>>> on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options >>>> to override the default.

I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
and movsbl to move char to int so that's what I benched.

A move from register to register isn't even executed on most modern
processor designs. It is detected at fetch and the register is
just renamed in the pipeline.

Yeah. I tried some different variations and by adding some data
dependencies by incrementing the value and moving it around, I could
get some difference between the two, approx 10 to 30 %, but I'm not
sure how much is due to the instruction itself or other effects of >manipulating the data.

The logic for sign extension (MOVSX) isn't complex, the added gate delay wouldn't affect the instruction timing. Fan the sign bit out
to the higher bits through a couple of gates to either select the
sign bit or the high order bits when storing into the new register.

Sign extension on load (MOV from memory) will happen in the load unit before
it hits the register file, most likely.

The x86 MOVBE instruction is a slight more complex example.

Funnily enough, the zero extend was the more performant in these tests
making unsigned char possibly more performant.

Within what margin of measurement error?

My intention wasn't really to claim they're exactly the same, but that
that I don't think there is any real performance benefit to be had by >switching char to unsigned. Even if the 10-30 % are a real thing, I
wonder how much software is actually using char types in a way where
it would make a difference?

We use uint8_t extensively because the data is unsigned in the range 0-255.

And generally want wrapping behavior modulo 2^8.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Kempe@21:1/5 to All on Wed Feb 7 00:26:08 2024

Den 2024-02-06 skrev Scott Lurndal <[email protected]>:

Andreas Kempe <[email protected]> writes:

Den 2024-02-06 skrev Keith Thompson <[email protected]>:

Richard Kettlewell <[email protected]d> writes:

I think we're mostly in agreement, perhaps with different understandings >>> of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.

I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance >>> justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.

I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
and movsbl to move char to int so that's what I benched.

A move from register to register isn't even executed on most modern
processor designs. It is detected at fetch and the register is
just renamed in the pipeline.

Yeah. I tried some different variations and by adding some data
dependencies by incrementing the value and moving it around, I could
get some difference between the two, approx 10 to 30 %, but I'm not
sure how much is due to the instruction itself or other effects of
manipulating the data.

Funnily enough, the zero extend was the more performant in these tests
making unsigned char possibly more performant.

My intention wasn't really to claim they're exactly the same, but that
that I don't think there is any real performance benefit to be had by
switching char to unsigned. Even if the 10-30 % are a real thing, I
wonder how much software is actually using char types in a way where
it would make a difference?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Kempe@21:1/5 to All on Wed Feb 7 02:11:26 2024

Den 2024-02-07 skrev Scott Lurndal <[email protected]>:

Andreas Kempe <[email protected]> writes:

Den 2024-02-06 skrev Scott Lurndal <[email protected]>:

Andreas Kempe <[email protected]> writes:

I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl >>>>and movsbl to move char to int so that's what I benched.

A move from register to register isn't even executed on most modern
processor designs. It is detected at fetch and the register is
just renamed in the pipeline.

Yeah. I tried some different variations and by adding some data >>dependencies by incrementing the value and moving it around, I could
get some difference between the two, approx 10 to 30 %, but I'm not
sure how much is due to the instruction itself or other effects of >>manipulating the data.

The logic for sign extension (MOVSX) isn't complex, the added gate delay wouldn't affect the instruction timing. Fan the sign bit out
to the higher bits through a couple of gates to either select the
sign bit or the high order bits when storing into the new register.

Sign extension on load (MOV from memory) will happen in the load unit before it hits the register file, most likely.

The x86 MOVBE instruction is a slight more complex example.

Funnily enough, the zero extend was the more performant in these tests >>making unsigned char possibly more performant.

Within what margin of measurement error?

Here's an example of a test I played around with. The body of my loop
does this 10M times for this test. movzbl is switched for movsbl when
testing the other configuration.

movzbl -24(%rsp), %eax
movb %al, -25(%rsp)
movzbl -25(%rsp), %eax
movb %al, -26(%rsp)
movzbl -26(%rsp), %eax
movb %al, -27(%rsp)
movzbl -27(%rsp), %eax
movb %al, -28(%rsp)
movzbl -28(%rsp), %eax
incl %eax
movb %al, -24(%rsp)

This is the data, unit is total cycles for a run, from 2000 runs of
10M each for the two different instructions:

movzbl:
mean = 1.24E+08
variance = 3.95E+12

movsbl:
mean = 1.38E+08
variance = 3.44E+12

ratio movsbl/movzbl = 1.11

Performing a two-tail student t-test gives

p-value: 0.00E+00

Something is causing these two test runs to give different performance
results. I will not pretend I know enough about the inner workings of
Intel's magic box to explain why.

My intention wasn't really to claim they're exactly the same, but that
that I don't think there is any real performance benefit to be had by >>switching char to unsigned. Even if the 10-30 % are a real thing, I
wonder how much software is actually using char types in a way where
it would make a difference?

We use uint8_t extensively because the data is unsigned in the range 0-255.

And generally want wrapping behavior modulo 2^8.

Sure, but if you are using uint8_t, you have sidestepped the whole
issues of char being signed or unsigned so a change wouldn't really
affect you.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Kettlewell@21:1/5 to Keith Thompson on Wed Feb 7 10:29:58 2024

Keith Thompson <[email protected]> writes:

Richard Kettlewell <[email protected]d> writes:

Keith Thompson <[email protected]> writes:

Lawrence D'Oliveiro <[email protected]d> writes:

Signed characters make no sense.

You wrote that "Signed characters make no sense". I was talking about a >>> context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)

I still don’t see any explanation for signed characters as such making
sense.

I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.

I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a signed type made sense for PDP-11 implementation, purely because of performance issues.

Having a basic 8-bit integer type be signed type makes sense (in
context) for performance reasons and perhaps for usability reasons too.

But that’s really not the same as “signed characters make sense”. For signed characters to make sense there has to be encoding where some
signs (or control codes, etc) are encoded to negative values. I’ve never heard of one.

“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to
represent characters, not just small integer values. If the purpose was
purely the latter it would have been called ‘short short int’ or
something like that.

I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.

i.e. we’re still suffering the locked-in side-effects of an ancient
decision even though the original justification has become irrelevant.
It might or might not have been a reasonable trade-off at the time, disregarding what were then hypotheticals about the future, but (indeed
with hindsight) I think from today’s point of view it was clearly the
wrong decision.

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Andreas Kempe on Wed Feb 7 15:22:04 2024

Andreas Kempe <[email protected]> writes:

Den 2024-02-07 skrev Scott Lurndal <[email protected]>:

Andreas Kempe <[email protected]> writes:

Funnily enough, the zero extend was the more performant in these tests >>>making unsigned char possibly more performant.

Within what margin of measurement error?

Here's an example of a test I played around with. The body of my loop
does this 10M times for this test. movzbl is switched for movsbl when
testing the other configuration.

movzbl -24(%rsp), %eax
movb %al, -25(%rsp)
movzbl -25(%rsp), %eax
movb %al, -26(%rsp)
movzbl -26(%rsp), %eax
movb %al, -27(%rsp)
movzbl -27(%rsp), %eax
movb %al, -28(%rsp)
movzbl -28(%rsp), %eax
incl %eax
movb %al, -24(%rsp)

This is the data, unit is total cycles for a run, from 2000 runs of
10M each for the two different instructions:

movzbl:
mean = 1.24E+08
variance = 3.95E+12

movsbl:
mean = 1.38E+08
variance = 3.44E+12

ratio movsbl/movzbl = 1.11

Sehr interresant. Ich weiss nicht, warum es ist.

We use uint8_t extensively because the data is unsigned in the range 0-255. >>
And generally want wrapping behavior modulo 2^8.

Sure, but if you are using uint8_t, you have sidestepped the whole
issues of char being signed or unsigned so a change wouldn't really
affect you.

While most C compilers have a compile-time option to select the signed-ness of char, using uint8_t sidesteps the issue completely.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Richard Kettlewell on Wed Feb 7 15:30:23 2024

Richard Kettlewell <[email protected]d> writes:

Keith Thompson <[email protected]> writes:

[...]

I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.

Having a basic 8-bit integer type be signed type makes sense (in
context) for performance reasons and perhaps for usability reasons too.

But that’s really not the same as “signed characters make sense”. For signed characters to make sense there has to be encoding where some
signs (or control codes, etc) are encoded to negative values. I’ve never heard of one.

“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to represent characters, not just small integer values.

Computers have absolutely no idea of "characters". They handle numbers,
integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character
(ie, an integer value from 0 - 127 for the case in question) is not the
same as a character.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Kettlewell@21:1/5 to Rainer Weikusat on Wed Feb 7 20:20:12 2024

Rainer Weikusat <[email protected]> writes:

Richard Kettlewell <[email protected]d> writes:

“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to
represent characters, not just small integer values.

Computers have absolutely no idea of "characters". They handle numbers, integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character
(ie, an integer value from 0 - 127 for the case in question) is not the
same as a character.

Language designers do, however, have an idea of “characters”.

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Richard Kettlewell on Wed Feb 7 20:58:01 2024

On Wed, 07 Feb 2024 20:20:12 +0000, Richard Kettlewell wrote:

Language designers do, however, have an idea of “characters”.

Unicode uses the terms “grapheme” and “text element”. Actually it also uses “character”, but it seems less clear on what that means. It is not
the same as a “code point” or “glyph”.

<https://www.unicode.org/faq/char_combmark.html>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Kettlewell@21:1/5 to Lawrence D'Oliveiro on Thu Feb 8 11:21:56 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Wed, 07 Feb 2024 20:20:12 +0000, Richard Kettlewell wrote:

Language designers do, however, have an idea of “characters”.

Unicode uses the terms “grapheme” and “text element”. Actually it also
uses “character”, but it seems less clear on what that means. It is not the same as a “code point” or “glyph”.

<https://www.unicode.org/faq/char_combmark.html>

Sure, but this was all happening in the 1970s, long before Unicode
existed.

K&R1 explicitly says char is “capable of holding one character in the
local character set” (and mentions EBCDIC as a concrete example on the
same page - the problem must have been obvious already).

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Richard Kettlewell on Thu Feb 8 16:34:10 2024

Richard Kettlewell <[email protected]d> writes:

Rainer Weikusat <[email protected]> writes:

Richard Kettlewell <[email protected]d> writes:

“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to
represent characters, not just small integer values.

Computers have absolutely no idea of "characters". They handle numbers,
integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character
(ie, an integer value from 0 - 127 for the case in question) is not the
same as a character.

Language designers do, however, have an idea of “characters”.

I don't quite understand what that's supposed to communicate. Insofar
the machine is concerned, a character is nothig but an integer and a
data type sufficient to hold a characters is thus necessarily an integer
type of some size. In a language without unsigned integer types, it'll necessarily also be an signed integer type.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Keith Thompson on Thu Feb 8 17:46:20 2024

Keith Thompson <[email protected]> writes:

Rainer Weikusat <[email protected]> writes:

Richard Kettlewell <[email protected]d> writes:

Rainer Weikusat <[email protected]> writes:

Richard Kettlewell <[email protected]d> writes:

“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to >>>>> represent characters, not just small integer values.

Computers have absolutely no idea of "characters". They handle numbers, >>>> integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character >>>> (ie, an integer value from 0 - 127 for the case in question) is not the >>>> same as a character.

Language designers do, however, have an idea of “characters”.

I don't quite understand what that's supposed to communicate. Insofar
the machine is concerned, a character is nothig but an integer and a
data type sufficient to hold a characters is thus necessarily an integer
type of some size. In a language without unsigned integer types, it'll
necessarily also be an signed integer type.

Early C (pre-K&R1) didn't explicitly have unsigned integer types, but
char was effectively unsigned in some implementations, in that
converting a char value to int would zero-fill the result rather than
doing sign-extension.

According to Ritchie's "The Development of the C Language"

,----
| During 1973-1980, the language grew a bit: the type structure gained
| unsigned
|
| [...]
|
| the similarity of the arithmetic properties of character pointers and
| unsigned integers made it hard to resist the temptation to identify
| them. The unsigned types were added to make unsigned arithmetic
| available without confusing it with pointer manipulation. Similarly, the
| early language condoned assignments between integers and pointers
`----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Rainer Weikusat on Thu Feb 8 19:54:12 2024

On 2024-02-08, Rainer Weikusat <[email protected]> wrote:

According to Ritchie's "The Development of the C Language"

,----
| During 1973-1980, the language grew a bit: the type structure gained
| unsigned
|
| [...]
|
| the similarity of the arithmetic properties of character pointers and
| unsigned integers made it hard to resist the temptation to identify
| them. The unsigned types were added to make unsigned arithmetic
| available without confusing it with pointer manipulation. Similarly, the
| early language condoned assignments between integers and pointers
`----

It seems like a very odd rationale, given how things played out.

The difference between two pointers ended up signed (ptrdiff_t).
So pointer arithmetic doesn't work exactly like unsigned. That's mostly
a good thing, except that pointers farther from each other than half the address space cannot be subtracted. (ISO C mostly takes that away anyway
since pointers to different objects may only be compared for exact
equality, and canno tbe subtracted. If no object is half the address
space or larger, subtraction overflow will never occur.)

Moreover, unsigned ended up necessary for representing a simple byte
in a nice way.

Not only that, but unsigned types are useful for bit manipulation,
without running into nonportable behaviors around shifting into and out
of the sign bit.

If you have a 32 bit int and want a 32 bit field, you want unsigned int.

Very odd to see the existence of unsigned math justified in terms of
some story about pointers.

It seems Ritchie really didn't think much about portability; he
probabably thought it was fine to do 1 << 15 with a 16 bit signed int
to calculate a mask for the highest bit, since that happened to work in
the systems he designed. If someone wanted C on their weird machine
where that misbehaves, or produces an alternative zero that compares
equal to regular zero, that was their problem.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @[email protected]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Thu Feb 8 21:57:57 2024

On Thu, 08 Feb 2024 10:23:29 -0800, Keith Thompson wrote:

The sentence "Whether or not sign-extension occurs for characters is
machine dependent" might be written in more modern terms as "The
signedness of char is implementation-defined".

signed char and unsigned char (and unsigned short and unsigned long)
were added in ANSI C 1989, possibly earlier.

Here’s an odd thing: what happens when you shift a signed int? K&R allows left-shift with the obvious meaning, and says that, for right-shift,
whether the top bits are zero-filled or sign-extended is implementation- defined; newer C specs say that left-shifting a negative value is simply “undefined”, and right-shifting a negative value is “implementation- defined”.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Thu Feb 8 22:30:52 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Thu, 08 Feb 2024 10:23:29 -0800, Keith Thompson wrote:

The sentence "Whether or not sign-extension occurs for characters is
machine dependent" might be written in more modern terms as "The
signedness of char is implementation-defined".

signed char and unsigned char (and unsigned short and unsigned long)
were added in ANSI C 1989, possibly earlier.

Here’s an odd thing: what happens when you shift a signed int? K&R allows >left-shift with the obvious meaning, and says that, for right-shift,
whether the top bits are zero-filled or sign-extended is implementation- >defined; newer C specs say that left-shifting a negative value is simply >“undefined”, and right-shifting a negative value is “implementation- >defined”.

There were extant hardware implementations exhibiting both behaviors. So they
made the behavior implementation-defined in the compiler.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Thu Feb 8 23:26:55 2024

On Thu, 08 Feb 2024 22:30:52 GMT, Scott Lurndal wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

Here’s an odd thing: what happens when you shift a signed int? K&R allows >>left-shift with the obvious meaning, and says that, for right-shift, >>whether the top bits are zero-filled or sign-extended is implementation- >>defined; newer C specs say that left-shifting a negative value is simply >>“undefined”, and right-shifting a negative value is “implementation- >>defined”.

There were extant hardware implementations exhibiting both behaviors.

Except the current spec doesn’t mention a choice between two behaviours.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Thu Jul 30 08:47:34 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:36:06 2026
  from Wales, Uk via Telnet
- Rixter
  Thu Jul 30 02:32:09 2026
  from Madison, Nc via Telnet
- Bob Worm
  Wed Jul 29 22:26:45 2026
  from Wales, Uk via Telnet
- Zenobyte
  Wed Jul 29 21:08:05 2026
  from San Juan, Pr via Telnet
- Guest
  Wed Jul 29 14:26:54 2026
  from Balkans via Telnet
- Rixter
  Wed Jul 29 14:18:17 2026
  from Madison, Nc via Telnet
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	85:33:34
Calls:	12,453
Calls today:	3
Files:	15,195
Messages:	6,537,805

Why 8 bit exit status codes?

Who's Online

Recent Visitors

System Info