Both functions make the low-order eight bits of the status argument
available to a parent process which has called a wait(2)-family
function.
The exit() and _Exit() functions conform to ISO/IEC 9899:1999 (“ISO C99”).
Finally, control is returned to the host environment. If the value of
status is zero or EXIT_SUCCESS, an implementation-defined form of the
status successful termination is returned. If the value of status is EXIT_FAILURE, an implementation-defined form of the status
unsuccessful termination is returned. Otherwise the status returned
is implementation-defined.
If the parent process of the calling process is executing a wait() or waitpid(), it is notified of the termination of the calling process
and the low order 8 bits of status are made available to it; see
3.2.1.
The value of status may be 0, EXIT_SUCCESS, EXIT_FAILURE, or any
other value, though only the least significant 8 bits (that is,
status & 0377) shall be available from wait() and waitpid(); the
full value shall be available from waitid() and in the siginfo_t
passed to a signal handler for SIGCHLD.
Hello everyone,
I'm wondering why, at least on Linux and FreeBSD, a process exit
status was chosen to be only the lower 8 bits in the C interface, i.e.
exit() and wait().
The question that remains is what the rationale behind using the lower
8 bits was from the start? Is it historical legacy that no one wanted
to change for backwards compatibility? Is there no need for exit codes
larger than 8 bits?
Andreas Kempe <[email protected]> writes:
Hello everyone,<snip>
I'm wondering why, at least on Linux and FreeBSD, a process exit
status was chosen to be only the lower 8 bits in the C interface, i.e. >>exit() and wait().
The question that remains is what the rationale behind using the lower
8 bits was from the start? Is it historical legacy that no one wanted
to change for backwards compatibility? Is there no need for exit codes >>larger than 8 bits?
The definition of the wait system call. Recall that the
PDP-11 was a 16-bit computer
and wait needed to be able to include metadata along with the exit
status.
I'm wondering why, at least on Linux and FreeBSD, a process exit status
was chosen to be only the lower 8 bits in the C interface, i.e.
exit() and wait().
This did bite some colleagues at work at one point who were porting a
modem manager from a real-time OS to Linux because they were returning negative status codes for errors.
Den 2024-02-02 skrev Scott Lurndal <[email protected]>:
Andreas Kempe <[email protected]> writes:
Hello everyone,<snip>
I'm wondering why, at least on Linux and FreeBSD, a process exit
status was chosen to be only the lower 8 bits in the C interface, i.e. >>>exit() and wait().
The question that remains is what the rationale behind using the lower
8 bits was from the start? Is it historical legacy that no one wanted
to change for backwards compatibility? Is there no need for exit codes >>>larger than 8 bits?
The definition of the wait system call. Recall that the
PDP-11 was a 16-bit computer
I'm afraid that's a tall order. I had yet to learn how to read when
they went out of production. :) Please excuse my ignorance.
and wait needed to be able to include metadata along with the exit
status.
I'm a bit unclear on the order of things coming into being. Did their
C implementation already use exit() with an int argument of size 16
bits and they also masked? Or was an int 8 bits on PDP-11 with POSIX
opting mask out the lower 8 bits on platforms with wider ints to
maintain backwards compatibility?
Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it makes sense, but otherwise I don't understand from an API perspective why one
would use a data type with the caveat that only half is used.
The curl command defines nearly 100 error codes ("man curl" for
details). That's the most I've seen.
(On Plan 9, a program's exit status is (was?) a string, empty for
success, a description of the error condition on error. It's a cool
idea, but I can imagine it introducing some interesting problems.)
Andreas Kempe <[email protected]> writes:
I'm a bit unclear on the order of things coming into being. Did their
C implementation already use exit() with an int argument of size 16
bits and they also masked? Or was an int 8 bits on PDP-11 with POSIX
opting mask out the lower 8 bits on platforms with wider ints to
maintain backwards compatibility?
The status argument to the wait system call returned
a two part value; 8 bits of exit status and 8 bits
that describe the termination conditions (e.g. the
signal number that stopped or terminated the
process).
Here's the modern 32-bit layout (in little endian form):
unsigned int __w_termsig:7; /* Terminating signal. */
unsigned int __w_coredump:1; /* Set if dumped core. */
unsigned int __w_retcode:8; /* Return code if exited normally. */
unsigned int:16;
It's just the PDP-11 unix 16-bit version with 16 unused padding bits.
SVR4 added the waitid(2) system call which via the siginfo argument has access to the full 32-bit program exit status.
Andreas Kempe <[email protected]> writes:
Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it
makes sense, but otherwise I don't understand from an API perspective
why one would use a data type with the caveat that only half is used.
C tends to use int values even for character data (when not an element
of a string). See for example the return types of getchar(), fgetc(),
et al, and even the type of character constants ('x' is of type int, not char).
In early C, int was in many ways a kind of default type. Functions with
no visible declaration were assumed to return int. The signedness of
plain char is implementation-defined.
Supporting exit values from 0 to 255 is fairly reasonable. Using an
int to store that value is also fairly reasonable -- especially
since main() returns int, and exit(n) is very nearly equivalent to
return n in main(). Ignoring all but the low-order 8 bits is not
specified by C. Non-POSIX systems can use all 32 (or 16, or ...)
bits of the return value.
Den 2024-02-03 skrev Keith Thompson <[email protected]>:
Andreas Kempe <[email protected]> writes:
Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it
makes sense, but otherwise I don't understand from an API perspective
why one would use a data type with the caveat that only half is used.
C tends to use int values even for character data (when not an element
of a string). See for example the return types of getchar(), fgetc(),
et al, and even the type of character constants ('x' is of type int, not
char).
I thought the reason for the int return type was to have an error code outside of the range of the valid data, with EOF being defined as
being a negative integer. A reason that isn't applicable for the
argument passing to exit by a program.
In early C, int was in many ways a kind of default type. Functions with
no visible declaration were assumed to return int. The signedness of
plain char is implementation-defined.
I realised that char was a bad example just as I posted. I should have
chosen unsigned char instead.
Supporting exit values from 0 to 255 is fairly reasonable. Using an
int to store that value is also fairly reasonable -- especially
since main() returns int, and exit(n) is very nearly equivalent to
return n in main(). Ignoring all but the low-order 8 bits is not
specified by C. Non-POSIX systems can use all 32 (or 16, or ...)
bits of the return value.
Yes, in my original post, I detailed that the restriction does not
come from the C standard, but from POSIX. I'm not sure which came
first.
If C was first with having an exit() function and an int return for
main, I can imagine that it went something like this
- C chooses int for main
- C uses int in exit() to match main
- OS folks want to store extra data in the exit status, but they
want to match the C API
- let's just stuff it in the upper bits and keep the API the same with
an imposed restriction on the value in POSIX
or POSIX exit() was constructed with the int from main in mind, or it
could just be, as you point out, that int is a nice default integer
type and there wasn't much thought put into it beyond that.
I can speculate a bunch different reasons, but I'm curious if anyone
knows what the actual reasoning was.
Den 2024-02-03 skrev Keith Thompson <[email protected]>:
Andreas Kempe <[email protected]> writes:
Yes, in my original post, I detailed that the restriction does not
come from the C standard, but from POSIX. I'm not sure which came
first.
The signedness of plain char is implementation-defined.
On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:
The signedness of plain char is implementation-defined.
Why? Because the PDP-11 on which C and Unix were originally developed did sign extension when loading a byte quantity into a (word-length) register.
Signed characters make no sense.
Lawrence D'Oliveiro <[email protected]d> writes:
On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:
The signedness of plain char is implementation-defined.
Why? Because the PDP-11 on which C and Unix were originally developed
did sign extension when loading a byte quantity into a (word-length)
register.
Signed characters make no sense.
Except in architectures where they do.
Lawrence D'Oliveiro <[email protected]d> writes:
On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:
The signedness of plain char is implementation-defined.
Why? Because the PDP-11 on which C and Unix were originally developed did
sign extension when loading a byte quantity into a (word-length) register. >>
Signed characters make no sense.
Except in architectures where they do.
If you're doing something where it matters (or even if you want your
code to be more readable) used signed char or unsigned char as
appropriate.
On Sat, 03 Feb 2024 20:33:19 -0700, Joe Pfeiffer wrote:
Lawrence D'Oliveiro <[email protected]d> writes:
On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:
The signedness of plain char is implementation-defined.
Why? Because the PDP-11 on which C and Unix were originally developed
did sign extension when loading a byte quantity into a (word-length)
register.
Signed characters make no sense.
Except in architectures where they do.
There are no character encodings which assign meanings to negative codes.
Den 2024-02-03 skrev Keith Thompson <[email protected]>:
Andreas Kempe <[email protected]> writes:
If C was first with having an exit() function and an int return for
main, I can imagine that it went something like this
- C chooses int for main
- C uses int in exit() to match main
- OS folks want to store extra data in the exit status, but they
want to match the C API
- let's just stuff it in the upper bits and keep the API the same with
an imposed restriction on the value in POSIX
or POSIX exit() was constructed with the int from main in mind, or it
could just be, as you point out, that int is a nice default integer
type and there wasn't much thought put into it beyond that.
I can speculate a bunch different reasons, but I'm curious if anyone
knows what the actual reasoning was.
(On Plan 9, a program's exit status is (was?) a string, empty for
success, a description of the error condition on error. It's a cool
idea, but I can imagine it introducing some interesting problems.)
Signed characters make no sense.
On Sat, 3 Feb 2024 21:37:55 -0000 (UTC), Lawrence D'Oliveiro wrote:
Signed characters make no sense.
Nor did 6 bit characters, but in the 1980s we had them:
3 characters in a 24 bit word.
My understanding is that on the PDP-11, making plain char signed made
code that stored character values in int objects more efficient. Sign-extension was more efficient than zero-filling or something like
that.
Lawrence D'Oliveiro <[email protected]d> writes:
On Mon, 05 Feb 2024 18:22:59 +0100, Kees Nuyt wrote:
On Sat, 3 Feb 2024 21:37:55 -0000 (UTC), Lawrence D'Oliveiro wrote:
Signed characters make no sense.
Nor did 6 bit characters, but in the 1980s we had them:
3 characters in a 24 bit word.
I see your sixbit and raise you Radix-50, which packed 3 characters into a >> 16-bit word.
None of these used signed character codes, by the way. So my point still
stands.
My understanding is that on the PDP-11, making plain char signed made
code that stored character values in int objects more efficient. >Sign-extension was more efficient than zero-filling or something like
that. I don't remember the details, but I'm sure it wouldn't be
difficult to find out.
If the PDP-11 had had an alternative MOVB instruction that did zero-extension, we might not be having this discussion.
Lawrence D'Oliveiro <[email protected]d> writes:
Signed characters make no sense.
You wrote that "Signed characters make no sense". I was talking about a context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the PDP-11.)
Richard Kettlewell <[email protected]d> writes:
Keith Thompson <[email protected]> writes:
Lawrence D'Oliveiro <[email protected]d> writes:
Signed characters make no sense.
You wrote that "Signed characters make no sense". I was talking about a >>> context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)
I still don’t see any explanation for signed characters as such making
sense.
I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.
I think that's just a (probably traditional) misnomer. A C char isn't a character, it's an integer type and it's a signed integer type because
all other original C integer types (int and short) were signed as
well. Unsigned integer types, as something that's different from
pointer, were a later addition.
Keith Thompson <[email protected]> writes:
Lawrence D'Oliveiro <[email protected]d> writes:
Signed characters make no sense.
You wrote that "Signed characters make no sense". I was talking about a
context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)
I still don’t see any explanation for signed characters as such making sense.
I think the situation is more accurately interpreted as letting a PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that negative values didn’t arise.
¹ My personal theory of human fallibility is that humans tend to fuck up everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits.
On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
¹ My personal theory of human fallibility is that humans tend to fuck up
everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits.
Does that work for all safety devices? Isolation transformers, steel
toed boots, helmets, seat belts, roll bars, third outlet prongs, ...
On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
Richard Kettlewell <[email protected]d> writes:
I still don’t see any explanation for signed characters as such making >>> sense.
I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that >>> negative values didn’t arise.
I think that's just a (probably traditional) misnomer. A C char isn't a
character, it's an integer type and it's a signed integer type because
all other original C integer types (int and short) were signed as
well. Unsigned integer types, as something that's different from
pointer, were a later addition.
Sure, except for the part where "abcd" denotes an object that is a null-terminated array of these *char* integers, that entity being formally called a "string" in ISO C, and used for representing text. (Or else "abcd" is
initializer syntax for a four element (or larger) array of *char*).
If *char* is signed (and CHAR_BIT is 8), then '\xff` produces a negative value,
even though the constant has type *int*, and "\xff"[0] does likewise.
This has been connected to needless bugs in C programs. An expression like table[str[i]] may result in table[] being negatively indexed.
The <ctype.h> function require an argument that is either EOF
or a value in the range of 0 to UCHAR_MAX, and so are incompatible
with string elements.
All this crap could have been avoided if *char* had been unsigned.
*unsigned char* never needed to exist except as a synonym for plain
*char*.
Kaz Kylheku <[email protected]> writes:
On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
Richard Kettlewell <[email protected]d> writes:
[...]
I still don’t see any explanation for signed characters as such making >>>> sense.
I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that >>>> negative values didn’t arise.
I think that's just a (probably traditional) misnomer. A C char isn't a
character, it's an integer type and it's a signed integer type because
all other original C integer types (int and short) were signed as
well. Unsigned integer types, as something that's different from
pointer, were a later addition.
Sure, except for the part where "abcd" denotes an object that is a
null-terminated array of these *char* integers, that entity being formally >> called a "string" in ISO C, and used for representing text. (Or else "abcd" is
initializer syntax for a four element (or larger) array of *char*).
If *char* is signed (and CHAR_BIT is 8), then '\xff` produces a negative value,
even though the constant has type *int*, and "\xff"[0] does likewise.
This has been connected to needless bugs in C programs. An expression like >> table[str[i]] may result in table[] being negatively indexed.
The <ctype.h> function require an argument that is either EOF
or a value in the range of 0 to UCHAR_MAX, and so are incompatible
with string elements.
All this crap could have been avoided if *char* had been unsigned.
*unsigned char* never needed to exist except as a synonym for plain
*char*.
All of this may be true¹ but it's all besides the point. The original C language had three integer types, char, short and int, which were all
signed types.
It further supported declaring pointers to some type and
pointers were basically unsigned integer indices into a linear memory
array. Char couldn't have been an unsigned integer type, regardless if
this would have made more sense², because unsigned integer types didn't exist in the language.
¹ My personal theory of human fallibility is that humans tend to fuck up everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits. Had they been avoided,
human ingenuity would have found something else to fuck up.
² Being wise in hindsight is always easy. But that's not an option for people who need to create something which doesn't yet exist and not be
wisely critical of something that does.
On Tue, 06 Feb 2024 18:30:46 +0000, Rainer Weikusat wrote:
All of this may be true¹ but it's all besides the point. The original C
language had three integer types, char, short and int, which were all
signed types.
This view ignores the early implementation of (K&R) C on IBM 370 systems, where a char was 8 bits of EBCDIC. In EBCDIC, all alphabetic and numeric characters have their high bit set (alphabetics range from 0x80 through
0xe9, while numerics range from 0xf0 through 0xf9).
Kaz Kylheku <[email protected]> writes:
On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
¹ My personal theory of human fallibility is that humans tend to fuck up >>> everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits.
Does that work for all safety devices? Isolation transformers, steel
toed boots, helmets, seat belts, roll bars, third outlet prongs, ...
I wrote about C types and somewhat more generally, programming language features, and not "safety devices" supposed to protect human bodies from physical injury.
On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
Kaz Kylheku <[email protected]> writes:
On 2024-02-06, Rainer Weikusat <[email protected]> wrote:
¹ My personal theory of human fallibility is that humans tend to fuck up >>>> everything they possibly can. Hence, so-called C pitfalls expose human >>>> traits (fallibility) and not language traits.
Does that work for all safety devices? Isolation transformers, steel
toed boots, helmets, seat belts, roll bars, third outlet prongs, ...
I wrote about C types and somewhat more generally, programming language
features, and not "safety devices" supposed to protect human bodies from
physical injury.
Type systems are safety devices. That's why we have terms like "type
safe" and "unsafe code".
Richard Kettlewell <[email protected]d> writes:
Keith Thompson <[email protected]> writes:
Lawrence D'Oliveiro <[email protected]d> writes:
Signed characters make no sense.
You wrote that "Signed characters make no sense". I was talking about a >>> context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)
I still don’t see any explanation for signed characters as such making
sense.
I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.
I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a signed type made sense for PDP-11 implementation, purely because of performance issues.
I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.
Den 2024-02-06 skrev Keith Thompson <[email protected]>:
Richard Kettlewell <[email protected]d> writes:
I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.
I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance
justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.
I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
and movsbl to move char to int so that's what I benched.
Den 2024-02-06 skrev Scott Lurndal <[email protected]>:
Andreas Kempe <[email protected]> writes:
Den 2024-02-06 skrev Keith Thompson <[email protected]>:
Richard Kettlewell <[email protected]d> writes:
I think we're mostly in agreement, perhaps with different understandings >>>> of "making sense". What I'm saying is that the decision to make char a >>>> signed type made sense for PDP-11 implementation, purely because of
performance issues.
I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance >>>> justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends >>>> on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options >>>> to override the default.
I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
and movsbl to move char to int so that's what I benched.
A move from register to register isn't even executed on most modern
processor designs. It is detected at fetch and the register is
just renamed in the pipeline.
Yeah. I tried some different variations and by adding some data
dependencies by incrementing the value and moving it around, I could
get some difference between the two, approx 10 to 30 %, but I'm not
sure how much is due to the instruction itself or other effects of >manipulating the data.
Funnily enough, the zero extend was the more performant in these tests
making unsigned char possibly more performant.
My intention wasn't really to claim they're exactly the same, but that
that I don't think there is any real performance benefit to be had by >switching char to unsigned. Even if the 10-30 % are a real thing, I
wonder how much software is actually using char types in a way where
it would make a difference?
Andreas Kempe <[email protected]> writes:
Den 2024-02-06 skrev Keith Thompson <[email protected]>:
Richard Kettlewell <[email protected]d> writes:
I think we're mostly in agreement, perhaps with different understandings >>> of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.
I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance >>> justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.
I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
and movsbl to move char to int so that's what I benched.
A move from register to register isn't even executed on most modern
processor designs. It is detected at fetch and the register is
just renamed in the pipeline.
Andreas Kempe <[email protected]> writes:
Den 2024-02-06 skrev Scott Lurndal <[email protected]>:
Andreas Kempe <[email protected]> writes:
I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl >>>>and movsbl to move char to int so that's what I benched.
A move from register to register isn't even executed on most modern
processor designs. It is detected at fetch and the register is
just renamed in the pipeline.
Yeah. I tried some different variations and by adding some data >>dependencies by incrementing the value and moving it around, I could
get some difference between the two, approx 10 to 30 %, but I'm not
sure how much is due to the instruction itself or other effects of >>manipulating the data.
The logic for sign extension (MOVSX) isn't complex, the added gate delay wouldn't affect the instruction timing. Fan the sign bit out
to the higher bits through a couple of gates to either select the
sign bit or the high order bits when storing into the new register.
Sign extension on load (MOV from memory) will happen in the load unit before it hits the register file, most likely.
The x86 MOVBE instruction is a slight more complex example.
Funnily enough, the zero extend was the more performant in these tests >>making unsigned char possibly more performant.
Within what margin of measurement error?
My intention wasn't really to claim they're exactly the same, but that
that I don't think there is any real performance benefit to be had by >>switching char to unsigned. Even if the 10-30 % are a real thing, I
wonder how much software is actually using char types in a way where
it would make a difference?
We use uint8_t extensively because the data is unsigned in the range 0-255.
And generally want wrapping behavior modulo 2^8.
Richard Kettlewell <[email protected]d> writes:
Keith Thompson <[email protected]> writes:
Lawrence D'Oliveiro <[email protected]d> writes:
Signed characters make no sense.
You wrote that "Signed characters make no sense". I was talking about a >>> context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)
I still don’t see any explanation for signed characters as such making
sense.
I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.
I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a signed type made sense for PDP-11 implementation, purely because of performance issues.
I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.
Den 2024-02-07 skrev Scott Lurndal <[email protected]>:
Andreas Kempe <[email protected]> writes:
Funnily enough, the zero extend was the more performant in these tests >>>making unsigned char possibly more performant.
Within what margin of measurement error?
Here's an example of a test I played around with. The body of my loop
does this 10M times for this test. movzbl is switched for movsbl when
testing the other configuration.
movzbl -24(%rsp), %eax
movb %al, -25(%rsp)
movzbl -25(%rsp), %eax
movb %al, -26(%rsp)
movzbl -26(%rsp), %eax
movb %al, -27(%rsp)
movzbl -27(%rsp), %eax
movb %al, -28(%rsp)
movzbl -28(%rsp), %eax
incl %eax
movb %al, -24(%rsp)
This is the data, unit is total cycles for a run, from 2000 runs of
10M each for the two different instructions:
movzbl:
mean = 1.24E+08
variance = 3.95E+12
movsbl:
mean = 1.38E+08
variance = 3.44E+12
ratio movsbl/movzbl = 1.11
We use uint8_t extensively because the data is unsigned in the range 0-255. >>
And generally want wrapping behavior modulo 2^8.
Sure, but if you are using uint8_t, you have sidestepped the whole
issues of char being signed or unsigned so a change wouldn't really
affect you.
Keith Thompson <[email protected]> writes:
I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.
Having a basic 8-bit integer type be signed type makes sense (in
context) for performance reasons and perhaps for usability reasons too.
But that’s really not the same as “signed characters make sense”. For signed characters to make sense there has to be encoding where some
signs (or control codes, etc) are encoded to negative values. I’ve never heard of one.
“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to represent characters, not just small integer values.
Richard Kettlewell <[email protected]d> writes:
“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to
represent characters, not just small integer values.
Computers have absolutely no idea of "characters". They handle numbers, integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character
(ie, an integer value from 0 - 127 for the case in question) is not the
same as a character.
Language designers do, however, have an idea of “characters”.
On Wed, 07 Feb 2024 20:20:12 +0000, Richard Kettlewell wrote:
Language designers do, however, have an idea of “characters”.
Unicode uses the terms “grapheme” and “text element”. Actually it also
uses “character”, but it seems less clear on what that means. It is not the same as a “code point” or “glyph”.
<https://www.unicode.org/faq/char_combmark.html>
Rainer Weikusat <[email protected]> writes:
Richard Kettlewell <[email protected]d> writes:
“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to
represent characters, not just small integer values.
Computers have absolutely no idea of "characters". They handle numbers,
integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character
(ie, an integer value from 0 - 127 for the case in question) is not the
same as a character.
Language designers do, however, have an idea of “characters”.
Rainer Weikusat <[email protected]> writes:
Richard Kettlewell <[email protected]d> writes:
Rainer Weikusat <[email protected]> writes:
Richard Kettlewell <[email protected]d> writes:
“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to >>>>> represent characters, not just small integer values.
Computers have absolutely no idea of "characters". They handle numbers, >>>> integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character >>>> (ie, an integer value from 0 - 127 for the case in question) is not the >>>> same as a character.
Language designers do, however, have an idea of “characters”.
I don't quite understand what that's supposed to communicate. Insofar
the machine is concerned, a character is nothig but an integer and a
data type sufficient to hold a characters is thus necessarily an integer
type of some size. In a language without unsigned integer types, it'll
necessarily also be an signed integer type.
Early C (pre-K&R1) didn't explicitly have unsigned integer types, but
char was effectively unsigned in some implementations, in that
converting a char value to int would zero-fill the result rather than
doing sign-extension.
According to Ritchie's "The Development of the C Language"
,----
| During 1973-1980, the language grew a bit: the type structure gained
| unsigned
|
| [...]
|
| the similarity of the arithmetic properties of character pointers and
| unsigned integers made it hard to resist the temptation to identify
| them. The unsigned types were added to make unsigned arithmetic
| available without confusing it with pointer manipulation. Similarly, the
| early language condoned assignments between integers and pointers
`----
The sentence "Whether or not sign-extension occurs for characters is
machine dependent" might be written in more modern terms as "The
signedness of char is implementation-defined".
signed char and unsigned char (and unsigned short and unsigned long)
were added in ANSI C 1989, possibly earlier.
On Thu, 08 Feb 2024 10:23:29 -0800, Keith Thompson wrote:
The sentence "Whether or not sign-extension occurs for characters is
machine dependent" might be written in more modern terms as "The
signedness of char is implementation-defined".
signed char and unsigned char (and unsigned short and unsigned long)
were added in ANSI C 1989, possibly earlier.
Here’s an odd thing: what happens when you shift a signed int? K&R allows >left-shift with the obvious meaning, and says that, for right-shift,
whether the top bits are zero-filled or sign-extended is implementation- >defined; newer C specs say that left-shifting a negative value is simply >“undefined”, and right-shifting a negative value is “implementation- >defined”.
Lawrence D'Oliveiro <[email protected]d> writes:
Here’s an odd thing: what happens when you shift a signed int? K&R allows >>left-shift with the obvious meaning, and says that, for right-shift, >>whether the top bits are zero-filled or sign-extended is implementation- >>defined; newer C specs say that left-shifting a negative value is simply >>“undefined”, and right-shifting a negative value is “implementation- >>defined”.
There were extant hardware implementations exhibiting both behaviors.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 155:36:16 |
| Calls: | 12,092 |
| Files: | 15,000 |
| Messages: | 6,517,709 |