Just read that some architects are leaving Intel and doing their own
startup, apparently aiming to develop RISC-V cores of all things.
https://www.tomshardware.com/tech-industry/senior-intel-cpu-architects-splinter-to-develop-risc-v-processors-veterans-establish-aheadcomputing
Maybe a good time to get some developers on board for development.
Just read that some architects are leaving Intel and doing their own
startup, apparently aiming to develop RISC-V cores of all things.
https://www.tomshardware.com/tech-industry/senior-intel-cpu-architects-splinter-to-develop-risc-v-processors-veterans-establish-aheadcomputing
Maybe a good time to get some developers on board for development.
Just read that some architects are leaving Intel and doing their own
startup, apparently aiming to develop RISC-V cores of all things.
In article <vajo7i$2s028$[email protected]>, [email protected] (Thomas >Koenig) wrote:
Just read that some architects are leaving Intel and doing their own
startup, apparently aiming to develop RISC-V cores of all things.
They're presumably intending to develop high-performance cores, since
they have substantial experience in doing that for x86-64. The question
is if demand for those will develop.
On 8/27/2024 2:59 PM, John Dallman wrote:
In article <vajo7i$2s028$[email protected]>, [email protected] (Thomas
Koenig) wrote:
Just read that some architects are leaving Intel and doing their own
startup, apparently aiming to develop RISC-V cores of all things.
They're presumably intending to develop high-performance cores, since
they have substantial experience in doing that for x86-64. The question
is if demand for those will develop.
Making RISC-V "not suck" in terms of performance will probably at least
be easier than making x86-64 "not suck".
Android is apparently waiting for a new RISC-V instruction setabout anyone wanting to do so on a large scale.
extension; >> you can run various Linuxes, but I have not heard
My thoughts for "major missing features" is still:
Needs register-indexed load;
Needs an intermediate size constant load (such as 17-bit sign extended)
in a 32-bit op.
Where, there is a sizeable chunk of constants between 12 and 17 bits,
but not quite as many between 17 and 32 (and 32-64 bits is comparably infrequent).
I could also make a case for an instruction to load a Binary16 value and convert to Binary32 or Binary64 in an FPR, but this is arguably a bit
niche (but, would still beat out using a memory load).
Big annoying thing with it, is that to have any hope of adoption, one
needs an "actually involved" party to add it. There doesn't seem to be
any sort of aggregated list of "known in-use" opcodes, or any real
mechanism for "informal" extensions.
The closest we have on the latter point is the "Composable Extensions" extension by Jan Gray, which seems to be mostly that part of the ISA's encoding space can be banked out based on a CSR or similar.
Though, bigger immediate values and register-indexed loads do arguably
better belong in the base ISA encoding space.
At present, I am still on the fence about whether or not to support the
C extension in RISC-V mode in the BJX2 Core, mostly because the encoding scheme just sucks bad enough that I don't really want to deal with it.
Realistically, can't likely expect anyone else to adopt BJX2 though.
Though, bigger issue might be how to make it able to access hardware
devices (seems like part of the physical address space is used for as a
PCI Config space, and would need to figure out what sorts of devices the Linux kernel expects to be there in such a scenario).
On 8/27/2024 6:50 PM, MitchAlsup1 wrote:
On Tue, 27 Aug 2024 22:39:02 +0000, BGB wrote:
On 8/27/2024 2:59 PM, John Dallman wrote:Yet, these people have decades of experience building complex things
In article <vajo7i$2s028$[email protected]>, [email protected] (Thomas >>>> Koenig) wrote:
Just read that some architects are leaving Intel and doing their own >>>>> startup, apparently aiming to develop RISC-V cores of all things.
They're presumably intending to develop high-performance cores, since
they have substantial experience in doing that for x86-64. The question >>>> is if demand for those will develop.
Making RISC-V "not suck" in terms of performance will probably at least
be easier than making x86-64 "not suck".
that
made x86 (also() not suck. They should have the "drawing power" to get
more people with similar experiences.
The drawback is that they are competing with "everyone else in
RISC-V-land,
and starting several years late.
Though, if anything, they probably have the experience to know how to
make things like the fabled "opcode fusion" work without burning too
many resources.
Android is apparently waiting for a new RISC-V instruction setabout anyone wanting to do so on a large scale.
extension; >> you can run various Linuxes, but I have not heard
My thoughts for "major missing features" is still:
Needs register-indexed load;
Needs an intermediate size constant load (such as 17-bit sign extended)
in a 32-bit op.
Full access to constants.
That would be better, but is unlikely within the existing encoding constraints.
But, say, if one burned one of the remaining unused "OP Rd, Rs, Imm12s" encodings as an Imm17s, well then...
With the OpCode space already 98% filled there does not need to
be such a list.
One would still need it if multiple parties want to be able to define an extension independently of each other and not step on the same
encodings.
The closest we have on the latter point is the "Composable Extensions"
extension by Jan Gray, which seems to be mostly that part of the ISA's
encoding space can be banked out based on a CSR or similar.
Though, bigger immediate values and register-indexed loads do arguably
better belong in the base ISA encoding space.
Agreed, but there is so much more.
FCMP Rt,#14,R19 // 32-bit instruction
ENTER R16,R0,#400 // 32-bit instruction
..
These are likely a bit further down the priority list.
Prolog/Epilog happens once per function, and often may be skipped for
small leaf functions, so seems like a lower priority. More so, if one
lacks a good way to optimize it much beyond the sequence of load/store
ops which is would be replacing (and maybe not a way to do it much
faster than however can be moved in a single clock cycle with the
available register ports).
At present, I am still on the fence about whether or not to support the
C extension in RISC-V mode in the BJX2 Core, mostly because the encoding >>> scheme just sucks bad enough that I don't really want to deal with it.
Realistically, can't likely expect anyone else to adopt BJX2 though.
Captain Obvious strikes again.
This is likely the fate of nearly every hobby class ISA.
In article <VbrzO.74199$[email protected]>, [email protected] (Scott >Lurndal) wrote:
[email protected] (John Dallman) writes:
They're presumably intending to develop high-performance cores,Ask Si-Five about demand for high-performance risc-v cores.
since they have substantial experience in doing that for x86-64.
The question is if demand for those will develop.
SiFive were pretty sure there wasn't near-term demand for them in 4Q2023. >Ahead Computing are presumably not expecting to deliver IP cores for a
year or two, so /maybe/ they have reasons to expect demand then.
But it's also possible they just want to carry on being chip architects
while being in charge of their own company. If so, adopting RISC-V is
more credible in the short term than starting to design a new ISA as a >commercial project. Intel won't sell them an x86 license at any
reasonable price.
Thinking a bit more, they may be trying to go the Nuvia route: design >original cores for an existing ISA and get bought out. Nuvia were bought
by Qualcomm for their ARMv9-A core IP well before they released anything.
If Ahead were to successfully design a fast RISC-V core with >power:performance that was competitive with ARM, /Intel/ might well buy
them.
Intel were all over RISC-V in 4Q2022 and 1Q2023, looking for something to >compete with ARM after having accepted you can't get power:performance to >match ARM out of x86-64. Then it all went quiet, and Intel didn't
manufacture the SiFive SoC ("Horse Creek") that was supposed to blaze the >trail for RISC-V as a consumer and/or enterprise architecture.
If you were a discontented Intel senior engineer, demonstrating that you >could produce what Intel needed, getting your company bought and you
brought back to Intel in a more senior position might seem worth trying.
[email protected] (John Dallman) writes:
They're presumably intending to develop high-performance cores,Ask Si-Five about demand for high-performance risc-v cores.
since they have substantial experience in doing that for x86-64.
The question is if demand for those will develop.
Intel were all over RISC-V in 4Q2022 and 1Q2023, looking for
something to compete with ARM after having accepted you can't
get power:performance to match ARM out of x86-64. Then it all
went quiet, and Intel didn't manufacture the SiFive SoC
("Horse Creek") that was supposed to blaze the trail for
RISC-V as a consumer and/or enterprise architecture.
The problem with this is that RISC-V isn't currently comparable, feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which
doesn't exist in the RISC-V design space yet.
The problem with this is that RISC-V isn't currently comparable, feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which doesn't
exist in the RISC-V design space yet.
Scott Lurndal <[email protected]> schrieb:
The problem with this is that RISC-V isn't currently comparable,
feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which doesn't
exist in the RISC-V design space yet.
What is missing (in broad terms)?
In article <VbrzO.74199$[email protected]>, [email protected] (Scott >Lurndal) wrote:
[email protected] (John Dallman) writes:
They're presumably intending to develop high-performance cores,Ask Si-Five about demand for high-performance risc-v cores.
since they have substantial experience in doing that for x86-64.
The question is if demand for those will develop.
SiFive were pretty sure there wasn't near-term demand for them in 4Q2023.
Ahead Computing are presumably not expecting to deliver IP cores for a
year or two
But it's also possible they just want to carry on being chip architects
while being in charge of their own company.
If so, adopting RISC-V is
more credible in the short term than starting to design a new ISA as a >commercial project.
Thinking a bit more, they may be trying to go the Nuvia route: design >original cores for an existing ISA and get bought out.
Nuvia were bought
by Qualcomm for their ARMv9-A core IP well before they released anything.
If Ahead were to successfully design a fast RISC-V core with >power:performance that was competitive with ARM, /Intel/ might well buy
them.
Android is apparently waiting for a new RISC-V instruction set extension;
you can run various Linuxes, but I have not heard about anyone wanting to
do so on a large scale.
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
The problem with this is that RISC-V isn't currently comparable,
feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which doesn't
exist in the RISC-V design space yet.
What is missing (in broad terms)?
NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
quit extensive.
Many of them are related to supporting server-grade
RAS, Memory Partitioning, address translation (e.g. 52-bit PA, 52-bit VA)
or accelerator interfaces (ST64B, LD64B).
Moreover, they have a mature SoC ecosystem
[email protected] (Scott Lurndal) writes:
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
The problem with this is that RISC-V isn't currently comparable,
feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which doesn't
exist in the RISC-V design space yet.
What is missing (in broad terms)?
NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
quit extensive.
I think the lack of "extensive" features is a feature of RISC-V. Last
I heard, the ARM manual was >10000 pages.
The RISC-V user manual has put on a lot of weight since Volume I >(unpriviledged) Version 2.2 (145 pages) and Volume II (priviledged)
20211203 (155 pages). The 20240411 draft of Volume I weighs in at 670 >pages), and the 20240411 draft of Volume II at 172 pages, but that's
still quite a long way from 10000.
Many of them are related to supporting server-grade
RAS, Memory Partitioning, address translation (e.g. 52-bit PA, 52-bit VA) >>or accelerator interfaces (ST64B, LD64B).
Can't say I ever missed such instructions.
Are RAS instructions like memory-ordering instructions?
Moreover, they have a mature SoC ecosystem
ARM certainly has that. However, a lot of the SoC ecosystem is only
accessed through drivers that are specific to one kernel and that
nobody maintains, and that's why many smartphones don't get any
updates after a few years. Let's hope it's better for servers.
On 8/28/2024 11:40 AM, MitchAlsup1 wrote:
On Wed, 28 Aug 2024 3:33:40 +0000, BGB wrote:
And what kind of code compatibility would you have between different
designs...
If people can agree as to the encodings, then implementations are more
free to pick which extensions they want or don't want.
If the encodings conflict with each other, no such free choice is
possible.
Prolog/Epilog happens once per function, and often may be skipped for
small leaf functions, so seems like a lower priority. More so, if one
lacks a good way to optimize it much beyond the sequence of load/store
ops which is would be replacing (and maybe not a way to do it much
faster than however can be moved in a single clock cycle with the
available register ports).
My 1-wide machines does ENTER and EXIT at 4 registers per cycle.
Try doing 4 LDs or 4 STs per cycle on a 1-wide machine.
It likely isn't going to happen because a 1-wide machine isn't going to
have the needed register ports.
But, if one doesn't have the register ports, there is likely no viable---------------
way to move 4 registers/cycle to/from memory (and it wouldn't make sense
for the register file to have a path to memory that is wider than what
the pipeline has).
This is likely the fate of nearly every hobby class ISA.Time to up your game to an industrial quality ISA.
Open question of what an "industrial quality" ISA has that BJX2 lacks...
Limiting the scope to things that RISC-V and ARM have.
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
The problem with this is that RISC-V isn't currently comparable,
feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which doesn't
exist in the RISC-V design space yet.
What is missing (in broad terms)?
NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
quit extensive.
On 8/29/2024 11:23 AM, MitchAlsup1 wrote:
With differing instructions, how does a software vendor write
software such that it can run near optimally on any implementation?
They presumably target whatever is common, or the least common
denominator (such as RV64G or RV64GC), and settle with "probably
good enough"...
But, probably not too much different from other ISAs, just with a
lot more parties involved.
The alternative is that one expects that all the software be
rebuilt for the specific configuration being used,
or recompiled from source or some other distribution format on
the local machine which it is to be run (with binaries distributed
as some form of "portable IR").
In article <vaqgtl$3526$[email protected]>, [email protected] (BGB) wrote:
On 8/29/2024 11:23 AM, MitchAlsup1 wrote:
With differing instructions, how does a software vendor write
software such that it can run near optimally on any implementation?
They presumably target whatever is common, or the least common
denominator (such as RV64G or RV64GC), and settle with "probably
good enough"...
ISVs can be proactive or passive about adopting a new ISA.
Variant ISAs create fear, uncertainty and doubt, and that means delay.
ISA promotors fear delay, because their investors will run out of
patience.
John Dallman <[email protected]> schrieb:
In article <vaqgtl$3526$[email protected]>, [email protected] (BGB)
wrote:
On 8/29/2024 11:23 AM, MitchAlsup1 wrote:
ISVs can be proactive or passive about adopting a new ISA.
What is an ISV? I assume "SV" is for "software vendor", but what
does the I stand for?
John Dallman <[email protected]> schrieb:[...]
What is an ISV? I assume "SV" is for "software vendor", but what
does the I stand for?
Variant ISAs create fear, uncertainty and doubt, and that means delay.
ISA promotors fear delay, because their investors will run out of
patience.
Which makes me wonder why companies such as Intel introduce new
instructions all the time.
and it can even make sense to have
architecture-optimized core libraries such as BLAS, or switch on
availability of features such as AVX512
But standard software (office applications, browsers...) should
just run everywhere, and there it gets hard to justify.
Thomas Koenig <[email protected]> writes:
John Dallman <[email protected]> schrieb:[...]
What is an ISV? I assume "SV" is for "software vendor", but what
does the I stand for?
<https://en.wikipedia.org/wiki/Independent_software_vendor>
Variant ISAs create fear, uncertainty and doubt, and that means
delay. ISA promotors fear delay, because their investors will run
out of patience.
Which makes me wonder why companies such as Intel introduce new >instructions all the time.
AMD64 already has the buy-in of application vendors for desktops and
servers, so it does not have the problem that extensions create
uncertainty among application vendors.
My guess is that there are the following motivations:
1) The new instructions make technical sense (for certain
applications).
2) Even if the applications that the users use don't benefit from the extensions, the users think (thanks also to Intels marketing) that
they might (because of 1); maybe not today, but maybe the next version
or maybe the application that the user will run in a year or two. And
I certainly have seen reports that this or that game does not work on
K10 or whatever because the game uses some SSE4.2 instruction that the
K10 does not have. Intel could have increased this kind of
obsolescence (and the resulting new sales) through instruction set
extensions by supporting AVX across the board early on (as AMD did),
and later by supporting AVX512 across the board, but Intel marketing apparently thinks it's better to get people to buy Core-branded rather
than Pentium-branded CPUs by disabling AVX for a long time on the
latter.
3) I expect that Intel patents the extensions. So these days
everybody could build an AMD64 CPU, because the patent has expired,
but nobody wants to buy such a CPU without the extensions (because of
2), and the extensions are patented.
and it can even make sense to have
architecture-optimized core libraries such as BLAS, or switch on >availability of features such as AVX512
Yes. And given that a lot of software uses some library or other, a
lot of software may benefit from the extensions. Of course, the
question is how big the benefit is.
E.g., glibc has many different versions of memcpy() and memmove() and
selects among them based on the actual CPU used in the run, thanks to
But standard software (office applications, browsers...) should
just run everywhere, and there it gets hard to justify.
That will also benefit from libraries.
For browsers the JavaScript and WASM JIT compiler can generate code
specific to the extensions present in the hardware; however, no ISA
extension comes to my mind that a JavaScript or current WASM JIT
compiler will benefit from;
IIRC there is discussion about explicit
vector stuff in WASM, and there the extensions may make a difference.
Also, a friend who works on a JavaVM JIT told me he is working on auto-vectorization, but I don't know if they really went for that; Auto-vectorization is not just the wrong approach, it also seems
particularly inappropriate for JIT compilers, because it requires a
lot of analysis, i.e., compile time.
- anton
On Fri, 30 Aug 2024 10:26:38 GMT
[email protected] (Anton Ertl) wrote:
Intel could have increased this kind of
obsolescence (and the resulting new sales) through instruction set
extensions by supporting AVX across the board early on (as AMD did),
and later by supporting AVX512 across the board, but Intel marketing
apparently thinks it's better to get people to buy Core-branded rather
than Pentium-branded CPUs by disabling AVX for a long time on the
latter.
I wish if it was only marketing, i.e. if it were only fuses in big-core >derived Pentiums and Celerons.
Unfortunately, the bigger problem was poor work (laziness) of Intel's >engineering that didn't have AVX, or any for VEX decoding, in their
Atom line until Gracemont.
It's not marketing, it's engineers, who produced quite capable core
like Tremont with thhe level of ISA support 10 years behind its time.
Scott Lurndal <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
The problem with this is that RISC-V isn't currently comparable,
feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which doesn't
exist in the RISC-V design space yet.
What is missing (in broad terms)?
NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
quit extensive.
Is there any way to get that list? I've looked, but I only got rough >overview articles and links to the full documentation, which is fairly >overwhelming.
Concerning the demand, RISC-V has the advantage of no ARM tax (and
legal costs like those between ARM and Qualcomm over the
developments started at NUVIA)
Another RISC-V advantage is that the government of the USA puts
restrictions on ARM that should not apply to the free RISC-V
architecture.
It would apply to implementations designed in the USA (such as those
by Ahead), but the point is that on the ISA level, and thus the
buy-in into the ecosystem (e.g., from ISVs), RISC-V has an advantage.
RISC-V also has a technical advantage over ARM: It has Ztso (total
store order) as an optional extension, which helps porting of
multi-threaded software from AMD64 (and emulation of AMD64
software). No such thing on ARMv8 or ARMv9 yet, although
implementations like the Apple M1 and Fujitsu A64FX provide
this feature.
But it's also possible they just want to carry on being chipSure. But what are the investors seeing in the company?
architects while being in charge of their own company.
Even if an architecture has a long track record, like MIPS, that's
not enough, as the switch from the MIPS ISA to RISC-V shows.
What I read is that the Snapdragon X implements ARM v8.7.
In article <vaqgtl$3526$[email protected]>, [email protected] (BGB) wrote:
The alternative is that one expects that all the software be
rebuilt for the specific configuration being used,
ISVs /really/ don't like that. It multiplies their testing and QA and
those are expensive. It rarely shows up problems, but convincing
themselves to do without it is hard for them.
or recompiled from source or some other distribution format on
the local machine which it is to be run (with binaries distributed
as some form of "portable IR").
ISVs get sceptical about that, because it's generating code they have not >tested.
ISVs get sceptical about that, because it's generating code they
have not tested.
Yes, that thinking seems to be a result of C/C++ compiler
shenanigans. People advocating "optimization" based on the
assumption that undefined behaviour does not happen have
suggested that I should keep compiler versions around that
compile my source code as I expect it.
In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
[email protected] (John Dallman) writes:
Android is apparently waiting for a new RISC-V instruction setWhich one?
extension;
I don't know what its name is. It was proposed by Hans Boehm, and the
Android team pointed me to this discussion on a RISC-V mailing list:
https://lists.riscv.org/g/tech-unprivileged/topic/92916241
Searching with various terms suggests it might well be the Zabha
extension, ratified in April this year, but that is deduction.
You may not consider it large-scale, but we wanted to have two
RISC-V servers for teaching (in particular, for the compiler
course).
Makes sense. It is not in itself "large-scale," but suitable hardware is
only going to be available if someone wants a lot of it, enough to make >building it worthwhile.
Now it's two years later, and the RISC-V servers are still not
showing up.
Yup. RISC-V established a lot of awareness, and some expectations, but
there hasn't been the equipment to let people start using it.
In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
[email protected] (John Dallman) writes:
Android is apparently waiting for a new RISC-V instruction setWhich one?
extension;
I don't know what its name is. It was proposed by Hans Boehm, and the
Android team pointed me to this discussion on a RISC-V mailing list:
https://lists.riscv.org/g/tech-unprivileged/topic/92916241
Searching with various terms suggests it might well be the Zabha
extension, ratified in April this year, but that is deduction.
Now it's two years later, and the RISC-V servers are still not
showing up.
Yup. RISC-V established a lot of awareness, and some expectations, but
there hasn't been the equipment to let people start using it.
In article <[email protected]>, [email protected] (Anton Ertl) wrote:
[email protected] (John Dallman) writes:
Android is apparently waiting for a new RISC-V instruction setWhich one?
extension;
I don't know what its name is. It was proposed by Hans Boehm, and the
Android team pointed me to this discussion on a RISC-V mailing list:
https://lists.riscv.org/g/tech-unprivileged/topic/92916241
[email protected] (John Dallman) writes:
But making the jump from embedded systems and SBCs to servers has not >happened for RISC-V yet, and looking how long it took to establish ARM
in servers, I expect that RISC-V will take quite a while. I guess
that high-performance cores like those that Ahead is probably working
on are one component along the way.
In article <[email protected]>, [email protected] (Anton Ertl) wrote:
ISVs get sceptical about that, because it's generating code they
have not tested.
Yes, that thinking seems to be a result of C/C++ compiler
shenanigans. People advocating "optimization" based on the
assumption that undefined behaviour does not happen have
suggested that I should keep compiler versions around that
compile my source code as I expect it.
Plain old compiler bugs, introduced while fixing other ones, are quite
enough to make me assume that I'll find problems on each change of
compiler. I have had a manager in a very large software company assure me that it was impossible for them to add bugs while making fixes. His
technical people corrected him immediately, because I'd just laughed.
[email protected] (John Dallman) writes:
In article <[email protected]>,
[email protected] (Anton Ertl) wrote:
[email protected] (John Dallman) writes:
Android is apparently waiting for a new RISC-V instruction setWhich one?
extension;
I don't know what its name is. It was proposed by Hans Boehm, and the
Android team pointed me to this discussion on a RISC-V mailing list:
https://lists.riscv.org/g/tech-unprivileged/topic/92916241
Searching with various terms suggests it might well be the Zabha
extension, ratified in April this year, but that is deduction.
You may not consider it large-scale, but we wanted to have two
RISC-V servers for teaching (in particular, for the compiler
course).
Makes sense. It is not in itself "large-scale," but suitable hardware is
only going to be available if someone wants a lot of it, enough to make
building it worthwhile.
Now it's two years later, and the RISC-V servers are still not
showing up.
Yup. RISC-V established a lot of awareness, and some expectations, but
there hasn't been the equipment to let people start using it.
I expect RISC-V to gradually encroach on the embedded market and as microcontroller IP that can be included in SoC accelerators (primarily
to avoid license fees for the alternatives such as cortex m7).
I don't see it replacing ARM64, X86_64/AMD64 or other server-grade processors.
I find it funny to find this on an Element14 page (the company
formerly known as Acorn, the original A in ARM); Element14 has long
since been bought by Broadcom, but apparently some web presence
still exists.
But making the jump from embedded systems and SBCs to servers has
not happened for RISC-V yet, and looking how long it took to
establish ARM in servers, I expect that RISC-V will take quite a
while. I guess that high-performance cores like those that Ahead
is probably working on are one component along the way.
In article <[email protected]>, [email protected] (Anton Ertl) wrote:
AMD64 already has the buy-in of application vendors for desktops and
servers, so it does not have the problem that extensions create
uncertainty among application vendors.
My guess is that there are the following motivations:
1) The new instructions make technical sense (for certain
applications).
This is sometimes true, but manufacturers tend to over-promote them,
claiming wider applicability and bigger effects than show up in real application code. After a few disappointments, ISVs tend to become less
keen on doing work on marketing advice.
Some manufacturers pay bonuses to their technical marketing people for getting ISVs to adopt new ISA extensions. This is counter productive,
because it means the ISVs are sure that the marketing advice will take no account of their interests.
They prefer to wait until an extension has been out for several years
before supporting it, so that it's available in pretty well all the
end-user hardware that hasn't finished its depreciation yet. That's
driven by a facet of the application software industry that most hardware manufacturers don't seem to understand. They appear to assume that
computers are set up with an initial software load and carry on running
that for their entire lives.
In fact, organisations replace about a quarter of their machines each
year, always buying up-to-date ones, and want to run the /same/ version
of software on all of them. They want common software versions for data compatibility, ease of training and so on. That means that a new release
of an application has to run on all the machines sold in the last four
years, sometimes longer.
Some manufacturers expect ISVs to produce multiple versions of software
for different sets of ISA extensions. They'll do that if the gains are
large enough, but they have to be quite large: for my employer, 25% is enough, but 10% isn't. We haven't had to make a decision in between those numbers yet. We've had one 25% case, for Intel SSE2, and many of 10% or
less.
2) Even if the applications that the users use don't benefit from
the extensions, the users think (thanks also to Intels marketing)
The sheer flood of extensions from Intel means most end-user
organisations have stopped trying to keep track these days.
John
On 8/29/2024 11:23 AM, MitchAlsup1 wrote:
Time to up your game to an industrial quality ISA.
Open question of what an "industrial quality" ISA has that BJX2 lacks... >>> Limiting the scope to things that RISC-V and ARM have.
Proper handling of exceptions (ignoring them is not proper)
If you mean FPU exceptions, maybe.
As far as general interrupt handling, mechanism isn't too far off from
what SH-4 had used, and apparently also RISC-V's CLINT and MIPS work in
a similar way.
Though, with differences as to how they divide up exceptions.
In my case:
Reset;
General Fault;
External Interrupt;
TLB/MMU;
Syscall.
Proper IEEE 754-2018 handling of FMAC (compute all the bits)
Possibly true.
My FPU can more-or-less pass the 1985 spec, but not the 2018 spec.
Floating Point Transcendentals
Not present in many/most ISA's I have looked at.
HyperVisors/Secure Monitors
Possible. I had considered doing it essentially with emulators, but
granted, this is not quite the same thing.
Seems many of the extant RV implementations don't have this either.
Write Interrupt service routines entirely in HLL
If you mean C... I do have this.
#ifdef TK_REGSAVE_TBR
__interrupt_tbrsave void __isr_syscall(void)
#else
__interrupt void __isr_syscall(void)
#endif
{
....
}
AKA: What exactly is the '__interrupt' for?...
However, the ISR's can't access virtual memory apart from manually translating the pointers.
The various architectural CR's can be accessed from C as well, such as "__arch_tbr" to access TBR, etc.
proper Privileges and Priorities
?...
Multi-location ATOMIC events
Possibly true.
Maybe the "volatile" mechanism is weak.
If you want to write reliable code that can be distributed as source and >compiled by any conforming C/C++ compiler, you need to be very sure that
you avoid relying on behaviour that is not specified and documented.
It is, of course, a lot easier to write software that appears roughly
correct in the source code and passes its tests, than software that is >rigidly accurate.
I see nothing wrong in blaming programmers for using "memcpy" when they >should have used "memmeove" - it was those programmers that made the
error.
On 30/08/2024 17:42, John Dallman wrote:
I always keep old versions of compilers around, and don't change
compilers (or libraries) in the middle of a project. Since I work with embedded systems, there are significantly fewer users compared to, say,
x86 target compilers. Thus there is a higher risk of bugs being missed
in beta testing and going unreported for longer. (IME bugs are far more likely in vendor SDK's than in gcc or newlib, but I keep everything
archived just in case.) I also like to have reproducible builds -
something that many Linux distributions are aiming for these days -
which requires archiving the toolchain.
If you want to write reliable code that can be distributed as source and compiled by any conforming C/C++ compiler, you need to be very sure that you avoid relying on behaviour that is not specified and documented. You need to write correct code. That means if you want to copy some memory with overlapping source and destination arrays, you use "memmove" - the function for that purpose. You don't use "memcpy", since it is specified explicitly as requiring non-overlapping arrays.
Be liberal in what you accept, and conservative in what you send.
David Brown <[email protected]> writes:
If you want to write reliable code that can be distributed as source and
compiled by any conforming C/C++ compiler, you need to be very sure that
you avoid relying on behaviour that is not specified and documented.
GCC and Clang/LLVM are distributed in source code, and given that
their maintainers find it ok to compile programs to arbitrary code if
they do not meet your expectations, one should expect that they do not
rely on behaviour that is not specified and documented, and never have
(at least not since adopting this attitude). But even they are not up
to the task. As John Regehr writes
<https://blog.regehr.org/archives/761>:
|LLVM/Clang 3.1 and GCC (SVN head from July 14 2012) [...] execute
|undefined behaviors even when compiling an empty C or C++ program with |optimizations turned off.
I am not surprised that nobody has risen to my challenge <[email protected]>:
|Write a proof-of-concept Forth interpreter in the language you
|advocate that runs at least one of bubble-sort, matrix-mult or sieve
|from bench/forth in
|<http://www.complang.tuwien.ac.at/forth/bench.zip>
in the last 7 years.
It is, of course, a lot easier to write software that appears roughly
correct in the source code and passes its tests, than software that is
rigidly accurate.
I never heard about "rigidly accurate" as a property of software
(except maybe numeric software).
The practice is that software is either tested (the usual case) or
formally proved correct. For a C program to be formally proved
correct would, dirst and foremost require a formal specification of C.
I see nothing wrong in blaming programmers for using "memcpy" when they
should have used "memmeove" - it was those programmers that made the
error.
I did not expect *you* to see what's wrong. But I hope that I never
have anything to do with anything that you programmed.
What's wrong with blaming the application programmers is that it does
not help the users of the binary that misbehaved after glibc was
"up"graded. It also does not help users who have a no-longer
maintained piece of source code that used to work with earlier
versions of glibc, but now acts up on some hardware. Sure, there are workarounds, but first the user would have to understand the problem.
- anton
Other than using CompCert, I don't know of any reliable way for
a programmer to make sure his C code does not suffer from UB.
because all too often it's virtually impossible for the tools to
understand that this particular code can/will hit UB.
On 30/08/2024 17:42, John Dallman wrote:
Plain old compiler bugs, introduced while fixing other ones, are
quite enough to make me assume that I'll find problems on each
change of compiler.
I always keep old versions of compilers around, and don't change
compilers (or libraries) in the middle of a project.
If you want to write software that is "correct because it passed
its tests", you can only expect it to be reliable when it is run
exactly as tested. That means it must be compiled as it was during
tests (same compiler, same options, same library), and arguably
even run only on the same hardware (if you only test on one
particular cpu, OS, etc., you can only be sure it works on that
cpu, OS, etc.).
That's why a lot of pre-compiled commercial software gives
particular versions of particular OS's or Linux distributions in
their lists of requirements - even though the software would
probably work fine on a much wider range.
I assume you work in the high end, as the average desktop PC is
replaced every 8 years on a _use it until it breaks_ policy.
On Fri, 30 Aug 2024 16:28:08 +0000, David Brown wrote:
On 30/08/2024 17:42, John Dallman wrote:
I always keep old versions of compilers around, and don't change
compilers (or libraries) in the middle of a project. Since I work with
embedded systems, there are significantly fewer users compared to, say,
x86 target compilers. Thus there is a higher risk of bugs being missed
in beta testing and going unreported for longer. (IME bugs are far more
likely in vendor SDK's than in gcc or newlib, but I keep everything
archived just in case.) I also like to have reproducible builds -
something that many Linux distributions are aiming for these days -
which requires archiving the toolchain.
There was once a software CAD vendor that made the transition from
SUNos to SOLARIS and we as a major purchaser could not follow due
to several OS differences:: SUNos had a license server that counted
licenses while SOLARIS had a license server that counted the cross
produce of licenses*core. We as a small company could not afford to
upgrade to Solaris. Then their new product simply had different bugs.
We chose to stay with the old SW because we knew where all the bugs
were and how not to stimulate them into nasal deamons. Ultimately
they got bought out and disappeared...
On 8/30/2024 1:11 PM, MitchAlsup1 wrote:
On Thu, 29 Aug 2024 19:07:29 +0000, BGB wrote:
Integer Overflow
Not usually a thing. Pretty much everything seems to treat integer
overflow as silently wrapping.
Bad Instruction encoding--OpCode exists but not as this
instruction uses it. Random code generation can use
every instruction without privilege.
Hit or miss.
Will usually fault on invalid instructions.
There is logic in place to reject privileged instructions in user-mode,
if the CPU is actually run in user-mode. Some of this is still TODO (currently, TestKern is still running everything in Supervisor Mode).
The alternative is to treat them as UB, so they may be one of:
Trap;
Do something else (like, if an instruction was added);
Do something wonky / unintended.
In practice, this seems to be more how it works.
Bad address--address exists but you are not allowed to touch it>with LD or ST instruction or to attempt to execute it.
If the MMU is enabled, it should fault on bad memory accesses.
In physical addressing mode, it does not trap.
IIRC, there was a mechanism on the bus to deal with accesses to bad
physical addresses (returning all zeroes). Otherwise, trying to access
an invalid address would cause the CPU to deadlock.
As I understand it, you don't even get FMUL correctly rounded.
To get it properly rounded you have to compute the full 53*53
product.
AFAICT, this wasn't required for the 1985 spec...
Things like "optional trap on denormal" seems like it should be OK (this
is what MIPS and friends did at the time).
For the most part, seems like the '85 spec was more "uses these formats
and gets more or less the same values, good enough". A lot of the
pedantic rounding stuff, etc, seemed to be more something for the 2008
spec.
The lack of single-rounded FMA shouldn't matter, since this wasn't added until later.
Support for Binary16 is a bonus feature (since 85 spec only gave Single
/ Double / Extended), but Binary16 is useful...
Floating Point Transcendentals
Not present in many/most ISA's I have looked at.
Its time has come.
Then who has done it, besides x87 and similar?...
Not going to put much weight in something if:
The only real known example is the legacy x87 ISA;
Pretty much everyone else (including on x86-64) is using unrolled Taylor-series expansion and similar.
HyperVisors/Secure Monitors
Possible. I had considered doing it essentially with emulators, but
granted, this is not quite the same thing.
How can something of lesser privilege emulate something of greater
privilege ??
Top level OS (or hypervisor layer) runs an emulator, which runs any VMs holding guest OS instances.
Granted, running the main OS in an emulator wouldn't be great for performance. But, in most contexts, this isn't really a thing.
Like, pretty sure Windows and Linux still tend to run bare-metal on most systems, ... (or, if a VM layer exists, it is unclear what if-any
purpose it would serve).
But, in any case, one doesn't need any special ISA level support to make things like QEMU and DOSBox work.
And, if a person wants to essentially use something like QEMU to run the whole OS, nothing really is stopping them.
Well, except maybe how slow that QEMU and DOSBox tend to be on something
like a RasPi (on a 50MHz CPU, one would likely be hard-pressed to even
run something like SimCity at acceptable speeds).
Not yet tried porting something like DOSBox to my stuff though...
But, a more clever emulator could likely leverage things like hardware address translation and maybe only JIT parts of the target system (vs,
say, fully emulating the memory access and using JIT compilation or interpretation for "pretty much everything").
Say, for example, if the host system and guest OS are running the same
ISA (vs, say, the guest OS running x86 or x86-64; on a host running a different ISA).
In article <[email protected]>, [email protected] (Anton Ertl) wrote:
Concerning the demand, RISC-V has the advantage of no ARM tax (and
legal costs like those between ARM and Qualcomm over the
developments started at NUVIA)
True, although the market for high-performance application cores is less price-sensitive than the market for low-performance embedded ones.
The clang/gcc maintainers' POV violates the first part of Postel's Law:
Be liberal in what you accept, and conservative in what you send.
Life would be a lot easier if they just provided a -WUB option that
warns and explains *any* construct that the compiler may regard as UB.
Bernd Linsel <[email protected]> schrieb:
The clang/gcc maintainers' POV violates the first part of Postel's Law:
Be liberal in what you accept, and conservative in what you send.
Life would be a lot easier if they just provided a -WUB option that
warns and explains *any* construct that the compiler may regard as UB.
Maybe a bit more elaborate:
#include <stdio.h>
int main()
{
int i;
sscanf("%d", &i);
The clang/gcc maintainers' POV violates the first part of Postel's Law:
Be liberal in what you accept, and conservative in what you send.
Life would be a lot easier if they just provided a -WUB option that
warns and explains *any* construct that the compiler may regard as UB.
Bernd Linsel <[email protected]> schrieb:
The clang/gcc maintainers' POV violates the first part of Postel's Law:
Be liberal in what you accept, and conservative in what you send.
Life would be a lot easier if they just provided a -WUB option that
warns and explains *any* construct that the compiler may regard as UB.
Maybe a bit more elaborate:
#include <stdio.h>
int main()
{
int i;
scanf("%d", &i);
return 0;
}
Should this be warned about?
Or what about
void foo(int *a)
{
*a ++;
}
Two possible cases of undefined behavior here: a could be an
invalid pointer, and the arithmetic operation could overflow.
On 31.08.24 11:24, Thomas Koenig wrote:
Bernd Linsel <[email protected]> schrieb:
The clang/gcc maintainers' POV violates the first part of Postel's Law:
Be liberal in what you accept, and conservative in what you send.
Life would be a lot easier if they just provided a -WUB option that
warns and explains *any* construct that the compiler may regard as UB.
Maybe a bit more elaborate:
#include <stdio.h>
int main()
{
int i;
scanf("%d", &i);
return 0;
}
Should this be warned about?
[corrected sscanf -> scanf]
Why? This "program" has the purpose to read one line, presumably
containing an integer number, from stdin and ignore it. No UB anywhere.
So, sorry for the too-quick examples earlier...
What about
int foo (int a)
{
return a + 1;
}
or
int foo(int *a)
{
return *a;
}
Both can exhibit undefined behavior, and for both it
is impossible for the compiler to tell at compile-time.
If you want to write reliable code that can be distributed as source and
compiled by any conforming C/C++ compiler, you need to be very sure that you >> avoid relying on behaviour that is not specified and documented. You need to >> write correct code. That means if you want to copy some memory with
overlapping source and destination arrays, you use "memmove" - the function >> for that purpose. You don't use "memcpy", since it is specified explicitly >> as requiring non-overlapping arrays.
The difficulty here is that the tools provide very little help for that, because all too often it's virtually impossible for the tools to
understand that this particular code can/will hit UB.
So it's all up to the programmer, who often doesn't know either.
Other than using CompCert, I don't know of any reliable way for
a programmer to make sure his C code does not suffer from UB.
In article <vasruo$id3b$[email protected]>, [email protected] (David Brown) wrote:
On 30/08/2024 17:42, John Dallman wrote:
Plain old compiler bugs, introduced while fixing other ones, are
quite enough to make me assume that I'll find problems on each
change of compiler.
I always keep old versions of compilers around, and don't change
compilers (or libraries) in the middle of a project.
I always have at least a couple of machines at the previous build
standard of any platform, often more machines and/or older build
standards.
Changing compilers or libraries is done at new major releases.
If you want to write software that is "correct because it passed
its tests", you can only expect it to be reliable when it is run
exactly as tested. That means it must be compiled as it was during
tests (same compiler, same options, same library), and arguably
even run only on the same hardware (if you only test on one
particular cpu, OS, etc., you can only be sure it works on that
cpu, OS, etc.).
This is simpler when you produce closed-source binary software. We only
ship builds we've tested. That means the /same binaries/ as we tested,
not rebuilt or modified. This requires a separate test harness, rather
than testing code compiled into the binaries.
We test on a wide variety of hardware for the most-used platforms, by
putting it into the distributed testing pools and always knowing which machine an individual test case ran on, because it's recorded in the test results.
That's why a lot of pre-compiled commercial software gives
particular versions of particular OS's or Linux distributions in
their lists of requirements - even though the software would
probably work fine on a much wider range.
We specify what we specifically support, because we've tested that, plus
the much broader requirements that it should work on. For Linux those are
a GCC runtimes version (currently 8.x) or later and a glibc version (currently 2.28) or later. We don't seem to have problems with
compatibility since we understood how the compatibility works with those libraries, and started doing it that way.
If there's a problem on a specifically supported Linux, we'll fix it
unless that's impossible. If there's a problem on one where it should
work, we'll investigate it, and fix it if we can, which may cause a distribution to be added to the specifically supported list. If we can't
fix a problem, we'll explain why not, and normally add the problem to the documentation. We can't do miracles, but we do pretty well.
Yes, doing good support is expensive, but it pays off in customer loyalty, which means money.
On Fri, 30 Aug 2024 22:42:19 +0000, BGB wrote:
Maybe my previous post was not clear enough: It's not a general UB
detector that I'd like to have integrated into the compiler (there are
static checker tools available that can nearly perfectly do that);
instead, I'd like to get a warning when the compiler does something
other than you would expect when reading the code in a "do what I mean" >manner.
Definitely - if you have 512 GB DDR5 memory in your workstation, the
cost of the CPU itself is a relatively small fraction.
Thomas Koenig <[email protected]> writes:
Definitely - if you have 512 GB DDR5 memory in your workstation, the
cost of the CPU itself is a relatively small fraction.
Reality check:
EUR
2400 =8*300 8*64GB MTC40F2046S1RC48BA1R Micron RDIMM 64GB, DDR5-4800
9300 AMD Ryzen Threadripper PRO 7995WX 96C boxed
The Intel side is a little cheaper, but also offers fewer cores:
4100 Intel Xeon w9-3475X, 36C boxed
6800 Intel Xeon w9-3495X, 56C tray
In any case, all three CPUs are significantly more expensive than
512GB of RAM.
Of course the fans of compilers that do what nobody means found a counterargument long ago: They claim that compilers would need psychic
powers to know what you mean.
Anton Ertl <[email protected]> schrieb:
Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic
powers to know what you mean.
Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.
What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.
An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.
As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.
On 31.08.24 21:08, Thomas Koenig wrote:
Anton Ertl <[email protected]> schrieb:
Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic
powers to know what you mean.
Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.
What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.
An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.
As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.
You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry etc., while compiler writers are free in their search for optimization tricks
that let them shine at SPEC benchmarks.
I personally write most code as in the days I learned C, where compilers where literally too dumb to remember what they did 2 source lines ago,
so you could not rely on the compiler doing the "right thing" -- same as nowadays, but because of other reasons.
So the things that Anton mentioned -- using size_t (or suitable other unsigned types) for iteration variables, pulling invariants out of
loops, and many more common optimizations -- can still be found in my
source codes.
PS: I find -fno-strict-overflow and -fno-strict-aliasing of value, too,
while I found that -fdelete-null-pointer-checks together with -Wnull-pointer-dereference has some utility.
On 8/30/2024 7:11 PM, Paul A. Clayton wrote:
On 8/28/24 11:36 PM, BGB wrote:
On 8/28/2024 11:40 AM, MitchAlsup1 wrote:[snip]
My 1-wide machines does ENTER and EXIT at 4 registers per cycle.
Try doing 4 LDs or 4 STs per cycle on a 1-wide machine.
It likely isn't going to happen because a 1-wide machine isn't going
to have the needed register ports.
For an in-order implementation, banking could be used for saving
a contiguous range of registers with no bank conflicts.
Mitch Alsup chose to provide four read/write ports with the
typical use being three read, one write instructions. This not
only facilitates faster register save/restore for function calls
(and context switches/interrupts) but presents the opportunity of
limited dual issue ("CoIssue").
I was mostly doing dual-issue with a 4R2W design.
Initially, 6R3W won out mostly because 4R2W disallows an indexed store
to be run in parallel with another op; but 6R3W did allow this. This
scenario made enough of a difference to seemingly justify the added cost
of a 3-wide design with a 3rd lane that goes mostly unused (and is
mostly limited to register MOV's and basic ALU ops and similar).
But, then this leads to an annoyance:
As is, I will need to generate different code for 1W, 2W, and 3W configurations;
It is starting to become tempting to generate code resembling that for
the 1W case (albeit still using the shuffling that would be used when bundling), and then using superscalar since, it turns out, it is not
quite as expensive as I had thought).
With superscalar, I wouldn't have the issue of 2W and 3W cores having
issues running code built for the other.
Also, on both 2W and 3W configurations, I can have a 128-bit MOV.X (load/store pair) instruction, so if one assumes 2-wide as the minimum,
this instruction can be safely assumed to exist.
I can mostly ignore 1-wide scenarios (2R1W and 3W1W), mostly as I have
ended up mostly deciding to relegate these to RISC-V.
By the time I have stripped down BJX2 enough to fit into a small FPGA,
it essentially has almost nothing to offer that RV wouldn't offer
already (and it makes more practical sense to use something like RV32IM
or similar).
I am not sure how one would efficiently pull off a 4W write operation.
Can note that generally, the GPR part of the register file can be built
with LUTRAMs, which on Xilinx chips have the property:
1R1W, 5-bit addr, 3-bit data; comb read, clock-edge write.
1R1W, 6-bit addr, 2-bit data; comb read, clock-edge write.
This means, the number of LUTRAMs needed for NxM with G registers can be calculated:
2R1W, 32, Cost=44
3R1W, 32, Cost=66
4R2W, 32, Cost=176
6R3W, 32, Cost=396
4R4W, 32, Cost=352
6R4W, 32, Cost=528
2R1W, 64, Cost=64
3R1W, 64, Cost=96
4R2W, 64, Cost=256
6R3W, 64, Cost=576
4R4W, 64, Cost=512
6R4W, 64, Cost=768
10R5W, 64, cost=1600.
I am not sure about ASIC.
For FPGA, pretty sure that bidirectional ports would gain little or
nothing over fixed-direction ports (since bidirectional IO is not a
thing, and the internal logic is almost entirely different between a
read and write port).
On 31.08.24 21:08, Thomas Koenig wrote:
Anton Ertl <[email protected]> schrieb:
Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic
powers to know what you mean.
Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.
What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.
An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.
As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.
You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry etc., while compiler writers are free in their search for optimization tricks
that let them shine at SPEC benchmarks.
I personally write most code as in the days I learned C, where compilers where literally too dumb to remember what they did 2 source lines ago,
so you could not rely on the compiler doing the "right thing" -- same as nowadays, but because of other reasons.
Bernd Linsel <[email protected]> schrieb:
On 31.08.24 21:08, Thomas Koenig wrote:
Anton Ertl <[email protected]> schrieb:
Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic >>>> powers to know what you mean.
Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.
What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.
An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.
As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.
You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry etc.,
while compiler writers are free in their search for optimization tricks
that let them shine at SPEC benchmarks.
A specification is a specification, but it seems you do not grasp
the concept. It seems a curious mental gap in some people who
think that it means fundamentally different things in different fields.
But if you insist in putting some extra constraints on compiler
writers, apart from the official standards, feel free to write them
down (but please in a concise manner) and try to get them accepted, preferably by the relevant standards committees. But you should know
that writing a specication that is unambiguous and clear is
hard work, and needs a lot of discussion and reviews.
Or fork either gcc or LLVM (or both) and implement whatever
restrictions you want, and if you can convince the maintainers
of these compilers that it is a good idea to fold in your changes,
they may do so.
If you can make your case to enough people (or companies),
then you will find enough volunteers and/or funding to do so.
Snide remarks about compiler writers on comp.arch aren't going
to have any meaningful impact, I'm afraid; if anything, they will
lower your chance of success.
But of course that depends on your definition of success - do
you want to achive anything, or do you want to aggravate people?
If it is the latter, then your chance of success might be a
bit higher.
I personally write most code as in the days I learned C, where compilers
where literally too dumb to remember what they did 2 source lines ago,
so you could not rely on the compiler doing the "right thing" -- same as
nowadays, but because of other reasons.
So you learned programming by ignoring the specifications that
were available. Well, sometimes making progress means unlearning
something.
Bernd Linsel <[email protected]> schrieb:
On 31.08.24 21:08, Thomas Koenig wrote:
Anton Ertl <[email protected]> schrieb:
Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic >>>> powers to know what you mean.
Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.
What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.
An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.
As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.
You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry etc.,
while compiler writers are free in their search for optimization tricks
that let them shine at SPEC benchmarks.
A specification is a specification, but it seems you do not grasp
the concept. It seems a curious mental gap in some people who
think that it means fundamentally different things in different fields.
But if you insist in putting some extra constraints on compiler
writers, apart from the official standards, feel free to write them
down (but please in a concise manner) and try to get them accepted, preferably by the relevant standards committees. But you should know
that writing a specication that is unambiguous and clear is
hard work, and needs a lot of discussion and reviews.
Or fork either gcc or LLVM (or both) and implement whatever
restrictions you want, and if you can convince the maintainers
of these compilers that it is a good idea to fold in your changes,
they may do so.
If you can make your case to enough people (or companies),
then you will find enough volunteers and/or funding to do so.
Snide remarks about compiler writers on comp.arch aren't going
to have any meaningful impact, I'm afraid; if anything, they will
lower your chance of success.
But of course that depends on your definition of success - do
you want to achive anything, or do you want to aggravate people?
If it is the latter, then your chance of success might be a
bit higher.
I personally write most code as in the days I learned C, where compilers
where literally too dumb to remember what they did 2 source lines ago,
so you could not rely on the compiler doing the "right thing" -- same as
nowadays, but because of other reasons.
So you learned programming by ignoring the specifications that
were available. Well, sometimes making progress means unlearning
something.
On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
On 31.08.24 21:08, Thomas Koenig wrote:
Anton Ertl <[email protected]> schrieb:
Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic >>>> powers to know what you mean.
Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.
What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.
An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.
As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.
You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry etc.,
while compiler writers are free in their search for optimization tricks
that let them shine at SPEC benchmarks.
A pressure vessel may actually be able to contain 2× the pressure it
will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.
Then there are 3 kinds of metals {grey, white, yellow} with different responses to stress and induced strain. There is no analogy in code--
If there were perhaps we would have better code today...
On Fri, 30 Aug 2024 22:42:19 +0000, BGB wrote:
On 8/30/2024 1:11 PM, MitchAlsup1 wrote:
On Thu, 29 Aug 2024 19:07:29 +0000, BGB wrote:
Integer Overflow
Not usually a thing. Pretty much everything seems to treat integer
overflow as silently wrapping.
ADA wants these.
Bad Instruction encoding--OpCode exists but not as this
  instruction uses it. Random code generation can use
  every instruction without privilege.
Hit or miss.
Will usually fault on invalid instructions.
Must be 100% to guarantee upwards compatibility.
There is logic in place to reject privileged instructions in user-mode,
if the CPU is actually run in user-mode. Some of this is still TODO
(currently, TestKern is still running everything in Supervisor Mode).
Yes, it is a pain--but a pain that is absolutely worth it.
The alternative is to treat them as UB, so they may be one of:
Trap;
Do something else (like, if an instruction was added);
Do something wonky / unintended.
In practice, this seems to be more how it works.
Bad practice == not industrial quality.
Bad address--address exists but you are not allowed to touch it> Â Âwith LD or ST instruction or to attempt to execute it.
If the MMU is enabled, it should fault on bad memory accesses.
In physical addressing mode, it does not trap.
YOU FAIL TO UNDERSTAND--there is an area in memory where the
preserved registers are stored--stored in a way that only 3
instructions can access--and the PTE is marked RWE=000
This prevents damaging the contract between callee and caller.
3 instructions can access these pages ENTER, EXIT and RET
nothing else.
IIRC, there was a mechanism on the bus to deal with accesses to bad
physical addresses (returning all zeroes). Otherwise, trying to access
an invalid address would cause the CPU to deadlock.
It is NOT a BAD address--it is a good but inaccessible address
outside those 3 instructions.
As I understand it, you don't even get FMUL correctly rounded.
To get it properly rounded you have to compute the full 53*53
product.
AFAICT, this wasn't required for the 1985 spec...
You Cannot get rounding correct unless you "compute as if to
infinite precision" and then follow the rules of rounding
(all modes).
Undefined behaviour is something that is exercised at run-time.
That's why the "undefined behaviour sanitizers" insert run-time
checks. And of course they only detect the behaviour when it is
actually exercised. I.e., they usually will not detect overflowable
buffers, because your usual test inputs don't exercise those.
In article <[email protected]>, [email protected] (Anton Ertl) wrote:
Undefined behaviour is something that is exercised at run-time.
That's why the "undefined behaviour sanitizers" insert run-time
checks. And of course they only detect the behaviour when it is
actually exercised. I.e., they usually will not detect overflowable
buffers, because your usual test inputs don't exercise those.
That's among the many reasons why there is no single way "to make code secure." For string buffers, you turn on the compiler run-time checks,
and use the length-checking versions of string handling functions. Then
you write tests to check both of those are actually working.
Then you discover that the C++ string[] operator is not bounds-checked,
as per the C++ standard, but string.at() is bounds-checked, and curse a
bit.
Anton Ertl <[email protected]> schrieb:
Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic
powers to know what you mean.
Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.
What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.
An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.
As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.
On 01/09/2024 12:21, John Dallman wrote:
Then you discover that the C++ string[] operator is not
bounds-checked, as per the C++ standard, but string.at()
is bounds-checked, and curse a bit.
But surely you would discover that before using the std::string
type? I might do some quick test code using "stuff copied off the
internet", but for any serious programming I would want to read the specifications of a type or function before using it. That's the
only way to be sure you are writing correct code.
BGB wrote:
I am not sure how one would efficiently pull off a 4W write operation.
Can note that generally, the GPR part of the register file can be
built with LUTRAMs, which on Xilinx chips have the property:
1R1W, 5-bit addr, 3-bit data; comb read, clock-edge write.
1R1W, 6-bit addr, 2-bit data; comb read, clock-edge write.
This means, the number of LUTRAMs needed for NxM with G registers can
be calculated:
2R1W, 32, Cost=44
3R1W, 32, Cost=66
4R2W, 32, Cost=176
6R3W, 32, Cost=396
4R4W, 32, Cost=352
6R4W, 32, Cost=528
2R1W, 64, Cost=64
3R1W, 64, Cost=96
4R2W, 64, Cost=256
6R3W, 64, Cost=576
4R4W, 64, Cost=512
6R4W, 64, Cost=768
10R5W, 64, cost=1600.
There is also the mUX logic and similar, but should follow the same
pattern.
There is a bit-array (2b per register) to indicate which of the arrays
holds each register. This ends up turning into FFs, but doesn't matter
as much.
In the Verilog, one can write it as-if there were only 1 array per
write port, with the duplication (for the read ports) handled
transparently by the synthesis stage (convenient), although it still
has a steep resource cost.
Since you are targeting 50 MHz, 20 ns per stage, and those LUTRAMs
possibly run at 500 MHz, and assuming the read port numbers are
ready at the start of the cycle, one might multi-pump the register
file read port access and save a pile on read banks and muxes.
For example, you could 4-pump the read port at 5 ns per read,
the LUTRAM read access taking 2 ns and 3 ns for muxing and routing.
That should divide your numbers above by more than 4 because some
muxing becomes simpler too (fewer sources).
You can't multi-pump the write access as the write port data usually
isn't ready until the end of the cycle.
I am not sure how one would efficiently pull off a 4W write operation.
Can note that generally, the GPR part of the register file can be built
with LUTRAMs, which on Xilinx chips have the property:
1R1W, 5-bit addr, 3-bit data; comb read, clock-edge write.
1R1W, 6-bit addr, 2-bit data; comb read, clock-edge write.
This means, the number of LUTRAMs needed for NxM with G registers can be calculated:
2R1W, 32, Cost=44
3R1W, 32, Cost=66
4R2W, 32, Cost=176
6R3W, 32, Cost=396
4R4W, 32, Cost=352
6R4W, 32, Cost=528
2R1W, 64, Cost=64
3R1W, 64, Cost=96
4R2W, 64, Cost=256
6R3W, 64, Cost=576
4R4W, 64, Cost=512
6R4W, 64, Cost=768
10R5W, 64, cost=1600.
There is also the mUX logic and similar, but should follow the same
pattern.
There is a bit-array (2b per register) to indicate which of the arrays
holds each register. This ends up turning into FFs, but doesn't matter
as much.
In the Verilog, one can write it as-if there were only 1 array per write port, with the duplication (for the read ports) handled transparently by
the synthesis stage (convenient), although it still has a steep resource cost.
On 9/1/2024 1:34 AM, Terje Mathisen wrote:
MitchAlsup1 wrote:
It was a revelation to me when I wrote my first fp emulation code and
grok'ed how having a single guard bit followed by a sticky bit was
sufficient to do this for all rounding modes.
At that point I only needed to maintain enough intermediate bits to
guarantee I would still have those rounding bits after normalization.
This doesn't mean that I could skip calculating all the bits of the full
NxN->2N mantissa product, only that I didn't need to keep them all
around after normalization.
OK.
It seemed like when I looked over the 1985 spec initially, it only
required that the result be larger than that of the destination
(seemingly missed the point of it also requiring infinite precision).
Say, 54*54 => 68 bits, where 68 > 52, under this interpretation, it
would have worked. Granted, this does turn it into a probability game
whether the result is correct or off by 1.
But, have now since noticed that it did specify computing to infinite precision (in this version of the standard), which, my FPU does not do.
There was mention of some operations that I have generally not seen in
the ISA in real-world FPUs:
An FP remainder operator;
Converters to/from ASCII strings;
An FP->Int truncate operator with the result still in FP format;
Usually, one goes round-trip FP->Int->FP;
...
Seems like pretty much everyone offloaded these tasks to the C library.
I had ended up with coverage of most of the rest, albeit still lacking a "trap on denormal" handler (seemingly worked for MIPS and friends, *).
So, it seemed like it was getting pretty close to "could maybe pass the
1985 spec if one lawyers it...". Maybe not so much it seems, unless I
fix the FMUL issue (TBD if it can be done without significantly
increasing adder-chain latency).
It is possible I could also add a check to detect and trap multiplies
for cases where both values have non-zero low-order bits (allowing these
to also be emulated in software).
So, went and added a flag for "Trap as needed to emulate full IEEE
semantics" to FPSCR, where the idea is that enabling this will cause it
to trap in cases where the FPU detects that the results would likely not match the IEEE standard (if using FADDG/FSUBG/FMULG/..., generally if fenv_access is enabled).
Might make sense to have a compiler option to assume fenv_access is
always enabled.
*: Though, from what I can gather, most of the N64 games and similar had operated with this disabled (giving DAZ/FTZ semantics) which apparently
posed an annoyance for later emulators (things like moving platforms in
games like SMB64 would apparently slowly drift upwards or away from the origin if the map was left running for long enough, etc; due to SSE and similar tending to operate with denormals enabled).
FMAC (with single rounding, which is the interesting one) you can of
course get catastrophic cancellation, so you need all the 2N mantissa
bits of the multiplication plus the N bits from the addend, then you
either need a normalizer wide enough to take in any possibly alignments
of the two parts, or you must have separate logic for each of the major
cases.
Yeah, for the 2008 spec onward, would also need this...
It is possible to provide it as a library call, but granted this makes
it slower.
There are FMAC instructions, but they are currently both slow and double-rounded (so, not so useful). Well, except for Binary16 and
Binary32 which appear single-rounded mostly because they happen to be performed internally as Binary64 (but are still slow).
On 31/08/2024 21:08, Thomas Koenig wrote:
Anton Ertl <[email protected]> schrieb:
Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic
powers to know what you mean.
Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.
What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.
An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.
As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.
That is very well put.
Specifications are an agreement between the supplier and the client.
The supplier promises particular functionality if the client stays
within those specifications. It is how things work in a huge range of >aspects of life. Sometimes there are agreements in place for what
happens if the specifications are broken (fine if you fail to deliver as >promised, jail sentence if you break the law, etc.), but these are
really just extensions of the agreement and specification.
If we think about computing, we can start with mathematics for examples.
A mathematical function maps one set onto another - it specifies what
value in the output set is produced from each value in the input set.
It does not specify the result for values that are not in the input set,
even if they are in a "reasonable" superset. So the real square root >function specifies an output for all non-negative real numbers - it does
not specify the result for negative real numbers. Attempting to find
the square root of a negative number is undefined behaviour.
Functions in computing are the same. You have a specification - a >pre-condition, and a post-condition. The inputs (including the
environment, if that is relevant) has to satisfy the pre-condition, and
then the function guarantees that the post-condition will hold after the >function call. Try to put anything else into the function without
satisfying the pre-condition, and it's garbage in, garbage out. If you
don't understand "garbage in, garbage out", you really don't understand
the first thing about software development. This has been understood
since the beginning of the programmable computer:
"""
On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into
the machine wrong figures, will the right answers come out?' I am not
able rightly to apprehend the kind of confusion of ideas that could
provoke such a question.
"""
In the context of compilers, the specification is the language standard
in use at the time, combined with the specifications of any library
functions or other code being used. If you don't follow those
specifications - your input code does not meet the pre-conditions, or
the pre-conditions are not met when your code is run - you get undefined >behaviour. There is no rational way to expect any particular result
when the input is in essence meaningless.
So if there is a function (or operator, or other feature) specified by
the language or by library or function documentation, and you pass it >something that is not documented as fulfilling the pre-conditions, it's >garbage in, garbage out - your code is wrong. If your code makes
assumptions about the workings of a function that are not specified in
its post-condition, the code is wrong. It might work during testing,
but it is not guaranteed to work. If you try to use a function outside
its specifications, then your code is wrong.
Of course it is not always easy to make sure everything is correct
within specifications. Programming languages and libraries are
complicated, and people make mistakes. And where practical, it can be
good to take that into consideration - if it is possible to give error >messages or help in the case of bad inputs, then that can be very
helpful to people. But it doesn't make sense to try to give the "right" >output for wrong input. And it doesn't make sense to do this to the >significant detriment of efficiency with correct inputs.
To compare this to specifications in other walks of life, imagine an >electricity company. The specification they provide to you, the
customer, has the pre-condition that you pay your bills. The
post-condition is that you get electricity. If you break the
specification - you stop paying your bills - it's perfectly reasonable
that they cut off your electricity. But it is /nice/ if they first send
you warning letters, and offers to re-arrange your debt. But if you are >following the specifications and paying your bills, you would not want
the electricity company to keep providing electricity to those who don't
pay, because that would mean /you/ would have to pay more.
In the same way, I want my compiler to warn about potential problems or >undefined behaviour when it reasonably can, rather than jumping straight
to nasal daemons. But I don't want it to generate slower code that it >otherwise could, just because some people might write incorrect code. I >should not have to pay (in run-time efficiency losses) for other
people's potential failure to follow specifications.
But I am quite happy to have compiler options to control the balance and >behaviour. Compilers generally do little optimisation without flags >explicitly enabling them. And some compilers have flags to change the >language specifications (such as making signed integer arithmetic wrap).
There's not a lot they could do better to satisfy people who want the
tools to conform to their imagined specification rather than the actual >specifications.
I suppose one thing they could do is that when a new compiler version
comes out with new optimisations, they could have a flag that turns
these off even if you have enabled others. Maybe you could have
"-olimit=8" to say "limit optimisations to those in gcc 8". That might
give fewer surprises to people who have got their code wrong.
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is mentioned as UB in some standard N, but was not addressed in previous standards.
Was it always UB? Or should it be considered ID until it became UB?
It does seem to me that as the C standard evolved, and as more things
have *explicitly* become documented as UB, compiler developers have
responded largely by dropping whatever the compiler did previously - sometimes breaking code that relied on it.
John Dallman <[email protected]> wrote:
In fact, organisations replace about a quarter of their machines each
year, always buying up-to-date ones, and want to run the /same/ version
of software on all of them. They want common software versions for data
compatibility, ease of training and so on. That means that a new release
of an application has to run on all the machines sold in the last four
years, sometimes longer.
I assume you work in the high end, as the average desktop PC is replaced every 8 years on a “use it until it breaks†policy.
Dell will tell you 5 years, and Google is paid to say the same.
And that actually might be true for laptops, but not desktops.
The bulk of the PC’s and servers where I work are a dozen years old.
A smattering of new PC’s bring the average down to 9 years.
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly
is mentioned as UB in some standard N, but was not addressed in
previous standards.
Was it always UB? Or should it be considered ID until it became
UB?
It does seem to me that as the C standard evolved, and as more
things have *explicitly* become documented as UB, compiler
developers have responded largely by dropping whatever the
compiler did previously - sometimes breaking code that relied on
it.
On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry etc., >>> while compiler writers are free in their search for optimization tricks
that let them shine at SPEC benchmarks.
A pressure vessel may actually be able to contain 2× the pressure it
will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.
Then there are 3 kinds of metals {grey, white, yellow} with different
responses to stress and induced strain. There is no analogy in code--
If there were perhaps we would have better code today...
Perhaps an analogy is code written in assembler, versus coed written in
C versus code written in something like Ada or Rust. Backing away now .
. . :-)
On Sun, 1 Sep 2024 22:07:53 +0200, David Brown
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is mentioned as UB in some standard N, but was not addressed in previous standards.
Was it always UB? Or should it be considered ID until it became UB?
It does seem to me that as the C standard evolved, and as more things
have *explicitly* become documented as UB, compiler developers have
responded largely by dropping whatever the compiler did previously - sometimes breaking code that relied on it.
I have moved on from C (mostly), and I learned long ago to archive
toolchains and to expect that any new version of a tool might break
something that worked previously. I don't like it, but it generally
doesn't annoy me that much.
MMV. Certainly Anton's does. ;-)
Similar to you (David), I came from a - not embedded per se - but
kiosk background: HRT indrustrial QA/QC systems. I know well the
attraction of a new compiler yielding better performing code. I also
know a large amount of my code was hardware and OS specific, that
those are the things beyond the scope of the compiler, but they also
are things that I don't want to have to revisit every time a new
version of the compiler is released.
13 of one, baker's dozen of the other.
Brett wrote:
John Dallman <[email protected]> wrote:
In fact, organisations replace about a quarter of their machines each
year, always buying up-to-date ones, and want to run the /same/ version
of software on all of them. They want common software versions for data
compatibility, ease of training and so on. That means that a new release >>> of an application has to run on all the machines sold in the last four
years, sometimes longer.
I assume you work in the high end, as the average desktop PC is replaced
every 8 years on a “use it until it breaks†policy.
Dell will tell you 5 years, and Google is paid to say the same.
And that actually might be true for laptops, but not desktops.
The bulk of the PC’s and servers where I work are a dozen years old. >> A smattering of new PC’s bring the average down to 9 years.
Organizations that rely on commercial licenced software have a much
easier calculation to make:
"I pay 10-100K dollar every year per CPU for my 3D
CAD/modelling/whatever software, if I can buy a new system in 2-4 years
time which is 50% faster (more cores/faster threads), then it could make sense to upgrade every year, except for the hazzle of installing
everything."
ENTER, LEAVE, and RET as the only instructions capable of accessing the
safe stack is fascinating me. I would like to try implementing this sort
of thing in my design. Pondering why the PTE is specially marked
RWE=000? One would think that some other OS available bits could be
used. Does it make the MMU software easier to implement? Assuming that
faults processed during ENTER, LEAVE, and RET are processed at a higher privilege level, could it not just check some other internal tables?
Decided to try implementing a capabilities machine in the current
design. Modeled it after the RISC-V capabilities instructions in the
CHERI document. It was either that or a segmentation system. Got to keep
the ole brain working.
Going with an OoO design for Bigfoot.
The rf386 takes an average of about 8 clocks per instruction. Helped out
by the presence of a data cache. IPC of 0.125 is nothing to write about. About 5 MIPs at 50 MHz. Stores are fast (2-3 cycles), but loads are
another story (14 ish cycles).
George Neuner <[email protected]> schrieb:
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is
mentioned as UB in some standard N, but was not addressed in previous
standards.
Was it always UB? Or should it be considered ID until it became UB?
Can you give an exapmple?
On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:
George Neuner <[email protected]> schrieb:
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is
mentioned as UB in some standard N, but was not addressed in previous
standards.
Was it always UB? Or should it be considered ID until it became UB?
Can you give an exapmple?
Memcopy() with overlapping pointers.
[email protected] (MitchAlsup1) writes:
On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:
George Neuner <[email protected]> schrieb:
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that
explicitly is mentioned as UB in some standard N, but was not
addressed in previous standards.
Was it always UB? Or should it be considered ID until it became
UB?
Can you give an exapmple?
Memcopy() with overlapping pointers.
Calling memcpy() between objects that overlap has always been
explicitly and specifically undefined behavior, going back to
the original ANSI C standard.
Stephen Fuld wrote:
On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry
etc.,
while compiler writers are free in their search for optimization tricks >>>> that let them shine at SPEC benchmarks.
A pressure vessel may actually be able to contain 2× the pressure it >>> will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.
Then there are 3 kinds of metals {grey, white, yellow} with different
responses to stress and induced strain. There is no analogy in code--
If there were perhaps we would have better code today...
Perhaps an analogy is code written in assembler, versus coed written
in C versus code written in something like Ada or Rust. Backing away
now . . . :-)
IMNSHO, code written in asm is generally more safe than code written in
C, because the author knows exactly what each line of code is going to do.
The problem is of course that it is harder to get 10x lines of correct
asm than to get 1x lines of correct C.
BTW, I am also solidly in the grey hair group here, writing C code that
is very low-level, using explicit local variables for any loop
invariant, copying other stuff into temp vars in order to make it really obvious that they cannot alias any globals or input/output parameters.
Anyway, that is all mostly moot since I'm using Rust for this kind of programming now. :-)
On Mon, 02 Sep 2024 06:59:32 -0700
Tim Rentsch <[email protected]> wrote:
[email protected] (MitchAlsup1) writes:
On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:
George Neuner <[email protected]> schrieb:
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that
explicitly is mentioned as UB in some standard N, but was not
addressed in previous standards.
Was it always UB? Or should it be considered ID until it became
UB?
Can you give an exapmple?
Memcopy() with overlapping pointers.
Calling memcpy() between objects that overlap has always been
explicitly and specifically undefined behavior, going back to
the original ANSI C standard.
3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined. https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the world
are telling the same story. May be, there exist old popular book that
said that it was defined?
On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:
George Neuner <[email protected]> schrieb:
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is
mentioned as UB in some standard N, but was not addressed in previous
standards.
Was it always UB? Or should it be considered ID until it became UB?
Can you give an exapmple?
Memcopy() with overlapping pointers.
In some sense I am
agreeing that the problem here is caused by the C standard, not by
it changing in different versions but by it giving too much freedom
to implementors for so-called "undefined behavior". Sadly the standardization process seems to have been taken over by compiler
writers, so the best advice I can offer is to join the ISO C
committee and start voting out the lunacy.
Alternatively I suppose
one could start up a competitive effort to gcc and clang, and offer
a compiler that doesn't engage in such shenanigans unless told to do
so (and told specifically), and then try to get developers to switch
to sane C in preference to the ever-increasingly insane C that is
most commonly used today.
Michael S <[email protected]> writes:
On Mon, 02 Sep 2024 06:59:32 -0700
Tim Rentsch <[email protected]> wrote:
[email protected] (MitchAlsup1) writes:
On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:
George Neuner <[email protected]> schrieb:
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that
explicitly is mentioned as UB in some standard N, but was not
addressed in previous standards.
Was it always UB? Or should it be considered ID until it became
UB?
Can you give an exapmple?
Memcopy() with overlapping pointers.
Calling memcpy() between objects that overlap has always been
explicitly and specifically undefined behavior, going back to
the original ANSI C standard.
3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the world
are telling the same story. May be, there exist old popular book that
said that it was defined?
My first answer is that the question asked was about standards, and
that is the question I was answering. There were no C standards
before 1989.
MitchAlsup1 <[email protected]> schrieb:
On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:
George Neuner <[email protected]> schrieb:
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is
mentioned as UB in some standard N, but was not addressed in previous
standards.
Was it always UB? Or should it be considered ID until it became UB?
Can you give an exapmple?
Memcopy() with overlapping pointers.
Does anybody have the first edition of K&R around to check what is
explicity stated there?
"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.
Memcopy() with overlapping pointers.
Does anybody have the first edition of K&R around to check what is
explicity stated there?
On 9/1/2024 6:32 PM, MitchAlsup1 wrote:
More modern machines have RND nobody will ever have REM.
Which is probably not a lot, as off-hand I am not aware of many ISA's
that have floor/ceil/round in the ISA itself, rather than doing it via conversion to an integer type.
Thomas Koenig <[email protected]> schrieb:
"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.
Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.
So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.
If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?
Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).
On 9/2/24 12:59 PM, Thomas Koenig wrote:
memcpy() doesn't appear in the index.Memcopy() with overlapping pointers.
Does anybody have the first edition of K&R around to check what is
explicity stated there?
On Mon, 2 Sep 2024 20:52:21 +0000, Thomas Koenig wrote:
Thomas Koenig <[email protected]> schrieb:
"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.
Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.
So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.
If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?
It depends::
Let
Base = NULL;
Index = &array / sizeof( array[0] );
is::
x = [base+index<<sale+small_offset]
u8ndefined ??
Tim Rentsch <[email protected]> schrieb:
In some sense I am
agreeing that the problem here is caused by the C standard, not by
it changing in different versions but by it giving too much freedom
to implementors for so-called "undefined behavior". Sadly the
standardization process seems to have been taken over by compiler
writers, so the best advice I can offer is to join the ISO C
committee and start voting out the lunacy.
Alternatively I suppose
one could start up a competitive effort to gcc and clang, and offer
a compiler that doesn't engage in such shenanigans unless told to do
so (and told specifically), and then try to get developers to switch
to sane C in preference to the ever-increasingly insane C that is
most commonly used today.
The specification needs to come first! Right now, compiler writers
have a specification, the standard, which they generally follow
(modulo bugs and extensions). You have to give them another,
supplemental specification to follow if you want any chance
of success.
But writing such a specification is a lot of work, very hard work,
and needs a lot of discussion.
"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.
If it gets accepted by a wide community, then a branch trying to
implement that particular version in either gcc or clang (or
both) could have a certain chance of being implemented by the
main compilers.
My suggestion is not to implement a language extension, but to
implement a compiler conforming to C as it is now,
with
additional guarantees for what happens in cases that are
undefined behavior.
Moreover the additional guarantees are
always in effect unless explicitly and specifically requested
otherwise (most likely by means of a #pragma or _Pragma).
Documentation needs to be written for the #pragmas, but no other documentation is required (it might be nice to describe the
additional guarantees but that is not required by the C
standard).
The point is to change the behavior of the compiler but
still conform to the existing ISO C standard.
On 9/2/2024 1:23 AM, Terje Mathisen wrote:
Stephen Fuld wrote:
On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry
etc.,
while compiler writers are free in their search for optimization
tricks
that let them shine at SPEC benchmarks.
A pressure vessel may actually be able to contain 2× the pressure it >>>> will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.
Then there are 3 kinds of metals {grey, white, yellow} with different
responses to stress and induced strain. There is no analogy in code--
If there were perhaps we would have better code today...
Perhaps an analogy is code written in assembler, versus coed written
in C versus code written in something like Ada or Rust. Backing away
now . . . :-)
IMNSHO, code written in asm is generally more safe than code written
in C, because the author knows exactly what each line of code is going
to do.
The problem is of course that it is harder to get 10x lines of correct
asm than to get 1x lines of correct C.
BTW, I am also solidly in the grey hair group here, writing C code
that is very low-level, using explicit local variables for any loop
invariant, copying other stuff into temp vars in order to make it
really obvious that they cannot alias any globals or input/output
parameters.
Anyway, that is all mostly moot since I'm using Rust for this kind of
programming now. :-)
Can you talk about the advantages and disadvantages of Rust versus C?
Tim Rentsch <[email protected]> schrieb:
My suggestion is not to implement a language extension, but to
implement a compiler conforming to C as it is now,
Sure, that was also what I was suggesting - define things that
are currently undefined behavior.
with
additional guarantees for what happens in cases that are
undefined behavior.
Guarantees or specifications - no difference there.
Moreover the additional guarantees are
always in effect unless explicitly and specifically requested
otherwise (most likely by means of a #pragma or _Pragma).
Documentation needs to be written for the #pragmas, but no other documentation is required (it might be nice to describe the
additional guarantees but that is not required by the C
standard).
It' the other way around - you need to describe first what the
actual behavior in absence of any pragmas is, and this needs to be a
firm specification, so the programmer doesn't need to read your mind
(or the source code to the compiler) to find out what you meant.
"But it is clear that..." would not be a specification; what is
clear to you may absolutely not be clear to anybody else.
This is also the only chance you'll have of getting this implemented
in one of the current compilers (and let's face it, if you want
high-quality code, you would need that; both LLVM and GCC
have taken an enormous amount of effort up to now, and duplicating
that is probably not going to happen).
The point is to change the behavior of the compiler but
still conform to the existing ISO C standard.
I understood that - defining things that are currently undefined.
But without a specification, that falls down.
So, let's try something that causes some grief - what should
be the default behavior (in the absence of pragmas) for integer
overflow? More specifically, can the compiler set the condition
to false in
int a;
...
if (a > a + 1) {
}
and how would you specify this in an unabigous manner?
Thomas Koenig <[email protected]> schrieb:
"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.
Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.
So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.
If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?
Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).
Tim Rentsch <[email protected]> schrieb:
My suggestion is not to implement a language extension, but to
implement a compiler conforming to C as it is now,
Sure, that was also what I was suggesting - define things that
are currently undefined behavior.
with
additional guarantees for what happens in cases that are
undefined behavior.
Guarantees or specifications - no difference there.
Tim Rentsch <[email protected]> schrieb:
My suggestion is not to implement a language extension, but to
implement a compiler conforming to C as it is now,
Sure, that was also what I was suggesting - define things that
are currently undefined behavior.
with
additional guarantees for what happens in cases that are
undefined behavior.
Guarantees or specifications - no difference there.
Moreover the additional guarantees are
always in effect unless explicitly and specifically requested
otherwise (most likely by means of a #pragma or _Pragma).
Documentation needs to be written for the #pragmas, but no other documentation is required (it might be nice to describe the
additional guarantees but that is not required by the C
standard).
It' the other way around - you need to describe first what the
actual behavior in absence of any pragmas is, and this needs to be a
firm specification, so the programmer doesn't need to read your mind
(or the source code to the compiler) to find out what you meant.
"But it is clear that..." would not be a specification; what is
clear to you may absolutely not be clear to anybody else.
This is also the only chance you'll have of getting this implemented
in one of the current compilers (and let's face it, if you want
high-quality code, you would need that; both LLVM and GCC
have taken an enormous amount of effort up to now, and duplicating
that is probably not going to happen).
The point is to change the behavior of the compiler but
still conform to the existing ISO C standard.
I understood that - defining things that are currently undefined.
But without a specification, that falls down.
So, let's try something that causes some grief - what should
be the default behavior (in the absence of pragmas) for integer
overflow? More specifically, can the compiler set the condition
to false in
int a;
...
if (a > a + 1) {
}
and how would you specify this in an unabigous manner?
On Mon, 02 Sep 2024 06:59:32 -0700
Tim Rentsch <[email protected]> wrote:
[email protected] (MitchAlsup1) writes:
On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:
George Neuner <[email protected]> schrieb:
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that
explicitly is mentioned as UB in some standard N, but was not
addressed in previous standards.
Was it always UB? Or should it be considered ID until it became
UB?
Can you give an exapmple?
Memcopy() with overlapping pointers.
Calling memcpy() between objects that overlap has always been
explicitly and specifically undefined behavior, going back to
the original ANSI C standard.
3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined. https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the world
are telling the same story. May be, there exist old popular book that
said that it was defined?
Michael S wrote:
On Mon, 02 Sep 2024 06:59:32 -0700
Tim Rentsch <[email protected]> wrote:
[email protected] (MitchAlsup1) writes:
On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:
George Neuner <[email protected]> schrieb:
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that
explicitly is mentioned as UB in some standard N, but was not
addressed in previous standards.
Was it always UB? Or should it be considered ID until it became
UB?
Can you give an exapmple?
Memcopy() with overlapping pointers.
Calling memcpy() between objects that overlap has always been
explicitly and specifically undefined behavior, going back to
the original ANSI C standard.
3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined. https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?
It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?
Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.
Code that depended on this was fine for decades, until the first library/compiler implementation discovered that in some circumstances
it could be faster to go in reverse order.
Terje
On 9/2/2024 1:23 AM, Terje Mathisen wrote:
Stephen Fuld wrote:
On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry
etc.,
while compiler writers are free in their search for optimization
tricks
that let them shine at SPEC benchmarks.
A pressure vessel may actually be able to contain 2× the
pressure it
will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.
Then there are 3 kinds of metals {grey, white, yellow} with different
responses to stress and induced strain. There is no analogy in code--
If there were perhaps we would have better code today...
Perhaps an analogy is code written in assembler, versus coed written
in C versus code written in something like Ada or Rust. Backing
away now . . . :-)
IMNSHO, code written in asm is generally more safe than code written
in C, because the author knows exactly what each line of code is going
to do.
The problem is of course that it is harder to get 10x lines of correct
asm than to get 1x lines of correct C.
BTW, I am also solidly in the grey hair group here, writing C code
that is very low-level, using explicit local variables for any loop
invariant, copying other stuff into temp vars in order to make it
really obvious that they cannot alias any globals or input/output
parameters.
Anyway, that is all mostly moot since I'm using Rust for this kind of
programming now. :-)
Can you talk about the advantages and disadvantages of Rust versus C?
On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
On Mon, 02 Sep 2024 06:59:32 -0700
Tim Rentsch <[email protected]> wrote:
[email protected] (MitchAlsup1) writes:
On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:
George Neuner <[email protected]> schrieb:
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that
explicitly is mentioned as UB in some standard N, but was not
addressed in previous standards.
Was it always UB? Or should it be considered ID until it became
UB?
Can you give an exapmple?
Memcopy() with overlapping pointers.
Calling memcpy() between objects that overlap has always been
explicitly and specifically undefined behavior, going back to
the original ANSI C standard.
3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?
It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?
Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.
What is "compact buffers" ?
On 02/09/2024 18:46, Stephen Fuld wrote:
On 9/2/2024 1:23 AM, Terje Mathisen wrote:
Stephen Fuld wrote:
On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry >>>>>> etc.,
while compiler writers are free in their search for optimization
tricks
that let them shine at SPEC benchmarks.
A pressure vessel may actually be able to contain 2× the pressure it >>>>> will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.
Then there are 3 kinds of metals {grey, white, yellow} with different >>>>> responses to stress and induced strain. There is no analogy in code-- >>>>> If there were perhaps we would have better code today...
Perhaps an analogy is code written in assembler, versus coed written
in C versus code written in something like Ada or Rust. Backing
away now . . . :-)
IMNSHO, code written in asm is generally more safe than code written
in C, because the author knows exactly what each line of code is
going to do.
The problem is of course that it is harder to get 10x lines of
correct asm than to get 1x lines of correct C.
BTW, I am also solidly in the grey hair group here, writing C code
that is very low-level, using explicit local variables for any loop
invariant, copying other stuff into temp vars in order to make it
really obvious that they cannot alias any globals or input/output
parameters.
Anyway, that is all mostly moot since I'm using Rust for this kind of
programming now. :-)
Can you talk about the advantages and disadvantages of Rust versus C?
And also for Rust versus C++ ?
My impression - based on hearsay for Rust as I have no experience - is
that the key point of Rust is memory "safety". I use scare-quotes here, since it is simply about correct use of dynamic memory and buffers.
It is entirely possible to have correct use of memory in C, but it is
also very easy to get it wrong - especially if the developer doesn't use available tools for static and run-time checks. Modern C++, on the
other hand, makes it much easier to get right. You can cause yourself
extra work and risk by using more old-fashioned C++, but following
modern design guides using smart pointers and containers, along with
easily available tools, and you get a lot of the management of memory
handled automatically for very little cost.
C++ provides a huge amount more than Rust - when I have looked at Rust,
it is (still) too limited for some of what I want to do.
Of course,
"with great power comes great responsibility" - C++ provides many
exciting ways to write a complete mess :-)
To my mind, the important question is not "Should we move from C to
Rust?", but "Should we move from bad C to C++, Rust, or simply to good C practices?".
On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?
It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?
Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.
What is "compact buffers" ?
(There are a few situations where UB in C could be diagnosed at
compile-time, which are probably historical decisions to avoid imposing
too much work on early compilers. Where possible, UB that can be caught
at compile time, could usefully be turned into constrain violations that
must be diagnosed.)
Michael S wrote:
On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?
It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?
Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.
What is "compact buffers" ?
Assume a buffer consisting of records of some type, some of them marked
as deleted. Iterating over them while removing the gaps means that you
are always copying to a destination lower in memory, right?
Undefined behaviour is something that is exercised at run-time.
That's why the "undefined behaviour sanitizers" insert run-time
checks. And of course they only detect the behaviour when it is
actually exercised.
On 2024-09-03 20:52, Terje Mathisen wrote:
Michael S wrote:
On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?
It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?
Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.
What is "compact buffers" ?
Assume a buffer consisting of records of some type, some of them
marked as deleted. Iterating over them while removing the gaps means
that you are always copying to a destination lower in memory, right?
Only if you iterate in order of increasing memory address, which is not
the only possibility.
My impression - based on hearsay for Rust as I have no experience - is that the key point of Rust is memory "safety". I use scare-quotes here, since it is simply about correct use of dynamic memory and buffers.
It is entirely possible to have correct use of memory in C,
Specifications are an agreement between the supplier and the client. The
Niklas Holsti wrote:
On 2024-09-03 20:52, Terje Mathisen wrote:Obviously so, I really didn't think that needed to be stated. :-(
Michael S wrote:
On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>> Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?
It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?
Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.
What is "compact buffers" ?
Assume a buffer consisting of records of some type, some of them
marked as deleted. Iterating over them while removing the gaps means
that you are always copying to a destination lower in memory, right?
Only if you iterate in order of increasing memory address, which is
not the only possibility.
My impression - based on hearsay for Rust as I have no experience - is that >> the key point of Rust is memory "safety". I use scare-quotes here, since it >> is simply about correct use of dynamic memory and buffers.
It is entirely possible to have correct use of memory in C,
If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.
My impression - based on hearsay for Rust as I have no experience - is
that
the key point of Rust is memory "safety". I use scare-quotes here,
since it
is simply about correct use of dynamic memory and buffers.
It is entirely possible to have correct use of memory in C,
If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.
Stefan
Stefan Monnier <[email protected]> schrieb:
My impression - based on hearsay for Rust as I have no experience - is
that
the key point of Rust is memory "safety". I use scare-quotes here,
since it
is simply about correct use of dynamic memory and buffers.
It is entirely possible to have correct use of memory in C,
If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.
Really?
I thought Fortran was higher level than C, and you can do a lot
more things in Fortran than in C.
Or rather, Fortran allows you to do things which are possible,
but very cumbersome, in C. Both are Turing complete, after all.
Specifications are an agreement between the supplier and the client. The
The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.
Stefan
My impression - based on hearsay for Rust as I have no experience - is that >> the key point of Rust is memory "safety". I use scare-quotes here, since it >> is simply about correct use of dynamic memory and buffers.
It is entirely possible to have correct use of memory in C,
If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.
On 9/2/2024 11:23 PM, David Brown wrote:
On 02/09/2024 18:46, Stephen Fuld wrote:
On 9/2/2024 1:23 AM, Terje Mathisen wrote:
Anyway, that is all mostly moot since I'm using Rust for this kind
of programming now. :-)
Can you talk about the advantages and disadvantages of Rust versus C?
And also for Rust versus C++ ?
I asked about C versus Rust as Terje explicitly mentioned those two languages, but you make a good point in general.
My impression - based on hearsay for Rust as I have no experience - is
that the key point of Rust is memory "safety". I use scare-quotes
here, since it is simply about correct use of dynamic memory and buffers.
I agree that memory safety is the key point, although I gather that it
has other features that many programmers like.
It is entirely possible to have correct use of memory in C, but it is
also very easy to get it wrong - especially if the developer doesn't
use available tools for static and run-time checks. Modern C++, on
the other hand, makes it much easier to get right. You can cause
yourself extra work and risk by using more old-fashioned C++, but
following modern design guides using smart pointers and containers,
along with easily available tools, and you get a lot of the management
of memory handled automatically for very little cost.
Is it fair to say then that Rust makes it harder to get memory
management "wrong"?
C++ provides a huge amount more than Rust - when I have looked at
Rust, it is (still) too limited for some of what I want to do.
Can you give a few examples?
Of course, "with great power comes great responsibility" - C++
provides many exciting ways to write a complete mess :-)
Sure. I gather that templates are very powerful and potentially very useful. On the other hand, I gather that multiple inheritance is very powerful, but difficult to use and potentially very ugly, and has not
been carried forward in the same way into newer languages.
snip stuff about the inadequacy of existing Rust versus C++ comparisons.
To my mind, the important question is not "Should we move from C to
Rust?", but "Should we move from bad C to C++, Rust, or simply to good
C practices?".
I understand. This brings up an important issue, that of older versus
newer languages.
A newer language has several advantages. One is it can take advantage
of what we have learned about language design and usage since the older language was designed. I can't underestimate this enough. While many
new language features turn out to be not useful, many are.
Another is that it doesn't have to worry about support for "dusty
decks", i.e. the existing base which may conform to an older version of
the language, nor for "dusty brains", that is programmers who learned
the older (i.e. worse) ways and keep generating new code using those
ways. You mention this issue in your comments.
Of course, the counter to that is that new languages have to overcome
the huge "installed base" advantage of existing languages.
Let me be clear. I am not a Rust evangelist. I am just looking for a
way forward that will help us make programmer easier and not to make
some of the same mistakes we have made in the past. Is Rust that? Some people think so. I just want to understand more.
On 03.09.24 10:10, David Brown wrote:
snip 8< - - - - - - - -
(There are a few situations where UB in C could be diagnosed at
compile-time, which are probably historical decisions to avoid
imposing too much work on early compilers. Where possible, UB that
can be caught at compile time, could usefully be turned into constrain
violations that must be diagnosed.)
And exactly these are the situations that I'd like to be warned from,
rather than the compiler making up something without telling.
On 2024-09-03 11:10, David Brown wrote:
[snip]
(There are a few situations where UB in C could be diagnosed at
compile-time, which are probably historical decisions to avoid
imposing too much work on early compilers. Where possible, UB that
can be caught at compile time, could usefully be turned into constrain
violations that must be diagnosed.)
The problem, as you of course know, is that the "can" in "can be caught
at compile time" depends on the amount and kind of analysis that is done
at compile time -- some cases of UB "can" be caught at compile time but
only by advanced and costly analysis. If the language standard requires
that such things /must/ be detected by the compiler, it can place quite
a burden on the developers of conforming compilers.
As I understand it, current C compilers detect UB mostly as a side
effect of the analyses they do for code optimization purposes, which
vary widely between compilers, and so the UB-detections also vary.
This issue (compile-time detection) has now and then been discussed in
the Ada standards group. Given the currently low market penetration of
Ada, the group has been reluctant to require too much of the compilers,
and so the more advanced UB-detecting tools are stand-alone, such as the SPARK tools.
Stephen Fuld wrote:
On 9/2/2024 1:23 AM, Terje Mathisen wrote:
Stephen Fuld wrote:
On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry
etc.,
while compiler writers are free in their search for optimization
tricks
that let them shine at SPEC benchmarks.
A pressure vessel may actually be able to contain 2× the
pressure it
will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.
Then there are 3 kinds of metals {grey, white, yellow} with different
responses to stress and induced strain. There is no analogy in code--
If there were perhaps we would have better code today...
Perhaps an analogy is code written in assembler, versus coed written
in C versus code written in something like Ada or Rust. Backing
away now . . . :-)
IMNSHO, code written in asm is generally more safe than code written
in C, because the author knows exactly what each line of code is
going to do.
The problem is of course that it is harder to get 10x lines of
correct asm than to get 1x lines of correct C.
BTW, I am also solidly in the grey hair group here, writing C code
that is very low-level, using explicit local variables for any loop
invariant, copying other stuff into temp vars in order to make it
really obvious that they cannot alias any globals or input/output
parameters.
Anyway, that is all mostly moot since I'm using Rust for this kind of
programming now. :-)
Can you talk about the advantages and disadvantages of Rust versus C?
Q&D programming is still far faster for me in C, but using Rust I don't
have to worry about how well the compiler will be able to optimize my
code, it is pretty much always close to speed of light since the entire aliasing issue goes away.
Rust also gets rid of the horrible external library/configure/cmake mess that kept me from successfully compiling the reference LAStools lidar
code for nearly 10 years.
Using the Rust port I just tell cargo to add it to my project and that's
it.
On 03/09/2024 19:19, Bernd Linsel wrote:
On 03.09.24 10:10, David Brown wrote:
snip 8< - - - - - - - -
(There are a few situations where UB in C could be diagnosed at
compile-time, which are probably historical decisions to avoid
imposing too much work on early compilers. Where possible, UB that
can be caught at compile time, could usefully be turned into constrain
violations that must be diagnosed.)
And exactly these are the situations that I'd like to be warned from,
rather than the compiler making up something without telling.
Some of those /are/ warned about by compilers (but I'd rather the
standards said that they were errors). But in general, many can be
handled by good development practice and compiler warnings. Still,
compilers could always get better!
One thing that could make a big difference, I think, is to drop the compilation model of each translation unit being compiled to a binary
object independently, with only a minimal amount of information for
linking. Link-time optimisation allows for many extra checks, not all
of which are currently implemented AFAIK. For example, it should be
possible to check that external declarations and definitions match up correctly across modules - that's currently UB and rarely checked.
On 8/31/24 4:56 PM, BGB wrote:
[snip]
I was mostly doing dual-issue with a 4R2W design.
Initially, 6R3W won out mostly because 4R2W disallows an indexed
store to be run in parallel with another op; but 6R3W did allow
this.
Stores and MADD allow one register read to be delayed by at least
one cycle. If the following cycle had a free read port, that could
be stolen to complete the store/MADD. This could be viewed as
cracking a three-source operation into a two-source operation and
a one-source operation that reads source operands in a following
cycle except that this operation never uses a result from the
previous cycle.
On 9/1/2024 4:02 PM, Paul A. Clayton wrote:
On 8/31/24 4:56 PM, BGB wrote:
[snip]
I was mostly doing dual-issue with a 4R2W design.
Initially, 6R3W won out mostly because 4R2W disallows an indexed store
to be run in parallel with another op; but 6R3W did allow this.
Stores and MADD allow one register read to be delayed by at least
one cycle. If the following cycle had a free read port, that could
be stolen to complete the store/MADD. This could be viewed as
cracking a three-source operation into a two-source operation and
a one-source operation that reads source operands in a following
cycle except that this operation never uses a result from the
previous cycle.
This wouldn't map well to my existing decoder/pipeline, which requires
all the ports (and all the registers) to be available at the time an instruction enters EX1, and currently has no support for "cracking" an instruction over multiple cycles, but may spread a single instruction
across multiple lanes.
But, yeah, if the restriction only applied to indexed store (in the
current implementation, it applies to all stores), it would still be
around 4% of the total instruction stream.
As-is, it is closer to 12%, and causing an extra penalty for 12% of the total-executed instructions was undesirable (but, IMHO, still better
than needing to use multiple instructions).
On Tue, 3 Sep 2024 23:19:36 +0000, David Brown wrote:
On 03/09/2024 19:19, Bernd Linsel wrote:
On 03.09.24 10:10, David Brown wrote:
snip 8< - - - - - - - -
(There are a few situations where UB in C could be diagnosed at
compile-time, which are probably historical decisions to avoid
imposing too much work on early compilers. Where possible, UB that
can be caught at compile time, could usefully be turned into constrain >>>> violations that must be diagnosed.)
And exactly these are the situations that I'd like to be warned from,
rather than the compiler making up something without telling.
Some of those /are/ warned about by compilers (but I'd rather the
standards said that they were errors). But in general, many can be
handled by good development practice and compiler warnings. Still,
compilers could always get better!
Something that might be an error in a 32-bit machine may not be
an error in a 36-bit {48, 64, 72} machine.
One thing that could make a big difference, I think, is to drop the
compilation model of each translation unit being compiled to a binary
object independently, with only a minimal amount of information for
linking. Link-time optimisation allows for many extra checks, not all
of which are currently implemented AFAIK. For example, it should be
possible to check that external declarations and definitions match up
correctly across modules - that's currently UB and rarely checked.
How does one call fprintf() under those rules ??
On 9/2/2024 8:36 AM, MitchAlsup1 wrote:
On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:
George Neuner <[email protected]> schrieb:
I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is
mentioned as UB in some standard N, but was not addressed in previous
standards.
Was it always UB? Or should it be considered ID until it became UB?
Can you give an exapmple?
Memcopy() with overlapping pointers.
I had just recently discovered that newer versions of GCC will cause
code to break if it is missing a return value in C++ mode.
So:
int Foo() { }
Will (in theory) cause the program to crash when called (emitting a
'UD2' instruction), except in WSL it seems this doesn't quite work
correctly (the UD2 doesn't result in an immediate crash), and the
program seemingly instead "goes off the rails and crashes at a later
point" (GCC omits the epilog when it does this, and seemingly control
flow then goes into whatever function follows in the binary, crashing
when that function tries to return seemingly by branching to an invalid address or similar).
This was mostly effecting "init" functions in my Verilator test benches...
Well, that, and a more inconsistent variant, where if one declares
struct fields as 8 and 3 bytes and then strncpy's 11 bytes into the
combined field, it may also insert a UD2 and skip emitting the following code.
...
But, yeah, that was annoying...
On 03/09/2024 18:54, Stephen Fuld wrote:
On 9/2/2024 11:23 PM, David Brown wrote:
On 02/09/2024 18:46, Stephen Fuld wrote:
On 9/2/2024 1:23 AM, Terje Mathisen wrote:
Anyway, that is all mostly moot since I'm using Rust for this kind
of programming now. :-)
Can you talk about the advantages and disadvantages of Rust versus C?
And also for Rust versus C++ ?
I asked about C versus Rust as Terje explicitly mentioned those two
languages, but you make a good point in general.
I want to know about both :-)
In my field, small-systems embedded development, C has been dominant for
a long time, but C++ use is increasing. Most of my new stuff in recent times has been C++. There are some in the field who are trying out
Rust, so I need to look into it myself - either because it is a better
choice than C++, or because customers might want it.
My impression - based on hearsay for Rust as I have no experience -
is that the key point of Rust is memory "safety". I use
scare-quotes here, since it is simply about correct use of dynamic
memory and buffers.
I agree that memory safety is the key point, although I gather that it
has other features that many programmers like.
Sure. There are certainly plenty of things that I think are a better
idea in a modern programming language and that make it a good step up compared to C. My key interest is in comparison to C++ - it is a step
up in some ways, a step down in others, and a step sideways in many features. But is it overall up or down, for /my/ uses?
Examples of things that I think are good in Rust are making variables immutable by default and pattern matching. Steps down include lack of function overloading and limited object oriented support.
There are some things that some people really like about Rust, that I am
far from convinced about - such as package management. I could be misunderstanding (since I don't have the experience), but for /my/ work,
I am very much against anything that encourages an "always get the
latest version" attitude. Stability is much more important to me. (I dislike the rate at which Rust changes - every two weeks or so for small things, and every couple of years for breaking changes.)
And there are some things that Rust simply gets wrong - such as the
handling of signed integer overflows.
Specifications are an agreement between the supplier and the client. The
The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.
On Tue, 3 Sep 2024 19:30:21 +0000, Stefan Monnier wrote:
Specifications are an agreement between the supplier and the client. The
The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.
# define int int64_t
..
makes it easier.
On 9/3/2024 8:46 AM, Terje Mathisen wrote:
Stephen Fuld wrote:different
On 9/2/2024 1:23 AM, Terje Mathisen wrote:
Stephen Fuld wrote:
On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry
etc.,
while compiler writers are free in their search for optimization
tricks
that let them shine at SPEC benchmarks.
A pressure vessel may actually be able to contain 2Ã the
pressure it
will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.
Then there are 3 kinds of metals {grey, white, yellow} with
code--responses to stress and induced strain. There is no analogy in
If there were perhaps we would have better code today...
Perhaps an analogy is code written in assembler, versus coed written
in C versus code written in something like Ada or Rust. Backing
away now . . . :-)
IMNSHO, code written in asm is generally more safe than code written
in C, because the author knows exactly what each line of code is
going to do.
The problem is of course that it is harder to get 10x lines of
correct asm than to get 1x lines of correct C.
BTW, I am also solidly in the grey hair group here, writing C code
that is very low-level, using explicit local variables for any loop
invariant, copying other stuff into temp vars in order to make it
really obvious that they cannot alias any globals or input/output
parameters.
Anyway, that is all mostly moot since I'm using Rust for this kind of
programming now. :-)
Can you talk about the advantages and disadvantages of Rust versus C?
Q&D programming is still far faster for me in C, but using Rust I don't have to worry about how well the compiler will be able to optimize my code, it is pretty much always close to speed of light since the entire aliasing issue goes away.
Rust also gets rid of the horrible external library/configure/cmake mess that kept me from successfully compiling the reference LAStools lidar
code for nearly 10 years.
Using the Rust port I just tell cargo to add it to my project and that's it.
Thank you. I find it interesting that the main advantage of Rust as
touted by its evangelists, memory safety, didn't make your list.
Otherwise, annoying:
Despite configuring GCC to use RV64G, it builds its C library as RV64GC
and is like "well, close enough".
Which is annoying because seemingly nearly every instruction has its own >encoding scheme for the immediate fields.
David Brown wrote:
On 03/09/2024 18:54, Stephen Fuld wrote:
On 9/2/2024 11:23 PM, David Brown wrote:
On 02/09/2024 18:46, Stephen Fuld wrote:
On 9/2/2024 1:23 AM, Terje Mathisen wrote:
Anyway, that is all mostly moot since I'm using Rust for this kind >>>>>> of programming now. :-)
Can you talk about the advantages and disadvantages of Rust versus C? >>>>>
And also for Rust versus C++ ?
I asked about C versus Rust as Terje explicitly mentioned those two
languages, but you make a good point in general.
I want to know about both :-)
In my field, small-systems embedded development, C has been dominant
for a long time, but C++ use is increasing. Most of my new stuff in
recent times has been C++. There are some in the field who are trying
out Rust, so I need to look into it myself - either because it is a
better choice than C++, or because customers might want it.
My impression - based on hearsay for Rust as I have no experience -
is that the key point of Rust is memory "safety". I use
scare-quotes here, since it is simply about correct use of dynamic
memory and buffers.
I agree that memory safety is the key point, although I gather that
it has other features that many programmers like.
Sure. There are certainly plenty of things that I think are a better
idea in a modern programming language and that make it a good step up
compared to C. My key interest is in comparison to C++ - it is a step
up in some ways, a step down in others, and a step sideways in many
features. But is it overall up or down, for /my/ uses?
Examples of things that I think are good in Rust are making variables
immutable by default and pattern matching. Steps down include lack of
function overloading and limited object oriented support.
There are some things that some people really like about Rust, that I
am far from convinced about - such as package management. I could be
misunderstanding (since I don't have the experience), but for /my/
work, I am very much against anything that encourages an "always get
the latest version" attitude. Stability is much more important to
me. (I dislike the rate at which Rust changes - every two weeks or so
for small things, and every couple of years for breaking changes.)
That's yet another of the things cargo (the rust package manager, as
well as lots of other stuff) get right:
Yes, by default you'll pick up the latest of every package/module you
"cargo add foo" to your project, but then you can edit the resulting text-format configuration file, and lock down exact versions of some or
all of those packages.
This is similar to how we always freeze python packages: Any changes are something we decide to employ.
And there are some things that Rust simply gets wrong - such as the
handling of signed integer overflows.
Maybe?
Rust will _always_ check for such overflow in debug builds, then when
you've determined that they don't occur, the release build falls back standard CPU behavior, i.e. wrapping around with no panics.
You can argue both pro and con here, personally I like the Rust setup
much more than C(++) which will use code that could do so as an excuse
to elide that as well as all surrounding/dependent code.
On 9/4/2024 2:04 AM, David Brown wrote:
On 03/09/2024 20:39, BGB wrote:
On 9/2/2024 8:36 AM, MitchAlsup1 wrote:
On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:
George Neuner <[email protected]> schrieb:
I'm not going to argue about whether UB in code is wrong. TheCan you give an exapmple?
question I have concerns what to do with something that explicitly is >>>>>> mentioned as UB in some standard N, but was not addressed in previous >>>>>> standards.
Was it always UB? Or should it be considered ID until it became UB? >>>>>
Memcopy() with overlapping pointers.
I had just recently discovered that newer versions of GCC will cause
code to break if it is missing a return value in C++ mode.
No, the error in the code caused the code to break. You don't get to
blame the compiler if you write rubbish. You get to /thank/ the
compiler if it has helpfully added an instruction to cause the program
to stop abruptly with a UD2 instruction.
Usually the role of the compiler is to make existing code work as it did before, not to cause it to break, even in the face of UB.
I would have more accepted if it turned it into a compiler error or
similar though (rather than turn it into a runtime crash).
Note that in C, falling off the end of Foo here is fine - it is only
if the caller attempts to use the non-existent return value that there
is UB. Thus in C mode, gcc implements Foo as "ret" (when optimised),
and will only warn you if you enable warnings.
In C++, it is the act of falling off the end of Foo that is UB, thus
the compiler will generate an UB2 (for -O0) or no code at all (when
optimised), and will warn you without requiring options.
It worked fine in the older instance of WSL running GCC 4.8.0 ("Ubuntu
14"), but sorta exploded when switching to a newer instance of WSL (with "Ubuntu 22")...
But, sometimes got lazy, and did:
int InitSomething()
{
...
}
Without a return, but was an issue when it was unexpectedly crashing
(and the cause was not immediately obvious, and I had not heard that
there had been a behavioral change here).
Well, also partly because it is traditional to always return 'int' even
when 'void' is technically more correct.
But, in general, coding practices in my Verilator testbenches tends to
be more lax (mostly code thrown together so the Verilog can do its thing
and display its output to a window, and accept user input as needed).
So:
int Foo() { }
Will (in theory) cause the program to crash when called (emitting a
'UD2' instruction), except in WSL it seems this doesn't quite work
correctly (the UD2 doesn't result in an immediate crash), and the
program seemingly instead "goes off the rails and crashes at a later
point" (GCC omits the epilog when it does this, and seemingly control
flow then goes into whatever function follows in the binary, crashing
when that function tries to return seemingly by branching to an
invalid address or similar).
This was mostly effecting "init" functions in my Verilator test
benches...
Well, that, and a more inconsistent variant, where if one declares
struct fields as 8 and 3 bytes and then strncpy's 11 bytes into the
combined field, it may also insert a UD2 and skip emitting the
following code.
...
But, yeah, that was annoying...
If your compiler tells you you are doing something stupid, and you
ignore it, I really don't think you can claim "the compiler broke my
code".
It would have been nicer if it crashed in a way where GDB could show me
the point at which the crash was triggered...
as opposed to just showing "??" followed by a random address (followed
by "can't read from address" or something to this effect).
(with the "-g" option). Where, "bt" and similar didn't work either.
I could tell it wasn't crashing immediately, because if it crashed immediately it would fail at the point the UD2 occurred.
However, in a lot of cases it was carrying on and triggering a storm of
debug prints for a while with often impossible values, before then
crashing (in a way that looked more like a possible stack corruption).
I suspect the latter being due to some weirdness in WSL (I figured about
the "UD2" mostly by trying to recreate the scenarios in "Compiler
Explorer" / "godbolt.org").
Luckily stuff mostly worked after this point, as the missing return
values were mostly limited to initialization functions.
Oddly though, "Compiler Explorer" was showing warnings for the missing
return values, but not in GCC in WSL.
Though, have noted that generally MSVC will warn about them, and in this
case I had usually fixed them, as granted it is still good practice to
return a value (more so if actually used, because "random garbage" isn't usually a particularly useful return value).
But, generally, MSVC will not unexpectedly break things.
On 04/09/2024 09:15, Terje Mathisen wrote:
David Brown wrote:
Maybe?
Rust will _always_ check for such overflow in debug builds, then when
you've determined that they don't occur, the release build falls back
standard CPU behavior, i.e. wrapping around with no panics.
But if you've determined that they do not occur (during debugging), then
your code never makes use of the results of an overflow - thus why is it defined behaviour? It makes no sense. The only time when you would actually see wrapping in final code is if you hadn't tested it properly,
and then you can be pretty confident that the whole thing will end in
tears when signs change unexpectedly. It would be much more sensible to leave signed overflow undefined, and let the compiler optimise on that
basis.
Absolutely. There's things about newer languages, like Rust, Go, and
Swift that I like. For example, they are designed with concurrency and multi-threading from the start, rather than an add-on. C++, as we know
it today, has grown gradually, and a lot of its complexity is because of features added on rather than having been part of the original design.
Michael S wrote:
On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?
>>
It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?
Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.
What is "compact buffers" ?
Assume a buffer consisting of records of some type, some of them
marked as deleted. Iterating over them while removing the gaps means
that you are always copying to a destination lower in memory, right?
On 03/09/2024 22:22, MitchAlsup1 wrote:
On Tue, 3 Sep 2024 19:30:21 +0000, Stefan Monnier wrote:
Specifications are an agreement between the supplier and the client. The >>>The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.
# define int int64_t
..
makes it easier.
That's UB, I believe :-) And it will certainly be confusing.
But good use of size-specific types is helpful to writing correct code.
If your calculations could conceivably overflow 32 bits, int64_t is a
good choice.
For smaller numbers and portable code, you might want int_fast32_t or int_fast16_t, which on most 64-bit systems will be faster than "int".
You can call it /ugly/, but it's not /hard/.
On 9/4/24 06:57, David Brown wrote:
On 04/09/2024 09:15, Terje Mathisen wrote:
David Brown wrote:
Maybe?
Rust will _always_ check for such overflow in debug builds, then when
you've determined that they don't occur, the release build falls back
standard CPU behavior, i.e. wrapping around with no panics.
But if you've determined that they do not occur (during debugging),
then your code never makes use of the results of an overflow - thus
why is it defined behaviour? It makes no sense. The only time when
you would actually see wrapping in final code is if you hadn't tested
it properly, and then you can be pretty confident that the whole thing
will end in tears when signs change unexpectedly. It would be much
more sensible to leave signed overflow undefined, and let the compiler
optimise on that basis.
You absolutely do want defined behavior on overflow.
There are
algorithms that depend on that. Bakery algorithms for instance.
Unless you think a real life bakery with service tickets
numbering from 1 to 50 either never gets more than 50 customers
in a day or closes after their 50th customer. :)
Joe Seigh
Terje Mathisen <[email protected]> writes:
Michael S wrote:
On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?
>>
It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?
Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.
What is "compact buffers" ?
Assume a buffer consisting of records of some type, some of them
marked as deleted. Iterating over them while removing the gaps means
that you are always copying to a destination lower in memory, right?
If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.
I'm all in favour of temporarily having checks for overflow (and other errors) during debugging, but I am sceptical to having distinct
debug/release builds. It encourages people to use debug builds during development, bug hunting and testing, then when all looks good they
switch to release build and deploy it. I prefer a single build, and
enable run-time checks on parts of it if and when necessary.
On 04/09/2024 18:07, Tim Rentsch wrote:
Terje Mathisen <[email protected]> writes:
Michael S wrote:
On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>> Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular >>>>>> book that said that it was defined?
>>
It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?
Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.
What is "compact buffers" ?
Assume a buffer consisting of records of some type, some of them
marked as deleted. Iterating over them while removing the gaps means
that you are always copying to a destination lower in memory, right?
If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.
Such tests are usually built into implementations of memmove(), which
will chose to run forwards or backwards as needed. So you might as well
just call memmove() any time you are not sure memcpy() is safe and appropriate.
On 03/09/2024 21:28, Stefan Monnier wrote:
My impression - based on hearsay for Rust as I have no experience - is that >>> the key point of Rust is memory "safety". I use scare-quotes here, since it
is simply about correct use of dynamic memory and buffers.
It is entirely possible to have correct use of memory in C,
If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.
Agreed.
I've heard it said that the power of a programming language comes not
from what you can do with the language, but from what you cannot do.
On Wed, 4 Sep 2024 20:15:24 +0000, Brett wrote:
David Brown <[email protected]> wrote:
On 03/09/2024 21:28, Stefan Monnier wrote:
My impression - based on hearsay for Rust as I have no experience - is >>>>> that
the key point of Rust is memory "safety". I use scare-quotes here,
since it
is simply about correct use of dynamic memory and buffers.
It is entirely possible to have correct use of memory in C,
If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.
Agreed.
I've heard it said that the power of a programming language comes not
from what you can do with the language, but from what you cannot do.
Wrong, the last version of Swift added all the garbage programming
concepts
that one should avoid.
You have to give people the tools to do anything.
It is impossible to create a computer programming language where
the programmer cannot shoot himself in the foot.
David Brown <[email protected]> wrote:
On 03/09/2024 21:28, Stefan Monnier wrote:
My impression - based on hearsay for Rust as I have no experience - is >>>> that
the key point of Rust is memory "safety". I use scare-quotes here,
since it
is simply about correct use of dynamic memory and buffers.
It is entirely possible to have correct use of memory in C,
If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.
Agreed.
I've heard it said that the power of a programming language comes not
from what you can do with the language, but from what you cannot do.
Wrong, the last version of Swift added all the garbage programming
concepts
that one should avoid.
You have to give people the tools to do anything.
David Brown <[email protected]> schrieb:
I'm all in favour of temporarily having checks for overflow (and
other errors) during debugging, but I am sceptical to having
distinct debug/release builds. It encourages people to use debug
builds during development, bug hunting and testing, then when all
looks good they switch to release build and deploy it. I prefer a
single build, and enable run-time checks on parts of it if and when necessary.
Wise man once said...
# It is absurd to make elaborate security checks on debugging runs,
# when no trust is put in the results, and then remove them in
# production runs, when an erroneous result could be expensive or
# disastrous. What would we think of a sailing enthusiast who wears
# his lifejacket when training on dry land, but takes it off as soon
# as he goes to sea?
(C.A.R. Hoare, in "Hints on Programming Language Desin)
On 04/09/2024 14:53, jseigh wrote:
On 9/4/24 06:57, David Brown wrote:
On 04/09/2024 09:15, Terje Mathisen wrote:
David Brown wrote:
Maybe?
Rust will _always_ check for such overflow in debug builds, then when
you've determined that they don't occur, the release build falls back
standard CPU behavior, i.e. wrapping around with no panics.
But if you've determined that they do not occur (during debugging),
then your code never makes use of the results of an overflow - thus
why is it defined behaviour? It makes no sense. The only time when
you would actually see wrapping in final code is if you hadn't tested
it properly, and then you can be pretty confident that the whole thing
will end in tears when signs change unexpectedly. It would be much
more sensible to leave signed overflow undefined, and let the compiler
optimise on that basis.
You absolutely do want defined behavior on overflow.
No, you absolutely do /not/ want that - for the vast majority of use-cases.
There are times when you want wrapping behaviour, yes. More generally,
you want modulo arithmetic rather than a model of mathematical integer arithmetic. But those cases are rare, and in C they are easily handled
using unsigned integers.
You can't use signed integers for them in C (except of course if you use explicit modulo and none of your intermediary results overflow int),
because signed integer overflow is UB. You can't use signed integers
for the purpose in Rust either, even though it is defined behaviour in release mode, because it is a run-time error in debug mode. (That's why Rust's attitude here is completely daft to me.)
There are
algorithms that depend on that. Bakery algorithms for instance.
Unless you think a real life bakery with service tickets
numbering from 1 to 50 either never gets more than 50 customers
in a day or closes after their 50th customer. :)
Joe Seigh
David Brown <[email protected]> wrote:
On 04/09/2024 14:53, jseigh wrote:
On 9/4/24 06:57, David Brown wrote:
On 04/09/2024 09:15, Terje Mathisen wrote:
David Brown wrote:
Maybe?
Rust will _always_ check for such overflow in debug builds, then when >>>>> you've determined that they don't occur, the release build falls back >>>>> standard CPU behavior, i.e. wrapping around with no panics.
But if you've determined that they do not occur (during debugging),
then your code never makes use of the results of an overflow - thus
why is it defined behaviour? It makes no sense. The only time when >>>> you would actually see wrapping in final code is if you hadn't tested
it properly, and then you can be pretty confident that the whole thing >>>> will end in tears when signs change unexpectedly. It would be much
more sensible to leave signed overflow undefined, and let the compiler >>>> optimise on that basis.
You absolutely do want defined behavior on overflow.
No, you absolutely do /not/ want that - for the vast majority of use-cases. >>
There are times when you want wrapping behaviour, yes. More generally,
you want modulo arithmetic rather than a model of mathematical integer
arithmetic. But those cases are rare, and in C they are easily handled
using unsigned integers.
I tried using unsigned for a bunch of my data types that should never go >negative, but every time I would have to compare them with an int somewhere >and that would cause a compiler warning, because the goal was to also
remove unsafe code.
Complete and udder disaster, went back to plain sized ints.
We use it exclusively for datatypes in the domain [0, 2**n). It's always compared against other unsigned variables or constants. Works quite well. Safer and cleaner than willy-nilly using int.
David Brown <[email protected]> wrote:
On 04/09/2024 14:53, jseigh wrote:
On 9/4/24 06:57, David Brown wrote:
On 04/09/2024 09:15, Terje Mathisen wrote:
David Brown wrote:
Maybe?
Rust will _always_ check for such overflow in debug builds, then when >>>>> you've determined that they don't occur, the release build falls back >>>>> standard CPU behavior, i.e. wrapping around with no panics.
But if you've determined that they do not occur (during debugging),
then your code never makes use of the results of an overflow - thus
why is it defined behaviour? It makes no sense. The only time when >>>> you would actually see wrapping in final code is if you hadn't tested
it properly, and then you can be pretty confident that the whole thing >>>> will end in tears when signs change unexpectedly. It would be much
more sensible to leave signed overflow undefined, and let the compiler >>>> optimise on that basis.
You absolutely do want defined behavior on overflow.
No, you absolutely do /not/ want that - for the vast majority of
use-cases.
There are times when you want wrapping behaviour, yes. More generally,
you want modulo arithmetic rather than a model of mathematical integer
arithmetic. But those cases are rare, and in C they are easily handled
using unsigned integers.
I tried using unsigned for a bunch of my data types that should never go negative, but every time I would have to compare them with an int
somewhere
and that would cause a compiler warning, because the goal was to also
remove unsafe code.
On 9/4/2024 3:18 PM, MitchAlsup1 wrote:
On Wed, 4 Sep 2024 20:15:24 +0000, Brett wrote:
David Brown <[email protected]> wrote:
On 03/09/2024 21:28, Stefan Monnier wrote:
My impression - based on hearsay for Rust as I have no experience - is >>>>>> that
the key point of Rust is memory "safety". I use scare-quotes here, >>>>>> since it
is simply about correct use of dynamic memory and buffers.
It is entirely possible to have correct use of memory in C,
If you look at the evolution of programming languages, "higher-level" >>>>> doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.
Agreed.
I've heard it said that the power of a programming language comes not
from what you can do with the language, but from what you cannot do.
Wrong, the last version of Swift added all the garbage programming
concepts
that one should avoid.
You have to give people the tools to do anything.
It is impossible to create a computer programming language where
the programmer cannot shoot himself in the foot.
A language could alternatively try to go in a direction like HolyC:
Take C:
Remove most advanced features;
Add some weird syntax tweaks;
Make all the types explicit sized.
Some of it is almost half tempting, except that I would probably make
the type-names lower-case to match with my existing usage (and save
needing to hit SHIFT as often).
Say:
u0: void
u1: _Bool
u8: unsigned char
u16: unsigned short
...
i16/s16: signed short
i32/s32: signed int
i64/s64: signed long long
f32: float
f64: double
m32: opaque 32-bit type
m64: opaque 64-bit type
m128: opaque 128-bit type
....
Then, say:
u0 foo(args...)
{
...
}
Where, args is exposed as an array of u32 or u64 depending on the target architecture.
....
On 9/4/2024 3:59 PM, Scott Lurndal wrote:
Say:
long z;
int x, y;
...
z=x*y;
Would auto-promote to long before the multiply.
On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:
On 9/4/2024 3:59 PM, Scott Lurndal wrote:
Say:
long z;
int x, y;
...
z=x*y;
Would auto-promote to long before the multiply.
\I may have to use this as an example of C allowing the programmer
to shoot himself in the foot; promotion or no promotion.
David Brown <[email protected]> wrote:
On 04/09/2024 14:53, jseigh wrote:
On 9/4/24 06:57, David Brown wrote:
On 04/09/2024 09:15, Terje Mathisen wrote:
David Brown wrote:
Maybe?
Rust will _always_ check for such overflow in debug builds, then when >>>>> you've determined that they don't occur, the release build falls back >>>>> standard CPU behavior, i.e. wrapping around with no panics.
But if you've determined that they do not occur (during debugging),
then your code never makes use of the results of an overflow - thus
why is it defined behaviour? It makes no sense. The only time when >>>> you would actually see wrapping in final code is if you hadn't tested
it properly, and then you can be pretty confident that the whole thing >>>> will end in tears when signs change unexpectedly. It would be much
more sensible to leave signed overflow undefined, and let the compiler >>>> optimise on that basis.
You absolutely do want defined behavior on overflow.
No, you absolutely do /not/ want that - for the vast majority of use-cases. >>
There are times when you want wrapping behaviour, yes. More generally,
you want modulo arithmetic rather than a model of mathematical integer
arithmetic. But those cases are rare, and in C they are easily handled
using unsigned integers.
I tried using unsigned for a bunch of my data types that should never go negative, but every time I would have to compare them with an int somewhere and that would cause a compiler warning, because the goal was to also
remove unsafe code.
Complete and udder disaster, went back to plain sized ints.
On 05/09/2024 02:56, MitchAlsup1 wrote:
On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:
On 9/4/2024 3:59 PM, Scott Lurndal wrote:
Say:
long z;
int x, y;
...
z=x*y;
Would auto-promote to long before the multiply.
\I may have to use this as an example of C allowing the programmer
to shoot himself in the foot; promotion or no promotion.
You snipped rather unfortunately here - it makes it look like this was
code that Scott wrote, and you've removed essential context by BGB.
While I agree it is an example of the kind of code that people sometimes write when they don't understand C arithmetic, I don't think it is C-specific. I can't think of any language off-hand where expressions
are evaluated differently depending on types used further out in the expression. Can you give any examples of languages where the equivalent code would either do the multiplication as "long", or give an error so
that the programmer would be informed of their error?
(I don't count personal one-person languages here.
On 04/09/2024 22:31, Brett wrote:
David Brown <[email protected]> wrote:
On 04/09/2024 14:53, jseigh wrote:
On 9/4/24 06:57, David Brown wrote:
On 04/09/2024 09:15, Terje Mathisen wrote:
David Brown wrote:
Maybe?
Rust will _always_ check for such overflow in debug builds, then when >>>>>> you've determined that they don't occur, the release build falls back >>>>>> standard CPU behavior, i.e. wrapping around with no panics.
But if you've determined that they do not occur (during debugging),
then your code never makes use of the results of an overflow - thus
why is it defined behaviour? It makes no sense. The only time when
you would actually see wrapping in final code is if you hadn't tested >>>>> it properly, and then you can be pretty confident that the whole thing >>>>> will end in tears when signs change unexpectedly. It would be much >>>>> more sensible to leave signed overflow undefined, and let the compiler >>>>> optimise on that basis.
You absolutely do want defined behavior on overflow.
No, you absolutely do /not/ want that - for the vast majority of
use-cases.
There are times when you want wrapping behaviour, yes. More generally, >>> you want modulo arithmetic rather than a model of mathematical integer
arithmetic. But those cases are rare, and in C they are easily handled >>> using unsigned integers.
I tried using unsigned for a bunch of my data types that should never go
negative, but every time I would have to compare them with an int
somewhere
and that would cause a compiler warning, because the goal was to also
remove unsafe code.
Complete and udder disaster, went back to plain sized ints.
That's a matter of choice in the warnings you pick and the style you use
- these should match.
However, I don't think C's integer promotion rules are ideal in regard
to mixing signed and unsigned arithmetic - converting both to "unsigned"
can easily lead to trouble.
Some people recommend using unsigned int everywhere you can, because the overflow behaviour is defined - I think that is simply wrong. Use
unsigned int where it is appropriate, but it is very rare (though it
happens sometimes) that you want any arithmetic to overflow in any way.
So the justification is wrong.
Some people like to use unsigned int when the values will not be
negative. I don't think that is a good idea either. In general, for
any given use you only need a limited range of values. 0 to 10000 is
just as much a subset of "int" as "unsigned int", and using "unsigned
int" does not give any advantages. On the contrary, using "int" can
give more efficient code in many places, and lets you enable warnings
about mixed unsigned / signed operations for when you actually want them.
Unsigned types are ideal for "raw" memory access or external data, for anything involving bit manipulation (use of &, |, ^, << and >> on signed types is usually wrong, IMHO), as building blocks in extended arithmetic types, for the few occasions when you want two's complement wrapping,
and for the even fewer occasions when you actually need that last bit of range.
It would be nice if C had subrange types like Pascal or Ada, but it does not. Usually int - or sizeed ints - are the practical choice.
I suspect that My 66000 is the only current ISA that efficiently
supports::
u7:
u11:
u15:
u21:
s47:
s19:
On Wed, 4 Sep 2024 17:25:44 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
David Brown <[email protected]> schrieb:
I'm all in favour of temporarily having checks for overflow (and
other errors) during debugging, but I am sceptical to having
distinct debug/release builds. It encourages people to use debug
builds during development, bug hunting and testing, then when all
looks good they switch to release build and deploy it. I prefer a
single build, and enable run-time checks on parts of it if and when
necessary.
Wise man once said...
# It is absurd to make elaborate security checks on debugging runs,
# when no trust is put in the results, and then remove them in
# production runs, when an erroneous result could be expensive or
# disastrous. What would we think of a sailing enthusiast who wears
# his lifejacket when training on dry land, but takes it off as soon
# as he goes to sea?
(C.A.R. Hoare, in "Hints on Programming Language Desin)
Wise man was wrong.
Range check are not similar to live jackets. They do not turn incorrect program into correct one.
Anton writes code that seriously pushes the boundary of what can be
achieved. For at least some of the things he does (such as GForth) he
is trying to squeeze every last drop of speed out of the target. And he
is /really/ good at it. But that means he is forever relying on nuances >about code generation. His code, at least for efficiency if not for >correctness, is dependent on details far beyond what is specified and >documented for C and for the gcc compiler. He might spend a long time >working with his code and a version of gcc, fine-tuning the details of
his source code to get out exactly the assembly he wants from the
compiler.
Of course it is frustrating for him when the next version of
gcc generates very different assembly from that same source, but he is
not really programming at the level of C, and he should not expect >consistency from C compilers like he does.
On 2024-09-05 10:54, David Brown wrote:
On 05/09/2024 02:56, MitchAlsup1 wrote:
On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:
On 9/4/2024 3:59 PM, Scott Lurndal wrote:
Say:
long z;
int x, y;
...
z=x*y;
Would auto-promote to long before the multiply.
\I may have to use this as an example of C allowing the programmer
to shoot himself in the foot; promotion or no promotion.
You snipped rather unfortunately here - it makes it look like this was
code that Scott wrote, and you've removed essential context by BGB.
While I agree it is an example of the kind of code that people
sometimes write when they don't understand C arithmetic, I don't think
it is C-specific. I can't think of any language off-hand where
expressions are evaluated differently depending on types used further
out in the expression. Can you give any examples of languages where
the equivalent code would either do the multiplication as "long", or
give an error so that the programmer would be informed of their error?
The Ada language can work in both ways. If you just have:
z : Long_Integer; -- Not a standard Ada type, but often provided.
x, y : Integer;
...
z := x * y;
the compiler will inform you that the types in the assignment do not
match: using the standard (predefined) operator "*", the product of two Integers gives an Integer, not a Long_Integer.
If you add this
definition to the code:
function "*" (Left, Right : Integer) return Long_Integer
is (Long_Integer(Left) * Long_Integer(Right));
the compiler sees that there is now /also/ an Integer * Integer => Long_Integer multiplication operator, and uses that. Function
overloading in Ada can depend on the type expected of the result.
Perhaps you asked for a language that worked like this "out of the box", without the programmer having to add things like the "*" function above,
and then Ada would not qualify on the second alternative (automatic lengthening before multiplication, depending on the result type desired).
(I don't count personal one-person languages here.
While Ada has low market penetration, I don't think it quite qualifies
as a one-person language -- yet :-)
David Brown wrote:
On 04/09/2024 22:31, Brett wrote:
David Brown <[email protected]> wrote:
On 04/09/2024 14:53, jseigh wrote:
On 9/4/24 06:57, David Brown wrote:
On 04/09/2024 09:15, Terje Mathisen wrote:
David Brown wrote:
Maybe?
Rust will _always_ check for such overflow in debug builds, then >>>>>>> when
you've determined that they don't occur, the release build falls >>>>>>> back
standard CPU behavior, i.e. wrapping around with no panics.
But if you've determined that they do not occur (during debugging), >>>>>> then your code never makes use of the results of an overflow - thus >>>>>> why is it defined behaviour? It makes no sense. The only time >>>>>> when
you would actually see wrapping in final code is if you hadn't tested >>>>>> it properly, and then you can be pretty confident that the whole
thing
will end in tears when signs change unexpectedly. It would be much >>>>>> more sensible to leave signed overflow undefined, and let the
compiler
optimise on that basis.
You absolutely do want defined behavior on overflow.
No, you absolutely do /not/ want that - for the vast majority of
use-cases.
There are times when you want wrapping behaviour, yes. More generally, >>>> you want modulo arithmetic rather than a model of mathematical integer >>>> arithmetic. But those cases are rare, and in C they are easily handled >>>> using unsigned integers.
I tried using unsigned for a bunch of my data types that should never go >>> negative, but every time I would have to compare them with an int
somewhere
and that would cause a compiler warning, because the goal was to also
remove unsafe code.
Complete and udder disaster, went back to plain sized ints.
That's a matter of choice in the warnings you pick and the style you
use - these should match.
However, I don't think C's integer promotion rules are ideal in regard
to mixing signed and unsigned arithmetic - converting both to
"unsigned" can easily lead to trouble.
Some people recommend using unsigned int everywhere you can, because
the overflow behaviour is defined - I think that is simply wrong. Use
unsigned int where it is appropriate, but it is very rare (though it
happens sometimes) that you want any arithmetic to overflow in any
way. So the justification is wrong.
Some people like to use unsigned int when the values will not be
negative. I don't think that is a good idea either. In general, for
any given use you only need a limited range of values. 0 to 10000 is
just as much a subset of "int" as "unsigned int", and using "unsigned
int" does not give any advantages. On the contrary, using "int" can
give more efficient code in many places, and lets you enable warnings
about mixed unsigned / signed operations for when you actually want them.
Unsigned types are ideal for "raw" memory access or external data, for
anything involving bit manipulation (use of &, |, ^, << and >> on
signed types is usually wrong, IMHO), as building blocks in extended
arithmetic types, for the few occasions when you want two's complement
wrapping, and for the even fewer occasions when you actually need that
last bit of range.
That last paragraph enumerates pretty much all the uses I have for integer-type variables, with (like Mitch) a few apis that use (-1) as an error signal that has to be handled with special code.
It would be nice if C had subrange types like Pascal or Ada, but it
does not. Usually int - or sizeed ints - are the practical choice.
Agreed 100%
I wrote enough Pascal with ranged types that I got used to it, and found
that I was missing the feature when I used C.
Terje
David Brown <[email protected]> wrote:
On 03/09/2024 21:28, Stefan Monnier wrote:
My impression - based on hearsay for Rust as I have no experience - is that
the key point of Rust is memory "safety". I use scare-quotes here, since it
is simply about correct use of dynamic memory and buffers.
It is entirely possible to have correct use of memory in C,
If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.
Agreed.
I've heard it said that the power of a programming language comes not
from what you can do with the language, but from what you cannot do.
Wrong, the last version of Swift added all the garbage programming concepts that one should avoid.
You have to give people the tools to do anything.
Specifications are an agreement between the supplier and the client. The
The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.
On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:
On 04/09/2024 18:07, Tim Rentsch wrote:
If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.
Such tests are usually built into implementations of memmove(), which
will chose to run forwards or backwards as needed. So you might as well
just call memmove() any time you are not sure memcpy() is safe and
appropriate.
Memmove() is always appropriate unless you are doing something
nefarious.
So:
# define memcpy memomve
and move forward with life--for the 2 extra cycles memmove costs it
saves everyone long term grief.
When you need the nefarious activities of memcpy write it as a
for loop by yourself and comment the nafariousness of the use.
On Wed, 4 Sep 2024 7:29:01 +0000, David Brown wrote:
On 03/09/2024 22:22, MitchAlsup1 wrote:
On Tue, 3 Sep 2024 19:30:21 +0000, Stefan Monnier wrote:
Specifications are an agreement between the supplier and the
client. The
The problem here is that the C standard, seen as a contract, is unfair >>>> to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.
# define int int64_t
..
makes it easier.
That's UB, I believe :-) And it will certainly be confusing.
On 64-bit machines it re-establishes the dusty-deck old K&R C where
int was the fastest integer type.
On 05/09/2024 11:12, Terje Mathisen wrote:
That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as an
error signal that has to be handled with special code.
You don't have loop counters, array indices, or integer arithmetic?
David Brown wrote:
Unsigned types are ideal for "raw" memory access or external data, for =
anything involving bit manipulation (use of &, |, ^, << and >> on signe= >d=20
types is usually wrong, IMHO), as building blocks in extended arithmeti= >c=20
types, for the few occasions when you want two's complement wrapping,=20
and for the even fewer occasions when you actually need that last bit o= >f=20
range.
That last paragraph enumerates pretty much all the uses I have for=20 >integer-type variables, with (like Mitch) a few apis that use (-1) as an =
error signal that has to be handled with special code.
=20
It would be nice if C had subrange types like Pascal or Ada, but it doe= >s=20
not.=C2=A0 Usually int - or sizeed ints - are the practical choice.
Agreed 100%
I wrote enough Pascal with ranged types that I got used to it, and found =
that I was missing the feature when I used C.
In either case, treating the C standard as agreement is nonsense.
David Brown <[email protected]> writes:
On 05/09/2024 11:12, Terje Mathisen wrote:
That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as an >>> error signal that has to be handled with special code.
You don't have loop counters, array indices, or integer arithmetic?
We do. There is no issue using unsigned loop counters,
David Brown <[email protected]> writes:
Anton writes code that seriously pushes the boundary of what can be
achieved. For at least some of the things he does (such as GForth) he
is trying to squeeze every last drop of speed out of the target. And he
is /really/ good at it. But that means he is forever relying on nuances
about code generation. His code, at least for efficiency if not for
correctness, is dependent on details far beyond what is specified and
documented for C and for the gcc compiler. He might spend a long time
working with his code and a version of gcc, fine-tuning the details of
his source code to get out exactly the assembly he wants from the
compiler.
No. We distribute Gforth as source code. It works for a wide variety
of architectures and compilers. So unlike what you suggest and what
some people have suggested earlier to avoid problems with new
"optimizations" in newer releases of gcc, we don't concentrate on a
specific version of gcc.
Of course it is frustrating for him when the next version of
gcc generates very different assembly from that same source, but he is
not really programming at the level of C, and he should not expect
consistency from C compilers like he does.
It's normal and no problem when the next version of gcc generates
different assembly language. There are some basic assumptions that
our code relies on, and that mostly does not change between gcc
versions.
An essential assumption is that, when we have:
A:
C code
B:
... that when we do &&A and &&B (which is documented in the GNU C
manual), we get the addresses pointing to the start and end of the
machine code corresponding to the C code.
In the days starting with
gcc-3.0, we found that gcc started reordering the basic blocks within
loops, so replaced loops in the part of the code that needs such
assumptions into separate functions. Around gcc-7, gcc started to
compile
A: C-code1
B: C-code2
C: goto *...
to the same code as
A: C-code1; C-code2; goto *...;
B: C-code2; goto *...;
C: goto *...;
I found a workaround that avoids this kind of code generation.
Another problem from gcc-3.1 to at least gcc-4.4 (intermittently) is
that gcc compiled
goto *ca;
into the equivalent of
goto gotoca;
/* and elsewhere */
gotoca: goto *ca;
We reported that repeatedly. At one point a gcc maintainer gave us
some bullshit about a possible performance advantage from this transformation, of course without presenting any empirical support,
while we saw a big slowdown on our code. We developed workarounds for
that, and they are in Gforth to this day, even though we have not
encountered a new gcc version with this problem for over a decade, but
new Gforth should also work on old gcc.
Another assumption is that when we concatenate the code snippet
between label A and B (which contains C-code1) and the code snippet
between label X and Y (which contains C-code3), executing the result
will behave like the concatenation of C-code1 and C-code3 in source
code. This assumption has two aspects:
1) Do the register assignments at the labels fit together. It turns
out that we never had a problem with that, and I think that the
reason for that is that the "goto *" can jump to any of those
labels (all their addresses are taken), and so the register
assignment must be the same right after each label.
What guarantees that the assignments are the same right before each
label? Probably that after the label, there is not much between
the label and the next goto*, and that makes all registers at
potential targets live.
2) If we have two pieces of machine code produced in this fashion,
does the architecture guarantee that such a concatenation works?
It turns out that in general-purpose architectures, all-but-one do.
That includes IA-64. The exception is MIPS with its architectural
load-delay slot (and there are also scheduling restrictions having
to do with the hilo register that may be problematic): the first
code snippet may end in a load, and the next code snippet may start
with an instruction that reads the result of the load. So we just
disabled this concatenation on MIPS.
We do a number of things to achieve stability: We do sanity-checking
on the resulting machine code snippets and fall back to plain threaded
code if the snippets turn out not to be relocatable.
Also, we enable all the flags for defining behaviour in gcc that we
find (unfortunately, in the documentation they are intermixed with
other options). For good measure, this includes -fno-delete-null-pointer-checks, although I doubt that it makes a
difference for our code either way.
One thing that came up about a year ago was that gcc auto-vectorizes
adjacent memory accesses on AMD64 (apparently the AMD64 port
maintainers are unhappy because AMD64 does not have instructions like
ARM A64's ldp and stp:-), which did not impact correctness, but led to
worse performance (not just in Gforth; I have also seen it in the
bubble benchmark from John Hennessy's Stanford small integer
benchmarks; I'm sure there is some SPEC benchmark that benefits). A
quick addition of -fno-tree-vectorize fixed that.
We have been thinking about moving from C to a better-defined
language, namely assembly language, but have not yet taken the plunge,
and it has not been necessary yet. Gcc has not been as crazy in our experience as the UB rethoric might make one think. Why is that? I
think the reasons are:
1) Gforth and a lot of other "irrelevant" (to the gcc maintainers)
projects sail in the slipstream of "relevant" code like SPEC and
the Linux kernel that are all full of undefined behaviour (Linux
defines many of them with flags, like Gforth does), so gcc does not
"optimize" as crazily as a UB fan might wish.
2) The code snippets are very short, with many in-edges on the
preceding and following label, which tends to destroy any
"knowledge" that the compiler might have derived from the
assumption that the program does not exercise undefined behaviour.
This severely limits the distance over which such "optimizations"
can be performed.
Nevertheless, the last time I tried what happens if I compile without
the behaviour-defining options, the result did not work; I did not investigate this further.
On 05/09/2024 11:12, Terje Mathisen wrote:
David Brown wrote:
Unsigned types are ideal for "raw" memory access or external data,
for anything involving bit manipulation (use of &, |, ^, << and >> on
signed types is usually wrong, IMHO), as building blocks in extended
arithmetic types, for the few occasions when you want two's
complement wrapping, and for the even fewer occasions when you
actually need that last bit of range.
That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as
an error signal that has to be handled with special code.
You don't have loop counters, array indices, or integer arithmetic?
Scott Lurndal <[email protected]> schrieb:
David Brown <[email protected]> writes:
On 05/09/2024 11:12, Terje Mathisen wrote:
That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as an >>>> error signal that has to be handled with special code.
You don't have loop counters, array indices, or integer arithmetic?
We do. There is no issue using unsigned loop counters,
I find counting down from n to 0 using unsigned variables
unintuitive. Or do you always count up and then calculate
what you actually use? Induction variable optimization
should take care of that, but it would be more complicated
to use.
On 05/09/2024 13:31, Anton Ertl wrote:...
It's normal and no problem when the next version of gcc generates
different assembly language. There are some basic assumptions that
our code relies on, and that mostly does not change between gcc
versions.
In the days starting with
gcc-3.0, we found that gcc started reordering the basic blocks within
loops, so replaced loops in the part of the code that needs such
assumptions into separate functions. Around gcc-7, gcc started to
compile
A: C-code1
B: C-code2
C: goto *...
to the same code as
A: C-code1; C-code2; goto *...;
B: C-code2; goto *...;
C: goto *...;
I found a workaround that avoids this kind of code generation.
This is all the kind of thing you can expect when you make assumptions
about code generation that are not supported by the documentation.
I too have written code that relies on being able to identify the start
and end of certain bits of code - typically for microcontrollers where
you want some bits of code (like flash programming routines or very
timing critical interrupt code) put in ram rather than flash. Sometimes
that can be done with compiler extensions, sometimes it takes extra
flags, linker file magic, or other messing around. But it's not
something I would expect to be portable, and it needs confirmed for
every compiler version and selection of flags used. (I realise that
this is a vastly simpler task for the kind of work I do than for an open >source project!)
Another problem from gcc-3.1 to at least gcc-4.4 (intermittently) is
that gcc compiled
goto *ca;
into the equivalent of
goto gotoca;
/* and elsewhere */
gotoca: goto *ca;
We reported that repeatedly. At one point a gcc maintainer gave us
some bullshit about a possible performance advantage from this
transformation, of course without presenting any empirical support,
while we saw a big slowdown on our code. We developed workarounds for
that, and they are in Gforth to this day, even though we have not
encountered a new gcc version with this problem for over a decade, but
new Gforth should also work on old gcc.
Again, the compiler is not doing anything outside its specifications.
You are looking for more than C and the gcc documented extensions give
you. That is always going to be hard.
Ideally, you need a new gcc flag or two with documented and guaranteed >effects to give you the assurance you need for your code. That's going
to take a lot of effort, I would expect, and I can see it being hard for
a relatively nice project like Gforth to push for that.
David Brown <[email protected]> writes:
On 05/09/2024 13:31, Anton Ertl wrote:
Terje Mathisen <[email protected]> writes:
David Brown wrote:
It would be nice if C had subrange types like Pascal or Ada, but it doe= >>s not.=C2=A0 Usually int - or sizeed ints - are the practical choice.
Agreed 100%
Although absent architecture support, how does one ensure that the
value remains within the subrange?
On 04/09/2024 20:13, MitchAlsup1 wrote:
On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:
On 04/09/2024 18:07, Tim Rentsch wrote:
If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.
Such tests are usually built into implementations of memmove(), which
will chose to run forwards or backwards as needed. So you might as well >>> just call memmove() any time you are not sure memcpy() is safe and
appropriate.
Memmove() is always appropriate unless you are doing something
nefarious.
So:
# define memcpy memomve
and move forward with life--for the 2 extra cycles memmove costs it
saves everyone long term grief.
Or just use memmove, and not memcpy, whenever you are moving stuff
around in the same buffer.
When you need the nefarious activities of memcpy write it as a
for loop by yourself and comment the nafariousness of the use.
memcpy is not nefarious. It's quite simple, and does what it says on
the tin. Use it when you want to copy non-overlapping memory areas.
Don't use it if you want to do something other than that. I have never understood why anyone would find this difficult.
---------------------------------------------------------- array
indicies are always positive
David Brown wrote:
On 05/09/2024 11:12, Terje Mathisen wrote:
David Brown wrote:
Unsigned types are ideal for "raw" memory access or external data,
for anything involving bit manipulation (use of &, |, ^, << and >>
on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want two's
complement wrapping, and for the even fewer occasions when you
actually need that last bit of range.
That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as
an error signal that has to be handled with special code.
You don't have loop counters, array indices, or integer arithmetic?
Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine
with unsigned i, arrays always use a zero base so in Rust the only array index type is usize, i.e the largest supported unsigned type in the
system, typically the same as u64.
unsigned arithmetic is easier than signed integer arithmetic, including comparisons that would result in a negative value, you just have to make
the test before subtracting, instead of checking if the result was
negative.
I.e I cannot easily replicate a downward loop that exits when the
counter become negative:
for (int i = START; i >= 0; i-- ) {
// Do something with data[i]
}
One of my alternatives are
unsigned u = start; // Cannot be less than zero
if (u) {
u++;
do {
u--;
data[u]...
while (u);
}
This typically results in effectively the same asm code as the signed version, except for a bottom JGE (Jump (signed) Greater or Equal instead
of JA (Jump Above or Equal, but my version is far more verbose.
Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:
for (unsigned u = start; (u & TOPBIT) == 0; u--)
Terje
Nobody said that gcc did anything wrong here. We were, however,
surprised that -fno-reorder-blocks did not suppress the reordering; we reported this as bug, but were told that this option does something
different from what it says. Anyway, we developed a workaround. And
we also developed a workaround for the code duplication problem that
showed up in gcc-7.
On Thu, 5 Sep 2024 14:05:08 +0000, Scott Lurndal wrote:
Terje Mathisen <[email protected]> writes:
David Brown wrote:
It would be nice if C had subrange types like Pascal or Ada, but its not.=C2=A0 Usually int - or sizeed ints - are the practical choice.
doe=
Agreed 100%
Although absent architecture support, how does one ensure that the
value remains within the subrange?
result = min(max(min_range,x),max_range);
or for 2^n values
result = ( ( x << (64-width) ) >> (64-width) );
The top is 2 instructions, the bottom 1 (both signed and unsigned).
On Thu, 5 Sep 2024 14:06:45 +0000, Scott Lurndal wrote:
---------------------------------------------------------- array
indicies are always positive
Not in ada or fortran.
On 04/09/2024 22:15, Brett wrote:
David Brown <[email protected]> wrote:
On 03/09/2024 21:28, Stefan Monnier wrote:
My impression - based on hearsay for Rust as I have no experience - is that
the key point of Rust is memory "safety". I use scare-quotes here, since it
is simply about correct use of dynamic memory and buffers.
It is entirely possible to have correct use of memory in C,
If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.
Agreed.
I've heard it said that the power of a programming language comes not
from what you can do with the language, but from what you cannot do.
Wrong, the last version of Swift added all the garbage programming concepts >> that one should avoid.
That does not show that I was wrong - perhaps Swift is not a powerful programming language!
Of course, it all depends on what you mean by "powerful".
(I don't know Swift at all.)
You have to give people the tools to do anything.
You don't /have/ to do that. But it's often easier to market a language
that can do anything.
On 05.09.24 19:04, Terje Mathisen wrote:
One of my alternatives are
unsigned u = start; // Cannot be less than zero
if (u) {
u++;
do {
u--;
data[u]...
while (u);
}
This typically results in effectively the same asm code as the signed
version, except for a bottom JGE (Jump (signed) Greater or Equal instead
of JA (Jump Above or Equal, but my version is far more verbose.
Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:
%G�%@| for (unsigned u = start; (u & TOPBIT) == 0; u--)
Terje
What about:
for (unsigned u = start; u != ~0u; --u)
...
or even
for (unsigned u = start; (int)u >= 0; --u)
...
?
I've compared all variants for x86_64 with -O3 -fexpensive-optimizations
on godbolt.org:
- 32 bit version: https://godbolt.org/z/TMhhx3nch
- 64 bit version: https://godbolt.org/z/8oxzTf5Gf
Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
fine with unsigned i, arrays always use a zero base so in Rust the
only array index type is usize, i.e the largest supported unsigned
type in the system, typically the same as u64.
unsigned arithmetic is easier than signed integer arithmetic,
including comparisons that would result in a negative value, you just
have to make the test before subtracting, instead of checking if the
result was negative.
I.e I cannot easily replicate a downward loop that exits when the
counter become negative:
for (int i = START; i >= 0; i-- ) {
// Do something with data[i]
}
One of my alternatives are
unsigned u = start; // Cannot be less than zero
if (u) {
u++;
do {
u--;
data[u]...
} while (u); /* presumably the } was intended */
}
If START is signed (presumably of type int), so the loop might run
zero times, but never more than INT_MAX times, then
for( unsigned u = START < 0 ? 0 : START + 1u; u > 0 && u--; ){
// Do something with data[i]
}
If START is unsigned, so in all cases the loop must run at
least once, then
unsigned u = START;
do {
// Do something with data[i]
} while( u > 0 && u-- );
(Yes I know the 'u > 0' expressions can be replaced by just 'u'.)
The optimizer should be smart enough to realize that if 'u > 0'
is true then the test 'u--' will also be true. The same should
hold if 'u > 0' is replaced by just 'u'.
(Disclaimer: code not compiled.)
On 9/5/2024 8:27 AM, David Brown wrote:
On 05/09/2024 10:51, Niklas Holsti wrote:
On 2024-09-05 10:54, David Brown wrote:
On 05/09/2024 02:56, MitchAlsup1 wrote:
On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:
On 9/4/2024 3:59 PM, Scott Lurndal wrote:
Say:
long z;
int x, y;
...
z=x*y;
Would auto-promote to long before the multiply.
\I may have to use this as an example of C allowing the programmer
to shoot himself in the foot; promotion or no promotion.
You snipped rather unfortunately here - it makes it look like this
was code that Scott wrote, and you've removed essential context by BGB. >>>>
While I agree it is an example of the kind of code that people
sometimes write when they don't understand C arithmetic, I don't
think it is C-specific. I can't think of any language off-hand
where expressions are evaluated differently depending on types used
further out in the expression. Can you give any examples of
languages where the equivalent code would either do the
multiplication as "long", or give an error so that the programmer
would be informed of their error?
The Ada language can work in both ways. If you just have:
z : Long_Integer; -- Not a standard Ada type, but often provided. >>> x, y : Integer;
...
z := x * y;
the compiler will inform you that the types in the assignment do not
match: using the standard (predefined) operator "*", the product of
two Integers gives an Integer, not a Long_Integer.
That seems like a safe choice. C's implicit promotion of int to long
int can be convenient, but convenience is sometimes at odds with safety.
A lot of time, implicit promotion will be the "safer" option than first
doing an operation that overflows and then promoting.
Annoyingly, one can't really do the implicit promotion first and then
promote afterwards, as there may be programs that actually rely on this particular bit of overflow behavior.
In effect, in my case, the promotion behavior ends up needing to depend
on the language-mode (it is either this or maybe internally split the operators into widening or non-widening variants, which are selected
when translating the AST into the IR stage).
Well, as opposed to dealing with the widening cases by emitting IR with
an implicit casts added into the IR.
On Thu, 5 Sep 2024 13:48:37 +0000, David Brown wrote:
On 04/09/2024 20:13, MitchAlsup1 wrote:
On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:
On 04/09/2024 18:07, Tim Rentsch wrote:
If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.
Such tests are usually built into implementations of memmove(), which
will chose to run forwards or backwards as needed. So you might as
well
just call memmove() any time you are not sure memcpy() is safe and
appropriate.
Memmove() is always appropriate unless you are doing something
nefarious.
So:
# define memcpy memomve
and move forward with life--for the 2 extra cycles memmove costs it
saves everyone long term grief.
Or just use memmove, and not memcpy, whenever you are moving stuff
around in the same buffer.
When you need the nefarious activities of memcpy write it as a
for loop by yourself and comment the nafariousness of the use.
memcpy is not nefarious. It's quite simple, and does what it says on
the tin. Use it when you want to copy non-overlapping memory areas.
Don't use it if you want to do something other than that. I have never
understood why anyone would find this difficult.
There are compilers that:: s/memcpy/memmove/g
On 05.09.24 17:49, Anton Ertl wrote:
Nobody said that gcc did anything wrong here. We were, however,
surprised that -fno-reorder-blocks did not suppress the reordering; we
reported this as bug, but were told that this option does something
different from what it says. Anyway, we developed a workaround. And
we also developed a workaround for the code duplication problem that
showed up in gcc-7.
Have you tried interspersing `asm volatile("")` statements?
It is very often an effective means to prevent gcc from reordering code
from before and after the asm statement.
Bernd Linsel <[email protected]> writes:
On 05.09.24 19:04, Terje Mathisen wrote:
One of my alternatives are
unsigned u = start; // Cannot be less than zero
if (u) {
u++;
do {
u--;
data[u]...
while (u);
}
This typically results in effectively the same asm code as the signed
version, except for a bottom JGE (Jump (signed) Greater or Equal instead >>> of JA (Jump Above or Equal, but my version is far more verbose.
Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:
%G�%@| for (unsigned u = start; (u & TOPBIT) == 0; u--)
Terje
What about:
for (unsigned u = start; u != ~0u; --u)
This is the form we use most when we need
to work in reverse.
On 2024-09-05 18:49, Anton Ertl wrote:
David Brown <[email protected]> writes:
On 05/09/2024 13:31, Anton Ertl wrote:
[ discussion of the implementation of Gforth as a code-copying
and code-pasting interpreter, and the maintenance problems
this leads to when changing gcc versions ]
It seems to me that this discussion (of Gforth) has very little do to
with the ability of C compilers to optimize away or do something else
with C code that the compiler detects invokes Undefined Behavior
I don't doubt that Anton has experienced bad effects of the
"optimization" of Undefined Behavior, in other contexts
The main bad effect is that I replaced more efficient and shorter code
with less efficient and longer code. In theory the compiler can
generate the same code for both, but in practice that does not happen.
As an example, the test for the smallest signed integer can be written
with -fwrapv as:
if (x<=x-1)
and gcc -fwrapv compiles this to shorter code on AMD64 than
if (x==CELL_MIN)
What gcc produces for both formulations is longer than
dec %rdi
jno ...
Maybe instead of pursuing "optimizations" against the intentions of
the programmer, they should concentrate on implementing real
optimizations like optimizing either variant into the small code shown
last.
Interestingly, the first idiom is a case where gcc recognizes what the intention of the programmer is, and warns that it is going to
miscompile it. The warning is good, the miscompilation not (but it
would be worse without the warning).
In any case, while the actual experience is that I have not been hit
by "optimizations" that ATUBDNH in production code, possibly because I minimize these assumptions with flags like -fwrapv, the possibility
that my code might be hit by such an "optimization" (e.g., a new one
in a new compiler version, if I am lucky with a new flag for disabling
the assumption, but my source code does not know about it yet) and the attitude of people who implement such "optimizations" is what I
resent.
- anton
On 05.09.24 17:49, Anton Ertl wrote:
Nobody said that gcc did anything wrong here. We were, however,
surprised that -fno-reorder-blocks did not suppress the reordering; we
reported this as bug, but were told that this option does something
different from what it says. Anyway, we developed a workaround. And
we also developed a workaround for the code duplication problem that
showed up in gcc-7.
Have you tried interspersing `asm volatile("")` statements?
It is very often an effective means to prevent gcc from reordering code
from before and after the asm statement.
If you additional specify inputs, e.g. `asm volatile("" :: "r" (foo))`,
you can force gcc to keep `foo` alive up to this point.
Thomas Koenig <[email protected]> schrieb:
"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.
Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.
So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.
If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?
Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).
On Fri, 06 Sep 2024 07:25:35 GMT
[email protected] (Anton Ertl) wrote:
What gcc produces for both formulations is longer than
dec %rdi
jno ...
Good trick.
The same trick in non-destructive form would be 1 byte longer.
cmp $1, %rdi
jno ...
But I was not able to force any of compilers currently installed on my
home desktop (gcc 13.2, clang 18.1, MSVC 19.30.30706 == VS2022) to
produce such code.
The closest was MSVC that sometimes (not in all circumstances) produces
2 bytes longer versiin:
49 8d 49 ff lea -0x1(%r9),%rcx
4c 3b c9 cmp %rcx,%r9
Of course, it's still good deal shorter than
48 ba 00 00 00 00 00 00 00 80 movabs $0x8000000000000000,%rdx
4c 3b ca cmp %rdx,%r9
Both gcc and clang [under -fwrapv] insisted on turning x<=x-1 into >x==LLONG_MIN.
However even if we were able to force compiler to produce desired code,
the space saving is architecture-specific.
E.g. I expect no saving on ARM64 where both variants occupie 8 bytes.
Interestingly, the first idiom is a case where gcc recognizes what the
intention of the programmer is, and warns that it is going to
miscompile it. The warning is good, the miscompilation not (but it
would be worse without the warning).
You had more luck with warnings than I did.
In all my test cases both gcc and clang [in absence of -fwrapv]
silently dropped the check and depended code.
Thomas Koenig <[email protected]> writes:
Thomas Koenig <[email protected]> schrieb:
"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.
Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.
So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.
If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?
Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).
The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior. (It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:
int a;
...
if (a > a + 1) {
...
}
Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here?
If a < INT_MAX, the behavior is the same as replacing the if()
test with 'if(0)'. If the compiler can accurately deduce that
the condition 'a < INT_MAX' will hold in all cases then the if()
can be optimized away accordingly.
If a == INT_MAX, one possibility is that code is generated to
evaluate the addition and the comparison, and the if-block is
either evaluated or it isn't, depending on the outcome of the
comparison. Important: the compiler is disallowed from drawing
any inferences based on "knowing" the result of either the
addition or the comparison; code must be generated under a "best
efforts" umbrella, and whatever the code does dictates whether
the if-block is evaluated or not, with the compiler being
forbidden to draw any conclusions based on what the result will
be.
If a == INT_MAX, it also should be possible for the addition to
abort the program. Here again the compiler is disallowed from
drawing any inferences based on knowing this will happen. To
make this work the rule allowing "UB to travel backwards in time"
must be revoked; unless a compiler can accurately deduce that a
given piece of code cannot transgress into UB then other code in
the program must not be moved (either forwards or backwards) past
the possibly-not-well-defined code segment.
Let me be clear that I have not thought through all the details
about exactly what the rules are or how they might be put into
effect. Hopefully though my comments here give you a better
sense of the direction meant to be suggested.
On 9/5/2024 10:04 AM, Terje Mathisen wrote:
David Brown wrote:
On 05/09/2024 11:12, Terje Mathisen wrote:
David Brown wrote:
Unsigned types are ideal for "raw" memory access or external data,
for anything involving bit manipulation (use of &, |, ^, << and >>
on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want two's >>>>> complement wrapping, and for the even fewer occasions when you
actually need that last bit of range.
That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as
an error signal that has to be handled with special code.
You don't have loop counters, array indices, or integer arithmetic?
Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine
with unsigned i, arrays always use a zero base so in Rust the only array
index type is usize, i.e the largest supported unsigned type in the
system, typically the same as u64.
unsigned arithmetic is easier than signed integer arithmetic, including
comparisons that would result in a negative value, you just have to make
the test before subtracting, instead of checking if the result was
negative.
I.e I cannot easily replicate a downward loop that exits when the
counter become negative:
for (int i = START; i >= 0; i-- ) {
// Do something with data[i]
}
for (int i = START; i > -1; i-- ) {
// Do something with data[i]
}
;^)
One of my alternatives are
unsigned u = start; // Cannot be less than zero
if (u) {
u++;
do {
u--;
data[u]...
while (u);
}
any unsigned integer cannot be less than zero?
This typically results in effectively the same asm code as the signed
version, except for a bottom JGE (Jump (signed) Greater or Equal instead
of JA (Jump Above or Equal, but my version is far more verbose.
Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:
for (unsigned u = start; (u & TOPBIT) == 0; u--)
Terje
On Fri, 6 Sep 2024 22:41:12 +0000, Chris M. Thomasson wrote:
On 9/5/2024 10:04 AM, Terje Mathisen wrote:
David Brown wrote:
On 05/09/2024 11:12, Terje Mathisen wrote:
David Brown wrote:
Unsigned types are ideal for "raw" memory access or external data, >>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>> on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want two's >>>>>> complement wrapping, and for the even fewer occasions when you
actually need that last bit of range.
That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as >>>>> an error signal that has to be handled with special code.
You don't have loop counters, array indices, or integer arithmetic?
Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine
with unsigned i, arrays always use a zero base so in Rust the only array >>> index type is usize, i.e the largest supported unsigned type in the
system, typically the same as u64.
unsigned arithmetic is easier than signed integer arithmetic, including
comparisons that would result in a negative value, you just have to make >>> the test before subtracting, instead of checking if the result was
negative.
I.e I cannot easily replicate a downward loop that exits when the
counter become negative:
for (int i = START; i >= 0; i-- ) {
// Do something with data[i]
}
for (int i = START; i > -1; i-- ) {
// Do something with data[i]
}
;^)
# define START 0x80000001
Thomas Koenig <[email protected]> writes:
Thomas Koenig <[email protected]> schrieb:
"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.
Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.
So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.
If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?
Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).
The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior.
(It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:
int a;
...
if (a > a + 1) {
...
}
Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here?
If a < INT_MAX, the behavior is the same as replacing the if()
test with 'if(0)'. If the compiler can accurately deduce that
the condition 'a < INT_MAX' will hold in all cases then the if()
can be optimized away accordingly.
If a == INT_MAX, one possibility is that code is generated to
evaluate the addition and the comparison, and the if-block is
either evaluated or it isn't, depending on the outcome of the
comparison. Important: the compiler is disallowed from drawing
any inferences based on "knowing" the result of either the
addition or the comparison; code must be generated under a "best
efforts" umbrella, and whatever the code does dictates whether
the if-block is evaluated or not, with the compiler being
forbidden to draw any conclusions based on what the result will
be.
If a == INT_MAX, it also should be possible for the addition to
abort the program. Here again the compiler is disallowed from
drawing any inferences based on knowing this will happen. To
make this work the rule allowing "UB to travel backwards in time"
must be revoked; unless a compiler can accurately deduce that a
given piece of code cannot transgress into UB then other code in
the program must not be moved (either forwards or backwards) past
the possibly-not-well-defined code segment.
Scott Lurndal <[email protected]> schrieb:
David Brown <[email protected]> writes:
On 05/09/2024 11:12, Terje Mathisen wrote:
That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as an >>>> error signal that has to be handled with special code.
You don't have loop counters, array indices, or integer arithmetic?
We do. There is no issue using unsigned loop counters,
I find counting down from n to 0 using unsigned variables
unintuitive. Or do you always count up and then calculate
what you actually use? Induction variable optimization
should take care of that, but it would be more complicated
to use.
On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:
Thomas Koenig <[email protected]> writes:
Thomas Koenig <[email protected]> schrieb:
"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.
Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.
So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.
If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?
Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).
The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior. (It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:
int a;
...
if (a > a + 1) {
...
}
Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here?
If a < INT_MAX, the behavior is the same as replacing the if()
test with 'if(0)'. If the compiler can accurately deduce that
the condition 'a < INT_MAX' will hold in all cases then the if()
can be optimized away accordingly.
If a == INT_MAX, one possibility is that code is generated to
evaluate the addition and the comparison, and the if-block is
either evaluated or it isn't, depending on the outcome of the
comparison. Important: the compiler is disallowed from drawing
any inferences based on "knowing" the result of either the
addition or the comparison; code must be generated under a "best
efforts" umbrella, and whatever the code does dictates whether
the if-block is evaluated or not, with the compiler being
forbidden to draw any conclusions based on what the result will
be.
If a == INT_MAX, it also should be possible for the addition to
abort the program. Here again the compiler is disallowed from
drawing any inferences based on knowing this will happen. To
make this work the rule allowing "UB to travel backwards in time"
must be revoked; unless a compiler can accurately deduce that a
given piece of code cannot transgress into UB then other code in
the program must not be moved (either forwards or backwards) past
the possibly-not-well-defined code segment.
It is also possible if a == INT_MAX that the exception will
transfer control to a signal handler to do some SW orchestrated
recovery.
I tried using unsigned for a bunch of my data types that should
never go negative, but every time I would have to compare them
with an int somewhere and that would cause a compiler warning,
because the goal was to also remove unsafe code.
[email protected] (MitchAlsup1) writes:
On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:
Thomas Koenig <[email protected]> writes:
Thomas Koenig <[email protected]> schrieb:
"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.
Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.
So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.
If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?
Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).
The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior. (It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:
int a;
...
if (a > a + 1) {
...
}
Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here?
If a < INT_MAX, the behavior is the same as replacing the if()
test with 'if(0)'. If the compiler can accurately deduce that
the condition 'a < INT_MAX' will hold in all cases then the if()
can be optimized away accordingly.
If a == INT_MAX, one possibility is that code is generated to
evaluate the addition and the comparison, and the if-block is
either evaluated or it isn't, depending on the outcome of the
comparison. Important: the compiler is disallowed from drawing
any inferences based on "knowing" the result of either the
addition or the comparison; code must be generated under a "best
efforts" umbrella, and whatever the code does dictates whether
the if-block is evaluated or not, with the compiler being
forbidden to draw any conclusions based on what the result will
be.
If a == INT_MAX, it also should be possible for the addition to
abort the program. Here again the compiler is disallowed from
drawing any inferences based on knowing this will happen. To
make this work the rule allowing "UB to travel backwards in time"
must be revoked; unless a compiler can accurately deduce that a
given piece of code cannot transgress into UB then other code in
the program must not be moved (either forwards or backwards) past
the possibly-not-well-defined code segment.
It is also possible if a == INT_MAX that the exception will
transfer control to a signal handler to do some SW orchestrated
recovery.
Philosophically this reaction doesn't fit with the others. Assuming
for the sake of discussion that raising an implementation-defined
signal is an important behavior to support, it should go into the
C standard in a different way than making it part of the "limited
undefined behavior" idea outlined above.
On 06.09.24 00:04, Tim Rentsch wrote:
If START is signed (presumably of type int), so the loop might run
zero times, but never more than INT_MAX times, then
for( unsigned u = START < 0 ? 0 : START + 1u; u > 0 && u--; ){
// Do something with data[i]
}
If START is unsigned, so in all cases the loop must run at
least once, then
unsigned u = START;
do {
// Do something with data[i]
} while( u > 0 && u-- );
(Yes I know the 'u > 0' expressions can be replaced by just 'u'.)
The optimizer should be smart enough to realize that if 'u > 0'
is true then the test 'u--' will also be true. The same should
hold if 'u > 0' is replaced by just 'u'.
(Disclaimer: code not compiled.)
Both yield not very elegant code:
https://godbolt.org/z/M4Y5PYP3v
On Sat, 7 Sep 2024 13:52:02 +0000, Tim Rentsch wrote:
[email protected] (MitchAlsup1) writes:
On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:
Thomas Koenig <[email protected]> writes:
Thomas Koenig <[email protected]> schrieb:
"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules >>>>>> as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.
Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.
So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.
If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?
Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).
The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior. (It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:
int a;
...
if (a > a + 1) {
...
}
Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here?
If a < INT_MAX, the behavior is the same as replacing the if()
test with 'if(0)'. If the compiler can accurately deduce that
the condition 'a < INT_MAX' will hold in all cases then the if()
can be optimized away accordingly.
If a == INT_MAX, one possibility is that code is generated to
evaluate the addition and the comparison, and the if-block is
either evaluated or it isn't, depending on the outcome of the
comparison. Important: the compiler is disallowed from drawing
any inferences based on "knowing" the result of either the
addition or the comparison; code must be generated under a "best
efforts" umbrella, and whatever the code does dictates whether
the if-block is evaluated or not, with the compiler being
forbidden to draw any conclusions based on what the result will
be.
If a == INT_MAX, it also should be possible for the addition to
abort the program. Here again the compiler is disallowed from
drawing any inferences based on knowing this will happen. To
make this work the rule allowing "UB to travel backwards in time"
must be revoked; unless a compiler can accurately deduce that a
given piece of code cannot transgress into UB then other code in
the program must not be moved (either forwards or backwards) past
the possibly-not-well-defined code segment.
It is also possible if a == INT_MAX that the exception will
transfer control to a signal handler to do some SW orchestrated
recovery.
Philosophically this reaction doesn't fit with the others. Assuming
for the sake of discussion that raising an implementation-defined
signal is an important behavior to support, it should go into the
C standard in a different way than making it part of the "limited
undefined behavior" idea outlined above.
With it "being difficult" to determine when an integer overflow
has occurred in may architectures, it is unlikely that integer
overflow could ever be put into the C standard.
ARM A64:
mov x2, #0x8000000000000000
cmp x1, x2
b.le 20 <bar2+0x10>
Stefan Monnier <[email protected]> writes:
Specifications are an agreement between the supplier and the client. The >>The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.
For programs there is no conformance level "free from UB" in the C
standard.
There are two conformance levels for programs:
1) A strictly conforming program shall use only those features of the
language and library specified in this International Standard.
This excludes all programs that terminate, including the "Hello,
World" program. [...]
[email protected] (Anton Ertl) writes:
Stefan Monnier <[email protected]> writes:
Specifications are an agreement between the supplier and the client.
The
The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.
For programs there is no conformance level "free from UB" in the C
standard.
The C standard doesn't define any conformance "levels": it defines
the term "strictly conforming program", for its own convenience in
defining the language; it also defines the term "conforming
program", for no apparent purpose at all. In both cases however
what is given are simply definitions; there is no reason an
interested party couldn't give a definition of some other term, for
the purpose of identifying a class of C programs that have some
particular property -- such as being free from undefined behavior --
where membership in the class is completely determined by statements
in the C standard, being used as a reference document.
There are two conformance levels for programs:
1) A strictly conforming program shall use only those features of the
language and library specified in this International Standard.
This excludes all programs that terminate, including the "Hello,
World" program. [...]
I don't know why you say this. Which aspects of the definition for
"strictly conforming program" do you think are violated by a typical
'Hello, World' program? I'm confident the people who wrote the C
standard would say such a program is strictly conforming.
On 07/09/2024 01:10, MitchAlsup1 wrote:
On Fri, 6 Sep 2024 22:41:12 +0000, Chris M. Thomasson wrote:
On 9/5/2024 10:04 AM, Terje Mathisen wrote:
David Brown wrote:
On 05/09/2024 11:12, Terje Mathisen wrote:
David Brown wrote:
Unsigned types are ideal for "raw" memory access or external data, >>>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>>> on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want two's >>>>>>> complement wrapping, and for the even fewer occasions when you
actually need that last bit of range.
That last paragraph enumerates pretty much all the uses I have for >>>>>> integer-type variables, with (like Mitch) a few apis that use (-1) as >>>>>> an error signal that has to be handled with special code.
You don't have loop counters, array indices, or integer arithmetic?
Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine >>>> with unsigned i, arrays always use a zero base so in Rust the only array >>>> index type is usize, i.e the largest supported unsigned type in the
system, typically the same as u64.
unsigned arithmetic is easier than signed integer arithmetic, including >>>> comparisons that would result in a negative value, you just have to make >>>> the test before subtracting, instead of checking if the result was
negative.
I.e I cannot easily replicate a downward loop that exits when the
counter become negative:
for (int i = START; i >= 0; i-- ) {
// Do something with data[i]
}
for (int i = START; i > -1; i-- ) {
// Do something with data[i]
}
;^)
# define START 0x80000001
No.
The great thing about 32 bit integers is that your numbers are never
anywhere close to being too big - or you /know/ you are dealing with
very big numbers and you can take that into account such as by using
64-bit integer types.
A number that is the start or end of a normal count range is /never/ 0x80000001. Write code that is clear, simple and correct for what you
are actually doing. And if you think such big numbers are realistic,
write the same clear, simple and correct code with "int64_t" instead.
Here we have the three variants:
#include <limits.h>
extern long foo1(long);
extern long foo2(long);
long bar(long a, long b)
{
long c;
if (__builtin_sub_overflow(b,1,&c))
return foo1(a);
else
return foo2(a);
}
long bar2(long a, long b)
{
if (b < b-1)
return foo1(a);
else
return foo2(a);
}
long bar3(long a, long b)
{
if (b == LONG_MIN)
return foo1(a);
else
return foo2(a);
}
Tim Rentsch <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Thomas Koenig <[email protected]> schrieb:
"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.
Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.
So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.
If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?
Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).
The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior.
That sentece makes no sense to me.
Behavior is defined by the standard, by the compiler documentation,
by other standards (such as OpenMP) or it is undefined.
"Giving less freedom" has no difference from defining.
(It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:
int a;
...
if (a > a + 1) {
...
}
Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here?
If a < INT_MAX, the behavior is the same as replacing the if()
test with 'if(0)'. If the compiler can accurately deduce that
the condition 'a < INT_MAX' will hold in all cases then the if()
can be optimized away accordingly.
If a == INT_MAX, one possibility is that code is generated to
evaluate the addition and the comparison, and the if-block is
either evaluated or it isn't, depending on the outcome of the
comparison. Important: the compiler is disallowed from drawing
any inferences based on "knowing" the result of either the
addition or the comparison; code must be generated under a "best
efforts" umbrella, and whatever the code does dictates whether
the if-block is evaluated or not, with the compiler being
forbidden to draw any conclusions based on what the result will
be.
If a == INT_MAX, it also should be possible for the addition to
abort the program. Here again the compiler is disallowed from
drawing any inferences based on knowing this will happen. To
make this work the rule allowing "UB to travel backwards in time"
must be revoked; unless a compiler can accurately deduce that a
given piece of code cannot transgress into UB then other code in
the program must not be moved (either forwards or backwards) past
the possibly-not-well-defined code segment.
After thinking about this for a time, what you want looks a lot
like volaitle.
Is there any requirement that you can think of that would not
be fullfilled with "volatile int a"?
Is there anything with "volatile int a" that you do not want?
If volatile is close to what you want, then this would be
straightforward to incorporate into an existing compiler such as
gcc, just add an option which declares every variable in the C
front end volatile, weed out the resulting bugs (yes, that is a
mixed metaphor) and be done.
On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:
[email protected] (Anton Ertl) writes:
Stefan Monnier <[email protected]> writes:
Specifications are an agreement between the supplier and the client. >>>>> The
The problem here is that the C standard, seen as a contract, is unfair >>>> to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.
For programs there is no conformance level "free from UB" in the C
standard.
The C standard doesn't define any conformance "levels": it defines
the term "strictly conforming program", for its own convenience in
defining the language; it also defines the term "conforming
program", for no apparent purpose at all. In both cases however
what is given are simply definitions; there is no reason an
interested party couldn't give a definition of some other term, for
the purpose of identifying a class of C programs that have some
particular property -- such as being free from undefined behavior --
where membership in the class is completely determined by statements
in the C standard, being used as a reference document.
There are two conformance levels for programs:
1) A strictly conforming program shall use only those features of the
language and library specified in this International Standard.
This excludes all programs that terminate, including the "Hello,
World" program. [...]
I don't know why you say this. Which aspects of the definition for
"strictly conforming program" do you think are violated by a typical
'Hello, World' program? I'm confident the people who wrote the C
standard would say such a program is strictly conforming.
The standard "Hello World !" program does not return a value to
<effectively> crt0.
Secondarily while one is supposed to return 0 for success and
something else for failure, there is no standard C defined way
that this is related back to the invoker of the program.
Another issue is that main() may not have the 3 defined arguments
and the containing environment is not supposed to complain when
argc, arv, and envp are unused or even unnamed as arguments.
On Sat, 7 Sep 2024 13:52:02 +0000, Tim Rentsch wrote:[...]
[email protected] (MitchAlsup1) writes:
On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:
The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior. (It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:
int a;
...
if (a > a + 1) {
...
}
Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here? [...] If a == INT_MAX, it also should be
possible for the addition to abort the program. [...]
It is also possible if a == INT_MAX that the exception will
transfer control to a signal handler to do some SW orchestrated
recovery.
Philosophically this reaction doesn't fit with the others. Assuming
for the sake of discussion that raising an implementation-defined
signal is an important behavior to support, it should go into the
C standard in a different way than making it part of the "limited
undefined behavior" idea outlined above.
With it "being difficult" to determine when an integer overflow
has occurred in may architectures, it is unlikely that integer
overflow could ever be put into the C standard.
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
On Sat, 7 Sep 2024 7:15:11 +0000, David Brown wrote:
On 07/09/2024 01:10, MitchAlsup1 wrote:
On Fri, 6 Sep 2024 22:41:12 +0000, Chris M. Thomasson wrote:
On 9/5/2024 10:04 AM, Terje Mathisen wrote:
David Brown wrote:
On 05/09/2024 11:12, Terje Mathisen wrote:Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
David Brown wrote:
Unsigned types are ideal for "raw" memory access or external data, >>>>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>>>> on signed types is usually wrong, IMHO), as building blocks in >>>>>>>> extended arithmetic types, for the few occasions when you want >>>>>>>> two's
complement wrapping, and for the even fewer occasions when you >>>>>>>> actually need that last bit of range.
That last paragraph enumerates pretty much all the uses I have for >>>>>>> integer-type variables, with (like Mitch) a few apis that use
(-1) as
an error signal that has to be handled with special code.
You don't have loop counters, array indices, or integer arithmetic? >>>>>
fine
with unsigned i, arrays always use a zero base so in Rust the only
array
index type is usize, i.e the largest supported unsigned type in the
system, typically the same as u64.
unsigned arithmetic is easier than signed integer arithmetic,
including
comparisons that would result in a negative value, you just have to
make
the test before subtracting, instead of checking if the result was
negative.
I.e I cannot easily replicate a downward loop that exits when the
counter become negative:
for (int i = START; i >= 0; i-- ) {
// Do something with data[i]
}
for (int i = START; i > -1; i-- ) {
// Do something with data[i]
}
;^)
# define START 0x80000001
No.
The great thing about 32 bit integers is that your numbers are never
anywhere close to being too big - or you /know/ you are dealing with
very big numbers and you can take that into account such as by using
64-bit integer types.
A number that is the start or end of a normal count range is /never/
0x80000001. Write code that is clear, simple and correct for what you
are actually doing. And if you think such big numbers are realistic,
write the same clear, simple and correct code with "int64_t" instead.
static uint64_t array[1024*1024*512+1]
static int SIZE = sizeof(array)/sizeof(uint65_t);
[email protected] (MitchAlsup1) writes:
On Sat, 7 Sep 2024 13:52:02 +0000, Tim Rentsch wrote:[...]
[email protected] (MitchAlsup1) writes:
On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:
The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior. (It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:
int a;
...
if (a > a + 1) {
...
}
Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here? [...] If a == INT_MAX, it also should be
possible for the addition to abort the program. [...]
It is also possible if a == INT_MAX that the exception will
transfer control to a signal handler to do some SW orchestrated
recovery.
Philosophically this reaction doesn't fit with the others. Assuming
for the sake of discussion that raising an implementation-defined
signal is an important behavior to support, it should go into the
C standard in a different way than making it part of the "limited
undefined behavior" idea outlined above.
With it "being difficult" to determine when an integer overflow
has occurred in may architectures, it is unlikely that integer
overflow could ever be put into the C standard.
It could easily be added to the C standard just by making the
signal-raise option be conditional: give each implementation
the choice of either (a) stipulating that overflow causes an implementation-defined signal to be raised, or (b) letting the
operation be limited undefined behavior. Limited undefined
behavior can be provided simply by naively compiling the code
in question, so that can be accommodated regardless of how
unsophisticated the processor is.
Thomas Koenig <[email protected]> writes:
After thinking about this for a time, what you want looks a lot
like volaitle.
That's a good insight. Certainly there are aspects of what I
have proposed that are similar to how volatile works.
Is there any requirement that you can think of that would not
be fullfilled with "volatile int a"?
Is there anything with "volatile int a" that you do not want?
Something being volatile has consequences only in reference to
objects, and only when a memory access (either read or write) is
requested. There are no such things as volatile values. What
we're looking for here is constraints on operations, not on
memory accesses. In a sense one might say what we want is
"volatile operators": similar in concept to how volatile works,
but in a different area of language semantics.
Stefan Monnier <[email protected]> writes:
Specifications are an agreement between the supplier and the client. The >>>The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.
For programs there is no conformance level "free from UB" in the C
standard.
The C standard doesn't define any conformance "levels": it defines
the term "strictly conforming program", for its own convenience in
defining the language; it also defines the term "conforming
program", for no apparent purpose at all.
In both cases however
what is given are simply definitions; there is no reason an
interested party couldn't give a definition of some other term, for
the purpose of identifying a class of C programs that have some
particular property -- such as being free from undefined behavior --
where membership in the class is completely determined by statements
in the C standard, being used as a reference document.
There are two conformance levels for programs:
1) A strictly conforming program shall use only those features of the
language and library specified in this International Standard.
This excludes all programs that terminate, including the "Hello,
World" program. [...]
I don't know why you say this. Which aspects of the definition for
"strictly conforming program" do you think are violated by a typical
'Hello, World' program?
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
And just for fun::
On Fri, 6 Sep 2024 13:26:42 +0000, Anton Ertl wrote:
Here we have the three variants:
#include <limits.h>
extern long foo1(long);
extern long foo2(long);
long bar(long a, long b)
{
long c;
if (__builtin_sub_overflow(b,1,&c))
return foo1(a);
else
return foo2(a);
}
long bar2(long a, long b)
{
if (b < b-1)
return foo1(a);
else
return foo2(a);
}
long bar3(long a, long b)
{
if (b == LONG_MIN)
return foo1(a);
else
return foo2(a);
}
My 66000:
add r3,R1,#-1 add r3,r1,#-1 bepm r1,.L4
bge R3,.L4 bge r3,.L4
8-bytes 8-bytes 4-bytes
I have a direct test for POSMAX in ISA that does not use a constant.
On 2024-09-03 11:10, David Brown wrote:
[snip]
(There are a few situations where UB in C could be diagnosed at
compile-time, which are probably historical decisions to avoid
imposing too much work on early compilers. Where possible, UB that
can be caught at compile time, could usefully be turned into
constrain violations that must be diagnosed.)
The problem, as you of course know, is that the "can" in "can be
caught at compile time" depends on the amount and kind of analysis
that is done at compile time -- some cases of UB "can" be caught at
compile time but only by advanced and costly analysis. If the language standard requires that such things /must/ be detected by the compiler,
it can place quite a burden on the developers of conforming compilers.
As I understand it, current C compilers detect UB mostly as a side
effect of the analyses they do for code optimization purposes, which
vary widely between compilers, and so the UB-detections also vary.
This issue (compile-time detection) has now and then been discussed in
the Ada standards group. Given the currently low market penetration of
Ada, the group has been reluctant to require too much of the
compilers, and so the more advanced UB-detecting tools are
stand-alone, such as the SPARK tools.
[email protected] (MitchAlsup1) writes:
And just for fun::
On Fri, 6 Sep 2024 13:26:42 +0000, Anton Ertl wrote:
Here we have the three variants:
#include <limits.h>
extern long foo1(long);
extern long foo2(long);
long bar(long a, long b)
{
long c;
if (__builtin_sub_overflow(b,1,&c))
return foo1(a);
else
return foo2(a);
}
long bar2(long a, long b)
{
if (b < b-1)
return foo1(a);
else
return foo2(a);
}
long bar3(long a, long b)
{
if (b == LONG_MIN)
return foo1(a);
else
return foo2(a);
}
My 66000:
add r3,R1,#-1 add r3,r1,#-1 bepm r1,.L4
bge R3,.L4 bge r3,.L4
8-bytes 8-bytes 4-bytes
I have a direct test for POSMAX in ISA that does not use a constant.
How does bge work in the first and second column? My impression was
that you are using an 88k-style flags-in-GPR architecture.
Concerning the last column, the gcc developer who added the
transformation of bar2() into bar3() apparently had My66000 in mind.
- anton
On 08/09/2024 02:17, MitchAlsup1 wrote:
On Sat, 7 Sep 2024 7:15:11 +0000, David Brown wrote:
static uint64_t array[1024*1024*512+1]
static int SIZE = sizeof(array)/sizeof(uint65_t);
Surely you mean :
static const size_t array_size = sizeof(array) / sizeof(uint64_t);
[email protected] (MitchAlsup1) writes:
On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:
Another issue is that main() may not have the 3 defined arguments
and the containing environment is not supposed to complain when
argc, arv, and envp are unused or even unnamed as arguments.
The usual "Hello, World" program defines main() either with no
arguments
int
main(){
...
}
or with two arguments
int
main( int argc, char *argv[] ){
...
}
and in both cases main() has defined behavior and does not
violate the strictures of strictly conforming programs.
If the surrounding OS or whatever cannot support these, that
doesn't change whether the program is strictly conforming. The
condition of being strictly conforming is a predicate on
programs, not on implementations.
Tim Rentsch <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
After thinking about this for a time, what you want looks a lot
like volaitle.
That's a good insight. Certainly there are aspects of what I
have proposed that are similar to how volatile works.
The way I understand you is the following: You want the
compiler to be forbidden to remove codepaths on the assumption
that undefined behavior cannot happen, and you want a
"best effort" in that case, which includes throwing an error
or just ignoring everything and proceeding.
The observable behavior includes (n2596)
"Volatile accesses to objects are evaluated strictly according to
the rules of the abstract machine."
So, assuming that variables are objects (if there's a definition
of an object in n2596, I missed it) the compiler cannot remove
accessing a in
volatile int a;
if (a > a + 1)
so it cannot remove any code path leading to the if statement, which
is what you want. An interesting point is what "volatile access"
actually means, especially for automatic variables; it seems that
all compilers treat this as a memory access (which makes limited
sense in my opinion - is there an explanation for this?)
Is there any requirement that you can think of that would not
be fullfilled with "volatile int a"?
Is there anything with "volatile int a" that you do not want?
Something being volatile has consequences only in reference to
objects, and only when a memory access (either read or write) is
requested. There are no such things as volatile values. What
we're looking for here is constraints on operations, not on
memory accesses. In a sense one might say what we want is
"volatile operators": similar in concept to how volatile works,
but in a different area of language semantics.
Hmm.. OK. The nice thing about SSA is that it transforms
complicated expressions like "a + b + c" into
tmp1 = a + b
tmp2 = tmp1 + c
so it would be possible to write a pass which would declare those
variables as volatile that you want (not needed for unsigned, for
example).
Alternatively, you could write a pass which translates
int a, b;
tmp1 = a + b;
into
tmp1 = (int) ((unsigned) a + (unsigned) b)
or just use -frwapv in the first place.
So, SSA offers you the possibility of working on operators, like
you want to.
On Sun, 8 Sep 2024 2:47:38 +0000, Tim Rentsch wrote:
[email protected] (MitchAlsup1) writes:
On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:
Another issue is that main() may not have the 3 defined arguments
and the containing environment is not supposed to complain when
argc, arv, and envp are unused or even unnamed as arguments.
The usual "Hello, World" program defines main() either with no
arguments
int
main(){
...
}
or with two arguments
int
main( int argc, char *argv[] ){
...
}
and in both cases main() has defined behavior and does not
violate the strictures of strictly conforming programs.
The Linux environment (crt0) calls main with 3 arguments.
Are you arguing that a program can be strictly conforming and
not be type safe at its call/return interfaces ??
On Sun, 8 Sep 2024 2:47:38 +0000, Tim Rentsch wrote:
[email protected] (MitchAlsup1) writes:
On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:
Another issue is that main() may not have the 3 defined arguments
and the containing environment is not supposed to complain when
argc, arv, and envp are unused or even unnamed as arguments.
The usual "Hello, World" program defines main() either with no
arguments
int
main(){
...
}
or with two arguments
int
main( int argc, char *argv[] ){
...
}
and in both cases main() has defined behavior and does not
violate the strictures of strictly conforming programs.
The Linux environment (crt0) calls main with 3 arguments.
Are you arguing that a program can be strictly conforming and
not be type safe at its call/return interfaces ??
On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:
On 04/09/2024 18:07, Tim Rentsch wrote:
Terje Mathisen <[email protected]> writes:
Michael S wrote:
On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
3 years ago Terje Mathisen wrote that many years ago he read
that behaviour of memcpy() with overlappped src/dst was defined. >>>>>>> https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>>> Mitch Alsup answered "That was true in 1983". So, two people of >>>>>>> different age living in different parts of the world are telling >>>>>>> the same story. May be, there exist old popular book that said
that it was defined?
It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime
doc?
Specifying that it would always copy from beginning to end of
the source buffer, in increasing address order meant that it
was guaranteed safe when used to compact buffers.
What is "compact buffers" ?
Assume a buffer consisting of records of some type, some of
them marked as deleted. Iterating over them while removing
the gaps means that you are always copying to a destination
lower in memory, right?
If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.
Such tests are usually built into implementations of memmove(),
which will chose to run forwards or backwards as needed. So you
might as well just call memmove() any time you are not sure
memcpy() is safe and appropriate.
Memmove() is always appropriate unless you are doing something
nefarious.
So:
# define memcpy memomve
and move forward with life--for the 2 extra cycles memmove costs
it saves everyone long term grief.
When you need the nefarious activities of memcpy write it as a
for loop by yourself and comment the nafariousness of the use.
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
1) At first I thought that yes, one could just check whether there
is an overlap of the memory areas. But then I remembered that you
cannot write such a check in standard C without (in the general
case) exercising undefined behaviour;
and then the compiler could eliminate the check or do something
else that's unexpected. Do you have such a check in mind that
does not exercise undefined behaviour in the general case?
2) Even if there is such a check, you have to be aware that there
is a potential problem with memcpy(). In that case the way to go
is to just use memmove().
But that does not help you with the next "clever" idea that some
compiler or library maintainer has.
On Sun, 8 Sep 2024 15:32:02 +0000, Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
And just for fun::
On Fri, 6 Sep 2024 13:26:42 +0000, Anton Ertl wrote:
Here we have the three variants:
#include <limits.h>
extern long foo1(long);
extern long foo2(long);
long bar(long a, long b)
{
long c;
if (__builtin_sub_overflow(b,1,&c))
return foo1(a);
else
return foo2(a);
}
long bar2(long a, long b)
{
if (b < b-1)
return foo1(a);
else
return foo2(a);
}
long bar3(long a, long b)
{
if (b == LONG_MIN)
return foo1(a);
else
return foo2(a);
}
My 66000:
add r3,R1,#-1 add r3,r1,#-1 bepm r1,.L4
bge R3,.L4 bge r3,.L4
8-bytes 8-bytes 4-bytes
I have a direct test for POSMAX in ISA that does not use a constant.
How does bge work in the first and second column? My impression was
that you are using an 88k-style flags-in-GPR architecture.
I just copied the RISC-V code
...Concerning the last column, the gcc developer who added the
transformation of bar2() into bar3() apparently had My66000 in mind.
BTW I had the comparisons to int-MAX/MIN in since about 2016.
On 05/09/2024 19:04, Terje Mathisen wrote:
David Brown wrote:
On 05/09/2024 11:12, Terje Mathisen wrote:
David Brown wrote:
Unsigned types are ideal for "raw" memory access or external data,
for anything involving bit manipulation (use of &, |, ^, << and >>
on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want
two's complement wrapping, and for the even fewer occasions when
you actually need that last bit of range.
That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1)
as an error signal that has to be handled with special code.
You don't have loop counters, array indices, or integer arithmetic?
Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
fine with unsigned i, arrays always use a zero base so in Rust the
only array index type is usize, i.e the largest supported unsigned
type in the system, typically the same as u64.
Loop counters can usually be signed or unsigned, and it usually makes no difference. Array indices are also usually much the same signed or unsigned, and it can feel more natural to use size_t here (an unsigned type). It can make a difference to efficiency, however. On x86-64,
this code is 3 instructions with T as "unsigned long int" or "long int",
4 with "int", and 5 with "unsigned int".
int foo(int * p, T x) {
int a = p[x++];
int b = p[x++];
return a + b;
}
Anyway, I count loop counters and array indices as "use of integer-type variables", whether you prefer signed or unsigned.
unsigned arithmetic is easier than signed integer arithmetic,
including comparisons that would result in a negative value, you just
have to make the test before subtracting, instead of checking if the
result was negative.
I can't follow that at all. Unsigned and signed arithmetic and
comparisons both work simply and as you'd expect. /Mixing/ signed and unsigned types can get things wrong.
I.e I cannot easily replicate a downward loop that exits when the
counter become negative:
 for (int i = START; i >= 0; i-- ) {
   // Do something with data[i]
 }
One of my alternatives are
 unsigned u = start; // Cannot be less than zero
 if (u) {
   u++;
   do {
     u--;
     data[u]...
   while (u);
 }
This typically results in effectively the same asm code as the signed
version, except for a bottom JGE (Jump (signed) Greater or Equal
instead of JA (Jump Above or Equal, but my version is far more verbose.
A more important thing is that the first version, with signed i, is
/vastly/ simpler and clearer in the source code.
Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:
 for (unsigned u = start; (u & TOPBIT) == 0; u--)
Terje
Or you could just write sane code that matches what you want to say.
On 05.09.24 19:04, Terje Mathisen wrote:
One of my alternatives are
 unsigned u = start; // Cannot be less than zero
 if (u) {
   u++;
   do {
     u--;
     data[u]...
   while (u);
 }
This typically results in effectively the same asm code as the signed
version, except for a bottom JGE (Jump (signed) Greater or Equal
instead of JA (Jump Above or Equal, but my version is far more verbose.
Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:
 for (unsigned u = start; (u & TOPBIT) == 0; u--)
Terje
What about:
for (unsigned u = start; u != ~0u; --u)
...
or even
for (unsigned u = start; (int)u >= 0; --u)
...
?
I've compared all variants for x86_64 with -O3 -fexpensive-optimizations
on godbolt.org:
- 32 bit version: https://godbolt.org/z/TMhhx3nch
- 64 bit version: https://godbolt.org/z/8oxzTf5Gf
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
1) At first I thought that yes, one could just check whether there
is an overlap of the memory areas. But then I remembered that you
cannot write such a check in standard C without (in the general
case) exercising undefined behaviour;
Yes, I can.
and then the compiler could eliminate the check or do something
else that's unexpected. Do you have such a check in mind that
does not exercise undefined behaviour in the general case?
Sure. I wouldn't have made my earlier statement otherwise.
2) Even if there is such a check, you have to be aware that there
is a potential problem with memcpy(). In that case the way to go
is to just use memmove().
The point of my previous comment was only to address the question
of whether any existing memcpy() calls are problematic. If all
of the checks return "no overlap" then memcpy() is not the problem.
But that does not help you with the next "clever" idea that some
compiler or library maintainer has.
I have the impression that this is an editorial comment having
nothing to do with memcpy() or memmove(). If that impression
is wrong then I'm at a loss to understand what you are talking
about, and would you please elaborate.
[email protected] (MitchAlsup1) writes:
So:
# define memcpy memomve
Incidentally, if one wants to do this, it's advisable to write
#undef memcpy
before the #define of memcpy.
and move forward with life--for the 2 extra cycles memmove costs
it saves everyone long term grief.
Simply replacing memcpy() by memmove() of course will always
work, but there might be negative consequences beyond a cost
of 2 extra cycles -- for example, if a negative stride is
better performing than a positive stride, but the nature
of the compaction forces memmove() to always take the slower
choice.
[email protected] (MitchAlsup1) writes:
On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:
On 04/09/2024 18:07, Tim Rentsch wrote:
Terje Mathisen <[email protected]> writes:
Michael S wrote:
On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
3 years ago Terje Mathisen wrote that many years ago he read
that behaviour of memcpy() with overlappped src/dst was defined. >>>>>>>> https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>>>> Mitch Alsup answered "That was true in 1983". So, two people of >>>>>>>> different age living in different parts of the world are telling >>>>>>>> the same story. May be, there exist old popular book that said >>>>>>>> that it was defined?
It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime
doc?
Specifying that it would always copy from beginning to end of
the source buffer, in increasing address order meant that it
was guaranteed safe when used to compact buffers.
What is "compact buffers" ?
Assume a buffer consisting of records of some type, some of
them marked as deleted. Iterating over them while removing
the gaps means that you are always copying to a destination
lower in memory, right?
If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.
Such tests are usually built into implementations of memmove(),
which will chose to run forwards or backwards as needed. So you
might as well just call memmove() any time you are not sure
memcpy() is safe and appropriate.
The ever-shallow David Brown first misses the point, then makes a
slightly incorrect statement, and finally makes a recommendation
that surely is familiar to every reader in the newsgroup.
Memmove() is always appropriate unless you are doing something
nefarious.
So:
# define memcpy memomve
Incidentally, if one wants to do this, it's advisable to write
#undef memcpy
before the #define of memcpy.
float invsqrt(float x)[...]
int32_t ix;
memcpy(&ix, &x, sizeof(ix));
and the compiler will see that x and ix can share the same register.
I don't suppose memmove() can be dependent upon to do the same?
Tim Rentsch wrote:
[email protected] (MitchAlsup1) writes:
On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:
On 04/09/2024 18:07, Tim Rentsch wrote:
Terje Mathisen <[email protected]> writes:
Michael S wrote:
On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
3 years ago Terje Mathisen wrote that many years ago he read >>>>>>>> that behaviour of memcpy() with overlappped src/dst was
defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>>>> Mitch Alsup answered "That was true in 1983". So, two
people of different age living in different parts of the
world are telling the same story. May be, there exist old
popular book that said that it was defined?
It probably wasn't written in the official C standard, which I >>>>>>> couldn't have afforded to buy/read, but in a compiler runtime
doc?
Specifying that it would always copy from beginning to end of
the source buffer, in increasing address order meant that it
was guaranteed safe when used to compact buffers.
What is "compact buffers" ?
Assume a buffer consisting of records of some type, some of
them marked as deleted. Iterating over them while removing
the gaps means that you are always copying to a destination
lower in memory, right?
If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.
Such tests are usually built into implementations of memmove(),
which will chose to run forwards or backwards as needed. So you
might as well just call memmove() any time you are not sure
memcpy() is safe and appropriate.
The ever-shallow David Brown first misses the point, then makes a
slightly incorrect statement, and finally makes a recommendation
that surely is familiar to every reader in the newsgroup.
Memmove() is always appropriate unless you are doing something
nefarious.
So:
# define memcpy memomve
Incidentally, if one wants to do this, it's advisable to write
#undef memcpy
before the #define of memcpy.
What really worries me is that I've been told (and shown in godbolt)
that memcpy() can be magic, i.e the ocmpiler is allowed to make it
NOP when I use it to move data between an integer and float variable:
float invsqrt(float x)
{
...
int32_t ix = *(int32_t *) &x;
is deprecated, instead do something like this:
int32_t ix;
memcpy(&ix, &x, sizeof(ix));
and the compiler will see that x and ix can share the same register.
I don't suppose memmove() can be dependent upon to do the same?
Terje
On Mon, 9 Sep 2024 10:20:00 +0200[...]
Terje Mathisen <[email protected]> wrote:
float invsqrt(float x)
{
...
int32_t ix = *(int32_t *) &x;
...int32_t ix;
memcpy(&ix, &x, sizeof(ix));
I don't know if it is always true in more complex cases, where absence
of aliasing is less obvious to compiler.
However, I'd expect that as
long as a copied item fits in register, the magic will work equally
with both memcpy and memmove.
David Brown wrote:
On 05/09/2024 19:04, Terje Mathisen wrote:
David Brown wrote:
On 05/09/2024 11:12, Terje Mathisen wrote:
David Brown wrote:
Unsigned types are ideal for "raw" memory access or external data, >>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>> on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want
two's complement wrapping, and for the even fewer occasions when
you actually need that last bit of range.
That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1)
as an error signal that has to be handled with special code.
You don't have loop counters, array indices, or integer arithmetic?
Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
fine with unsigned i, arrays always use a zero base so in Rust the
only array index type is usize, i.e the largest supported unsigned
type in the system, typically the same as u64.
Loop counters can usually be signed or unsigned, and it usually makes
no difference. Array indices are also usually much the same signed or
unsigned, and it can feel more natural to use size_t here (an unsigned
type). It can make a difference to efficiency, however. On x86-64,
this code is 3 instructions with T as "unsigned long int" or "long
int", 4 with "int", and 5 with "unsigned int".
int foo(int * p, T x) {
int a = p[x++];
int b = p[x++];
return a + b;
}
;; assume *p in rdi, x in rsi
mov rax,[rdi+rsi]
add rax,[rdi+rsi+8]
ret
:-)
Or you could just write sane code that matches what you want to say.
On Sun, 8 Sep 2024 6:25:10 +0000, David Brown wrote:
On 08/09/2024 02:17, MitchAlsup1 wrote:
On Sat, 7 Sep 2024 7:15:11 +0000, David Brown wrote:
static uint64_t array[1024*1024*512+1]
static int SIZE = sizeof(array)/sizeof(uint65_t);
Surely you mean :
static const size_t array_size = sizeof(array) / sizeof(uint64_t);
I wanted SIZE to have the same type as i.
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
1) A strictly conforming program shall use only those features
of the language and library specified in this International
Standard. This excludes all programs that terminate,
including the "Hello, World" program. [...]
I don't know why you say this. Which aspects of the definition
for "strictly conforming program" do you think are violated by a
typical 'Hello, World' program?
A typical "Hello, World" program terminates, and as mentioned,
no terminating program can be strictly conforming, because it
exercises at least implementation-defined behaviour (e.g., look
at section 7.22.4.4 of C11).
Michael S <[email protected]> writes:
On Mon, 9 Sep 2024 10:20:00 +0200[...]
Terje Mathisen <[email protected]> wrote:
float invsqrt(float x)
{
...
int32_t ix = *(int32_t *) &x;
...int32_t ix;
memcpy(&ix, &x, sizeof(ix));
I don't know if it is always true in more complex cases, where
absence of aliasing is less obvious to compiler.
Something like
memmove(*p, *q, 8)
can be translated to something like
0: 48 8b 06 mov (%rsi),%rax
3: 48 89 07 mov %rax,(%rdi)
without any aliasing worries, and indeed, gcc-9, gcc-10, and gcc-12,
does that.
However, I'd expect that as
long as a copied item fits in register, the magic will work equally
with both memcpy and memmove.
One would hope so, but here's what happens with gcc-12:
#include <string.h>
void foo1(char *p, char* q)
{
memcpy(p,q,32);
}
void foo2(char *p, char* q)
{
memmove(p,q,32);
}
gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o:
0000000000000000 <foo1>:
0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret
13: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
1a: 00 00 00 00
1e: 66 90 xchg %ax,%ax
0000000000000020 <foo2>:
20: ba 20 00 00 00 mov $0x20,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>
The jmp in line 25 is probably a tail-call to memmove().
My guess is that xmm registers and unrolling are used here rather than
ymm registers because waking up the second 128 bits takes time. But
even with that, the code uses two different registers, and if
scheduled differently, could be used for implementing foo2():
0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret
- anton
On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:
Does hardware on which negative stride is faster really exists?
On Mon, 9 Sep 2024 10:20:00 +0200
Terje Mathisen <[email protected]> wrote:
Tim Rentsch wrote:
[email protected] (MitchAlsup1) writes:
On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:
On 04/09/2024 18:07, Tim Rentsch wrote:
Terje Mathisen <[email protected]> writes:
Michael S wrote:
On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
3 years ago Terje Mathisen wrote that many years ago he read >>>>>>>>>> that behaviour of memcpy() with overlappped src/dst was
defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>>>>>> Mitch Alsup answered "That was true in 1983". So, two
people of different age living in different parts of the
world are telling the same story. May be, there exist old >>>>>>>>>> popular book that said that it was defined?
It probably wasn't written in the official C standard, which I >>>>>>>>> couldn't have afforded to buy/read, but in a compiler runtime >>>>>>>>> doc?
Specifying that it would always copy from beginning to end of >>>>>>>>> the source buffer, in increasing address order meant that it >>>>>>>>> was guaranteed safe when used to compact buffers.
What is "compact buffers" ?
Assume a buffer consisting of records of some type, some of
them marked as deleted. Iterating over them while removing
the gaps means that you are always copying to a destination
lower in memory, right?
If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.
Such tests are usually built into implementations of memmove(),
which will chose to run forwards or backwards as needed. So you
might as well just call memmove() any time you are not sure
memcpy() is safe and appropriate.
The ever-shallow David Brown first misses the point, then makes a
slightly incorrect statement, and finally makes a recommendation
that surely is familiar to every reader in the newsgroup.
Memmove() is always appropriate unless you are doing something
nefarious.
So:
# define memcpy memomve
Incidentally, if one wants to do this, it's advisable to write
#undef memcpy
before the #define of memcpy.
What really worries me is that I've been told (and shown in godbolt)
that memcpy() can be magic, i.e the ocmpiler is allowed to make it
NOP when I use it to move data between an integer and float variable:
float invsqrt(float x)
{
...
int32_t ix = *(int32_t *) &x;
is deprecated, instead do something like this:
int32_t ix;
memcpy(&ix, &x, sizeof(ix));
and the compiler will see that x and ix can share the same register.
I don't suppose memmove() can be dependent upon to do the same?
Terje
In simple situations like shown above, memmove is as dependable as
memcpy.
I don't know if it is always true in more complex cases, where absence
of aliasing is less obvious to compiler. However, I'd expect that as
long as a copied item fits in register, the magic will work equally
with both memcpy and memmove.
It depends on compiler, too.
MSVC from VS2019 produces the same code for both variants d_to_u below.
But MSVC from VS2017 does not.
#include <stdint.h>
#include <string.h>
void d_to_u_cpy(uint64_t* u, const double* d) {
memcpy(u, d, sizeof(*u));
}
#define memcpy memmove
void d_to_u_move(uint64_t* u, const double* d) {
memcpy(u, d, sizeof(*u));
}
On Mon, 09 Sep 2024 10:30:34 GMT
[email protected] (Anton Ertl) wrote:
One would hope so, but here's what happens with gcc-12:
#include <string.h>
void foo1(char *p, char* q)
{
memcpy(p,q,32);
}
void foo2(char *p, char* q)
{
memmove(p,q,32);
}
gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o:
0000000000000000 <foo1>:
0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret
13: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
1a: 00 00 00 00
1e: 66 90 xchg %ax,%ax
0000000000000020 <foo2>:
20: ba 20 00 00 00 mov $0x20,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>
The jmp in line 25 is probably a tail-call to memmove().
My guess is that xmm registers and unrolling are used here rather than
ymm registers because waking up the second 128 bits takes time. But
even with that, the code uses two different registers, and if
scheduled differently, could be used for implementing foo2():
0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret
- anton
Try -march instead of -mavx2. E.g. -march=haswell
Sometimes gcc is beyond logic.
Michael S <[email protected]> writes:
On Mon, 09 Sep 2024 10:30:34 GMT
[email protected] (Anton Ertl) wrote:
One would hope so, but here's what happens with gcc-12:
#include <string.h>
void foo1(char *p, char* q)
{
memcpy(p,q,32);
}
void foo2(char *p, char* q)
{
memmove(p,q,32);
}
gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o:
0000000000000000 <foo1>:
0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret
13: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
1a: 00 00 00 00
1e: 66 90 xchg %ax,%ax
0000000000000020 <foo2>:
20: ba 20 00 00 00 mov $0x20,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>
The jmp in line 25 is probably a tail-call to memmove().
My guess is that xmm registers and unrolling are used here rather
than ymm registers because waking up the second 128 bits takes
time. But even with that, the code uses two different registers,
and if scheduled differently, could be used for implementing
foo2():
0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret
- anton
Try -march instead of -mavx2. E.g. -march=haswell
Sometimes gcc is beyond logic.
For gcc -O3 -march=haswell I got the same result (with gcc-12). I
also tried -march=x86-64-v3 with the same result.
But gcc -O3 -march=x86-64-v4 produced:
0000000000000000 <foo1>:
0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
8: c5 f8 77 vzeroupper
b: c3 ret
c: 0f 1f 40 00 nopl 0x0(%rax)
0000000000000010 <foo2>:
10: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
14: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
18: c5 f8 77 vzeroupper
1b: c3 ret
And when changing the length to 64:
0000000000000000 <foo1>:
0: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0
6: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi)
c: c5 f8 77 vzeroupper
f: c3 ret
0000000000000010 <foo2>:
10: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0
16: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi)
1c: c5 f8 77 vzeroupper
1f: c3 ret
But when changing the length to 63:
0000000000000000 <foo1>:
0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
8: c5 fe 6f 4e 1f vmovdqu 0x1f(%rsi),%ymm1
d: c5 fe 7f 4f 1f vmovdqu %ymm1,0x1f(%rdi)
12: c5 f8 77 vzeroupper
15: c3 ret
16: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
1d: 00 00 00
0000000000000020 <foo2>:
20: ba 3f 00 00 00 mov $0x3f,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>
- anton
David Brown wrote:
On 05/09/2024 19:04, Terje Mathisen wrote:
David Brown wrote:
On 05/09/2024 11:12, Terje Mathisen wrote:
David Brown wrote:
Unsigned types are ideal for "raw" memory access or external
data, for anything involving bit manipulation (use of &, |, ^,
<< and >> on signed types is usually wrong, IMHO), as building
blocks in extended arithmetic types, for the few occasions when
you want two's complement wrapping, and for the even fewer
occasions when you actually need that last bit of range.
That last paragraph enumerates pretty much all the uses I have
for integer-type variables, with (like Mitch) a few apis that
use (-1) as an error signal that has to be handled with special
code.
You don't have loop counters, array indices, or integer
arithmetic?
Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
fine with unsigned i, arrays always use a zero base so in Rust the
only array index type is usize, i.e the largest supported unsigned
type in the system, typically the same as u64.
Loop counters can usually be signed or unsigned, and it usually
makes no difference. Array indices are also usually much the same
signed or unsigned, and it can feel more natural to use size_t here
(an unsigned type). It can make a difference to efficiency,
however. On x86-64, this code is 3 instructions with T as
"unsigned long int" or "long int", 4 with "int", and 5 with
"unsigned int".
int foo(int * p, T x) {
int a = p[x++];
int b = p[x++];
return a + b;
}
;; assume *p in rdi, x in rsi
mov rax,[rdi+rsi]
add rax,[rdi+rsi+8]
ret
Bernd Linsel wrote:
On 05.09.24 19:04, Terje Mathisen wrote:
One of my alternatives are
unsigned u = start; // Cannot be less than zero
if (u) {
u++;
do {
u--;
data[u]...
while (u);
}
This typically results in effectively the same asm code as the
signed version, except for a bottom JGE (Jump (signed) Greater or
Equal instead of JA (Jump Above or Equal, but my version is far
more verbose.
Alternatively, if you don't need all N bits of the unsigned type,
then you can subtract and check if the top bit is set in the
result:
for (unsigned u = start; (u & TOPBIT) == 0; u--)
What about:
for (unsigned u = start; u != ~0u; --u)
I like that one!
...
or even
for (unsigned u = start; (int)u >= 0; --u)
That is the one that I've actually been using, i.e. casting to the corresponding signed type.
Tim Rentsch <[email protected]> writes:
[email protected] (MitchAlsup1) writes:
So:
# define memcpy memomve
Incidentally, if one wants to do this, it's advisable to write
#undef memcpy
before the #define of memcpy.
and move forward with life--for the 2 extra cycles memmove costs
it saves everyone long term grief.
Is it two extra cycles? Here are some data points from <[email protected]>:
Haswell (Core i7-4790K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
14 14 15 15 17 30 48 85 150 281 570 1370 memmove
15 16 13 16 19 32 48 86 161 327 631 1420 memcpy
Skylake (Core i5-6600K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
14 14 14 14 15 27 43 77 147 305 573 1417 memmove
13 14 10 12 14 27 46 85 165 313 607 1350 memcpy
Zen (Ryzen 5 1600X), glibc 2.24
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
16 16 16 17 32 43 66 107 177 328 601 1225 memmove
13 13 14 13 38 49 73 116 188 336 610 1233 memcpy
I don't see a consistent speedup of memcpy over memmove here.
However, when one uses memcpy(&var,ptr,8) or the like to perform an
unaligned access, gcc transforms this into a load (or store) without
the redefinition of memcpy, but into much slower code with the
redefinition (i.e., when using memmove instead of memcpy).
Simply replacing memcpy() by memmove() of course will always
work, but there might be negative consequences beyond a cost
of 2 extra cycles -- for example, if a negative stride is
better performing than a positive stride, but the nature
of the compaction forces memmove() to always take the slower
choice.
If the two memory blocks don't overlap, memmove() can use the
fastest stride.
But when changing the length to 63:
0000000000000000 <foo1>:
0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
8: c5 fe 6f 4e 1f vmovdqu 0x1f(%rsi),%ymm1
d: c5 fe 7f 4f 1f vmovdqu %ymm1,0x1f(%rdi)
12: c5 f8 77 vzeroupper
15: c3 ret
16: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
1d: 00 00 00
0000000000000020 <foo2>:
20: ba 3f 00 00 00 mov $0x3f,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>
- anton
Tim Rentsch wrote:
[email protected] (MitchAlsup1) writes:
Memmove() is always appropriate unless you are doing something
nefarious.
So:
# define memcpy memomve
Incidentally, if one wants to do this, it's advisable to write
#undef memcpy
before the #define of memcpy.
What really worries me is that I've been told (and shown in
godbolt) that memcpy() can be magic, i.e the ocmpiler is allowed
to make it NOP when I use it to move data between an integer and
float variable:
float invsqrt(float x)
{
...
int32_t ix = *(int32_t *) &x;
is deprecated, instead do something like this:
int32_t ix;
memcpy(&ix, &x, sizeof(ix));
and the compiler will see that x and ix can share the same
register.
I don't suppose memmove() can be dependent upon to do the same?
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
1) At first I thought that yes, one could just check whether there
is an overlap of the memory areas. But then I remembered that you
cannot write such a check in standard C without (in the general
case) exercising undefined behaviour;
Yes, I can.
and then the compiler could eliminate the check or do something
else that's unexpected. Do you have such a check in mind that
does not exercise undefined behaviour in the general case?
Sure. I wouldn't have made my earlier statement otherwise.
You also stated "I'm confident the people who wrote the C standard
would say such a program is strictly conforming." about a program with implementation-defined behaviour, so I lack confidence in your claim.
2) Even if there is such a check, you have to be aware that there
is a potential problem with memcpy(). In that case the way to go
is to just use memmove().
The point of my previous comment was only to address the question
of whether any existing memcpy() calls are problematic. If all
of the checks return "no overlap" then memcpy() is not the problem.
At least for the test runs.
But that does not help you with the next "clever" idea that some
compiler or library maintainer has.
I have the impression that this is an editorial comment having
nothing to do with memcpy() or memmove(). If that impression
is wrong then I'm at a loss to understand what you are talking
about, and would you please elaborate.
There are at least 200 undefined behaviours in the C standard, and
according to some people, C programmers should avoid all of them. So
the possible breakage of memcpy() is just one of many problems that
the programmers should be aware of and that they should test for.
Just because we discussed memcpy() as one of the problems with this
approach does not mean that having a way to deal with memcpy() solves
the larger problem.
On Mon, 09 Sep 2024 12:28:13 GMT...
[email protected] (Anton Ertl) wrote:
But when changing the length to 63:
An interesting question is which code I want in this case.
In absence of -march options and with -O1|2|3 I want something like
that:
foo2:
movups (%rsi), %xmm0
movups 16(%rsi), %xmm1
movups 32(%rsi), %xmm2
movups 47(%rsi), %xmm3
movups %xmm0, (%rsi)
movups %xmm1, 16(%rsi)
movups %xmm2, 32(%rsi)
movups %xmm3, 47(%rsi)
ret
Without deep thinking I don't see why I would want anything
different for foo1().
On 09/09/2024 08:56, Terje Mathisen wrote:
David Brown wrote:
On 05/09/2024 19:04, Terje Mathisen wrote:
David Brown wrote:
On 05/09/2024 11:12, Terje Mathisen wrote:
David Brown wrote:
Unsigned types are ideal for "raw" memory access or external data, >>>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>>> on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want
two's complement wrapping, and for the even fewer occasions when >>>>>>> you actually need that last bit of range.
That last paragraph enumerates pretty much all the uses I have for >>>>>> integer-type variables, with (like Mitch) a few apis that use (-1) >>>>>> as an error signal that has to be handled with special code.
You don't have loop counters, array indices, or integer arithmetic?
Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
fine with unsigned i, arrays always use a zero base so in Rust the
only array index type is usize, i.e the largest supported unsigned
type in the system, typically the same as u64.
Loop counters can usually be signed or unsigned, and it usually makes
no difference. Array indices are also usually much the same signed or
unsigned, and it can feel more natural to use size_t here (an unsigned
type). It can make a difference to efficiency, however. On x86-64,
this code is 3 instructions with T as "unsigned long int" or "long
int", 4 with "int", and 5 with "unsigned int".
int foo(int * p, T x) {
int a = p[x++];
int b = p[x++];
return a + b;
}
;; assume *p in rdi, x in rsi
mov rax,[rdi+rsi]
add rax,[rdi+rsi+8]
ret
Yes - that's three instructions for 64-bit type T. (To be clear, I had counted the "ret" here.)
With 32-bit int for T, you need a "movsx rsi, esi" first to sign-extend
the 32-bit int parameter "x" to 64 bits. (That could be different for different ABI's.) With 32-bit unsigned int for T you need an additional instruction to make sure the result of the first "x++" is wrapped as
32-bit unsigned.
:-)
Or you could just write sane code that matches what you want to say.
Of course the fine line between "smart code" and "smart-arse code" is somewhat subjective!
It also varies over time, and depends on the needs of the code.
Sometimes it makes sense to prioritise efficiency over readability - but
that is rare, and has been getting steadily rarer over the decades as processors have been getting faster (disproportionally so for
inefficient code) and compilers have been getting better.
Often you get the most efficient results by writing code clearly and
simply so that the compiler can understand it better and good object
code. This is particularly true if you want the same source to be used
on different targets or different variants of a target - few people can
track the instruction scheduling and timings on multiple processors
better than a good compiler. (And the few people who /can/ do that
spend their time chatting in comp.arch instead of writing code...) When
you do hand-made micro-optimisations, these can work against the
compiler and give poorer results overall.
This is especially the case
when code is moved around with inlining, constant propagation,
unrolling, link-time optimisation, etc.
Long ago, it was a different matter - then compilers needed more help to
get good results. And compilers are far from perfect - there are still
times when "smart" code or assembly-like C is needed (such as when
taking advantage of some vector and SIMD facilities).
So it's all up to the programmer, who often doesn't know either.There is no full-proof or complete method for C. There are other language for which formal methods can come closer to proving the correctness of the code, but for most practical cases this is infeasible.
Other than using CompCert, I don't know of any reliable way for
a programmer to make sure his C code does not suffer from UB.
On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:
Does hardware on which negative stride is faster really exists?
Tim Rentsch <[email protected]> writes: >>[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?
2) Even if there is such a check, you have to be aware that there is a >potential problem with memcpy(). In that case the way to go is to
just use memmove(). But that does not help you with the next "clever"
idea that some compiler or library maintainer has.
- anton
On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:
Does hardware on which negative stride is faster really exists?
On Mon, 9 Sep 2024 9:26:57 +0000, Michael S wrote:
On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:
Does hardware on which negative stride is faster really exists?
When the negative stride can be compared to zero, yes. else no.
But the performance gain is often zero and sometimes negative.
Michael S <[email protected]> schrieb:
On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:
Does hardware on which negative stride is faster really exists?
Depends on what the alterntive is.
For a Fortran assignment
a(n1:n2) = a(n3:n4)
the semantics of the language demand that the RHS is evaluated
completely before the assignment. In the case of the wrong
kind of overlap, a negative stride can be used instead of
using an array temporary.
On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:
Tim Rentsch <[email protected]> writes: >>[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you
cannot write such a check in standard C without (in the general case) >exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?
The result of comparing pointers to two elements of the same array is defined. Cast to (char*), both src and dst can be considered to point
to elements of the [address space sized] char array at address zero.
Adding size_t to a pointer yields another pointer of the same type.
All of gcc, clang and MSVC seem happy with this.
On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:
1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot >>write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?
The result of comparing pointers to two elements of the same array is >defined. Cast to (char*), both src and dst can be considered to point
to elements of the [address space sized] char array at address zero.
All of gcc, clang and MSVC seem happy with this.
On Mon, 9 Sep 2024 08:56:45 +0200
Terje Mathisen <[email protected]> wrote:
David Brown wrote:
On 05/09/2024 19:04, Terje Mathisen wrote:
David Brown wrote:
On 05/09/2024 11:12, Terje Mathisen wrote:
David Brown wrote:
Unsigned types are ideal for "raw" memory access or external
data, for anything involving bit manipulation (use of &, |, ^,
<< and >> on signed types is usually wrong, IMHO), as building
blocks in extended arithmetic types, for the few occasions
when you want two's complement wrapping, and for the even
fewer occasions when you actually need that last bit of
range.
That last paragraph enumerates pretty much all the uses I have
for integer-type variables, with (like Mitch) a few apis that
use (-1) as an error signal that has to be handled with special
code.
You don't have loop counters, array indices, or integer
arithmetic?
Loop counters of the for (i= 0; i < LIMIT; i++) type are of
course fine with unsigned i, arrays always use a zero base so in
Rust the only array index type is usize, i.e the largest
supported unsigned type in the system, typically the same as
u64.
Loop counters can usually be signed or unsigned, and it usually
makes no difference. Array indices are also usually much the same signed or unsigned, and it can feel more natural to use size_t
here (an unsigned type). It can make a difference to efficiency, however. On x86-64, this code is 3 instructions with T as
"unsigned long int" or "long int", 4 with "int", and 5 with
"unsigned int".
int foo(int * p, T x) {
int a = p[x++];
int b = p[x++];
return a + b;
}
;; assume *p in rdi, x in rsi
mov rax,[rdi+rsi]
add rax,[rdi+rsi+8]
ret
more like
mov rax,[rdi+rsi*4]
add rax,[rdi+rsi*4+8]
ret
But that's not the point (==trap).
The point (==trap), I'd guess, is that for T=uint32_t code generator
has to account for possibility of x==2**32-1.
Tim Rentsch <[email protected]> writes: >[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?
Michael S <[email protected]> schrieb:[on memcpy() where glibc used negative stride on some hardware and
Does hardware on which negative stride is faster really exists?
Depends on what the alterntive is.
For a Fortran assignment
a(n1:n2) = a(n3:n4)
the semantics of the language demand that the RHS is evaluated
completely before the assignment. In the case of the wrong
kind of overlap, a negative stride can be used instead of
using an array temporary.
David Brown <[email protected]> wrote:
Of course the fine line between "smart code" and "smart-arse code" is
somewhat subjective!
It also varies over time, and depends on the needs of the code.
Sometimes it makes sense to prioritise efficiency over readability - but
that is rare, and has been getting steadily rarer over the decades as
processors have been getting faster (disproportionally so for
inefficient code) and compilers have been getting better.
Often you get the most efficient results by writing code clearly and
simply so that the compiler can understand it better and good object
code. This is particularly true if you want the same source to be used
on different targets or different variants of a target - few people can
track the instruction scheduling and timings on multiple processors
better than a good compiler. (And the few people who /can/ do that
spend their time chatting in comp.arch instead of writing code...) When
you do hand-made micro-optimisations, these can work against the
compiler and give poorer results overall.
I know of no example where hand optimized code does worse on a newer CPU.
A newer CPU with bigger OoOe will effectively unroll your code and schedule it even better.
It’s older lesser CPU’s where your hand optimized code might fail hard, and
I know of few examples of that. None actually.
This is especially the case
when code is moved around with inlining, constant propagation,
unrolling, link-time optimisation, etc.
Long ago, it was a different matter - then compilers needed more help to
get good results. And compilers are far from perfect - there are still
times when "smart" code or assembly-like C is needed (such as when
taking advantage of some vector and SIMD facilities).
On Mon, 09 Sep 2024 23:27:24 -0400
George Neuner <[email protected]> wrote:
On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you
cannot write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?
The result of comparing pointers to two elements of the same array is
defined. Cast to (char*), both src and dst can be considered to point
to elements of the [address space sized] char array at address zero.
According to my understanding, your 'can be considered' part is not
codified in the C Standard.
Adding size_t to a pointer yields another pointer of the same type.
All of gcc, clang and MSVC seem happy with this.
It works. But is it guaranteed to work in the future by some sort of document? I am pretty sure that no such guarantee exists in gcc and
MSVC docs. I did not look in clang docs. Trying to find anythings in LLVM/clang docs makes me sad.
David Brown <[email protected]> wrote:
Often you get the most efficient results by writing code clearly and
simply so that the compiler can understand it better and good object
code. This is particularly true if you want the same source to be used
on different targets or different variants of a target - few people can
track the instruction scheduling and timings on multiple processors
better than a good compiler. (And the few people who /can/ do that
spend their time chatting in comp.arch instead of writing code...) When
you do hand-made micro-optimisations, these can work against the
compiler and give poorer results overall.
I know of no example where hand optimized code does worse on a newer CPU.
A newer CPU with bigger OoOe will effectively unroll your code and schedule it even better.
It’s older lesser CPU’s where your hand optimized code might fail hard, and
I know of few examples of that. None actually.
This is especially the case
when code is moved around with inlining, constant propagation,
unrolling, link-time optimisation, etc.
Long ago, it was a different matter - then compilers needed more help to
get good results. And compilers are far from perfect - there are still
times when "smart" code or assembly-like C is needed (such as when
taking advantage of some vector and SIMD facilities).
On Sun, 08 Sep 2024 15:36:39 GMT
[email protected] (Anton Ertl) wrote:
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?
The check that reliably catches all overlaps seems easy.
E.g. (src <= dst) == (src+len > dst)
In theory, on unusual hardware platform it can give false positives.
May be, for task in hand that's o.k.
On Sun, 08 Sep 2024 15:36:39 GMT
[email protected] (Anton Ertl) wrote:
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?
The check that reliably catches all overlaps seems easy.
E.g. (src <= dst) == (src+len > dst)
In theory, on unusual hardware platform it can give false positives.
On Mon, 09 Sep 2024 23:27:24 -0400
George Neuner <[email protected]> wrote:
On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you
cannot write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?
The result of comparing pointers to two elements of the same array is
defined. Cast to (char*), both src and dst can be considered to point
to elements of the [address space sized] char array at address zero.
According to my understanding, your 'can be considered' part is not
codified in the C Standard.
Adding size_t to a pointer yields another pointer of the same type.
All of gcc, clang and MSVC seem happy with this.
It works. But is it guaranteed to work in the future by some sort of >document? I am pretty sure that no such guarantee exists in gcc and
MSVC docs. I did not look in clang docs. Trying to find anythings in >LLVM/clang docs makes me sad.
However, my point was that "hand-optimised" source code can lead to
poorer results on newer /compilers/ compared to simpler source code. If >you've googled for "bit twiddling hacks" for cool tricks, or written >something like "(x << 4) + (x << 2) + x" instead of "x * 21", then the >results will be slower with a modern compiler and modern cpu, even
though the "hand-optimised" version might have been faster two decades
ago. You can expect the modern tool to convert the multiplication into >shifts and adds if that is more efficient on the target, or a
multiplication if that is best on the target. But you can't expect the >compiler to turn the shifts and adds into a multiplication.
(Sometimes it can, but you can't expect it to.)
Brett wrote:
David Brown <[email protected]> wrote:
Often you get the most efficient results by writing code clearly and
simply so that the compiler can understand it better and good object
code. This is particularly true if you want the same source to be used
on different targets or different variants of a target - few people can
track the instruction scheduling and timings on multiple processors
better than a good compiler. (And the few people who /can/ do that
spend their time chatting in comp.arch instead of writing code...) When >>> you do hand-made micro-optimisations, these can work against the
compiler and give poorer results overall.
I know of no example where hand optimized code does worse on a newer CPU.
A newer CPU with bigger OoOe will effectively unroll your code and schedule >> it even better.
Not true:
My favorite benchmark program for 20+ years was Word Count, I
re-optimized that for every new x86 generation, and on the Pentium I got
it to run at 1.5 clock cycles per character (40 MB/s on a 60 MHz Pentium).
When the PentiumPro came out, it did a 10-20 cycle stall for every pair
of characters, so about an order of magnitude slower in cycle count.
(But only about 3X clock time due to being 200 instead of 60 MHz.)
It’s older lesser CPU’s where your hand optimized code might fail hard, andRight.
I know of few examples of that. None actually.
This is especially the case
when code is moved around with inlining, constant propagation,
unrolling, link-time optimisation, etc.
Long ago, it was a different matter - then compilers needed more help to >>> get good results. And compilers are far from perfect - there are still
times when "smart" code or assembly-like C is needed (such as when
taking advantage of some vector and SIMD facilities).
Terje
On Mon, 9 Sep 2024 20:52:29 +0000
[email protected] (MitchAlsup1) wrote:
On Mon, 9 Sep 2024 9:26:57 +0000, Michael S wrote:
On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:
Does hardware on which negative stride is faster really exists?
When the negative stride can be compared to zero, yes. else no.
But the performance gain is often zero and sometimes negative.
Direction of the count is not related to the sign of pointer's
stride.
David Brown <[email protected]> writes:
However, my point was that "hand-optimised" source code can lead to
poorer results on newer /compilers/ compared to simpler source code. If
you've googled for "bit twiddling hacks" for cool tricks, or written
something like "(x << 4) + (x << 2) + x" instead of "x * 21", then the
results will be slower with a modern compiler and modern cpu, even
though the "hand-optimised" version might have been faster two decades
ago. You can expect the modern tool to convert the multiplication into
shifts and adds if that is more efficient on the target, or a
multiplication if that is best on the target. But you can't expect the
compiler to turn the shifts and adds into a multiplication.
Why not? Let's see:
[b3:~/tmp:109062] gcc -Os -c xxx-mul.c && objdump -d xxx-mul.o
xxx-mul.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <foo>:
0: 48 6b c7 15 imul $0x15,%rdi,%rax
4: c3 ret
[b3:~/tmp:109063] gcc -O3 -c xxx-mul.c && objdump -d xxx-mul.o
xxx-mul.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <foo>:
0: 48 8d 04 bf lea (%rdi,%rdi,4),%rax
4: 48 8d 04 87 lea (%rdi,%rax,4),%rax
8: c3 ret
So gcc-12 obviously understands that your "hand-optimized" version is equivalent to the multiplication, and with -O3 then decides that the
leas are faster.
(Sometimes it can, but you can't expect it to.)
That also works the other way.
But it becomes really annoying when I intend it not to perform a transformation, and it performs the transformation, like when writing "-(x>0)" and the compiler turns that into a conditional branch. These
days gcc does not do that, but I have just seen another twist:
long bar(long x)
{
return -(x>0);
}
gcc-12 -O3 turns this into:
10: 31 c0 xor %eax,%eax
12: 48 85 ff test %rdi,%rdi
15: 0f 9f c0 setg %al
18: f7 d8 neg %eax
1a: 48 98 cltq
1c: c3 ret
So apparently sign-extension optimization is apparently still lacking. Clang-14 handles this fine:
10: 31 c0 xor %eax,%eax
12: 48 85 ff test %rdi,%rdi
15: 0f 9f c0 setg %al
18: 48 f7 d8 neg %rax
1b: c3 ret
I should add that I appreciate your proposed solution; it's
better than what I think I would have come up with under a
similar set of assumptions.
Again - sometimes a compiler will recognise a particular hand-optimised pattern, turn it back to something logically simpler, then optimise from there. But you cannot /expect/ that.
Terje Mathisen <[email protected]> wrote:
Brett wrote:
David Brown <[email protected]> wrote:
Often you get the most efficient results by writing code clearly
and simply so that the compiler can understand it better and good
object code. This is particularly true if you want the same
source to be used on different targets or different variants of a
target - few people can track the instruction scheduling and
timings on multiple processors better than a good compiler. (And
the few people who /can/ do that spend their time chatting in
comp.arch instead of writing code...) When you do hand-made
micro-optimisations, these can work against the compiler and give
poorer results overall.
I know of no example where hand optimized code does worse on a
newer CPU. A newer CPU with bigger OoOe will effectively unroll
your code and schedule it even better.
Not true:
My favorite benchmark program for 20+ years was Word Count, I
re-optimized that for every new x86 generation, and on the Pentium
I got it to run at 1.5 clock cycles per character (40 MB/s on a 60
MHz Pentium).
When the PentiumPro came out, it did a 10-20 cycle stall for every
pair of characters, so about an order of magnitude slower in cycle
count. (But only about 3X clock time due to being 200 instead of 60
MHz.)
But how big a slowdown did the unoptimized code get?
Are you describing a glass jaw handling unpredictable branches on a
CPU with a much longer pipeline?
A shorter pipeline with better worst case handling is going to do
better, even if older. Intel was going for high clock benchmark
speed, not performance.
On Tue, 10 Sep 2024 7:35:31 +0000, Michael S wrote:
On Mon, 9 Sep 2024 20:52:29 +0000
[email protected] (MitchAlsup1) wrote:
On Mon, 9 Sep 2024 9:26:57 +0000, Michael S wrote:
On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:
Does hardware on which negative stride is faster really exists?
When the negative stride can be compared to zero, yes. else no.
But the performance gain is often zero and sometimes negative.
Direction of the count is not related to the sign of pointer's
stride.
For the record; I was responding to an array index stride not a
pointer stride.
George Neuner <[email protected]> writes:
On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:
1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot >>>write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a >>>check in mind that does not exercise undefined behaviour in the
general case?
The result of comparing pointers to two elements of the same array is >>defined. Cast to (char*), both src and dst can be considered to point
to elements of the [address space sized] char array at address zero.
Yes, that would be reasonable. Unfortunately, "optimizations" that
assume that undefined behaviour does not happen are not justified by assigning reasonable meaning to language constructs, but by giving
only the little meaning to language constructs that the standard
requires, and in case of unequality comparisons between pointers to
different objects, the C standard does not define a meaning for that.
All of gcc, clang and MSVC seem happy with this.
But the next version of gcc or clang might see such a check and decide
to bite you.
One can cast the pointers into an uintptr_t, and try to do the check
there. AFAIK the result would be implementation-defined, but on an architecture with a flat address space it's unlikely that they will
find a way to compile the code in a different way than the programmer intended without making "relevant" programs slower.
On Tue, 10 Sep 2024 07:37:59 -0700
Tim Rentsch <[email protected]> wrote:
I should add that I appreciate your proposed solution; it's
better than what I think I would have come up with under a
similar set of assumptions.
Unfortunately, my solution is wrong and mistake is not even subtle.
On Tue, 10 Sep 2024 18:03:01 -0000 (UTC)
Brett <[email protected]> wrote:
Terje Mathisen <[email protected]> wrote:
Brett wrote:
David Brown <[email protected]> wrote:
Often you get the most efficient results by writing code clearly
and simply so that the compiler can understand it better and good
object code. This is particularly true if you want the same
source to be used on different targets or different variants of a
target - few people can track the instruction scheduling and
timings on multiple processors better than a good compiler. (And
the few people who /can/ do that spend their time chatting in
comp.arch instead of writing code...) When you do hand-made
micro-optimisations, these can work against the compiler and give
poorer results overall.
I know of no example where hand optimized code does worse on a
newer CPU. A newer CPU with bigger OoOe will effectively unroll
your code and schedule it even better.
Not true:
My favorite benchmark program for 20+ years was Word Count, I
re-optimized that for every new x86 generation, and on the Pentium
I got it to run at 1.5 clock cycles per character (40 MB/s on a 60
MHz Pentium).
When the PentiumPro came out, it did a 10-20 cycle stall for every
pair of characters, so about an order of magnitude slower in cycle
count. (But only about 3X clock time due to being 200 instead of 60
MHz.)
But how big a slowdown did the unoptimized code get?
Are you describing a glass jaw handling unpredictable branches on a
CPU with a much longer pipeline?
No, the glass jaw of PPro described by Terje is known as partial
register stall.
A shorter pipeline with better worst case handling is going to do
better, even if older. Intel was going for high clock benchmark
speed, not performance.
Typically, PPro was much faster than Pentium clock-for-clock,
especially so when running 32-bit software.
But it had few weak points.
Michael S <[email protected]> writes:
On Sun, 08 Sep 2024 15:36:39 GMT
[email protected] (Anton Ertl) wrote:
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?
The check that reliably catches all overlaps seems easy.
E.g. (src <= dst) == (src+len > dst)
In theory, on unusual hardware platform it can give false positives.
May be, for task in hand that's o.k.
The challenge is to find portable C that doesn't enter the arena
of undefined behavior (and also detects exactly those cases where
overlap occurs), and that is quite a stringent criterion.
The comparison shown works if src and dst both point to elements
of the same array. But if they don't, comparing pointers to see
if one is less than another (or any of <, <=, >, >=) is undefined
behavior. At the bit level it wouldn't surprise me to learn that
the test shown always returns accurate information. However the
C standard doesn't promise that a bit-level comparison will be
done, and implementations are allowed to do anything at all for
this test in cases where src and dst point to (somewhere within)
different top-level objects. What the hardware does doesn't
matter - what needs to be satisfied are the rules of the C
standard, and they are less forgiving.
I should add that I appreciate your proposed solution; it's
better than what I think I would have come up with under a
similar set of assumptions.
Terje Mathisen <[email protected]> wrote:
Brett wrote:
David Brown <[email protected]> wrote:
Often you get the most efficient results by writing code clearly and
simply so that the compiler can understand it better and good object
code. This is particularly true if you want the same source to be used >>>> on different targets or different variants of a target - few people can >>>> track the instruction scheduling and timings on multiple processors
better than a good compiler. (And the few people who /can/ do that
spend their time chatting in comp.arch instead of writing code...) When >>>> you do hand-made micro-optimisations, these can work against the
compiler and give poorer results overall.
I know of no example where hand optimized code does worse on a newer CPU. >>> A newer CPU with bigger OoOe will effectively unroll your code and schedule >>> it even better.
Not true:
My favorite benchmark program for 20+ years was Word Count, I
re-optimized that for every new x86 generation, and on the Pentium I got
it to run at 1.5 clock cycles per character (40 MB/s on a 60 MHz Pentium). >>
When the PentiumPro came out, it did a 10-20 cycle stall for every pair
of characters, so about an order of magnitude slower in cycle count.
(But only about 3X clock time due to being 200 instead of 60 MHz.)
But how big a slowdown did the unoptimized code get?
Are you describing a glass jaw handling unpredictable branches on a CPU
with a much longer pipeline?
Tim Rentsch wrote:
Michael S <[email protected]> writes:
On Sun, 08 Sep 2024 15:36:39 GMT
[email protected] (Anton Ertl) wrote:
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
1) At first I thought that yes, one could just check whether
there is an overlap of the memory areas. But then I remembered
that you cannot write such a check in standard C without (in the
general case) exercising undefined behaviour; and then the
compiler could eliminate the check or do something else that's
unexpected. Do you have such a check in mind that does not
exercise undefined behaviour in the general case?
The check that reliably catches all overlaps seems easy.
E.g. (src <= dst) == (src+len > dst)
Does that work for dst < src? What if dst+len < src?
[email protected] (Anton Ertl) writes:...
George Neuner <[email protected]> writes:
On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:
1) At first I thought that yes, one could just check whether there is >>>>an overlap of the memory areas. But then I remembered that you cannot >>>>write such a check in standard C without (in the general case) >>>>exercising undefined behaviour; and then the compiler could eliminate >>>>the check or do something else that's unexpected. Do you have such a >>>>check in mind that does not exercise undefined behaviour in the
general case?
It is legal to test for equality between pointers to different objects
so you could test for overlap by testing against every element in the
array. It seems like it should be possible for the compiler to figure
out what's happening and optimize those tests away, but unfortunately
no compiler I tested did it.
I do believe though that in reality it could be faster to use the
branchy version, and let the branch predictors do their job instead
of having to wait to evaluate all three terms:
bool is_overlap(char *src, char *dst, size_t len)
{
if (src < dst) {
return (src+len > dst);
}
return (dst+len > src);
}
Terje
On Tue, 10 Sep 2024 07:37:59 -0700
Tim Rentsch <[email protected]> wrote:
I should add that I appreciate your proposed solution; it's
better than what I think I would have come up with under a
similar set of assumptions.
Unfortunately, my solution is wrong and mistake is not even subtle.
Michael S <[email protected]> writes:
On Sun, 08 Sep 2024 15:36:39 GMT
[email protected] (Anton Ertl) wrote:
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.
1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?
The check that reliably catches all overlaps seems easy.
E.g. (src <= dst) == (src+len > dst)
In theory, on unusual hardware platform it can give false positives.
May be, for task in hand that's o.k.
The challenge is to find portable C that doesn't enter the arena
of undefined behavior (and also detects exactly those cases where
overlap occurs), and that is quite a stringent criterion.
The comparison shown works if src and dst both point to elements
of the same array. [...]
On Tue, 10 Sep 2024 22:27:02 +0300
Michael S <[email protected]> wrote:
On Tue, 10 Sep 2024 07:37:59 -0700
Tim Rentsch <[email protected]> wrote:
I should add that I appreciate your proposed solution; it's
better than what I think I would have come up with under a
similar set of assumptions.
Unfortunately, my solution is wrong and mistake is not even subtle.
This one appears to work: (src < dst+len) == (dst < src+len)
It is legal to test for equality between pointers to different objects
so you could test for overlap by testing against every element in the
array.
Josh Vanderhoof <[email protected]> writes:
[how to write a portable, UB-free check if mempcy() intervals overlap]
It is legal to test for equality between pointers to different
objects
Right. This observation is the key insight.
On Wed, 11 Sep 2024 09:29:04 -0700
Tim Rentsch <[email protected]> wrote:
Josh Vanderhoof <[email protected]> writes:
[how to write a portable, UB-free check if mempcy() intervals overlap]
It is legal to test for equality between pointers to different
objects
Right. This observation is the key insight.
Real mode x86 C compilers operating in Large and Compact Models that
were popular on IBM-compatible PCs 30-40 years ago could have more than
one representation for the pointer to the same memory location. If my
memory serves me, the rules of pointers comparison for equality were
the same as rules of comparison for <>. In both cases for reliable
result pointers had to be explicitly normalized (i.e. converted from
'far' to 'huge' or something like that).
It was long time ago and even back then I didn't use Large model very
often, so it's possible that I misremember. But if I remember
correctly, does it mean that those C compilers now would be considered non-compliant?
Would be nice, say, if there were semi-standard compiler macros for
various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to use.
Whether or not the target/compiler allows misaligned memory access;
If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.
[elaborations on the above]
On Tue, 10 Sep 2024 11:21:01 +0300, Michael S
<[email protected]> wrote:
On Mon, 09 Sep 2024 23:27:24 -0400
George Neuner <[email protected]> wrote:
On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will
work on platforms one doesn't have, but there is a relatively
simple and portable way to tell if some memcpy() call crosses
over into the realm of undefined behavior.
1) At first I thought that yes, one could just check whether
there is an overlap of the memory areas. But then I remembered
that you cannot write such a check in standard C without (in the
general case) exercising undefined behaviour; and then the
compiler could eliminate the check or do something else that's
unexpected. Do you have such a check in mind that does not
exercise undefined behaviour in the general case?
The result of comparing pointers to two elements of the same array
is defined. Cast to (char*), both src and dst can be considered
to point to elements of the [address space sized] char array at
address zero.
According to my understanding, your 'can be considered' part is not
codified in the C Standard.
Adding size_t to a pointer yields another pointer of the same
type.
All of gcc, clang and MSVC seem happy with this.
It works. But is it guaranteed to work in the future by some sort
of document? I am pretty sure that no such guarantee exists in gcc
and MSVC docs. I did not look in clang docs. Trying to find
anythings in LLVM/clang docs makes me sad.
I know that it has worked as expected with every version of gcc
and Microsoft I've used since 1988. [clang I don't use, but I
tried it on godbolt.org with the most recent version]
Will it continue to work ... who knows?
I definitely am NOT an expert on the C standard, but thinking
about it, it occurred to me that if an array is explicitly defined
that *might* cover all memory (or at least all heap), then the
compiler would have to honor any apparent pointers into it.
E.g., char (*all_memory)[] = 0;
None of the compilers at godbolt seem to need this to compare
arbitrary addresses as char*, but all accept it.
Obviously speculation, but it's the best I have.
Michael S <[email protected]> writes:
On Wed, 11 Sep 2024 09:29:04 -0700
Tim Rentsch <[email protected]> wrote:
Josh Vanderhoof <[email protected]> writes:
[how to write a portable, UB-free check if mempcy() intervals
overlap]
It is legal to test for equality between pointers to different
objects
Right. This observation is the key insight.
Real mode x86 C compilers operating in Large and Compact Models that
were popular on IBM-compatible PCs 30-40 years ago could have more
than one representation for the pointer to the same memory
location. If my memory serves me, the rules of pointers comparison
for equality were the same as rules of comparison for <>. In both
cases for reliable result pointers had to be explicitly normalized
(i.e. converted from 'far' to 'huge' or something like that).
It was long time ago and even back then I didn't use Large model
very often, so it's possible that I misremember. But if I remember correctly, does it mean that those C compilers now would be
considered non-compliant?
The C standard was first ratified (by ANSI) in 1989. The rules
for pointer comparison were clarified in the C99 standard, but it
has always been true that pointers to the same object have to
compare equal.
C environments that have things like 'far' or 'huge' pointers,
etc, are not standard C but must have extensions so that they can
deal with the different kinds of pointers. Depending on how the
non-standard kinds of pointer worked, the implementation might or
might not be conforming. Most likely though it's a moot point
because once a program starts using an extension all the rules
can change, and the C standard allows that. It's only programs
that look like really standard C that have to do what the C
standard says (for the implementation to be conforming); any
code that declares a 'far' pointer or 'huge' pointer certainly
isn't standard C.
BGB <[email protected]> writes:
[...]
Would be nice, say, if there were semi-standard compiler macros for
various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to
use. Whether or not the target/compiler allows misaligned memory
access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.
[elaborations on the above]
I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.
If the two memory blocks don't overlap, memmove() can use the
fastest stride. [...]
The way to go for memmove() is:
On hardware where positive stride is faster:
if (((uintptr)(dest-src)) >= len)
return memcpy_posstride(dest,src,len)
else
return memcpy_negstride(dest,src,len)
On hardware where the negative stride is faster:
if (((uintptr)(src-dest)) >= len)
return memcpy_negstride(dest,src,len)
else
return memcpy_posstride(dest,src,len)
And I expect that my test is undefined behaviour, but most people
except the UB advocates should understand what I mean.
The benefit of this comparison over just comparing the addresses
is that the branch will have a much lower miss rate.
Josh Vanderhoof <[email protected]> writes:
[email protected] (Anton Ertl) writes:
George Neuner <[email protected]> writes:
On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:
1) At first I thought that yes, one could just check whether
there is an overlap of the memory areas. But then I remembered
that you cannot write such a check in standard C without (in the
general case) exercising undefined behaviour; and then the
compiler could eliminate the check or do something else that's
unexpected. Do you have such a check in mind that does not
exercise undefined behaviour in the general case?
...
It is legal to test for equality between pointers to different
objects so you could test for overlap by testing against every
element in the array. It seems like it should be possible for the
compiler to figure out what's happening and optimize those tests
away, but unfortunately no compiler I tested did it.
That would be an interesting result of the ATUBDNH lunacy:
programmers would see themselves forced to write workarounds such
as the one you suggest (with terrible performance when not
optimized), and then C compiler maintainers would see themselves
forced to optimize this kind of code. The end result would be
that both parties have to put in more effort to eventually get the
same result as if ordered comparison between different objects had
been defined from the start.
For now, the ATUBDNH advocates tell programmers that they have to
work around the lack of definition, but there is usually no
optimization for that.
On 9/11/2024 5:38 AM, Anton Ertl wrote:
Josh Vanderhoof <[email protected]> writes:
[email protected] (Anton Ertl) writes:...
George Neuner <[email protected]> writes:
On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:
1) At first I thought that yes, one could just check whether there is >>>>>> an overlap of the memory areas. But then I remembered that you
cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate >>>>>> the check or do something else that's unexpected. Do you have such a >>>>>> check in mind that does not exercise undefined behaviour in the
general case?
It is legal to test for equality between pointers to different objects
so you could test for overlap by testing against every element in the
array. It seems like it should be possible for the compiler to figure
out what's happening and optimize those tests away, but unfortunately
no compiler I tested did it.
That would be an interesting result of the ATUBDNH lunacy: programmers
would see themselves forced to write workarounds such as the one you
suggest (with terrible performance when not optimized), and then C
compiler maintainers would see themselves forced to optimize this kind
of code. The end result would be that both parties have to put in
more effort to eventually get the same result as if ordered comparison
between different objects had been defined from the start.
For now, the ATUBDNH advocates tell programmers that they have to work
around the lack of definition, but there is usually no optimization
for that.
One case where things work somewhat along the lines you suggest is
unaligned accesses. Traditionally, if knowing that the hardware
supports unaligned accesses, for a 16-bit load one would write:
int16_t foo1(int16_t *p)
{
return *p;
}
If one does not know that the hardware supports unaligned accesses,
the traditional way to perform such an access (little-endian) is
something like:
int16_t foo2(int16_t *p)
{
unsignedchar *q = p;
return (int16_t)(q[0] + (q[1]>>8));
}
Now, several years ago, somebody told me that the proper way is as
follows:
int16_t foo3(int16_t *p)
{
int16_t v;
memcpy(&v,p,2);
return v;
}
That way looked horribly inefficient to me, with v having to reside in
memory instead of in a register and then the expensive function call,
and all the decisions that memcpy() has to take depending on the
length argument. But gcc optimizes this idiom into an unaligned load
rather than taking all the steps that I expected (however, I have seen
cases where the code produced on hardware that supports unaligned
accesses is worse than that for foo1()). Of course, if you also want
to support less sophisticated compilers, this idiom may be really slow
on those, although not quite as expensive as your containment check.
Would be nice, say, if there were semi-standard compiler macros for
various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to use.
Whether or not the target/compiler allows misaligned memory access;
If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.
Clang:
__LITTLE_ENDIAN__, __BIG_ENDIAN__
One or the other is defined based on endian.
GCC:
__BYTE_ORDER__ which may equal one of:
__ORDER_LITTLE_ENDIAN__
__ORDER_BIG_ENDIAN__
__ORDER_PDP_ENDIAN__
MSVC:
REG_DWORD is one of:
REG_DWORD_LITTLE_ENDIAN
REG_DWORD_BIG_ENDIAN
GCC:
__SIZEOF_type__ //gives sizeof various types
Possible:
__MINALIGN_type__ //minimum allowed alignment for type
Maybe also alias pointer control:
__POINTER_ALIAS__
__POINTER_ALIAS_CONSERVATIVE__
__POINTER_ALIAS_STRICT__
Where, pointer alias can be declared, and:
If conservative, then conservative semantics are being used.
Pointers may be freely cast without concern for pointer aliasing.
Compiler will assume that "non restrict" pointer stores may alias.
If strict, the compiler is using TBAA semantics.
Compiler may assume that aliasing is based on pointer types.
What about:
max(src,dst) < (min(src,dst)+len)
If you have a min/max circuit, i.e a two-element sorter, then it
could be quite efficient, otherwise run the min first, then the
max and the add during the second cycle, before the less than test
in the third cycle.
I do believe though that in reality it could be faster to use the
branchy version, and let the branch predictors do their job
instead of having to wait to evaluate all three terms:
bool is_overlap(char *src, char *dst, size_t len)
{
if (src < dst) {
return (src+len > dst);
}
return (dst+len > src);
}
On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:
BGB <[email protected]> writes:
[...]
Would be nice, say, if there were semi-standard compiler macros for
various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to
use. Whether or not the target/compiler allows misaligned memory
access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.
[elaborations on the above]
I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.
Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.
[considering which way to copy with memmove()]...
If the two memory blocks don't overlap, memmove() can use the
fastest stride. [...]
The way to go for memmove() is:
On hardware where positive stride is faster:
if (((uintptr)(dest-src)) >= len)
return memcpy_posstride(dest,src,len)
else
return memcpy_negstride(dest,src,len)
On hardware where the negative stride is faster:
if (((uintptr)(src-dest)) >= len)
return memcpy_negstride(dest,src,len)
else
return memcpy_posstride(dest,src,len)
And I expect that my test is undefined behaviour, but most people
except the UB advocates should understand what I mean.
Last but not least, having two different code blocks for the
different preferences is clunky. The two blocks can be
combined by fusing the two test expressions into a single
expression, as for example
#ifndef PREFER_UPWARDS
#define PREFER_UPWARDS 1
#endif/*PREFER_UPWARDS*/
extern void* ascending_copy( void*, const void*, size_t );
extern void* descending_copy( void*, const void*, size_t );
void *
good_memmove( void *vd, const void *vs, size_t n ){
const char *d = vd;
const char *s = vs;
_Bool upwards = PREFER_UPWARDS ? d-s +0ull >= n : s-d +0ull < n;
return
upwards
? ascending_copy( vd, vs, n )
: descending_copy( vd, vs, n );
}
Using the preprocessor symbol PREFER_UPWARDS to select between
the two preferences (ascending or descending) allows the choice
to made by a -D compiler option, and we can expect the compiler
to optimize away the part of the test that is never used.
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
[considering which way to copy with memmove()]
If the two memory blocks don't overlap, memmove() can use the
fastest stride. [...]
The way to go for memmove() is:
On hardware where positive stride is faster:
if (((uintptr)(dest-src)) >= len)
return memcpy_posstride(dest,src,len)
else
return memcpy_negstride(dest,src,len)
On hardware where the negative stride is faster:
if (((uintptr)(src-dest)) >= len)
return memcpy_negstride(dest,src,len)
else
return memcpy_posstride(dest,src,len)
And I expect that my test is undefined behaviour, but most people
except the UB advocates should understand what I mean.
...
Last but not least, having two different code blocks for the
different preferences is clunky. The two blocks can be
combined by fusing the two test expressions into a single
expression, as for example
#ifndef PREFER_UPWARDS
#define PREFER_UPWARDS 1
#endif/*PREFER_UPWARDS*/
extern void* ascending_copy( void*, const void*, size_t );
extern void* descending_copy( void*, const void*, size_t );
void *
good_memmove( void *vd, const void *vs, size_t n ){
const char *d = vd;
const char *s = vs;
_Bool upwards = PREFER_UPWARDS ? d-s +0ull >= n : s-d +0ull < n; >>
return
upwards
? ascending_copy( vd, vs, n )
: descending_copy( vd, vs, n );
}
Using the preprocessor symbol PREFER_UPWARDS to select between
the two preferences (ascending or descending) allows the choice
to made by a -D compiler option, and we can expect the compiler
to optimize away the part of the test that is never used.
That's clever, but for usage in glibc or the like the clunky version
is the preferred one: [elaboration]
Michael S <[email protected]> writes:
On Wed, 11 Sep 2024 17:34:38 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
Real mode x86 C compilers operating in Large and Compact Models
that were popular on IBM-compatible PCs 30-40 years ago could
have more than one representation for the pointer to the same
memory location. If my memory serves me, the rules of pointers
comparison for equality were the same as rules of comparison for
<>. In both cases for reliable result pointers had to be
explicitly normalized (i.e. converted from 'far' to 'huge' or
something like that).
It was long time ago and even back then I didn't use Large model
very often, so it's possible that I misremember. But if I
remember correctly, does it mean that those C compilers now would
be considered non-compliant?
The C standard was first ratified (by ANSI) in 1989. The rules
for pointer comparison were clarified in the C99 standard, but it
has always been true that pointers to the same object have to
compare equal.
C environments that have things like 'far' or 'huge' pointers,
etc, are not standard C but must have extensions so that they can
deal with the different kinds of pointers. Depending on how the
non-standard kinds of pointer worked, the implementation might or
might not be conforming. Most likely though it's a moot point
because once a program starts using an extension all the rules
can change, and the C standard allows that. It's only programs
that look like really standard C that have to do what the C
standard says (for the implementation to be conforming); any
code that declares a 'far' pointer or 'huge' pointer certainly
isn't standard C.
In Compact and Large models data pointers are 'far' by default. So,
the source doesn't have to use non-standard declarations.
In that case, if the defaulted 'far' pointers don't follow the
rules given in the C standard for regular pointers, then the
implementation is not conforming. Extensions are allowed only if
they don't change the behavior of any strictly conforming
program. If undecorated pointer declarations don't observe this
condition then it's not a valid extension, which in turn causes
the implementation to be non-conforming.
Code inside the implementation is allowed to exploit internal
knowledge.
On Wed, 11 Sep 2024 17:34:38 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Wed, 11 Sep 2024 09:29:04 -0700
Tim Rentsch <[email protected]> wrote:
Josh Vanderhoof <[email protected]> writes:
[how to write a portable, UB-free check if mempcy() intervals
overlap]
It is legal to test for equality between pointers to different
objects
Right. This observation is the key insight.
Real mode x86 C compilers operating in Large and Compact Models that
were popular on IBM-compatible PCs 30-40 years ago could have more
than one representation for the pointer to the same memory
location. If my memory serves me, the rules of pointers comparison
for equality were the same as rules of comparison for <>. In both
cases for reliable result pointers had to be explicitly normalized
(i.e. converted from 'far' to 'huge' or something like that).
It was long time ago and even back then I didn't use Large model
very often, so it's possible that I misremember. But if I remember
correctly, does it mean that those C compilers now would be
considered non-compliant?
The C standard was first ratified (by ANSI) in 1989. The rules
for pointer comparison were clarified in the C99 standard, but it
has always been true that pointers to the same object have to
compare equal.
C environments that have things like 'far' or 'huge' pointers,
etc, are not standard C but must have extensions so that they can
deal with the different kinds of pointers. Depending on how the
non-standard kinds of pointer worked, the implementation might or
might not be conforming. Most likely though it's a moot point
because once a program starts using an extension all the rules
can change, and the C standard allows that. It's only programs
that look like really standard C that have to do what the C
standard says (for the implementation to be conforming); any
code that declares a 'far' pointer or 'huge' pointer certainly
isn't standard C.
In Compact and Large models data pointers are 'far' by default. So,
the source doesn't have to use non-standard declarations.
That's clever, but for usage in glibc or the like the clunky version
is the preferred one: memmove() is usually called through the dynamic
linking mechanism, and which implementation is actually called is
selected based on the hardware that it runs on (what does it do when
the program is linked statically?). There seem to be quite a few
memmove() (and __memmove_chk()) implementations in glibc-2.36 on
AMD64:
__memmove_chk
__memmove_sse2_unaligned_erms
__memmove_chk
__memmove_chk_erms
__memmove_chk_evex_unaligned
__memmove_chk_avx_unaligned
__memmove_chk_ssse3
__memmove_chk_sse2_unaligned
__memmove_erms
__memmove_avx512_unaligned
__memmove_evex_unaligned
__memmove_evex_unaligned_erms
__memmove_avx_unaligned
__memmove_avx_unaligned_erms
__memmove_avx_unaligned_rtm
__memmove_ssse3
__memmove_sse2_unaligned
__memmove_chk_sse2_unaligned_erms
__memmove_chk_avx512_no_vzeroupper
__memmove_chk_avx512_unaligned
__memmove_chk_avx512_unaligned_erms
__memmove_chk_evex_unaligned_erms
__memmove_chk_avx_unaligned_erms
__memmove_chk_avx_unaligned_rtm
__memmove_chk_avx_unaligned_erms_rtm
__memmove_avx512_no_vzeroupper
__memmove_avx512_unaligned_erms
__memmove_avx_unaligned_erms_rtm
On Thu, 12 Sep 2024 14:20:42 +0000, Anton Ertl wrote:
That's clever, but for usage in glibc or the like the clunky version
is the preferred one: memmove() is usually called through the dynamic
linking mechanism, and which implementation is actually called is
selected based on the hardware that it runs on (what does it do when
the program is linked statically?). There seem to be quite a few
memmove() (and __memmove_chk()) implementations in glibc-2.36 on
AMD64:
__memmove_chk
__memmove_sse2_unaligned_erms
__memmove_chk
__memmove_chk_erms
__memmove_chk_evex_unaligned
__memmove_chk_avx_unaligned
__memmove_chk_ssse3
__memmove_chk_sse2_unaligned
__memmove_erms
__memmove_avx512_unaligned
__memmove_evex_unaligned
__memmove_evex_unaligned_erms
__memmove_avx_unaligned
__memmove_avx_unaligned_erms
__memmove_avx_unaligned_rtm
__memmove_ssse3
__memmove_sse2_unaligned
__memmove_chk_sse2_unaligned_erms
__memmove_chk_avx512_no_vzeroupper
__memmove_chk_avx512_unaligned
__memmove_chk_avx512_unaligned_erms
__memmove_chk_evex_unaligned_erms
__memmove_chk_avx_unaligned_erms
__memmove_chk_avx_unaligned_rtm
__memmove_chk_avx_unaligned_erms_rtm
__memmove_avx512_no_vzeroupper
__memmove_avx512_unaligned_erms
__memmove_avx_unaligned_erms_rtm
All of these compile to the MM instruction in My 66000,
including the memcpy() variants.
Q&D programming is still far faster for me in C, but using Rust I
don't have to worry about how well the compiler will be able to
optimize my code, it is pretty much always close to speed of light
since the entire aliasing issue goes away.
Rust also gets rid of the horrible external library/configure/cmake
mess that kept me from successfully compiling the reference LAStools
lidar code for nearly 10 years.
Using the Rust port I just tell cargo to add it to my project and
that's it.
Terje
On Tue, 3 Sep 2024 17:46:38 +0200
Terje Mathisen <[email protected]> wrote:
Q&D programming is still far faster for me in C, but using Rust I
don't have to worry about how well the compiler will be able to
optimize my code, it is pretty much always close to speed of light
since the entire aliasing issue goes away.
I am trying to compare speed of few compiled languages in one benchmark
that I find interesting.
In order to make comparison I have to port a test bench first, because
while most of this languages are able, with various level of
difficulties, to call C routines, none of them can be called from 'C',
at least at my level of knowledge.
This is because in some cases, the performance overhead of copying the
last (sz&31) bytes is significant, say:
rsz=cte-ct;
if(rsz)
{
if(rsz&16)
{
v0=((u64 *)cs)[0]; v1=((u64 *)cs)[1];
((u64 *)ct)[0]=v0; ((u64 *)ct)[1]=v1;
cs+=16; ct+=16;
}
if(rsz&8)
{
v0=((u64 *)cs)[0];
((u64 *)ct)[0]=v0;
cs+=8; ct+=8;
}
if(rsz&4)
{
v0=((u32 *)cs)[0];
((u32 *)ct)[0]=v0;
cs+=4; ct+=4;
}
if(rsz&2)
{
v0=((u16 *)cs)[0];
((u16 *)ct)[0]=v0;
cs+=2; ct+=2;
}
if(rsz&1)
{
v0=((byte *)cs)[0];
((byte *)ct)[0]=v0;
cs++; ct++;
}
}
For small copies with awkward sizes, this tailing part can cost more
than the whole rest of the copy.
On Tue, 3 Sep 2024 17:46:38 +0200
Terje Mathisen <[email protected]> wrote:
Q&D programming is still far faster for me in C, but using Rust I
don't have to worry about how well the compiler will be able to
optimize my code, it is pretty much always close to speed of light
since the entire aliasing issue goes away.
I am trying to compare speed of few compiled languages in one benchmark
that I find interesting.
In order to make comparison I have to port a test bench first, because
while most of this languages are able, with various level of
difficulties, to call C routines, none of them can be called from 'C',
at least at my level of knowledge.
Porting test bench from C to Go was quite easy, the only part that I
didn't grasp immediately was related to time measurements.
Today I started Rust port and it is VERY much harder. After several
hours of reading of various tutorials, examples and Stack Overflow
articles I still don't know how to write
switch (argv[1][0]) {
case 't':
case 'T':
x = 42;
break;
}
At this rate, I am not sure that my motivation will last long enough to finish the porting.
In order to make comparison I have to port a test bench first, because
while most of this languages are able, with various level of
difficulties, to call C routines, none of them can be called from 'C',
at least at my level of knowledge.
Michael S <[email protected]> schrieb:
In order to make comparison I have to port a test bench first,
because while most of this languages are able, with various level of difficulties, to call C routines, none of them can be called from
'C', at least at my level of knowledge.
If you declare a Fortran procedure BIND(C), you can call it from C.
gfortran will give you the C prototype with -fc-prototypes.
Or, if you don't declare it BIND(C) and it uses old-style code,
you can use -fc-prototypes-external.
Michael S <[email protected]> writes:
On Tue, 3 Sep 2024 17:46:38 +0200
Terje Mathisen <[email protected]> wrote:
Q&D programming is still far faster for me in C, but using Rust I
don't have to worry about how well the compiler will be able to
optimize my code, it is pretty much always close to speed of light
since the entire aliasing issue goes away.
I am trying to compare speed of few compiled languages in one
benchmark that I find interesting.
In order to make comparison I have to port a test bench first,
because while most of this languages are able, with various level of difficulties, to call C routines, none of them can be called from
'C', at least at my level of knowledge.
Porting test bench from C to Go was quite easy, the only part that I
didn't grasp immediately was related to time measurements.
Today I started Rust port and it is VERY much harder. After several
hours of reading of various tutorials, examples and Stack Overflow
articles I still don't know how to write
switch (argv[1][0]) {
case 't':
case 'T':
x = 42;
break;
}
At this rate, I am not sure that my motivation will last long
enough to finish the porting.
Disclaimer: I have very little experience with Rust. The
example shown below looks like Rust but may very well have
syntax errors (or worse).
match argv[1][0] {
't' | 'T' => { x = 42; }
_ => { }
}
The _ pattern matches anything that hasn't been matched (and
may be necessary, I'm not sure about that).
On Thu, 12 Sep 2024 18:33:18 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Tue, 3 Sep 2024 17:46:38 +0200
Terje Mathisen <[email protected]> wrote:
Q&D programming is still far faster for me in C, but using Rust I
don't have to worry about how well the compiler will be able to
optimize my code, it is pretty much always close to speed of light
since the entire aliasing issue goes away.
I am trying to compare speed of few compiled languages in one
benchmark that I find interesting.
In order to make comparison I have to port a test bench first,
because while most of this languages are able, with various level of
difficulties, to call C routines, none of them can be called from
'C', at least at my level of knowledge.
Porting test bench from C to Go was quite easy, the only part that I
didn't grasp immediately was related to time measurements.
Today I started Rust port and it is VERY much harder. After several
hours of reading of various tutorials, examples and Stack Overflow
articles I still don't know how to write
switch (argv[1][0]) {
case 't':
case 'T':
x = 42;
break;
}
At this rate, I am not sure that my motivation will last long
enough to finish the porting.
Disclaimer: I have very little experience with Rust. The
example shown below looks like Rust but may very well have
syntax errors (or worse).
match argv[1][0] {
't' | 'T' => { x = 42; }
_ => { }
}
The _ pattern matches anything that hasn't been matched (and
may be necessary, I'm not sure about that).
My hardle is relatedd to [0] part rather than to switch/case part.
Accessing nth character of String (or of str? Or &str ? I am still
trying to figure out the difference.) is not as simple as in C or Go.
One person on Stack Overflow said that he was able to figure it out
after he learned the difference between std::string and
std::string_view in C++. May be, I should follow the same process. But
I don't want to. I don't plan to become an expert Rust programmer,
but rather want to do a simple benchmark.
On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:
BGB <[email protected]> writes:
[...]
Would be nice, say, if there were semi-standard compiler macros for
various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to
use. Whether or not the target/compiler allows misaligned memory
access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.
[elaborations on the above]
I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.
Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain
limited classes of buffer overflows.
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.
Michael S <[email protected]> writes:
On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:
BGB <[email protected]> writes:
[...]
Would be nice, say, if there were semi-standard compiler macros
for various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to
use. Whether or not the target/compiler allows misaligned memory
access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.
[elaborations on the above]
I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.
Why not?
Because it's not needed, and would make things worse rather
than better. The result would be a bigger language but not
a better language.
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain
limited classes of buffer overflows.
Eliminating undefined behavior is not what's being asked for.
These two questions are not the same.
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.
Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.
Michael S <[email protected]> writes:
On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:
BGB <[email protected]> writes:
[...]
Would be nice, say, if there were semi-standard compiler macros
for various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to
use. Whether or not the target/compiler allows misaligned memory
access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.
[elaborations on the above]
I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.
Why not?
Because it's not needed, and would make things worse rather
than better. The result would be a bigger language but not
a better language.
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain
limited classes of buffer overflows.
Eliminating undefined behavior is not what's being asked for.
These two questions are not the same.
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.
Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.
On 9/12/2024 9:18 AM, David Brown wrote:
On 11/09/2024 20:51, BGB wrote:
On 9/11/2024 5:38 AM, Anton Ertl wrote:
Josh Vanderhoof <[email protected]> writes:
[email protected] (Anton Ertl) writes:
Would be nice, say, if there were semi-standard compiler macros for
various things:
Ask, and you shall receive! (Well, sometimes you might receive.)
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to use.
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
...
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
...
#else
...
#endif
Works in gcc, clang and MSVC.
Technically now also in BGBCC, since I have just recently added it.
And C23 has the <stdbit.h> header with many convenient little "bit and
byte" utilities, including endian detection:
#include <stdbit.h>
#if __STDC_ENDIAN_NATIVE__ == __STDC_ENDIAN_LITTLE__
...
#elif __STDC_ENDIAN_NATIVE__ == __STDC_ENDIAN_BIG__
...
#else
...
#endif
This is good at least.
Though, generally takes a few years before new features become usable.
Like, it is only in recent years that it has become "safe" to use most
parts of C99.
Whether or not the target/compiler allows misaligned memory access; >>> If set, one may use misaligned access.
Why would you need that? Any decent compiler will know what is
allowed for the target (perhaps partly on the basis of compiler
flags), and will generate the best allowed code for accesses like
foo3() above.
Imagine you have compilers that are smart enough to turn "memcpy()" into
a load and store, but not smart enough to optimize away the memory
accesses, or fully optimize away the wrapper functions...
So, for best results, the best case option is to use a pointer cast and dereference.
For some cases, one may also need to know whether or not they can access
the pointers in a misaligned way (and whether doing so would be better
or worse than something like "memcpy()").
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.
Pointer comparisons are always allowed for equality tests if they are
pointers to objects of compatible types. (Function pointers cannot be
compared at all.)
For other relational tests, the pointers must point to sub-objects of
the same aggregate object. (That means they can't be null pointers,
misaligned pointers, invalid pointers or pointers going nowhere.)
This is independent of how the address space(s) are organised on the
target machine.
What you /can/ do, on pretty much any implementation with a single
linear address space, is convert pointers to uintptr_t and then
compare them. There may be some targets for which there is no
uintptr_t, or where the mapping from pointer to integer does not match
with the address, but that would be very unusual.
I can't think when you would need to do such comparisons, however,
other than to implement memmove - and library functions can use any
kind of implementation-specific feature they like.
Yeah.
My "_memlzcpy()" functions do a lot of relative comparisons (more than
needed for memmove):
dst<=src: memmove
(dst-src)>=sz: memcpy
(dst-src)>=32: can copy with 32B blocks
(dst-src)>=16: can copy with 16B blocks
(dst-src)>= 8: can copy with 8B blocks
1/2/4: Generate a full-block fill pattern
3/5/6/7: partial fill pattern (16B block with irregular step)
There is a difference here between "_memlzcpy()" and "_memlzcpyf()" in
that:
the former will always copy an exact number of bytes;
the latter may write 16-32 bytes over the limit.
Possible:
__MINALIGN_type__ //minimum allowed alignment for type
_Alignof(type) has been around since C11.
_Alignof tells the native alignment, not the minimum.
Where, _Alignof(int32_t) will give 4, but __MINALIGN_INT32__ would give
1 if the target supports misaligned pointers.
Maybe also alias pointer control:
__POINTER_ALIAS__
__POINTER_ALIAS_CONSERVATIVE__
__POINTER_ALIAS_STRICT__
Where, pointer alias can be declared, and:
If conservative, then conservative semantics are being used.
Pointers may be freely cast without concern for pointer aliasing. >>> Compiler will assume that "non restrict" pointer stores may alias. >>> If strict, the compiler is using TBAA semantics.
Compiler may assume that aliasing is based on pointer types.
Faffing around with pointer types - breaking the "effective type"
rules - has been a bad idea and risky behaviour since C was
standardised. You never need to do it. (I accept, however, that on
some weaker or older compilers "doing the right thing" can be
noticeably less efficient than writing bad code.) Just get a
half-decent compiler and use memcpy(). For any situation where you
might think casting pointer types would be a good idea, your sizes are
small and known at compile time, so they are easy for the compiler to
optimise.
It depends.
In some things, like my ELF and PE/COFF program loaders, the code can
get particularly nasty in these areas...
And as a general rule, if you feel you really want to break the rules
of C and still get something useful out at the end, use "volatile"
liberally.
I have used "volatile" here to good effect.
Most of the commonly used parts of C99 have been "safe" to use for 20
years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.
There are only two serious, general purpose C compilers in mainstream
use - gcc and clang, and both support almost all of C23 now. But it
will take a while for the more niche tools, such as some embedded
compilers, to catch up.
George Neuner <[email protected]> writes:
On Tue, 10 Sep 2024 11:21:01 +0300, Michael S
<[email protected]> wrote:
On Mon, 09 Sep 2024 23:27:24 -0400
George Neuner <[email protected]> wrote:
On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:
Tim Rentsch <[email protected]> writes:
[email protected] (Anton Ertl) writes:
There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,
There may not be a way to tell if memcpy()-calling code will
work on platforms one doesn't have, but there is a relatively
simple and portable way to tell if some memcpy() call crosses
over into the realm of undefined behavior.
1) At first I thought that yes, one could just check whether
there is an overlap of the memory areas. But then I remembered
that you cannot write such a check in standard C without (in the
general case) exercising undefined behaviour; and then the
compiler could eliminate the check or do something else that's
unexpected. Do you have such a check in mind that does not
exercise undefined behaviour in the general case?
The result of comparing pointers to two elements of the same array
is defined. Cast to (char*), both src and dst can be considered
to point to elements of the [address space sized] char array at
address zero.
According to my understanding, your 'can be considered' part is not
codified in the C Standard.
Adding size_t to a pointer yields another pointer of the same
type.
In terms of types, that is right, but the addition works only if
the pointer points into an array large enough to include the
result of the addition (the result is also allowed to be just one
past the end of the array).
All of gcc, clang and MSVC seem happy with this.
It works. But is it guaranteed to work in the future by some sort
of document? I am pretty sure that no such guarantee exists in gcc
and MSVC docs. I did not look in clang docs. Trying to find
anythings in LLVM/clang docs makes me sad.
I know that it has worked as expected with every version of gcc
and Microsoft I've used since 1988. [clang I don't use, but I
tried it on godbolt.org with the most recent version]
Will it continue to work ... who knows?
I definitely am NOT an expert on the C standard, but thinking
about it, it occurred to me that if an array is explicitly defined
that *might* cover all memory (or at least all heap), then the
compiler would have to honor any apparent pointers into it.
E.g., char (*all_memory)[] = 0;
This declaration introduces a pointer, not an array. Similarly
the declaration
char (*great_white_array)[ 999999999999999999 ] = 0;
does not introduce an array but just a pointer (and initializes
the pointer to be a null pointer). There is no humongous array.
None of the compilers at godbolt seem to need this to compare
arbitrary addresses as char*, but all accept it.
The given declaration of 'all_memory' is strictly conforming.
It must be accepted by any conforming C implementation (which
all of gcc, clang, and MSVC purport to be, IIUC).
Obviously speculation, but it's the best I have.
It's important to realize that there are two distinct questions.
One, does the code work (in a given implementation)? Two, does
the code satisfy the rules given in the C standard?
Unfortunately having an answer to the first question does not by
itself give enough information to answer the second question.
On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:
BGB <[email protected]> writes:
[...]
Would be nice, say, if there were semi-standard compiler macros
for various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to
use. Whether or not the target/compiler allows misaligned memory
access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.
[elaborations on the above]
I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.
Why not?
Because it's not needed, and would make things worse rather
than better. The result would be a bigger language but not
a better language.
I beg to differ.
Yes, the standard would be bigger. And yes, few unimportant benchmarks
would run a little slower. But a job of compiler writers would be
simpler and less exciting (good thing!). The most importantly,
programming in resulting language would feel more predictable.
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain
limited classes of buffer overflows.
Eliminating undefined behavior is not what's being asked for.
These two questions are not the same.
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.
Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.
No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be
bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.
On Thu, 12 Sep 2024 04:04:06 -0700, Tim Rentsch
<[email protected]> wrote:
George Neuner <[email protected]> writes:
I definitely am NOT an expert on the C standard, but thinking
about it, it occurred to me that if an array is explicitly defined
that *might* cover all memory (or at least all heap), then the
compiler would have to honor any apparent pointers into it.
E.g., char (*all_memory)[] = 0;
This declaration introduces a pointer, not an array. Similarly
the declaration
char (*great_white_array)[ 999999999999999999 ] = 0;
does not introduce an array but just a pointer (and initializes
the pointer to be a null pointer). There is no humongous array.
Of course there is no actual array ... the point was to (try to)
define *something* such that the compiler would think there was an
array and consider any char* as possibly pointing to an element of
that array.
[And yes! it might end up pessimizing character manipulating code.]
The C standard guarantees that pointers to 2 elements of the same
array are comparable, and current (and past) compilers do allow
comparing arbitrary pointers when cast to char* without needing an
actual char array that covers the addresses.
But a guarantee wrt the standard requires the compiler to at least
*think* there is such an array. The question is how to do that.
On 03/09/2024 18:54, Stephen Fuld wrote:
On 9/2/2024 11:23 PM, David Brown wrote:
On 02/09/2024 18:46, Stephen Fuld wrote:
On 9/2/2024 1:23 AM, Terje Mathisen wrote:
Anyway, that is all mostly moot since I'm using Rust for this kind
of programming now. :-)
Can you talk about the advantages and disadvantages of Rust versus C?
And also for Rust versus C++ ?
I asked about C versus Rust as Terje explicitly mentioned those two
languages, but you make a good point in general.
I want to know about both :-)
In my field, small-systems embedded development, C has been dominant for
a long time, but C++ use is increasing. Most of my new stuff in recent times has been C++. There are some in the field who are trying out
Rust, so I need to look into it myself - either because it is a better
choice than C++, or because customers might want it.
My impression - based on hearsay for Rust as I have no experience -
is that the key point of Rust is memory "safety". I use scare-quotes
here, since it is simply about correct use of dynamic memory and
buffers.
I agree that memory safety is the key point, although I gather that it
has other features that many programmers like.
Sure. There are certainly plenty of things that I think are a better
idea in a modern programming language and that make it a good step up compared to C. My key interest is in comparison to C++ - it is a step
up in some ways, a step down in others, and a step sideways in many features. But is it overall up or down, for /my/ uses?
Examples of things that I think are good in Rust are making variables immutable by default and pattern matching. Steps down include lack of function overloading
On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.
Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.
No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be
bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.
Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.
No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be
bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.
Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety.
Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?
SCNR
On Fri, 13 Sep 2024 21:39:39 +0000, Thomas Koenig wrote:
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.
Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.
No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be
bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.
Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety.
FORTAN allowed::
subroutine1:
COMMON /ALPHA/i,j,k,l,m,n
subroutine2:
COMMON /ALPHA/x.y.z
expecting {i,j} which are INT*4 to overlap with x Read*8 ;...
{Completely neglecting the BE/LE problems,...}
On 9/13/2024 10:55 AM, Thomas Koenig wrote:
David Brown <[email protected]> schrieb:
Most of the commonly used parts of C99 have been "safe" to use for 20
years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.
What about VLAs?
IIRC, VLAs and _Complex and similar still don't work in MSVC.
Most of the rest does now at least.
Bernd Linsel <[email protected]> writes:
On 05.09.24 19:04, Terje Mathisen wrote:
One of my alternatives are
unsigned u = start; // Cannot be less than zero
if (u) {
u++;
do {
u--;
data[u]...
while (u);
}
This typically results in effectively the same asm code as the signed
version, except for a bottom JGE (Jump (signed) Greater or Equal instead >>> of JA (Jump Above or Equal, but my version is far more verbose.
Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:
for (unsigned u = start; (u & TOPBIT) == 0; u--)
Terje
What about:
for (unsigned u = start; u != ~0u; --u)
This is the form we use most when we need
to work in reverse.
...
or even
for (unsigned u = start; (int)u >= 0; --u)
...
?
I've compared all variants for x86_64 with -O3 -fexpensive-optimizations
on godbolt.org:
- 32 bit version: https://godbolt.org/z/TMhhx3nch
- 64 bit version: https://godbolt.org/z/8oxzTf5Gf
No significant differences in code generation for unsigned vs. signed.
Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
C/C++ is a mistake. It should always have been ILP64, and this nonsense >would go away. Any new architecture should make C ILP64 (looking at you >RISC-V, missing yet another opportunity to not make the same mistakes as >everyone else).
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.
Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.
No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.
Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?
SCNR
On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.
Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.
No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be
bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.
Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?
SCNR
What I wrote is how all production C compilers work today. So it
will add no new bugs. What I propose is to formally codify 50 y.o.
existing practice.
And no, it's both much easier to follow than old FORTRAN common blocks
and has wider scope (applies to all storage classes, rather than just
to global).
When you write code working on signed numbers and do something like:
(a < 0) || (a >= max)
Then the compiler realizes if you treat 'a' as unsigned, this is just:
(unsigned)a >= max
In many cases int is slower now than long -- which violates the notion
of int from K&R days.
[email protected] (Kent Dickey) writes:
Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
C/C++ is a mistake. It should always have been ILP64, and this nonsense >>would go away. Any new architecture should make C ILP64 (looking at you >>RISC-V, missing yet another opportunity to not make the same mistakes as >>everyone else).
We now have had more than 30 years of catering for this mistake by
everyone involved. Given their goals, I think that RISC-V made the
right choice for int in their ABI, even if it was the original choice
by the MIPS and Alpha people that they follow, like everyone else, was
wrong.
That being said, one option would be to introduce another ABI and API
with 64-bit int (and maybe 32-bit long short int), and programmers
could choose whether to program for the ILP API, or the int=int32_t
API. Would the ILP API/ABI fare better then x32? I doubt it, even
though I would support it. This ship probably has sailed.
- anton
Kent Dickey <[email protected]> schrieb:
When you write code working on signed numbers and do something like:
(a < 0) || (a >= max)
Then the compiler realizes if you treat 'a' as unsigned, this is just:
(unsigned)a >= max
For which definition of a and max exactly?
It coertainly does not do so for
_Bool foo(int a, int max)
{
return (a < 0) || (a >= max);
}
On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.
Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.
No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be
bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.
Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?
SCNR
What I wrote is how all production C compilers work today. So it
will add no new bugs.
In article <vc4mgj$1khmk$[email protected]>,
Thomas Koenig <[email protected]> wrote:
Kent Dickey <[email protected]> schrieb:
When you write code working on signed numbers and do something like:
(a < 0) || (a >= max)
Then the compiler realizes if you treat 'a' as unsigned, this is just:
(unsigned)a >= max
For which definition of a and max exactly?
It coertainly does not do so for
_Bool foo(int a, int max)
{
return (a < 0) || (a >= max);
}
Sorry, I should have made it clear for max >= 0 (but not necessarily an unsigned variable), and for my code, a constant, which is how the
compiler knows it's positive . I have this in my code all the time to validate function inputs--a negative number is bad, and a number beyond
a certain reasonable value is bad. And I let the compiler optimize the
check to (unsigned)a >= (unsigned)max.
Kent
BGB <[email protected]> schrieb:
On 9/13/2024 10:55 AM, Thomas Koenig wrote:
David Brown <[email protected]> schrieb:
Most of the commonly used parts of C99 have been "safe" to use
for 20 years. There were a few bits that MSVC did not implement
until relatively recently, but I think even have caught up now.
What about VLAs?
IIRC, VLAs and _Complex and similar still don't work in MSVC.
Most of the rest does now at least.
It's only been 25 years. You have to give Microsoft a bit of
time to catch up. I'm sure they will get there by 2099.
On 12/09/2024 13:29, Michael S wrote:
On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:
BGB <[email protected]> writes:
[...]
Would be nice, say, if there were semi-standard compiler macros
for various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy
to use. Whether or not the target/compiler allows misaligned
memory access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.
[elaborations on the above]
I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.
I fully agree that C is not, and should not be seen as, a "high-level assembly language". But it is a language that is very useful to "hardware-type folks", and there are a few things that could make it
easier to write more portable code if they were standardised. As it
is, we just have to accept that some things are not portable.
Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.
And how should that be defined?
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.
Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.
No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would
be bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.
Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?
SCNR
What I wrote is how all production C compilers work today. So it
will add no new bugs.
Maybe I should be a little bit more precise in why I think this
is an extemely bad idea.
struct {
char x[8]
int y;
} bar;
Assume
bar.y = 1234;
bar.x[i] = 42; // The compiler does not know i
// Do something with bar.y
The compiler should then treat the access to bar.x[i] as if bar.y
was clobbered by the assignment statement, and reload bar.y if
it was kept in a register? That is the semantics you propose.
So, either bar.y is treated as if it was volatile, or hard-to-detect
bugs would appear because, with optimization, the assignment would
sometimes change the value of bar.y and sometimes not.
BGB <[email protected]> schrieb:
On 9/13/2024 10:55 AM, Thomas Koenig wrote:
David Brown <[email protected]> schrieb:
Most of the commonly used parts of C99 have been "safe" to use for 20
years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.
What about VLAs?
IIRC, VLAs and _Complex and similar still don't work in MSVC.
Most of the rest does now at least.
It's only been 25 years. You have to give Microsoft a bit of
time to catch up. I'm sure they will get there by 2099.
On 9/12/2024 5:12 AM, Tim Rentsch wrote:
BGB <[email protected]> writes:
[...]
Would be nice, say, if there were semi-standard compiler macros for
various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to use. >>> Whether or not the target/compiler allows misaligned memory access;
If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.
[elaborations on the above]
I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.
There are a few ways things can go:
Define rules, have one of N permutations for how those rules can go;
How it often worked in practice.
Throw up hands and say it is unknowable.
What a lot of "portability" people assert.
Do whatever gives the fastest results in standardized benchmarks.
What many compiler maintainers want.
On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.
Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.
No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would
be bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.
Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?
SCNR
What I wrote is how all production C compilers work today. So it
will add no new bugs.
Maybe I should be a little bit more precise in why I think this
is an extemely bad idea.
struct {
char x[8]
int y;
} bar;
Assume
bar.y = 1234;
bar.x[i] = 42; // The compiler does not know i
// Do something with bar.y
The compiler should then treat the access to bar.x[i] as if bar.y
was clobbered by the assignment statement, and reload bar.y if
it was kept in a register? That is the semantics you propose.
Yes, exactly.
Michael S <[email protected]> schrieb:
On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by
implementation. And in practice it is. Just not in
theory.
Do you mean union rather than struct? And do you mean
bar.x[7] rather than bar.x[8]? Surely no one would expect
that storing into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think
should be defined by the C standard but is not? And the
same question for a struct if that is what you meant.
No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior
would be bar.y==42 on LE machines and bar.y==42*2**24 on BE
machines. As it actually happens in reality with all
production compilers.
Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers
(TM) have to have something that pays their salaries, right?
SCNR
What I wrote is how all production C compilers work today. So it
will add no new bugs.
Maybe I should be a little bit more precise in why I think this
is an extemely bad idea.
struct {
char x[8]
int y;
} bar;
Assume
bar.y = 1234;
bar.x[i] = 42; // The compiler does not know i
// Do something with bar.y
The compiler should then treat the access to bar.x[i] as if bar.y
was clobbered by the assignment statement, and reload bar.y if
it was kept in a register? That is the semantics you propose.
Yes, exactly.
So, volatile for all structs,
plus prescribed behavior on array overruns.
At the risk of repeating myself: This is an extremely bad idea.
I rest my case.
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.
That has two drawbacks: minor one that you need to know that
there are no padding between 'x' and 'y'.
Major drawback
is that it would forbid bounds checking for array accesses.
In code like above it is easy to spot out of bound access at
compile time.
You seem to think that C should be as optimizable and as full of UBs as Fortran.
Many compiler authors agree with you.
I have different idea.
iIMHO, your party exploits the letter of C
standard in violation to its spirit.
Michael S <[email protected]> wrote:
On Thu, 12 Sep 2024 16:34:31 +0200
David Brown <[email protected]> wrote:
On 12/09/2024 13:29, Michael S wrote:
On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:
BGB <[email protected]> writes:
I fully agree that C is not, and should not be seen as, a
"high-level assembly language". But it is a language that is very
useful to "hardware-type folks", and there are a few things that
could make it easier to write more portable code if they were
standardised. As it is, we just have to accept that some things
are not portable.
Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.
And how should that be defined?
bar.x[8] = 42 should be defined to be the same as
char tmp = 42
memcpy(&bar.y, &tmp, sizeof(tmp));
That has two drawbacks: minor one that you need to know that
there are no padding between 'x' and 'y'.
Major drawback
is that it would forbid bounds checking for array accesses.
In code like above it is easy to spot out of bound access at
compile time. Even with variable index compiler knows size
of 'x' so can insert bounds checking code (and AFAIK if you
insist leading compilers will do this).
More generally, assuming cooperating compiler modern C has enough
features to eliminate out of bounds array indexing.
More precisely,
I mean compiler which inserts bounds check where they are needed
and warns or rejects constructs that can not be checked. I claim
that it is possible to write nontrivial programs in "checked C".
With change as above very important language construct would be
uncheckable.
BTW: If you need such behaviour you can get what you want by
using unions, so there is no need to break language for folks
that do not need this.
On 2024-09-15, Michael S <[email protected]> wrote:
You seem to think that C should be as optimizable and as full of
UBs as Fortran.
The only place where "undefined behavior" is mentioned in the Fortran standards is with reference to C.
Many compiler authors agree with you.
I have different idea.
You don't appear to believe in specifications.
iIMHO, your party exploits the letter of C
standard in violation to its spirit.
If you meet the spirit of the C standard, say hello to him for me.
On Thu, 12 Sep 2024 16:34:31 +0200
David Brown <[email protected]> wrote:
On 12/09/2024 13:29, Michael S wrote:
On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:
BGB <[email protected]> writes:
I fully agree that C is not, and should not be seen as, a "high-level
assembly language". But it is a language that is very useful to
"hardware-type folks", and there are a few things that could make it
easier to write more portable code if they were standardised. As it
is, we just have to accept that some things are not portable.
Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.
And how should that be defined?
bar.x[8] = 42 should be defined to be the same as
char tmp = 42
memcpy(&bar.y, &tmp, sizeof(tmp));
Padding is another thing that should be Implementation Defined.
I.e. compiler should provide complete documentation of its padding algorithms.
In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed
by another field of integer type with the same or narrower width then
there should be no padding in-between.
MitchAlsup1 <[email protected]> schrieb:
In many cases int is slower now than long -- which violates the notion
of int from K&R days.
That's a designers's choice, I think. It is possible to add 32-bit instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.
And having a smaller memory footprint is also beneficial, especially
for caches.
(Plus, there are FORTRAN's storage association rules, but these should
be less used by now. But for a 64-bit integer, they pretty much would require a 64-bit REAL and a 128-bit DOUBLE PRECISION).
On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
Waldek Hebisch <[email protected]> wrote:
That has two drawbacks: minor one that you need to know that
there are no padding between 'x' and 'y'.
Padding is another thing that should be Implementation Defined.
I.e. compiler should provide complete documentation of its padding >algorithms.
On 9/3/2024 4:14 PM, David Brown wrote:
On 03/09/2024 18:54, Stephen Fuld wrote:
On 9/2/2024 11:23 PM, David Brown wrote:
On 02/09/2024 18:46, Stephen Fuld wrote:
On 9/2/2024 1:23 AM, Terje Mathisen wrote:
Anyway, that is all mostly moot since I'm using Rust for this kind >>>>>> of programming now. :-)
Can you talk about the advantages and disadvantages of Rust versus C? >>>>>
And also for Rust versus C++ ?
I asked about C versus Rust as Terje explicitly mentioned those two
languages, but you make a good point in general.
I want to know about both :-)
In my field, small-systems embedded development, C has been dominant
for a long time, but C++ use is increasing. Most of my new stuff in
recent times has been C++. There are some in the field who are trying
out Rust, so I need to look into it myself - either because it is a
better choice than C++, or because customers might want it.
My impression - based on hearsay for Rust as I have no experience -
is that the key point of Rust is memory "safety". I use
scare-quotes here, since it is simply about correct use of dynamic
memory and buffers.
I agree that memory safety is the key point, although I gather that
it has other features that many programmers like.
Sure. There are certainly plenty of things that I think are a better
idea in a modern programming language and that make it a good step up
compared to C. My key interest is in comparison to C++ - it is a step
up in some ways, a step down in others, and a step sideways in many
features. But is it overall up or down, for /my/ uses?
Examples of things that I think are good in Rust are making variables
immutable by default and pattern matching. Steps down include lack of
function overloading
Rust's generic functions are not sufficient?
On Thu, 12 Sep 2024 16:34:31 +0200
David Brown <[email protected]> wrote:
On 12/09/2024 13:29, Michael S wrote:
On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:
BGB <[email protected]> writes:
[...]
Would be nice, say, if there were semi-standard compiler macros
for various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy
to use. Whether or not the target/compiler allows misaligned
memory access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.
[elaborations on the above]
I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.
I fully agree that C is not, and should not be seen as, a "high-level
assembly language". But it is a language that is very useful to
"hardware-type folks", and there are a few things that could make it
easier to write more portable code if they were standardised. As it
is, we just have to accept that some things are not portable.
Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.
And how should that be defined?
bar.x[8] = 42 should be defined to be the same as
char tmp = 42
memcpy(&bar.y, &tmp, sizeof(tmp));
On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
Waldek Hebisch <[email protected]> wrote:
Michael S <[email protected]> wrote:
On Thu, 12 Sep 2024 16:34:31 +0200
David Brown <[email protected]> wrote:
On 12/09/2024 13:29, Michael S wrote:
On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:
BGB <[email protected]> writes:
I fully agree that C is not, and should not be seen as, a
"high-level assembly language". But it is a language that is very
useful to "hardware-type folks", and there are a few things that
could make it easier to write more portable code if they were
standardised. As it is, we just have to accept that some things
are not portable.
Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.
And how should that be defined?
bar.x[8] = 42 should be defined to be the same as
char tmp = 42
memcpy(&bar.y, &tmp, sizeof(tmp));
That has two drawbacks: minor one that you need to know that
there are no padding between 'x' and 'y'.
Padding is another thing that should be Implementation Defined.
I.e. compiler should provide complete documentation of its padding algorithms.
In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by another field of integer type with the same or narrower width then
there should be no padding in-between.
In article <[email protected]>, [email protected] (Michael S) wrote:
Padding is another thing that should be Implementation Defined.
I.e. compiler should provide complete documentation of its padding
algorithms.
It is, and they do. I've used a lot of different compilers over the last
29 years, needing to know about padding for a DIY varargs, and I've never
had problems with finding out what the padding was.
It can usually be described quite briefly, by saying that all data types
are naturally aligned. The only variant of that I've encountered is on
32-bit x86 Linux and 32-bit POWER AIX where in both cases 8-byte doubles
were 4-byte aligned.
The C standard specifies that struct members shall be stored in memory in
the same order as they appear in the declaration. It does not specify
padding because the standard committee feel they need to allow C to work
on machines that are not byte-addressed or are otherwise weird.
In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed
by another field of integer type with the same or narrower width then
there should be no padding in-between.
That would be fine if you were willing to confine yourself to
byte-addressed machines.
On Sun, 15 Sep 2024 08:05:47 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by
implementation. And in practice it is. Just not in
theory.
Do you mean union rather than struct? And do you mean
bar.x[7] rather than bar.x[8]? Surely no one would expect
that storing into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think
should be defined by the C standard but is not? And the
same question for a struct if that is what you meant.
No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior
would be bar.y==42 on LE machines and bar.y==42*2**24 on BE
machines. As it actually happens in reality with all
production compilers.
Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers
(TM) have to have something that pays their salaries, right?
SCNR
What I wrote is how all production C compilers work today. So it
will add no new bugs.
Maybe I should be a little bit more precise in why I think this
is an extemely bad idea.
struct {
char x[8]
int y;
} bar;
Assume
bar.y = 1234;
bar.x[i] = 42; // The compiler does not know i
// Do something with bar.y
The compiler should then treat the access to bar.x[i] as if bar.y
was clobbered by the assignment statement, and reload bar.y if
it was kept in a register? That is the semantics you propose.
Yes, exactly.
So, volatile for all structs,
No.
Access to field of struct's should be ordered only relatively to
accesses to other fields *of the same instance* of the struct. And,
of course, usual 'as if' applies, so optimizing compiler can figure out
that bar.x[7] and bar.y do not overlap and thus generate code knowing
that write to one does not clobber the other.
That's pretty far from semantics of volatile.
plus prescribed behavior on array overruns.
Only withing bound of struct. bar.x[12] remains UB
At the risk of repeating myself: This is an extremely bad idea.
I rest my case.
You seem to think that C should be as optimizable and as full of UBs as Fortran. Many compiler authors agree with you.
I have different idea. IMHO, your party exploits the letter of C
standard in violation to its spirit.
On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
Waldek Hebisch <[email protected]> wrote:
Michael S <[email protected]> wrote:
On Thu, 12 Sep 2024 16:34:31 +0200
David Brown <[email protected]> wrote:
On 12/09/2024 13:29, Michael S wrote:
On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:
BGB <[email protected]> writes:
I fully agree that C is not, and should not be seen as, a
"high-level assembly language". But it is a language that is very
useful to "hardware-type folks", and there are a few things that
could make it easier to write more portable code if they were
standardised. As it is, we just have to accept that some things
are not portable.
Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.
And how should that be defined?
bar.x[8] = 42 should be defined to be the same as
char tmp = 42
memcpy(&bar.y, &tmp, sizeof(tmp));
That has two drawbacks: minor one that you need to know that
there are no padding between 'x' and 'y'.
Padding is another thing that should be Implementation Defined.
I.e. compiler should provide complete documentation of its padding algorithms.
In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by another field of integer type with the same or narrower width then
there should be no padding in-between.
Major drawback
is that it would forbid bounds checking for array accesses.
In code like above it is easy to spot out of bound access at
compile time. Even with variable index compiler knows size
of 'x' so can insert bounds checking code (and AFAIK if you
insist leading compilers will do this).
More generally, assuming cooperating compiler modern C has enough
features to eliminate out of bounds array indexing.
In general, only by means of fat pointers.
Fat pointers break existing ABIs.
Also if fat pointers is what I want then I already have them in few mainstream languages where they are integrated much better than they
will ever be in "checked C".
On 2024-09-15 12:09 p.m., David Brown wrote:
What about bit-fields in a struct? I believe they are usually packed. InIn addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by
another field of integer type with the same or narrower width then
there should be no padding in-between.
case its for something like an I/O device.
Padding is another thing that should be Implementation Defined.
Robert Finch <[email protected]> writes:
On 2024-09-15 12:09 p.m., David Brown wrote:
What about bit-fields in a struct? I believe they are usually packed. In >>case its for something like an I/O device.In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by >>>> another field of integer type with the same or narrower width then
there should be no padding in-between.
That's a bit more complicated as it depends on the target byte-order.
e.g.
struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9; /**< R/W1C/H - RAM
ECC DBE detected. */
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9; /**< R/W1C/H - RAM
ECC SBE detected. */
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;
On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.
Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.
If the code were this
union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;
and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.
No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would
be bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.
Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?
SCNR
What I wrote is how all production C compilers work today. So it
will add no new bugs.
Maybe I should be a little bit more precise in why I think this
is an extemely bad idea.
struct {
char x[8]
int y;
} bar;
Assume
bar.y = 1234;
bar.x[i] = 42; // The compiler does not know i
// Do something with bar.y
The compiler should then treat the access to bar.x[i] as if bar.y
was clobbered by the assignment statement, and reload bar.y if
it was kept in a register? That is the semantics you propose.
Yes, exactly.
So, either bar.y is treated as if it was volatile, or hard-to-detect
bugs would appear because, with optimization, the assignment would
sometimes change the value of bar.y and sometimes not.
No, semantics is that compiler has to reload bar.y if it keeps it in register. Optimizer that does anything else is buggy.
On 2024-09-15 12:09 p.m., David Brown wrote:
On 15/09/2024 14:40, Michael S wrote:What about bit-fields in a struct? I believe they are usually packed. In
On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
Waldek Hebisch <[email protected]> wrote:
Michael S <[email protected]> wrote:
On Thu, 12 Sep 2024 16:34:31 +0200
David Brown <[email protected]> wrote:
On 12/09/2024 13:29, Michael S wrote:
On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:
BGB <[email protected]> writes:
I fully agree that C is not, and should not be seen as, a
"high-level assembly language". But it is a language that is very >>>>>> useful to "hardware-type folks", and there are a few things that
could make it easier to write more portable code if they were
standardised. As it is, we just have to accept that some things
are not portable.
Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited >>>>>>> classes of buffer overflows.
struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;
IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.
And how should that be defined?
bar.x[8] = 42 should be defined to be the same as
char tmp = 42
memcpy(&bar.y, &tmp, sizeof(tmp));
That has two drawbacks: minor one that you need to know that
there are no padding between 'x' and 'y'.
Padding is another thing that should be Implementation Defined.
It is.
I.e. compiler should provide complete documentation of its padding
algorithms.
They do. Or, they should. Often they are lazy and say "defined by
the platform ABI". Really, it is only the alignments that are needed.
C defines the minimum padding between members in a struct - you get
the padding needed to ensure that members are correctly aligned. I
don't think the C standards disallow additional padding, but it would
be an extraordinarily strange implementation if there were anything
more than this minimum padding.
But I certainly wouldn't mind if the standards dictated this minimum
padding, and then there would be nothing left to the implementation
other than alignments.
In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by
another field of integer type with the same or narrower width then
there should be no padding in-between.
case its for something like an I/O device.
On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:
Robert Finch <[email protected]> writes:
On 2024-09-15 12:09 p.m., David Brown wrote:
What about bit-fields in a struct? I believe they are usually packed. In >>> case its for something like an I/O device.In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by >>>>> another field of integer type with the same or narrower width then
there should be no padding in-between.
That's a bit more complicated as it depends on the target byte-order.
e.g.
struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9; /**< R/W1C/H - RAM
ECC DBE detected. */
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9; /**< R/W1C/H - RAM
ECC SBE detected. */
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;
Which brings to mind a slight different but related bit-field issue.
If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??
So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.
David Brown <[email protected]> schrieb:
Most of the commonly used parts of C99 have been "safe" to use for 20
years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.
What about VLAs?
There are only two serious, general purpose C compilers in mainstream
use - gcc and clang, and both support almost all of C23 now. But it
will take a while for the more niche tools, such as some embedded
compilers, to catch up.
It is almost impossible to gather statistics on compiler use,
especially with free compilers, but what about MSVC and icc?
On 9/13/2024 10:55 AM, Thomas Koenig wrote:
David Brown <[email protected]> schrieb:
Most of the commonly used parts of C99 have been "safe" to use for 20
years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.
What about VLAs?
IIRC, VLAs and _Complex and similar still don't work in MSVC.
Most of the rest does now at least.
There are only two serious, general purpose C compilers in mainstream
use - gcc and clang, and both support almost all of C23 now. But it
will take a while for the more niche tools, such as some embedded
compilers, to catch up.
It is almost impossible to gather statistics on compiler use,
especially with free compilers, but what about MSVC and icc?
From what I gather:
GCC and Clang are popular for most mainline targets;
GCC is the dominant C compiler on Linux.
MSVC is popular on Windows
Has been essentially freeware/fremium for over a decade;
Visual Studio has a fairly good debugger;
Targets limited to things you can run Windows on (x86, X64, ARM)
TinyCC, popular for niche use, but limited range of targets;
x86, ARM, experimental RISC-V.
SDCC, popular for 8/16 bit targets;
CC65, popular for 6502 and 65C816;
Robert Finch <[email protected]> writes:
On 2024-09-15 12:09 p.m., David Brown wrote:
What about bit-fields in a struct? I believe they are usually packed. In >> case its for something like an I/O device.In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by >>>> another field of integer type with the same or narrower width then
there should be no padding in-between.
That's a bit more complicated as it depends on the target byte-order.
e.g.
struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9; /**< R/W1C/H - RAM ECC DBE detected. */
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9; /**< R/W1C/H - RAM ECC SBE detected. */
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;
On 15/09/2024 19:21, MitchAlsup1 wrote:
On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:
Robert Finch <[email protected]> writes:
On 2024-09-15 12:09 p.m., David Brown wrote:
What about bit-fields in a struct? I believe they are usually packed. In >>>> case its for something like an I/O device.In addition, some padding-related things can be defined by Standard >>>>>> itself. Not in this particular case, but, for example, it could be >>>>>> defined that when field of one integer type is immediately followed by >>>>>> another field of integer type with the same or narrower width then >>>>>> there should be no padding in-between.
That's a bit more complicated as it depends on the target byte-order.
e.g.
struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9; /**< R/W1C/H - RAM
ECC DBE detected. */
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9; /**< R/W1C/H - RAM
ECC SBE detected. */
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;
Which brings to mind a slight different but related bit-field issue.
If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??
So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.
You do so inconveniently, perhaps with access inline functions rather
than a bit-field struct.
Fortunately, not many hardware designers are that sadistic. (Or perhaps
they /are/ that sadistic, but lack the imagination for that particular trick.)
On 9/13/2024 10:30 AM, David Brown wrote:
On 12/09/2024 23:14, BGB wrote:
On 9/12/2024 9:18 AM, David Brown wrote:
On 11/09/2024 20:51, BGB wrote:
On 9/11/2024 5:38 AM, Anton Ertl wrote:
Josh Vanderhoof <[email protected]> writes:
[email protected] (Anton Ertl) writes:
<snip lots>
Though, generally takes a few years before new features become usable.
Like, it is only in recent years that it has become "safe" to use
most parts of C99.
Most of the commonly used parts of C99 have been "safe" to use for 20
years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.
Until VS2013, the most one could really use was:
// comments
long long
Otherwise, it was basically C90.
'stdint.h'? Nope.
Ability to declare variables wherever? Nope.
...
After this, it was piecewise.
Though, IIRC, still no VLAs or similar.
There are only two serious, general purpose C compilers in mainstream
use - gcc and clang, and both support almost all of C23 now. But it
will take a while for the more niche tools, such as some embedded
compilers, to catch up.
<stdbit.h> is, however, in the standard library rather than the
compiler, and they can be a bit slow to catch up.
FWIW:
I had been adding parts of newer standards in my case, but it is more hit/miss (more adding parts as they seem relevant).
Whether or not the target/compiler allows misaligned memory access; >>>>> If set, one may use misaligned access.
Why would you need that? Any decent compiler will know what is
allowed for the target (perhaps partly on the basis of compiler
flags), and will generate the best allowed code for accesses like
foo3() above.
Imagine you have compilers that are smart enough to turn "memcpy()"
into a load and store, but not smart enough to optimize away the
memory accesses, or fully optimize away the wrapper functions...
Why would I do that? If I want to have efficient object code, I use a
good compiler. Under what realistic circumstances would you need to
have highly efficient results but be unable to use a good optimising
compiler? Compilers have been inlining code for 30 years at least
(that's when I first saw it) - this is not something new and rare.
Say, you are using a target where you can't use GCC or similar.
Say:
BJX2, haven't ported GCC as it looks like a pain;
Also GCC is big and slow to recompile.
6502 and 65C816, because these are old and probably not worth the effort
from GCC's POV.
Various other obscure/niche targets.
Say, SH-5, which never saw a production run (it was a 64-bit successor
to SH-4), but seemingly around the time Hitachi spun-out Renesas, the
SH-5 essentially got canned. And, it apparently wasn't worth it for GCC
to maintain a target for which there were no actual chips (comparably
the SH-2 and SH-4 lived on a lot longer due to having niche uses).
So, for best results, the best case option is to use a pointer cast
and dereference.
For some cases, one may also need to know whether or not they can
access the pointers in a misaligned way (and whether doing so would
be better or worse than something like "memcpy()").
Again, I cannot see a /real/ situation where that would be relevant.
I can think of a few.
Most often though it is in things like data compression/decompression
code, where there is often a lot of priority on "gotta go fast".
There is a difference here between "_memlzcpy()" and "_memlzcpyf()"
in that:
the former will always copy an exact number of bytes;
the latter may write 16-32 bytes over the limit.
It may do /what/ ? That is a scary function!
This is why the latter have an 'f' extension (for "fast").
There are cases where it may be desirable to have the function write
past the end in the name of speed, and others where this would not be acceptable.
Hence why there are 2 functions.
The main intended use-case for _memlzcpyf() being use for match-copying
in something like my LZ4 decoder, where one may pad the decode buffer by
an extra 32 byes.
Also my RP2 decoder works in a similar way.
Possible:
__MINALIGN_type__ //minimum allowed alignment for type
_Alignof(type) has been around since C11.
_Alignof tells the native alignment, not the minimum.
It is the same thing.
Not necessarily, it wouldn't make sense for _Alignof to return 1 for all
the basic integer types.
But, for" minimum alignment" it may make sense
to return 1 for anything that can be accessed unaligned.
Where, _Alignof(int32_t) will give 4, but __MINALIGN_INT32__ would
give 1 if the target supports misaligned pointers.
The alignment of types in C is given by _Alignof. Hardware may
support unaligned accesses - C does not. (By that, I mean that
unaligned accesses are UB.)
The point of __MINALIGN_type__ would be:
If the compiler defines it, and it is defined as 1, then this allows the compiler to be able to tell the program that it is safe to use this type
in an unaligned way.
This also applies to targets where some types are unaligned but others
are not:
Say, if all integer types 64 bits or less are unaligned, but 128-bit
types are not.
Most of this is being compiled by BGBCC for a 50 MHz cPU.
So, the CPU is slow and the compiler doesn't generate particularly
efficient code unless one writes it in a way it can use effectively.
Which often means trying to write C like it was assembler and manually organizing statements to try to minimize value dependencies (often
caching any values in variables, and using lots of variables).
In this case, the equivalent of "-fwrapv -fno-strict-aliasing" is the
default semantics.
Generally, MSVC also responds well to a similar coding style as used for BGBCC (or, as it more happened, the coding styles that gave good results
in MSVC also tended to work well in BGBCC).
On 14/09/2024 23:19, Michael S wrote:
Yes, exactly.
Contrary to your imagination - compilers have /never/ followed your
proposed semantics. The oldest gcc version I found on godbolt.org is
3.4.6 from 2006, and given:
struct Bar {
char x[8];
int y;
} bar;
int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}
It generates:
foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret
That is, y is /not/ reloaded after bar.x[i] is set.
On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:
Robert Finch <[email protected]> writes:
On 2024-09-15 12:09 p.m., David Brown wrote:
What about bit-fields in a struct? I believe they are usually packed. In >>> case its for something like an I/O device.In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by >>>>> another field of integer type with the same or narrower width then
there should be no padding in-between.
That's a bit more complicated as it depends on the target byte-order.
e.g.
struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9;
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;
Which brings to mind a slight different but related bit-field issue.
If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??
So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.
My motivation is eliminating as many UBs as is practically
possible.
[email protected] (MitchAlsup1) writes:
On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:
Robert Finch <[email protected]> writes:
On 2024-09-15 12:09 p.m., David Brown wrote:
What about bit-fields in a struct? I believe they are usually packed. >>>> InIn addition, some padding-related things can be defined by Standard >>>>>> itself. Not in this particular case, but, for example, it could be >>>>>> defined that when field of one integer type is immediately followed by >>>>>> another field of integer type with the same or narrower width then >>>>>> there should be no padding in-between.
case its for something like an I/O device.
That's a bit more complicated as it depends on the target byte-order.
e.g.
struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9;
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;
Which brings to mind a slight different but related bit-field issue.
If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??
So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.
The 17-bit bitfied can be specified in the usual way. Example:
struct bitfield_example {
unsigned one : 32;
unsigned two : 20;
unsigned hmm : 17;
};
An implementation is allowed to use up the last 12 bits of the
first 64-bit unit and the first 5 bits of the next 64-bit unit.
But, whether that happens or not is up to the implementation.
The bitfield for member 'hmm' could instead be put entirely in
the second 64-bit unit, with the last 12 bits of the first 64-bit
unit simply left as padding. There is no standard way to force
it.
[email protected] (Scott Lurndal) writes:
Robert Finch <[email protected]> writes:
On 2024-09-15 12:09 p.m., David Brown wrote:
What about bit-fields in a struct? I believe they are usually packed. In >>> case its for something like an I/O device.In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by >>>>> another field of integer type with the same or narrower width then
there should be no padding in-between.
That's a bit more complicated as it depends on the target byte-order.
e.g.
struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9; /**< R/W1C/H - RAM ECC DBE detected. */
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9; /**< R/W1C/H - RAM ECC SBE detected. */
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;
Probably many people know that this code depends on an
implementation-defined extension (allowing uint64_t as
the type of a bitfield) and is not guaranteed to be
portable. Using 'unsigned' instead would be portable
(assuming typical 32-bit ints, etc).
[examples of descending loops with unsigned loop variables]
This discussion wandered into many subthreads, but I only want to make
one post and chose here.
When you write code working on signed numbers and do something like:
(a < 0) || (a >= max)
Then the compiler realizes if you treat 'a' as unsigned, this is just:
(unsigned)a >= max
since any negative number, treated as unsigned, will be larger than the largest positive signed number. So, to do loops which count down and
have any stride using an unsigned loop count:
for(u = start; u <= start; u -= step)
With the usual caveats (start must be a valid signed number, and step
cannot be so large that start + step crosses the signed boundary).
But: unsigned numbers in C have some dangers, which no one here has mentioned. Some code presented comes CLOSE to being wrong, but gets
lucky. With "int" being 32-bits, C promotion rules around unsigned
ints, signed ints, and unsigned 64-bit can create trouble.
uint64_t dval; uint32_t uval; int a;
val32 = 1 dval = 1; a = 1;
dval = val32 - 2 + dval;
C will do (val32 - 2) first, with is (1U - 2) which is 0xffff_ffff, and
then add dval, and the result is 0x1_0000_0000.
Signed numbers don't have this risk, so if you're doing known small loops, you can just use ints. If you're doing possibly large loops, just use int64_t.
Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
C/C++ is a mistake. It should always have been ILP64, and this nonsense would go away. Any new architecture should make C ILP64 (looking at you RISC-V, missing yet another opportunity to not make the same mistakes as everyone else).
On Sun, 15 Sep 2024 20:13:44 +0200
David Brown <[email protected]> wrote:
struct Bar {
char x[8];
int y;
} bar;
int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}
It generates:
foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret
That is, y is /not/ reloaded after bar.x[i] is set.
No other compiler on godbolt is doing it, except possibly gcc clones.
Not even clang, who's former leader wrote "Nasal Manifest".
Tim Rentsch <[email protected]> writes:
[email protected] (Scott Lurndal) writes:
Robert Finch <[email protected]> writes:
On 2024-09-15 12:09 p.m., David Brown wrote:
What about bit-fields in a struct? I believe they are usuallyIn addition, some padding-related things can be defined by
Standard itself. Not in this particular case, but, for
example, it could be defined that when field of one integer
type is immediately followed by another field of integer type
with the same or narrower width then there should be no padding
in-between.
packed. In case its for something like an I/O device.
That's a bit more complicated as it depends on the target byte-order.
e.g.
struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9;
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;
Probably many people know that this code depends on an
implementation-defined extension (allowing uint64_t as
the type of a bitfield) and is not guaranteed to be
portable. Using 'unsigned' instead would be portable
(assuming typical 32-bit ints, etc).
Portability in this case was not necessary. In any case,
it's portable to clang and gcc, which is good enough.
On 9/14/2024 8:26 AM, Anton Ertl wrote:
[email protected] (Kent Dickey) writes:
Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
C/C++ is a mistake. It should always have been ILP64, and this nonsense >>> would go away. Any new architecture should make C ILP64 (looking at you >>> RISC-V, missing yet another opportunity to not make the same mistakes as >>> everyone else).
We now have had more than 30 years of catering for this mistake by
everyone involved. Given their goals, I think that RISC-V made the
right choice for int in their ABI, even if it was the original choice
by the MIPS and Alpha people that they follow, like everyone else, was
wrong.
That being said, one option would be to introduce another ABI and API
with 64-bit int (and maybe 32-bit long short int), and programmers
could choose whether to program for the ILP API, or the int=int32_t
API. Would the ILP API/ABI fare better then x32? I doubt it, even
though I would support it. This ship probably has sailed.
Changing the size of 'int' would likely be a massive pain from a
software compatibility POV (possibly effecting things much more than
changing the size of pointers, or the size of 'long'; which was a major source of pain during the 32 to 64 bit migration).
When my project got started, I was originally going with 32-bit 'long',
like MSVC, but then switched over to keeping 'long' matched with the
pointer size, as code that assumed sizeof(long)==sizeof(void *) was more common than code that assumed sizeof(long)==4 (it was more common for
code to use 'int' as the de-facto 32-bit type), as well as this being a
more useful assumption (though this assumption breaks with 128 bit
pointers).
Changing sizeof(int) to be anything other than 4 is likely to break significant amounts of code, and pretty much anything that reads/writes structs to files or similar for data storage.
But, yes, this is even with the whole thing that on a 64-bit machine,
32-bit integers are typically handled in a way where they are sign or
zero extended to 64 bits.
Granted, a better alternative might be to rework code to generally use
the "stdint.h" types, and to use "intptr_t" for integer types matched to
the size of a pointer, ...
Michael S <[email protected]> writes:
On Sun, 15 Sep 2024 20:13:44 +0200
David Brown <[email protected]> wrote:
struct Bar {
char x[8];
int y;
} bar;
int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}
It generates:
foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret
That is, y is /not/ reloaded after bar.x[i] is set.
No other compiler on godbolt is doing it, except possibly gcc clones.
Not even clang, who's former leader wrote "Nasal Manifest".
Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
both show bar.y not being overwritten (optimization levels -01 or -O2)
when foo() is called.
On 14/09/2024 21:26, Thomas Koenig wrote:
MitchAlsup1 <[email protected]> schrieb:
In many cases int is slower now than long -- which violates the notion
of int from K&R days.
That's a designers's choice, I think. It is possible to add 32-bit
instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.
For some kinds of instructions, that's true - for others, it's not so
easy without either making rather complicated instructions or having
assembly instructions with undefined behaviour (imagine the terror that
would bring to some people!).
A classic example would be for "y = p[x++];" in a loop. For a 64-bit
type x, you would set up one register once with "p + x", and then have a
load with post-increment instruction in the loop. You can also do that
with x as a 32-bit int, unless you are of the opinion that enough apples added to a pile should give a negative number of apples.
But with a
wrapping type for x - such as unsigned int in C or modulo types in Ada,
you have little choice but to hold "p" and "x" separately in registers,
add them for every load, and do the increment and modulo operation. I
really can't see this all being handled by a single instruction.
If the loop variable
represents degrees C or F, or some other naturally signed measure it
should be signed (or maybe floating point).
What kind of loop it
is, whether ascending or descending, or what the increment is, etc,
is secondary; a more important factor is what sort of value is
being represented, and in almost all cases that is what should
determine the type used.
Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
C/C++ is a mistake. It should always have been ILP64, and this nonsense
would go away. Any new architecture should make C ILP64 (looking at you
RISC-V, missing yet another opportunity to not make the same mistakes as
everyone else).
I believe this view is shortsighted. The big mistake is developers hardcoding types everywhere - especially int, but also long, and
their unsigned variants. It's almost never a good idea to hardcode
a specific width (eg, uint32_t) in a type name used for parameters
or local variables, but that is by far a very common practice.
Michael S <[email protected]> writes:
On Sun, 15 Sep 2024 20:13:44 +0200
David Brown <[email protected]> wrote:
struct Bar {
char x[8];
int y;
} bar;
int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}
It generates:
foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret
That is, y is /not/ reloaded after bar.x[i] is set.
No other compiler on godbolt is doing it, except possibly gcc
clones. Not even clang, who's former leader wrote "Nasal Manifest".
Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
both show bar.y not being overwritten (optimization levels -01 or -O2)
when foo() is called.
On 14/09/2024 21:26, Thomas Koenig wrote:
MitchAlsup1 <[email protected]> schrieb:
In many cases int is slower now than long -- which violates the notion
of int from K&R days.
That's a designers's choice, I think. It is possible to add 32-bit
instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.
For some kinds of instructions, that's true - for others, it's not so
easy without either making rather complicated instructions or having assembly instructions with undefined behaviour (imagine the terror that would bring to some people!).
A classic example would be for "y = p[x++];" in a loop. For a 64-bit
type x, you would set up one register once with "p + x", and then have a load with post-increment instruction in the loop. You can also do that with x as a 32-bit int, unless you are of the opinion that enough apples added to a pile should give a negative number of apples. But with a wrapping type for x - such as unsigned int in C or modulo types in Ada,
you have little choice but to hold "p" and "x" separately in registers,
add them for every load, and do the increment and modulo operation. I really can't see this all being handled by a single instruction.
On Sun, 15 Sep 2024 18:48:48 +0000, David Brown wrote:
On 15/09/2024 19:21, MitchAlsup1 wrote:
On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:
Robert Finch <[email protected]> writes:
On 2024-09-15 12:09 p.m., David Brown wrote:
What about bit-fields in a struct? I believe they are usuallyIn addition, some padding-related things can be defined by Standard >>>>>>> itself. Not in this particular case, but, for example, it could be >>>>>>> defined that when field of one integer type is immediately
followed by
another field of integer type with the same or narrower width then >>>>>>> there should be no padding in-between.
packed. In
case its for something like an I/O device.
That's a bit more complicated as it depends on the target byte-order.
e.g.
struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23; >>>> uint64_t dbe : 9; /**< R/W1C/H - RAM
ECC DBE detected. */
uint64_t reserved_9_31 : 23; >>>> uint64_t sbe : 9; /**< R/W1C/H - RAM
ECC SBE detected. */
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23; >>>> uint64_t dbe : 9;
uint64_t reserved_41_63 : 23; >>>> #endif
} s;
Which brings to mind a slight different but related bit-field issue.
If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??
So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.
You do so inconveniently, perhaps with access inline functions rather
than a bit-field struct.
Fortunately, not many hardware designers are that sadistic. (Or perhaps
they /are/ that sadistic, but lack the imagination for that particular
trick.)
In My 66000 ISA it is both efficient and straightforward::
i = struct.field;
..
struct.field = j;
CARRY Rsf1,{I}
SRA Ri,Rsf0,<17,53>
and
CARRY Rsf1,{O}
INS Rsf0,Rj,<52,17>
Note: Rsf1 and Rsf0 combined are the 128 bits container, but there is no
need for these registers to be sequential.
As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant
past could do these rather efficiently {360 SRDL,...}
If the ISA has any realistically efficient grasp on multi-precision
integer operations, these fall out almost for free.
[email protected] (MitchAlsup1) writes:
On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:
Robert Finch <[email protected]> writes:
On 2024-09-15 12:09 p.m., David Brown wrote:
What about bit-fields in a struct? I believe they are usually packed. In >>>> case its for something like an I/O device.In addition, some padding-related things can be defined by Standard >>>>>> itself. Not in this particular case, but, for example, it could be >>>>>> defined that when field of one integer type is immediately followed by >>>>>> another field of integer type with the same or narrower width then >>>>>> there should be no padding in-between.
That's a bit more complicated as it depends on the target byte-order.
e.g.
struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9;
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;
Which brings to mind a slight different but related bit-field issue.
If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??
So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.
The 17-bit bitfied can be specified in the usual way. Example:
struct bitfield_example {
unsigned one : 32;
unsigned two : 20;
unsigned hmm : 17;
};
An implementation is allowed to use up the last 12 bits of the
first 64-bit unit and the first 5 bits of the next 64-bit unit.
But, whether that happens or not is up to the implementation.
The bitfield for member 'hmm' could instead be put entirely in
the second 64-bit unit, with the last 12 bits of the first 64-bit
unit simply left as padding. There is no standard way to force
it.
On 9/15/2024 12:46 PM, Anton Ertl wrote:
Michael S <[email protected]> writes:
Padding is another thing that should be Implementation Defined.
It is. It's defined in the ABI, so when the compiler documents to
follow some ABI, you automatically get that ABI's structure layout.
And if a compiler does not follow an ABI, it is practically useless.
Though, there also isn't a whole lot of freedom of choice here regarding layout.
If member ordering or padding differs from typical expectations, then
any code which serializes structures to files is liable to break, and
this practice isn't particularly uncommon.
Say, typical pattern:
Members are organized in the same order they appear in the source code;
If the current position is not a multiple of the member's alignment, it
is padded to an offset that is a multiple of the member's alignment;
For primitive types, the alignment is equal to the size, which is also a power of 2;
If needed, the total size of the struct is padded to a multiple of the largest alignment of the struct members.
For C++ classes, it is more chaotic (and more compiler dependent), but:
Tim Rentsch <[email protected]> schrieb:
If the loop variable
represents degrees C or F, or some other naturally signed measure it
should be signed (or maybe floating point).
The first one is a bad idea because temperature is a continuous
physical quantity.
The second has bad implications for constructs like
DO R = 0.0, 1.0, 0.1
where it will depend on details floating point arithmetic if the
number of loop trips is 10 or 11.
You can argue that people can write
DO R=0.0, 1.05, 0.1
but this construct was error-prone enough that it was deleted
from the Fortran standards.
What kind of loop it
is, whether ascending or descending, or what the increment is, etc,
is secondary; a more important factor is what sort of value is
being represented, and in almost all cases that is what should
determine the type used.
Not for floating point numbers. For that, you should simply do
DO I=0,10
R = I * 0.1
or
R = 0.0
DO I=0,10
...
R = R + 0.1
END DO
whichever rounding error you prefer.
Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
C/C++ is a mistake. It should always have been ILP64, and this nonsense >>> would go away. Any new architecture should make C ILP64 (looking at you >>> RISC-V, missing yet another opportunity to not make the same mistakes as >>> everyone else).
I believe this view is shortsighted. The big mistake is developers
hardcoding types everywhere - especially int, but also long, and
their unsigned variants. It's almost never a good idea to hardcode
a specific width (eg, uint32_t) in a type name used for parameters
or local variables, but that is by far a very common practice.
Hence Fortran's SELECTED_REAL_KIND and SELECTED_INT_KIND...
On 9/15/2024 2:09 PM, David Brown wrote:
On 14/09/2024 04:39, BGB wrote:
On 9/13/2024 10:55 AM, Thomas Koenig wrote:
David Brown <[email protected]> schrieb:
Most of the commonly used parts of C99 have been "safe" to use for 20 >>>>> years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.
What about VLAs?
IIRC, VLAs and _Complex and similar still don't work in MSVC.
Most of the rest does now at least.
Thanks - you know it far better than I do.
I use it fairly often.
Mostly VS2022 at present.
There are only two serious, general purpose C compilers in mainstream >>>>> use - gcc and clang, and both support almost all of C23 now. But it >>>>> will take a while for the more niche tools, such as some embedded
compilers, to catch up.
It is almost impossible to gather statistics on compiler use,
especially with free compilers, but what about MSVC and icc?
From what I gather:
GCC and Clang are popular for most mainline targets;
GCC is the dominant C compiler on Linux.
It is also far and away the dominant compiler for embedded systems -
both embedded Linux and small embedded systems.
Albeit, ones with semi-popular CPU architectures.
Though, GCC and Linux kinda go together here.
Say, one isn't going to find Linux ported to targets outside the scope
of GCC,
and GCC isn't too interested outside the scope of targets that
could potentially run Linux and see at least semi-widespread use.
.
TinyCC, popular for niche use, but limited range of targets;
x86, ARM, experimental RISC-V.
SDCC, popular for 8/16 bit targets;
SDCC has never been very popular. For the targets SDCC support, Keil
(8051) and IAR (many small CISC targets) are far more common. But for
these kinds of devices, you are never working in anything close to
standard C anyway.
OK.
I had mostly heard of people using SDCC here.
CC65, popular for 6502 and 65C816;
That's getting /really/ obscure now. There are thousands of C
compilers that are used, or have been used, for various
microcontrollers. But if you sum all their uses over the last decade,
it will not be close to 1% of the total use of C compilers.
This is mostly for the crowd still messing around with a few older systems:
Commodore 64/128
Apple II / II/C / II/E
Apple IIGS
NES and SNES
...
Also, some newer projects, like the "Commander X16" are also using CC65
(it was based around a 65C816 being used in a 6502 compatibility mode).
Where, AFAIK, GCC proper has little interest in these targets.
On Sun, 15 Sep 2024 18:47:06 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Sun, 15 Sep 2024 20:13:44 +0200
David Brown <[email protected]> wrote:
struct Bar {
char x[8];
int y;
} bar;
int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}
It generates:
foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret
That is, y is /not/ reloaded after bar.x[i] is set.
No other compiler on godbolt is doing it, except possibly gcc
clones. Not even clang, who's former leader wrote "Nasal Manifest".
Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
both show bar.y not being overwritten (optimization levels -01 or -O2)
when foo() is called.
I didn't mean to say that gcc3 is the only gcc version that returns non-overwritten value.
I meant to say that all gcc versions are in one camp and the rest of compilers represented on Goldbolt is in the other camp.
On Sun, 15 Sep 2024 20:13:44 +0200
David Brown <[email protected]> wrote:
On 14/09/2024 23:19, Michael S wrote:
Yes, exactly.
Contrary to your imagination - compilers have /never/ followed your
proposed semantics. The oldest gcc version I found on godbolt.org is
3.4.6 from 2006, and given:
struct Bar {
char x[8];
int y;
} bar;
int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}
It generates:
foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret
That is, y is /not/ reloaded after bar.x[i] is set.
No other compiler on godbolt is doing it, except possibly gcc clones.
Not even clang, who's former leader wrote "Nasal Manifest".
On 9/15/2024 2:40 PM, David Brown wrote:
On 14/09/2024 08:34, BGB wrote:
On 9/13/2024 10:30 AM, David Brown wrote:
On 12/09/2024 23:14, BGB wrote:
On 9/12/2024 9:18 AM, David Brown wrote:
On 11/09/2024 20:51, BGB wrote:
On 9/11/2024 5:38 AM, Anton Ertl wrote:
Josh Vanderhoof <[email protected]> writes:
[email protected] (Anton Ertl) writes:
<snip lots>
Though, generally takes a few years before new features become usable. >>>>> Like, it is only in recent years that it has become "safe" to use
most parts of C99.
Most of the commonly used parts of C99 have been "safe" to use for
20 years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.
Until VS2013, the most one could really use was:
// comments
long long
Otherwise, it was basically C90.
'stdint.h'? Nope.
Ability to declare variables wherever? Nope.
...
Nonsense.
MS basically gave up on C and concentrated on C++ (then later C# and
other languages). Their C compiler gained the parts of C99 that were
in common with C++ - and anyway, most people (that I have heard of)
using MSVC for C programming actually use the C++ compiler but stick
approximately to a C subset. And this has been the case for a /long/
time - long before 2013.
Go and try to write C with variables not declared at the start of a
block in VS2008 or similar and see how far you get...
While, it may work in C++ mode, it did not work in C mode.
IIRC, the ability to declare variables wherever got added in VS2013.
Looks like 'stdint.h' got added in VS2010.
Whether or not the target/compiler allows misaligned memory >>>>>>> access;
If set, one may use misaligned access.
Why would you need that? Any decent compiler will know what is
allowed for the target (perhaps partly on the basis of compiler
flags), and will generate the best allowed code for accesses like
foo3() above.
Imagine you have compilers that are smart enough to turn "memcpy()"
into a load and store, but not smart enough to optimize away the
memory accesses, or fully optimize away the wrapper functions...
Why would I do that? If I want to have efficient object code, I use
a good compiler. Under what realistic circumstances would you need
to have highly efficient results but be unable to use a good
optimising compiler? Compilers have been inlining code for 30 years
at least (that's when I first saw it) - this is not something new
and rare.
Say, you are using a target where you can't use GCC or similar.
Which target would that be? Excluding personal projects, some very
niche devices, and long-outdated small CISC chips, there really aren't
many devices that don't have a GCC and clang port. Of course there /
are/ processors that gcc does not support, but almost nobody writes
code that has to be portable to such devices.
And as for optimising compilers, I used at least two different
optimising compilers in the mid nineties that inlined code
automatically, before using gcc. (I can't remember if they inlined
memcpy - it was a long time ago!). Optimising compilers are not a new
concept, and are not limited to gcc and clang.
It also depends on what one considers optimizing.
But, like:
Allocates variables into registers;
Evaluates expressions involving constants;
Turns "memcpy()" into inlined loads/stores in some cases;
Essentially treating it like a builtin function.
...
Well, at least BGBCC does this much.
Things it doesn't do though:
Loop unrolling;
Inline functions;
...
There is a partial feature to cache member loads and array loads within
a basic-block, but will flush any such cached values whenever a memory
store happens.
Say:
i=foo->bar->x + foo->bar->y;
Will cache and reuse the first foo->bar.
But, if you do:
*ptr=0;
Or:
foo->z=3;
It will flush any memory of the cached values (unless the pointers are 'restrict').
There is an option to disable this caching though (at which point it
will always do each member load). But, unlike TBAA, this optimization is
less prone to break stuff.
It also has a special feature than small leaf functions which can fit entirely in scratch registers may skip creation of a stack frame.
But, I can note that even with these limitations, BGBCC+BJX2 still seems
to be beating RV64G + "GCC -O3" in terms of performance in my tests
(well, mostly because clever compiler can't beat ISA limitations).
Say:
BJX2, haven't ported GCC as it looks like a pain;
Also GCC is big and slow to recompile.
6502 and 65C816, because these are old and probably not worth the
effort from GCC's POV.
Various other obscure/niche targets.
Say, SH-5, which never saw a production run (it was a 64-bit
successor to SH-4), but seemingly around the time Hitachi spun-out
Renesas, the SH-5 essentially got canned. And, it apparently wasn't
worth it for GCC to maintain a target for which there were no actual
chips (comparably the SH-2 and SH-4 lived on a lot longer due to
having niche uses).
It would be quite ridiculous to limit the way you write code because
of possible limitations for non-existent compilers for target devices
that have never been made.
Hitachi did release an ISA spec for SH-5 at least (and it might have
worked OK, if Renesas had pushed "upwards" rather than focusing almost exclusively on the small embedded / microcontroller space).
But, at present, people trying to worry about portability to things with non-power-of-2 integers, non-8-bit bytes, non-twos-complement
arithmetic, etc, has a similar level of validity (or non-validity) to
writing code for ISA's which never saw a release in "actual silicon".
If the compiler is naive (wrt inline memcpy):
memcpy(&v, cs, 8);
rl=(v>>4)&15;
Needs 5 instructions, but:
v=*(uint64_t *)cs;
rl=(v>>4)&15;
Uses 3 instructions.
Having the compiler turn the former into the latter is possible, but
would require more complex pattern matching, and would likely need to be handled in the frontend (rather than in the function-call operation) in
the backend.
Not necessarily, it wouldn't make sense for _Alignof to return 1 for
all the basic integer types.
Of course it makes sense to do that, on targets where an alignment of
1 is safe and efficient.
Tradition dictates that struct members are pad-aligned aligned to their native alignment (usually equal to the size of the base type), unless
the struct is 'packed'.
An implementation where all structs are packed by default could have unforeseen consequences...
Presumably, _Alignof would give the same alignment as would appear in
structs or similar.
But, for" minimum alignment" it may make sense to return 1 for
anything that can be accessed unaligned.
Again, I see no use for this.
The main alternatives:
Detect target architecture and "know" whether the architecture is unaligned-safe (ye olde mess of ifdef's);
Have a global PP define that applies to all types, but this doesn't
allow for cases where some types are unaligned safe but others are not.
One possibility could be __minalign__(type), but (unlike doing it with preprocessor defines), one could not likely use it in preprocessor expressions.
#if __MINALIGN_LONG__==1
...
#else
...
#endif
Works, but:
#if _Alignof(long)==1
...
Poses problems, as generally the preprocessor is not able to evaluate
things like this.
Where, _Alignof(int32_t) will give 4, but __MINALIGN_INT32__ would
give 1 if the target supports misaligned pointers.
The alignment of types in C is given by _Alignof. Hardware may
support unaligned accesses - C does not. (By that, I mean that
unaligned accesses are UB.)
The point of __MINALIGN_type__ would be:
If the compiler defines it, and it is defined as 1, then this allows
the compiler to be able to tell the program that it is safe to use
this type in an unaligned way.
For what purpose?
Probably for unaligned deref's on targets where "memcpy()" is a less desirable option (say, if it takes several additional CPU instructions).
This also applies to targets where some types are unaligned but
others are not:
Say, if all integer types 64 bits or less are unaligned, but 128-bit
types are not.
For what purpose? And why do you want to worry about totally
hypothetical systems?
Note that a lot of what I am describing here is true of BJX2.
It is also true of __m128 and similar in MSVC.
__m128 v;
v=*(__m128 *)someptr;
May explode if someptr is not 16-byte aligned, as it may emit a "MOVDQA"
or similar (rather than MOVDQU).
But, in both cases, if "int *" or "long *" is misaligned, both are fine
with it.
There may be other compilers in a similar camp.
But, then again, it is kinda hypothetical in the sense to claim that one can't cast and deref a pointer, since on most existing targets, it works without issue (except that on GCC one may also need to use 'volatile').
Most of this is being compiled by BGBCC for a 50 MHz cPU.
So, the CPU is slow and the compiler doesn't generate particularly
efficient code unless one writes it in a way it can use effectively.
Which often means trying to write C like it was assembler and
manually organizing statements to try to minimize value dependencies
(often caching any values in variables, and using lots of variables).
In this case, the equivalent of "-fwrapv -fno-strict-aliasing" is the
default semantics.
Generally, MSVC also responds well to a similar coding style as used
for BGBCC (or, as it more happened, the coding styles that gave good
results in MSVC also tended to work well in BGBCC).
Note that MSVC most certainly does /not/ work like "gcc -fwrapv" -
signed integer overflow is UB in MSVC, and it generates code that
assumes it never happens. There is an obscure officially undocumented
(or documented unofficially, if you prefer) flag to turn off such
optimisations.
Last I read about it, they had no plans to do any type-based alias
analysis, but nor did they rule out the possibility in the future.
I haven't seen any issues with MSVC and this sort of code usually works
as expected...
But, a lot of times, one has to supply these options to GCC otherwise
the code will break. So, it almost makes sense to assume these semantics
as a default.
In the case of BGBCC, I decided to make these semantics the default as a matter of a policy decision.
There is some talk about pointer provenance semantics for C (apparently
semi controversial), but admittedly thus far I don't fully understand
the idea.
David Brown <[email protected]> schrieb:
On 14/09/2024 21:26, Thomas Koenig wrote:
MitchAlsup1 <[email protected]> schrieb:
In many cases int is slower now than long -- which violates the notion >>>> of int from K&R days.
That's a designers's choice, I think. It is possible to add 32-bit
instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.
For some kinds of instructions, that's true - for others, it's not so
easy without either making rather complicated instructions or having
assembly instructions with undefined behaviour (imagine the terror that
would bring to some people!).
It has happened, see the illegal (but sometimes useful)
6502 instructions, or the recent RISC-V implementation snafu
(GhostWrite).
A classic example would be for "y = p[x++];" in a loop. For a 64-bit
type x, you would set up one register once with "p + x", and then have a
load with post-increment instruction in the loop. You can also do that
with x as a 32-bit int, unless you are of the opinion that enough apples
added to a pile should give a negative number of apples.
But of course it should!
But wait, no, the number of apples should become zero if you add
enough of them.
But wait... maybe if the pile becomes too large, then the apples
will no longer be individual apples, but will be crushed under
their weight, a bit like https://what-if.xkcd.com/4/ .
But with a
wrapping type for x - such as unsigned int in C or modulo types in Ada,
you have little choice but to hold "p" and "x" separately in registers,
add them for every load, and do the increment and modulo operation. I
really can't see this all being handled by a single instruction.
One reason not to use such a wrapping type.
Although, if you have (R1+R2) addressing and a 32-bit addition, this
could actually work, but not with a post-increment instruction.
David Brown wrote:
On 14/09/2024 21:26, Thomas Koenig wrote:
MitchAlsup1 <[email protected]> schrieb:
In many cases int is slower now than long -- which violates the notion >>>> of int from K&R days.
That's a designers's choice, I think. It is possible to add 32-bit
instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.
For some kinds of instructions, that's true - for others, it's not so
easy without either making rather complicated instructions or having
assembly instructions with undefined behaviour (imagine the terror
that would bring to some people!).
A classic example would be for "y = p[x++];" in a loop. For a 64-bit
type x, you would set up one register once with "p + x", and then have
a load with post-increment instruction in the loop. You can also do
that with x as a 32-bit int, unless you are of the opinion that enough
apples added to a pile should give a negative number of apples. But
with a wrapping type for x - such as unsigned int in C or modulo types
in Ada, you have little choice but to hold "p" and "x" separately in
registers, add them for every load, and do the increment and modulo
operation. I really can't see this all being handled by a single
instruction.
This becomes much simpler in Rust where usize is the only legal index type:
Yeah, you have to actually write it as
y = p[x];
x += 1;
instead of a single line, but this makes zero difference to the
compiler, right?
On 16/09/2024 10:37, Terje Mathisen wrote:
David Brown wrote:
On 14/09/2024 21:26, Thomas Koenig wrote:
MitchAlsup1 <[email protected]> schrieb:
In many cases int is slower now than long -- which violates the
notion of int from K&R days.
That's a designers's choice, I think. It is possible to add
32-bit instructions which should be as fast (or possibly faster)
than 64-bit instructions, as AMD64 and ARM have shown.
For some kinds of instructions, that's true - for others, it's not
so easy without either making rather complicated instructions or
having assembly instructions with undefined behaviour (imagine the
terror that would bring to some people!).
A classic example would be for "y = p[x++];" in a loop. For a
64-bit type x, you would set up one register once with "p + x",
and then have a load with post-increment instruction in the loop.
You can also do that with x as a 32-bit int, unless you are of the
opinion that enough apples added to a pile should give a negative
number of apples. But with a wrapping type for x - such as
unsigned int in C or modulo types in Ada, you have little choice
but to hold "p" and "x" separately in registers, add them for
every load, and do the increment and modulo operation. I really
can't see this all being handled by a single instruction.
This becomes much simpler in Rust where usize is the only legal
index type:
Yeah, you have to actually write it as
y = p[x];
x += 1;
instead of a single line, but this makes zero difference to the
compiler, right?
I don't care much about the compiler - but I don't think this is an improvement for the programmer. (In general, I dislike trying to do
too much in a single expression or statement, but some C constructs
are common enough that I am happy with them. It would be hard to
formulate concrete rules here.)
And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.
On Mon, 16 Sep 2024 14:48:50 +0200
David Brown <[email protected]> wrote:
On 16/09/2024 10:37, Terje Mathisen wrote:
David Brown wrote:
On 14/09/2024 21:26, Thomas Koenig wrote:
MitchAlsup1 <[email protected]> schrieb:
In many cases int is slower now than long -- which violates the
notion of int from K&R days.
That's a designers's choice, I think. It is possible to add
32-bit instructions which should be as fast (or possibly faster)
than 64-bit instructions, as AMD64 and ARM have shown.
For some kinds of instructions, that's true - for others, it's not
so easy without either making rather complicated instructions or
having assembly instructions with undefined behaviour (imagine the
terror that would bring to some people!).
A classic example would be for "y = p[x++];" in a loop. For a
64-bit type x, you would set up one register once with "p + x",
and then have a load with post-increment instruction in the loop.
You can also do that with x as a 32-bit int, unless you are of the
opinion that enough apples added to a pile should give a negative
number of apples. But with a wrapping type for x - such as
unsigned int in C or modulo types in Ada, you have little choice
but to hold "p" and "x" separately in registers, add them for
every load, and do the increment and modulo operation. I really
can't see this all being handled by a single instruction.
This becomes much simpler in Rust where usize is the only legal
index type:
Yeah, you have to actually write it as
y = p[x];
x += 1;
instead of a single line, but this makes zero difference to the
compiler, right?
I don't care much about the compiler - but I don't think this is an
improvement for the programmer. (In general, I dislike trying to do
too much in a single expression or statement, but some C constructs
are common enough that I am happy with them. It would be hard to
formulate concrete rules here.)
And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.
It's not less efficient. usize in Rust is approximately the same as
size_t in C.
With one exception that usize overflow panics under debug
build.
On 15/09/2024 21:13, MitchAlsup1 wrote:
On Sun, 15 Sep 2024 18:48:48 +0000, David Brown wrote:
On 15/09/2024 19:21, MitchAlsup1 wrote:
Which brings to mind a slight different but related bit-field issue.
If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??
So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.
You do so inconveniently, perhaps with access inline functions rather
than a bit-field struct.
Fortunately, not many hardware designers are that sadistic. (Or perhaps >>> they /are/ that sadistic, but lack the imagination for that particular
trick.)
In My 66000 ISA it is both efficient and straightforward::
That does not change that it is inconvenient in C, which is what you
asked about. For any ISA, there will always be things that can easily written in C that are awkward in assembly, and vice versa.
i = struct.field;
..
struct.field = j;
CARRY Rsf1,{I}
SRA Ri,Rsf0,<17,53>
and
CARRY Rsf1,{O}
INS Rsf0,Rj,<52,17>
Note: Rsf1 and Rsf0 combined are the 128 bits container, but there is no
need for these registers to be sequential.
As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant
past could do these rather efficiently {360 SRDL,...}
Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even beneficial from the hardware viewpoint, but really inconvenient on the software side (whether you use bit-fields or not).
Some hardware designers seem to have no understanding of or
consideration for the software folks that will use their designs. "HW Sadism" is no doubt too strong a term - ignorance and a lack of
consideration is more realistic.
If the ISA has any realistically efficient grasp on multi-precision
integer operations, these fall out almost for free.
I can't see that. I am not saying you are wrong, but I don't see the connection.
On 16/09/2024 15:04, Michael S wrote:
On Mon, 16 Sep 2024 14:48:50 +0200
David Brown <[email protected]> wrote:
On 16/09/2024 10:37, Terje Mathisen wrote:
David Brown wrote:
On 14/09/2024 21:26, Thomas Koenig wrote:
MitchAlsup1 <[email protected]> schrieb:
In many cases int is slower now than long -- which violates the
notion of int from K&R days.
That's a designers's choice, I think. It is possible to add
32-bit instructions which should be as fast (or possibly faster)
than 64-bit instructions, as AMD64 and ARM have shown.
For some kinds of instructions, that's true - for others, it's
not so easy without either making rather complicated
instructions or having assembly instructions with undefined
behaviour (imagine the terror that would bring to some people!).
A classic example would be for "y = p[x++];" in a loop. For a
64-bit type x, you would set up one register once with "p + x",
and then have a load with post-increment instruction in the loop.
You can also do that with x as a 32-bit int, unless you are of
the opinion that enough apples added to a pile should give a
negative number of apples. But with a wrapping type for x -
such as unsigned int in C or modulo types in Ada, you have
little choice but to hold "p" and "x" separately in registers,
add them for every load, and do the increment and modulo
operation. I really can't see this all being handled by a
single instruction.
This becomes much simpler in Rust where usize is the only legal
index type:
Yeah, you have to actually write it as
y = p[x];
x += 1;
instead of a single line, but this makes zero difference to the
compiler, right?
I don't care much about the compiler - but I don't think this is an
improvement for the programmer. (In general, I dislike trying to
do too much in a single expression or statement, but some C
constructs are common enough that I am happy with them. It would
be hard to formulate concrete rules here.)
And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.
It's not less efficient. usize in Rust is approximately the same as
size_t in C.
Ah, okay - I was thinking of it as a C unsigned int.
With one exception that usize overflow panics under debug
build.
I'm quite happy with unsigned types that are not allowed to overflow,
as long as there is some other way to get efficient wrapping on the
rare occasions when you need it.
But I am completely against the idea that you have different defined semantics for different builds. Run-time errors in a debug/test
build and undefined behaviour in release mode is fine - defining the behaviour of overflow in release mode (other than possibly to the
same run-time checking) is wrong.
The GCC community would be quite happy to support such targets, but
someone would need to make the port. And the architecture of the gcc compiler suite is best suited to processors with reasonably regular and orthogonal ISAs with plenty of registers and at least 16-bit width -
getting good results for a cpu like the 6502 from gcc would be an extraordinary level of effort. It makes a lot more sense to look at
tools like SDCC with an architecture that fits better.
On 16/09/2024 15:04, Michael S wrote:
With one exception that usize overflow panics under debug
build.
I'm quite happy with unsigned types that are not allowed to overflow, as
long as there is some other way to get efficient wrapping on the rare occasions when you need it.
But I am completely against the idea that you have different defined semantics for different builds. Run-time errors in a debug/test build
and undefined behaviour in release mode is fine - defining the behaviour
of overflow in release mode (other than possibly to the same run-time checking) is wrong.
David Brown wrote:
On 16/09/2024 15:04, Michael S wrote:
With one exception that usize overflow panics under debug
build.
I'm quite happy with unsigned types that are not allowed to
overflow, as long as there is some other way to get efficient
wrapping on the rare occasions when you need it.
But I am completely against the idea that you have different
defined semantics for different builds. Run-time errors in a
debug/test build and undefined behaviour in release mode is fine -
defining the behaviour of overflow in release mode (other than
possibly to the same run-time checking) is wrong.
In the compilers that do checking which I have worked with
there was always a distinction between checked builds and debug
builds. In my C code I have Assert() and AssertDbg(). Assert stay in
the production code, AssertDbg are only in the debug builds.
Debug builds disable optimizations and spill all variable updates
to memory to make life easier for the debugger.
One usually compiles debug builds with no-optimize and all checks
enabled.
But debug, optimize, and checking are separate controls.
In the compilers for checking languages I've worked with,
checking and optimization are compatible.
For example, if the compiler uses an AddFaultOverflow x = x + 1
instruction to increment 'x' then it knows no overflow is possible
and then can make all the other optimizations that C assumes are true.
And on those compilers checks can be controlled with quite fine
resolution. Checks can be enabled/disabled based on kind of check,
eg scalar overflow, array bounds,
for a compilation unit, a routine, a section of code,
a particular data type, a particular object.
This was all standard on DEC Ada85 so if Rust compilers do not
do this now they may in the near future.
With all respect to the regulars here, most people in technical Usenet
groups are either old, unusually nerdy, or both.
David Brown <[email protected]> schrieb:
The GCC community would be quite happy to support such targets, but
someone would need to make the port. And the architecture of the gcc
compiler suite is best suited to processors with reasonably regular and
orthogonal ISAs with plenty of registers and at least 16-bit width -
getting good results for a cpu like the 6502 from gcc would be an
extraordinary level of effort. It makes a lot more sense to look at
tools like SDCC with an architecture that fits better.
Native compilation of gcc on a 6502 would be... interesting.
But I think an adaption of gcc to a 6502 could actually work if
the zero page was treated as 128 16-bit registers. Not going
there, though :-)
On Mon, 16 Sep 2024 11:39:55 -0400
EricP <[email protected]> wrote:
David Brown wrote:
On 16/09/2024 15:04, Michael S wrote:In the compilers that do checking which I have worked with
With one exception that usize overflow panics under debugI'm quite happy with unsigned types that are not allowed to
build.
overflow, as long as there is some other way to get efficient
wrapping on the rare occasions when you need it.
But I am completely against the idea that you have different
defined semantics for different builds. Run-time errors in a
debug/test build and undefined behaviour in release mode is fine -
defining the behaviour of overflow in release mode (other than
possibly to the same run-time checking) is wrong.
there was always a distinction between checked builds and debug
builds. In my C code I have Assert() and AssertDbg(). Assert stay in
the production code, AssertDbg are only in the debug builds.
Debug builds disable optimizations and spill all variable updates
to memory to make life easier for the debugger.
One usually compiles debug builds with no-optimize and all checks
enabled.
But debug, optimize, and checking are separate controls.
In the compilers for checking languages I've worked with,
checking and optimization are compatible.
For example, if the compiler uses an AddFaultOverflow x = x + 1
instruction to increment 'x' then it knows no overflow is possible
and then can make all the other optimizations that C assumes are true.
And on those compilers checks can be controlled with quite fine
resolution. Checks can be enabled/disabled based on kind of check,
eg scalar overflow, array bounds,
for a compilation unit, a routine, a section of code,
a particular data type, a particular object.
This was all standard on DEC Ada85 so if Rust compilers do not
do this now they may in the near future.
If ability to control compilers checks was standard on DEC Ada then it
made DEC Ada none-standard.
On Mon, 16 Sep 2024 16:09:38 +0200
David Brown <[email protected]> wrote:
On 16/09/2024 15:04, Michael S wrote:
On Mon, 16 Sep 2024 14:48:50 +0200
With one exception that usize overflow panics under debug
build.
I'm quite happy with unsigned types that are not allowed to overflow,
as long as there is some other way to get efficient wrapping on the
rare occasions when you need it.
Rust has it in form of builtin functions wrapping_*()
But I am completely against the idea that you have different defined
semantics for different builds. Run-time errors in a debug/test
build and undefined behaviour in release mode is fine - defining the
behaviour of overflow in release mode (other than possibly to the
same run-time checking) is wrong.
On the one hand, Rust manual says that integer overflow in release mode wraps. On the other hand, it says that "Relying on integer overflow’s wrapping behavior is considered an error."
It does not sound particularly consistent and rather close to worst of
both worlds.
However on more important issue of out-of-bound array access Rust is consistent,
On 15/09/2024 21:13, MitchAlsup1 wrote:
As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant
past could do these rather efficiently {360 SRDL,...}
Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even beneficial from the hardware viewpoint, but really inconvenient on the software side (whether you use bit-fields or not).
On Mon, 16 Sep 2024 14:48:50 +0200
David Brown <[email protected]> wrote:
It's not less efficient. usize in Rust is approximately the same as
size_t in C. With one exception that usize overflow panics under debug
build.
On Mon, 16 Sep 2024 11:39:55 -0400
EricP <[email protected]> wrote:
David Brown wrote:
On 16/09/2024 15:04, Michael S wrote:
With one exception that usize overflow panics under debug
build.
I'm quite happy with unsigned types that are not allowed to
overflow, as long as there is some other way to get efficient
wrapping on the rare occasions when you need it.
But I am completely against the idea that you have different
defined semantics for different builds. Run-time errors in a
debug/test build and undefined behaviour in release mode is fine -
defining the behaviour of overflow in release mode (other than
possibly to the same run-time checking) is wrong.
In the compilers that do checking which I have worked with
there was always a distinction between checked builds and debug
builds. In my C code I have Assert() and AssertDbg(). Assert stay in
the production code, AssertDbg are only in the debug builds.
Debug builds disable optimizations and spill all variable updates
to memory to make life easier for the debugger.
One usually compiles debug builds with no-optimize and all checks
enabled.
But debug, optimize, and checking are separate controls.
In the compilers for checking languages I've worked with,
checking and optimization are compatible.
For example, if the compiler uses an AddFaultOverflow x = x + 1
instruction to increment 'x' then it knows no overflow is possible
and then can make all the other optimizations that C assumes are true.
And on those compilers checks can be controlled with quite fine
resolution. Checks can be enabled/disabled based on kind of check,
eg scalar overflow, array bounds,
for a compilation unit, a routine, a section of code,
a particular data type, a particular object.
This was all standard on DEC Ada85 so if Rust compilers do not
do this now they may in the near future.
If ability to control compilers checks was standard on DEC Ada then it
made DEC Ada none-standard.
On Mon, 16 Sep 2024 13:04:02 +0000, Michael S wrote:
On Mon, 16 Sep 2024 14:48:50 +0200
David Brown <[email protected]> wrote:
It's not less efficient. usize in Rust is approximately the same as
size_t in C. With one exception that usize overflow panics under debug
build.
One can and should argue that::
#p++;
should panic if p++ crosses an address space boundary (user->OS, or OS->HyperVisor,...) as no array is allowed to cross such a boundary.
On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:
On 15/09/2024 21:13, MitchAlsup1 wrote:
As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant
past could do these rather efficiently {360 SRDL,...}
Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even
beneficial from the hardware viewpoint, but really inconvenient on the
software side (whether you use bit-fields or not).
Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
..
On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:
On 15/09/2024 21:13, MitchAlsup1 wrote:
As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant
past could do these rather efficiently {360 SRDL,...}
Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even
beneficial from the hardware viewpoint, but really inconvenient on the
software side (whether you use bit-fields or not).
Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
On 16/09/2024 02:00, BGB wrote:
On 9/15/2024 2:09 PM, David Brown wrote:
This is mostly for the crowd still messing around with a few older
systems:
Commodore 64/128
Apple II / II/C / II/E
Apple IIGS
NES and SNES
...
It is not a "crowd" - it's a small group of oddballs and enthusiasts. I fully support them, and playing with these things is a great hobby. I
would maybe be doing that too, if I had twice as many hours in the week.
But talking about "popular compilers like gcc and CC65" is like
talking about "popular sports like football and Inuit ear pulling
contests".
Also, some newer projects, like the "Commander X16" are also using
CC65 (it was based around a 65C816 being used in a 6502 compatibility
mode).
Where, AFAIK, GCC proper has little interest in these targets.
The GCC community would be quite happy to support such targets, but
someone would need to make the port. And the architecture of the gcc compiler suite is best suited to processors with reasonably regular and orthogonal ISAs with plenty of registers and at least 16-bit width -
getting good results for a cpu like the 6502 from gcc would be an extraordinary level of effort.
The recommended approach today (for applications where it matters) is to
use static analysis of the Ada code (e.g. SPARK or other tools) to prove
that run-time errors cannot happen, which then makes it possible to omit
the corresponding run-time checks while staying compliant.
I don't know if Rust code can be analysed as easily and completely as
Ada code can. But Ada compilers usually allow fine-grained control over
which checks are applied where, not just a single choice between "debug"
and "production" builds.
On 16/09/2024 09:17, Thomas Koenig wrote:
David Brown <[email protected]> schrieb:
On 14/09/2024 21:26, Thomas Koenig wrote:
MitchAlsup1 <[email protected]> schrieb:
In many cases int is slower now than long -- which violates the notion >>>>> of int from K&R days.
That's a designers's choice, I think. It is possible to add 32-bit
instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.
For some kinds of instructions, that's true - for others, it's not so
easy without either making rather complicated instructions or having
assembly instructions with undefined behaviour (imagine the terror that
would bring to some people!).
It has happened, see the illegal (but sometimes useful)
6502 instructions, or the recent RISC-V implementation snafu
(GhostWrite).
I have seen plenty of undefined behaviour in ISA's over the years. (A
very common case is that instruction encodings that are not specified
are left as UB so that later extensions to the ISA can use them.)
I was
just thinking of the reactions you'd get if you made an ISA where
attempting to overflow signed integer arithmetic was UB at the hardware level, so that you could get faster and simpler instructions.
On 16 Sep 2024, Niklas Holsti wrote
(in article <[email protected]>):
....
The recommended approach today (for applications where it matters) is to
use static analysis of the Ada code (e.g. SPARK or other tools) to prove
that run-time errors cannot happen, which then makes it possible to omit
the corresponding run-time checks while staying compliant.
I don't know if Rust code can be analysed as easily and completely as
Ada code can. But Ada compilers usually allow fine-grained control over
which checks are applied where, not just a single choice between "debug"
and "production" builds.
I find, without using SPARK or any analysis (other than that done
by the compiler) that going from all Ada language-defined checks
ON to all OFF gains < 5% in speed.
So all checks are left ON in "production" builds.
On 2024-09-16 18:58, Michael S wrote:
On Mon, 16 Sep 2024 11:39:55 -0400
EricP <[email protected]> wrote:
David Brown wrote:
On 16/09/2024 15:04, Michael S wrote:
With one exception that usize overflow panics under debug
build.
I'm quite happy with unsigned types that are not allowed to
overflow, as long as there is some other way to get efficient
wrapping on the rare occasions when you need it.
But I am completely against the idea that you have different
defined semantics for different builds. Run-time errors in a
debug/test build and undefined behaviour in release mode is fine -
defining the behaviour of overflow in release mode (other than
possibly to the same run-time checking) is wrong.
In the compilers that do checking which I have worked with
there was always a distinction between checked builds and debug
builds. In my C code I have Assert() and AssertDbg(). Assert stay in
the production code, AssertDbg are only in the debug builds.
Debug builds disable optimizations and spill all variable updates
to memory to make life easier for the debugger.
One usually compiles debug builds with no-optimize and all checks
enabled.
But debug, optimize, and checking are separate controls.
In the compilers for checking languages I've worked with,
checking and optimization are compatible.
For example, if the compiler uses an AddFaultOverflow x = x + 1
instruction to increment 'x' then it knows no overflow is possible
and then can make all the other optimizations that C assumes are true.
And on those compilers checks can be controlled with quite fine
resolution. Checks can be enabled/disabled based on kind of check,
eg scalar overflow, array bounds,
for a compilation unit, a routine, a section of code,
a particular data type, a particular object.
This was all standard on DEC Ada85 so if Rust compilers do not
do this now they may in the near future.
If ability to control compilers checks was standard on DEC Ada then it
made DEC Ada none-standard.
No, it means that DEC Ada could be used as a standard-conforming Ada
compiler or as a non-conforming compiler, to a user-chosen extent.
The recommended approach today (for applications where it matters) is to
use static analysis of the Ada code (e.g. SPARK or other tools) to prove
that run-time errors cannot happen, which then makes it possible to omit
the corresponding run-time checks while staying compliant.
I don't know if Rust code can be analysed as easily and completely as
Ada code can. But Ada compilers usually allow fine-grained control over
which checks are applied where, not just a single choice between "debug"
and "production" builds.
On 16/09/2024 19:51, MitchAlsup1 wrote:
On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:
On 15/09/2024 21:13, MitchAlsup1 wrote:
As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant
past could do these rather efficiently {360 SRDL,...}
Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even
beneficial from the hardware viewpoint, but really inconvenient on the
software side (whether you use bit-fields or not).
Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
..
The folks designing those register setups had a choice, and made a bad
choice from the viewpoint of software (whether it be C, assembly, or any other language).
It's conceivable that it was the right choice on balance, considering
many factors. And it's certainly more believable that it was an
appropriate choice when sizes were smaller. It is less believable that
there is an overwhelming need to cross a 64-bit boundary.
Bill Findlay wrote:
I found the same 5% performance cost in my tests with DEC Ada85.
Most code was pretty optimal too.
The one thing I found DEC's compiler made a complete pigs breakfast
of the generated code was scanning a character string backwards:
Tim Rentsch <[email protected]> schrieb:
If the loop variable
represents degrees C or F, or some other naturally signed measure it
should be signed (or maybe floating point).
The first one is a bad idea because temperature is a continuous
physical quantity.
The second has bad implications for constructs like
DO R = 0.0, 1.0, 0.1
where it will depend on details floating point arithmetic if the
number of loop trips is 10 or 11.
You can argue that people can write
DO R=0.0, 1.05, 0.1
but this construct was error-prone enough that it was deleted
from the Fortran standards.
What kind of loop it
is, whether ascending or descending, or what the increment is, etc,
is secondary; a more important factor is what sort of value is
being represented, and in almost all cases that is what should
determine the type used.
Not for floating point numbers. For that, you should simply do
DO I=0,10
R = I * 0.1
or
R = 0.0
DO I=0,10
...
R = R + 0.1
END DO
whichever rounding error you prefer.
On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:
On Sun, 15 Sep 2024 18:47:06 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Sun, 15 Sep 2024 20:13:44 +0200
David Brown <[email protected]> wrote:
struct Bar {
char x[8];
int y;
} bar;
int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}
It generates:
foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret
That is, y is /not/ reloaded after bar.x[i] is set.
No other compiler on godbolt is doing it, except possibly gcc
clones. Not even clang, who's former leader wrote "Nasal Manifest".
Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
both show bar.y not being overwritten (optimization levels -01 or -O2)
when foo() is called.
I didn't mean to say that gcc3 is the only gcc version that returns non-overwritten value.
I meant to say that all gcc versions are in one camp and the rest of compilers represented on Goldbolt is in the other camp.
On 16/09/2024 10:37, Terje Mathisen wrote:
This becomes much simpler in Rust where usize is the only legal index
type:
Yeah, you have to actually write it as
 y = p[x];
 x += 1;
instead of a single line, but this makes zero difference to the
compiler, right?
I don't care much about the compiler - but I don't think this is an improvement for the programmer. (In general, I dislike trying to do too much in a single expression or statement, but some C constructs are
common enough that I am happy with them. It would be hard to formulate concrete rules here.)
And the resulting object code is less efficient than you get with signed
int and "y = p[x++];" (or "y = p[x]; x++;") in C.
On 9/16/2024 4:12 AM, David Brown wrote:
snip
With all respect to the regulars here, most people in technical Usenet
groups are either old, unusually nerdy, or both.
I resemble that remark! :-)
On Mon, 16 Sep 2024 13:04:02 +0000, Michael S wrote:
On Mon, 16 Sep 2024 14:48:50 +0200
David Brown <[email protected]> wrote:
It's not less efficient. usize in Rust is approximately the same as
size_t in C. With one exception that usize overflow panics under debug
build.
One can and should argue that::
#p++;
should panic if p++ crosses an address space boundary (user->OS, or OS->HyperVisor,...) as no array is allowed to cross such a boundary.
These double-width bit-field straddle operations show up at 32-bits.
Various FP64 formats (DEC's middle-endian FP being the worst example),
Intel page table entries and segment/gate descriptors, come to mind.
It's just going to take a while for double-width things to show up
at the 64-bit level. But if FP128 becomes a reality...
Codecs likely have to deal with double-width straddles a lot, whatever
the register word size. So for them it likely happens at 64-bits already.
I added a bunch of instructions for dealing with double-width operations.
The main ISA design decision is whether to have register pair specifiers,
R0, R2, R4,... or two separate {r_high,r_low} registers.
In either case the main uArch issue is that now instructions have an extra source register and two dest registers, which has a number of consequences. But once you bite the bullet on that it simplifies a lot of things,
like how to deal with carry or overflow without flags,
full width multiplies, divide producing both quotient and remainder.
On Tue, 17 Sep 2024 0:00:34 +0000, EricP wrote:
Bill Findlay wrote:
I found the same 5% performance cost in my tests with DEC Ada85.
Most code was pretty optimal too.
The one thing I found DEC's compiler made a complete pigs breakfast
of the generated code was scanning a character string backwards:
Bacon, sausage, and ham.
Sounds yummy. Code not so much.
EricP wrote:
These double-width bit-field straddle operations show up at 32-bits. Various FP64 formats (DEC's middle-endian FP being the worst
example), Intel page table entries and segment/gate descriptors,
come to mind.
Lots of them in 32-bit code!
On 2024-09-16 10:25, Thomas Koenig wrote:
Tim Rentsch <[email protected]> schrieb:
[attribution lost]
Bringing it back to "architecture" Like Anton Ertl has said, LP64
for C/C++ is a mistake. It should always have been ILP64, and
this nonsense would go away. Any new architecture should make C
ILP64 (looking at you RISC-V, missing yet another opportunity to
not make the same mistakes as everyone else).
I believe this view is shortsighted. The big mistake is
developers hardcoding types everywhere - especially int, but
also long, and their unsigned variants. It's almost never a
good idea to hardcode a specific width (eg, uint32_t) in a type
name used for parameters or local variables, but that is by far
a very common practice.
I agree. This issue guided the design of the scalar type system
in Ada.
C programmers can use typedef to get part way there, but not all
the way because typedefs are still weakly typed.
David Brown <[email protected]> schrieb:
On 16/09/2024 09:17, Thomas Koenig wrote:
David Brown <[email protected]> schrieb:
On 14/09/2024 21:26, Thomas Koenig wrote:
MitchAlsup1 <[email protected]> schrieb:
In many cases int is slower now than long -- which violates the notion >>>>>> of int from K&R days.
That's a designers's choice, I think. It is possible to add 32-bit
instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.
For some kinds of instructions, that's true - for others, it's not so
easy without either making rather complicated instructions or having
assembly instructions with undefined behaviour (imagine the terror that >>>> would bring to some people!).
It has happened, see the illegal (but sometimes useful)
6502 instructions, or the recent RISC-V implementation snafu
(GhostWrite).
I have seen plenty of undefined behaviour in ISA's over the years. (A
very common case is that instruction encodings that are not specified
are left as UB so that later extensions to the ISA can use them.)
A much better idea is to raise an exception, that way you can
be sure that nobody uses it for nefarious purposes.
I was
just thinking of the reactions you'd get if you made an ISA where
attempting to overflow signed integer arithmetic was UB at the hardware
level, so that you could get faster and simpler instructions.
Hard to see how this would be possible... but I realize this
is a hypothetical example.
David Brown wrote:
On 16/09/2024 10:37, Terje Mathisen wrote:
This becomes much simpler in Rust where usize is the only legal index
type:
Yeah, you have to actually write it as
 y = p[x];
 x += 1;
instead of a single line, but this makes zero difference to the
compiler, right?
I don't care much about the compiler - but I don't think this is an
improvement for the programmer. (In general, I dislike trying to do
too much in a single expression or statement, but some C constructs
are common enough that I am happy with them. It would be hard to
formulate concrete rules here.)
And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.
Is that true? I'll have to check godbolt myself if that is really the case!
David Brown <[email protected]> wrote:
On 16/09/2024 19:51, MitchAlsup1 wrote:
On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:
On 15/09/2024 21:13, MitchAlsup1 wrote:
As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant >>>>> past could do these rather efficiently {360 SRDL,...}
Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even
beneficial from the hardware viewpoint, but really inconvenient on the >>>> software side (whether you use bit-fields or not).
Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
..
The folks designing those register setups had a choice, and made a bad
choice from the viewpoint of software (whether it be C, assembly, or any
other language).
It's conceivable that it was the right choice on balance, considering
many factors. And it's certainly more believable that it was an
appropriate choice when sizes were smaller. It is less believable that
there is an overwhelming need to cross a 64-bit boundary.
Several pieces of software discoverd that "bad" smaller data
structures lead to faster execution. Simply, smaller data structures
lead to better utilization of caches and busses, and efect due to
this was larger than cost of extra instructions. So need to cross
64-bit boundary may be rare, but there will be cases when it is best
choice.
On 9/16/2024 4:27 AM, David Brown wrote:
On 16/09/2024 09:18, BGB wrote:
On 9/15/2024 12:46 PM, Anton Ertl wrote:
Michael S <[email protected]> writes:
Padding is another thing that should be Implementation Defined.
It is. It's defined in the ABI, so when the compiler documents to
follow some ABI, you automatically get that ABI's structure layout.
And if a compiler does not follow an ABI, it is practically useless.
Though, there also isn't a whole lot of freedom of choice here
regarding layout.
If member ordering or padding differs from typical expectations, then
any code which serializes structures to files is liable to break, and
this practice isn't particularly uncommon.
Your expectations here should match up with the ABI - otherwise things
are going to go wrong pretty quickly. But I think most ABIs will have
fairly sensible choices for padding and alignments.
Yeah. It is "almost fixed", as there are a lot of programs that are
liable to break if these assumptions differ.
Say, typical pattern:
Members are organized in the same order they appear in the source code;
That is required by the C standards. (A compiler can re-arrange the
order if that does not affect any observable behaviour. gcc used to
have an optimisation option that allowed it to re-arrange struct
ordering when it was safe to do so, but it was removed as it was
rarely used and a serious PITA to support with LTO.)
OK.
If the current position is not a multiple of the member's alignment,
it is padded to an offset that is a multiple of the member's alignment;
That is a requirement in the C standards.
The only implementation-defined option is whether or not there is /
extra/ padding - and I have never seen that in practice. (And there
are more implementation-defined options for bit-fields.)
Extra padding seems like it wouldn't have much benefit.
Albeit, types like _Bool in my implementation are padded to a full byte
(it is treated as an "unsigned char" that is assumed to always hold
either 0 or 1).
For primitive types, the alignment is equal to the size, which is
also a power of 2;
That is the norm, up to the maximum appropriate alignment for the
architecture. A 16-bit cpu has nothing to gain by making 32-bit types
32-bit aligned.
This comes up as an issue in some Windows file formats, where one can't
just naively use a struct with 32-bit fields because some 32-bit members
only have 16-bit alignment.
If needed, the total size of the struct is padded to a multiple of
the largest alignment of the struct members.
That is required by the C standards.
For C++ classes, it is more chaotic (and more compiler dependent), but:
Not really, no. Apart from a few hidden bits such as pointers to
handle virtual methods and virtual inheritance, the data fields are
ordered, padded and aligned just like in C structs. And these hidden
pointers follow the same rules as any other pointer.
The only other special bit is empty base class optimisation, and
that's pretty simple too.
For simple cases, they may match up, like a POD class may look just like
an equivalent struct, or single-inheritance classes with virtual methods
like a struct with a vtable, etc... But in more complex cases there may
be compiler differences (along with differences in things like name
mangling, etc).
Though, unlike with structs, programs seem less inclined to rely on the memory layout specifics of class instances.
On Tue, 17 Sep 2024 08:20:15 +0200
Terje Mathisen <[email protected]> wrote:
EricP wrote:
These double-width bit-field straddle operations show up at 32-bits.
Various FP64 formats (DEC's middle-endian FP being the worst
example), Intel page table entries and segment/gate descriptors,
come to mind.
Lots of them in 32-bit code!
Lot's of what in 32-bit code?
On 17/09/2024 08:07, Terje Mathisen wrote:
David Brown wrote:
On 16/09/2024 10:37, Terje Mathisen wrote:
This becomes much simpler in Rust where usize is the only legal
index type:
Yeah, you have to actually write it as
 Â y = p[x];
 Â x += 1;
instead of a single line, but this makes zero difference to the
compiler, right?
I don't care much about the compiler - but I don't think this is an
improvement for the programmer. (In general, I dislike trying to do >>> too much in a single expression or statement, but some C constructs
are common enough that I am happy with them. It would be hard to
formulate concrete rules here.)
And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.
Is that true? I'll have to check godbolt myself if that is really the
case!
It is not true - or at least, it shouldn't be true. I had thought the
Rust code was using the equivalent of a C "unsigned int" here, which
would require extra code for wrapping semantics. But that was just my misunderstanding of Rust and its types - with a 64-bit unsigned type, it should give the same results as C. However, there's no harm in checking
it and letting us know.
(I've previously shown how "y = p[x++];" in C is less efficient on
x86-64 if x is "unsigned int", compared to "int" or 64-bit types for x.)
Michael S wrote:
On Tue, 17 Sep 2024 08:20:15 +0200
Terje Mathisen <[email protected]> wrote:
EricP wrote:
These double-width bit-field straddle operations show up at
32-bits. Various FP64 formats (DEC's middle-endian FP being the
worst example), Intel page table entries and segment/gate
descriptors, come to mind.
Lots of them in 32-bit code!
Lot's of what in 32-bit code?
Pretty much any 64-bit container with non-regular contents, with the
suggest double / fp64 as the classic example?
Terje
On 17/09/2024 03:36, Waldek Hebisch wrote:
David Brown <[email protected]> wrote:
On 16/09/2024 19:51, MitchAlsup1 wrote:
On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:
On 15/09/2024 21:13, MitchAlsup1 wrote:
As to HW sadism:: this not not <realistically> any harder than
mis- aligned DW accesses from the cache. Many ISA from the
rather distant past could do these rather efficiently {360
SRDL,...}
Anyone who designs a data structure with a bit-field that spans
two 64-bit parts of a struct is probably ignorant of C
bit-fields and software in general. It is highly unlikely to be
necessary or even beneficial from the hardware viewpoint, but
really inconvenient on the software side (whether you use
bit-fields or not).
Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
..
The folks designing those register setups had a choice, and made a
bad choice from the viewpoint of software (whether it be C,
assembly, or any other language).
It's conceivable that it was the right choice on balance,
considering many factors. And it's certainly more believable that
it was an appropriate choice when sizes were smaller. It is less
believable that there is an overwhelming need to cross a 64-bit
boundary.
Several pieces of software discoverd that "bad" smaller data
structures lead to faster execution. Simply, smaller data
structures lead to better utilization of caches and busses, and
efect due to this was larger than cost of extra instructions. So
need to cross 64-bit boundary may be rare, but there will be cases
when it is best choice.
It is possible, but I think it is rare.
Perhaps my perception is biased from working with microcontrollers,
where you often don't have caches and instruction speeds are not
nearly as much faster than ram access speeds as you see in modern x86 systems.
The other thing I don't like about split bit-fields is that there is typically no way to do atomic updates, which can mean you need extra
care to keep things correct.
David Brown wrote:
On 17/09/2024 08:07, Terje Mathisen wrote:
David Brown wrote:
On 16/09/2024 10:37, Terje Mathisen wrote:
This becomes much simpler in Rust where usize is the only legal
index type:
Yeah, you have to actually write it as
 Â y = p[x];
 Â x += 1;
instead of a single line, but this makes zero difference to the
compiler, right?
I don't care much about the compiler - but I don't think this is
an improvement for the programmer. (In general, I dislike
trying to do too much in a single expression or statement, but
some C constructs are common enough that I am happy with them.Â
It would be hard to formulate concrete rules here.)
And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.
Is that true? I'll have to check godbolt myself if that is really
the case!
It is not true - or at least, it shouldn't be true. I had thought
the Rust code was using the equivalent of a C "unsigned int" here,
which would require extra code for wrapping semantics. But that
was just my misunderstanding of Rust and its types - with a 64-bit
unsigned type, it should give the same results as C. However,
there's no harm in checking it and letting us know.
No need to check this particular point, Rust's usize was obviously
designed to be an unsigned type large enough to index into the entire addressable memory range, so on a 64-bit platform it has to be 64
bits.
(I've previously shown how "y = p[x++];" in C is less efficient onThat's actually surprising to me, I would have guessed any 32-bit
x86-64 if x is "unsigned int", compared to "int" or 64-bit types
for x.)
index would be less efficient than a full-width type, but if the
idionm is very, very common in C code, then it makes sense to make it
fast.
Doing so would typically require either sign- or zero-extending all
32-bit variables when loaded into a 64-bit register, right?
Terje
Niklas Holsti wrote:Not just that, many language forms actually preclude the need for checks,
On 2024-09-16 18:58, Michael S wrote:
On Mon, 16 Sep 2024 11:39:55 -0400
EricP <[email protected]> wrote:
David Brown wrote:
On 16/09/2024 15:04, Michael S wrote:
With one exception that usize overflow panics under debug
build.
I'm quite happy with unsigned types that are not allowed to
overflow, as long as there is some other way to get efficient wrapping on the rare occasions when you need it.
But I am completely against the idea that you have different
defined semantics for different builds. Run-time errors in a debug/test build and undefined behaviour in release mode is fine - defining the behaviour of overflow in release mode (other than possibly to the same run-time checking) is wrong.
In the compilers that do checking which I have worked with
there was always a distinction between checked builds and debug
builds. In my C code I have Assert() and AssertDbg(). Assert stay in the production code, AssertDbg are only in the debug builds.
Debug builds disable optimizations and spill all variable updates
to memory to make life easier for the debugger.
One usually compiles debug builds with no-optimize and all checks enabled.
But debug, optimize, and checking are separate controls.
In the compilers for checking languages I've worked with,
checking and optimization are compatible.
For example, if the compiler uses an AddFaultOverflow x = x + 1 instruction to increment 'x' then it knows no overflow is possible
and then can make all the other optimizations that C assumes are true.
And on those compilers checks can be controlled with quite fine resolution. Checks can be enabled/disabled based on kind of check,
eg scalar overflow, array bounds,
for a compilation unit, a routine, a section of code,
a particular data type, a particular object.
This was all standard on DEC Ada85 so if Rust compilers do not
do this now they may in the near future.
If ability to control compilers checks was standard on DEC Ada then it made DEC Ada none-standard.
No, it means that DEC Ada could be used as a standard-conforming Ada compiler or as a non-conforming compiler, to a user-chosen extent.
The recommended approach today (for applications where it matters) is to use static analysis of the Ada code (e.g. SPARK or other tools) to prove that run-time errors cannot happen, which then makes it possible to omit the corresponding run-time checks while staying compliant.
DEC Ada did that too. It seems to me this optimization to be a relatively straight forward "propagation of constants" type of problem.
Stephen Fuld wrote:
On 9/16/2024 4:12 AM, David Brown wrote:
snip
With all respect to the regulars here, most people in technical
Usenet groups are either old, unusually nerdy, or both.
I resemble that remark! :-)
Ditto, probably...
I'm 67 (but not yet retired), I taught myself the Trachtenberg
algorithms for mental arithmetic when I was around 12 (was reminded of
this last night when I watched Gifted on netflix), I mail ordered what
was probably the first Rubik's cube to get to Norway. (And developed
three different algorithms to solve it, but I only remember the last one
now which I had optimized for simplicity, not speed.)
Those, along with high school chess and orienteering mapping should
count as nerdy pursuits, right?
Winning the County Yo-Yo championship would be less so?
Regards to all the regulars here, I do consider many of you friends that
I just haven't met yet.
On Tue, 17 Sep 2024 11:48:12 +0200
Terje Mathisen <[email protected]> wrote:
David Brown wrote:
On 17/09/2024 08:07, Terje Mathisen wrote:
David Brown wrote:
On 16/09/2024 10:37, Terje Mathisen wrote:
This becomes much simpler in Rust where usize is the only legal
index type:
Yeah, you have to actually write it as
 Â y = p[x];
 Â x += 1;
instead of a single line, but this makes zero difference to the
compiler, right?
I don't care much about the compiler - but I don't think this is
an improvement for the programmer. (In general, I dislike
trying to do too much in a single expression or statement, but
some C constructs are common enough that I am happy with them.Â
It would be hard to formulate concrete rules here.)
And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.
Is that true? I'll have to check godbolt myself if that is really
the case!
It is not true - or at least, it shouldn't be true. I had thought
the Rust code was using the equivalent of a C "unsigned int" here,
which would require extra code for wrapping semantics. But that
was just my misunderstanding of Rust and its types - with a 64-bit
unsigned type, it should give the same results as C. However,
there's no harm in checking it and letting us know.
No need to check this particular point, Rust's usize was obviously
designed to be an unsigned type large enough to index into the entire
addressable memory range, so on a 64-bit platform it has to be 64
bits.
That's actually surprising to me, I would have guessed any 32-bit
(I've previously shown how "y = p[x++];" in C is less efficient on
x86-64 if x is "unsigned int", compared to "int" or 64-bit types
for x.)
index would be less efficient than a full-width type, but if the
idionm is very, very common in C code, then it makes sense to make it
fast.
Doing so would typically require either sign- or zero-extending all
32-bit variables when loaded into a 64-bit register, right?
Terje
Taken in isolation, on something like x86=64 or aarch64, where result
of 32-bit addition is by default zero-extended, there is no difference between 32-bit and 64-bit unsigned x.
However when statement shown above is part of the sequence, even short
one, 64-bit x allows compiler optimizations that are impossible with
32-bit.
E.g.
y1 = p[x++]
y2 = p[x++]
On x86-64 with 64-bit x the second load can be implemented as
mov dstreg, [rcx+rdx*4+4]
On aarch64 with 64-bit x both loads can be folded into single 'load
pair' instruction.
On Tue, 17 Sep 2024 11:29:15 +0200
David Brown <[email protected]> wrote:
On 17/09/2024 03:36, Waldek Hebisch wrote:
David Brown <[email protected]> wrote:
On 16/09/2024 19:51, MitchAlsup1 wrote:
On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:
On 15/09/2024 21:13, MitchAlsup1 wrote:
As to HW sadism:: this not not <realistically> any harder than
mis- aligned DW accesses from the cache. Many ISA from the
rather distant past could do these rather efficiently {360
SRDL,...}
Anyone who designs a data structure with a bit-field that spans
two 64-bit parts of a struct is probably ignorant of C
bit-fields and software in general. It is highly unlikely to be
necessary or even beneficial from the hardware viewpoint, but
really inconvenient on the software side (whether you use
bit-fields or not).
Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
..
The folks designing those register setups had a choice, and made a
bad choice from the viewpoint of software (whether it be C,
assembly, or any other language).
It's conceivable that it was the right choice on balance,
considering many factors. And it's certainly more believable that
it was an appropriate choice when sizes were smaller. It is less
believable that there is an overwhelming need to cross a 64-bit
boundary.
Several pieces of software discoverd that "bad" smaller data
structures lead to faster execution. Simply, smaller data
structures lead to better utilization of caches and busses, and
efect due to this was larger than cost of extra instructions. So
need to cross 64-bit boundary may be rare, but there will be cases
when it is best choice.
It is possible, but I think it is rare.
Perhaps my perception is biased from working with microcontrollers,
where you often don't have caches and instruction speeds are not
nearly as much faster than ram access speeds as you see in modern x86
systems.
On the other hand, with MCUs it's quite common to be limited by size of
data storage (SRAM), while size of program storage (flash) is bigger
than one will ever want. Plus, quite often, speed is of less concern.
In such [common] situation densely packed [arrays of] structures could
be desirable.
The other thing I don't like about split bit-fields is that there is
typically no way to do atomic updates, which can mean you need extra
care to keep things correct.
In the common case, on common ISAs atomic RMW update of bit field is impossible even when the field does not cross a word boundary.
In case you mean write-only update (i.e. values of adjacent fields are
known in advance and not expected to change), what you say can be
correct or not, depending on availability of unaligned stores and on
what exactly one consider 'atomic'.
On Tue, 17 Sep 2024 08:20:15 +0200
Terje Mathisen <[email protected]> wrote:
EricP wrote:
These double-width bit-field straddle operations show up at 32-bits.Lots of them in 32-bit code!
Various FP64 formats (DEC's middle-endian FP being the worst
example), Intel page table entries and segment/gate descriptors,
come to mind.
Lot's of what in 32-bit code?
[email protected] (MitchAlsup1) writes:
On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:
I didn't see any content from you in this last posting
of yours.
With all respect to the regulars here, most people in technical Usenet groups are either old, unusually nerdy, or both.
I plead guilty to nerdy, but as for old, I'm still 27 (and that's been
true for more than 20 years).
Stefan
With all respect to the regulars here, most people in technical Usenet
groups are either old, unusually nerdy, or both.
On Tue, 17 Sep 2024 01:35:17 +0000
[email protected] (MitchAlsup1) wrote:
On Tue, 17 Sep 2024 0:00:34 +0000, EricP wrote:
Bill Findlay wrote:
I found the same 5% performance cost in my tests with DEC Ada85.
Most code was pretty optimal too.
The one thing I found DEC's compiler made a complete pigs breakfast
of the generated code was scanning a character string backwards:
Bacon, sausage, and ham.
Sounds yummy. Code not so much.
It seems that you and EricP give different (not to say an opposite)
meaning to the phrase "complete pigs breakfast".
EricP wrote:
I added a bunch of instructions for dealing with double-width
operations.
The main ISA design decision is whether to have register pair
specifiers,
R0, R2, R4,... or two separate {r_high,r_low} registers.
In either case the main uArch issue is that now instructions have an
extra
source register and two dest registers, which has a number of
consequences.
But once you bite the bullet on that it simplifies a lot of things,
like how to deal with carry or overflow without flags,
full width multiplies, divide producing both quotient and remainder.
Very nice!
This means that you can do integer IMAC(), right?
(hi, lo) = imac(a, b, c); // == a*b+c
The only thing even nicer from the perspective of writing arbitrary
precision library code would be IMAA, i.e. a*b+c+d since that is the
largest combination which is guaranteed to never overflow the double
register target field.
Terje
On 17 Sep 2024, Stefan Monnier wrote
(in article<[email protected]>):
With all respect to the regulars here, most people in technical Usenet
groups are either old, unusually nerdy, or both.
I plead guilty to nerdy, but as for old, I'm still 27 (and that's been
true for more than 20 years).
Stefan
Hi Stefan!
At least equally nerdy, I should think, but 50 years older.
(Older, not old!)
On Tue, 17 Sep 2024 2:34:44 +0000, Tim Rentsch wrote:
[email protected] (MitchAlsup1) writes:
On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:
I didn't see any content from you in this last posting
of yours.
I had started to make a comment after hitting quote, and
while re-reading what you wrote I had nothing to add and
nothing to modify or complain about. While thinking it all
over I ended hitting the Post Article button without any
text.
There was no way to retrieve the post, so I let it lie.
On Tue, 17 Sep 2024 2:34:44 +0000, Tim Rentsch wrote:
[email protected] (MitchAlsup1) writes:
On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:
I didn't see any content from you in this last posting
of yours.
I had started to make a comment after hitting quote, and
while re-reading what you wrote I had nothing to add and
nothing to modify or complain about. While thinking it all
over I ended hitting the Post Article button without any
text.
There was no way to retrieve the post, so I let it lie.
Another option would be for adjacent _Bool values to merge similar to bitfields...
On 17/09/2024 03:36, Waldek Hebisch wrote:
David Brown <[email protected]> wrote:
On 16/09/2024 19:51, MitchAlsup1 wrote:
On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:
On 15/09/2024 21:13, MitchAlsup1 wrote:
As to HW sadism:: this not not <realistically> any harder than mis- >>>>>> aligned DW accesses from the cache. Many ISA from the rather distant >>>>>> past could do these rather efficiently {360 SRDL,...}
Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even >>>>> beneficial from the hardware viewpoint, but really inconvenient on the >>>>> software side (whether you use bit-fields or not).
Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
..
The folks designing those register setups had a choice, and made a bad
choice from the viewpoint of software (whether it be C, assembly, or any >>> other language).
It's conceivable that it was the right choice on balance, considering
many factors. And it's certainly more believable that it was an
appropriate choice when sizes were smaller. It is less believable that
there is an overwhelming need to cross a 64-bit boundary.
Several pieces of software discoverd that "bad" smaller data
structures lead to faster execution. Simply, smaller data structures
lead to better utilization of caches and busses, and efect due to
this was larger than cost of extra instructions. So need to cross
64-bit boundary may be rare, but there will be cases when it is best
choice.
It is possible, but I think it is rare.
Perhaps my perception is biased from working with microcontrollers,
where you often don't have caches and instruction speeds are not nearly
as much faster than ram access speeds as you see in modern x86 systems.
The other thing I don't like about split bit-fields is that there is typically no way to do atomic updates, which can mean you need extra
care to keep things correct.
On 9/17/2024 4:39 AM, David Brown wrote:
On 16/09/2024 21:46, BGB wrote:
On 9/16/2024 4:27 AM, David Brown wrote:
Albeit, types like _Bool in my implementation are padded to a full
byte (it is treated as an "unsigned char" that is assumed to always
hold either 0 or 1).
That's the usual way to handle them.
Another option would be for adjacent _Bool values to merge similar to bitfields...
Though, seems that simply turning it into a byte is the typical option.
This comes up as an issue in some Windows file formats, where one
can't just naively use a struct with 32-bit fields because some 32-bit
members only have 16-bit alignment.
Ah, the joys of using ancient formats with new systems!
I was around when this stuff was still newish.
Some are essentially frozen in time with their misaligned members.
Still better than:
"Well, initial field wasn't big enough";
"Repurpose those bytes from over there, and glue them on".
There would need to be a mechanism in the ISA to select between these
modes though (probably a "magic branch" scheme different from the one
used for Inter-ISA branches).
This would likely include an RV64 encoding for "Branch to/from CoEx",
and an encoding within this ISA to jump between CoEx and "Native" mode.
Magic branches make sense mostly as any such mode switch is going to
require a pipeline flush.
This is assuming an implementation that would want to be able to support
both this ISA and also RV64GC.
One possibility could be (in native RV notation):
RV64 (Branches if supported, NOP if not):
LBU X0, Xs, Disp12s //Dest=RV64GC
LWU X0, Xs, Disp12s //Dest=CoEx
LHU X0, Xs, Disp12s //Dest=Native
New ISA:
LBU X0, Xs, Disp10s //Dest=RV64GC
LWU X0, Xs, Disp10s //Dest=CoEx
LHU X0, Xs, Disp10s //Dest=Native
On 9/17/2024 3:11 PM, MitchAlsup1 wrote:
Modes make testing significantly harder. Each mode adds 1 to the
exponent
how many test cases it takes to adequately test a part.
Possibly.
But, modes are kinda unavoidable here:
CPU only runs RV64GC or similar:
Doomed to relative slowness;
CPU only does CoEx:
Closes off the ability to run binaries that assume RV64GC.
CPU only does new ISA:
Well, then it can't run RISC-V code, making all this kinda moot.
This is assuming an implementation that would want to be able to support >>> both this ISA and also RV64GC.
One possibility could be (in native RV notation):
RV64 (Branches if supported, NOP if not):
LBU X0, Xs, Disp12s //Dest=RV64GC
LWU X0, Xs, Disp12s //Dest=CoEx
LHU X0, Xs, Disp12s //Dest=Native
New ISA:
LBU X0, Xs, Disp10s //Dest=RV64GC
LWU X0, Xs, Disp10s //Dest=CoEx
LHU X0, Xs, Disp10s //Dest=Native
This only gives 36-bits (top) or 30-bits (bottom) or range. What you are
going to want is 64-bits of range -- especially when switching modes--
you PROBABLY want to use an entirely different sub-tree of the
translation
table trees.
Idea here is that 'Xs' will give the base address for the target.
On the RISC-V side, this would mean, say:
AUIPC X7, disp
LWU X0, X7, disp
Similar to a normal JALR.
I could almost interpret X0 as PC, except that on a "standard" RISC-V
CPU, the non-supported case would be, likely: "program crashes trying to access a NULL pointer", which is less useful.
Branches in the new ISA would likely be encoded using jumbo prefixes.
Well, partly because the new ISA lacks AUIPC, but the new ISA can encode
it more directly as, essentially:
LWU X0, PC, Disp33s
On Tue, 17 Sep 2024 16:32:35 +0000, Bill Findlay wrote:
On 17 Sep 2024, Stefan Monnier wrote
(in article<[email protected]>):
With all respect to the regulars here, most people in technical Usenet >>>> groups are either old, unusually nerdy, or both.
I plead guilty to nerdy, but as for old, I'm still 27 (and that's been
true for more than 20 years).
Stefan
Hi Stefan!
At least equally nerdy, I should think, but 50 years older.
(Older, not old!)
At 71 real years old I still operate as if I were <let's say> 21.
On 17/09/2024 20:18, MitchAlsup1 wrote:
On Tue, 17 Sep 2024 16:32:35 +0000, Bill Findlay wrote:
On 17 Sep 2024, Stefan Monnier wrote
(in article<[email protected]>):
With all respect to the regulars here, most people in technical Usenet >>>>> groups are either old, unusually nerdy, or both.
I plead guilty to nerdy, but as for old, I'm still 27 (and that's been >>>> true for more than 20 years).
Stefan
Hi Stefan!
At least equally nerdy, I should think, but 50 years older.
(Older, not old!)
At 71 real years old I still operate as if I were <let's say> 21.
You are not 71, you are merely 0x47 :-)
EricP wrote:
Codecs likely have to deal with double-width straddles a lot, whatever
the register word size. So for them it likely happens at 64-bits already.
Nothing likely about it: LZ4 is pretty much the only compression algorithm/lossless codec that never straddles, all the rest tend to
treat the source data as single bitstream of arbitrary length, except
for some built-in chunking mechanism which simplifies faster scanning.
The core of the algorithm always starts with knowing the endianness,
then picking up 32 or 64-bit chunks of input data (byte-flipping if
needed) and then extractin the next N bits either from the top of bottom
of the buffer register.
AlLmost by definition, this is not code that a compiler is setup to help
you get correct.
I added a bunch of instructions for dealing with double-width operations.
The main ISA design decision is whether to have register pair specifiers,
R0, R2, R4,... or two separate {r_high,r_low} registers.
In either case the main uArch issue is that now instructions have an
extra
source register and two dest registers, which has a number of
consequences.
But once you bite the bullet on that it simplifies a lot of things,
like how to deal with carry or overflow without flags,
full width multiplies, divide producing both quotient and remainder.
Very nice!
This means that you can do integer IMAC(), right?
(hi, lo) = imac(a, b, c); // == a*b+c
The only thing even nicer from the perspective of writing arbitrary
precision library code would be IMAA, i.e. a*b+c+d since that is the
largest combination which is guaranteed to never overflow the double
register target field.
Terje
On 9/17/2024 6:04 PM, MitchAlsup1 wrote:
Still limited to 32-bit displacement from IP.
How would you perform the following call::
current IP = 0x0000000000001234
target IP = 0x7FFFFFFF00001234
This is a single (2-word) instruction in my ISA, assuming GOT is
32-bit displaceable and 64-bit entries.
Granted, but in plain RISC-V, there is no real better option.
If one wants to generate 64-bit displacement, and doesn't want to load a constant from memory:
LUI X6, Disp20Hi //20 bits
ADDI X6, X6, Disp12Hi //12 bits
AUIPC X7, Disp20Lo
ADD X7, Disp12Lo
SLLI X6, X6, 32
ADD X7, X7, X6
Which is sort of the whole reason I am considering hacking around it
with an alternate encoding scheme.
New encoding scheme can in theory do:
LEA X7, PC, Disp64
In a single 96-bit instruction.
AUPIC is (and remains) a crutch (like LUI from MIPS)
a) it consumes an instruction (space and time)
b) it consumes a register unnecessarily
c) it consumes power that direct delivery of the constant would not
Yeah, pretty much.
LUI + AUIPC + JAL, eat nearly 27 bits of encoding space.
On 9/18/2024 9:27 AM, MitchAlsup1 wrote:
On Wed, 18 Sep 2024 4:00:43 +0000, BGB wrote:
On 9/17/2024 6:04 PM, MitchAlsup1 wrote:
Still limited to 32-bit displacement from IP.
How would you perform the following call::
current IP = 0x0000000000001234
target IP = 0x7FFFFFFF00001234
This is a single (2-word) instruction in my ISA, assuming GOT is
32-bit displaceable and 64-bit entries.
Granted, but in plain RISC-V, there is no real better option.
If one wants to generate 64-bit displacement, and doesn't want to load a >>> constant from memory:
LUI X6, Disp20Hi //20 bits
ADDI X6, X6, Disp12Hi //12 bits
AUIPC X7, Disp20Lo
ADD X7, Disp12Lo
SLLI X6, X6, 32
ADD X7, X7, X6
How very much simpler is::
MEM Rd,[IP,Ri<<s,DISP64]
1 instruction, 3 words, 1 decode cycle, no forwarding, shorter latency.
It is simpler, but N/E in RV64G...
This is the whole issue of the idea:
Remain backwards compatible with RV64G / RV64GC (in a binary sense).
*and* try to allow extending it in a way such that performance can be
less poor...
Which is sort of the whole reason I am considering hacking around it
with an alternate encoding scheme.
Just put in real constants.
New encoding scheme can in theory do:
LEA X7, PC, Disp64
In a single 96-bit instruction.
Where is the indexing register?
Generally the use of a displacement and index register are mutually
exclusive (and, cases that can make use of Disp AND Index are much less common than Disp OR Index).
I may still consider defining an encoding for this, but not yet. It is
in a similar boat as auto-increment. Both add resource cost with
relatively little benefit in terms of overall performance.
Auto-increment because if one has superscalar, the increment can usually
be co-executed. And, full [Rb+Ri*Sc+Disp], because it is just too
infrequent to really justify the extra cost of a 3-way adder even if
limited mostly to the low-order bits...
Terje Mathisen wrote:
EricP wrote:
Codecs likely have to deal with double-width straddles a lot, whateverNothing likely about it: LZ4 is pretty much the only compression
the register word size. So for them it likely happens at 64-bits already. >>
algorithm/lossless codec that never straddles, all the rest tend to
treat the source data as single bitstream of arbitrary length, except
for some built-in chunking mechanism which simplifies faster scanning.
The core of the algorithm always starts with knowing the endianness,
then picking up 32 or 64-bit chunks of input data (byte-flipping if
needed) and then extractin the next N bits either from the top of bottom
of the buffer register.
AlLmost by definition, this is not code that a compiler is setup to help
you get correct.
I added a bunch of instructions for dealing with double-width operations. >>> The main ISA design decision is whether to have register pair specifiers, >>> R0, R2, R4,... or two separate {r_high,r_low} registers.
In either case the main uArch issue is that now instructions have an
extra
source register and two dest registers, which has a number of
consequences.
But once you bite the bullet on that it simplifies a lot of things,
like how to deal with carry or overflow without flags,
full width multiplies, divide producing both quotient and remainder.
Very nice!
This means that you can do integer IMAC(), right?
(hi, lo) = imac(a, b, c); // == a*b+c
The only thing even nicer from the perspective of writing arbitrary
precision library code would be IMAA, i.e. a*b+c+d since that is the
largest combination which is guaranteed to never overflow the double
register target field.
Terje
I thought about IMAC but it was a bit too much.
And unlike FMA there is no precision gain in IMAC, just convenience.
IMAC requires 6 register specifiers, 2 dest and 4 source if you don't
care about overflow/carry on the accumulate.
2-wide = 2-wide + narrow * narrow
It needs 7 registers, 3 dest and 4 source if you want overflow/carry
on the accumulate.
3-wide = 2-wide + narrow * narrow
I wanted to support checked arithmetic which means full width multiplies.
And I was always bothered by the risc approach of MULL (low part) and
MULH (high part) where they do most of the multiply then toss half away
just because they won't have 2 dest registers.
So what else I can do with 2 dest registers? Wide add and sub.
Various wide Add,Sub solves the missing carry/overflow flags problems.
FMA already requires 3 source registers.
Beside Add,Sub,Mul what else can one do with 3 source and 2 dest registers? Wide shifts and wide bit-field extract and insert.
I went with two (r_hi,r_lo) register specifiers because it gave programmers more flexibility. I played a bit with even register pairs (R0, R2, R4...)
and found one had to do extra MOVs just form a pair.
(r_hi,r_lo) cost a longer instruction format but I have a variable length instruction so its mostly a wider fetch and decode pathways to handle
the worst case instruction size.
W = Wide = (hi,lo) register pair, N = Narrow = one register.
Add forms:
Add N = N + N // No carry out
Add3 N = N + N + N // No carry out
Addw2 W = N + N // Generate carry
Addw3 W = N + N + N // Generate + propagate carry
Addw1 W = W + N // Propagate carry
Same for subtract wide.
The three Add forms are chosen to make multi-precision integer
multiply easier. See below.
MUluw W = N * N
Mulsw W = N * N
Divuw (quo,rem) = N / N
Divsw (quo,rem) = N / N
Shllw W = W << size // Shift left logical
Shlaw W = W << size // Shift left arithmetic, fault on signed overflow Shrlw W = W >> size // Shift right logical
Shraw W = W >> size // Shift right arithmetic, sign extend
Shrnw W = W >> size // Shift right numeric, round -1 to zero
Bfextu N = extract (W, size, position) // Bit-field extract, zero extend Bfexts N = extract (W, size, position) // Bit-field extract, sign extend Bfins W = insert (W, N, size, position) // Bit-field insert
=====================================
Example unsigned 128 * 128 => 256 multiply:
// Unsigned Multiply 128*128 => 256
// (r3,r2)*(r1,r0) => (r3,r2,r1,r0)
// Uses r4,r5,r6,r7,r8 as temp registers
//
muluw r5,r4 = r3*r0
muluw r6,r0 = r2*r0
muluw r8,r7 = r2*r1
muluw r3,r2 = r3*r1
addw3 r4,r1 = r4+r6+r7
addw3 r5,r2 = r5+r8+r2
addw2 r4,r2 = r2+r4
add3 r3 = r3+r5+r4
The reason I prefer the separate (r_hi,r_lo) pair specifiers rather
than the even number register pairs R0,R2,R4... is because the above
sequence would require extra moves for form the even numbered pairs.
With separate pairs one can select registers so that everything lands
in the right dest at the right time.
EricP <[email protected]> wrote:
Terje Mathisen wrote:
EricP wrote:
I always assumed that MULH just grabbed the part that would have been
thrown away. And that is how at least one RISC-V core does it:
https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit
They claim 5 cycles, should be six, five for the multiply and one more
for the second result, unless the next instruction does not need a write port, and does not use the result. You can get a throughput of 5 cycles
with
smart coding, but that rarely happens without effort.
Terje Mathisen wrote:
Very nice!
This means that you can do integer IMAC(), right?
(hi, lo) = imac(a, b, c); // == a*b+c
The only thing even nicer from the perspective of writing arbitrary
precision library code would be IMAA, i.e. a*b+c+d since that is the
largest combination which is guaranteed to never overflow the double
register target field.
I thought about IMAC but it was a bit too much.
And unlike FMA there is no precision gain in IMAC, just convenience.
IMAC requires 6 register specifiers, 2 dest and 4 source if you don't
care about overflow/carry on the accumulate.
2-wide = 2-wide + narrow * narrow
It needs 7 registers, 3 dest and 4 source if you want overflow/carry
on the accumulate.
3-wide = 2-wide + narrow * narrow
EricP <[email protected]> wrote:
I wanted to support checked arithmetic which means full width multiplies.
And I was always bothered by the risc approach of MULL (low part) and
MULH (high part) where they do most of the multiply then toss half away
just because they won't have 2 dest registers.
I always assumed that MULH just grabbed the part that would have been
thrown away. And that is how at least one RISC-V core does it:
https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit
They claim 5 cycles, should be six, five for the multiply and one more for the second result, unless the next instruction does not need a write port, and does not use the result. You can get a throughput of 5 cycles with
smart coding, but that rarely happens without effort.
Brett wrote:
EricP <[email protected]> wrote:
They claim 5 cycles, should be six, five for the multiply and one more
for
the second result, unless the next instruction does not need a write
port,
and does not use the result. You can get a throughput of 5 cycles with
smart coding, but that rarely happens without effort.
That article is ignoring multiplier pipelining.
If the multiplier is pipelined with a latency of 5 and throughput of 1,
then MULL takes 5 cycles and MULL,MULH takes 6.
But those two multiplies still are tossing away 50% of their work.
And if it does fuse them then the internal uArch cost is the same as if
you had designed it optimally from the start, except now you have
to pay for a fuser.
<sound of soap box being dragged out>Agreed
This idea that macro-op fusion is some magic solution is bullshit.
1) It's not free.Far from it.
2) It only works where Decode can see *all* the required lookaheadI think it is but a crutch for a misdesigned ISA
instructions, which means you have to pay for an N-lane decoder
but only get 1 lane.
3) It's probabilistic as it depends on how the fetch buffers get loaded.It can be worse than that
Eg if the fetch buffer contains a valid instruction but does not
have
a next instruction, do you stall Decode to see if a fuser might
arrive
or dispatch it anyway.
4) It gets exponentially expensive if you start doing multipleAll the more reason to have a better ISA
instruction
lanes because decode has to deal with all the permutations of
fusion possibilities.
5) Any fused instructions leave (multiple) bubbles that should be
compacted out or there wasn't much point to doing the fusion.
In my opinion it is better to have an ISA that is optimal by design
rather than being patched up by fusion later.
Some of this inefficiency is caused by clinging to now 40 year oldMakes FMAC had
risc design *guidelines* (ie not even rules) that:
- instructions have at most 1 dest and 2 source registers
- register specifier fields are either source or dest, never bothI happen to be wishywashy on this
- instructions should take at most 1 clock (they never did)This never worked for floating point anyway...and many consider
These self imposed design restrictions cause ISA designers to miss
some possible more optimal solutions. The result is things like
RISC-V's memory reference linkage structures taking 6 instructions
to build a 64-bit PC-relative address. And I'm pretty sure we won't
see any 6 instruction fusers for quite some time.
<sound of soap box being dragged back to cupboard>
On Wed, 18 Sep 2024 21:15:55 +0000, Brett wrote:
EricP <[email protected]> wrote:
Terje Mathisen wrote:
EricP wrote:
I always assumed that MULH just grabbed the part that would have been
thrown away. And that is how at least one RISC-V core does it:
https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit
They claim 5 cycles, should be six, five for the multiply and one more
for the second result, unless the next instruction does not need a write
port, and does not use the result. You can get a throughput of 5 cycles
with
smart coding, but that rarely happens without effort.
It is easy enough in the decoder to recognize a MUL followed by MULH
(and vice versa) as using the multiplier tree once and delivering 2
results. So the first result is 6 cycles, the second result on the 6th
cycle. {you ALMOST have to do this to avoid large wastes in power.}
And I'm pretty sure we won't
see any 6 instruction fusers for quite some time.
On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:
Brett wrote:
EricP <[email protected]> wrote:
They claim 5 cycles, should be six, five for the multiply and one more
for
the second result, unless the next instruction does not need a write
port,
and does not use the result. You can get a throughput of 5 cycles with
smart coding, but that rarely happens without effort.
That article is ignoring multiplier pipelining.
If the multiplier is pipelined with a latency of 5 and throughput of 1,
then MULL takes 5 cycles and MULL,MULH takes 6.
But those two multiplies still are tossing away 50% of their work.
And if it does fuse them then the internal uArch cost is the same as if
you had designed it optimally from the start, except now you have
to pay for a fuser.
You failed to recognize the critical part of my comment on this::
When the IMUL function unit sees MULL and MULH back to back AND
when both operands are the same for both instructions; it KNOWS
that the second multiply has the same result as the first and
thereby that the second multiply can be suppressed and the first
multiply used twice. {{In pure CMOS, if you drop the same operands
twice into the multiplier tree, the multiplier tree burns no power
in any event, just the operand delivery power.}}
You may call this fusion, but it is the very lowest level of it
and was not called such when first used.
<sound of soap box being dragged out>
- register specifier fields are either source or dest, never both
I happen to be wishywashy on this
MitchAlsup1 wrote:
On Wed, 18 Sep 2024 21:15:55 +0000, Brett wrote:
EricP <[email protected]> wrote:
Terje Mathisen wrote:
EricP wrote:
I always assumed that MULH just grabbed the part that would have been
thrown away. And that is how at least one RISC-V core does it:
https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit
They claim 5 cycles, should be six, five for the multiply and one more
for the second result, unless the next instruction does not need a write >>> port, and does not use the result. You can get a throughput of 5 cycles
with
smart coding, but that rarely happens without effort.
It is easy enough in the decoder to recognize a MUL followed by MULH
(and vice versa) as using the multiplier tree once and delivering 2
results. So the first result is 6 cycles, the second result on the 6th
cycle. {you ALMOST have to do this to avoid large wastes in power.}
Yes, but then you *require* a macro-op fuser to function efficiently. Probably... assuming it works.
OR one can give up the cherished 1-dest,2-source self imposed ISA design limitation and have a 32-bit instruction with four 5-bit registers,
2 source, 2 dest, leaving 12 bits for opcode and function code
that you know will calculate multiply once, and can write back
the result in 1 clock if it has two write ports (which it needs
anyway if it wants any hope of catching up after a stall bubble).
Also in the case of Alpha they only had unsigned MUL,MULH and
for signed multiply it had to use branchy code (pre-CMOV) to
do the signed correction subtracts, so fusion would be too complex.
That design decision is as baffling as HP-PA originally leaving
a MUL instruction out entirely because "it violated the 1-clock per instruction design philosophy". (HP quickly fixed it, but still...)
MitchAlsup1 <[email protected]> wrote:
On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:
- register specifier fields are either source or dest, never both
I happen to be wishywashy on this
This is deeply interesting, can you expound on why it is fine a register field can be shared by loads and stores, and sometimes both like x86.
Classic RISC says the loads are critical, but no one is one wide today,
so
stores matter for deconfliction…. And does stuff just fall out right to allow both?
EricP <[email protected]> wrote:
MitchAlsup1 wrote:
It is easy enough in the decoder to recognize a MUL followed by MULH
(and vice versa) as using the multiplier tree once and delivering 2
results. So the first result is 6 cycles, the second result on the 6th
cycle. {you ALMOST have to do this to avoid large wastes in power.}
Yes, but then you *require* a macro-op fuser to function efficiently.
Probably... assuming it works.
OR one can give up the cherished 1-dest,2-source self imposed ISA design
limitation and have a 32-bit instruction with four 5-bit registers,
2 source, 2 dest, leaving 12 bits for opcode and function code
that you know will calculate multiply once, and can write back
the result in 1 clock if it has two write ports (which it needs
anyway if it wants any hope of catching up after a stall bubble).
You already have 2 source, 2 dest if you have load with address update.
A low end CPU is going to have a shared INT/FPU pipeline so you have the hardware to do three sources for MAC. You might as well do 3 source 2
dest on the int side as well. And ARM does Add with Shift which is 3
sources, though one is a constant if you want one cycle uncracked
throughput in most designs.
On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:
MitchAlsup1 <[email protected]> wrote:
On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:
- register specifier fields are either source or dest, never both
I happen to be wishywashy on this
This is deeply interesting, can you expound on why it is fine a register
field can be shared by loads and stores, and sometimes both like x86.
My 66000 encodes store data register in the same field position as it
encodes "what kind of branch" is being performed, and the same position
as all calculation (and load) results.
I started doing this in 1982 with Mc88100 ISA, and never found a problem
with the encoding nor in the decoding nor with the pipelining of it.
Let me be clear, I do not support necessarily damaging a source operand
to fit in another destination as::
ADD SP,SP,#0x40
by specifying SP only once in the instruction.
So,
+------+-----+-----+----------------+
| major| Rd | Rs1 | whatever |
+------+-----+-----+----------------+
| BC | cnd | Rs1 | label offset |
+------+-----+-----+----------------+
| LD | Rd | Rb | displacement |
+------+-----+-----+----------------+
| ST | Rs0 | Rb | displacement |
+------+-----+-----+----------------+
Is:
a) no burden in encoding
b) no burden in decoding
c) no burden in pipelining
d) no burden in stealing the Store data port late in the pipeline
{in particular, this saves lots of flip-flops deferring store
data until after cache hit, TLB hit, and data has arrived at
cache.}
I disagree with things like::
+------+-----+-----+----------------+
| big OpCode | Rds | whatever |
+------+-----+-----+----------------+
Where Rds means the specifier is used as both a source and destination.
Notice in my encoding one can ALWAYS take the register specification
fields and wire them directly into the RF/renamer decoder ports.
You lose this property the other way around.
Classic RISC says the loads are critical, but no one is one wide today,
SiFive disagrees with you.
so
stores matter for deconfliction…. And does stuff just fall out right to
allow both?
Can you restate what you wanted to say using different words or perhaps
give an example ??
MitchAlsup1 <[email protected]> wrote:
On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:
MitchAlsup1 <[email protected]> wrote:
On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:
- register specifier fields are either source or dest, never both
I happen to be wishywashy on this
This is deeply interesting, can you expound on why it is fine a register >>> field can be shared by loads and stores, and sometimes both like x86.
My 66000 encodes store data register in the same field position as it
encodes "what kind of branch" is being performed, and the same position
as all calculation (and load) results.
I started doing this in 1982 with Mc88100 ISA, and never found a problem
with the encoding nor in the decoding nor with the pipelining of it.
Let me be clear, I do not support necessarily damaging a source operand
to fit in another destination as::
ADD SP,SP,#0x40
by specifying SP only once in the instruction.
So,
+------+-----+-----+----------------+
| major| Rd | Rs1 | whatever |
+------+-----+-----+----------------+
| BC | cnd | Rs1 | label offset |
+------+-----+-----+----------------+
| LD | Rd | Rb | displacement |
+------+-----+-----+----------------+
| ST | Rs0 | Rb | displacement |
+------+-----+-----+----------------+
Is:
a) no burden in encoding
b) no burden in decoding
c) no burden in pipelining
d) no burden in stealing the Store data port late in the pipeline
{in particular, this saves lots of flip-flops deferring store
data until after cache hit, TLB hit, and data has arrived at
cache.}
I disagree with things like::
+------+-----+-----+----------------+
| big OpCode | Rds | whatever |
+------+-----+-----+----------------+
Where Rds means the specifier is used as both a source and destination.
Notice in my encoding one can ALWAYS take the register specification
fields and wire them directly into the RF/renamer decoder ports.
You lose this property the other way around.
Classic RISC says the loads are critical, but no one is one wide today,
SiFive disagrees with you.
so
stores matter for deconfliction…. And does stuff just fall out right to >>> allow both?
Can you restate what you wanted to say using different words or perhaps
give an example ??
A series of adds to the same register in a four wide design.
A = A + 1
A = A + B
A = A + C
A = A + D
On Fri, 20 Sep 2024 0:12:48 +0000, Brett wrote:
MitchAlsup1 <[email protected]> wrote:
On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:
MitchAlsup1 <[email protected]> wrote:
On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:
- register specifier fields are either source or dest, never both
I happen to be wishywashy on this
This is deeply interesting, can you expound on why it is fine a register >>>> field can be shared by loads and stores, and sometimes both like x86.
My 66000 encodes store data register in the same field position as it
encodes "what kind of branch" is being performed, and the same position
as all calculation (and load) results.
I started doing this in 1982 with Mc88100 ISA, and never found a problem >>> with the encoding nor in the decoding nor with the pipelining of it.
Let me be clear, I do not support necessarily damaging a source operand
to fit in another destination as::
ADD SP,SP,#0x40
by specifying SP only once in the instruction.
So,
+------+-----+-----+----------------+
| major| Rd | Rs1 | whatever |
+------+-----+-----+----------------+
| BC | cnd | Rs1 | label offset |
+------+-----+-----+----------------+
| LD | Rd | Rb | displacement |
+------+-----+-----+----------------+
| ST | Rs0 | Rb | displacement |
+------+-----+-----+----------------+
Is:
a) no burden in encoding
b) no burden in decoding
c) no burden in pipelining
d) no burden in stealing the Store data port late in the pipeline
{in particular, this saves lots of flip-flops deferring store
data until after cache hit, TLB hit, and data has arrived at
cache.}
I disagree with things like::
+------+-----+-----+----------------+
| big OpCode | Rds | whatever |
+------+-----+-----+----------------+
Where Rds means the specifier is used as both a source and destination.
Notice in my encoding one can ALWAYS take the register specification
fields and wire them directly into the RF/renamer decoder ports.
You lose this property the other way around.
Classic RISC says the loads are critical, but no one is one wide today, >>>SiFive disagrees with you.
so
stores matter for deconfliction…. And does stuff just fall out right to >>>> allow both?
Can you restate what you wanted to say using different words or perhaps
give an example ??
A series of adds to the same register in a four wide design.
A = A + 1
A = A + B
A = A + C
A = A + D
Which any good compiler should emit as::
T1 = A + B
T2 = C + D
A = LEA( T1, T2, #1 )
With a 2 cycle latency instead of 4.
On 9/19/2024 12:15 PM, BGB wrote:
On 9/19/2024 2:04 AM, Robert Finch wrote:
On 2024-09-18 10:30 p.m., BGB wrote:
On 9/18/2024 2:29 PM, Chris M. Thomasson wrote:I think I am early GenX. 59 and still learning loads of stuff.
On 9/18/2024 1:13 AM, David Brown wrote:
On 17/09/2024 20:18, MitchAlsup1 wrote:
On Tue, 17 Sep 2024 16:32:35 +0000, Bill Findlay wrote:You are not 71, you are merely 0x47 :-)
On 17 Sep 2024, Stefan Monnier wrote
(in article<[email protected]>):
With all respect to the regulars here, most people in
technical Usenet
groups are either old, unusually nerdy, or both.
I plead guilty to nerdy, but as for old, I'm still 27 (and
that's been
true for more than 20 years).
Stefan
Hi Stefan!
At least equally nerdy, I should think, but 50 years older.
(Older, not old!)
At 71 real years old I still operate as if I were <let's say> 21. >>>>>>
LOL! :^)
Not going to say my exact age, but if I wrote my age in hex I could
almost try to pass myself off as an early Zoomer (rather than as a
millennial...).
...
Old enough to remember tube TVs and radios. Transistorized pocket
radio were a big thing.
In my case, my childhood was mostly in the era of Win 3.x and Win 9x
PCs, and early dial-up internet (unlike most Zoomers, I remember a
time before YouTube).
[...]
I remember way back wrt compuserve. :^)
MitchAlsup1 <[email protected]> wrote:
On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:
MitchAlsup1 <[email protected]> wrote:
On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:
- register specifier fields are either source or dest, never both
I happen to be wishywashy on this
This is deeply interesting, can you expound on why it is fine a register >>> field can be shared by loads and stores, and sometimes both like x86.
My 66000 encodes store data register in the same field position as it
encodes "what kind of branch" is being performed, and the same position
as all calculation (and load) results.
I started doing this in 1982 with Mc88100 ISA, and never found a problem
with the encoding nor in the decoding nor with the pipelining of it.
Let me be clear, I do not support necessarily damaging a source operand
to fit in another destination as::
ADD SP,SP,#0x40
by specifying SP only once in the instruction.
So,
+------+-----+-----+----------------+
| major| Rd | Rs1 | whatever |
+------+-----+-----+----------------+
| BC | cnd | Rs1 | label offset |
+------+-----+-----+----------------+
| LD | Rd | Rb | displacement |
+------+-----+-----+----------------+
| ST | Rs0 | Rb | displacement |
+------+-----+-----+----------------+
Is:
a) no burden in encoding
b) no burden in decoding
c) no burden in pipelining
d) no burden in stealing the Store data port late in the pipeline
{in particular, this saves lots of flip-flops deferring store
data until after cache hit, TLB hit, and data has arrived at
cache.}
I disagree with things like::
+------+-----+-----+----------------+
| big OpCode | Rds | whatever |
+------+-----+-----+----------------+
Where Rds means the specifier is used as both a source and destination.
Notice in my encoding one can ALWAYS take the register specification
fields and wire them directly into the RF/renamer decoder ports.
You lose this property the other way around.
Classic RISC says the loads are critical, but no one is one wide today,
SiFive disagrees with you.
so
stores matter for deconfliction…. And does stuff just fall out right to
allow both?
Can you restate what you wanted to say using different words or perhaps
give an example ??
A series of adds to the same register in a four wide design.
A = A + 1
A = A + B
A = A + C
A = A + D
On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:
MitchAlsup1 <[email protected]> wrote:
On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:
- register specifier fields are either source or dest, never both
I happen to be wishywashy on this
This is deeply interesting, can you expound on why it is fine a register
field can be shared by loads and stores, and sometimes both like x86.
My 66000 encodes store data register in the same field position as it
encodes "what kind of branch" is being performed, and the same position
as all calculation (and load) results.
I started doing this in 1982 with Mc88100 ISA, and never found a problem
with the encoding nor in the decoding nor with the pipelining of it.
Let me be clear, I do not support necessarily damaging a source operand
to fit in another destination as::
ADD SP,SP,#0x40
by specifying SP only once in the instruction.
So,
+------+-----+-----+----------------+
| major| Rd | Rs1 | whatever |
+------+-----+-----+----------------+
| BC | cnd | Rs1 | label offset |
+------+-----+-----+----------------+
| LD | Rd | Rb | displacement |
+------+-----+-----+----------------+
| ST | Rs0 | Rb | displacement |
+------+-----+-----+----------------+
Is:
a) no burden in encoding
b) no burden in decoding
c) no burden in pipelining
d) no burden in stealing the Store data port late in the pipeline
{in particular, this saves lots of flip-flops deferring store
data until after cache hit, TLB hit, and data has arrived at
cache.}
I disagree with things like::
+------+-----+-----+----------------+
| big OpCode | Rds | whatever |
+------+-----+-----+----------------+
Where Rds means the specifier is used as both a source and destination.
Notice in my encoding one can ALWAYS take the register specification
fields and wire them directly into the RF/renamer decoder ports.
You lose this property the other way around.
MitchAlsup1 wrote:
On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:
MitchAlsup1 <[email protected]> wrote:
On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:
- register specifier fields are either source or dest, never both
I happen to be wishywashy on this
This is deeply interesting, can you expound on why it is fine a register >>> field can be shared by loads and stores, and sometimes both like x86.
My 66000 encodes store data register in the same field position as it
encodes "what kind of branch" is being performed, and the same position
as all calculation (and load) results.
I started doing this in 1982 with Mc88100 ISA, and never found a problem
with the encoding nor in the decoding nor with the pipelining of it.
Let me be clear, I do not support necessarily damaging a source operand
to fit in another destination as::
ADD SP,SP,#0x40
by specifying SP only once in the instruction.
So,
+------+-----+-----+----------------+
| major| Rd | Rs1 | whatever |
+------+-----+-----+----------------+
| BC | cnd | Rs1 | label offset |
+------+-----+-----+----------------+
| LD | Rd | Rb | displacement |
+------+-----+-----+----------------+
| ST | Rs0 | Rb | displacement |
+------+-----+-----+----------------+
Is:
a) no burden in encoding
b) no burden in decoding
c) no burden in pipelining
d) no burden in stealing the Store data port late in the pipeline
{in particular, this saves lots of flip-flops deferring store
data until after cache hit, TLB hit, and data has arrived at
cache.}
I disagree with things like::
+------+-----+-----+----------------+
| big OpCode | Rds | whatever |
+------+-----+-----+----------------+
Where Rds means the specifier is used as both a source and destination.
Notice in my encoding one can ALWAYS take the register specification
fields and wire them directly into the RF/renamer decoder ports.
You lose this property the other way around.
I assume in your examples that you want to start your register file
read access and or rename register lookup access in the decode stage,
and not wait to start at the end of the decode stage.
Effectively pipelining those accesses.
That's fine.
But that's my point - it doesn't make a difference because in both
cases you can wire the reg fields to the reg file or rename directly
and start the access ASAP.
In both cases the enable signal determining what to do shows up
later after decode has done its thing. And the critical path for
that decode enable signal is the same both ways.
And if you are not doing this early access start but the traditional
of latch the decode output THEN start your RegRd or Rename access
it makes no timing difference at all.
By allowing the opcode-Rds style instructions to be *CONSIDERED*
it opens an avenue to potential instructions that cost little or
nothing extra in terms of logic or performance.
And this is particularly useful with fixed width 32-bit instructions
where one is try to pack as much function into a fixed size space as possible. Even more so with 16-bit compact instructions.
For example, a 32-bit fixed format instruction with four 5-bit registers could do a full width integer multiply wide-accumulate
IMAC (Rsd_hi,Rsd_lo) = (Rsd_hi,Rsd_lo) + Rs1 * Rs2
with little more logic than the existing MULL,MULH approach.
It still only needs 2 read ports because Rs1,Rs2 are read first to start
the multiply, then (Rsd_hi,Rsd_lo) second as they aren't needed until
late in the multiply-accumulate.
On 9/18/2024 1:42 PM, MitchAlsup1 wrote:
One simple option would be to assume an instruction looks like:
[Prefix Bytes]
[REX byte]
OP_Byte | 0F+OP_Byte
Mod/RM + SIB + ...
And then use a heuristic to try to guess how to interpret the
instruction stream based on "looks better" (more likely to be aligned
with the instruction stream vs random unaligned garbage).
Though, such a "looks good" heuristic could itself risk skewing the
results.
I may still consider defining an encoding for this, but not yet. It is
in a similar boat as auto-increment. Both add resource cost with
relatively little benefit in terms of overall performance.
Auto-increment because if one has superscalar, the increment can usually >>> be co-executed. And, full [Rb+Ri*Sc+Disp], because it is just too
infrequent to really justify the extra cost of a 3-way adder even if
limited mostly to the low-order bits...
Myopathy--look it up.
OK.
Not sure how that is related (a medical condition involving muscle defects...).
Can also note that a worthwhile design goal is to not add significant
cost over what would be needed for a plain RV64GC implementation, but,
could define a [Rb+Ri*Sc+Disp] encoding or similar if it would likely be beneficial enough to justify its existence.
Myopathy is NEAR SIGHTEDNESS.
On 9/19/24 11:07 AM, EricP wrote:
[snip]
If the multiplier is pipelined with a latency of 5 and throughput
of 1,
then MULL takes 5 cycles and MULL,MULH takes 6.
But those two multiplies still are tossing away 50% of their work.
I do not remember how multipliers are actually implemented — and
am not motivated to refresh my memory at the moment — but I
thought a multiply low would not need to generate the upper bits,
so I do not understand where your "50% of their work" is coming
from.
The high result needs the low result carry-out but not the rest of
the result. (An approximate multiply high for multiply by
reciprocal might be useful, avoiding the low result work. There
might also be ways that a multiplier could be configured to also
provide bit mixing similar to middle result for generating a
hash?)
I seem to recall a PowerPC implementation did semi-pipelined 32-
bit multiplication 16-bits at a time. This presumably saved area
and power
while also facilitating early out for small
multiplicands,
at the cost of some latency and substantial
throughput compared to a fully pipelined multiplication.
If I
remember correctly, this produced a result for 16-bit by 32-bit multiplication, which is different from generating a low or high
result.
And if it does fuse them then the internal uArch cost is the same
as if
you had designed it optimally from the start, except now you have
to pay for a fuser.
<sound of soap box being dragged out>
This idea that macro-op fusion is some magic solution is bullshit.
1) It's not free.
Neither is increasing the number of opcodes or providing extender
prefixes. If one wants binary compatibility, non-fusing
implementations would work.
(I tend to favor providing a translation layer between software
distribution format and instruction cache format, which reduces
the binary compatibility constraint.)
2) It only works where Decode can see *all* the required lookahead
instructions, which means you have to pay for an N-lane decoder
but only get 1 lane.
Most fusion is for two adjacent instructions, which significantly
limits the complexity.
The fusable patterns are also a subset of
all pairs of two instructions, so complete two-way decoding may
not be needed.
There may also be optimization opportunities from looking ahead.
Mitch Alsup proposed such for branch handling in a scalar
implementation.
Apart from fusion, there might be advantages for
avoiding bank conflicts in a banked register file. I.e., the cost
of lookahead might be shared by multiple techniques/optimizations.
I tend to agree that fusion tends to be a workaround for sub-
optimal instruction encoding, but it seems that encoding involves
a lot of tradeoffs.
3) It's probabilistic as it depends on how the fetch buffers get
loaded.
Eg if the fetch buffer contains a valid instruction but does
not have
a next instruction, do you stall Decode to see if a fuser
might arrive
or dispatch it anyway.
This is also somewhat true for variable length encodings that
cross fetch boundaries.
In general a boundary-crossing instruction
would probably stall even if such was not strictly necessary
(e.g., if the missing information is opcode refinement — not
related to instruction routing — or an immediate or even a
register source identifier specifying a value that can have
delayed use (e.g., value of a store, addend of a FMADD).
This does seem a weakness, but fusion is not entirely negative
factors.
4) It gets exponentially expensive if you start doing multiple
instruction
lanes because decode has to deal with all the permutations of
fusion possibilities.
This is also a factor in mere superscalar decode/execute.
Detecting that an instruction is dependent on another would
normally stall the execution of that instruction.
(I feel that encoding some of the dependency information could
be useful to avoid some of this work. In theory, common
dependency detection could also be more broadly useful; e.g.,
operand availability detection and execution/operand routing.)
5) Any fused instructions leave (multiple) bubbles that should be
compacted out or there wasn't much point to doing the fusion.
Even with reduced operations per cycle, fusion could still provide
a net energy benefit.
In my opinion it is better to have an ISA that is optimal by design
rather than being patched up by fusion later.
Fusion is mostly presented for "patching up", but there are also considerations of diverse microarchitectures. With pre-fused
instructions, an implementation might need to crack some of those instructions. Software optimized for such an implementation might
also prefer more flexible compile-time scheduling of pre-cracked
operations.
A load-op instruction is perhaps particularly difficult because
one needs frequent stalls, a skewed (or second chance) pipeline to
hide the load latency, out-of-order execution, or some other stall
avoidance mechanism.
There are also constraints in encoding granularity.
Some of this inefficiency is caused by clinging to now 40 year old
risc design *guidelines* (ie not even rules) that:
- instructions have at most 1 dest and 2 source registers
FMADD seems to have mostly killed the 2-source limit. AArch64's
paired load removes the 2 destination limit. (Paired destinations
were common for early double precision implementations.)
- register specifier fields are either source or dest, never both
This seems mostly a code density consideration. I think using a
single name for both a source and a destination is not so
horrible, but I am not a hardware guy.
- instructions should take at most 1 clock (they never did)
That was clearly overconstraining.
These self imposed design restrictions cause ISA designers to miss
some possible more optimal solutions. The result is things like
RISC-V's memory reference linkage structures taking 6 instructions
to build a 64-bit PC-relative address. And I'm pretty sure we won't
see any 6 instruction fusers for quite some time.
I very much doubt a compiler would generate such outside of some
real-time application where the time constancy might justify the
code bloat.
<sound of soap box being dragged back to cupboard>
I do not mean my response to be heckling. Your points are very
true. However, I think fusion is a technique — like cracking —
that is a natural part of an architect's toolbox.
With all respect to the regulars here, most people in technical Usenet
groups are either old, unusually nerdy, or both.
On 9/22/2024 3:43 PM, Paul A. Clayton wrote:
On 9/19/24 11:07 AM, EricP wrote:
I tend to agree that fusion tends to be a workaround for sub-
optimal instruction encoding, but it seems that encoding involves
a lot of tradeoffs.
Yeah...
However, the cost of doing fusion is higher than having longer-form variable-length instructions via prefixes...
If one wants a cheapish way to do prefixes on a 1-wide machine, they
could transpose the instruction words during fetch, and then only need a single decoder.
So:
WordA
PrefixA WordB
PrefixA PrefixB WordC
Is presented to the decoder as:
WordA
WordB PrefixA
WordC PrefixB PrefixA
So, the decoder doesn't move...
Possibly, a similar trick could be used for 2-wide with limited variable-length, but would get more complicated.<snip>
FMADD seems to have mostly killed the 2-source limit. AArch64's
paired load removes the 2 destination limit. (Paired destinations
were common for early double precision implementations.)
IMHO:
RISC-V not having register-index load/store, while having things like
FMADD, is kinda stupid. Having advanced features while taking a big hit
on the lack of cheap features is not ideal.
I had recently been working on getting BGBCC to target RISC-V (generated
code still not fully working, but the compiler is now able to do the
compiler thing at least).
However, with all of the limits that RISC-V imposes, BGBCC is currently generating output that is around 43% bigger in RISC-V mode than BJX2-XG2
mode (or around 56% bigger than baseline mode).
This is kinda terrible...
So, say, 6 instructions for a 64-bit constant load, or around 4
instructions to load/store a global variable (relative to GP), 4
instructions whenever the 12-bit displacement fails, ...
On 9/22/2024 3:43 PM, Paul A. Clayton wrote:
On 9/19/24 11:07 AM, EricP wrote:
[snip]
If the multiplier is pipelined with a latency of 5 and throughput of 1,
then MULL takes 5 cycles and MULL,MULH takes 6.
But those two multiplies still are tossing away 50% of their work.
I do not remember how multipliers are actually implemented — and
am not motivated to refresh my memory at the moment — but I
thought a multiply low would not need to generate the upper bits,
so I do not understand where your "50% of their work" is coming
from.
The high result needs the low result carry-out but not the rest of
the result. (An approximate multiply high for multiply by
reciprocal might be useful, avoiding the low result work. There
might also be ways that a multiplier could be configured to also
provide bit mixing similar to middle result for generating a
hash?)
I guess it might be interesting if one made a bigger multiplier out of
4-bit multipliers, in a way similar to a 4-bit shift-add.
On 9/22/24 6:19 PM, MitchAlsup1 wrote:
On 9/19/24 11:07 AM, EricP wrote:
<sound of soap box being dragged out>
This idea that macro-op fusion is some magic solution is bullshit.
The argument is, at best, of Academic Quality, made by a student
at the time as a way to justify RISC-V not having certain easy
for HW to perform calculations.
The RISC-V published argument for fusion is not great, but fusion
(and cracking/fission) seem natural architectural mechanisms *if*
one is stuck with binary compatibility.
On 9/22/24 6:19 PM, MitchAlsup1 wrote:
On Sun, 22 Sep 2024 20:43:38 +0000, Paul A. Clayton wrote:
On 9/19/24 11:07 AM, EricP wrote:
[snip]
If the multiplier is pipelined with a latency of 5 and throughput
of 1,
then MULL takes 5 cycles and MULL,MULH takes 6.
But those two multiplies still are tossing away 50% of their work.
I do not remember how multipliers are actually implemented — and
am not motivated to refresh my memory at the moment — but I
thought a multiply low would not need to generate the upper bits,
so I do not understand where your "50% of their work" is coming
from.
+-----------+ +------------+
\ mplier / \ mcand / Big input mux >> +--------+ +--------+
| |
| +--------------+
| / /
| / /
+-- / /
/ Tree /
/ /--+
/ / |
/ / |
+---------------+-----------+
hi low Products
two n-bit operands are multiplied into a 2×n-bit result.
{{All the rest is HOW not what}}
So are you saying the high bits come for free? This seems
contrary to the conception of sums of partial products, where
some of the partial products are only needed for the upper bits
and so could (it seems to me) be uncalculated if one only wanted
the lower bits.
The high result needs the low result carry-out but not the rest of
the result. (An approximate multiply high for multiply by
reciprocal might be useful, avoiding the low result work. There
might also be ways that a multiplier could be configured to also
provide bit mixing similar to middle result for generating a
hash?)
I seem to recall a PowerPC implementation did semi-pipelined 32-
bit multiplication 16-bits at a time. This presumably saved area
and power
You save 1/2 of the tree area, but ultimately consume more power.
The power consumption would seem to depend on how frequently both
multiplier and multiplicand are larger than 16 bits. (However, I
seem to recall that the mentioned implementation only checked one
operand.) I suspect that for a lot of code, small values are
common.
My 66000's CARRY and PRED are "extender prefixes", admittedly
included in the original architecture so compensating for encoding constraints (e.g., not having 36-bit instruction parcels) rather
than microarchitectural or architectural variation.
[snip]>> (I feel that encoding some of the dependency information
could
be useful to avoid some of this work. In theory, common
dependency detection could also be more broadly useful; e.g.,
operand availability detection and execution/operand routing.)
So useful that it is encoded directly in My 66000 ISA.
How so? My 66000 does not provide any explicit declaration what
operation will be using a result (or where an operand is being
sourced from). Register names express the dependencies so the
dataflow graph is implicit.
I was speculating that _knowing_ when an operand will be available
and where a result should be sent (rather than broadcasting) could
be useful information.
Even with reduced operations per cycle, fusion could still provide
a net energy benefit.
Here I disagree:: but for a different reason::
In order for RISC-V to use a 64-bit constant as an operand, it has
to execute either:: AUPIC-LD to an area of memory containing the
64-bit constant, or a 6-7 instruction stream to build the constant
inline. While an ISA that directly supports 64-bit constants in ISA
does not execute any of those.
Thus, while it may save power seen at the "its my ISA" level it
may save power, but when seem from the perspective of "it is
directly supported in my ISA" it wastes power.
Yes, but "computing" large immediates is obviously less efficient
(except for compression), the computation part is known to be
unnecessary. Fusing a comparison and a branch may be a consequence
of bad ISA design in not properly estimating how much work an
instruction can do (and be encoded in available space) and there
is excess decode overhead with separate instructions, but the
individual operations seem to be doing actual work.
I suspect there can be cases where different microarchitectures
would benefit from different amounts of instruction/operation
complexity such that cracking and/or fusion may be useful even in
an optimally designed generic ISA.
[snip]
- register specifier fields are either source or dest, never both
This seems mostly a code density consideration. I think using a
single name for both a source and a destination is not so
horrible, but I am not a hardware guy.
All we HW guys want is the where ever the field is specified,
it is specified in exactly 1 field in the instruction. So, if
field<a..b> is used to specify Rd in one instruction, there is
no other field<!a..!b> specifies the Rd register. RISC-V blew
this "requirement.
Only with the Compressed extension, I think. The Compressed
extension was somewhat rushed and, in my opinion, philosophically
flawed by being redundant (i.e., every C instruction can be
expanded to a non-C instruction). Things like My 66000's ENTER
provide code density benefits but are contrary to the simplicity
emphasis. Perhaps a Rho (density) extension would have been
better.☺ (The extension letter idea was interesting for an
academic ISA but has been clearly shown to be seriously flawed.)
16-bit instructions could have kept the same register field
placements with masking/truncation for two-register-field
instructions.
Even a non-destructive form might be provided by
different masking or bit inversion for the destination. However,
providing three register fields seems to require significant
irregularity in extracting register names. (Another technique
would be using opcode bits for specifying part or all of a
register name. Some special purpose registers or groups of
registers may not be horrible for compiler register allocation,
but such seems rather funky/clunky.)
It is interesting that RISC-V chose to split the immediate field
for store instructions so that source register names would be in
the same place for all (non-C) instructions.
Comparing an ISA design to RISC-V is not exactly the same as
comparing to "best in class".
On 9/3/2024 3:40 AM, Michael S wrote:
On Tue, 3 Sep 2024 05:55:14 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Tim Rentsch <[email protected]> schrieb:
My suggestion is not to implement a language extension, but to
implement a compiler conforming to C as it is now,
Sure, that was also what I was suggesting - define things that
are currently undefined behavior.
with
additional guarantees for what happens in cases that are
undefined behavior.
Guarantees or specifications - no difference there.
Moreover the additional guarantees are
always in effect unless explicitly and specifically requested
otherwise (most likely by means of a #pragma or _Pragma).
Documentation needs to be written for the #pragmas, but no other
documentation is required (it might be nice to describe the
additional guarantees but that is not required by the C
standard).
It' the other way around - you need to describe first what the
actual behavior in absence of any pragmas is, and this needs to be
a firm specification, so the programmer doesn't need to read your
mind (or the source code to the compiler) to find out what you
meant. "But it is clear that..." would not be a specification;
what is clear to you may absolutely not be clear to anybody else.
This is also the only chance you'll have of getting this
implemented in one of the current compilers (and let's face it, if
you want high-quality code, you would need that; both LLVM and GCC
have taken an enormous amount of effort up to now, and duplicating
that is probably not going to happen).
The point is to change the behavior of the compiler but
still conform to the existing ISO C standard.
I understood that - defining things that are currently undefined.
But without a specification, that falls down.
So, let's try something that causes some grief - what should
be the default behavior (in the absence of pragmas) for integer
overflow? More specifically, can the compiler set the condition
to false in
int a;
...
if (a > a + 1) {
}
and how would you specify this in an unabigous manner?
I'd start much earlier, by declaration of "Homogeneity and
Exclusion". It would state that "more defined C" does not pretend
to cover all targets covered by existing C language.
Specifically, following target characteristics are required:
- byte-addressable machine with 8-bit bytes
- two-complement integer types
- if float type is supported it has to be IEEE-754 binary32
- if double type is supported it has to be IEEE-754 binary64
- if long double type is supported it has to be IEEE-754 binary128
- storage order for multibyte types should be either LE or BE,
consistently for all built-in types
- flat address space That part should be specified in more formal
manner
I might add a few things.
ALU:
If integer types overflow, they wrap, with any internal sign or zero extension consistent with the declared type;
If a multiply overflows, the result will contain the low-order bits
of the product, sign or zero extended according to the declared types;
If a variable is shifted left, it will behave as-if it were sign or
zero extended in a way consistent with the type;
If a signed value is shifted right, its high order bits will remain consistent with the original sign bit.
So, in the above example, one could see:
if (a > a + 1) { }
As a hypothetical:
if (a > SignExtend32(a + 1)) { }
Where SignExtent32 returns the input value sign-extended from 32 bits
(a+1 always incrementing the value, but may conceptually either wrap
or go outside the allowed range for 'int', with the sign extension
always returning it to its canonical form, seen as twos complement).
I will not define the behavior of shifts greater than or equal to the
modulo of the integer size, or of negative shifts, as there isn't a consistent behavior here across targets.
However, will note for shifting in a constant expression, it does
seem to be the case, that the shift will behave as-if the width was unbounded, and negative shifts as a shift in the opposite direction,
with the result then being sign or zero extended in accordance with
the type.
Say, for example, zigzag sign folding:
int32_t i, j, k;
i=somevalue;
j=(i<<1)^(i>>31); //fold sign into LSB
k=(j>>1)^((j<<31)>>31);
assert(k==i);
Memory:
One may freely cast pointers to different types and dereference them, regardless of types or alignment of said pointers;
Pointers will behave as-if the memory space were a linear array of
bytes, with each value as one or more contiguous bytes in memory;
Structs are normally packed with each member stored sequentially in
memory, with each member padded to its natural alignment, and the
overal struct, if needed, padded to a multiple of the largest member alignment; The natural alignment for primitive types is equal to the
size of said primitive type;
The address taken of any variable will have an in-memory layout
consistent with the declared type;
...
Implicitly:
Any memory store may potentially alias with any other memory access,
unless: One or both pointers has the restrict keyword;
It can be reasonably proven that the pointed-to memory locations do
not alias;
A compiler may assume an access is aligned if it can be verified that
no operation has caused the address to become misaligned (though, as
a reservation, may assume that if a variable is declared restrict, it
may also be assumed to be properly aligned for its type).
Granted, there are targets where pointers are assumed aligned by
default and declared unaligned, but there is no standard way in C to
declare an unaligned pointer, and there is code that assumes the
ability to freely de-reference pointers regardless of alignment.
Though, a less conservative option would be to assume that any normal
pointer variable is aligned by default, but may become unaligned if
it accepts a value created by casting from a type of smaller
alignment (or is assigned a value from a pointer holding such a
value).
char *cs;
int *pi, *pj;
...
pi=(int *)cs; //taints pi with unaligned status.
..
pj=pi; //taints pj with unaligned status via pi
This would still leave it as UB to pass or return a misaligned
pointer across function boundaries (if the pointer is then
de-referenced), or similar for putting them in struct members.
May leave a partial exception for "void *", which may be cast to
another type without causing the result to become unaligned.
...
Misc:
A missing return value is required to still return as normal;
However, the nature and contents of the value returned will be
undefined (it will be "probably random garbage").
But, would make some reservations:
The relative location and alignment of global variables remains
undefined; The relative location and alignment of automatic variables
remains undefined;
The nature or the storage of any global or automatic variable whose
address has not been taken, remains undefined;
The nature or identity of any temporary variables created within an expression, remains undefined;
Calling a function with a missing prototype will remain undefined,
except if both the argument and return types are all primitive types,
the argument types are an exact match and either pointer or integer
types, and the return type is a small integer;
...
Similar, one likely can't (yet) require that targets be little
endian, but one can make a working assumption that the target is
probably little endian.
...
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 09:29:12 |
| Calls: | 12,100 |
| Files: | 15,003 |
| Messages: | 6,517,968 |