Forum: >>> Magnum BBS <<<

Computer architects leaving Intel...

From Thomas Koenig@21:1/5 to All on Tue Aug 27 05:29:22 2024

Just read that some architects are leaving Intel and doing their own
startup, apparently aiming to develop RISC-V cores of all things.

https://www.tomshardware.com/tech-industry/senior-intel-cpu-architects-splinter-to-develop-risc-v-processors-veterans-establish-aheadcomputing

Maybe a good time to get some developers on board for development.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Tue Aug 27 12:02:40 2024

On Tue, 27 Aug 2024 05:29:22 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Just read that some architects are leaving Intel and doing their own
startup, apparently aiming to develop RISC-V cores of all things.

https://www.tomshardware.com/tech-industry/senior-intel-cpu-architects-splinter-to-develop-risc-v-processors-veterans-establish-aheadcomputing

Maybe a good time to get some developers on board for development.

It looks like exodus from Intel Hillsboro. Hillsboro was #1 and then
#2 (after Haifa) Intel x86 development center in relatively recent
past, but it seems that by now this role firmly belongs to Austin.
It's believable that more ambitious among Intel Hillsboro people are
not happy with that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Tue Aug 27 12:06:16 2024

On 8/26/2024 10:29 PM, Thomas Koenig wrote:

Just read that some architects are leaving Intel and doing their own
startup, apparently aiming to develop RISC-V cores of all things.

https://www.tomshardware.com/tech-industry/senior-intel-cpu-architects-splinter-to-develop-risc-v-processors-veterans-establish-aheadcomputing

Maybe a good time to get some developers on board for development.

Or suggest to them that, instead of RISC-V, they should look at My 66000.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Koenig on Tue Aug 27 20:59:00 2024

In article <vajo7i$2s028$[email protected]>, [email protected] (Thomas Koenig) wrote:

Just read that some architects are leaving Intel and doing their own
startup, apparently aiming to develop RISC-V cores of all things.

They're presumably intending to develop high-performance cores, since
they have substantial experience in doing that for x86-64. The question
is if demand for those will develop.

Android is apparently waiting for a new RISC-V instruction set extension;
you can run various Linuxes, but I have not heard about anyone wanting to
do so on a large scale.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Dallman on Tue Aug 27 21:04:53 2024

[email protected] (John Dallman) writes:

In article <vajo7i$2s028$[email protected]>, [email protected] (Thomas >Koenig) wrote:

Just read that some architects are leaving Intel and doing their own
startup, apparently aiming to develop RISC-V cores of all things.

They're presumably intending to develop high-performance cores, since
they have substantial experience in doing that for x86-64. The question
is if demand for those will develop.

Ask Si-Five about demand for high-performance risc-v cores.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Aug 27 23:50:56 2024

On Tue, 27 Aug 2024 22:39:02 +0000, BGB wrote:

On 8/27/2024 2:59 PM, John Dallman wrote:

In article <vajo7i$2s028$[email protected]>, [email protected] (Thomas
Koenig) wrote:

Just read that some architects are leaving Intel and doing their own
startup, apparently aiming to develop RISC-V cores of all things.

They're presumably intending to develop high-performance cores, since
they have substantial experience in doing that for x86-64. The question
is if demand for those will develop.

Making RISC-V "not suck" in terms of performance will probably at least
be easier than making x86-64 "not suck".

Yet, these people have decades of experience building complex things
that
made x86 (also() not suck. They should have the "drawing power" to get
more people with similar experiences.

The drawback is that they are competing with "everyone else in
RISC-V-land,
and starting several years late.

Android is apparently waiting for a new RISC-V instruction set
extension; >> you can run various Linuxes, but I have not heard

about anyone wanting to do so on a large scale.

My thoughts for "major missing features" is still:
Needs register-indexed load;
Needs an intermediate size constant load (such as 17-bit sign extended)
in a 32-bit op.

Full access to constants.

Where, there is a sizeable chunk of constants between 12 and 17 bits,
but not quite as many between 17 and 32 (and 32-64 bits is comparably infrequent).

Except in in "math codes".

But 64-bit memory reference displacements means one does not have to
even bother to have a strategy of what to do when you need a single
FORTRAN common block to be 74GB in size in order to run 5-decade old
FEM codes.

I could also make a case for an instruction to load a Binary16 value and convert to Binary32 or Binary64 in an FPR, but this is arguably a bit
niche (but, would still beat out using a memory load).

Most of these are covered by something like::

CVTSD Rd,#1 // 32-bit instruction

Big annoying thing with it, is that to have any hope of adoption, one
needs an "actually involved" party to add it. There doesn't seem to be
any sort of aggregated list of "known in-use" opcodes, or any real
mechanism for "informal" extensions.

With the OpCode space already 98% filled there does not need to
be such a list.

The closest we have on the latter point is the "Composable Extensions" extension by Jan Gray, which seems to be mostly that part of the ISA's encoding space can be banked out based on a CSR or similar.

Though, bigger immediate values and register-indexed loads do arguably
better belong in the base ISA encoding space.

Agreed, but there is so much more.

FCMP Rt,#14,R19 // 32-bit instruction
ENTER R16,R0,#400 // 32-bit instruction
..

At present, I am still on the fence about whether or not to support the
C extension in RISC-V mode in the BJX2 Core, mostly because the encoding scheme just sucks bad enough that I don't really want to deal with it.

Realistically, can't likely expect anyone else to adopt BJX2 though.

Captain Obvious strikes again.

Though, bigger issue might be how to make it able to access hardware
devices (seems like part of the physical address space is used for as a
PCI Config space, and would need to figure out what sorts of devices the Linux kernel expects to be there in such a scenario).

It is reasons like this that cause My 66000 to have four 64-bit address
spaces {DRAM, MMI/O, configuration, ROM}. PCIe MMI/O space can easily
exceed 42-bits before one throws MR-IOV at the problem. Configuration
headers in My 66000 contain all the information CPUID has in x86-land.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Aug 28 16:40:24 2024

On Wed, 28 Aug 2024 3:33:40 +0000, BGB wrote:

On 8/27/2024 6:50 PM, MitchAlsup1 wrote:

On Tue, 27 Aug 2024 22:39:02 +0000, BGB wrote:

On 8/27/2024 2:59 PM, John Dallman wrote:

In article <vajo7i$2s028$[email protected]>, [email protected] (Thomas >>>> Koenig) wrote:

Just read that some architects are leaving Intel and doing their own >>>>> startup, apparently aiming to develop RISC-V cores of all things.

They're presumably intending to develop high-performance cores, since
they have substantial experience in doing that for x86-64. The question >>>> is if demand for those will develop.

Making RISC-V "not suck" in terms of performance will probably at least
be easier than making x86-64 "not suck".

Yet, these people have decades of experience building complex things
that
made x86 (also() not suck. They should have the "drawing power" to get
more people with similar experiences.

The drawback is that they are competing with "everyone else in
RISC-V-land,
and starting several years late.

Though, if anything, they probably have the experience to know how to
make things like the fabled "opcode fusion" work without burning too
many resources.

Android is apparently waiting for a new RISC-V instruction set
extension; >> you can run various Linuxes, but I have not heard

about anyone wanting to do so on a large scale.

My thoughts for "major missing features" is still:
Needs register-indexed load;
Needs an intermediate size constant load (such as 17-bit sign extended)
in a 32-bit op.

Full access to constants.

That would be better, but is unlikely within the existing encoding constraints.

But, say, if one burned one of the remaining unused "OP Rd, Rs, Imm12s" encodings as an Imm17s, well then...

Dropping compressed instructions gives enough OpCode room to put the
entire My 66000 ISA in what remains.

With the OpCode space already 98% filled there does not need to
be such a list.

One would still need it if multiple parties want to be able to define an extension independently of each other and not step on the same
encodings.

And what kind of code compatibility would you have between different
designs...

The closest we have on the latter point is the "Composable Extensions"
extension by Jan Gray, which seems to be mostly that part of the ISA's
encoding space can be banked out based on a CSR or similar.

Though, bigger immediate values and register-indexed loads do arguably
better belong in the base ISA encoding space.

Agreed, but there is so much more.

FCMP Rt,#14,R19 // 32-bit instruction
ENTER R16,R0,#400 // 32-bit instruction
..

These are likely a bit further down the priority list.

Prolog/Epilog happens once per function, and often may be skipped for
small leaf functions, so seems like a lower priority. More so, if one
lacks a good way to optimize it much beyond the sequence of load/store
ops which is would be replacing (and maybe not a way to do it much
faster than however can be moved in a single clock cycle with the
available register ports).

My 1-wide machines does ENTER and EXIT at 4 registers per cycle.
Try doing 4 LDs or 4 STs per cycle on a 1-wide machine.

At present, I am still on the fence about whether or not to support the
C extension in RISC-V mode in the BJX2 Core, mostly because the encoding >>> scheme just sucks bad enough that I don't really want to deal with it.

Realistically, can't likely expect anyone else to adopt BJX2 though.

Captain Obvious strikes again.

This is likely the fate of nearly every hobby class ISA.

Time to up your game to an industrial quality ISA.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Dallman on Wed Aug 28 18:28:44 2024

[email protected] (John Dallman) writes:

In article <VbrzO.74199$[email protected]>, [email protected] (Scott >Lurndal) wrote:

[email protected] (John Dallman) writes:

They're presumably intending to develop high-performance cores,
since they have substantial experience in doing that for x86-64.
The question is if demand for those will develop.

Ask Si-Five about demand for high-performance risc-v cores.

SiFive were pretty sure there wasn't near-term demand for them in 4Q2023. >Ahead Computing are presumably not expecting to deliver IP cores for a
year or two, so /maybe/ they have reasons to expect demand then.

But it's also possible they just want to carry on being chip architects
while being in charge of their own company. If so, adopting RISC-V is
more credible in the short term than starting to design a new ISA as a >commercial project. Intel won't sell them an x86 license at any
reasonable price.

Thinking a bit more, they may be trying to go the Nuvia route: design >original cores for an existing ISA and get bought out. Nuvia were bought
by Qualcomm for their ARMv9-A core IP well before they released anything.
If Ahead were to successfully design a fast RISC-V core with >power:performance that was competitive with ARM, /Intel/ might well buy
them.

Intel were all over RISC-V in 4Q2022 and 1Q2023, looking for something to >compete with ARM after having accepted you can't get power:performance to >match ARM out of x86-64. Then it all went quiet, and Intel didn't
manufacture the SiFive SoC ("Horse Creek") that was supposed to blaze the >trail for RISC-V as a consumer and/or enterprise architecture.

The problem with this is that RISC-V isn't currently comparable,
feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which doesn't
exist in the RISC-V design space yet.

If you were a discontented Intel senior engineer, demonstrating that you >could produce what Intel needed, getting your company bought and you
brought back to Intel in a more senior position might seem worth trying.

Perhaps, but the last few decades are littered with failed similar attempts.

(the exceptions, starting with Amdahl, _are_ notable for not being
re-absorbed, but rather for striking out solo successfully).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Lurndal on Wed Aug 28 19:17:00 2024

In article <VbrzO.74199$[email protected]>, [email protected] (Scott Lurndal) wrote:

[email protected] (John Dallman) writes:

They're presumably intending to develop high-performance cores,
since they have substantial experience in doing that for x86-64.
The question is if demand for those will develop.

Ask Si-Five about demand for high-performance risc-v cores.

SiFive were pretty sure there wasn't near-term demand for them in 4Q2023.
Ahead Computing are presumably not expecting to deliver IP cores for a
year or two, so /maybe/ they have reasons to expect demand then.

But it's also possible they just want to carry on being chip architects
while being in charge of their own company. If so, adopting RISC-V is
more credible in the short term than starting to design a new ISA as a commercial project. Intel won't sell them an x86 license at any
reasonable price.

Thinking a bit more, they may be trying to go the Nuvia route: design
original cores for an existing ISA and get bought out. Nuvia were bought
by Qualcomm for their ARMv9-A core IP well before they released anything.
If Ahead were to successfully design a fast RISC-V core with
power:performance that was competitive with ARM, /Intel/ might well buy
them.

Intel were all over RISC-V in 4Q2022 and 1Q2023, looking for something to compete with ARM after having accepted you can't get power:performance to
match ARM out of x86-64. Then it all went quiet, and Intel didn't
manufacture the SiFive SoC ("Horse Creek") that was supposed to blaze the
trail for RISC-V as a consumer and/or enterprise architecture.

If you were a discontented Intel senior engineer, demonstrating that you
could produce what Intel needed, getting your company bought and you
brought back to Intel in a more senior position might seem worth trying.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Lurndal on Wed Aug 28 19:49:00 2024

In article <w%JzO.33560$[email protected]>, [email protected] (Scott Lurndal) wrote:

Intel were all over RISC-V in 4Q2022 and 1Q2023, looking for
something to compete with ARM after having accepted you can't
get power:performance to match ARM out of x86-64. Then it all
went quiet, and Intel didn't manufacture the SiFive SoC
("Horse Creek") that was supposed to blaze the trail for
RISC-V as a consumer and/or enterprise architecture.

The problem with this is that RISC-V isn't currently comparable, feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which
doesn't exist in the RISC-V design space yet.

Open-source design of the ISA has delivered an architecture suitable for teaching, its original purpose, but has failed to promptly deliver the dull-but-necessary features for large-scale systems? I'm shocked!

Surely SiFive should have done this work, if they'd known what they were
doing in competing with ARM?

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Wed Aug 28 18:55:14 2024

Scott Lurndal <[email protected]> schrieb:

The problem with this is that RISC-V isn't currently comparable, feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which doesn't
exist in the RISC-V design space yet.

What is missing (in broad terms)?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Wed Aug 28 20:46:08 2024

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

The problem with this is that RISC-V isn't currently comparable,
feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which doesn't
exist in the RISC-V design space yet.

What is missing (in broad terms)?

NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
quit extensive. Many of them are related to supporting server-grade
RAS, Memory Partitioning, address translation (e.g. 52-bit PA, 52-bit VA)
or accelerator interfaces (ST64B, LD64B).

Moreover, they have a mature SoC ecosystem including a well-
defined and highly capable interrupt controller, an I/O
MMU, a high-speed processor interconnect (CHI), a standard debug
infrastructure (coresight), embedded logic analyzer (ELA),
network on chip (NIC-700), et alia.

https://developer.arm.com/documentation/107997/latest

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Thu Aug 29 11:51:24 2024

[email protected] (John Dallman) writes:

In article <VbrzO.74199$[email protected]>, [email protected] (Scott >Lurndal) wrote:

[email protected] (John Dallman) writes:

They're presumably intending to develop high-performance cores,
since they have substantial experience in doing that for x86-64.
The question is if demand for those will develop.

Ask Si-Five about demand for high-performance risc-v cores.

SiFive were pretty sure there wasn't near-term demand for them in 4Q2023.

Or maybe there was some other reason that the investor money did not
flow as plentiful as it used to, and so SiFive put the most far-out
projects on the back-burner.

Concerning the demand, RISC-V has the advantage of no ARM tax (and
legal costs like those between ARM and Qualcomm over the developments
started at NUVIA) or the question of AMD64 licensing to third parties.

Another RISC-V advantage is that the government of the USA puts
restrictions on ARM that should not apply to the free RISC-V
architecture.

It would apply to implementations designed in the USA (such as those
by Ahead), but the point is that on the ISA level, and thus the buy-in
into the ecosystem (e.g., from ISVs), RISC-V has an advantage.

RISC-V also has a technical advantage over ARM: It has Ztso (total
store order) as an optional extension, which helps porting of
multi-threaded software from AMD64 (and emulation of AMD64 software).
No such thing on ARMv8 or ARMv9 yet, although implementations like the
Apple M1 and Fujitsu A64FX provide this feature.

Ahead Computing are presumably not expecting to deliver IP cores for a
year or two

Three years sounds overly optimistic. Nuvia was founded in 2019,
acquired in 2021, and hardware has been delivered in 2024, very much
in line with the often-read number of 5 years for CPU design projects.

But it's also possible they just want to carry on being chip architects
while being in charge of their own company.

Sure. But what are the investors seeing in the company?

If so, adopting RISC-V is
more credible in the short term than starting to design a new ISA as a >commercial project.

Certainly. Establishing another ISA is hard, because it requires
buy-in from many forces for lasting success. Even if an architecture
has a long track record, like MIPS, that's not enough, as the switch
from the MIPS ISA to RISC-V shows.

RISC-V has quite a bit of mindshare, it lacks the ARM tax, and with
the government of the USA hampering ARM, the RISC-V future looks even
brighter. They still have quite a way to go.

Thinking a bit more, they may be trying to go the Nuvia route: design >original cores for an existing ISA and get bought out.

Probably. Getting bought is a common outcome of a successful startup.

Nuvia were bought
by Qualcomm for their ARMv9-A core IP well before they released anything.

What I read is that the Snapdragon X implements ARM v8.7.

If Ahead were to successfully design a fast RISC-V core with >power:performance that was competitive with ARM, /Intel/ might well buy
them.

Yes, or somebody else, as happened with Nuvia.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Thu Aug 29 13:17:55 2024

[email protected] (John Dallman) writes:

Android is apparently waiting for a new RISC-V instruction set extension;

Which one?

you can run various Linuxes, but I have not heard about anyone wanting to
do so on a large scale.

You may not consider it large-scale, but we wanted to have two RISC-V
servers for teaching (in particular, for the compiler course). Some
years earlier we had written that into a "future plans" document, and
in 2022 we got the request to buy them now, because the period that
was covered in that document was coming to an end. Of course at the
time the best RISC-V thing to be had was the Visionfive V1, which was
cheap, but too weak for our purposes (cross-compiling would have been
possible, but we did not want to go there).

So we eventually settled on two servers based on the Rocket Lake,
which at least gave us AVX-512 (the deadline was too early for Zen4).

Now it's two years later, and the RISC-V servers are still not showing
up. We'll see how things look when it's time to retire the Rocket
Lakes (their predecessors were good for a decade).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Thu Aug 29 13:47:58 2024

[email protected] (Scott Lurndal) writes:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

The problem with this is that RISC-V isn't currently comparable,
feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which doesn't
exist in the RISC-V design space yet.

What is missing (in broad terms)?

NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
quit extensive.

I think the lack of "extensive" features is a feature of RISC-V. Last
I heard, the ARM manual was >10000 pages.

The RISC-V user manual has put on a lot of weight since Volume I (unpriviledged) Version 2.2 (145 pages) and Volume II (priviledged)
20211203 (155 pages). The 20240411 draft of Volume I weighs in at 670
pages), and the 20240411 draft of Volume II at 172 pages, but that's
still quite a long way from 10000.

One interesting case here is that the 236-page version
20190608-Base-Ratified of Volume I spends 12 pages on Chapter 14
"RVWMO Memory Consistency Model, Version 0.1" plus 30 pages for
"Appendix A RVWMO Explanatory Material, Version 0.1" plus 27 pages on
"Appendix B Formal Memory Model Specifications, Version 0.1"
(apparently not grown further in 20240411; the number of pages is a
little smaller for each of the parts).

If the goal of RISC-V was a really simple ISA (as in "simple to
specify"), they would have gone for sequential consistency, but
obviously the lure of implementation simplicity won out here.

Many of them are related to supporting server-grade
RAS, Memory Partitioning, address translation (e.g. 52-bit PA, 52-bit VA)
or accelerator interfaces (ST64B, LD64B).

Can't say I ever missed such instructions.

Are RAS instructions like memory-ordering instructions? The hardware
does not provide the feature, but it provides instructions for
throwing the problem over to software, which is then supposed to use
those instructions (but not too often) to provide the feature that
hardware does not provide?

Moreover, they have a mature SoC ecosystem

ARM certainly has that. However, a lot of the SoC ecosystem is only
accessed through drivers that are specific to one kernel and that
nobody maintains, and that's why many smartphones don't get any
updates after a few years. Let's hope it's better for servers.

One hope is that the openness of RISC-V will also create a more open
ecosystem that will result in drivers in mainline Linux. But my guess
is that for smartphones, the economic incentives are in the other
direction. For servers things may be better, though (even on ARM).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Thu Aug 29 15:06:45 2024

[email protected] (Anton Ertl) writes:

[email protected] (Scott Lurndal) writes:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

The problem with this is that RISC-V isn't currently comparable,
feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which doesn't
exist in the RISC-V design space yet.

What is missing (in broad terms)?

NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
quit extensive.

I think the lack of "extensive" features is a feature of RISC-V. Last
I heard, the ARM manual was >10000 pages.

Actually considerably more if you consider all the related IP
such as the GIC, SMMU and others.

The RISC-V user manual has put on a lot of weight since Volume I >(unpriviledged) Version 2.2 (145 pages) and Volume II (priviledged)
20211203 (155 pages). The 20240411 draft of Volume I weighs in at 670 >pages), and the 20240411 draft of Volume II at 172 pages, but that's
still quite a long way from 10000.

I think comparing manual pages is somewhat pointless.

<snip>

Many of them are related to supporting server-grade
RAS, Memory Partitioning, address translation (e.g. 52-bit PA, 52-bit VA) >>or accelerator interfaces (ST64B, LD64B).

Can't say I ever missed such instructions.

They are architectural features. The may, or many not, require
additional instructions.

The RAS feature is a framework that software can rely on for
any implementation of an ARM SoC regardless of vendor.

Are RAS instructions like memory-ordering instructions?

There is one instruction specific to RAS. ESB, which is a
barrier instruction synchronizing error events.

Moreover, they have a mature SoC ecosystem

ARM certainly has that. However, a lot of the SoC ecosystem is only
accessed through drivers that are specific to one kernel and that
nobody maintains, and that's why many smartphones don't get any
updates after a few years. Let's hope it's better for servers.

It is far better for servers. The SBSA specification, for example,
is designed specifically to support standard software interfaces to
the hardware/firmware. Microsoft, Ubuntu, Redhat et alia are
all involved in the creation and maintenance of that and related
specifications along with the ARM processor vendors.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Aug 29 16:23:19 2024

On Thu, 29 Aug 2024 3:36:44 +0000, BGB wrote:

On 8/28/2024 11:40 AM, MitchAlsup1 wrote:

On Wed, 28 Aug 2024 3:33:40 +0000, BGB wrote:

And what kind of code compatibility would you have between different
designs...

If people can agree as to the encodings, then implementations are more
free to pick which extensions they want or don't want.

If the encodings conflict with each other, no such free choice is
possible.

With differing instructions, how does a software vendor write software
such that it can run near optimally on any implementation ??

Prolog/Epilog happens once per function, and often may be skipped for
small leaf functions, so seems like a lower priority. More so, if one
lacks a good way to optimize it much beyond the sequence of load/store
ops which is would be replacing (and maybe not a way to do it much
faster than however can be moved in a single clock cycle with the
available register ports).

My 1-wide machines does ENTER and EXIT at 4 registers per cycle.
Try doing 4 LDs or 4 STs per cycle on a 1-wide machine.

It likely isn't going to happen because a 1-wide machine isn't going to
have the needed register ports.

3R1W most of the time converts to 4R or 4W for the *logues.

But, if one doesn't have the register ports, there is likely no viable
way to move 4 registers/cycle to/from memory (and it wouldn't make sense
for the register file to have a path to memory that is wider than what
the pipeline has).

---------------

This is likely the fate of nearly every hobby class ISA.

Time to up your game to an industrial quality ISA.

Open question of what an "industrial quality" ISA has that BJX2 lacks...
Limiting the scope to things that RISC-V and ARM have.

Proper handling of exceptions (ignoring them is not proper)
Proper IEEE 754-2018 handling of FMAC (compute all the bits)
Floating Point Transcendentals
HyperVisors/Secure Monitors
Write Interrupt service routines entirely in HLL
proper Privileges and Priorities
Multi-location ATOMIC events
..

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Fri Aug 30 06:12:02 2024

Scott Lurndal <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

The problem with this is that RISC-V isn't currently comparable,
feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which doesn't
exist in the RISC-V design space yet.

What is missing (in broad terms)?

NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
quit extensive.

Is there any way to get that list? I've looked, but I only got rough
overview articles and links to the full documentation, which is fairly overwhelming.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to BGB on Fri Aug 30 09:05:00 2024

In article <vaqgtl$3526$[email protected]>, [email protected] (BGB) wrote:

On 8/29/2024 11:23 AM, MitchAlsup1 wrote:

With differing instructions, how does a software vendor write
software such that it can run near optimally on any implementation?

They presumably target whatever is common, or the least common
denominator (such as RV64G or RV64GC), and settle with "probably
good enough"...

ISVs can be proactive or passive about adopting a new ISA. Anyone
promoting a new ISA wants to motivate them to be proactive, but faces
problems with prerequisites:

* Who can work with simulators, and who needs hardware?
* Different kinds of software need more or less powerful hardware.
* Application people need an OS and development tools at minimum.
* Quite often they need other software: math libraries, databases, etc.

But, probably not too much different from other ISAs, just with a
lot more parties involved.

Variant ISAs create fear, uncertainty and doubt, and that means delay.
ISA promotors fear delay, because their investors will run out of
patience.

The alternative is that one expects that all the software be
rebuilt for the specific configuration being used,

ISVs /really/ don't like that. It multiplies their testing and QA and
those are expensive. It rarely shows up problems, but convincing
themselves to do without it is hard for them.

or recompiled from source or some other distribution format on
the local machine which it is to be run (with binaries distributed
as some form of "portable IR").

ISVs get sceptical about that, because it's generating code they have not tested.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Dallman on Fri Aug 30 09:38:02 2024

John Dallman <[email protected]> schrieb:

In article <vaqgtl$3526$[email protected]>, [email protected] (BGB) wrote:

On 8/29/2024 11:23 AM, MitchAlsup1 wrote:

With differing instructions, how does a software vendor write
software such that it can run near optimally on any implementation?

They presumably target whatever is common, or the least common
denominator (such as RV64G or RV64GC), and settle with "probably
good enough"...

ISVs can be proactive or passive about adopting a new ISA.

What is an ISV? I assume "SV" is for "software vendor", but what
does the I stand for?

[...]

Variant ISAs create fear, uncertainty and doubt, and that means delay.
ISA promotors fear delay, because their investors will run out of
patience.

Which makes me wonder why companies such as Intel introduce new
instructions all the time. For people who compile their own code
(scientists and engineers) that can be OK, they can just use
-march=native (or equivalent), and it can even make sense to have architecture-optimized core libraries such as BLAS, or switch on
availability of features such as AVX512 (but that again has many
sub-features and highly different performance characteristics,
depending on the micro-arch).

But standard software (office applications, browsers...) should
just run everywhere, and there it gets hard to justify.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Fri Aug 30 13:48:25 2024

On Fri, 30 Aug 2024 09:38:02 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

John Dallman <[email protected]> schrieb:

In article <vaqgtl$3526$[email protected]>, [email protected] (BGB)
wrote:

On 8/29/2024 11:23 AM, MitchAlsup1 wrote:

ISVs can be proactive or passive about adopting a new ISA.

What is an ISV? I assume "SV" is for "software vendor", but what
does the I stand for?

https://en.wikipedia.org/wiki/Independent_software_vendor

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Fri Aug 30 10:26:38 2024

Thomas Koenig <[email protected]> writes:

John Dallman <[email protected]> schrieb:

[...]

What is an ISV? I assume "SV" is for "software vendor", but what
does the I stand for?

<https://en.wikipedia.org/wiki/Independent_software_vendor>

Variant ISAs create fear, uncertainty and doubt, and that means delay.
ISA promotors fear delay, because their investors will run out of
patience.

Which makes me wonder why companies such as Intel introduce new
instructions all the time.

AMD64 already has the buy-in of application vendors for desktops and
servers, so it does not have the problem that extensions create
uncertainty among application vendors.

My guess is that there are the following motivations:

1) The new instructions make technical sense (for certain
applications).

2) Even if the applications that the users use don't benefit from the extensions, the users think (thanks also to Intels marketing) that
they might (because of 1); maybe not today, but maybe the next version
or maybe the application that the user will run in a year or two. And
I certainly have seen reports that this or that game does not work on
K10 or whatever because the game uses some SSE4.2 instruction that the
K10 does not have. Intel could have increased this kind of
obsolescence (and the resulting new sales) through instruction set
extensions by supporting AVX across the board early on (as AMD did),
and later by supporting AVX512 across the board, but Intel marketing
apparently thinks it's better to get people to buy Core-branded rather
than Pentium-branded CPUs by disabling AVX for a long time on the
latter.

3) I expect that Intel patents the extensions. So these days
everybody could build an AMD64 CPU, because the patent has expired,
but nobody wants to buy such a CPU without the extensions (because of
2), and the extensions are patented.

and it can even make sense to have
architecture-optimized core libraries such as BLAS, or switch on
availability of features such as AVX512

Yes. And given that a lot of software uses some library or other, a
lot of software may benefit from the extensions. Of course, the
question is how big the benefit is.

E.g., glibc has many different versions of memcpy() and memmove() and
selects among them based on the actual CPU used in the run, thanks to

But standard software (office applications, browsers...) should
just run everywhere, and there it gets hard to justify.

That will also benefit from libraries.

For browsers the JavaScript and WASM JIT compiler can generate code
specific to the extensions present in the hardware; however, no ISA
extension comes to my mind that a JavaScript or current WASM JIT
compiler will benefit from; IIRC there is discussion about explicit
vector stuff in WASM, and there the extensions may make a difference.

Also, a friend who works on a JavaVM JIT told me he is working on auto-vectorization, but I don't know if they really went for that; Auto-vectorization is not just the wrong approach, it also seems
particularly inappropriate for JIT compilers, because it requires a
lot of analysis, i.e., compile time.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Fri Aug 30 14:52:46 2024

On Fri, 30 Aug 2024 10:26:38 GMT
[email protected] (Anton Ertl) wrote:

Thomas Koenig <[email protected]> writes:

John Dallman <[email protected]> schrieb:

[...]

What is an ISV? I assume "SV" is for "software vendor", but what
does the I stand for?

<https://en.wikipedia.org/wiki/Independent_software_vendor>

Variant ISAs create fear, uncertainty and doubt, and that means
delay. ISA promotors fear delay, because their investors will run
out of patience.

Which makes me wonder why companies such as Intel introduce new >instructions all the time.

AMD64 already has the buy-in of application vendors for desktops and
servers, so it does not have the problem that extensions create
uncertainty among application vendors.

My guess is that there are the following motivations:

1) The new instructions make technical sense (for certain
applications).

2) Even if the applications that the users use don't benefit from the extensions, the users think (thanks also to Intels marketing) that
they might (because of 1); maybe not today, but maybe the next version
or maybe the application that the user will run in a year or two. And
I certainly have seen reports that this or that game does not work on
K10 or whatever because the game uses some SSE4.2 instruction that the
K10 does not have. Intel could have increased this kind of
obsolescence (and the resulting new sales) through instruction set
extensions by supporting AVX across the board early on (as AMD did),
and later by supporting AVX512 across the board, but Intel marketing apparently thinks it's better to get people to buy Core-branded rather
than Pentium-branded CPUs by disabling AVX for a long time on the
latter.

I wish if it was only marketing, i.e. if it were only fuses in big-core
derived Pentiums and Celerons.
Unfortunately, the bigger problem was poor work (laziness) of Intel's engineering that didn't have AVX, or any for VEX decoding, in their
Atom line until Gracemont.
It's not marketing, it's engineers, who produced quite capable core
like Tremont with thhe level of ISA support 10 years behind its time.

3) I expect that Intel patents the extensions. So these days
everybody could build an AMD64 CPU, because the patent has expired,
but nobody wants to buy such a CPU without the extensions (because of
2), and the extensions are patented.

and it can even make sense to have
architecture-optimized core libraries such as BLAS, or switch on >availability of features such as AVX512

Yes. And given that a lot of software uses some library or other, a
lot of software may benefit from the extensions. Of course, the
question is how big the benefit is.

E.g., glibc has many different versions of memcpy() and memmove() and
selects among them based on the actual CPU used in the run, thanks to

But standard software (office applications, browsers...) should
just run everywhere, and there it gets hard to justify.

That will also benefit from libraries.

For browsers the JavaScript and WASM JIT compiler can generate code
specific to the extensions present in the hardware; however, no ISA
extension comes to my mind that a JavaScript or current WASM JIT
compiler will benefit from;

More convenient FP->Int conversion than what is available in SSE3.
Also, I'd guess, due to non-destructive ops scalar DPFP code could be
sometimes more compact with AVX encoding than with SSE2 encoding.

IIRC there is discussion about explicit
vector stuff in WASM, and there the extensions may make a difference.

Also, a friend who works on a JavaVM JIT told me he is working on auto-vectorization, but I don't know if they really went for that; Auto-vectorization is not just the wrong approach, it also seems
particularly inappropriate for JIT compilers, because it requires a
lot of analysis, i.e., compile time.

- anton

I agree for case of JS. Not so much for case of Enterprise Java.
OTOH, personally I care about performance of JS and don't care at all
about Enterprise Java. Would think that great majority of the world
is like me in that regard, but may be not so great among those who
sign checks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Fri Aug 30 12:07:04 2024

Michael S <[email protected]> writes:

On Fri, 30 Aug 2024 10:26:38 GMT
[email protected] (Anton Ertl) wrote:

Intel could have increased this kind of
obsolescence (and the resulting new sales) through instruction set
extensions by supporting AVX across the board early on (as AMD did),
and later by supporting AVX512 across the board, but Intel marketing
apparently thinks it's better to get people to buy Core-branded rather
than Pentium-branded CPUs by disabling AVX for a long time on the
latter.

I wish if it was only marketing, i.e. if it were only fuses in big-core >derived Pentiums and Celerons.
Unfortunately, the bigger problem was poor work (laziness) of Intel's >engineering that didn't have AVX, or any for VEX decoding, in their
Atom line until Gracemont.

Intel has certainly disabled AVX in Pentiums and Celerons that used
the P-cores (e.g., Skylake-based Pentiums). That's purely marketing.

Concerning the "Atom"-based processors, it seems to me that they were
not lazy, they did what they were told, and they were told not to
implement AVX. Admittedly, this saves a little area and maybe a
little power, but the AMD Jaguar (2013) included AVX and went for the
same market segment as the Intel Silvermont (2013). And not just
Silvermont excluded AVX, so did Goldmont (2016), Goldmont+ (2017), and
Tremont (2020), and also the contemporaneous P-core-based Pentiums and Celerons. Apparently the idea was that AVX/AVX2 and AVX-512 are
premium features.

One interesting case is the Xeon E-2400 line. On these CPUs only the
P-Cores are enabled, they are server processors, and yet Intel
disabled AVX-512 (which the Xeon E-2300 line has). I wonder what the
reasoning behind that decision was.

It's not marketing, it's engineers, who produced quite capable core
like Tremont with thhe level of ISA support 10 years behind its time.

If their bosses tell them to create a core without AVX, what should
they do? (Answer: Found Ahead! :-) If their bosses had asked them to
create a core with AVX, would they have rebelled out of lazyness? I
doubt it.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Fri Aug 30 14:30:22 2024

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

The problem with this is that RISC-V isn't currently comparable,
feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
they'll need to support a similar feature set - most of which doesn't
exist in the RISC-V design space yet.

What is missing (in broad terms)?

NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
quit extensive.

Is there any way to get that list? I've looked, but I only got rough >overview articles and links to the full documentation, which is fairly >overwhelming.

Chapter A2 (A-Profile Extensions) of DDI0487 (ARM ARM) gives a nice list
for each architectecture version.

https://developer.arm.com/documentation/ddi0487/latest/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Fri Aug 30 15:48:00 2024

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

Concerning the demand, RISC-V has the advantage of no ARM tax (and
legal costs like those between ARM and Qualcomm over the
developments started at NUVIA)

True, although the market for high-performance application cores is less price-sensitive than the market for low-performance embedded ones.

Another RISC-V advantage is that the government of the USA puts
restrictions on ARM that should not apply to the free RISC-V
architecture.

It would apply to implementations designed in the USA (such as those
by Ahead), but the point is that on the ISA level, and thus the
buy-in into the ecosystem (e.g., from ISVs), RISC-V has an advantage.

As someone who does porting and platforms for an ISV, I'm seeing no
customer demand whatsoever. I'm pretty sure that's because of the lack of high-performance implementations. I'd like to do RISC-V, because new architectures are fun, but I can't get hardware at present that's up to
the job, and so I can't justify spending time on it.

RISC-V also has a technical advantage over ARM: It has Ztso (total
store order) as an optional extension, which helps porting of
multi-threaded software from AMD64 (and emulation of AMD64
software). No such thing on ARMv8 or ARMv9 yet, although
implementations like the Apple M1 and Fujitsu A64FX provide
this feature.

Yup, that's an advantage. I have not had trouble with the lack of it on multi-threaded ARM Linux or ARM Windows, but the threading framework I
use was originally developed on SPARC and does its mutexes properly.

But it's also possible they just want to carry on being chip
architects while being in charge of their own company.

Sure. But what are the investors seeing in the company?

Hard to say, given the things venture capitalists are prepared to throw
money at these days.

Even if an architecture has a long track record, like MIPS, that's
not enough, as the switch from the MIPS ISA to RISC-V shows.

In my market sector, so far, that's "the death of MIPS." That happened in
2008, simply because it wasn't remotely performance-competitive.

What I read is that the Snapdragon X implements ARM v8.7.

You're right, I mis-remembered.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Fri Aug 30 14:12:04 2024

[email protected] (John Dallman) writes:

In article <vaqgtl$3526$[email protected]>, [email protected] (BGB) wrote:

The alternative is that one expects that all the software be
rebuilt for the specific configuration being used,

ISVs /really/ don't like that. It multiplies their testing and QA and
those are expensive. It rarely shows up problems, but convincing
themselves to do without it is hard for them.

You actually don't need different extensions for such problems, if you
have library providers like the glibc people which use different implementations with different behaviours (in ways that resulted in
breakage) depending on the processor (not architectural extensions).

In particular, apparently around 2010 or shortly earlier, glibc
started to implement memcpy() with backwards stride on some (not all)
AMD64 hardware, and on some software this led to breakage. The cool
feature is that you could test the software on your hardware and it
would behave as expected, while on some other, hardware-level 100%
compatible hardware it would misbehave. And if the user on that
system reported the problem, you would be unable to reproduce it. I
am not sure if static linking protects against this. Containerization
does not.

Anyway, Ulrich Drepper (glibc maintainer at the time) made the usual C undefined behaviour argument and blamed the application, which
resulted in a huge flame war. The resolution was that glibc was
modified to behave as expected for binaries linked against older
versions of glibc, but would still misbehave for binaries that are
linked against more recent glibc versions. The idea was apparently
that this avoids breakage of the existing binaries, and that new
binaries would be built from source code that avoids the problem
(probably by using memmove() instead of memcpy()).

There was still no easy way to determine whether your software that
calls memcpy() actually works as expected on all hardware, but there
is a way to avoid this particular problem if you are aware of it:

#define memcpy(dest,src,n) memmove(dest,src,n)

or recompiled from source or some other distribution format on
the local machine which it is to be run (with binaries distributed
as some form of "portable IR").

ISVs get sceptical about that, because it's generating code they have not >tested.

Yes, that thinking seems to be a result of C/C++ compiler shenanigans.
People advocating "optimization" based on the assumption that
undefined behaviour does not happen have suggested that I should keep
compiler versions around that compile my source code as I expect it.
Of course that does not help, because I distribute (GNU) software in
source code. And, as the glibc issue discussed earlier shows, even
testing code with a specific compiler and library version does not
necessarily help.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Fri Aug 30 16:42:00 2024

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

ISVs get sceptical about that, because it's generating code they
have not tested.

Yes, that thinking seems to be a result of C/C++ compiler
shenanigans. People advocating "optimization" based on the
assumption that undefined behaviour does not happen have
suggested that I should keep compiler versions around that
compile my source code as I expect it.

Plain old compiler bugs, introduced while fixing other ones, are quite
enough to make me assume that I'll find problems on each change of
compiler. I have had a manager in a very large software company assure me
that it was impossible for them to add bugs while making fixes. His
technical people corrected him immediately, because I'd just laughed.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Dallman on Fri Aug 30 15:44:56 2024

[email protected] (John Dallman) writes:

In article <[email protected]>, >[email protected] (Anton Ertl) wrote:

[email protected] (John Dallman) writes:

Android is apparently waiting for a new RISC-V instruction set
extension;

Which one?

I don't know what its name is. It was proposed by Hans Boehm, and the
Android team pointed me to this discussion on a RISC-V mailing list:

https://lists.riscv.org/g/tech-unprivileged/topic/92916241

Searching with various terms suggests it might well be the Zabha
extension, ratified in April this year, but that is deduction.

You may not consider it large-scale, but we wanted to have two
RISC-V servers for teaching (in particular, for the compiler
course).

Makes sense. It is not in itself "large-scale," but suitable hardware is
only going to be available if someone wants a lot of it, enough to make >building it worthwhile.

Now it's two years later, and the RISC-V servers are still not
showing up.

Yup. RISC-V established a lot of awareness, and some expectations, but
there hasn't been the equipment to let people start using it.

I expect RISC-V to gradually encroach on the embedded market and as microcontroller IP that can be included in SoC accelerators (primarily
to avoid license fees for the alternatives such as cortex m7).

I don't see it replacing ARM64, X86_64/AMD64 or other server-grade
processors.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Fri Aug 30 15:10:02 2024

[email protected] (John Dallman) writes:

In article <[email protected]>, >[email protected] (Anton Ertl) wrote:

[email protected] (John Dallman) writes:

Android is apparently waiting for a new RISC-V instruction set
extension;

Which one?

I don't know what its name is. It was proposed by Hans Boehm, and the
Android team pointed me to this discussion on a RISC-V mailing list:

https://lists.riscv.org/g/tech-unprivileged/topic/92916241

Thanks.

Searching with various terms suggests it might well be the Zabha
extension, ratified in April this year, but that is deduction.

Yes.

Now it's two years later, and the RISC-V servers are still not
showing up.

Yup. RISC-V established a lot of awareness, and some expectations, but
there hasn't been the equipment to let people start using it.

There is equipment, but only at the small-system end for now, with
Raspi-like SBCs being the top of the line for now.

The Visionfive V2 is one of them, and is roughly comparable to a Raspi
3 (1.5GHz in-order core). We have the V1, and it runs Fedora just
fine, albeit slowly.

The BeagleV-Ahead has 4 Xuantie C910 cores (2GHz out-of-order multiple
issue), but only 4GB RAM. It's harder to find, but there seems to be
an Ubuntu image for it: <https://community.element14.com/products/devtools/single-board-computers/next-genbeaglebone/b/blog/posts/beaglev-ahead-getting-started-1>

I find it funny to find this on an Element14 page (the company
formerly known as Acorn, the original A in ARM); Element14 has long
since been bought by Broadcom, but apparently some web presence still
exists.

But making the jump from embedded systems and SBCs to servers has not
happened for RISC-V yet, and looking how long it took to establish ARM
in servers, I expect that RISC-V will take quite a while. I guess
that high-performance cores like those that Ahead is probably working
on are one component along the way.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to John Dallman on Fri Aug 30 12:04:22 2024

On 8/30/24 10:48, John Dallman wrote:

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

[email protected] (John Dallman) writes:

Android is apparently waiting for a new RISC-V instruction set
extension;

Which one?

I don't know what its name is. It was proposed by Hans Boehm, and the
Android team pointed me to this discussion on a RISC-V mailing list:

https://lists.riscv.org/g/tech-unprivileged/topic/92916241

The RV64A stuff? I don't know about android but I would find
it limiting. Kind of like having to work with C/C++17 concurrency
support without having to resort to inline assembly on x64. I
know risc-v thinks they solved the ABA problem with lr/sc but
they haven't in all cases.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Fri Aug 30 16:15:14 2024

[email protected] (Anton Ertl) writes:

[email protected] (John Dallman) writes:

But making the jump from embedded systems and SBCs to servers has not >happened for RISC-V yet, and looking how long it took to establish ARM
in servers, I expect that RISC-V will take quite a while. I guess
that high-performance cores like those that Ahead is probably working
on are one component along the way.

It takes a whole ecosystem, from the OS vendors to the Lauterbachs et alia
to support a new architecture. RISC-V may get there eventually, but
I don't see it happening quickly.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to John Dallman on Fri Aug 30 18:28:08 2024

On 30/08/2024 17:42, John Dallman wrote:

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

ISVs get sceptical about that, because it's generating code they
have not tested.

Yes, that thinking seems to be a result of C/C++ compiler
shenanigans. People advocating "optimization" based on the
assumption that undefined behaviour does not happen have
suggested that I should keep compiler versions around that
compile my source code as I expect it.

Plain old compiler bugs, introduced while fixing other ones, are quite
enough to make me assume that I'll find problems on each change of
compiler. I have had a manager in a very large software company assure me that it was impossible for them to add bugs while making fixes. His
technical people corrected him immediately, because I'd just laughed.

I always keep old versions of compilers around, and don't change
compilers (or libraries) in the middle of a project. Since I work with embedded systems, there are significantly fewer users compared to, say,
x86 target compilers. Thus there is a higher risk of bugs being missed
in beta testing and going unreported for longer. (IME bugs are far more
likely in vendor SDK's than in gcc or newlib, but I keep everything
archived just in case.) I also like to have reproducible builds -
something that many Linux distributions are aiming for these days -
which requires archiving the toolchain.

If you want to write reliable code that can be distributed as source and compiled by any conforming C/C++ compiler, you need to be very sure that
you avoid relying on behaviour that is not specified and documented.
You need to write correct code. That means if you want to copy some
memory with overlapping source and destination arrays, you use "memmove"
- the function for that purpose. You don't use "memcpy", since it is
specified explicitly as requiring non-overlapping arrays.

If you want to write software that is "correct because it passed its
tests", you can only expect it to be reliable when it is run exactly as
tested. That means it must be compiled as it was during tests (same
compiler, same options, same library), and arguably even run only on the
same hardware (if you only test on one particular cpu, OS, etc., you can
only be sure it works on that cpu, OS, etc.).

It is, of course, a lot easier to write software that appears roughly
correct in the source code and passes its tests, than software that is
rigidly accurate.

That's why a lot of pre-compiled commercial software gives particular
versions of particular OS's or Linux distributions in their lists of requirements - even though the software would probably work fine on a
much wider range.

I see nothing wrong in blaming programmers for using "memcpy" when they
should have used "memmeove" - it was those programmers that made the
error. And there is nothing wrong with toolchain developers wanting to
give the most efficient results possible to those that code correctly,
rather than punishing accurate programmers for the mistakes of less
accurate programmers. But it is also important for toolchain developers
to remember that programmers are all fallible humans, and sometimes they
could do a better job of minimising the consequences of other people's
errors, or at least informing about these issues - especially for errors
that might be fairly common.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Scott Lurndal on Fri Aug 30 18:33:40 2024

On 30/08/2024 17:44, Scott Lurndal wrote:

[email protected] (John Dallman) writes:

In article <[email protected]>,
[email protected] (Anton Ertl) wrote:

[email protected] (John Dallman) writes:

Android is apparently waiting for a new RISC-V instruction set
extension;

Which one?

I don't know what its name is. It was proposed by Hans Boehm, and the
Android team pointed me to this discussion on a RISC-V mailing list:

https://lists.riscv.org/g/tech-unprivileged/topic/92916241

Searching with various terms suggests it might well be the Zabha
extension, ratified in April this year, but that is deduction.

You may not consider it large-scale, but we wanted to have two
RISC-V servers for teaching (in particular, for the compiler
course).

Makes sense. It is not in itself "large-scale," but suitable hardware is
only going to be available if someone wants a lot of it, enough to make
building it worthwhile.

Now it's two years later, and the RISC-V servers are still not
showing up.

Yup. RISC-V established a lot of awareness, and some expectations, but
there hasn't been the equipment to let people start using it.

I expect RISC-V to gradually encroach on the embedded market and as microcontroller IP that can be included in SoC accelerators (primarily
to avoid license fees for the alternatives such as cortex m7).

That's where I expect to see it, and I hope to see more of it. At the
very least, decent competition will help push ARM forward.

What I personally would like to see is RISC-V extensions aimed at
real-time and deterministic systems - RTOS acceleration, hardware
semaphores, and the like.

I don't see it replacing ARM64, X86_64/AMD64 or other server-grade processors.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Fri Aug 30 18:52:00 2024

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

I find it funny to find this on an Element14 page (the company
formerly known as Acorn, the original A in ARM); Element14 has long
since been bought by Broadcom, but apparently some web presence
still exists.

A bit of exploration of the website reveals it's a promotional website
for Farnells, an electronics distributor, and doesn't seem to have
anything to do with ex-Acorn or Broadcom.

But making the jump from embedded systems and SBCs to servers has
not happened for RISC-V yet, and looking how long it took to
establish ARM in servers, I expect that RISC-V will take quite a
while. I guess that high-performance cores like those that Ahead
is probably working on are one component along the way.

A necessary step, but there are many more.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to John Dallman on Fri Aug 30 17:59:42 2024

John Dallman <[email protected]> wrote:

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

AMD64 already has the buy-in of application vendors for desktops and
servers, so it does not have the problem that extensions create
uncertainty among application vendors.

My guess is that there are the following motivations:

1) The new instructions make technical sense (for certain
applications).

This is sometimes true, but manufacturers tend to over-promote them,
claiming wider applicability and bigger effects than show up in real application code. After a few disappointments, ISVs tend to become less
keen on doing work on marketing advice.

Some manufacturers pay bonuses to their technical marketing people for getting ISVs to adopt new ISA extensions. This is counter productive,
because it means the ISVs are sure that the marketing advice will take no account of their interests.

They prefer to wait until an extension has been out for several years
before supporting it, so that it's available in pretty well all the
end-user hardware that hasn't finished its depreciation yet. That's
driven by a facet of the application software industry that most hardware manufacturers don't seem to understand. They appear to assume that
computers are set up with an initial software load and carry on running
that for their entire lives.

In fact, organisations replace about a quarter of their machines each
year, always buying up-to-date ones, and want to run the /same/ version
of software on all of them. They want common software versions for data compatibility, ease of training and so on. That means that a new release
of an application has to run on all the machines sold in the last four
years, sometimes longer.

I assume you work in the high end, as the average desktop PC is replaced
every 8 years on a “use it until it breaks” policy.

Dell will tell you 5 years, and Google is paid to say the same.
And that actually might be true for laptops, but not desktops.

The bulk of the PC’s and servers where I work are a dozen years old.
A smattering of new PC’s bring the average down to 9 years.

Some manufacturers expect ISVs to produce multiple versions of software
for different sets of ISA extensions. They'll do that if the gains are
large enough, but they have to be quite large: for my employer, 25% is enough, but 10% isn't. We haven't had to make a decision in between those numbers yet. We've had one 25% case, for Intel SSE2, and many of 10% or
less.

2) Even if the applications that the users use don't benefit from
the extensions, the users think (thanks also to Intels marketing)

The sheer flood of extensions from Intel means most end-user
organisations have stopped trying to keep track these days.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Aug 30 18:11:48 2024

On Thu, 29 Aug 2024 19:07:29 +0000, BGB wrote:

On 8/29/2024 11:23 AM, MitchAlsup1 wrote:

Time to up your game to an industrial quality ISA.

Open question of what an "industrial quality" ISA has that BJX2 lacks... >>> Limiting the scope to things that RISC-V and ARM have.

Proper handling of exceptions (ignoring them is not proper)

If you mean FPU exceptions, maybe.

As far as general interrupt handling, mechanism isn't too far off from
what SH-4 had used, and apparently also RISC-V's CLINT and MIPS work in
a similar way.

Though, with differences as to how they divide up exceptions.
In my case:
Reset;
General Fault;
External Interrupt;
TLB/MMU;
Syscall.

Integer Overflow
Bad Instruction encoding--OpCode exists but not as this
instruction uses it. Random code generation can use
every instruction without privilege.
Bad address--address exists but you are not allowed to touch it
with LD or ST instruction or to attempt to execute it.

Proper IEEE 754-2018 handling of FMAC (compute all the bits)

Possibly true.
My FPU can more-or-less pass the 1985 spec, but not the 2018 spec.

As I understand it, you don't even get FMUL correctly rounded.
To get it properly rounded you have to compute the full 53*53
product.

Floating Point Transcendentals

Not present in many/most ISA's I have looked at.

Its time has come.

HyperVisors/Secure Monitors

Possible. I had considered doing it essentially with emulators, but
granted, this is not quite the same thing.

How can something of lesser privilege emulate something of greater
privilege ??

Seems many of the extant RV implementations don't have this either.

Then not of Industrial quality !!

Write Interrupt service routines entirely in HLL

If you mean C... I do have this.

#ifdef TK_REGSAVE_TBR
__interrupt_tbrsave void __isr_syscall(void)
#else
__interrupt void __isr_syscall(void)
#endif
{
....
}

So there is NO (nadda == 0) ASM instructions between "Core takes
interrupt" and control arrives at __isr_call() ??

AKA: What exactly is the '__interrupt' for?...

However, the ISR's can't access virtual memory apart from manually translating the pointers.

The various architectural CR's can be accessed from C as well, such as "__arch_tbr" to access TBR, etc.

proper Privileges and Priorities

?...

OS cannot access Hypervisor data/code
Hypervisor cannot access Secure Monitor data/code

Every thread runs at its proper priority at all cycles that it has
control.
Thus, you cannot receive interrupt control and then set priority,
priority
needs to be part of delivering control.

Threads are always re-entrant eave the instant they receive control.

Application can call OS
OS can call Hypervisor
Hypervisor can call secure Monitor
as easily as thread can call itself.

Interrupts need no maintenance when Hypervisor changes OS[k] to OS[j] Interrupts need no maintenance when Secure monitor changes
Hypervisor[k] to Hypervisor]j]

System has a means to detect DRAM failures and map-out affected
pages.

System has a means to detect Device failure and restart device
or change mapping to device.

Multi-location ATOMIC events

Possibly true.
Maybe the "volatile" mechanism is weak.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Fri Aug 30 17:58:31 2024

David Brown <[email protected]> writes:

If you want to write reliable code that can be distributed as source and >compiled by any conforming C/C++ compiler, you need to be very sure that
you avoid relying on behaviour that is not specified and documented.

GCC and Clang/LLVM are distributed in source code, and given that
their maintainers find it ok to compile programs to arbitrary code if
they do not meet your expectations, one should expect that they do not
rely on behaviour that is not specified and documented, and never have
(at least not since adopting this attitude). But even they are not up
to the task. As John Regehr writes
<https://blog.regehr.org/archives/761>:

|LLVM/Clang 3.1 and GCC (SVN head from July 14 2012) [...] execute
|undefined behaviors even when compiling an empty C or C++ program with |optimizations turned off.

I am not surprised that nobody has risen to my challenge <[email protected]>:

|Write a proof-of-concept Forth interpreter in the language you
|advocate that runs at least one of bubble-sort, matrix-mult or sieve
|from bench/forth in
|<http://www.complang.tuwien.ac.at/forth/bench.zip>

in the last 7 years.

It is, of course, a lot easier to write software that appears roughly
correct in the source code and passes its tests, than software that is >rigidly accurate.

I never heard about "rigidly accurate" as a property of software
(except maybe numeric software).

The practice is that software is either tested (the usual case) or
formally proved correct. For a C program to be formally proved
correct would, dirst and foremost require a formal specification of C.

I see nothing wrong in blaming programmers for using "memcpy" when they >should have used "memmeove" - it was those programmers that made the
error.

I did not expect *you* to see what's wrong. But I hope that I never
have anything to do with anything that you programmed.

What's wrong with blaming the application programmers is that it does
not help the users of the binary that misbehaved after glibc was
"up"graded. It also does not help users who have a no-longer
maintained piece of source code that used to work with earlier
versions of glibc, but now acts up on some hardware. Sure, there are workarounds, but first the user would have to understand the problem.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Fri Aug 30 18:20:54 2024

On Fri, 30 Aug 2024 16:28:08 +0000, David Brown wrote:

On 30/08/2024 17:42, John Dallman wrote:

I always keep old versions of compilers around, and don't change
compilers (or libraries) in the middle of a project. Since I work with embedded systems, there are significantly fewer users compared to, say,
x86 target compilers. Thus there is a higher risk of bugs being missed
in beta testing and going unreported for longer. (IME bugs are far more likely in vendor SDK's than in gcc or newlib, but I keep everything
archived just in case.) I also like to have reproducible builds -
something that many Linux distributions are aiming for these days -
which requires archiving the toolchain.

There was once a software CAD vendor that made the transition from
SUNos to SOLARIS and we as a major purchaser could not follow due
to several OS differences:: SUNos had a license server that counted
licenses while SOLARIS had a license server that counted the cross
produce of licenses*core. We as a small company could not afford to
upgrade to Solaris. Then their new product simply had different bugs.
We chose to stay with the old SW because we knew where all the bugs
were and how not to stimulate them into nasal deamons. Ultimately
they got bought out and disappeared...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Fri Aug 30 14:34:23 2024

If you want to write reliable code that can be distributed as source and compiled by any conforming C/C++ compiler, you need to be very sure that you avoid relying on behaviour that is not specified and documented. You need to write correct code. That means if you want to copy some memory with overlapping source and destination arrays, you use "memmove" - the function for that purpose. You don't use "memcpy", since it is specified explicitly as requiring non-overlapping arrays.

The difficulty here is that the tools provide very little help for that, because all too often it's virtually impossible for the tools to
understand that this particular code can/will hit UB.

So it's all up to the programmer, who often doesn't know either.
Other than using CompCert, I don't know of any reliable way for
a programmer to make sure his C code does not suffer from UB.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bernd Linsel@21:1/5 to Anton Ertl on Fri Aug 30 21:08:09 2024

The clang/gcc maintainers' POV violates the first part of Postel's Law:

Be liberal in what you accept, and conservative in what you send.

Life would be a lot easier if they just provided a -WUB option that
warns and explains *any* construct that the compiler may regard as UB.

(The various already existing options, e.g. -Wnull-dereference etc., and
the most deviant outgrowth, -fsanitize=... are *not* reliable; the
compiler happily optimizes whole execution paths away and does not tell
about it with any syllable).

On 30.08.24 19:58, Anton Ertl wrote:

David Brown <[email protected]> writes:

If you want to write reliable code that can be distributed as source and
compiled by any conforming C/C++ compiler, you need to be very sure that
you avoid relying on behaviour that is not specified and documented.

GCC and Clang/LLVM are distributed in source code, and given that
their maintainers find it ok to compile programs to arbitrary code if
they do not meet your expectations, one should expect that they do not
rely on behaviour that is not specified and documented, and never have
(at least not since adopting this attitude). But even they are not up
to the task. As John Regehr writes
<https://blog.regehr.org/archives/761>:

|LLVM/Clang 3.1 and GCC (SVN head from July 14 2012) [...] execute
|undefined behaviors even when compiling an empty C or C++ program with |optimizations turned off.

I am not surprised that nobody has risen to my challenge <[email protected]>:

|Write a proof-of-concept Forth interpreter in the language you
|advocate that runs at least one of bubble-sort, matrix-mult or sieve
|from bench/forth in
|<http://www.complang.tuwien.ac.at/forth/bench.zip>

in the last 7 years.

It is, of course, a lot easier to write software that appears roughly
correct in the source code and passes its tests, than software that is
rigidly accurate.

I never heard about "rigidly accurate" as a property of software
(except maybe numeric software).

The practice is that software is either tested (the usual case) or
formally proved correct. For a C program to be formally proved
correct would, dirst and foremost require a formal specification of C.

I see nothing wrong in blaming programmers for using "memcpy" when they
should have used "memmeove" - it was those programmers that made the
error.

I did not expect *you* to see what's wrong. But I hope that I never
have anything to do with anything that you programmed.

What's wrong with blaming the application programmers is that it does
not help the users of the binary that misbehaved after glibc was
"up"graded. It also does not help users who have a no-longer
maintained piece of source code that used to work with earlier
versions of glibc, but now acts up on some hardware. Sure, there are workarounds, but first the user would have to understand the problem.

- anton

--
Bernd Linsel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Stefan Monnier on Fri Aug 30 20:38:00 2024

In article <[email protected]>, [email protected] (Stefan Monnier) wrote:

Other than using CompCert, I don't know of any reliable way for
a programmer to make sure his C code does not suffer from UB.

That looked very interesting for a few minutes. If CompCert could warn
about undefined behaviour reasonably reliably, I'd be very interested in
using it as a specialised lint program.

As far as I can see from the documentation, the C interpreter that comes
with it can do that, but that's not very practical with millions of lines
of source.

because all too often it's virtually impossible for the tools to
understand that this particular code can/will hit UB.

Presumably this is often impractical for a compiler, and run-time
checking is required? I gave Clang's Undefined Behaviour Sanitizer a try
a few weeks ago, and must get back to it.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Brown on Fri Aug 30 21:03:00 2024

In article <vasruo$id3b$[email protected]>, [email protected] (David Brown) wrote:

On 30/08/2024 17:42, John Dallman wrote:

Plain old compiler bugs, introduced while fixing other ones, are
quite enough to make me assume that I'll find problems on each
change of compiler.

I always keep old versions of compilers around, and don't change
compilers (or libraries) in the middle of a project.

I always have at least a couple of machines at the previous build
standard of any platform, often more machines and/or older build
standards.

Changing compilers or libraries is done at new major releases.

If you want to write software that is "correct because it passed
its tests", you can only expect it to be reliable when it is run
exactly as tested. That means it must be compiled as it was during
tests (same compiler, same options, same library), and arguably
even run only on the same hardware (if you only test on one
particular cpu, OS, etc., you can only be sure it works on that
cpu, OS, etc.).

This is simpler when you produce closed-source binary software. We only
ship builds we've tested. That means the /same binaries/ as we tested,
not rebuilt or modified. This requires a separate test harness, rather
than testing code compiled into the binaries.

We test on a wide variety of hardware for the most-used platforms, by
putting it into the distributed testing pools and always knowing which
machine an individual test case ran on, because it's recorded in the test results.

That's why a lot of pre-compiled commercial software gives
particular versions of particular OS's or Linux distributions in
their lists of requirements - even though the software would
probably work fine on a much wider range.

We specify what we specifically support, because we've tested that, plus
the much broader requirements that it should work on. For Linux those are
a GCC runtimes version (currently 8.x) or later and a glibc version
(currently 2.28) or later. We don't seem to have problems with
compatibility since we understood how the compatibility works with those libraries, and started doing it that way.

If there's a problem on a specifically supported Linux, we'll fix it
unless that's impossible. If there's a problem on one where it should
work, we'll investigate it, and fix it if we can, which may cause a distribution to be added to the specifically supported list. If we can't
fix a problem, we'll explain why not, and normally add the problem to the documentation. We can't do miracles, but we do pretty well.

Yes, doing good support is expensive, but it pays off in customer loyalty, which means money.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Brett on Fri Aug 30 20:38:00 2024

In article <vat1ad$jeb4$[email protected]>, [email protected] (Brett) wrote:

I assume you work in the high end, as the average desktop PC is
replaced every 8 years on a _use it until it breaks_ policy.

Yes: we supply software components for stuff where end-users understand
they need powerful machines, and generally have them.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Fri Aug 30 21:24:10 2024

[email protected] (MitchAlsup1) writes:

On Fri, 30 Aug 2024 16:28:08 +0000, David Brown wrote:

On 30/08/2024 17:42, John Dallman wrote:

I always keep old versions of compilers around, and don't change
compilers (or libraries) in the middle of a project. Since I work with
embedded systems, there are significantly fewer users compared to, say,
x86 target compilers. Thus there is a higher risk of bugs being missed
in beta testing and going unreported for longer. (IME bugs are far more
likely in vendor SDK's than in gcc or newlib, but I keep everything
archived just in case.) I also like to have reproducible builds -
something that many Linux distributions are aiming for these days -
which requires archiving the toolchain.

There was once a software CAD vendor that made the transition from
SUNos to SOLARIS and we as a major purchaser could not follow due

Solbourne?

https://en.wikipedia.org/wiki/Solbourne_Computer

to several OS differences:: SUNos had a license server that counted
licenses while SOLARIS had a license server that counted the cross
produce of licenses*core. We as a small company could not afford to
upgrade to Solaris. Then their new product simply had different bugs.
We chose to stay with the old SW because we knew where all the bugs
were and how not to stimulate them into nasal deamons. Ultimately
they got bought out and disappeared...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Aug 31 02:04:23 2024

On Fri, 30 Aug 2024 22:42:19 +0000, BGB wrote:

On 8/30/2024 1:11 PM, MitchAlsup1 wrote:

On Thu, 29 Aug 2024 19:07:29 +0000, BGB wrote:
Integer Overflow

Not usually a thing. Pretty much everything seems to treat integer
overflow as silently wrapping.

ADA wants these.

Bad Instruction encoding--OpCode exists but not as this
   instruction uses it. Random code generation can use
   every instruction without privilege.

Hit or miss.

Will usually fault on invalid instructions.

Must be 100% to guarantee upwards compatibility.

There is logic in place to reject privileged instructions in user-mode,
if the CPU is actually run in user-mode. Some of this is still TODO (currently, TestKern is still running everything in Supervisor Mode).

Yes, it is a pain--but a pain that is absolutely worth it.

The alternative is to treat them as UB, so they may be one of:
Trap;
Do something else (like, if an instruction was added);
Do something wonky / unintended.

In practice, this seems to be more how it works.

Bad practice == not industrial quality.

Bad address--address exists but you are not allowed to touch it>

with LD or ST instruction or to attempt to execute it.

If the MMU is enabled, it should fault on bad memory accesses.

In physical addressing mode, it does not trap.

YOU FAIL TO UNDERSTAND--there is an area in memory where the
preserved registers are stored--stored in a way that only 3
instructions can access--and the PTE is marked RWE=000
This prevents damaging the contract between callee and caller.
3 instructions can access these pages ENTER, EXIT and RET
nothing else.

IIRC, there was a mechanism on the bus to deal with accesses to bad
physical addresses (returning all zeroes). Otherwise, trying to access
an invalid address would cause the CPU to deadlock.

It is NOT a BAD address--it is a good but inaccessible address
outside those 3 instructions.

As I understand it, you don't even get FMUL correctly rounded.
To get it properly rounded you have to compute the full 53*53
product.

AFAICT, this wasn't required for the 1985 spec...

You Cannot get rounding correct unless you "compute as if to
infinite precision" and then follow the rules of rounding
(all modes).

Things like "optional trap on denormal" seems like it should be OK (this
is what MIPS and friends did at the time).

I am talking about FMUL and getting the proper result--no
denorms needed.

For the most part, seems like the '85 spec was more "uses these formats
and gets more or less the same values, good enough". A lot of the
pedantic rounding stuff, etc, seemed to be more something for the 2008
spec.

Then you fail to grasp the spirit of the spec.

The lack of single-rounded FMA shouldn't matter, since this wasn't added until later.

It was in the 19985 spec.

Support for Binary16 is a bonus feature (since 85 spec only gave Single
/ Double / Extended), but Binary16 is useful...

So is a dildo for some people. Irrelevant to the issues at hand.

Floating Point Transcendentals

Not present in many/most ISA's I have looked at.

Its time has come.

Then who has done it, besides x87 and similar?...

I am talking about transcendentals that take FDIV number of cycles
Not FADD taking 200 cycles.

Not going to put much weight in something if:
The only real known example is the legacy x87 ISA;
Pretty much everyone else (including on x86-64) is using unrolled Taylor-series expansion and similar.

At least spell it Chebychev.

HyperVisors/Secure Monitors

Possible. I had considered doing it essentially with emulators, but
granted, this is not quite the same thing.

How can something of lesser privilege emulate something of greater
privilege ??

Top level OS (or hypervisor layer) runs an emulator, which runs any VMs holding guest OS instances.

But if the most you have is Supervisor how do you emulate something
of higher privilege efficiently ??

Granted, running the main OS in an emulator wouldn't be great for performance. But, in most contexts, this isn't really a thing.

Quit acting stupid. You are better than that.

Like, pretty sure Windows and Linux still tend to run bare-metal on most systems, ... (or, if a VM layer exists, it is unclear what if-any
purpose it would serve).

You can run both windows and linux at the same time.
Windows for games and documents, linux for CAD.

But, in any case, one doesn't need any special ISA level support to make things like QEMU and DOSBox work.

Quit acting stupid. You are better than that.

And, if a person wants to essentially use something like QEMU to run the whole OS, nothing really is stopping them.

Quit acting stupid. You are better than that.

Well, except maybe how slow that QEMU and DOSBox tend to be on something
like a RasPi (on a 50MHz CPU, one would likely be hard-pressed to even
run something like SimCity at acceptable speeds).

Quit acting stupid. You are better than that.

Not yet tried porting something like DOSBox to my stuff though...

But, a more clever emulator could likely leverage things like hardware address translation and maybe only JIT parts of the target system (vs,
say, fully emulating the memory access and using JIT compilation or interpretation for "pretty much everything").

You need efficient 2-level (or more) translation.

Say, for example, if the host system and guest OS are running the same
ISA (vs, say, the guest OS running x86 or x86-64; on a host running a different ISA).

What if one thread wants 386, another wants 486, another x86-64
AND all three get the proper undefined instruction trapping.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Dallman on Sat Aug 31 08:59:16 2024

John Dallman <[email protected]> schrieb:

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

Concerning the demand, RISC-V has the advantage of no ARM tax (and
legal costs like those between ARM and Qualcomm over the
developments started at NUVIA)

True, although the market for high-performance application cores is less price-sensitive than the market for low-performance embedded ones.

Definitely - if you have 512 GB DDR5 memory in your workstation, the
cost of the CPU itself is a relatively small fraction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Bernd Linsel on Sat Aug 31 08:45:16 2024

Bernd Linsel <[email protected]> schrieb:

The clang/gcc maintainers' POV violates the first part of Postel's Law:

Be liberal in what you accept, and conservative in what you send.

Life would be a lot easier if they just provided a -WUB option that
warns and explains *any* construct that the compiler may regard as UB.

Patches welcome.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Thomas Koenig on Sat Aug 31 09:29:46 2024

Thomas Koenig <[email protected]> schrieb:

Bernd Linsel <[email protected]> schrieb:

The clang/gcc maintainers' POV violates the first part of Postel's Law:

Be liberal in what you accept, and conservative in what you send.

Life would be a lot easier if they just provided a -WUB option that
warns and explains *any* construct that the compiler may regard as UB.

Maybe a bit more elaborate:

#include <stdio.h>

int main()
{
int i;
sscanf("%d", &i);

Should be "scanf", of course.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Bernd Linsel on Sat Aug 31 09:24:59 2024

Bernd Linsel <[email protected]> schrieb:

The clang/gcc maintainers' POV violates the first part of Postel's Law:

Be liberal in what you accept, and conservative in what you send.

Life would be a lot easier if they just provided a -WUB option that
warns and explains *any* construct that the compiler may regard as UB.

Maybe a bit more elaborate:

#include <stdio.h>

int main()
{
int i;
sscanf("%d", &i);
return 0;
}

Should this be warned about?

Or what about

void foo(int *a)
{
*a ++;
}

Two possible cases of undefined behavior here: a could be an
invalid pointer, and the arithmetic operation could overflow.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bernd Linsel@21:1/5 to Thomas Koenig on Sat Aug 31 13:10:22 2024

On 31.08.24 11:24, Thomas Koenig wrote:

Bernd Linsel <[email protected]> schrieb:

The clang/gcc maintainers' POV violates the first part of Postel's Law:

Be liberal in what you accept, and conservative in what you send.

Life would be a lot easier if they just provided a -WUB option that
warns and explains *any* construct that the compiler may regard as UB.

Maybe a bit more elaborate:

#include <stdio.h>

int main()
{
int i;
scanf("%d", &i);
return 0;
}

Should this be warned about?

[corrected sscanf -> scanf]
Why? This "program" has the purpose to read one line, presumably
containing an integer number, from stdin and ignore it. No UB anywhere.

It does accept an empty line as well as 3432 MB of garbage, and even an
integer without leading space, but always returns true.
Scanf's man page does not state anything about warn-unused-result, and
it's input parsing is clearly described.

I would not complain if the compiler would deliver something that's
roughly equivalent to

int main(void)
{
(void)scanf("%*s");
return 0;
}

while

int main(void)
{
return 0;
}

would be inacceptable.

Or what about

void foo(int *a)
{
*a ++;
}

Two possible cases of undefined behavior here: a could be an
invalid pointer, and the arithmetic operation could overflow.

The result of the pointer increment is never used, so the compiler will
warn and not compile any increment instruction nonetheless.

Furthermore, as *a is not declared volatile, the read operation is
superfluous. A call to foo() may thus legally result in:

<nothing>,

but a still better result would be:

foo(x) -> assert(__builtin_expect(x != NULL, 1)).

Additionally, I'd expect at least 2 warnings:
- result of pointer increment `a++` never used
- result of variable access `*a` never used.

GCC provides means like e.g. the nonnull() attribute, and even if that
were not available, it is good practice to assert() pointer arguments --
or check and return an error code -- at the beginning of the function
body, if you expect to be called from arbitrary (library user) code.

Furthermore, to provide hints to the compiler, you can always write
something like:

if (a == NULL) __builtin_unreachable();

Commonly, one instruments that as:

#define ASSUME(cond) \
do { \
if (!__builtin_expect(!(cond),0)) \
__builtin_unreachable(); \
} while (0)

Maybe my previous post was not clear enough: It's not a general UB
detector that I'd like to have integrated into the compiler (there are
static checker tools available that can nearly perfectly do that);
instead, I'd like to get a warning when the compiler does something
other than you would expect when reading the code in a "do what I mean"
manner.

--
Bernd Linsel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Bernd Linsel on Sat Aug 31 11:18:15 2024

Bernd Linsel <[email protected]> schrieb:

On 31.08.24 11:24, Thomas Koenig wrote:

Bernd Linsel <[email protected]> schrieb:

The clang/gcc maintainers' POV violates the first part of Postel's Law:

Be liberal in what you accept, and conservative in what you send.

Life would be a lot easier if they just provided a -WUB option that
warns and explains *any* construct that the compiler may regard as UB.

Maybe a bit more elaborate:

#include <stdio.h>

int main()
{
int i;
scanf("%d", &i);
return 0;
}

Should this be warned about?

[corrected sscanf -> scanf]
Why? This "program" has the purpose to read one line, presumably
containing an integer number, from stdin and ignore it. No UB anywhere.

What happens on overflow on input? That's undefined behavior, IIRC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bernd Linsel@21:1/5 to Thomas Koenig on Sat Aug 31 13:58:46 2024

On 31.08.24 13:26, Thomas Koenig wrote:

So, sorry for the too-quick examples earlier...

What about

int foo (int a)
{
return a + 1;
}

or

int foo(int *a)
{
return *a;
}

Both can exhibit undefined behavior, and for both it
is impossible for the compiler to tell at compile-time.

So the compiler should just compile both functions (gcc 12.2.0 with -O3
does):

$ gcc -Wall -Wextra -Wpedantic -O3 -xc -std=gnu11 -c - -o foo.o
int foo(int a)
{
return a + 1;
}

int bar(int *a)
{
return *a;
}
^D

$ objdump -d foo.o

foo.o: file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <foo>:
0: 8d 47 01 lea 0x1(%rdi),%eax
3: c3 ret
4: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
b: 00 00 00 00
f: 90 nop

0000000000000010 <bar>:
10: 8b 07 mov (%rdi),%eax
12: c3 ret

All as expected.

What I don't want is that the compiler makes assumptions, concludes UB,
feels entitled to compile whatever it wants and deliver rubbish without
telling about it.

--
Bernd Linsel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to All on Sat Aug 31 11:26:58 2024

So, sorry for the too-quick examples earlier...

What about

int foo (int a)
{
return a + 1;
}

or

int foo(int *a)
{
return *a;
}

Both can exhibit undefined behavior, and for both it
is impossible for the compiler to tell at compile-time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Stefan Monnier on Sat Aug 31 15:55:56 2024

On 30/08/2024 20:34, Stefan Monnier wrote:

If you want to write reliable code that can be distributed as source and
compiled by any conforming C/C++ compiler, you need to be very sure that you >> avoid relying on behaviour that is not specified and documented. You need to >> write correct code. That means if you want to copy some memory with
overlapping source and destination arrays, you use "memmove" - the function >> for that purpose. You don't use "memcpy", since it is specified explicitly >> as requiring non-overlapping arrays.

The difficulty here is that the tools provide very little help for that, because all too often it's virtually impossible for the tools to
understand that this particular code can/will hit UB.

Yes, that is true. And in such cases there is no way for a compiler to "optimise on the assumption of no UB", since it does not know that there
will be, or could be, UB. So Anton has nothing to fear there. Bernd,
on the other hand, might be disappointed - there is also no way for the compiler to warn that the code might have error or UB.

So it's all up to the programmer, who often doesn't know either.
Other than using CompCert, I don't know of any reliable way for
a programmer to make sure his C code does not suffer from UB.

There is no full-proof or complete method for C. There are other
language for which formal methods can come closer to proving the
correctness of the code, but for most practical cases this is infeasible.

The best you can do, as a programmer, is to learn the language as well
as you can, write code carefully, and use whatever help you can get that
is within budget - including linter tools, code reviews, test setups,
and so on. You can come a long way using good free tools such as gcc
and clang, including their extensive compiler warnings and their
sanitizers for run-time checking and testing.

No one claims that writing good, working code is easy.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to John Dallman on Sat Aug 31 16:00:35 2024

On 30/08/2024 22:03, John Dallman wrote:

In article <vasruo$id3b$[email protected]>, [email protected] (David Brown) wrote:

On 30/08/2024 17:42, John Dallman wrote:

Plain old compiler bugs, introduced while fixing other ones, are
quite enough to make me assume that I'll find problems on each
change of compiler.

I always keep old versions of compilers around, and don't change
compilers (or libraries) in the middle of a project.

I always have at least a couple of machines at the previous build
standard of any platform, often more machines and/or older build
standards.

Changing compilers or libraries is done at new major releases.

If you want to write software that is "correct because it passed
its tests", you can only expect it to be reliable when it is run
exactly as tested. That means it must be compiled as it was during
tests (same compiler, same options, same library), and arguably
even run only on the same hardware (if you only test on one
particular cpu, OS, etc., you can only be sure it works on that
cpu, OS, etc.).

This is simpler when you produce closed-source binary software. We only
ship builds we've tested. That means the /same binaries/ as we tested,
not rebuilt or modified. This requires a separate test harness, rather
than testing code compiled into the binaries.

It is indeed simpler when you produce binaries. (We make embedded
systems - for many products, we have full control of the of software and
the hardware, which makes it a lot easier to have a consistent test environment.)

We test on a wide variety of hardware for the most-used platforms, by
putting it into the distributed testing pools and always knowing which machine an individual test case ran on, because it's recorded in the test results.

That's why a lot of pre-compiled commercial software gives
particular versions of particular OS's or Linux distributions in
their lists of requirements - even though the software would
probably work fine on a much wider range.

We specify what we specifically support, because we've tested that, plus
the much broader requirements that it should work on. For Linux those are
a GCC runtimes version (currently 8.x) or later and a glibc version (currently 2.28) or later. We don't seem to have problems with
compatibility since we understood how the compatibility works with those libraries, and started doing it that way.

That is a good compromise.

If there's a problem on a specifically supported Linux, we'll fix it
unless that's impossible. If there's a problem on one where it should
work, we'll investigate it, and fix it if we can, which may cause a distribution to be added to the specifically supported list. If we can't
fix a problem, we'll explain why not, and normally add the problem to the documentation. We can't do miracles, but we do pretty well.

Yes, doing good support is expensive, but it pays off in customer loyalty, which means money.

Agreed. For a lot of businesses, customer loyalty comes not from making working products (lots of people can do that), but how you handle things
when something goes wrong.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Sat Aug 31 14:33:47 2024

On Sat, 31 Aug 2024 2:04:23 +0000, MitchAlsup1 wrote:

On Fri, 30 Aug 2024 22:42:19 +0000, BGB wrote:

For example::

You CAN buy an Ultima GTR an engine and transmission and assemble
a sports car you can register as a street car in your state.

You CANNOT form a company to buy and assemble 1,000 of those and
sell them to the general public.

The former is hobby level, the latter is industrial grade.

The difference is standards and regulations and expectations::
emission regulations
crash structure regulations
pedestrian impact regulations
lighting standards
licensing criterion
infotainment system
air conditioning
..

ALL of which can be ignored for a hobby, none of which can
be ignored for industrial grade.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Bernd Linsel on Sat Aug 31 15:03:47 2024

Bernd Linsel <[email protected]> writes:

Maybe my previous post was not clear enough: It's not a general UB
detector that I'd like to have integrated into the compiler (there are
static checker tools available that can nearly perfectly do that);

Undefined behaviour is something that is exercised at run-time.
That's why the "undefined behaviour sanitizers" insert run-time
checks. And of course they only detect the behaviour when it is
actually exercised. I.e., they usually will not detect overflowable
buffers, because your usual test inputs don't exercise those.

What do you mean with the static checker tools you mention?

instead, I'd like to get a warning when the compiler does something
other than you would expect when reading the code in a "do what I mean" >manner.

Of course the fans of compilers that do what nobody means found a counterargument long ago: They claim that compilers would need psychic
powers to know what you mean. So one way to specify what I guess you
mean with 'read the code in a "do what I mean" manner' is the
behaviour that the the compiler exhibits without "knowledge" coming
from the assumption that there is no undefined behaviour in the
program. For a longer discussion read <https://www.complang.tuwien.ac.at/papers/ertl17kps.pdf>.

And yes, compilers could actually produce information about
differences between such a compilation and a compilation where the
compiler assumes that undefined behaviour does not happen.

One way to use such information is if you then intend to run the
compiler in "Assume That Undefined Behaviour Does Not Happen" mode for production code: check *every* case where the resulting code behaves differently. If the behaviour of the ATUBDNH compiler is not
according to your intentions, change the source code to avoid
undefined behaviour in such cases, forcing the ATUBDNH compiler to
behave as you intend. If the behaviour of the ATUBDNH compiler is as
you intended, you can keep the source code as-is (but then you get the
same warning the next time 'round). Or you can change the source code
in a way that results in the compiler not needing to ATUBDNH in order
to produce the code you would like (see below for examples).

Another way to use such information is if you intend to run the
compiler in don't-ATUBDNH mode for production code. In that case you
only need to look at a few cases: those occuring in the
most-frequently executed code. Again, for each difference there are
two cases: If your intention is only reflected in the don't-ATUBDNH
code, you don't have to do anything, or change the code such that the
warning goes away in the future (without changing the code). If your
intention is also covered by the ATUBDNH case, you can change the code
to actually perform the optimization also in the don't-ATUBDNH
compiler.

Here are examples: Wang et al. [Section 3.3 of wang+12], found that in
all of SPECint 2006 there were only two places where the ATUBDNH made
a measurable difference to performance. These were two inner loops.

In one case the code is

int k;
int *ic, *is;
...
for (k = 1; k <= M; k++) {
...
ic[k] += is[k];
...
}

and the don't-ATUBDNH variant has a sign extension after the "k++"
that the ATUBDNH does not have. Wang et al. suggest changing the type
of k to size_t to avoid this sign-extension operation. After that
change ATUBDNH makes no difference to this loop.

The other loop is

quantum_reg *reg;
...
// reg->size: int
// reg->node[i].state: unsigned long long
for (i = 0; i < reg->size; i++)
reg->node[i].state = ...;

Here ATUBDNH pulls the load of reg->size out of the loop (it assumes
that reg->size does not alias with reg->node[i].state). Wang et
al. solved that by assigning reg->size to a variable outside the loop,
i.e., something like:

quantum_reg *reg;
...
long reg_size = reg->size
for (i = 0; i < reg_size; i++)
reg->node[i].state = ...;

But once we are at that, why stop at optimizations suggestions coming
from ATUBDNH. E.g., consider a loop similar to the second loop:

quantum_reg *reg;
...
// reg->size: int
// reg->node[i].state: int <==== HERE'S THE DIFFERENCE
for (i = 0; i < reg->size; i++)
reg->node[i].state = ...;

In this case ATUBDNH would not allow pulling reg->size out of the
loop, yet you don't intend to ever alias reg->size with
reg->node[i].state. A compiler could actually guess your intention,
and suggest that you may want to pull reg->size out (plus also mention
the caveats about possible aliasing).

So once we are there, we no longer need ATUBDNH, we just need
don't-ATUBDNH and a compiler option that produces manual-optimization suggestions, ordered by the expected payoff (probably it's a good idea
to use profile data for this ordering).

I personally try to turn GCC into don't-ATUBDNH as far as possible
with options like "-fno-delete-null-pointer-checks
-fno-strict-aliasing -fno-strict-overflow".

@InProceedings{wang+12,
author = {Xi Wang and Haogang Chen and Alvin Cheung and Zhihao Jia and Nickolai Zeldovich and M. Frans Kaashoek},
title = {Undefined Behavior: What Happened to My Code?},
booktitle = {Asia-Pacific Workshop on Systems (APSYS'12)},
OPTpages = {},
year = {2012},
url1 = {http://homes.cs.washington.edu/~akcheung/getFile.php?file=apsys12.pdf},
url2 = {http://people.csail.mit.edu/nickolai/papers/wang-undef-2012-08-21.pdf},
OPTannote = {}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sat Aug 31 17:10:29 2024

Thomas Koenig <[email protected]> writes:

Definitely - if you have 512 GB DDR5 memory in your workstation, the
cost of the CPU itself is a relatively small fraction.

Reality check:
EUR
2400 =8*300 8*64GB MTC40F2046S1RC48BA1R Micron RDIMM 64GB, DDR5-4800
9300 AMD Ryzen Threadripper PRO 7995WX 96C boxed

The Intel side is a little cheaper, but also offers fewer cores:

4100 Intel Xeon w9-3475X, 36C boxed
6800 Intel Xeon w9-3495X, 56C tray

In any case, all three CPUs are significantly more expensive than
512GB of RAM.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sat Aug 31 18:55:25 2024

Anton Ertl <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Definitely - if you have 512 GB DDR5 memory in your workstation, the
cost of the CPU itself is a relatively small fraction.

Reality check:
EUR
2400 =8*300 8*64GB MTC40F2046S1RC48BA1R Micron RDIMM 64GB, DDR5-4800
9300 AMD Ryzen Threadripper PRO 7995WX 96C boxed

The Intel side is a little cheaper, but also offers fewer cores:

4100 Intel Xeon w9-3475X, 36C boxed
6800 Intel Xeon w9-3495X, 56C tray

In any case, all three CPUs are significantly more expensive than
512GB of RAM.

Let's just say those prices are not representative of what I have
in my workstation. First, the CPUs are different, and second,
the deals that a large corporation gets on hardware can be quite
surprising to somebody who is not familiar with them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sat Aug 31 19:08:33 2024

Anton Ertl <[email protected]> schrieb:

Of course the fans of compilers that do what nobody means found a counterargument long ago: They claim that compilers would need psychic
powers to know what you mean.

Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.

What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.

An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.

As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bernd Linsel@21:1/5 to Thomas Koenig on Sat Aug 31 23:01:54 2024

On 31.08.24 21:08, Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic
powers to know what you mean.

Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.

What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.

An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.

As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.

You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry etc.,
while compiler writers are free in their search for optimization tricks
that let them shine at SPEC benchmarks.

I personally write most code as in the days I learned C, where compilers
where literally too dumb to remember what they did 2 source lines ago,
so you could not rely on the compiler doing the "right thing" -- same as nowadays, but because of other reasons.

So the things that Anton mentioned -- using size_t (or suitable other
unsigned types) for iteration variables, pulling invariants out of
loops, and many more common optimizations -- can still be found in my
source codes.

PS: I find -fno-strict-overflow and -fno-strict-aliasing of value, too,
while I found that -fdelete-null-pointer-checks together with -Wnull-pointer-dereference has some utility.

--
Bernd Linsel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Bernd Linsel on Sat Aug 31 21:14:53 2024

On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:

On 31.08.24 21:08, Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic
powers to know what you mean.

Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.

What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.

An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.

As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.

You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry etc., while compiler writers are free in their search for optimization tricks
that let them shine at SPEC benchmarks.

A pressure vessel may actually be able to contain 2× the pressure it
will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.

Then there are 3 kinds of metals {grey, white, yellow} with different
responses to stress and induced strain. There is no analogy in code--
If there were perhaps we would have better code today...

I personally write most code as in the days I learned C, where compilers where literally too dumb to remember what they did 2 source lines ago,
so you could not rely on the compiler doing the "right thing" -- same as nowadays, but because of other reasons.

I do too.

So the things that Anton mentioned -- using size_t (or suitable other unsigned types) for iteration variables, pulling invariants out of
loops, and many more common optimizations -- can still be found in my
source codes.

The modern change is that "int" is no longer the fastest integral
type {which it was guaranteed to be in the days I learned C}.

PS: I find -fno-strict-overflow and -fno-strict-aliasing of value, too,
while I found that -fdelete-null-pointer-checks together with -Wnull-pointer-dereference has some utility.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Aug 31 21:25:16 2024

On Sat, 31 Aug 2024 20:56:56 +0000, BGB wrote:

On 8/30/2024 7:11 PM, Paul A. Clayton wrote:

On 8/28/24 11:36 PM, BGB wrote:

On 8/28/2024 11:40 AM, MitchAlsup1 wrote:

[snip]

My 1-wide machines does ENTER and EXIT at 4 registers per cycle.
Try doing 4 LDs or 4 STs per cycle on a 1-wide machine.

It likely isn't going to happen because a 1-wide machine isn't going
to have the needed register ports.

For an in-order implementation, banking could be used for saving
a contiguous range of registers with no bank conflicts.

Mitch Alsup chose to provide four read/write ports with the
typical use being three read, one write instructions. This not
only facilitates faster register save/restore for function calls
(and context switches/interrupts) but presents the opportunity of
limited dual issue ("CoIssue").

I was mostly doing dual-issue with a 4R2W design.

Initially, 6R3W won out mostly because 4R2W disallows an indexed store
to be run in parallel with another op; but 6R3W did allow this. This
scenario made enough of a difference to seemingly justify the added cost
of a 3-wide design with a 3rd lane that goes mostly unused (and is
mostly limited to register MOV's and basic ALU ops and similar).

But, then this leads to an annoyance:
As is, I will need to generate different code for 1W, 2W, and 3W configurations;
It is starting to become tempting to generate code resembling that for
the 1W case (albeit still using the shuffling that would be used when bundling), and then using superscalar since, it turns out, it is not
quite as expensive as I had thought).

You are falling for the VLIW thought train trap...

With superscalar, I wouldn't have the issue of 2W and 3W cores having
issues running code built for the other.

Such is the advantage of configurable register file ports.

Also, on both 2W and 3W configurations, I can have a 128-bit MOV.X (load/store pair) instruction, so if one assumes 2-wide as the minimum,
this instruction can be safely assumed to exist.

VLIW trap again.

ENTER and EXIT have no such trap as they are not tied to the number of
file ports in any given implementation. They work even when the file
is not configurable and especially when it is. Different timing, thou;
because RF configuration determines throughput (as it does OH SO often}

I can mostly ignore 1-wide scenarios (2R1W and 3W1W), mostly as I have
ended up mostly deciding to relegate these to RISC-V.

Tisc..

By the time I have stripped down BJX2 enough to fit into a small FPGA,
it essentially has almost nothing to offer that RV wouldn't offer
already (and it makes more practical sense to use something like RV32IM
or similar).

I am not sure how one would efficiently pull off a 4W write operation.

Can note that generally, the GPR part of the register file can be built
with LUTRAMs, which on Xilinx chips have the property:
1R1W, 5-bit addr, 3-bit data; comb read, clock-edge write.
1R1W, 6-bit addr, 2-bit data; comb read, clock-edge write.

This means, the number of LUTRAMs needed for NxM with G registers can be calculated:
2R1W, 32, Cost=44
3R1W, 32, Cost=66
4R2W, 32, Cost=176
6R3W, 32, Cost=396
4R4W, 32, Cost=352
6R4W, 32, Cost=528

2R1W, 64, Cost=64
3R1W, 64, Cost=96
4R2W, 64, Cost=256
6R3W, 64, Cost=576
4R4W, 64, Cost=512
6R4W, 64, Cost=768

10R5W, 64, cost=1600.

An accurate but slight underestimate.

I am not sure about ASIC.

Depends on who implemented the SRAM and RF technology.

For FPGA, pretty sure that bidirectional ports would gain little or
nothing over fixed-direction ports (since bidirectional IO is not a
thing, and the internal logic is almost entirely different between a
read and write port).

It is even easier when you have access to individual transistors
and wires...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Bernd Linsel on Sat Aug 31 21:42:31 2024

Bernd Linsel <[email protected]> schrieb:

On 31.08.24 21:08, Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic
powers to know what you mean.

Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.

What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.

An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.

As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.

You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry etc., while compiler writers are free in their search for optimization tricks
that let them shine at SPEC benchmarks.

A specification is a specification, but it seems you do not grasp
the concept. It seems a curious mental gap in some people who
think that it means fundamentally different things in different fields.

But if you insist in putting some extra constraints on compiler
writers, apart from the official standards, feel free to write them
down (but please in a concise manner) and try to get them accepted,
preferably by the relevant standards committees. But you should know
that writing a specication that is unambiguous and clear is
hard work, and needs a lot of discussion and reviews.

Or fork either gcc or LLVM (or both) and implement whatever
restrictions you want, and if you can convince the maintainers
of these compilers that it is a good idea to fold in your changes,
they may do so.

If you can make your case to enough people (or companies),
then you will find enough volunteers and/or funding to do so.
Snide remarks about compiler writers on comp.arch aren't going
to have any meaningful impact, I'm afraid; if anything, they will
lower your chance of success.

But of course that depends on your definition of success - do
you want to achive anything, or do you want to aggravate people?
If it is the latter, then your chance of success might be a
bit higher.

I personally write most code as in the days I learned C, where compilers where literally too dumb to remember what they did 2 source lines ago,
so you could not rely on the compiler doing the "right thing" -- same as nowadays, but because of other reasons.

So you learned programming by ignoring the specifications that
were available. Well, sometimes making progress means unlearning
something.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Aug 31 23:37:28 2024

On Sat, 31 Aug 2024 21:42:31 +0000, Thomas Koenig wrote:

Bernd Linsel <[email protected]> schrieb:

On 31.08.24 21:08, Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic >>>> powers to know what you mean.

Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.

What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.

An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.

As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.

You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry etc.,
while compiler writers are free in their search for optimization tricks
that let them shine at SPEC benchmarks.

A specification is a specification, but it seems you do not grasp
the concept. It seems a curious mental gap in some people who
think that it means fundamentally different things in different fields.

But if you insist in putting some extra constraints on compiler
writers, apart from the official standards, feel free to write them
down (but please in a concise manner) and try to get them accepted, preferably by the relevant standards committees. But you should know
that writing a specication that is unambiguous and clear is
hard work, and needs a lot of discussion and reviews.

convincing the random code exercisers not to try the ATOMIC
parts of the ISA is even harder.

Or fork either gcc or LLVM (or both) and implement whatever
restrictions you want, and if you can convince the maintainers
of these compilers that it is a good idea to fold in your changes,
they may do so.

If you can make your case to enough people (or companies),
then you will find enough volunteers and/or funding to do so.
Snide remarks about compiler writers on comp.arch aren't going
to have any meaningful impact, I'm afraid; if anything, they will
lower your chance of success.

But of course that depends on your definition of success - do
you want to achive anything, or do you want to aggravate people?
If it is the latter, then your chance of success might be a
bit higher.

I personally write most code as in the days I learned C, where compilers
where literally too dumb to remember what they did 2 source lines ago,
so you could not rely on the compiler doing the "right thing" -- same as
nowadays, but because of other reasons.

So you learned programming by ignoring the specifications that
were available. Well, sometimes making progress means unlearning
something.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Aug 31 23:35:33 2024

On Sat, 31 Aug 2024 21:42:31 +0000, Thomas Koenig wrote:

Bernd Linsel <[email protected]> schrieb:

On 31.08.24 21:08, Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic >>>> powers to know what you mean.

Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.

What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.

An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.

As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.

You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry etc.,
while compiler writers are free in their search for optimization tricks
that let them shine at SPEC benchmarks.

A specification is a specification, but it seems you do not grasp
the concept. It seems a curious mental gap in some people who
think that it means fundamentally different things in different fields.

But if you insist in putting some extra constraints on compiler
writers, apart from the official standards, feel free to write them
down (but please in a concise manner) and try to get them accepted, preferably by the relevant standards committees. But you should know
that writing a specication that is unambiguous and clear is
hard work, and needs a lot of discussion and reviews.

Convincing the random code exercisers to obey the "that is not
an instruction" part of the specification is vastly harder.
C

Or fork either gcc or LLVM (or both) and implement whatever
restrictions you want, and if you can convince the maintainers
of these compilers that it is a good idea to fold in your changes,
they may do so.

If you can make your case to enough people (or companies),
then you will find enough volunteers and/or funding to do so.
Snide remarks about compiler writers on comp.arch aren't going
to have any meaningful impact, I'm afraid; if anything, they will
lower your chance of success.

But of course that depends on your definition of success - do
you want to achive anything, or do you want to aggravate people?
If it is the latter, then your chance of success might be a
bit higher.

I personally write most code as in the days I learned C, where compilers
where literally too dumb to remember what they did 2 source lines ago,
so you could not rely on the compiler doing the "right thing" -- same as
nowadays, but because of other reasons.

So you learned programming by ignoring the specifications that
were available. Well, sometimes making progress means unlearning
something.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Sat Aug 31 19:45:54 2024

On 8/31/2024 2:14 PM, MitchAlsup1 wrote:

On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:

On 31.08.24 21:08, Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic >>>> powers to know what you mean.

Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.

What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.

An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.

As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.

You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry etc.,
while compiler writers are free in their search for optimization tricks
that let them shine at SPEC benchmarks.

A pressure vessel may actually be able to contain 2× the pressure it
will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.

Then there are 3 kinds of metals {grey, white, yellow} with different responses to stress and induced strain. There is no analogy in code--
If there were perhaps we would have better code today...

Perhaps an analogy is code written in assembler, versus coed written in
C versus code written in something like Ada or Rust. Backing away now .
. . :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Sun Sep 1 08:34:11 2024

MitchAlsup1 wrote:

On Fri, 30 Aug 2024 22:42:19 +0000, BGB wrote:

On 8/30/2024 1:11 PM, MitchAlsup1 wrote:

On Thu, 29 Aug 2024 19:07:29 +0000, BGB wrote:
Integer Overflow

Not usually a thing. Pretty much everything seems to treat integer
overflow as silently wrapping.

ADA wants these.

Bad Instruction encoding--OpCode exists but not as this
Â Â instruction uses it. Random code generation can use
Â Â every instruction without privilege.

Hit or miss.

Will usually fault on invalid instructions.

Must be 100% to guarantee upwards compatibility.

There is logic in place to reject privileged instructions in user-mode,
if the CPU is actually run in user-mode. Some of this is still TODO
(currently, TestKern is still running everything in Supervisor Mode).

Yes, it is a pain--but a pain that is absolutely worth it.

The alternative is to treat them as UB, so they may be one of:
   Trap;
   Do something else (like, if an instruction was added);
   Do something wonky / unintended.

In practice, this seems to be more how it works.

Bad practice == not industrial quality.

Bad address--address exists but you are not allowed to touch it> Â Â

with LD or ST instruction or to attempt to execute it.

If the MMU is enabled, it should fault on bad memory accesses.

In physical addressing mode, it does not trap.

YOU FAIL TO UNDERSTAND--there is an area in memory where the
preserved registers are stored--stored in a way that only 3
instructions can access--and the PTE is marked RWE=000
This prevents damaging the contract between callee and caller.
3 instructions can access these pages ENTER, EXIT and RET
nothing else.

IIRC, there was a mechanism on the bus to deal with accesses to bad
physical addresses (returning all zeroes). Otherwise, trying to access
an invalid address would cause the CPU to deadlock.

It is NOT a BAD address--it is a good but inaccessible address
outside those 3 instructions.

As I understand it, you don't even get FMUL correctly rounded.
To get it properly rounded you have to compute the full 53*53
product.

AFAICT, this wasn't required for the 1985 spec...

You Cannot get rounding correct unless you "compute as if to
infinite precision" and then follow the rules of rounding
(all modes).

This rule is in fact really simple:

In all versions of the standard, from the very first up to the upcoming
2029, the core instructions (FADD/FSUB/FMUL/FDIV/FSQRT) MUST result in
the correctly rounded result, according to whatever the current rounding
mode is/was.

This does mean that you have to act as if you did the calculation to arbitrary/infinite precision, which really means "enough bits so that
any following bits do not matter for the rounding result".

It was a revelation to me when I wrote my first fp emulation code and
grok'ed how having a single guard bit followed by a sticky bit was
sufficient to do this for all rounding modes.

At that point I only needed to maintain enough intermediate bits to
guarantee I would still have those rounding bits after normalization.

This doesn't mean that I could skip calculating all the bits of the full NxN->2N mantissa product, only that I didn't need to keep them all
around after normalization.

FMAC (with single rounding, which is the interesting one) you can of
course get catastrophic cancellation, so you need all the 2N mantissa
bits of the multiplication plus the N bits from the addend, then you
either need a normalizer wide enough to take in any possibly alignments
of the two parts, or you must have separate logic for each of the major
cases.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Sun Sep 1 11:21:00 2024

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

Undefined behaviour is something that is exercised at run-time.
That's why the "undefined behaviour sanitizers" insert run-time
checks. And of course they only detect the behaviour when it is
actually exercised. I.e., they usually will not detect overflowable
buffers, because your usual test inputs don't exercise those.

That's among the many reasons why there is no single way "to make code
secure." For string buffers, you turn on the compiler run-time checks,
and use the length-checking versions of string handling functions. Then
you write tests to check both of those are actually working.

Then you discover that the C++ string[] operator is not bounds-checked,
as per the C++ standard, but string.at() is bounds-checked, and curse a
bit.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to John Dallman on Sun Sep 1 22:12:34 2024

On 01/09/2024 12:21, John Dallman wrote:

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

Undefined behaviour is something that is exercised at run-time.
That's why the "undefined behaviour sanitizers" insert run-time
checks. And of course they only detect the behaviour when it is
actually exercised. I.e., they usually will not detect overflowable
buffers, because your usual test inputs don't exercise those.

That's among the many reasons why there is no single way "to make code secure." For string buffers, you turn on the compiler run-time checks,
and use the length-checking versions of string handling functions. Then
you write tests to check both of those are actually working.

Then you discover that the C++ string[] operator is not bounds-checked,
as per the C++ standard, but string.at() is bounds-checked, and curse a
bit.

But surely you would discover that before using the std::string type? I
might do some quick test code using "stuff copied off the internet", but
for any serious programming I would want to read the specifications of a
type or function before using it. That's the only way to be sure you
are writing correct code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Sun Sep 1 22:07:53 2024

On 31/08/2024 21:08, Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic
powers to know what you mean.

Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.

What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.

An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.

As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.

That is very well put.

Specifications are an agreement between the supplier and the client.
The supplier promises particular functionality if the client stays
within those specifications. It is how things work in a huge range of
aspects of life. Sometimes there are agreements in place for what
happens if the specifications are broken (fine if you fail to deliver as promised, jail sentence if you break the law, etc.), but these are
really just extensions of the agreement and specification.

If we think about computing, we can start with mathematics for examples.
A mathematical function maps one set onto another - it specifies what
value in the output set is produced from each value in the input set.
It does not specify the result for values that are not in the input set,
even if they are in a "reasonable" superset. So the real square root
function specifies an output for all non-negative real numbers - it does
not specify the result for negative real numbers. Attempting to find
the square root of a negative number is undefined behaviour.

Functions in computing are the same. You have a specification - a pre-condition, and a post-condition. The inputs (including the
environment, if that is relevant) has to satisfy the pre-condition, and
then the function guarantees that the post-condition will hold after the function call. Try to put anything else into the function without
satisfying the pre-condition, and it's garbage in, garbage out. If you
don't understand "garbage in, garbage out", you really don't understand
the first thing about software development. This has been understood
since the beginning of the programmable computer:

"""
On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into
the machine wrong figures, will the right answers come out?' I am not
able rightly to apprehend the kind of confusion of ideas that could
provoke such a question.
"""

In the context of compilers, the specification is the language standard
in use at the time, combined with the specifications of any library
functions or other code being used. If you don't follow those
specifications - your input code does not meet the pre-conditions, or
the pre-conditions are not met when your code is run - you get undefined behaviour. There is no rational way to expect any particular result
when the input is in essence meaningless.

So if there is a function (or operator, or other feature) specified by
the language or by library or function documentation, and you pass it
something that is not documented as fulfilling the pre-conditions, it's
garbage in, garbage out - your code is wrong. If your code makes
assumptions about the workings of a function that are not specified in
its post-condition, the code is wrong. It might work during testing,
but it is not guaranteed to work. If you try to use a function outside
its specifications, then your code is wrong.

Of course it is not always easy to make sure everything is correct
within specifications. Programming languages and libraries are
complicated, and people make mistakes. And where practical, it can be
good to take that into consideration - if it is possible to give error
messages or help in the case of bad inputs, then that can be very
helpful to people. But it doesn't make sense to try to give the "right"
output for wrong input. And it doesn't make sense to do this to the significant detriment of efficiency with correct inputs.

To compare this to specifications in other walks of life, imagine an electricity company. The specification they provide to you, the
customer, has the pre-condition that you pay your bills. The
post-condition is that you get electricity. If you break the
specification - you stop paying your bills - it's perfectly reasonable
that they cut off your electricity. But it is /nice/ if they first send
you warning letters, and offers to re-arrange your debt. But if you are following the specifications and paying your bills, you would not want
the electricity company to keep providing electricity to those who don't
pay, because that would mean /you/ would have to pay more.

In the same way, I want my compiler to warn about potential problems or undefined behaviour when it reasonably can, rather than jumping straight
to nasal daemons. But I don't want it to generate slower code that it otherwise could, just because some people might write incorrect code. I
should not have to pay (in run-time efficiency losses) for other
people's potential failure to follow specifications.

But I am quite happy to have compiler options to control the balance and behaviour. Compilers generally do little optimisation without flags
explicitly enabling them. And some compilers have flags to change the
language specifications (such as making signed integer arithmetic wrap).
There's not a lot they could do better to satisfy people who want the
tools to conform to their imagined specification rather than the actual specifications.

I suppose one thing they could do is that when a new compiler version
comes out with new optimisations, they could have a flag that turns
these off even if you have enabled others. Maybe you could have
"-olimit=8" to say "limit optimisations to those in gcc 8". That might
give fewer surprises to people who have got their code wrong.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to David Brown on Sun Sep 1 21:43:00 2024

In article <vb2hri$1jub9$[email protected]>, [email protected]
(David Brown) wrote:

On 01/09/2024 12:21, John Dallman wrote:

Then you discover that the C++ string[] operator is not
bounds-checked, as per the C++ standard, but string.at()
is bounds-checked, and curse a bit.

But surely you would discover that before using the std::string
type? I might do some quick test code using "stuff copied off the
internet", but for any serious programming I would want to read the specifications of a type or function before using it. That's the
only way to be sure you are writing correct code.

I didn't write that code, and I don't have the power to demand it be re-written. My group is somewhat pickier about correctness and security
than the group who created it.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Sun Sep 1 17:47:06 2024

EricP wrote:

BGB wrote:

I am not sure how one would efficiently pull off a 4W write operation.

Can note that generally, the GPR part of the register file can be
built with LUTRAMs, which on Xilinx chips have the property:
1R1W, 5-bit addr, 3-bit data; comb read, clock-edge write.
1R1W, 6-bit addr, 2-bit data; comb read, clock-edge write.

This means, the number of LUTRAMs needed for NxM with G registers can
be calculated:
2R1W, 32, Cost=44
3R1W, 32, Cost=66
4R2W, 32, Cost=176
6R3W, 32, Cost=396
4R4W, 32, Cost=352
6R4W, 32, Cost=528

2R1W, 64, Cost=64
3R1W, 64, Cost=96
4R2W, 64, Cost=256
6R3W, 64, Cost=576
4R4W, 64, Cost=512
6R4W, 64, Cost=768

10R5W, 64, cost=1600.

There is also the mUX logic and similar, but should follow the same
pattern.

There is a bit-array (2b per register) to indicate which of the arrays
holds each register. This ends up turning into FFs, but doesn't matter
as much.

In the Verilog, one can write it as-if there were only 1 array per
write port, with the duplication (for the read ports) handled
transparently by the synthesis stage (convenient), although it still
has a steep resource cost.

Since you are targeting 50 MHz, 20 ns per stage, and those LUTRAMs
possibly run at 500 MHz, and assuming the read port numbers are
ready at the start of the cycle, one might multi-pump the register
file read port access and save a pile on read banks and muxes.

For example, you could 4-pump the read port at 5 ns per read,
the LUTRAM read access taking 2 ns and 3 ns for muxing and routing.
That should divide your numbers above by more than 4 because some
muxing becomes simpler too (fewer sources).

You can't multi-pump the write access as the write port data usually
isn't ready until the end of the cycle.

Oh wait, the write-back data output from the MEM-LD stage is ready
at the start of the WB cycle so you could multi-pump the write too.
The normal forwarding logic would pick off if a read register number
matches a write register number so you shouldn't have to worry about
the order of reads and writes to the same register.

That would cut the cost of multiple write ports way down.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to BGB on Sun Sep 1 17:17:04 2024

BGB wrote:

I am not sure how one would efficiently pull off a 4W write operation.

Can note that generally, the GPR part of the register file can be built
with LUTRAMs, which on Xilinx chips have the property:
1R1W, 5-bit addr, 3-bit data; comb read, clock-edge write.
1R1W, 6-bit addr, 2-bit data; comb read, clock-edge write.

This means, the number of LUTRAMs needed for NxM with G registers can be calculated:
2R1W, 32, Cost=44
3R1W, 32, Cost=66
4R2W, 32, Cost=176
6R3W, 32, Cost=396
4R4W, 32, Cost=352
6R4W, 32, Cost=528

2R1W, 64, Cost=64
3R1W, 64, Cost=96
4R2W, 64, Cost=256
6R3W, 64, Cost=576
4R4W, 64, Cost=512
6R4W, 64, Cost=768

10R5W, 64, cost=1600.

There is also the mUX logic and similar, but should follow the same
pattern.

There is a bit-array (2b per register) to indicate which of the arrays
holds each register. This ends up turning into FFs, but doesn't matter
as much.

In the Verilog, one can write it as-if there were only 1 array per write port, with the duplication (for the read ports) handled transparently by
the synthesis stage (convenient), although it still has a steep resource cost.

Since you are targeting 50 MHz, 20 ns per stage, and those LUTRAMs
possibly run at 500 MHz, and assuming the read port numbers are
ready at the start of the cycle, one might multi-pump the register
file read port access and save a pile on read banks and muxes.

For example, you could 4-pump the read port at 5 ns per read,
the LUTRAM read access taking 2 ns and 3 ns for muxing and routing.
That should divide your numbers above by more than 4 because some
muxing becomes simpler too (fewer sources).

You can't multi-pump the write access as the write port data usually
isn't ready until the end of the cycle.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun Sep 1 23:32:47 2024

On Sun, 1 Sep 2024 21:21:38 +0000, BGB wrote:

On 9/1/2024 1:34 AM, Terje Mathisen wrote:

MitchAlsup1 wrote:

It was a revelation to me when I wrote my first fp emulation code and
grok'ed how having a single guard bit followed by a sticky bit was
sufficient to do this for all rounding modes.

At that point I only needed to maintain enough intermediate bits to
guarantee I would still have those rounding bits after normalization.

This doesn't mean that I could skip calculating all the bits of the full
NxN->2N mantissa product, only that I didn't need to keep them all
around after normalization.

OK.

It seemed like when I looked over the 1985 spec initially, it only
required that the result be larger than that of the destination
(seemingly missed the point of it also requiring infinite precision).

Say, 54*54 => 68 bits, where 68 > 52, under this interpretation, it
would have worked. Granted, this does turn it into a probability game
whether the result is correct or off by 1.

it is 53×53->106 to get correct rounding in 1 step.

But, have now since noticed that it did specify computing to infinite precision (in this version of the standard), which, my FPU does not do.

My point exactly,

There was mention of some operations that I have generally not seen in
the ISA in real-world FPUs:
An FP remainder operator;

Something IEE specifies but would require an intermediate of 2045
bits to get correct in all circumstances. This is easier to do in
Sw ! Mc6881 did it in nearly 2300 cycles !!

Converters to/from ASCII strings;

Easier and better in SW.

An FP->Int truncate operator with the result still in FP format;

RND (round) instrution.

Usually, one goes round-trip FP->Int->FP;

Has underflow and overflow problems 2^1022 -> int=>overflow, ...

...

Seems like pretty much everyone offloaded these tasks to the C library.

More modern machines have RND nobody will ever have REM.

I had ended up with coverage of most of the rest, albeit still lacking a "trap on denormal" handler (seemingly worked for MIPS and friends, *).

So, it seemed like it was getting pretty close to "could maybe pass the
1985 spec if one lawyers it...". Maybe not so much it seems, unless I
fix the FMUL issue (TBD if it can be done without significantly
increasing adder-chain latency).

You could check for "inability to correctly round and trap on that
{I have a patent on doing this in transcendental instructions}

It is possible I could also add a check to detect and trap multiplies
for cases where both values have non-zero low-order bits (allowing these
to also be emulated in software).

So, went and added a flag for "Trap as needed to emulate full IEEE
semantics" to FPSCR, where the idea is that enabling this will cause it
to trap in cases where the FPU detects that the results would likely not match the IEEE standard (if using FADDG/FSUBG/FMULG/..., generally if fenv_access is enabled).

Might make sense to have a compiler option to assume fenv_access is
always enabled.

*: Though, from what I can gather, most of the N64 games and similar had operated with this disabled (giving DAZ/FTZ semantics) which apparently
posed an annoyance for later emulators (things like moving platforms in
games like SMB64 would apparently slowly drift upwards or away from the origin if the map was left running for long enough, etc; due to SSE and similar tending to operate with denormals enabled).

GPUs started out without even IEEE 754 formats and over many generations
did more and more of 754, then 2008, and closing in on 2019

FMAC (with single rounding, which is the interesting one) you can of
course get catastrophic cancellation, so you need all the 2N mantissa
bits of the multiplication plus the N bits from the addend, then you
either need a normalizer wide enough to take in any possibly alignments
of the two parts, or you must have separate logic for each of the major
cases.

Yeah, for the 2008 spec onward, would also need this...

It is possible to provide it as a library call, but granted this makes
it slower.

There are FMAC instructions, but they are currently both slow and double-rounded (so, not so useful). Well, except for Binary16 and
Binary32 which appear single-rounded mostly because they happen to be performed internally as Binary64 (but are still slow).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Mon Sep 2 00:08:21 2024

On Sun, 1 Sep 2024 22:07:53 +0200, David Brown
<[email protected]> wrote:

On 31/08/2024 21:08, Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

Of course the fans of compilers that do what nobody means found a
counterargument long ago: They claim that compilers would need psychic
powers to know what you mean.

Of course, different compiler writers have different opinions, but
what you write is very close to a straw man argument.

What compiler writers generlly agree upon is that specifications
matter (either in the language standard or in documented behavior
of the compiler). Howewer, the concept of a specification is
something that you do not appear to understand, and maybe never
will.

An example: I work in the chemical industry. If a pressure vessel
is rated for 16 bar overpressure, we are not allowed to run it at
32 bar. If the supplier happens to have sold vessels which can
actually withstand 32 bar, and then makes modifications which
lower the actual pressure the vessel can withstand only 16 bar,
the customer has no cause for complaint.

As usual, the specification goes both ways: The supplier
guarantees the pressure rating, and the customer is obliged
(by law, in this case) to never operate the vessel above its
pressure rating. Hence, safety valves rupture discs.

That is very well put.

Specifications are an agreement between the supplier and the client.
The supplier promises particular functionality if the client stays
within those specifications. It is how things work in a huge range of >aspects of life. Sometimes there are agreements in place for what
happens if the specifications are broken (fine if you fail to deliver as >promised, jail sentence if you break the law, etc.), but these are
really just extensions of the agreement and specification.

If we think about computing, we can start with mathematics for examples.
A mathematical function maps one set onto another - it specifies what
value in the output set is produced from each value in the input set.
It does not specify the result for values that are not in the input set,
even if they are in a "reasonable" superset. So the real square root >function specifies an output for all non-negative real numbers - it does
not specify the result for negative real numbers. Attempting to find
the square root of a negative number is undefined behaviour.

Functions in computing are the same. You have a specification - a >pre-condition, and a post-condition. The inputs (including the
environment, if that is relevant) has to satisfy the pre-condition, and
then the function guarantees that the post-condition will hold after the >function call. Try to put anything else into the function without
satisfying the pre-condition, and it's garbage in, garbage out. If you
don't understand "garbage in, garbage out", you really don't understand
the first thing about software development. This has been understood
since the beginning of the programmable computer:

"""
On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into
the machine wrong figures, will the right answers come out?' I am not
able rightly to apprehend the kind of confusion of ideas that could
provoke such a question.
"""

In the context of compilers, the specification is the language standard
in use at the time, combined with the specifications of any library
functions or other code being used. If you don't follow those
specifications - your input code does not meet the pre-conditions, or
the pre-conditions are not met when your code is run - you get undefined >behaviour. There is no rational way to expect any particular result
when the input is in essence meaningless.

So if there is a function (or operator, or other feature) specified by
the language or by library or function documentation, and you pass it >something that is not documented as fulfilling the pre-conditions, it's >garbage in, garbage out - your code is wrong. If your code makes
assumptions about the workings of a function that are not specified in
its post-condition, the code is wrong. It might work during testing,
but it is not guaranteed to work. If you try to use a function outside
its specifications, then your code is wrong.

Of course it is not always easy to make sure everything is correct
within specifications. Programming languages and libraries are
complicated, and people make mistakes. And where practical, it can be
good to take that into consideration - if it is possible to give error >messages or help in the case of bad inputs, then that can be very
helpful to people. But it doesn't make sense to try to give the "right" >output for wrong input. And it doesn't make sense to do this to the >significant detriment of efficiency with correct inputs.

To compare this to specifications in other walks of life, imagine an >electricity company. The specification they provide to you, the
customer, has the pre-condition that you pay your bills. The
post-condition is that you get electricity. If you break the
specification - you stop paying your bills - it's perfectly reasonable
that they cut off your electricity. But it is /nice/ if they first send
you warning letters, and offers to re-arrange your debt. But if you are >following the specifications and paying your bills, you would not want
the electricity company to keep providing electricity to those who don't
pay, because that would mean /you/ would have to pay more.

In the same way, I want my compiler to warn about potential problems or >undefined behaviour when it reasonably can, rather than jumping straight
to nasal daemons. But I don't want it to generate slower code that it >otherwise could, just because some people might write incorrect code. I >should not have to pay (in run-time efficiency losses) for other
people's potential failure to follow specifications.

But I am quite happy to have compiler options to control the balance and >behaviour. Compilers generally do little optimisation without flags >explicitly enabling them. And some compilers have flags to change the >language specifications (such as making signed integer arithmetic wrap).
There's not a lot they could do better to satisfy people who want the
tools to conform to their imagined specification rather than the actual >specifications.

I suppose one thing they could do is that when a new compiler version
comes out with new optimisations, they could have a flag that turns
these off even if you have enabled others. Maybe you could have
"-olimit=8" to say "limit optimisations to those in gcc 8". That might
give fewer surprises to people who have got their code wrong.

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is
mentioned as UB in some standard N, but was not addressed in previous standards.

Was it always UB? Or should it be considered ID until it became UB?

It does seem to me that as the C standard evolved, and as more things
have *explicitly* become documented as UB, compiler developers have
responded largely by dropping whatever the compiler did previously -
sometimes breaking code that relied on it.

I have moved on from C (mostly), and I learned long ago to archive
toolchains and to expect that any new version of a tool might break
something that worked previously. I don't like it, but it generally
doesn't annoy me that much.

MMV. Certainly Anton's does. ;-)

Similar to you (David), I came from a - not embedded per se - but
kiosk background: HRT indrustrial QA/QC systems. I know well the
attraction of a new compiler yielding better performing code. I also
know a large amount of my code was hardware and OS specific, that
those are the things beyond the scope of the compiler, but they also
are things that I don't want to have to revisit every time a new
version of the compiler is released.

13 of one, baker's dozen of the other.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to George Neuner on Mon Sep 2 05:55:34 2024

George Neuner <[email protected]> schrieb:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is mentioned as UB in some standard N, but was not addressed in previous standards.

Was it always UB? Or should it be considered ID until it became UB?

Can you give an exapmple?

I would say this really depends on the circumstances. If it is
something left unspecified by earlier standards, and put into the
list of undefined behavior as a clarification, that is one thing.

If it is something that was previosly well-defined and then made
into undefined behavior, that is another thing; I would then
likely consider it a bug in the standard (but again, depending
on the circumstances).

It does seem to me that as the C standard evolved, and as more things
have *explicitly* become documented as UB, compiler developers have
responded largely by dropping whatever the compiler did previously - sometimes breaking code that relied on it.

There's a reason that there is a "porting to" file for each release
of gcc; in a way, each release can be considered a new compiler.

As an example, here's an entry from
https://gcc.gnu.org/gcc-13/porting_to.html :

# Fortran language issues

# Behavior on integer overflow

# GCC 13 includes new optimizations which may change behavior on
# integer overflow. Traditional code, like linear congruential
# pseudo-random number generators in old programs and relying on
# a specific, non-standard behavior may now generate unexpected
# results. The option -fsanitize=undefined can be used to detect
# such code at runtime.
#
# It is recommended to use the intrinsic subroutine RANDOM_NUMBER for
# random number generators or, if the old behavior is desired, to use
# the -fwrapv option. Note that this option can impact performance.

Integer overflow on multiplication had always been illegal in
Fortran (prohibited by "shall not"), but it had widely been used
anyway. That was a though one...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Brett on Mon Sep 2 10:01:43 2024

Brett wrote:

John Dallman <[email protected]> wrote:

In fact, organisations replace about a quarter of their machines each
year, always buying up-to-date ones, and want to run the /same/ version
of software on all of them. They want common software versions for data
compatibility, ease of training and so on. That means that a new release
of an application has to run on all the machines sold in the last four
years, sometimes longer.

I assume you work in the high end, as the average desktop PC is replaced every 8 years on a â€œuse it until it breaksâ€ policy.

Dell will tell you 5 years, and Google is paid to say the same.
And that actually might be true for laptops, but not desktops.

The bulk of the PCâ€™s and servers where I work are a dozen years old.
A smattering of new PCâ€™s bring the average down to 9 years.

Organizations that rely on commercial licenced software have a much
easier calculation to make:

"I pay 10-100K dollar every year per CPU for my 3D
CAD/modelling/whatever software, if I can buy a new system in 2-4 years
time which is 50% faster (more cores/faster threads), then it could make
sense to upgrade every year, except for the hazzle of installing
everything."

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to George Neuner on Mon Sep 2 01:40:46 2024

George Neuner <[email protected]> writes:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly
is mentioned as UB in some standard N, but was not addressed in
previous standards.

Was it always UB? Or should it be considered ID until it became
UB?

It does seem to me that as the C standard evolved, and as more
things have *explicitly* become documented as UB, compiler
developers have responded largely by dropping whatever the
compiler did previously - sometimes breaking code that relied on
it.

For the most part the circumstances you describe simply don't
occur. I know of one case where a rule introduced in C11
identified a specific situation as undefined behavior whereas
in C99 and before it was arguably not undefined behavior (but
never behavior that should be relied on). I don't remember
any others; if you have any specific examples please mention
them.

Something that does happen is a rule is given fuzzily in one
version of the C standard and then made more precise in a later
version. A good example of that is evaluation sequencing.
Before C11 the rules about what evaluations must be done before
other evaluations were not as clear as they should be. C11 fixed
that. However in that case I don't think anything went from
certainly defined (or certainly unspecified) to undefined, but
rather changed in the other direction, from possibly undefined
to certainly defined. Offhand I don't remember any other
examples, although surely there must be some.

Sometimes it happens that there is a change in the C language not
because wording in the Standard changes but because how the
wording in the Standard is interpreted, usually through a
response to a Defect Report. A good example of this kind of
change is "wobbly bits" - the idea that when a variable has not
been initialized then the bits of the variable are allowed to
change at any time. (By the way, IMO this idea is completely
stupid.) As far as I am aware this principle is not stated
anywhere in the C standard itself, but has crept into how the C
standard is interpreted by way of responses to Defect Reports.
It could be that changes of this kind is what you are thinking
about.

Overall though, I think the greatest changes in compiler behavior
are a result not of changes in the C standard but of optimization
techniques becoming more aggressive. To make things worse, it
isn't always clear whether a changed behavior is the result of a
more aggressive advantage-taking of a true UB situation, or if
the optimizer is buggy. I encountered an interesting situation
recently where a given piece of code worked just fine under both
gcc and clang, *except* under gcc at level O3 (clang at O3 had no
problems). It's been more than a decade since C11 was ratified
(and nearly a quarter of a century since C99). Compilations
should always be done with an explicit -std=c99 or -std=c11. If
you have been compiling with -std=c99 all this time, or even
using -std=c11 over the shorter time frame, and you see changes
between different versions of the compiler, it's not the C
standard changing that's causing the problem, but how the
compiler is choosing to act on what should be a fixed set of
rules.

Completely coincidentally, I happened to see a couple of
videos recently

https://www.youtube.com/watch?v=si9iqF5uTFk Grace M Hopper I
https://www.youtube.com/watch?v=AW7ZHpKuqZg Grace M Hopper II

that I think folks in comp.arch might be interested to watch.
The second one deals with language versions and compiler
verification (among other topics). A bit on the long side
but I enjoyed watching them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stephen Fuld on Mon Sep 2 10:23:43 2024

Stephen Fuld wrote:

On 8/31/2024 2:14 PM, MitchAlsup1 wrote:

On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:

You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry etc., >>> while compiler writers are free in their search for optimization tricks
that let them shine at SPEC benchmarks.

A pressure vessel may actually be able to contain 2Ã— the pressure it
will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.

Then there are 3 kinds of metals {grey, white, yellow} with different
responses to stress and induced strain. There is no analogy in code--
If there were perhaps we would have better code today...

Perhaps an analogy is code written in assembler, versus coed written in
C versus code written in something like Ada or Rust. Backing away now .
. . :-)

IMNSHO, code written in asm is generally more safe than code written in
C, because the author knows exactly what each line of code is going to do.

The problem is of course that it is harder to get 10x lines of correct
asm than to get 1x lines of correct C.

BTW, I am also solidly in the grey hair group here, writing C code that
is very low-level, using explicit local variables for any loop
invariant, copying other stuff into temp vars in order to make it really obvious that they cannot alias any globals or input/output parameters.

Anyway, that is all mostly moot since I'm using Rust for this kind of programming now. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to George Neuner on Mon Sep 2 14:22:51 2024

On 02/09/2024 06:08, George Neuner wrote:

On Sun, 1 Sep 2024 22:07:53 +0200, David Brown

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is mentioned as UB in some standard N, but was not addressed in previous standards.

Was it always UB? Or should it be considered ID until it became UB?

I can't answer for languages other than C and C++ (others might be able
to compare usefully to, for example, Ada or Fortran). But the C
standards explicitly state that behaviours that are not defined in the standards are undefined behaviour in exactly the same way as cases that
are labelled as undefined behaviour, and also cases where the program
violates a "shall" or "shall not" requirement.

To be clear - the meaning of "undefined behaviour" is simply that no
behaviour has been defined. The C standards can say that something is "undefined behaviour" (or just fail to give a definition of the
behaviour) and then the implementation can give a definition of it. An
example here would be that the C standards say that signed integer
arithmetic overflow is undefined behaviour - if you have a signed
integer operation and the mathematically correct results can't be
represented in the type, then there is no possible way for the generated
code to give the correct result. The C standards therefore leave this
as "undefined behaviour". However, if you use "gcc -fwrapv" then the
behaviour /is/ defined - it is defined as two's complement wrapping.

So if you write C code that overflows signed integer arithmetic and
relies on given behaviour and results, the code is wrong because it has undefined behaviour - you are, at best, relying on luck. But if you
write C code with such demands and specify that it is only suitable for
use with the gcc "-fwrapv" flag, then it is not wrong and there is no
undefined behaviour because the compiler implementation has given a
definition of the behaviour. However, if you use the same code with,
say, old versions of MSVC then you are back to luck and UB even if that compiler does not have optimisations based on knowing that signed
integer arithmetic overflow is UB. And it is /your/ fault when the code
fails on newer versions of MSVC that /do/ have such optimisations.

This is all very different from what the C standards call "implementation-defined behaviour". Such things as how signed integers
are converted to unsigned integers are explicitly IB in the C standards
- implementations must define and document the behaviour.

It does seem to me that as the C standard evolved, and as more things
have *explicitly* become documented as UB, compiler developers have
responded largely by dropping whatever the compiler did previously - sometimes breaking code that relied on it.

I think that is perhaps partly true, partly a myth, and partly simply a side-effect of compilers gaining more optimisations as they are able to
analyse more code at a time and do more advanced transforms. The C
standards have clarified some of the text over time (most people would
agree there is still plenty of scope for improvement there!). That can
include changing some things that were previously undefined by omission
to being explicitly labelled UB. I can't think of any examples
off-hand. But note that this would not in any way change the meaning of
the code - UB by omission is the same as explicit UB as far as the C
language is concerned. There are very few cases where code was correct
for original standard C90 (i.e., independent of any IB and independent
of particular compilers) and is not correct C23 with identical defined behaviour. There were a few things changed between C90 and C99, but I
don't know of any since then other than a few added keywords that could conflict with user identifiers.

It is an unfortunate truth that older C compilers did not do as good a
job at optimisation as newer ones. And this meant that many tricks were
used in order to get efficient results, even those some of these relied
on UB. Such code can have different results on different compilers, or different sets of options, because there is no definition of what the
"correct" result should be. The programmer will have a clear idea of
what they think is "correct", but it is not defined or specified
anywhere. Usually the programmer feels it is "obvious" what the
intended behaviour is - but "obvious" to a programmer does not mean
"obvious" to a compiler. Thus you end up with code that works (as
intended by the programmer) by testing and good luck with some compilers
and options, and fails by bad luck on other compilers or options. The
compiler didn't "break" the code - the code was broken to start with.
But it is entirely reasonable and understandable why the programmer
wrote the "broken" code in the first place, and why it did a useful job
despite having UB.

So I appreciate when people get frustrated that changes to a tool change
the apparent behaviour of their code. But it is important to understand
the the compiler is not wrong here - it is doing the best job it can for
people writing correct code. A development tool should emphasis people
using it /now/ - and while there is C code in use today that was written
many decades ago, the majority of C code (and even more so for C++) is
much more recent. It would be wrong to limit modern programmers because
of code written long ago - even more so when there is no clear
specification of how that old code was supposed to work.

I have moved on from C (mostly), and I learned long ago to archive
toolchains and to expect that any new version of a tool might break
something that worked previously. I don't like it, but it generally
doesn't annoy me that much.

This all depends on the kind of code you write, and the kind of system
you target. On my embedded targets, most of my code can be written in
standard C. But a lot of it also uses at least some gcc extensions to
improve the code - enhancing static error checking, making it more
efficient, or making it easier and clearer to write. I am quite clear
there that the code is dependent on gcc (it would probably also be fine
for clang, but I have not checked that). For all such code, I do my
utmost to make sure it is correct and safe, with no UB and no IB beyond
what is obvious and necessary. Most programs will also contain code
that is more specifically toolchain-dependent, perhaps with snippets of
inline assembly, or target-specific features that are needed. This was
more of an issue before, when I was using a wider range of compilers.

But for any given project, I stick to a single compiler version and
usually one set of compiler flags. For my work, code without C-level UB
is not enough - I sometimes also need to test for things like run-time
speed and code size, or interaction with external tools of various
sorts, or stack usage limits - all things that are outside the scope of C.

However, I don't remember when I last found that portable code that I
wrote and was working on one compiler failed to have correct C-level functionality when compiled with a newer compiler (or flags) due to
undefined behaviour, new optimisations, or changes in the C standard.
I've had portability issues with older code due to IB such as writing
code for a microcontroller with a different size of "int". I've seen
issues with third-party code - I've had to compile such code with
"-fwrapv -fno-strict-aliasing" on occasion. I've made other mistakes in
my code. And I've got UB things wrong in my early days when new to C programming. But truly, I am at a loss to understand why some people
are so worried about UB in C - you simply need to know the rules and specifications for the language features you use, and follow those rules.

MMV. Certainly Anton's does. ;-)

Anton writes code that seriously pushes the boundary of what can be
achieved. For at least some of the things he does (such as GForth) he
is trying to squeeze every last drop of speed out of the target. And he
is /really/ good at it. But that means he is forever relying on nuances
about code generation. His code, at least for efficiency if not for correctness, is dependent on details far beyond what is specified and documented for C and for the gcc compiler. He might spend a long time
working with his code and a version of gcc, fine-tuning the details of
his source code to get out exactly the assembly he wants from the
compiler. Of course it is frustrating for him when the next version of
gcc generates very different assembly from that same source, but he is
not really programming at the level of C, and he should not expect
consistency from C compilers like he does.

Similar to you (David), I came from a - not embedded per se - but
kiosk background: HRT indrustrial QA/QC systems. I know well the
attraction of a new compiler yielding better performing code. I also
know a large amount of my code was hardware and OS specific, that
those are the things beyond the scope of the compiler, but they also
are things that I don't want to have to revisit every time a new
version of the compiler is released.

Yes. For this kind of work, you want to keep your build environment
consistent - no matter how careful you are to write correct code without UB.

13 of one, baker's dozen of the other.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Mon Sep 2 13:13:20 2024

Terje Mathisen <[email protected]> schrieb:

Brett wrote:

John Dallman <[email protected]> wrote:

In fact, organisations replace about a quarter of their machines each
year, always buying up-to-date ones, and want to run the /same/ version
of software on all of them. They want common software versions for data
compatibility, ease of training and so on. That means that a new release >>> of an application has to run on all the machines sold in the last four
years, sometimes longer.

I assume you work in the high end, as the average desktop PC is replaced
every 8 years on a â€œuse it until it breaksâ€ policy.

Dell will tell you 5 years, and Google is paid to say the same.
And that actually might be true for laptops, but not desktops.

The bulk of the PCâ€™s and servers where I work are a dozen years old. >> A smattering of new PCâ€™s bring the average down to 9 years.

Organizations that rely on commercial licenced software have a much
easier calculation to make:

"I pay 10-100K dollar every year per CPU for my 3D
CAD/modelling/whatever software, if I can buy a new system in 2-4 years
time which is 50% faster (more cores/faster threads), then it could make sense to upgrade every year, except for the hazzle of installing
everything."

Made more complicated by wildly different license schemes.
Some vendors give the victim^H^H^H^H^H^Hcustomer a number of
licenses for interactive use (up to four cores, for example),
and you have to purchase extra for "HPC" use (which is ridiculous
today). With others, you need a "network license" to even connect
remotely, but you can run a single calculation on as many parallel
cores and CPUs, on a cluster, as you want.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Mon Sep 2 13:33:23 2024

On Mon, 2 Sep 2024 5:06:34 +0000, Robert Finch wrote:

ENTER, LEAVE, and RET as the only instructions capable of accessing the
safe stack is fascinating me. I would like to try implementing this sort
of thing in my design. Pondering why the PTE is specially marked
RWE=000? One would think that some other OS available bits could be
used. Does it make the MMU software easier to implement? Assuming that
faults processed during ENTER, LEAVE, and RET are processed at a higher privilege level, could it not just check some other internal tables?

a) I did not want to consume another bit in PTE
b) I did not want to compare CSP with another base register
So, RWE=000 was the ticket.

This ends up very similar to MILL in the Safe-Stack stuff. I tried to
do it without a separate stack and failed.

Decided to try implementing a capabilities machine in the current
design. Modeled it after the RISC-V capabilities instructions in the
CHERI document. It was either that or a segmentation system. Got to keep
the ole brain working.

Going with an OoO design for Bigfoot.

The rf386 takes an average of about 8 clocks per instruction. Helped out
by the presence of a data cache. IPC of 0.125 is nothing to write about. About 5 MIPs at 50 MHz. Stores are fast (2-3 cycles), but loads are
another story (14 ish cycles).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Mon Sep 2 13:36:49 2024

On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

George Neuner <[email protected]> schrieb:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is
mentioned as UB in some standard N, but was not addressed in previous
standards.

Was it always UB? Or should it be considered ID until it became UB?

Can you give an exapmple?

Memcopy() with overlapping pointers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to [email protected] on Mon Sep 2 06:59:32 2024

[email protected] (MitchAlsup1) writes:

On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

George Neuner <[email protected]> schrieb:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is
mentioned as UB in some standard N, but was not addressed in previous
standards.

Was it always UB? Or should it be considered ID until it became UB?

Can you give an exapmple?

Memcopy() with overlapping pointers.

Calling memcpy() between objects that overlap has always been
explicitly and specifically undefined behavior, going back to
the original ANSI C standard.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Mon Sep 2 18:09:03 2024

On Mon, 02 Sep 2024 06:59:32 -0700
Tim Rentsch <[email protected]> wrote:

[email protected] (MitchAlsup1) writes:

On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

George Neuner <[email protected]> schrieb:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that
explicitly is mentioned as UB in some standard N, but was not
addressed in previous standards.

Was it always UB? Or should it be considered ID until it became
UB?

Can you give an exapmple?

Memcopy() with overlapping pointers.

Calling memcpy() between objects that overlap has always been
explicitly and specifically undefined behavior, going back to
the original ANSI C standard.

3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined. https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the world
are telling the same story. May be, there exist old popular book that
said that it was defined?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Terje Mathisen on Mon Sep 2 09:46:47 2024

On 9/2/2024 1:23 AM, Terje Mathisen wrote:

Stephen Fuld wrote:

On 8/31/2024 2:14 PM, MitchAlsup1 wrote:

On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:

You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry
etc.,
while compiler writers are free in their search for optimization tricks >>>> that let them shine at SPEC benchmarks.

A pressure vessel may actually be able to contain 2Ã— the pressure it >>> will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.

Then there are 3 kinds of metals {grey, white, yellow} with different
responses to stress and induced strain. There is no analogy in code--
If there were perhaps we would have better code today...

Perhaps an analogy is code written in assembler, versus coed written
in C versus code written in something like Ada or Rust. Backing away
now . . . :-)

IMNSHO, code written in asm is generally more safe than code written in
C, because the author knows exactly what each line of code is going to do.

The problem is of course that it is harder to get 10x lines of correct
asm than to get 1x lines of correct C.

BTW, I am also solidly in the grey hair group here, writing C code that
is very low-level, using explicit local variables for any loop
invariant, copying other stuff into temp vars in order to make it really obvious that they cannot alias any globals or input/output parameters.

Anyway, that is all mostly moot since I'm using Rust for this kind of programming now. :-)

Can you talk about the advantages and disadvantages of Rust versus C?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Mon Sep 2 10:21:17 2024

Michael S <[email protected]> writes:

On Mon, 02 Sep 2024 06:59:32 -0700
Tim Rentsch <[email protected]> wrote:

[email protected] (MitchAlsup1) writes:

On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

George Neuner <[email protected]> schrieb:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that
explicitly is mentioned as UB in some standard N, but was not
addressed in previous standards.

Was it always UB? Or should it be considered ID until it became
UB?

Can you give an exapmple?

Memcopy() with overlapping pointers.

Calling memcpy() between objects that overlap has always been
explicitly and specifically undefined behavior, going back to
the original ANSI C standard.

3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined. https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the world
are telling the same story. May be, there exist old popular book that
said that it was defined?

My first answer is that the question asked was about standards, and
that is the question I was answering. There were no C standards
before 1989.

My second answer is, if I wanted to research the issue for the time
before there were any C standards, I would start with these
references, in more or less this order:

K&R original edition (1978)
PJ Plauger's book on implementing the C standard library
Harbison and Steele
K&R 2nd edition (1988?)

Probably there are others but these are what I thought of off the
top of my head.

My third answer is, it wouldn't surprise me if there were a book or
some sort of reference document that makes such a claim about how
memcpy behaves, but I'm not aware of any (which doesn't mean
anything), and nothing comes to mind in the general domain of near-authoritative books on C (other than the four listed above).
So, assuming there is such a book or document, I expect it would
be one of two things:

Reference documentation for some specific C implementation (as
for example from Sun Microsystems); or

A book (or document) that purports to be authoritative (or maybe
appears to be authoritative) but in reality is not.

Obviously I can't disprove the existence of something that Terje
said he read many years ago (perhaps with more information this
could be done, but for sure I don't have such information). For the
sake of discussion I'm willing to stipulate that Terje did read
something and that what he read did say something about memcpy
working for overlapping arguments. The question then becomes, What
is it that he read, and what exactly did it say? I'm not in a
position to answer those questions but maybe Terje or someone else
remembers and can fill us in.

(My aversion to using google groups stops me from following the
reference you nicely provided.)

To all this I should add that it certainly is feasible to implement
memcpy so that it works with overlapping arguments, and I have no
doubt (strictly speaking, less than epsilon doubt) that some library implementer somewhere (and probably more than one) has done this.
Also it goes without saying that the C standard allows such a choice
even today, and an implementation could choose to document that
memcpy is well-behaved in that implementation. Undefined behavior
doesn't mean that what will happen must be bad, only that what does
happen is completely up to the implementation. Unfortunately more
and more compiler writers are taking the attitude that any tiny bit
of freedom in the direction of undefined behavior should be taken
advantage of in pursuit of even the most trivial possible gain in
performance, at the cost of ripping the code to shreds and making C
less reliable than it could be (and should be). In some sense I am
agreeing that the problem here is caused by the C standard, not by
it changing in different versions but by it giving too much freedom
to implementors for so-called "undefined behavior". Sadly the
standardization process seems to have been taken over by compiler
writers, so the best advice I can offer is to join the ISO C
committee and start voting out the lunacy. Alternatively I suppose
one could start up a competitive effort to gcc and clang, and offer
a compiler that doesn't engage in such shenanigans unless told to do
so (and told specifically), and then try to get developers to switch
to sane C in preference to the ever-increasingly insane C that is
most commonly used today.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to [email protected] on Mon Sep 2 17:59:16 2024

MitchAlsup1 <[email protected]> schrieb:

On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

George Neuner <[email protected]> schrieb:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is
mentioned as UB in some standard N, but was not addressed in previous
standards.

Was it always UB? Or should it be considered ID until it became UB?

Can you give an exapmple?

Memcopy() with overlapping pointers.

Does anybody have the first edition of K&R around to check what is
explicity stated there?

If both were intended to have the same functionality, it would have
been strange to define both.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Mon Sep 2 18:32:54 2024

Tim Rentsch <[email protected]> schrieb:

In some sense I am
agreeing that the problem here is caused by the C standard, not by
it changing in different versions but by it giving too much freedom
to implementors for so-called "undefined behavior". Sadly the standardization process seems to have been taken over by compiler
writers, so the best advice I can offer is to join the ISO C
committee and start voting out the lunacy.

The standard could always define previously undefined behavior in
subsequent versions. (Adding a new feature is mostly that).

However, the main problem I see is that of defining that subset
or version or whatever you want to call it of C that you (generic
you) want implemented. It could be defined as an extension (or
restriction, if you will) of the C standard, with additional
rules.

Alternatively I suppose
one could start up a competitive effort to gcc and clang, and offer
a compiler that doesn't engage in such shenanigans unless told to do
so (and told specifically), and then try to get developers to switch
to sane C in preference to the ever-increasingly insane C that is
most commonly used today.

The specification needs to come first! Right now, compiler writers
have a specification, the standard, which they generally follow
(modulo bugs and extensions). You have to give them another,
supplemental specification to follow if you want any chance
of success.

But writing such a specification is a lot of work, very hard work,
and needs a lot of discussion.

"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.

If it gets accepted by a wide community, then a branch trying to
implement that particular version in either gcc or clang (or
both) could have a certain chance of being implemented by the
main compilers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Tim Rentsch on Mon Sep 2 19:31:18 2024

Tim Rentsch <[email protected]> writes:

Michael S <[email protected]> writes:

On Mon, 02 Sep 2024 06:59:32 -0700
Tim Rentsch <[email protected]> wrote:

[email protected] (MitchAlsup1) writes:

On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

George Neuner <[email protected]> schrieb:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that
explicitly is mentioned as UB in some standard N, but was not
addressed in previous standards.

Was it always UB? Or should it be considered ID until it became
UB?

Can you give an exapmple?

Memcopy() with overlapping pointers.

Calling memcpy() between objects that overlap has always been
explicitly and specifically undefined behavior, going back to
the original ANSI C standard.

3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the world
are telling the same story. May be, there exist old popular book that
said that it was defined?

My first answer is that the question asked was about standards, and
that is the question I was answering. There were no C standards
before 1989.

Third edition of the SVID (8/89) has on pg. 7-83:

USAGE:
Character movement is performed differently in different
implementations. Thus overlapping moves may be unpredictable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Mon Sep 2 19:32:56 2024

Thomas Koenig <[email protected]> writes:

MitchAlsup1 <[email protected]> schrieb:

On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

George Neuner <[email protected]> schrieb:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is
mentioned as UB in some standard N, but was not addressed in previous
standards.

Was it always UB? Or should it be considered ID until it became UB?

Can you give an exapmple?

Memcopy() with overlapping pointers.

Does anybody have the first edition of K&R around to check what is
explicity stated there?

The system V interface definition, third edition, August 1989 states
that overlapping moves are unpredictable specifically due to differences
in implementation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Thomas Koenig on Mon Sep 2 20:52:21 2024

Thomas Koenig <[email protected]> schrieb:

"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.

Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.

So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.

If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?

Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Schultz@21:1/5 to Thomas Koenig on Mon Sep 2 16:42:07 2024

On 9/2/24 12:59 PM, Thomas Koenig wrote:

Memcopy() with overlapping pointers.

Does anybody have the first edition of K&R around to check what is
explicity stated there?

memcpy() doesn't appear in the index.

--
http://davesrocketworks.com
David Schultz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Sep 3 01:47:22 2024

On Mon, 2 Sep 2024 19:32:23 +0000, BGB wrote:

On 9/1/2024 6:32 PM, MitchAlsup1 wrote:

More modern machines have RND nobody will ever have REM.

Which is probably not a lot, as off-hand I am not aware of many ISA's
that have floor/ceil/round in the ISA itself, rather than doing it via conversion to an integer type.

VAX has round float* to float* 1978.....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Sep 3 01:50:52 2024

On Mon, 2 Sep 2024 20:52:21 +0000, Thomas Koenig wrote:

Thomas Koenig <[email protected]> schrieb:

"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.

Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.

So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.

If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?

It depends::

Let
Base = NULL;
Index = &array / sizeof( array[0] );

is::

x = [base+index<<sale+small_offset]

u8ndefined ??

Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Schultz on Tue Sep 3 01:52:06 2024

On Mon, 2 Sep 2024 21:42:07 +0000, David Schultz wrote:

On 9/2/24 12:59 PM, Thomas Koenig wrote:

Memcopy() with overlapping pointers.

Does anybody have the first edition of K&R around to check what is
explicity stated there?

memcpy() doesn't appear in the index.

Was in the library I used in 1980 BSD Unix PDP-11-70.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to [email protected] on Mon Sep 2 20:40:39 2024

[email protected] (MitchAlsup1) writes:

On Mon, 2 Sep 2024 20:52:21 +0000, Thomas Koenig wrote:

Thomas Koenig <[email protected]> schrieb:

"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.

Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.

So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.

If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?

It depends::

Let
Base = NULL;
Index = &array / sizeof( array[0] );

is::

x = [base+index<<sale+small_offset]

u8ndefined ??

These lines aren't even close to being meaningful C source.
What question are you trying to ask?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to All on Mon Sep 2 20:37:31 2024

Thomas Koenig <[email protected]> writes:

I'm responding here to one part of your posting. I
may respond to the other part at a later time.

Tim Rentsch <[email protected]> schrieb:

In some sense I am
agreeing that the problem here is caused by the C standard, not by
it changing in different versions but by it giving too much freedom
to implementors for so-called "undefined behavior". Sadly the
standardization process seems to have been taken over by compiler
writers, so the best advice I can offer is to join the ISO C
committee and start voting out the lunacy.

Alternatively I suppose
one could start up a competitive effort to gcc and clang, and offer
a compiler that doesn't engage in such shenanigans unless told to do
so (and told specifically), and then try to get developers to switch
to sane C in preference to the ever-increasingly insane C that is
most commonly used today.

The specification needs to come first! Right now, compiler writers
have a specification, the standard, which they generally follow
(modulo bugs and extensions). You have to give them another,
supplemental specification to follow if you want any chance
of success.

But writing such a specification is a lot of work, very hard work,
and needs a lot of discussion.

"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.

If it gets accepted by a wide community, then a branch trying to
implement that particular version in either gcc or clang (or
both) could have a certain chance of being implemented by the
main compilers.

My suggestion is not to implement a language extension, but to
implement a compiler conforming to C as it is now, with
additional guarantees for what happens in cases that are
undefined behavior. Moreover the additional guarantees are
always in effect unless explicitly and specifically requested
otherwise (most likely by means of a #pragma or _Pragma).
Documentation needs to be written for the #pragmas, but no other
documentation is required (it might be nice to describe the
additional guarantees but that is not required by the C
standard).

The point is to change the behavior of the compiler but
still conform to the existing ISO C standard.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Tue Sep 3 05:55:14 2024

Tim Rentsch <[email protected]> schrieb:

My suggestion is not to implement a language extension, but to
implement a compiler conforming to C as it is now,

Sure, that was also what I was suggesting - define things that
are currently undefined behavior.

with
additional guarantees for what happens in cases that are
undefined behavior.

Guarantees or specifications - no difference there.

Moreover the additional guarantees are
always in effect unless explicitly and specifically requested
otherwise (most likely by means of a #pragma or _Pragma).
Documentation needs to be written for the #pragmas, but no other documentation is required (it might be nice to describe the
additional guarantees but that is not required by the C
standard).

It' the other way around - you need to describe first what the
actual behavior in absence of any pragmas is, and this needs to be a
firm specification, so the programmer doesn't need to read your mind
(or the source code to the compiler) to find out what you meant.
"But it is clear that..." would not be a specification; what is
clear to you may absolutely not be clear to anybody else.

This is also the only chance you'll have of getting this implemented
in one of the current compilers (and let's face it, if you want
high-quality code, you would need that; both LLVM and GCC
have taken an enormous amount of effort up to now, and duplicating
that is probably not going to happen).

The point is to change the behavior of the compiler but
still conform to the existing ISO C standard.

I understood that - defining things that are currently undefined.
But without a specification, that falls down.

So, let's try something that causes some grief - what should
be the default behavior (in the absence of pragmas) for integer
overflow? More specifically, can the compiler set the condition
to false in

int a;

...

if (a > a + 1) {
}

and how would you specify this in an unabigous manner?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Stephen Fuld on Tue Sep 3 08:23:33 2024

On 02/09/2024 18:46, Stephen Fuld wrote:

On 9/2/2024 1:23 AM, Terje Mathisen wrote:

Stephen Fuld wrote:

On 8/31/2024 2:14 PM, MitchAlsup1 wrote:

On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:

You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry
etc.,
while compiler writers are free in their search for optimization
tricks
that let them shine at SPEC benchmarks.

A pressure vessel may actually be able to contain 2Ã— the pressure it >>>> will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.

Then there are 3 kinds of metals {grey, white, yellow} with different
responses to stress and induced strain. There is no analogy in code--
If there were perhaps we would have better code today...

Perhaps an analogy is code written in assembler, versus coed written
in C versus code written in something like Ada or Rust. Backing away
now . . . :-)

IMNSHO, code written in asm is generally more safe than code written
in C, because the author knows exactly what each line of code is going
to do.

The problem is of course that it is harder to get 10x lines of correct
asm than to get 1x lines of correct C.

BTW, I am also solidly in the grey hair group here, writing C code
that is very low-level, using explicit local variables for any loop
invariant, copying other stuff into temp vars in order to make it
really obvious that they cannot alias any globals or input/output
parameters.

Anyway, that is all mostly moot since I'm using Rust for this kind of
programming now. :-)

Can you talk about the advantages and disadvantages of Rust versus C?

And also for Rust versus C++ ?

My impression - based on hearsay for Rust as I have no experience - is
that the key point of Rust is memory "safety". I use scare-quotes here,
since it is simply about correct use of dynamic memory and buffers.

It is entirely possible to have correct use of memory in C, but it is
also very easy to get it wrong - especially if the developer doesn't use available tools for static and run-time checks. Modern C++, on the
other hand, makes it much easier to get right. You can cause yourself
extra work and risk by using more old-fashioned C++, but following
modern design guides using smart pointers and containers, along with
easily available tools, and you get a lot of the management of memory
handled automatically for very little cost.

C++ provides a huge amount more than Rust - when I have looked at Rust,
it is (still) too limited for some of what I want to do. Of course,
"with great power comes great responsibility" - C++ provides many
exciting ways to write a complete mess :-)

Most of the "Rust vs C++" comparisons I see are complete rubbish in
regards to C++ - they tend to see it as "C with a couple of OOP bits
added", and are usually strongly biased towards the Rust fad. For example :

<https://www.geeksforgeeks.org/rust-vs-c/>

This says Rust is "Multi-paradigm (functional, imperative)" while C++ is "Object-oriented". C++ is as "multi-paradigm" as you can get in a
programming language - object-oriented /and/ functional /and/ imperative
/and/ generic /and/ lots of other "paradigms". And it says C++ has
"manual memory management", while omitting that it /also/ has extensive automatic memory management.

To my mind, the important question is not "Should we move from C to
Rust?", but "Should we move from bad C to C++, Rust, or simply to good C practices?".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Tue Sep 3 10:44:21 2024

On Tue, 3 Sep 2024 05:55:14 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Tim Rentsch <[email protected]> schrieb:

My suggestion is not to implement a language extension, but to
implement a compiler conforming to C as it is now,

Sure, that was also what I was suggesting - define things that
are currently undefined behavior.

with
additional guarantees for what happens in cases that are
undefined behavior.

Guarantees or specifications - no difference there.

Moreover the additional guarantees are
always in effect unless explicitly and specifically requested
otherwise (most likely by means of a #pragma or _Pragma).
Documentation needs to be written for the #pragmas, but no other documentation is required (it might be nice to describe the
additional guarantees but that is not required by the C
standard).

It' the other way around - you need to describe first what the
actual behavior in absence of any pragmas is, and this needs to be a
firm specification, so the programmer doesn't need to read your mind
(or the source code to the compiler) to find out what you meant.
"But it is clear that..." would not be a specification; what is
clear to you may absolutely not be clear to anybody else.

This is also the only chance you'll have of getting this implemented
in one of the current compilers (and let's face it, if you want
high-quality code, you would need that; both LLVM and GCC
have taken an enormous amount of effort up to now, and duplicating
that is probably not going to happen).

The point is to change the behavior of the compiler but
still conform to the existing ISO C standard.

I understood that - defining things that are currently undefined.
But without a specification, that falls down.

So, let's try something that causes some grief - what should
be the default behavior (in the absence of pragmas) for integer
overflow? More specifically, can the compiler set the condition
to false in

int a;

...

if (a > a + 1) {
}

and how would you specify this in an unabigous manner?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Tue Sep 3 09:29:22 2024

On 02/09/2024 22:52, Thomas Koenig wrote:

Thomas Koenig <[email protected]> schrieb:

"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.

Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.

That sounds a lot like adding a new type of run-time error handling to
the language. That's not necessarily a bad idea, but it would likely be
a very big change with significant ramifications for existing code.

So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.

The kind of null pointer checks that are thrown away by some compilers
are those that come /after/ a dereference :

int foo(int * p) {
int x = *p;
if (!p) {
printf("I shouldn't have done that...\n");
}
return x;
}

If dereferencing a null pointer is an "observable error", it needs to be observed at the "int x = *p;" line, and has no influence on the deletion
of the later pointer check.

Making dereferencing a null pointer an "observable error" would mean
requiring compilers to insert an explicit check in a large number of
cases, with a jump to some kind of run-time error-handling code when it
is zero. That is a very significant cost, to be paid by all users of
pointers in C - even those that are careful to ensure that their
pointers are not null before calling "foo". (There's also the
definition complications - a pointer that happens to contain the value
0, or point to address 0, is not necessarily a NULL pointer, and on some targets there are lots of different values that are all null pointers.
And there are endless possibilities for invalid pointers that are not null.)

C is a language where the programmer takes the responsibility to get the
code right - not the language or run-time. It insists on manual and
explicit control of this kind of thing, so that you don't have to pay
for checks you don't want.

Leaving the dereferencing of invalid pointers as undefined behaviour
means that code that does not have invalid pointers does not have extra
hidden checks and costs, along with hidden jumps to error handlers. It
also means that development tools can run in modes that add whatever
they like of extra checks and handling of invalid pointers -
"sanitizers" and other run-time checkers. And static error checkers can
warn if they see code paths with bad dereferences.

If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?

Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Tue Sep 3 10:10:20 2024

On 03/09/2024 07:55, Thomas Koenig wrote:

Tim Rentsch <[email protected]> schrieb:

My suggestion is not to implement a language extension, but to
implement a compiler conforming to C as it is now,

Sure, that was also what I was suggesting - define things that
are currently undefined behavior.

with
additional guarantees for what happens in cases that are
undefined behavior.

Guarantees or specifications - no difference there.

I personally think that - for the most part - that would be a really bad
idea. I am not in favour of arbitrarily defining the behaviour of
something that has no sensible correct behaviour. If the code flow
reaches something that is run-time UB, the code is wrong or has been
used incorrectly (i.e., the calling code, or user, or something else has
made a mistake). No possible handling of the UB will result in correct results.

It is sometimes possible to have damage limitation, such as exiting the
program quickly with an error message rather than corrupting files,
opening security breaches, etc. But that is always context specific -
stopping the program with an error message is fine for many PC programs,
but less ideal for a flight control system.

There are some languages that have integrated error handling, and can
sensibly have checks as a natural part of the language and the code. C
is not such a language. Let C remain a language where the programmer
has control, and where checks are done manually or they are not done at
all. People who don't want that, should use other languages that give
them what they want. UB in C is a /feature/, it is not a problem.
Trying to remove UB (by specifying more behaviour) reduces the power of
the language, and reduces the power of tools for the language, often for downright silly results (like wrapping integer overflow).

But if people want a compiler that has extra guarantees and
specifications for behaviour in cases of UB, then those already exist -
"gcc -fsanitize=undefined" would be a good example. Of course such
tools could be improved in a variety of ways.

(There are a few situations where UB in C could be diagnosed at
compile-time, which are probably historical decisions to avoid imposing
too much work on early compilers. Where possible, UB that can be caught
at compile time, could usefully be turned into constrain violations that
must be diagnosed.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Tue Sep 3 11:40:42 2024

On Tue, 3 Sep 2024 05:55:14 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Tim Rentsch <[email protected]> schrieb:

My suggestion is not to implement a language extension, but to
implement a compiler conforming to C as it is now,

Sure, that was also what I was suggesting - define things that
are currently undefined behavior.

with
additional guarantees for what happens in cases that are
undefined behavior.

Guarantees or specifications - no difference there.

Moreover the additional guarantees are
always in effect unless explicitly and specifically requested
otherwise (most likely by means of a #pragma or _Pragma).
Documentation needs to be written for the #pragmas, but no other documentation is required (it might be nice to describe the
additional guarantees but that is not required by the C
standard).

It' the other way around - you need to describe first what the
actual behavior in absence of any pragmas is, and this needs to be a
firm specification, so the programmer doesn't need to read your mind
(or the source code to the compiler) to find out what you meant.
"But it is clear that..." would not be a specification; what is
clear to you may absolutely not be clear to anybody else.

This is also the only chance you'll have of getting this implemented
in one of the current compilers (and let's face it, if you want
high-quality code, you would need that; both LLVM and GCC
have taken an enormous amount of effort up to now, and duplicating
that is probably not going to happen).

The point is to change the behavior of the compiler but
still conform to the existing ISO C standard.

I understood that - defining things that are currently undefined.
But without a specification, that falls down.

So, let's try something that causes some grief - what should
be the default behavior (in the absence of pragmas) for integer
overflow? More specifically, can the compiler set the condition
to false in

int a;

...

if (a > a + 1) {
}

and how would you specify this in an unabigous manner?

I'd start much earlier, by declaration of "Homogeneity and Exclusion".
It would state that "more defined C" does not pretend to cover all
targets covered by existing C language.
Specifically, following target characteristics are required:
- byte-addressable machine with 8-bit bytes
- two-complement integer types
- if float type is supported it has to be IEEE-754 binary32
- if double type is supported it has to be IEEE-754 binary64
- if long double type is supported it has to be IEEE-754 binary128
- storage order for multibyte types should be either LE or BE,
consistently for all built-in types
- flat address space That part should be specified in more formal manner

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Tue Sep 3 17:41:40 2024

Michael S wrote:

On Mon, 02 Sep 2024 06:59:32 -0700
Tim Rentsch <[email protected]> wrote:

[email protected] (MitchAlsup1) writes:

On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

George Neuner <[email protected]> schrieb:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that
explicitly is mentioned as UB in some standard N, but was not
addressed in previous standards.

Was it always UB? Or should it be considered ID until it became
UB?

Can you give an exapmple?

Memcopy() with overlapping pointers.

Calling memcpy() between objects that overlap has always been
explicitly and specifically undefined behavior, going back to
the original ANSI C standard.

3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined. https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the world
are telling the same story. May be, there exist old popular book that
said that it was defined?

It probably wasn't written in the official C standard, which I couldn't
have afforded to buy/read, but in a compiler runtime doc?

Specifying that it would always copy from beginning to end of the source buffer, in increasing address order meant that it was guaranteed safe
when used to compact buffers.

Code that depended on this was fine for decades, until the first library/compiler implementation discovered that in some circumstances it
could be faster to go in reverse order.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Tue Sep 3 19:09:28 2024

On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Mon, 02 Sep 2024 06:59:32 -0700
Tim Rentsch <[email protected]> wrote:

[email protected] (MitchAlsup1) writes:

On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

George Neuner <[email protected]> schrieb:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that
explicitly is mentioned as UB in some standard N, but was not
addressed in previous standards.

Was it always UB? Or should it be considered ID until it became
UB?

Can you give an exapmple?

Memcopy() with overlapping pointers.

Calling memcpy() between objects that overlap has always been
explicitly and specifically undefined behavior, going back to
the original ANSI C standard.

3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined. https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?

It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?

Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.

What is "compact buffers" ?

Code that depended on this was fine for decades, until the first library/compiler implementation discovered that in some circumstances
it could be faster to go in reverse order.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stephen Fuld on Tue Sep 3 17:46:38 2024

Stephen Fuld wrote:

On 9/2/2024 1:23 AM, Terje Mathisen wrote:

Stephen Fuld wrote:

On 8/31/2024 2:14 PM, MitchAlsup1 wrote:

On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:

You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry
etc.,
while compiler writers are free in their search for optimization
tricks
that let them shine at SPEC benchmarks.

A pressure vessel may actually be able to contain 2Ãƒâ€” the
pressure it
will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.

Then there are 3 kinds of metals {grey, white, yellow} with different
responses to stress and induced strain. There is no analogy in code--
If there were perhaps we would have better code today...

Perhaps an analogy is code written in assembler, versus coed written
in C versus code written in something like Ada or Rust.Â Backing
away now . . . :-)

IMNSHO, code written in asm is generally more safe than code written
in C, because the author knows exactly what each line of code is going
to do.

The problem is of course that it is harder to get 10x lines of correct
asm than to get 1x lines of correct C.

BTW, I am also solidly in the grey hair group here, writing C code
that is very low-level, using explicit local variables for any loop
invariant, copying other stuff into temp vars in order to make it
really obvious that they cannot alias any globals or input/output
parameters.

Anyway, that is all mostly moot since I'm using Rust for this kind of
programming now. :-)

Can you talk about the advantages and disadvantages of Rust versus C?

Q&D programming is still far faster for me in C, but using Rust I don't
have to worry about how well the compiler will be able to optimize my
code, it is pretty much always close to speed of light since the entire aliasing issue goes away.

Rust also gets rid of the horrible external library/configure/cmake mess
that kept me from successfully compiling the reference LAStools lidar
code for nearly 10 years.

Using the Rust port I just tell cargo to add it to my project and that's it.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Tue Sep 3 16:17:49 2024

Michael S <[email protected]> writes:

On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Mon, 02 Sep 2024 06:59:32 -0700
Tim Rentsch <[email protected]> wrote:

[email protected] (MitchAlsup1) writes:

On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

George Neuner <[email protected]> schrieb:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that
explicitly is mentioned as UB in some standard N, but was not
addressed in previous standards.

Was it always UB? Or should it be considered ID until it became
UB?

Can you give an exapmple?

Memcopy() with overlapping pointers.

Calling memcpy() between objects that overlap has always been
explicitly and specifically undefined behavior, going back to
the original ANSI C standard.

3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?

It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?

Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.

What is "compact buffers" ?

In this case, 'compact' was used as a verb. Perhaps by removing
extraneous whitespace.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to David Brown on Tue Sep 3 09:54:11 2024

On 9/2/2024 11:23 PM, David Brown wrote:

On 02/09/2024 18:46, Stephen Fuld wrote:

On 9/2/2024 1:23 AM, Terje Mathisen wrote:

Stephen Fuld wrote:

On 8/31/2024 2:14 PM, MitchAlsup1 wrote:

On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:

You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry >>>>>> etc.,
while compiler writers are free in their search for optimization
tricks
that let them shine at SPEC benchmarks.

A pressure vessel may actually be able to contain 2Ã— the pressure it >>>>> will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.

Then there are 3 kinds of metals {grey, white, yellow} with different >>>>> responses to stress and induced strain. There is no analogy in code-- >>>>> If there were perhaps we would have better code today...

Perhaps an analogy is code written in assembler, versus coed written
in C versus code written in something like Ada or Rust. Backing
away now . . . :-)

IMNSHO, code written in asm is generally more safe than code written
in C, because the author knows exactly what each line of code is
going to do.

The problem is of course that it is harder to get 10x lines of
correct asm than to get 1x lines of correct C.

BTW, I am also solidly in the grey hair group here, writing C code
that is very low-level, using explicit local variables for any loop
invariant, copying other stuff into temp vars in order to make it
really obvious that they cannot alias any globals or input/output
parameters.

Anyway, that is all mostly moot since I'm using Rust for this kind of
programming now. :-)

Can you talk about the advantages and disadvantages of Rust versus C?

And also for Rust versus C++ ?

I asked about C versus Rust as Terje explicitly mentioned those two
languages, but you make a good point in general.

My impression - based on hearsay for Rust as I have no experience - is
that the key point of Rust is memory "safety". I use scare-quotes here, since it is simply about correct use of dynamic memory and buffers.

I agree that memory safety is the key point, although I gather that it
has other features that many programmers like.

It is entirely possible to have correct use of memory in C, but it is
also very easy to get it wrong - especially if the developer doesn't use available tools for static and run-time checks. Modern C++, on the
other hand, makes it much easier to get right. You can cause yourself
extra work and risk by using more old-fashioned C++, but following
modern design guides using smart pointers and containers, along with
easily available tools, and you get a lot of the management of memory
handled automatically for very little cost.

Is it fair to say then that Rust makes it harder to get memory
management "wrong"?

C++ provides a huge amount more than Rust - when I have looked at Rust,
it is (still) too limited for some of what I want to do.

Can you give a few examples?

Of course,
"with great power comes great responsibility" - C++ provides many
exciting ways to write a complete mess :-)

Sure. I gather that templates are very powerful and potentially very
useful. On the other hand, I gather that multiple inheritance is very powerful, but difficult to use and potentially very ugly, and has not
been carried forward in the same way into newer languages.

snip stuff about the inadequacy of existing Rust versus C++ comparisons.

To my mind, the important question is not "Should we move from C to
Rust?", but "Should we move from bad C to C++, Rust, or simply to good C practices?".

I understand. This brings up an important issue, that of older versus
newer languages.

A newer language has several advantages. One is it can take advantage
of what we have learned about language design and usage since the older language was designed. I can't underestimate this enough. While many
new language features turn out to be not useful, many are.

Another is that it doesn't have to worry about support for "dusty
decks", i.e. the existing base which may conform to an older version of
the language, nor for "dusty brains", that is programmers who learned
the older (i.e. worse) ways and keep generating new code using those
ways. You mention this issue in your comments.

Of course, the counter to that is that new languages have to overcome
the huge "installed base" advantage of existing languages.

Let me be clear. I am not a Rust evangelist. I am just looking for a
way forward that will help us make programmer easier and not to make
some of the same mistakes we have made in the past. Is Rust that? Some
people think so. I just want to understand more.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bernd Linsel@21:1/5 to All on Tue Sep 3 19:19:31 2024

T24gMDMuMDkuMjQgMTA6MTAsIERhdmlkIEJyb3duIHdyb3RlOg0KDQpzbmlwIDg8IC0gLSAt IC0gLSAtIC0gLQ0KDQo+IChUaGVyZSBhcmUgYSBmZXcgc2l0dWF0aW9ucyB3aGVyZSBVQiBp biBDIGNvdWxkIGJlIGRpYWdub3NlZCBhdCANCj4gY29tcGlsZS10aW1lLCB3aGljaCBhcmUg cHJvYmFibHkgaGlzdG9yaWNhbCBkZWNpc2lvbnMgdG8gYXZvaWQgaW1wb3NpbmcgDQo+IHRv byBtdWNoIHdvcmsgb24gZWFybHkgY29tcGlsZXJzLsKgIFdoZXJlIHBvc3NpYmxlLCBVQiB0 aGF0IGNhbiBiZSBjYXVnaHQgDQo+IGF0IGNvbXBpbGUgdGltZSwgY291bGQgdXNlZnVsbHkg YmUgdHVybmVkIGludG8gY29uc3RyYWluIHZpb2xhdGlvbnMgdGhhdCANCj4gbXVzdCBiZSBk aWFnbm9zZWQuKQ0KDQpBbmQgZXhhY3RseSB0aGVzZSBhcmUgdGhlIHNpdHVhdGlvbnMgdGhh dCBJJ2QgbGlrZSB0byBiZSB3YXJuZWQgZnJvbSwgDQpyYXRoZXIgdGhhbiB0aGUgY29tcGls ZXIgbWFraW5nIHVwIHNvbWV0aGluZyB3aXRob3V0IHRlbGxpbmcuDQoNCi0tIA0KQmVybmQg TGluc2VsDQo=

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Tue Sep 3 19:52:49 2024

Michael S wrote:

On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?

It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?

Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.

What is "compact buffers" ?

Assume a buffer consisting of records of some type, some of them marked
as deleted. Iterating over them while removing the gaps means that you
are always copying to a destination lower in memory, right?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to David Brown on Tue Sep 3 21:08:31 2024

On 2024-09-03 11:10, David Brown wrote:

[snip]

(There are a few situations where UB in C could be diagnosed at
compile-time, which are probably historical decisions to avoid imposing
too much work on early compilers. Where possible, UB that can be caught
at compile time, could usefully be turned into constrain violations that
must be diagnosed.)

The problem, as you of course know, is that the "can" in "can be caught
at compile time" depends on the amount and kind of analysis that is done
at compile time -- some cases of UB "can" be caught at compile time but
only by advanced and costly analysis. If the language standard requires
that such things /must/ be detected by the compiler, it can place quite
a burden on the developers of conforming compilers.

As I understand it, current C compilers detect UB mostly as a side
effect of the analyses they do for code optimization purposes, which
vary widely between compilers, and so the UB-detections also vary.

This issue (compile-time detection) has now and then been discussed in
the Ada standards group. Given the currently low market penetration of
Ada, the group has been reluctant to require too much of the compilers,
and so the more advanced UB-detecting tools are stand-alone, such as the
SPARK tools.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Terje Mathisen on Tue Sep 3 21:10:00 2024

On 2024-09-03 20:52, Terje Mathisen wrote:

Michael S wrote:

On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?

It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?

Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.

What is "compact buffers" ?

Assume a buffer consisting of records of some type, some of them marked
as deleted. Iterating over them while removing the gaps means that you
are always copying to a destination lower in memory, right?

Only if you iterate in order of increasing memory address, which is not
the only possibility.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Sep 3 15:06:38 2024

Undefined behaviour is something that is exercised at run-time.
That's why the "undefined behaviour sanitizers" insert run-time
checks. And of course they only detect the behaviour when it is
actually exercised.

IIUC the way the run-time checks need to *prevent* undefined behavior
rather than merely detecting it, because if you do

if (would_UB_here())
fprintf (stderr, ...);
maybe_do_UB_here();

the compiler is allowed to skip the `fprintf` if `maybe_do_UB_here`
does UB. IOW the UB effect can be "retroactive".

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Niklas Holsti on Tue Sep 3 21:08:50 2024

Niklas Holsti wrote:

On 2024-09-03 20:52, Terje Mathisen wrote:

Michael S wrote:

On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?

It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?

Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.

What is "compact buffers" ?

Assume a buffer consisting of records of some type, some of them
marked as deleted. Iterating over them while removing the gaps means
that you are always copying to a destination lower in memory, right?

Only if you iterate in order of increasing memory address, which is not
the only possibility.

Obviously so, I really didn't think that needed to be stated. :-(

uint8_t buffer[1000]

memcpy(buffer + 0, buffer + 10, 100)

OK?

This is the memcpy() version which the original 8086 REP MOVSB was
designed for, long before alternative code turned out to be faster in
some circumstances.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Sep 3 15:28:03 2024

My impression - based on hearsay for Rust as I have no experience - is that the key point of Rust is memory "safety". I use scare-quotes here, since it is simply about correct use of dynamic memory and buffers.

It is entirely possible to have correct use of memory in C,

If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Sep 3 15:30:21 2024

Specifications are an agreement between the supplier and the client. The

The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Terje Mathisen on Tue Sep 3 22:34:42 2024

On 2024-09-03 22:08, Terje Mathisen wrote:

Niklas Holsti wrote:

On 2024-09-03 20:52, Terje Mathisen wrote:

Michael S wrote:

On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>> Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?

It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?

Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.

What is "compact buffers" ?

Assume a buffer consisting of records of some type, some of them
marked as deleted. Iterating over them while removing the gaps means
that you are always copying to a destination lower in memory, right?

Only if you iterate in order of increasing memory address, which is
not the only possibility.

Obviously so, I really didn't think that needed to be stated. :-(

I admit my comment was partly tongue-in-cheek, but if the issue is when
and whether a memcpy() that always copies in increasing address order is useful, it seems that a statement about "iterating over" an array should
also specify the iteration order. Ok, ;-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stefan Monnier on Tue Sep 3 20:05:14 2024

Stefan Monnier <[email protected]> schrieb:

My impression - based on hearsay for Rust as I have no experience - is that >> the key point of Rust is memory "safety". I use scare-quotes here, since it >> is simply about correct use of dynamic memory and buffers.

It is entirely possible to have correct use of memory in C,

If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.

Really?

I thought Fortran was higher level than C, and you can do a lot
more things in Fortran than in C.

Or rather, Fortran allows you to do things which are possible,
but very cumbersome, in C. Both are Turing complete, after all.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Tue Sep 3 20:20:41 2024

On Tue, 3 Sep 2024 19:28:03 +0000, Stefan Monnier wrote:

My impression - based on hearsay for Rust as I have no experience - is
that
the key point of Rust is memory "safety". I use scare-quotes here,
since it
is simply about correct use of dynamic memory and buffers.

It is entirely possible to have correct use of memory in C,

If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.

A higher level language simply makes it HARDER to shoot yourself in the
foot, not easier to express this-crap or that-crap.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Sep 3 20:25:22 2024

On Tue, 3 Sep 2024 20:05:14 +0000, Thomas Koenig wrote:

Stefan Monnier <[email protected]> schrieb:

My impression - based on hearsay for Rust as I have no experience - is
that
the key point of Rust is memory "safety". I use scare-quotes here,
since it
is simply about correct use of dynamic memory and buffers.

It is entirely possible to have correct use of memory in C,

If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.

Really?

I thought Fortran was higher level than C, and you can do a lot
more things in Fortran than in C.

Fortran has a memory model where if address aliasing occurs it is
the programmers fault, C has the contrapositive.

Given the Fortran library, it is easy to write in C what could be
written in Fortran--mostly because Fortran programmers use their
library instead of trying to circumvent it at every step.

Or rather, Fortran allows you to do things which are possible,
but very cumbersome, in C. Both are Turing complete, after all.

Turing complete does not take memory order into account.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Tue Sep 3 20:22:03 2024

On Tue, 3 Sep 2024 19:30:21 +0000, Stefan Monnier wrote:

Specifications are an agreement between the supplier and the client. The

The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.

# define int int64_t
..

makes it easier.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Stefan Monnier on Wed Sep 4 01:15:14 2024

On 03/09/2024 21:28, Stefan Monnier wrote:

My impression - based on hearsay for Rust as I have no experience - is that >> the key point of Rust is memory "safety". I use scare-quotes here, since it >> is simply about correct use of dynamic memory and buffers.

It is entirely possible to have correct use of memory in C,

If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.

Agreed.

I've heard it said that the power of a programming language comes not
from what you can do with the language, but from what you cannot do.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Stephen Fuld on Wed Sep 4 01:14:14 2024

On 03/09/2024 18:54, Stephen Fuld wrote:

On 9/2/2024 11:23 PM, David Brown wrote:

On 02/09/2024 18:46, Stephen Fuld wrote:

On 9/2/2024 1:23 AM, Terje Mathisen wrote:

Anyway, that is all mostly moot since I'm using Rust for this kind
of programming now. :-)

Can you talk about the advantages and disadvantages of Rust versus C?

And also for Rust versus C++ ?

I asked about C versus Rust as Terje explicitly mentioned those two languages, but you make a good point in general.

I want to know about both :-)

In my field, small-systems embedded development, C has been dominant for
a long time, but C++ use is increasing. Most of my new stuff in recent
times has been C++. There are some in the field who are trying out
Rust, so I need to look into it myself - either because it is a better
choice than C++, or because customers might want it.

My impression - based on hearsay for Rust as I have no experience - is
that the key point of Rust is memory "safety". I use scare-quotes
here, since it is simply about correct use of dynamic memory and buffers.

I agree that memory safety is the key point, although I gather that it
has other features that many programmers like.

Sure. There are certainly plenty of things that I think are a better
idea in a modern programming language and that make it a good step up
compared to C. My key interest is in comparison to C++ - it is a step
up in some ways, a step down in others, and a step sideways in many
features. But is it overall up or down, for /my/ uses?

Examples of things that I think are good in Rust are making variables
immutable by default and pattern matching. Steps down include lack of
function overloading and limited object oriented support.

There are some things that some people really like about Rust, that I am
far from convinced about - such as package management. I could be misunderstanding (since I don't have the experience), but for /my/ work,
I am very much against anything that encourages an "always get the
latest version" attitude. Stability is much more important to me. (I
dislike the rate at which Rust changes - every two weeks or so for small things, and every couple of years for breaking changes.)

And there are some things that Rust simply gets wrong - such as the
handling of signed integer overflows.

It is entirely possible to have correct use of memory in C, but it is
also very easy to get it wrong - especially if the developer doesn't
use available tools for static and run-time checks. Modern C++, on
the other hand, makes it much easier to get right. You can cause
yourself extra work and risk by using more old-fashioned C++, but
following modern design guides using smart pointers and containers,
along with easily available tools, and you get a lot of the management
of memory handled automatically for very little cost.

Is it fair to say then that Rust makes it harder to get memory
management "wrong"?

I don't know about reality, but that's what the salesmen say.

In modern C++ it's not hard to write code that doesn't leak and doesn't
have out of bounds accesses, but you need to put a bit more effort into
coding to track ownership properly, and not everything is as well
diagnosed at compile time as it could be. There has been progress
towards the equivalent to the Rust borrow checker for C++, I hear.

C++ provides a huge amount more than Rust - when I have looked at
Rust, it is (still) too limited for some of what I want to do.

Can you give a few examples?

As an example, in C++, you can make your own types that are, as fast as
I can see, much more expressive and flexible than in Rust, while also
being safe to use. This requires object syntax with support for
multiple constructors, operator overload, function overloads,
public/private separation, and multiple inheritance (at least of methods).

Of course, "with great power comes great responsibility" - C++
provides many exciting ways to write a complete mess :-)

Sure. I gather that templates are very powerful and potentially very useful. On the other hand, I gather that multiple inheritance is very powerful, but difficult to use and potentially very ugly, and has not
been carried forward in the same way into newer languages.

Multiple inheritance can easily get really messy, especially with
polymorphic types where data fields come from different ancestors, and
it get even more messy with virtual inheritance. I don't think that is
a good solution in more than a very few niche situations.

But multiple inheritance from bases with no data (just methods, types,
static data, constexpr data, etc.) is fine and can be very handy. Non-polymorphic inheritance with data fields is also fine.

snip stuff about the inadequacy of existing Rust versus C++ comparisons.

To my mind, the important question is not "Should we move from C to
Rust?", but "Should we move from bad C to C++, Rust, or simply to good
C practices?".

I understand. This brings up an important issue, that of older versus
newer languages.

A newer language has several advantages. One is it can take advantage
of what we have learned about language design and usage since the older language was designed. I can't underestimate this enough. While many
new language features turn out to be not useful, many are.

Absolutely. There's things about newer languages, like Rust, Go, and
Swift that I like. For example, they are designed with concurrency and multi-threading from the start, rather than an add-on. C++, as we know
it today, has grown gradually, and a lot of its complexity is because of features added on rather than having been part of the original design.

But it seems to me that Rust could have taken more from C++ and been a
more complete rival. That is, it could have taken more of what can be
done in C++, and found more elegant way to achieve the same effects from
the start.

Another is that it doesn't have to worry about support for "dusty
decks", i.e. the existing base which may conform to an older version of
the language, nor for "dusty brains", that is programmers who learned
the older (i.e. worse) ways and keep generating new code using those
ways. You mention this issue in your comments.

Of course, the counter to that is that new languages have to overcome
the huge "installed base" advantage of existing languages.

Let me be clear. I am not a Rust evangelist. I am just looking for a
way forward that will help us make programmer easier and not to make
some of the same mistakes we have made in the past. Is Rust that? Some people think so. I just want to understand more.

I am in the same boat. While I like C++ and find it a lot better than
C, I'd be quite happy to drop it for Rust or anything else if I found
they were better.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Bernd Linsel on Wed Sep 4 01:19:36 2024

On 03/09/2024 19:19, Bernd Linsel wrote:

On 03.09.24 10:10, David Brown wrote:

snip 8< - - - - - - - -

(There are a few situations where UB in C could be diagnosed at
compile-time, which are probably historical decisions to avoid
imposing too much work on early compilers. Where possible, UB that
can be caught at compile time, could usefully be turned into constrain
violations that must be diagnosed.)

And exactly these are the situations that I'd like to be warned from,
rather than the compiler making up something without telling.

Some of those /are/ warned about by compilers (but I'd rather the
standards said that they were errors). But in general, many can be
handled by good development practice and compiler warnings. Still,
compilers could always get better!

One thing that could make a big difference, I think, is to drop the
compilation model of each translation unit being compiled to a binary
object independently, with only a minimal amount of information for
linking. Link-time optimisation allows for many extra checks, not all
of which are currently implemented AFAIK. For example, it should be
possible to check that external declarations and definitions match up
correctly across modules - that's currently UB and rarely checked.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Niklas Holsti on Wed Sep 4 01:22:27 2024

On 03/09/2024 20:08, Niklas Holsti wrote:

On 2024-09-03 11:10, David Brown wrote:

[snip]

(There are a few situations where UB in C could be diagnosed at
compile-time, which are probably historical decisions to avoid
imposing too much work on early compilers. Where possible, UB that
can be caught at compile time, could usefully be turned into constrain
violations that must be diagnosed.)

The problem, as you of course know, is that the "can" in "can be caught
at compile time" depends on the amount and kind of analysis that is done
at compile time -- some cases of UB "can" be caught at compile time but
only by advanced and costly analysis. If the language standard requires
that such things /must/ be detected by the compiler, it can place quite
a burden on the developers of conforming compilers.

Yes. But I am happy to place a bigger burden on compilers if it reduces
the risk of errors for developers.

Of course there must be some balance. But many of the rules are based
on the kind of compiler that could run on a PDP-11 - it's reasonable to
expect more these days.

As I understand it, current C compilers detect UB mostly as a side
effect of the analyses they do for code optimization purposes, which
vary widely between compilers, and so the UB-detections also vary.

This issue (compile-time detection) has now and then been discussed in
the Ada standards group. Given the currently low market penetration of
Ada, the group has been reluctant to require too much of the compilers,
and so the more advanced UB-detecting tools are stand-alone, such as the SPARK tools.

That makes sense for Ada. Given the high market penetration of C and
C++, the balance is different.

And of course if a future C26 (or whatever) standard required more UB
detection for conformity, that would not affect existing C23 or earlier compilers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Terje Mathisen on Tue Sep 3 16:27:32 2024

On 9/3/2024 8:46 AM, Terje Mathisen wrote:

Stephen Fuld wrote:

On 9/2/2024 1:23 AM, Terje Mathisen wrote:

Stephen Fuld wrote:

On 8/31/2024 2:14 PM, MitchAlsup1 wrote:

On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:

You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry
etc.,
while compiler writers are free in their search for optimization
tricks
that let them shine at SPEC benchmarks.

A pressure vessel may actually be able to contain 2Ãƒâ€” the
pressure it
will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.

Then there are 3 kinds of metals {grey, white, yellow} with different
responses to stress and induced strain. There is no analogy in code--
If there were perhaps we would have better code today...

Perhaps an analogy is code written in assembler, versus coed written
in C versus code written in something like Ada or Rust.Â Backing
away now . . . :-)

IMNSHO, code written in asm is generally more safe than code written
in C, because the author knows exactly what each line of code is
going to do.

The problem is of course that it is harder to get 10x lines of
correct asm than to get 1x lines of correct C.

BTW, I am also solidly in the grey hair group here, writing C code
that is very low-level, using explicit local variables for any loop
invariant, copying other stuff into temp vars in order to make it
really obvious that they cannot alias any globals or input/output
parameters.

Anyway, that is all mostly moot since I'm using Rust for this kind of
programming now. :-)

Can you talk about the advantages and disadvantages of Rust versus C?

Q&D programming is still far faster for me in C, but using Rust I don't
have to worry about how well the compiler will be able to optimize my
code, it is pretty much always close to speed of light since the entire aliasing issue goes away.

Rust also gets rid of the horrible external library/configure/cmake mess that kept me from successfully compiling the reference LAStools lidar
code for nearly 10 years.

Using the Rust port I just tell cargo to add it to my project and that's
it.

Thank you. I find it interesting that the main advantage of Rust as
touted by its evangelists, memory safety, didn't make your list.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Wed Sep 4 01:54:52 2024

On Tue, 3 Sep 2024 23:19:36 +0000, David Brown wrote:

On 03/09/2024 19:19, Bernd Linsel wrote:

On 03.09.24 10:10, David Brown wrote:

snip 8< - - - - - - - -

(There are a few situations where UB in C could be diagnosed at
compile-time, which are probably historical decisions to avoid
imposing too much work on early compilers. Where possible, UB that
can be caught at compile time, could usefully be turned into constrain
violations that must be diagnosed.)

And exactly these are the situations that I'd like to be warned from,
rather than the compiler making up something without telling.

Some of those /are/ warned about by compilers (but I'd rather the
standards said that they were errors). But in general, many can be
handled by good development practice and compiler warnings. Still,
compilers could always get better!

Something that might be an error in a 32-bit machine may not be
an error in a 36-bit {48, 64, 72} machine.

One thing that could make a big difference, I think, is to drop the compilation model of each translation unit being compiled to a binary
object independently, with only a minimal amount of information for
linking. Link-time optimisation allows for many extra checks, not all
of which are currently implemented AFAIK. For example, it should be
possible to check that external declarations and definitions match up correctly across modules - that's currently UB and rarely checked.

How does one call fprintf() under those rules ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Wed Sep 4 01:49:12 2024

On Sun, 1 Sep 2024 21:02:16 +0000, Paul A. Clayton wrote:

On 8/31/24 4:56 PM, BGB wrote:
[snip]

I was mostly doing dual-issue with a 4R2W design.

Initially, 6R3W won out mostly because 4R2W disallows an indexed
store to be run in parallel with another op; but 6R3W did allow
this.

Stores and MADD allow one register read to be delayed by at least
one cycle. If the following cycle had a free read port, that could
be stolen to complete the store/MADD. This could be viewed as
cracking a three-source operation into a two-source operation and
a one-source operation that reads source operands in a following
cycle except that this operation never uses a result from the
previous cycle.

Stores are allowed to delay the St.Data read until after retirement.
Thus, you are guaranteed that the cache line is present, that the
cache is in a hit state, and that the TLB has translated the address,
And finally, you need no forwarding on that read.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Sep 4 01:57:24 2024

On Tue, 3 Sep 2024 23:23:50 +0000, BGB wrote:

On 9/1/2024 4:02 PM, Paul A. Clayton wrote:

On 8/31/24 4:56 PM, BGB wrote:
[snip]

I was mostly doing dual-issue with a 4R2W design.

Initially, 6R3W won out mostly because 4R2W disallows an indexed store
to be run in parallel with another op; but 6R3W did allow this.

Stores and MADD allow one register read to be delayed by at least
one cycle. If the following cycle had a free read port, that could
be stolen to complete the store/MADD. This could be viewed as
cracking a three-source operation into a two-source operation and
a one-source operation that reads source operands in a following
cycle except that this operation never uses a result from the
previous cycle.

This wouldn't map well to my existing decoder/pipeline, which requires
all the ports (and all the registers) to be available at the time an instruction enters EX1, and currently has no support for "cracking" an instruction over multiple cycles, but may spread a single instruction
across multiple lanes.

Your pipeline is amateur at best.
--------------

But, yeah, if the restriction only applied to indexed store (in the
current implementation, it applies to all stores), it would still be
around 4% of the total instruction stream.

As-is, it is closer to 12%, and causing an extra penalty for 12% of the total-executed instructions was undesirable (but, IMHO, still better
than needing to use multiple instructions).

Delaying ST.data only delays LDs which alias that ST.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Wed Sep 4 08:53:28 2024

On 04/09/2024 03:54, MitchAlsup1 wrote:

On Tue, 3 Sep 2024 23:19:36 +0000, David Brown wrote:

On 03/09/2024 19:19, Bernd Linsel wrote:

On 03.09.24 10:10, David Brown wrote:

snip 8< - - - - - - - -

(There are a few situations where UB in C could be diagnosed at
compile-time, which are probably historical decisions to avoid
imposing too much work on early compilers. Where possible, UB that
can be caught at compile time, could usefully be turned into constrain >>>> violations that must be diagnosed.)

And exactly these are the situations that I'd like to be warned from,
rather than the compiler making up something without telling.

Some of those /are/ warned about by compilers (but I'd rather the
standards said that they were errors). But in general, many can be
handled by good development practice and compiler warnings. Still,
compilers could always get better!

Something that might be an error in a 32-bit machine may not be
an error in a 36-bit {48, 64, 72} machine.

One thing that could make a big difference, I think, is to drop the
compilation model of each translation unit being compiled to a binary
object independently, with only a minimal amount of information for
linking. Link-time optimisation allows for many extra checks, not all
of which are currently implemented AFAIK. For example, it should be
possible to check that external declarations and definitions match up
correctly across modules - that's currently UB and rarely checked.

How does one call fprintf() under those rules ??

Untyped vararg functions are a big risk factor for programming and are
always difficult for static (or run-time) checking. The best you can do
is limit them to the standard ones (the printf family is very useful),
make sure you are always using declarations from common headers rather
than "home-made" declarations, and use the tools you can (such as gcc
and clang's format attribute checks).

There will never be a way to do full automatic checking of code
correctness. But the more mistakes that can be caught automatically,
the better. Modern tools can catch more than older tools, and there is
scope for them to catch even more. (Though it can sometimes be
surprising how difficult it can be to add seemingly obvious warnings to compilers - the way the different analysis and optimisation passes are
divided can mean critical information is lost or too inefficient to track.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to BGB on Wed Sep 4 09:04:40 2024

On 03/09/2024 20:39, BGB wrote:

On 9/2/2024 8:36 AM, MitchAlsup1 wrote:

On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

George Neuner <[email protected]> schrieb:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is
mentioned as UB in some standard N, but was not addressed in previous
standards.

Was it always UB? Or should it be considered ID until it became UB?

Can you give an exapmple?

Memcopy() with overlapping pointers.

I had just recently discovered that newer versions of GCC will cause
code to break if it is missing a return value in C++ mode.

No, the error in the code caused the code to break. You don't get to
blame the compiler if you write rubbish. You get to /thank/ the
compiler if it has helpfully added an instruction to cause the program
to stop abruptly with a UD2 instruction.

Note that in C, falling off the end of Foo here is fine - it is only if
the caller attempts to use the non-existent return value that there is
UB. Thus in C mode, gcc implements Foo as "ret" (when optimised), and
will only warn you if you enable warnings.

In C++, it is the act of falling off the end of Foo that is UB, thus the compiler will generate an UB2 (for -O0) or no code at all (when
optimised), and will warn you without requiring options.

So:
int Foo() { }

Will (in theory) cause the program to crash when called (emitting a
'UD2' instruction), except in WSL it seems this doesn't quite work
correctly (the UD2 doesn't result in an immediate crash), and the
program seemingly instead "goes off the rails and crashes at a later
point" (GCC omits the epilog when it does this, and seemingly control
flow then goes into whatever function follows in the binary, crashing
when that function tries to return seemingly by branching to an invalid address or similar).

This was mostly effecting "init" functions in my Verilator test benches...

Well, that, and a more inconsistent variant, where if one declares
struct fields as 8 and 3 bytes and then strncpy's 11 bytes into the
combined field, it may also insert a UD2 and skip emitting the following code.

...

But, yeah, that was annoying...

If your compiler tells you you are doing something stupid, and you
ignore it, I really don't think you can claim "the compiler broke my code".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Wed Sep 4 09:15:19 2024

David Brown wrote:

On 03/09/2024 18:54, Stephen Fuld wrote:

On 9/2/2024 11:23 PM, David Brown wrote:

On 02/09/2024 18:46, Stephen Fuld wrote:

On 9/2/2024 1:23 AM, Terje Mathisen wrote:

Anyway, that is all mostly moot since I'm using Rust for this kind
of programming now. :-)

Can you talk about the advantages and disadvantages of Rust versus C?

And also for Rust versus C++ ?

I asked about C versus Rust as Terje explicitly mentioned those two
languages, but you make a good point in general.

I want to know about both :-)

In my field, small-systems embedded development, C has been dominant for
a long time, but C++ use is increasing. Most of my new stuff in recent times has been C++. There are some in the field who are trying out
Rust, so I need to look into it myself - either because it is a better
choice than C++, or because customers might want it.

My impression - based on hearsay for Rust as I have no experience -
is that the key point of Rust is memory "safety".Â I use
scare-quotes here, since it is simply about correct use of dynamic
memory and buffers.

I agree that memory safety is the key point, although I gather that it
has other features that many programmers like.

Sure. There are certainly plenty of things that I think are a better
idea in a modern programming language and that make it a good step up compared to C. My key interest is in comparison to C++ - it is a step
up in some ways, a step down in others, and a step sideways in many features. But is it overall up or down, for /my/ uses?

Examples of things that I think are good in Rust are making variables immutable by default and pattern matching. Steps down include lack of function overloading and limited object oriented support.

There are some things that some people really like about Rust, that I am
far from convinced about - such as package management. I could be misunderstanding (since I don't have the experience), but for /my/ work,
I am very much against anything that encourages an "always get the
latest version" attitude. Stability is much more important to me. (I dislike the rate at which Rust changes - every two weeks or so for small things, and every couple of years for breaking changes.)

That's yet another of the things cargo (the rust package manager, as
well as lots of other stuff) get right:

Yes, by default you'll pick up the latest of every package/module you
"cargo add foo" to your project, but then you can edit the resulting text-format configuration file, and lock down exact versions of some or
all of those packages.

This is similar to how we always freeze python packages: Any changes are something we decide to employ.

And there are some things that Rust simply gets wrong - such as the
handling of signed integer overflows.

Maybe?

Rust will _always_ check for such overflow in debug builds, then when
you've determined that they don't occur, the release build falls back
standard CPU behavior, i.e. wrapping around with no panics.

You can argue both pro and con here, personally I like the Rust setup
much more than C(++) which will use code that could do so as an excuse
to elide that as well as all surrounding/dependent code.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Stefan Monnier on Wed Sep 4 09:20:10 2024

On 03/09/2024 21:30, Stefan Monnier wrote:

Specifications are an agreement between the supplier and the client. The

The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.

That's what I do for a living. And I'm not exactly unique here. If we
charge money for a product with code, and there's a bug in the code,
that is covered by the product's guarantee, just like design faults in
the hardware.

Basically, hitting UB at run-time means a bug in the code because the
program does not do what you intended. And if you hit a bug in the
code, then the behaviour is not what you defined in the code
specifications - it is UB.

As I see it, the task of avoiding UB in general is simply the task of
writing bug-free code. That can definitely be hard, regardless of the language.

But if you are thinking specifically of "popular" UB in C, such as dereferencing null pointers, overflowing signed arithmetic, using
pointers after "free", or accessing arrays out of bounds, then no, I
don't think it is hard at all. Seriously, it is extremely rare that I
have bugs in my code from such UB, even during early development. Maybe
it is the type of code I write (it's a somewhat niche field), or the way
I do my development, but it just is not a problem. (I can have plenty
of other kinds of bugs, of course!)

What /definitely/ does not help is for a language to define incorrect
behaviour in order to say it doesn't have undefined behaviour. A
classic example is defining signed integer overflow as two's complement wrapping. That does not fix any errors in the code - it just guarantees
that the code will produce incorrect answers which can later lead to
nasal daemons, but that it won't launch the nasal daemons immediately.
So your tools can't do as much to help catch the errors (from static
error checking, debuggers or sanitizers), and the compiler can't
generate as efficient results for correct code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Wed Sep 4 09:29:01 2024

On 03/09/2024 22:22, MitchAlsup1 wrote:

On Tue, 3 Sep 2024 19:30:21 +0000, Stefan Monnier wrote:

Specifications are an agreement between the supplier and the client. The

The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.

# define int int64_t
..

makes it easier.

That's UB, I believe :-) And it will certainly be confusing.

But good use of size-specific types is helpful to writing correct code.
If your calculations could conceivably overflow 32 bits, int64_t is a
good choice.

For smaller numbers and portable code, you might want int_fast32_t or int_fast16_t, which on most 64-bit systems will be faster than "int".

You can call it /ugly/, but it's not /hard/.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stephen Fuld on Wed Sep 4 09:26:16 2024

Stephen Fuld wrote:

On 9/3/2024 8:46 AM, Terje Mathisen wrote:

Stephen Fuld wrote:

On 9/2/2024 1:23 AM, Terje Mathisen wrote:

Stephen Fuld wrote:

On 8/31/2024 2:14 PM, MitchAlsup1 wrote:

On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:

You compare apples and peaches. Technical specifications for your
pressure vessel result from the physical abilities of the chosen
material, by keeping requirements as vessel border width, geometry
etc.,
while compiler writers are free in their search for optimization
tricks
that let them shine at SPEC benchmarks.

A pressure vessel may actually be able to contain 2ÃƒÆ’Ã¢â‚¬â€ the
pressure it
will be able to contain 20 after 20 years of service due to stress
and strain acting on the base materials.

Then there are 3 kinds of metals {grey, white, yellow} with

different

responses to stress and induced strain. There is no analogy in

code--

If there were perhaps we would have better code today...

Perhaps an analogy is code written in assembler, versus coed written
in C versus code written in something like Ada or Rust.Ã‚ Backing
away now . . . :-)

IMNSHO, code written in asm is generally more safe than code written
in C, because the author knows exactly what each line of code is
going to do.

The problem is of course that it is harder to get 10x lines of
correct asm than to get 1x lines of correct C.

BTW, I am also solidly in the grey hair group here, writing C code
that is very low-level, using explicit local variables for any loop
invariant, copying other stuff into temp vars in order to make it
really obvious that they cannot alias any globals or input/output
parameters.

Anyway, that is all mostly moot since I'm using Rust for this kind of
programming now. :-)

Can you talk about the advantages and disadvantages of Rust versus C?

Q&D programming is still far faster for me in C, but using Rust I don't have to worry about how well the compiler will be able to optimize my code, it is pretty much always close to speed of light since the entire aliasing issue goes away.

Rust also gets rid of the horrible external library/configure/cmake mess that kept me from successfully compiling the reference LAStools lidar
code for nearly 10 years.

Using the Rust port I just tell cargo to add it to my project and that's it.

Thank you. I find it interesting that the main advantage of Rust as
touted by its evangelists, memory safety, didn't make your list.

Possibly because, due to the way I've been writing C(++) code for the
last 40 years, I have almost never been hit by those problems myself?

OTOH, in retrospect I know I have written a lot of code that would not
have survived an experienced attacker, i.e. strcpy()/memcpy()/etc
without explicit checks that the target buffer is large enough.

This is of course fine in the classic "everyone is a friend, all code is
open source, and nobody wants to actively attack us" environment, but
not so much for anything exposed to the Internet.

During my NTP Hackers time we never had memory overruns afair, but we
did get a lot of abuse when DoS attacks were using our by default open debug/monitoring interface to amplify attacks on other systems. This was similar to the classic DNS abuse for the same purpose.

Yes, I do like the Rust memory safety, but it does nothing to prevent
attacks of that type: We had to switch from UDP to TCP for all requests
that could produce outputs larger than the input size.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Wed Sep 4 08:05:51 2024

BGB <[email protected]> writes:

Otherwise, annoying:
Despite configuring GCC to use RV64G, it builds its C library as RV64GC
and is like "well, close enough".

This may be an artifact of bootstrapping. At some point I built a new
version of gcc for our Alphas. We had machines without the BWX
extensions and machines with the BWX extension.

Of course I built gcc on the fastest machine we had, one with BWX.
And then I found out that the resulting compiler binary would not run
on the machines without BWX.

Ok, so build it again, taking care to configure it to not use BWX in bootstrapping itself. However, somehow libgcc got inherited from the
previous build, so the resulting compiler would run on machines
without BWX, but the binaries it produced would not. My guess is that something similar happened for libgcc in your case.

I did another round of rebuilding, making sure that libgcc was rebuilt
from scratch without BWX. I don't remember all that was involved;
maybe I just did this build on a machine that does not have BWX.

[Risc-V compressed instructions]

Which is annoying because seemingly nearly every instruction has its own >encoding scheme for the immediate fields.

It's designed for easy hardware decoding, so maybe you just need to
discover the ideas behind that and put them into your decoder.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Wed Sep 4 12:57:21 2024

On 04/09/2024 09:15, Terje Mathisen wrote:

David Brown wrote:

On 03/09/2024 18:54, Stephen Fuld wrote:

On 9/2/2024 11:23 PM, David Brown wrote:

On 02/09/2024 18:46, Stephen Fuld wrote:

On 9/2/2024 1:23 AM, Terje Mathisen wrote:

Anyway, that is all mostly moot since I'm using Rust for this kind >>>>>> of programming now. :-)

Can you talk about the advantages and disadvantages of Rust versus C? >>>>>

And also for Rust versus C++ ?

I asked about C versus Rust as Terje explicitly mentioned those two
languages, but you make a good point in general.

I want to know about both :-)

In my field, small-systems embedded development, C has been dominant
for a long time, but C++ use is increasing. Most of my new stuff in
recent times has been C++. There are some in the field who are trying
out Rust, so I need to look into it myself - either because it is a
better choice than C++, or because customers might want it.

My impression - based on hearsay for Rust as I have no experience -
is that the key point of Rust is memory "safety".Â I use
scare-quotes here, since it is simply about correct use of dynamic
memory and buffers.

I agree that memory safety is the key point, although I gather that
it has other features that many programmers like.

Sure. There are certainly plenty of things that I think are a better
idea in a modern programming language and that make it a good step up
compared to C. My key interest is in comparison to C++ - it is a step
up in some ways, a step down in others, and a step sideways in many
features. But is it overall up or down, for /my/ uses?

Examples of things that I think are good in Rust are making variables
immutable by default and pattern matching. Steps down include lack of
function overloading and limited object oriented support.

There are some things that some people really like about Rust, that I
am far from convinced about - such as package management. I could be
misunderstanding (since I don't have the experience), but for /my/
work, I am very much against anything that encourages an "always get
the latest version" attitude. Stability is much more important to
me. (I dislike the rate at which Rust changes - every two weeks or so
for small things, and every couple of years for breaking changes.)

That's yet another of the things cargo (the rust package manager, as
well as lots of other stuff) get right:

Yes, by default you'll pick up the latest of every package/module you
"cargo add foo" to your project, but then you can edit the resulting text-format configuration file, and lock down exact versions of some or
all of those packages.

OK, that's good. And I presume there is no problem keeping these
versions locally in your git (or other source code system), for when the
old versions are removed from their internet sources.

This is similar to how we always freeze python packages: Any changes are something we decide to employ.

And there are some things that Rust simply gets wrong - such as the
handling of signed integer overflows.

Maybe?

Rust will _always_ check for such overflow in debug builds, then when
you've determined that they don't occur, the release build falls back standard CPU behavior, i.e. wrapping around with no panics.

But if you've determined that they do not occur (during debugging), then
your code never makes use of the results of an overflow - thus why is it defined behaviour? It makes no sense. The only time when you would
actually see wrapping in final code is if you hadn't tested it properly,
and then you can be pretty confident that the whole thing will end in
tears when signs change unexpectedly. It would be much more sensible to
leave signed overflow undefined, and let the compiler optimise on that
basis.

I'm all in favour of temporarily having checks for overflow (and other
errors) during debugging, but I am sceptical to having distinct
debug/release builds. It encourages people to use debug builds during development, bug hunting and testing, then when all looks good they
switch to release build and deploy it. I prefer a single build, and
enable run-time checks on parts of it if and when necessary.

You can argue both pro and con here, personally I like the Rust setup
much more than C(++) which will use code that could do so as an excuse
to elide that as well as all surrounding/dependent code.

If the compiler can see that code is never run, or that it will have all
gone horribly wrong before the code is reached, I am happy to see it
removed by the compiler. (Where possible - and there are unfortunately
limits to warning abilities - I like the compiler to tell me about it.)
I see no benefit in keeping code in place if it can't be run.

(But I agree that there are pros and cons to many of these things.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to BGB on Wed Sep 4 13:12:08 2024

On 04/09/2024 10:06, BGB wrote:

On 9/4/2024 2:04 AM, David Brown wrote:

On 03/09/2024 20:39, BGB wrote:

On 9/2/2024 8:36 AM, MitchAlsup1 wrote:

On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

George Neuner <[email protected]> schrieb:

I'm not going to argue about whether UB in code is wrong. The
question I have concerns what to do with something that explicitly is >>>>>> mentioned as UB in some standard N, but was not addressed in previous >>>>>> standards.

Was it always UB? Or should it be considered ID until it became UB? >>>>>

Can you give an exapmple?

Memcopy() with overlapping pointers.

I had just recently discovered that newer versions of GCC will cause
code to break if it is missing a return value in C++ mode.

No, the error in the code caused the code to break. You don't get to
blame the compiler if you write rubbish. You get to /thank/ the
compiler if it has helpfully added an instruction to cause the program
to stop abruptly with a UD2 instruction.

Usually the role of the compiler is to make existing code work as it did before, not to cause it to break, even in the face of UB.

No, it is not.

The role of a compiler is to take correct input code and generate
correct output code. And if the compiler is helpful, then as a bonus it
can tell you when you have made mistakes.

It is most certainly /not/ the aim of a compiler to generate exactly the
same garbage out as some other compiler did for some garbage in.

I would have more accepted if it turned it into a compiler error or
similar though (rather than turn it into a runtime crash).

The compiler /does/ generate an error - /if/ you use the compiler properly.

It is an unfortunate thing, IMHO, that gcc is far too lenient to random
crap input by default. The world of C and C++ programming would have
seen vastly fewer bugs if "gcc -Wpedantic -Wall -Wextra -Werror" had
been the default, and expert users turned off specific warnings if they
wanted.

If you choose to use your shiny new power saw without guards, holding
your wood by hand, without reading any instructions, and you lose a hand
- who would you blame? Do you blame the power saw for not being as slow
and weak as your old rusty handsaw that merely scratched you?

Your code was wrong. You should have known it was wrong when you wrote
it. You should have used standard, common free tools that would have
told you it was wrong. You should not have ignored the warnings these
tools gave you even when you didn't use them appropriately. It is not
the fault of the tool.

Note that in C, falling off the end of Foo here is fine - it is only
if the caller attempts to use the non-existent return value that there
is UB. Thus in C mode, gcc implements Foo as "ret" (when optimised),
and will only warn you if you enable warnings.

In C++, it is the act of falling off the end of Foo that is UB, thus
the compiler will generate an UB2 (for -O0) or no code at all (when
optimised), and will warn you without requiring options.

It worked fine in the older instance of WSL running GCC 4.8.0 ("Ubuntu
14"), but sorta exploded when switching to a newer instance of WSL (with "Ubuntu 22")...

But, sometimes got lazy, and did:
int InitSomething()
{
...
}

Without a return, but was an issue when it was unexpectedly crashing
(and the cause was not immediately obvious, and I had not heard that
there had been a behavioral change here).

There has been no behaviour changes in the language or the compiler.
Your code had no defined behaviour before, it has no defined behaviour
now, and that is completely independent of the compiler or version.

Well, also partly because it is traditional to always return 'int' even
when 'void' is technically more correct.

Don't be absurd. That C tradition was outdated before you were born,
and has never existed in C++.

But, in general, coding practices in my Verilator testbenches tends to
be more lax (mostly code thrown together so the Verilog can do its thing
and display its output to a window, and accept user input as needed).

So:
int Foo() { }

Will (in theory) cause the program to crash when called (emitting a
'UD2' instruction), except in WSL it seems this doesn't quite work
correctly (the UD2 doesn't result in an immediate crash), and the
program seemingly instead "goes off the rails and crashes at a later
point" (GCC omits the epilog when it does this, and seemingly control
flow then goes into whatever function follows in the binary, crashing
when that function tries to return seemingly by branching to an
invalid address or similar).

This was mostly effecting "init" functions in my Verilator test
benches...

Well, that, and a more inconsistent variant, where if one declares
struct fields as 8 and 3 bytes and then strncpy's 11 bytes into the
combined field, it may also insert a UD2 and skip emitting the
following code.

...

But, yeah, that was annoying...

If your compiler tells you you are doing something stupid, and you
ignore it, I really don't think you can claim "the compiler broke my
code".

It would have been nicer if it crashed in a way where GDB could show me
the point at which the crash was triggered...

Finding bugs is always cheaper (in time, effort and money) the earlier
you find it. Get yourself an editor or IDE that will spot such mistakes
as you type them. Failing that, at least use a compiler with good
warnings, use those warnings, and pay attention to them. Then learn to
use the sanitizer options as the next step. It is a waste of time to
wait until debugging to find such obvious and simple mistakes.

as opposed to just showing "??" followed by a random address (followed
by "can't read from address" or something to this effect).
(with the "-g" option). Where, "bt" and similar didn't work either.

I could tell it wasn't crashing immediately, because if it crashed immediately it would fail at the point the UD2 occurred.

However, in a lot of cases it was carrying on and triggering a storm of
debug prints for a while with often impossible values, before then
crashing (in a way that looked more like a possible stack corruption).

I suspect the latter being due to some weirdness in WSL (I figured about
the "UD2" mostly by trying to recreate the scenarios in "Compiler
Explorer" / "godbolt.org").

Luckily stuff mostly worked after this point, as the missing return
values were mostly limited to initialization functions.

Oddly though, "Compiler Explorer" was showing warnings for the missing
return values, but not in GCC in WSL.

Though, have noted that generally MSVC will warn about them, and in this
case I had usually fixed them, as granted it is still good practice to
return a value (more so if actually used, because "random garbage" isn't usually a particularly useful return value).

But, generally, MSVC will not unexpectedly break things.

gcc did not break anything.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to David Brown on Wed Sep 4 08:53:08 2024

On 9/4/24 06:57, David Brown wrote:

On 04/09/2024 09:15, Terje Mathisen wrote:

David Brown wrote:

Maybe?

Rust will _always_ check for such overflow in debug builds, then when
you've determined that they don't occur, the release build falls back
standard CPU behavior, i.e. wrapping around with no panics.

But if you've determined that they do not occur (during debugging), then
your code never makes use of the results of an overflow - thus why is it defined behaviour? It makes no sense. The only time when you would actually see wrapping in final code is if you hadn't tested it properly,
and then you can be pretty confident that the whole thing will end in
tears when signs change unexpectedly. It would be much more sensible to leave signed overflow undefined, and let the compiler optimise on that
basis.

You absolutely do want defined behavior on overflow. There are
algorithms that depend on that. Bakery algorithms for instance.
Unless you think a real life bakery with service tickets
numbering from 1 to 50 either never gets more than 50 customers
in a day or closes after their 50th customer. :)

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to David Brown on Wed Sep 4 08:41:22 2024

On 9/3/24 19:14, David Brown wrote:

Absolutely. There's things about newer languages, like Rust, Go, and
Swift that I like. For example, they are designed with concurrency and multi-threading from the start, rather than an add-on. C++, as we know
it today, has grown gradually, and a lot of its complexity is because of features added on rather than having been part of the original design.

Rust and Go use C/C++ atomics and concurrency model. I think that's
maybe to do with using common compiler back ends. They do try to make/encourage programmers use language constructs that they think
are safe, fool proof, and generalizable (though that's up to debate).

I don't know about Swift. Apple is off in their own alternate
reality. I like some of their hardware. I would get the M4
mac mini but I've owned both an x86 and powerpc mini and dealing
with their tool chains and api's is an absolute nightmare.

Part of the problem with concurrency support is that it is limited
by the imagination and foibles of the language architects. It
sucks that even today you have to resort to assembler to implement
some very basic and fundamental lock-free algorithms that are 30
to 50 years old at least.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Wed Sep 4 09:07:20 2024

Terje Mathisen <[email protected]> writes:

Michael S wrote:

On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?
>>

It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?

Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.

What is "compact buffers" ?

Assume a buffer consisting of records of some type, some of them
marked as deleted. Iterating over them while removing the gaps means
that you are always copying to a destination lower in memory, right?

If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Wed Sep 4 16:32:37 2024

On Wed, 4 Sep 2024 7:29:01 +0000, David Brown wrote:

On 03/09/2024 22:22, MitchAlsup1 wrote:

On Tue, 3 Sep 2024 19:30:21 +0000, Stefan Monnier wrote:

Specifications are an agreement between the supplier and the client. The >>>

The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.

# define int int64_t
..

makes it easier.

That's UB, I believe :-) And it will certainly be confusing.

On 64-bit machines it re-establishes the dusty-deck old K&R C where
int was the fastest integer type.

But good use of size-specific types is helpful to writing correct code.
If your calculations could conceivably overflow 32 bits, int64_t is a
good choice.

For smaller numbers and portable code, you might want int_fast32_t or int_fast16_t, which on most 64-bit systems will be faster than "int".

You can call it /ugly/, but it's not /hard/.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to jseigh on Wed Sep 4 18:34:29 2024

On 04/09/2024 14:53, jseigh wrote:

On 9/4/24 06:57, David Brown wrote:

On 04/09/2024 09:15, Terje Mathisen wrote:

David Brown wrote:

Maybe?

Rust will _always_ check for such overflow in debug builds, then when
you've determined that they don't occur, the release build falls back
standard CPU behavior, i.e. wrapping around with no panics.

But if you've determined that they do not occur (during debugging),
then your code never makes use of the results of an overflow - thus
why is it defined behaviour? It makes no sense. The only time when
you would actually see wrapping in final code is if you hadn't tested
it properly, and then you can be pretty confident that the whole thing
will end in tears when signs change unexpectedly. It would be much
more sensible to leave signed overflow undefined, and let the compiler
optimise on that basis.

You absolutely do want defined behavior on overflow.

No, you absolutely do /not/ want that - for the vast majority of use-cases.

There are times when you want wrapping behaviour, yes. More generally,
you want modulo arithmetic rather than a model of mathematical integer arithmetic. But those cases are rare, and in C they are easily handled
using unsigned integers.

You can't use signed integers for them in C (except of course if you use explicit modulo and none of your intermediary results overflow int),
because signed integer overflow is UB. You can't use signed integers
for the purpose in Rust either, even though it is defined behaviour in
release mode, because it is a run-time error in debug mode. (That's why
Rust's attitude here is completely daft to me.)

There are
algorithms that depend on that. Bakery algorithms for instance.
Unless you think a real life bakery with service tickets
numbering from 1 to 50 either never gets more than 50 customers
in a day or closes after their 50th customer. :)

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Tim Rentsch on Wed Sep 4 19:53:13 2024

On 04/09/2024 18:07, Tim Rentsch wrote:

Terje Mathisen <[email protected]> writes:

Michael S wrote:

On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular
book that said that it was defined?
>>

It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?

Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.

What is "compact buffers" ?

Assume a buffer consisting of records of some type, some of them
marked as deleted. Iterating over them while removing the gaps means
that you are always copying to a destination lower in memory, right?

If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.

Such tests are usually built into implementations of memmove(), which
will chose to run forwards or backwards as needed. So you might as well
just call memmove() any time you are not sure memcpy() is safe and
appropriate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Wed Sep 4 17:25:44 2024

David Brown <[email protected]> schrieb:

I'm all in favour of temporarily having checks for overflow (and other errors) during debugging, but I am sceptical to having distinct
debug/release builds. It encourages people to use debug builds during development, bug hunting and testing, then when all looks good they
switch to release build and deploy it. I prefer a single build, and
enable run-time checks on parts of it if and when necessary.

Wise man once said...

# It is absurd to make elaborate security checks on debugging runs,
# when no trust is put in the results, and then remove them in
# production runs, when an erroneous result could be expensive or
# disastrous. What would we think of a sailing enthusiast who wears
# his lifejacket when training on dry land, but takes it off as soon
# as he goes to sea?

(C.A.R. Hoare, in "Hints on Programming Language Desin)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Wed Sep 4 18:13:17 2024

On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

On 04/09/2024 18:07, Tim Rentsch wrote:

Terje Mathisen <[email protected]> writes:

Michael S wrote:

On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

3 years ago Terje Mathisen wrote that many years ago he read that
behaviour of memcpy() with overlappped src/dst was defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>> Mitch Alsup answered "That was true in 1983".
So, two people of different age living in different parts of the
world are telling the same story. May be, there exist old popular >>>>>> book that said that it was defined?
>>

It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime doc?

Specifying that it would always copy from beginning to end of the
source buffer, in increasing address order meant that it was
guaranteed safe when used to compact buffers.

What is "compact buffers" ?

Assume a buffer consisting of records of some type, some of them
marked as deleted. Iterating over them while removing the gaps means
that you are always copying to a destination lower in memory, right?

If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.

Such tests are usually built into implementations of memmove(), which
will chose to run forwards or backwards as needed. So you might as well
just call memmove() any time you are not sure memcpy() is safe and appropriate.

Memmove() is always appropriate unless you are doing something
nefarious.
So:
# define memcpy memomve
and move forward with life--for the 2 extra cycles memmove costs it
saves everyone long term grief.

When you need the nefarious activities of memcpy write it as a
for loop by yourself and comment the nafariousness of the use.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to David Brown on Wed Sep 4 20:15:24 2024

David Brown <[email protected]> wrote:

On 03/09/2024 21:28, Stefan Monnier wrote:

My impression - based on hearsay for Rust as I have no experience - is that >>> the key point of Rust is memory "safety". I use scare-quotes here, since it
is simply about correct use of dynamic memory and buffers.

It is entirely possible to have correct use of memory in C,

If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.

Agreed.

I've heard it said that the power of a programming language comes not
from what you can do with the language, but from what you cannot do.

Wrong, the last version of Swift added all the garbage programming concepts that one should avoid.

You have to give people the tools to do anything.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to [email protected] on Wed Sep 4 20:58:32 2024

MitchAlsup1 <[email protected]> wrote:

On Wed, 4 Sep 2024 20:15:24 +0000, Brett wrote:

David Brown <[email protected]> wrote:

On 03/09/2024 21:28, Stefan Monnier wrote:

My impression - based on hearsay for Rust as I have no experience - is >>>>> that
the key point of Rust is memory "safety". I use scare-quotes here,
since it
is simply about correct use of dynamic memory and buffers.

It is entirely possible to have correct use of memory in C,

If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.

Agreed.

I've heard it said that the power of a programming language comes not
from what you can do with the language, but from what you cannot do.

Wrong, the last version of Swift added all the garbage programming
concepts
that one should avoid.

You have to give people the tools to do anything.

It is impossible to create a computer programming language where
the programmer cannot shoot himself in the foot.

https://www-users.york.ac.uk/~ss44/joke/foot.htm

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Wed Sep 4 20:18:55 2024

On Wed, 4 Sep 2024 20:15:24 +0000, Brett wrote:

David Brown <[email protected]> wrote:

On 03/09/2024 21:28, Stefan Monnier wrote:

My impression - based on hearsay for Rust as I have no experience - is >>>> that
the key point of Rust is memory "safety". I use scare-quotes here,
since it
is simply about correct use of dynamic memory and buffers.

It is entirely possible to have correct use of memory in C,

If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.

Agreed.

I've heard it said that the power of a programming language comes not
from what you can do with the language, but from what you cannot do.

Wrong, the last version of Swift added all the garbage programming
concepts
that one should avoid.

You have to give people the tools to do anything.

It is impossible to create a computer programming language where
the programmer cannot shoot himself in the foot.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Wed Sep 4 23:53:58 2024

On Wed, 4 Sep 2024 17:25:44 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

David Brown <[email protected]> schrieb:

I'm all in favour of temporarily having checks for overflow (and
other errors) during debugging, but I am sceptical to having
distinct debug/release builds. It encourages people to use debug
builds during development, bug hunting and testing, then when all
looks good they switch to release build and deploy it. I prefer a
single build, and enable run-time checks on parts of it if and when necessary.

Wise man once said...

# It is absurd to make elaborate security checks on debugging runs,
# when no trust is put in the results, and then remove them in
# production runs, when an erroneous result could be expensive or
# disastrous. What would we think of a sailing enthusiast who wears
# his lifejacket when training on dry land, but takes it off as soon
# as he goes to sea?

(C.A.R. Hoare, in "Hints on Programming Language Desin)

Wise man was wrong.
Range check are not similar to live jackets. They do not turn incorrect
program into correct one.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to David Brown on Wed Sep 4 20:31:24 2024

David Brown <[email protected]> wrote:

On 04/09/2024 14:53, jseigh wrote:

On 9/4/24 06:57, David Brown wrote:

On 04/09/2024 09:15, Terje Mathisen wrote:

David Brown wrote:

Maybe?

Rust will _always_ check for such overflow in debug builds, then when
you've determined that they don't occur, the release build falls back
standard CPU behavior, i.e. wrapping around with no panics.

But if you've determined that they do not occur (during debugging),
then your code never makes use of the results of an overflow - thus
why is it defined behaviour? It makes no sense. The only time when
you would actually see wrapping in final code is if you hadn't tested
it properly, and then you can be pretty confident that the whole thing
will end in tears when signs change unexpectedly. It would be much
more sensible to leave signed overflow undefined, and let the compiler
optimise on that basis.

You absolutely do want defined behavior on overflow.

No, you absolutely do /not/ want that - for the vast majority of use-cases.

There are times when you want wrapping behaviour, yes. More generally,
you want modulo arithmetic rather than a model of mathematical integer arithmetic. But those cases are rare, and in C they are easily handled
using unsigned integers.

I tried using unsigned for a bunch of my data types that should never go negative, but every time I would have to compare them with an int somewhere
and that would cause a compiler warning, because the goal was to also
remove unsafe code.

Complete and udder disaster, went back to plain sized ints.

You can't use signed integers for them in C (except of course if you use explicit modulo and none of your intermediary results overflow int),
because signed integer overflow is UB. You can't use signed integers
for the purpose in Rust either, even though it is defined behaviour in release mode, because it is a run-time error in debug mode. (That's why Rust's attitude here is completely daft to me.)

There are
algorithms that depend on that. Bakery algorithms for instance.
Unless you think a real life bakery with service tickets
numbering from 1 to 50 either never gets more than 50 customers
in a day or closes after their 50th customer. :)

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Brett on Wed Sep 4 20:59:07 2024

Brett <[email protected]> writes:

David Brown <[email protected]> wrote:

On 04/09/2024 14:53, jseigh wrote:

On 9/4/24 06:57, David Brown wrote:

On 04/09/2024 09:15, Terje Mathisen wrote:

David Brown wrote:

Maybe?

Rust will _always_ check for such overflow in debug builds, then when >>>>> you've determined that they don't occur, the release build falls back >>>>> standard CPU behavior, i.e. wrapping around with no panics.

But if you've determined that they do not occur (during debugging),
then your code never makes use of the results of an overflow - thus
why is it defined behaviour? It makes no sense. The only time when >>>> you would actually see wrapping in final code is if you hadn't tested
it properly, and then you can be pretty confident that the whole thing >>>> will end in tears when signs change unexpectedly. It would be much
more sensible to leave signed overflow undefined, and let the compiler >>>> optimise on that basis.

You absolutely do want defined behavior on overflow.

No, you absolutely do /not/ want that - for the vast majority of use-cases. >>
There are times when you want wrapping behaviour, yes. More generally,
you want modulo arithmetic rather than a model of mathematical integer
arithmetic. But those cases are rare, and in C they are easily handled
using unsigned integers.

I tried using unsigned for a bunch of my data types that should never go >negative, but every time I would have to compare them with an int somewhere >and that would cause a compiler warning, because the goal was to also
remove unsafe code.

We use it exclusively for datatypes in the domain [0, 2**n). It's always compared against other unsigned variables or constants. Works quite well. Safer and cleaner than willy-nilly using int.

This is in a multi-million line C++ application.

Complete and udder disaster, went back to plain sized ints.

s/udder/utter/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Wed Sep 4 21:02:35 2024

Scott Lurndal <[email protected]> schrieb:

[unsigned]

We use it exclusively for datatypes in the domain [0, 2**n). It's always compared against other unsigned variables or constants. Works quite well. Safer and cleaner than willy-nilly using int.

The proposal for adding an unsigned data type to Fortran, which
I initiated and which I am currently implementing for gfortran,
does exactly that - no comparisions of signed vs. unsigned without
explicit conversion (and no arithmetic either).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Wed Sep 4 21:22:34 2024

On Wed, 4 Sep 2024 20:31:24 +0000, Brett wrote:

David Brown <[email protected]> wrote:

On 04/09/2024 14:53, jseigh wrote:

On 9/4/24 06:57, David Brown wrote:

On 04/09/2024 09:15, Terje Mathisen wrote:

David Brown wrote:

Maybe?

Rust will _always_ check for such overflow in debug builds, then when >>>>> you've determined that they don't occur, the release build falls back >>>>> standard CPU behavior, i.e. wrapping around with no panics.

But if you've determined that they do not occur (during debugging),
then your code never makes use of the results of an overflow - thus
why is it defined behaviour? It makes no sense. The only time when >>>> you would actually see wrapping in final code is if you hadn't tested
it properly, and then you can be pretty confident that the whole thing >>>> will end in tears when signs change unexpectedly. It would be much
more sensible to leave signed overflow undefined, and let the compiler >>>> optimise on that basis.

You absolutely do want defined behavior on overflow.

No, you absolutely do /not/ want that - for the vast majority of
use-cases.

There are times when you want wrapping behaviour, yes. More generally,
you want modulo arithmetic rather than a model of mathematical integer
arithmetic. But those cases are rare, and in C they are easily handled
using unsigned integers.

I tried using unsigned for a bunch of my data types that should never go negative, but every time I would have to compare them with an int
somewhere
and that would cause a compiler warning, because the goal was to also
remove unsafe code.

For the last 25 years I have used nothing but unsigned (other than
places
where the interface standard passes an int argument or returns an int
result or I explicitly expect a negative number.) It has worked
fabulously
well for me.

I would LIKE a compiler warning if it sees::

for( int i = positive; i < something_positive; i++ )

The warning being:: "signed loop variable should be unsigned."

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Sep 4 22:50:18 2024

On Wed, 4 Sep 2024 22:11:47 +0000, BGB wrote:

On 9/4/2024 3:18 PM, MitchAlsup1 wrote:

On Wed, 4 Sep 2024 20:15:24 +0000, Brett wrote:

David Brown <[email protected]> wrote:

On 03/09/2024 21:28, Stefan Monnier wrote:

My impression - based on hearsay for Rust as I have no experience - is >>>>>> that
the key point of Rust is memory "safety". I use scare-quotes here, >>>>>> since it
is simply about correct use of dynamic memory and buffers.

It is entirely possible to have correct use of memory in C,

If you look at the evolution of programming languages, "higher-level" >>>>> doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.

Agreed.

I've heard it said that the power of a programming language comes not
from what you can do with the language, but from what you cannot do.

Wrong, the last version of Swift added all the garbage programming
concepts
that one should avoid.

You have to give people the tools to do anything.

It is impossible to create a computer programming language where
the programmer cannot shoot himself in the foot.

A language could alternatively try to go in a direction like HolyC:
Take C:
Remove most advanced features;
Add some weird syntax tweaks;
Make all the types explicit sized.

Some of it is almost half tempting, except that I would probably make
the type-names lower-case to match with my existing usage (and save
needing to hit SHIFT as often).

Say:
u0: void
u1: _Bool
u8: unsigned char
u16: unsigned short
...
i16/s16: signed short
i32/s32: signed int
i64/s64: signed long long

I suspect that My 66000 is the only current ISA that efficiently
supports::
u7:
u11:
u15:
u21:
s47:
s19:
..

f32: float
f64: double
m32: opaque 32-bit type
m64: opaque 64-bit type
m128: opaque 128-bit type

....

Then, say:
u0 foo(args...)
{
...
}

Where, args is exposed as an array of u32 or u64 depending on the target architecture.

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Sep 5 00:56:22 2024

On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:

On 9/4/2024 3:59 PM, Scott Lurndal wrote:

Say:
long z;
int x, y;
...
z=x*y;
Would auto-promote to long before the multiply.

\I may have to use this as an example of C allowing the programmer
to shoot himself in the foot; promotion or no promotion.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Thu Sep 5 09:54:56 2024

On 05/09/2024 02:56, MitchAlsup1 wrote:

On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:

On 9/4/2024 3:59 PM, Scott Lurndal wrote:

Say:
   long z;
   int x, y;
   ...
   z=x*y;
Would auto-promote to long before the multiply.

\I may have to use this as an example of C allowing the programmer
to shoot himself in the foot; promotion or no promotion.

You snipped rather unfortunately here - it makes it look like this was
code that Scott wrote, and you've removed essential context by BGB.

While I agree it is an example of the kind of code that people sometimes
write when they don't understand C arithmetic, I don't think it is
C-specific. I can't think of any language off-hand where expressions
are evaluated differently depending on types used further out in the expression. Can you give any examples of languages where the equivalent
code would either do the multiplication as "long", or give an error so
that the programmer would be informed of their error?

(I don't count personal one-person languages here. They are very rarely formally or accurately specified.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Brett on Thu Sep 5 09:43:37 2024

On 04/09/2024 22:31, Brett wrote:

David Brown <[email protected]> wrote:

On 04/09/2024 14:53, jseigh wrote:

On 9/4/24 06:57, David Brown wrote:

On 04/09/2024 09:15, Terje Mathisen wrote:

David Brown wrote:

Maybe?

Rust will _always_ check for such overflow in debug builds, then when >>>>> you've determined that they don't occur, the release build falls back >>>>> standard CPU behavior, i.e. wrapping around with no panics.

But if you've determined that they do not occur (during debugging),
then your code never makes use of the results of an overflow - thus
why is it defined behaviour? It makes no sense. The only time when >>>> you would actually see wrapping in final code is if you hadn't tested
it properly, and then you can be pretty confident that the whole thing >>>> will end in tears when signs change unexpectedly. It would be much
more sensible to leave signed overflow undefined, and let the compiler >>>> optimise on that basis.

You absolutely do want defined behavior on overflow.

No, you absolutely do /not/ want that - for the vast majority of use-cases. >>
There are times when you want wrapping behaviour, yes. More generally,
you want modulo arithmetic rather than a model of mathematical integer
arithmetic. But those cases are rare, and in C they are easily handled
using unsigned integers.

I tried using unsigned for a bunch of my data types that should never go negative, but every time I would have to compare them with an int somewhere and that would cause a compiler warning, because the goal was to also
remove unsafe code.

Complete and udder disaster, went back to plain sized ints.

That's a matter of choice in the warnings you pick and the style you use
- these should match.

However, I don't think C's integer promotion rules are ideal in regard
to mixing signed and unsigned arithmetic - converting both to "unsigned"
can easily lead to trouble.

Some people recommend using unsigned int everywhere you can, because the overflow behaviour is defined - I think that is simply wrong. Use
unsigned int where it is appropriate, but it is very rare (though it
happens sometimes) that you want any arithmetic to overflow in any way.
So the justification is wrong.

Some people like to use unsigned int when the values will not be
negative. I don't think that is a good idea either. In general, for
any given use you only need a limited range of values. 0 to 10000 is
just as much a subset of "int" as "unsigned int", and using "unsigned
int" does not give any advantages. On the contrary, using "int" can
give more efficient code in many places, and lets you enable warnings
about mixed unsigned / signed operations for when you actually want them.

Unsigned types are ideal for "raw" memory access or external data, for
anything involving bit manipulation (use of &, |, ^, << and >> on signed
types is usually wrong, IMHO), as building blocks in extended arithmetic
types, for the few occasions when you want two's complement wrapping,
and for the even fewer occasions when you actually need that last bit of
range.

It would be nice if C had subrange types like Pascal or Ada, but it does
not. Usually int - or sizeed ints - are the practical choice.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to David Brown on Thu Sep 5 11:51:32 2024

On 2024-09-05 10:54, David Brown wrote:

On 05/09/2024 02:56, MitchAlsup1 wrote:

On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:

On 9/4/2024 3:59 PM, Scott Lurndal wrote:

Say:
   long z;
   int x, y;
   ...
   z=x*y;
Would auto-promote to long before the multiply.

\I may have to use this as an example of C allowing the programmer
to shoot himself in the foot; promotion or no promotion.

You snipped rather unfortunately here - it makes it look like this was
code that Scott wrote, and you've removed essential context by BGB.

While I agree it is an example of the kind of code that people sometimes write when they don't understand C arithmetic, I don't think it is C-specific. I can't think of any language off-hand where expressions
are evaluated differently depending on types used further out in the expression. Can you give any examples of languages where the equivalent code would either do the multiplication as "long", or give an error so
that the programmer would be informed of their error?

The Ada language can work in both ways. If you just have:

z : Long_Integer; -- Not a standard Ada type, but often provided.
x, y : Integer;
...
z := x * y;

the compiler will inform you that the types in the assignment do not
match: using the standard (predefined) operator "*", the product of two Integers gives an Integer, not a Long_Integer. If you add this
definition to the code:

function "*" (Left, Right : Integer) return Long_Integer
is (Long_Integer(Left) * Long_Integer(Right));

the compiler sees that there is now /also/ an Integer * Integer =>
Long_Integer multiplication operator, and uses that. Function
overloading in Ada can depend on the type expected of the result.

Perhaps you asked for a language that worked like this "out of the box", without the programmer having to add things like the "*" function above,
and then Ada would not qualify on the second alternative (automatic
lengthening before multiplication, depending on the result type desired).

(I don't count personal one-person languages here.

While Ada has low market penetration, I don't think it quite qualifies
as a one-person language -- yet :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Thu Sep 5 11:12:04 2024

David Brown wrote:

On 04/09/2024 22:31, Brett wrote:

David Brown <[email protected]> wrote:

On 04/09/2024 14:53, jseigh wrote:

On 9/4/24 06:57, David Brown wrote:

On 04/09/2024 09:15, Terje Mathisen wrote:

David Brown wrote:

Maybe?

Rust will _always_ check for such overflow in debug builds, then when >>>>>> you've determined that they don't occur, the release build falls back >>>>>> standard CPU behavior, i.e. wrapping around with no panics.

But if you've determined that they do not occur (during debugging),
then your code never makes use of the results of an overflow - thus
why is it defined behaviour?Â It makes no sense.Â The only time when
you would actually see wrapping in final code is if you hadn't tested >>>>> it properly, and then you can be pretty confident that the whole thing >>>>> will end in tears when signs change unexpectedly.Â It would be much >>>>> more sensible to leave signed overflow undefined, and let the compiler >>>>> optimise on that basis.

You absolutely do want defined behavior on overflow.

No, you absolutely do /not/ want that - for the vast majority of
use-cases.

There are times when you want wrapping behaviour, yes. More generally, >>> you want modulo arithmetic rather than a model of mathematical integer
arithmetic. But those cases are rare, and in C they are easily handled >>> using unsigned integers.

I tried using unsigned for a bunch of my data types that should never go
negative, but every time I would have to compare them with an int
somewhere
and that would cause a compiler warning, because the goal was to also
remove unsafe code.

Complete and udder disaster, went back to plain sized ints.

That's a matter of choice in the warnings you pick and the style you use
- these should match.

However, I don't think C's integer promotion rules are ideal in regard
to mixing signed and unsigned arithmetic - converting both to "unsigned"
can easily lead to trouble.

Some people recommend using unsigned int everywhere you can, because the overflow behaviour is defined - I think that is simply wrong. Use
unsigned int where it is appropriate, but it is very rare (though it
happens sometimes) that you want any arithmetic to overflow in any way.
So the justification is wrong.

Some people like to use unsigned int when the values will not be
negative. I don't think that is a good idea either. In general, for
any given use you only need a limited range of values. 0 to 10000 is
just as much a subset of "int" as "unsigned int", and using "unsigned
int" does not give any advantages. On the contrary, using "int" can
give more efficient code in many places, and lets you enable warnings
about mixed unsigned / signed operations for when you actually want them.

Unsigned types are ideal for "raw" memory access or external data, for anything involving bit manipulation (use of &, |, ^, << and >> on signed types is usually wrong, IMHO), as building blocks in extended arithmetic types, for the few occasions when you want two's complement wrapping,
and for the even fewer occasions when you actually need that last bit of range.

That last paragraph enumerates pretty much all the uses I have for integer-type variables, with (like Mitch) a few apis that use (-1) as an
error signal that has to be handled with special code.

It would be nice if C had subrange types like Pascal or Ada, but it does not. Usually int - or sizeed ints - are the practical choice.

Agreed 100%

I wrote enough Pascal with ranged types that I got used to it, and found
that I was missing the feature when I used C.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to [email protected] on Thu Sep 5 13:15:00 2024

In article <[email protected]>, [email protected] (MitchAlsup1) wrote:

I suspect that My 66000 is the only current ISA that efficiently
supports::
u7:
u11:
u15:
u21:
s47:
s19:

Concertina II has them on the way...

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Thu Sep 5 15:08:44 2024

On 04/09/2024 22:53, Michael S wrote:

On Wed, 4 Sep 2024 17:25:44 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

David Brown <[email protected]> schrieb:

I'm all in favour of temporarily having checks for overflow (and
other errors) during debugging, but I am sceptical to having
distinct debug/release builds. It encourages people to use debug
builds during development, bug hunting and testing, then when all
looks good they switch to release build and deploy it. I prefer a
single build, and enable run-time checks on parts of it if and when
necessary.

Wise man once said...

# It is absurd to make elaborate security checks on debugging runs,
# when no trust is put in the results, and then remove them in
# production runs, when an erroneous result could be expensive or
# disastrous. What would we think of a sailing enthusiast who wears
# his lifejacket when training on dry land, but takes it off as soon
# as he goes to sea?

(C.A.R. Hoare, in "Hints on Programming Language Desin)

Wise man was wrong.
Range check are not similar to live jackets. They do not turn incorrect program into correct one.

Wise man was right. Range checks are not intended to turn incorrect
programs into correct ones - they are for damage mitigation. Life
jackets don't stop you falling overboard, they stop you drowning if you
/do/ fall overboard. The context of the quotation was "security
checks", which is different from debugging and fault-finding.

For some kinds of software, you have to think about what can go wrong
outside the context of software bugs, and what can be done about it -
such as damage limitation. There can be external effects such as
malicious or accidental corruption of data, hardware failures, etc.
These are outside the scope of C, and need special treatment such as
using "volatile" to inform the compiler that something has observable behaviour, or using inline assembly or intrinsic functions for fine
control. And you have to accept that usually, there is no way to handle
these things entirely in software.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Thu Sep 5 11:31:02 2024

David Brown <[email protected]> writes:

Anton writes code that seriously pushes the boundary of what can be
achieved. For at least some of the things he does (such as GForth) he
is trying to squeeze every last drop of speed out of the target. And he
is /really/ good at it. But that means he is forever relying on nuances >about code generation. His code, at least for efficiency if not for >correctness, is dependent on details far beyond what is specified and >documented for C and for the gcc compiler. He might spend a long time >working with his code and a version of gcc, fine-tuning the details of
his source code to get out exactly the assembly he wants from the
compiler.

No. We distribute Gforth as source code. It works for a wide variety
of architectures and compilers. So unlike what you suggest and what
some people have suggested earlier to avoid problems with new
"optimizations" in newer releases of gcc, we don't concentrate on a
specific version of gcc.

Of course it is frustrating for him when the next version of
gcc generates very different assembly from that same source, but he is
not really programming at the level of C, and he should not expect >consistency from C compilers like he does.

It's normal and no problem when the next version of gcc generates
different assembly language. There are some basic assumptions that
our code relies on, and that mostly does not change between gcc
versions.

An essential assumption is that, when we have:

A:
C code
B:

... that when we do &&A and &&B (which is documented in the GNU C
manual), we get the addresses pointing to the start and end of the
machine code corresponding to the C code. In the days starting with
gcc-3.0, we found that gcc started reordering the basic blocks within
loops, so replaced loops in the part of the code that needs such
assumptions into separate functions. Around gcc-7, gcc started to
compile

A: C-code1
B: C-code2
C: goto *...

to the same code as

A: C-code1; C-code2; goto *...;
B: C-code2; goto *...;
C: goto *...;

I found a workaround that avoids this kind of code generation.

Another problem from gcc-3.1 to at least gcc-4.4 (intermittently) is
that gcc compiled

goto *ca;

into the equivalent of

goto gotoca;

/* and elsewhere */
gotoca: goto *ca;

We reported that repeatedly. At one point a gcc maintainer gave us
some bullshit about a possible performance advantage from this
transformation, of course without presenting any empirical support,
while we saw a big slowdown on our code. We developed workarounds for
that, and they are in Gforth to this day, even though we have not
encountered a new gcc version with this problem for over a decade, but
new Gforth should also work on old gcc.

Another assumption is that when we concatenate the code snippet
between label A and B (which contains C-code1) and the code snippet
between label X and Y (which contains C-code3), executing the result
will behave like the concatenation of C-code1 and C-code3 in source
code. This assumption has two aspects:

1) Do the register assignments at the labels fit together. It turns
out that we never had a problem with that, and I think that the
reason for that is that the "goto *" can jump to any of those
labels (all their addresses are taken), and so the register
assignment must be the same right after each label.

What guarantees that the assignments are the same right before each
label? Probably that after the label, there is not much between
the label and the next goto*, and that makes all registers at
potential targets live.

2) If we have two pieces of machine code produced in this fashion,
does the architecture guarantee that such a concatenation works?
It turns out that in general-purpose architectures, all-but-one do.
That includes IA-64. The exception is MIPS with its architectural
load-delay slot (and there are also scheduling restrictions having
to do with the hilo register that may be problematic): the first
code snippet may end in a load, and the next code snippet may start
with an instruction that reads the result of the load. So we just
disabled this concatenation on MIPS.

We do a number of things to achieve stability: We do sanity-checking
on the resulting machine code snippets and fall back to plain threaded
code if the snippets turn out not to be relocatable.

Also, we enable all the flags for defining behaviour in gcc that we
find (unfortunately, in the documentation they are intermixed with
other options). For good measure, this includes -fno-delete-null-pointer-checks, although I doubt that it makes a
difference for our code either way.

One thing that came up about a year ago was that gcc auto-vectorizes
adjacent memory accesses on AMD64 (apparently the AMD64 port
maintainers are unhappy because AMD64 does not have instructions like
ARM A64's ldp and stp:-), which did not impact correctness, but led to
worse performance (not just in Gforth; I have also seen it in the
bubble benchmark from John Hennessy's Stanford small integer
benchmarks; I'm sure there is some SPEC benchmark that benefits). A
quick addition of -fno-tree-vectorize fixed that.

We have been thinking about moving from C to a better-defined
language, namely assembly language, but have not yet taken the plunge,
and it has not been necessary yet. Gcc has not been as crazy in our
experience as the UB rethoric might make one think. Why is that? I
think the reasons are:

1) Gforth and a lot of other "irrelevant" (to the gcc maintainers)
projects sail in the slipstream of "relevant" code like SPEC and
the Linux kernel that are all full of undefined behaviour (Linux
defines many of them with flags, like Gforth does), so gcc does not
"optimize" as crazily as a UB fan might wish.

2) The code snippets are very short, with many in-edges on the
preceding and following label, which tends to destroy any
"knowledge" that the compiler might have derived from the
assumption that the program does not exercise undefined behaviour.
This severely limits the distance over which such "optimizations"
can be performed.

Nevertheless, the last time I tried what happens if I compile without
the behaviour-defining options, the result did not work; I did not
investigate this further.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Niklas Holsti on Thu Sep 5 15:27:47 2024

On 05/09/2024 10:51, Niklas Holsti wrote:

On 2024-09-05 10:54, David Brown wrote:

On 05/09/2024 02:56, MitchAlsup1 wrote:

On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:

On 9/4/2024 3:59 PM, Scott Lurndal wrote:

Say:
   long z;
   int x, y;
   ...
   z=x*y;
Would auto-promote to long before the multiply.

\I may have to use this as an example of C allowing the programmer
to shoot himself in the foot; promotion or no promotion.

You snipped rather unfortunately here - it makes it look like this was
code that Scott wrote, and you've removed essential context by BGB.

While I agree it is an example of the kind of code that people
sometimes write when they don't understand C arithmetic, I don't think
it is C-specific. I can't think of any language off-hand where
expressions are evaluated differently depending on types used further
out in the expression. Can you give any examples of languages where
the equivalent code would either do the multiplication as "long", or
give an error so that the programmer would be informed of their error?

The Ada language can work in both ways. If you just have:

   z : Long_Integer; -- Not a standard Ada type, but often provided.
   x, y : Integer;
   ...
   z := x * y;

the compiler will inform you that the types in the assignment do not
match: using the standard (predefined) operator "*", the product of two Integers gives an Integer, not a Long_Integer.

That seems like a safe choice. C's implicit promotion of int to long
int can be convenient, but convenience is sometimes at odds with safety.

If you add this
definition to the code:

   function "*" (Left, Right : Integer) return Long_Integer
   is (Long_Integer(Left) * Long_Integer(Right));

the compiler sees that there is now /also/ an Integer * Integer => Long_Integer multiplication operator, and uses that. Function
overloading in Ada can depend on the type expected of the result.

You can make types in C++ that have this effect, but you have to make
them and use them consistently. You can't overload operators on
standard types like that.

Perhaps you asked for a language that worked like this "out of the box", without the programmer having to add things like the "*" function above,
and then Ada would not qualify on the second alternative (automatic lengthening before multiplication, depending on the result type desired).

I asked for either, and you gave me both :-)

(I don't count personal one-person languages here.

While Ada has low market penetration, I don't think it quite qualifies
as a one-person language -- yet :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Thu Sep 5 15:29:47 2024

On 05/09/2024 11:12, Terje Mathisen wrote:

David Brown wrote:

On 04/09/2024 22:31, Brett wrote:

David Brown <[email protected]> wrote:

On 04/09/2024 14:53, jseigh wrote:

On 9/4/24 06:57, David Brown wrote:

On 04/09/2024 09:15, Terje Mathisen wrote:

David Brown wrote:

Maybe?

Rust will _always_ check for such overflow in debug builds, then >>>>>>> when
you've determined that they don't occur, the release build falls >>>>>>> back
standard CPU behavior, i.e. wrapping around with no panics.

But if you've determined that they do not occur (during debugging), >>>>>> then your code never makes use of the results of an overflow - thus >>>>>> why is it defined behaviour?Â It makes no sense.Â The only time >>>>>> when
you would actually see wrapping in final code is if you hadn't tested >>>>>> it properly, and then you can be pretty confident that the whole
thing
will end in tears when signs change unexpectedly.Â It would be much >>>>>> more sensible to leave signed overflow undefined, and let the
compiler
optimise on that basis.

You absolutely do want defined behavior on overflow.

No, you absolutely do /not/ want that - for the vast majority of
use-cases.

There are times when you want wrapping behaviour, yes. More generally, >>>> you want modulo arithmetic rather than a model of mathematical integer >>>> arithmetic. But those cases are rare, and in C they are easily handled >>>> using unsigned integers.

I tried using unsigned for a bunch of my data types that should never go >>> negative, but every time I would have to compare them with an int
somewhere
and that would cause a compiler warning, because the goal was to also
remove unsafe code.

Complete and udder disaster, went back to plain sized ints.

That's a matter of choice in the warnings you pick and the style you
use - these should match.

However, I don't think C's integer promotion rules are ideal in regard
to mixing signed and unsigned arithmetic - converting both to
"unsigned" can easily lead to trouble.

Some people recommend using unsigned int everywhere you can, because
the overflow behaviour is defined - I think that is simply wrong. Use
unsigned int where it is appropriate, but it is very rare (though it
happens sometimes) that you want any arithmetic to overflow in any
way. So the justification is wrong.

Some people like to use unsigned int when the values will not be
negative. I don't think that is a good idea either. In general, for
any given use you only need a limited range of values. 0 to 10000 is
just as much a subset of "int" as "unsigned int", and using "unsigned
int" does not give any advantages. On the contrary, using "int" can
give more efficient code in many places, and lets you enable warnings
about mixed unsigned / signed operations for when you actually want them.

Unsigned types are ideal for "raw" memory access or external data, for
anything involving bit manipulation (use of &, |, ^, << and >> on
signed types is usually wrong, IMHO), as building blocks in extended
arithmetic types, for the few occasions when you want two's complement
wrapping, and for the even fewer occasions when you actually need that
last bit of range.

That last paragraph enumerates pretty much all the uses I have for integer-type variables, with (like Mitch) a few apis that use (-1) as an error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic?

It would be nice if C had subrange types like Pascal or Ada, but it
does not. Usually int - or sizeed ints - are the practical choice.

Agreed 100%

I wrote enough Pascal with ranged types that I got used to it, and found
that I was missing the feature when I used C.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Brett on Thu Sep 5 15:35:59 2024

On 04/09/2024 22:15, Brett wrote:

David Brown <[email protected]> wrote:

On 03/09/2024 21:28, Stefan Monnier wrote:

My impression - based on hearsay for Rust as I have no experience - is that
the key point of Rust is memory "safety". I use scare-quotes here, since it
is simply about correct use of dynamic memory and buffers.

It is entirely possible to have correct use of memory in C,

If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.

Agreed.

I've heard it said that the power of a programming language comes not
from what you can do with the language, but from what you cannot do.

Wrong, the last version of Swift added all the garbage programming concepts that one should avoid.

That does not show that I was wrong - perhaps Swift is not a powerful programming language!

Of course, it all depends on what you mean by "powerful".

(I don't know Swift at all.)

You have to give people the tools to do anything.

You don't /have/ to do that. But it's often easier to market a language
that can do anything.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Thu Sep 5 13:19:59 2024

Stefan Monnier <[email protected]> writes:

Specifications are an agreement between the supplier and the client. The

The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.

For programs there is no conformance level "free from UB" in the C
standard. There are two conformance levels for programs:

1) A strictly conforming program shall use only those features of the
language and library specified in this International Standard.
This excludes all programs that terminate, including the "Hello,
World" program. And of course it also excludes pretty much all
non-terminating programs.

2) A conforming program is one that is acceptable to a conforming
implementation. So if, say, gcc-10 is a conforming implementation
(and I think that it claims so), and it accepts your program, your
program is a conforming program.

One first would have to agree on whether the program should be
conforming or strictly conforming. In the "strictly conforming" case,
it is indeed hard to write any useful code (I find it even hard to
think of a useful non-terminating program that uses only things
specified in the C standard features). OTOH, conforming programs
include many that exercise undefined, unspecified, or
implementation-defined behaviour, so in that case the C standard does
not serve as specification.

In either case, treating the C standard as agreement is nonsense.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Thu Sep 5 15:48:37 2024

On 04/09/2024 20:13, MitchAlsup1 wrote:

On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

On 04/09/2024 18:07, Tim Rentsch wrote:

If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.

Such tests are usually built into implementations of memmove(), which
will chose to run forwards or backwards as needed. So you might as well
just call memmove() any time you are not sure memcpy() is safe and
appropriate.

Memmove() is always appropriate unless you are doing something
nefarious.
So:
# define memcpy memomve
and move forward with life--for the 2 extra cycles memmove costs it
saves everyone long term grief.

Or just use memmove, and not memcpy, whenever you are moving stuff
around in the same buffer.

When you need the nefarious activities of memcpy write it as a
for loop by yourself and comment the nafariousness of the use.

memcpy is not nefarious. It's quite simple, and does what it says on
the tin. Use it when you want to copy non-overlapping memory areas.
Don't use it if you want to do something other than that. I have never understood why anyone would find this difficult.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Thu Sep 5 16:01:26 2024

On 04/09/2024 18:32, MitchAlsup1 wrote:

On Wed, 4 Sep 2024 7:29:01 +0000, David Brown wrote:

On 03/09/2024 22:22, MitchAlsup1 wrote:

On Tue, 3 Sep 2024 19:30:21 +0000, Stefan Monnier wrote:

Specifications are an agreement between the supplier and the
client. The

The problem here is that the C standard, seen as a contract, is unfair >>>> to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.

# define int int64_t
..

makes it easier.

That's UB, I believe :-) And it will certainly be confusing.

On 64-bit machines it re-establishes the dusty-deck old K&R C where
int was the fastest integer type.

No, it does not. It gives you an inconsistent mess and opens up all
sorts of potential complications when interacting with code that uses
"int" properly. It does not help you avoid UB - it creates a lot more potential for mistakes.

Now, if you had suggested that we'd have been better off if the powers
that be had made int 64-bit on 64-bit targets, then it would be a very different matter. It would reduce the risk of UB from signed integer
overload quite considerably - few numbers are big enough to overflow 64
bits without being so big that you are using multi-precision numerics
libraries anyway.

It would also mean that a lot of existing code that incorrectly or
non-portably assumes "int" is 32-bit, would fail to work on the new systems.

We can't change existing non-portable code. We can't change existing
ABI's for 64-bit targets. Slapping a #define band-aid on the code will
not fix anything.

A better answer is it use int_fastNN_t types in your own code, picking a
size that matches what you actually need. (Perhaps limit it to 32 or
64, to be portable to most systems - or just 64 if you really are sure
the code will not be used on smaller targets.) int_fast32_t and
int_fast64_t are both 64-bit on x86-64.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Thu Sep 5 14:06:45 2024

David Brown <[email protected]> writes:

On 05/09/2024 11:12, Terje Mathisen wrote:

That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as an
error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic?

We do. There is no issue using unsigned loop counters, array
indicies are always positive and unsigned integer arithmetic works
just fine in our application.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Terje Mathisen on Thu Sep 5 14:05:08 2024

Terje Mathisen <[email protected]> writes:

David Brown wrote:

Unsigned types are ideal for "raw" memory access or external data, for =

anything involving bit manipulation (use of &, |, ^, << and >> on signe= >d=20
types is usually wrong, IMHO), as building blocks in extended arithmeti= >c=20
types, for the few occasions when you want two's complement wrapping,=20
and for the even fewer occasions when you actually need that last bit o= >f=20
range.

That last paragraph enumerates pretty much all the uses I have for=20 >integer-type variables, with (like Mitch) a few apis that use (-1) as an =

error signal that has to be handled with special code.

Same here.

=20
It would be nice if C had subrange types like Pascal or Ada, but it doe= >s=20
not.=C2=A0 Usually int - or sizeed ints - are the practical choice.

Agreed 100%

Although absent architecture support, how does one ensure that the
value remains within the subrange?

I wrote enough Pascal with ranged types that I got used to it, and found =

that I was missing the feature when I used C.

On the Burroughs Medium Systems, which addressed to the digit/nibble,
ranged types (up to 100 digits/bytes) were de rigueur. Natural types
for COBOL.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Thu Sep 5 14:49:42 2024

Anton Ertl <[email protected]> schrieb:

In either case, treating the C standard as agreement is nonsense.

That's a good summary of your attitude. Using this argument with
compiler writers will get you precisely nowhere, but you already
have experience with that.

Now for a challenge: Try to specify the behavior of any piece of
undefined behavior whose treatment by compilers you object to in
a way that a compiler writer can follow it. Think that it could
be published as an annex to the standard.

What could this be?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Thu Sep 5 14:58:54 2024

Scott Lurndal <[email protected]> schrieb:

David Brown <[email protected]> writes:

On 05/09/2024 11:12, Terje Mathisen wrote:

That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as an >>> error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic?

We do. There is no issue using unsigned loop counters,

I find counting down from n to 0 using unsigned variables
unintuitive. Or do you always count up and then calculate
what you actually use? Induction variable optimization
should take care of that, but it would be more complicated
to use.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Thu Sep 5 17:09:19 2024

On 05/09/2024 13:31, Anton Ertl wrote:

David Brown <[email protected]> writes:

Anton writes code that seriously pushes the boundary of what can be
achieved. For at least some of the things he does (such as GForth) he
is trying to squeeze every last drop of speed out of the target. And he
is /really/ good at it. But that means he is forever relying on nuances
about code generation. His code, at least for efficiency if not for
correctness, is dependent on details far beyond what is specified and
documented for C and for the gcc compiler. He might spend a long time
working with his code and a version of gcc, fine-tuning the details of
his source code to get out exactly the assembly he wants from the
compiler.

No. We distribute Gforth as source code. It works for a wide variety
of architectures and compilers. So unlike what you suggest and what
some people have suggested earlier to avoid problems with new
"optimizations" in newer releases of gcc, we don't concentrate on a
specific version of gcc.

OK.

Of course it is frustrating for him when the next version of
gcc generates very different assembly from that same source, but he is
not really programming at the level of C, and he should not expect
consistency from C compilers like he does.

It's normal and no problem when the next version of gcc generates
different assembly language. There are some basic assumptions that
our code relies on, and that mostly does not change between gcc
versions.

As long as you are sticking to defined behaviour (defined by the C
standards, or by the gcc documentation), and use specified C standard
versions in the build, then there should not be any incorrect behaviour
in different versions. Performance might regress, and of course there's
always the risk of bugs.

An essential assumption is that, when we have:

A:
C code
B:

... that when we do &&A and &&B (which is documented in the GNU C
manual), we get the addresses pointing to the start and end of the
machine code corresponding to the C code.

I don't see anything in the gcc reference manual suggesting that &&B is
the end of the corresponding code. What you get - all you get - is that
"goto * &&A" gives the same effect as "goto A".

In the days starting with
gcc-3.0, we found that gcc started reordering the basic blocks within
loops, so replaced loops in the part of the code that needs such
assumptions into separate functions. Around gcc-7, gcc started to
compile

A: C-code1
B: C-code2
C: goto *...

to the same code as

A: C-code1; C-code2; goto *...;
B: C-code2; goto *...;
C: goto *...;

I found a workaround that avoids this kind of code generation.

This is all the kind of thing you can expect when you make assumptions
about code generation that are not supported by the documentation.
Compilers can, and do, move code around in various ways, duplicate it,
combine it, unroll it, compress it - whatever gives (or tries to give - optimisation is not an exact science) better results while giving the documented behaviour.

I too have written code that relies on being able to identify the start
and end of certain bits of code - typically for microcontrollers where
you want some bits of code (like flash programming routines or very
timing critical interrupt code) put in ram rather than flash. Sometimes
that can be done with compiler extensions, sometimes it takes extra
flags, linker file magic, or other messing around. But it's not
something I would expect to be portable, and it needs confirmed for
every compiler version and selection of flags used. (I realise that
this is a vastly simpler task for the kind of work I do than for an open
source project!)

Another problem from gcc-3.1 to at least gcc-4.4 (intermittently) is
that gcc compiled

goto *ca;

into the equivalent of

goto gotoca;

/* and elsewhere */
gotoca: goto *ca;

We reported that repeatedly. At one point a gcc maintainer gave us
some bullshit about a possible performance advantage from this transformation, of course without presenting any empirical support,
while we saw a big slowdown on our code. We developed workarounds for
that, and they are in Gforth to this day, even though we have not
encountered a new gcc version with this problem for over a decade, but
new Gforth should also work on old gcc.

Again, the compiler is not doing anything outside its specifications.
What you want here is a guarantee of behaviour that is not defined
anywhere. You are not seeing a bug in the compiler, or an
incompatibility with previous versions - you are seeing the need for a
feature (and a controlling compiler flag) that gcc does not currently
have. It's a potential feature that might be useful to other people
too, while being an anti-feature to others.

Another assumption is that when we concatenate the code snippet
between label A and B (which contains C-code1) and the code snippet
between label X and Y (which contains C-code3), executing the result
will behave like the concatenation of C-code1 and C-code3 in source
code. This assumption has two aspects:

1) Do the register assignments at the labels fit together. It turns
out that we never had a problem with that, and I think that the
reason for that is that the "goto *" can jump to any of those
labels (all their addresses are taken), and so the register
assignment must be the same right after each label.

What guarantees that the assignments are the same right before each
label? Probably that after the label, there is not much between
the label and the next goto*, and that makes all registers at
potential targets live.

2) If we have two pieces of machine code produced in this fashion,
does the architecture guarantee that such a concatenation works?
It turns out that in general-purpose architectures, all-but-one do.
That includes IA-64. The exception is MIPS with its architectural
load-delay slot (and there are also scheduling restrictions having
to do with the hilo register that may be problematic): the first
code snippet may end in a load, and the next code snippet may start
with an instruction that reads the result of the load. So we just
disabled this concatenation on MIPS.

We do a number of things to achieve stability: We do sanity-checking
on the resulting machine code snippets and fall back to plain threaded
code if the snippets turn out not to be relocatable.

Also, we enable all the flags for defining behaviour in gcc that we
find (unfortunately, in the documentation they are intermixed with
other options). For good measure, this includes -fno-delete-null-pointer-checks, although I doubt that it makes a
difference for our code either way.

(-fno-delete-null-pointer-checks will make no difference to code that
doesn't accidentally use leap-before-you-look checking.)

There are certainly a few cases (-fno-strict-aliasing is a prime
example) where flags are documented as disabling optimisations, when
they are better viewed as adding definitions to the language and would
be better documented under "Options Controlling C Dialect" or "Options
for Code Generation Conventions".

One thing that came up about a year ago was that gcc auto-vectorizes
adjacent memory accesses on AMD64 (apparently the AMD64 port
maintainers are unhappy because AMD64 does not have instructions like
ARM A64's ldp and stp:-), which did not impact correctness, but led to
worse performance (not just in Gforth; I have also seen it in the
bubble benchmark from John Hennessy's Stanford small integer
benchmarks; I'm sure there is some SPEC benchmark that benefits). A
quick addition of -fno-tree-vectorize fixed that.

That happens sometimes. In my brief testing of clang, it often seems a
bit too keen on vectorising code that would be better kept short and
simple. I have no doubt gcc gets that wrong sometimes too.

We have been thinking about moving from C to a better-defined
language, namely assembly language, but have not yet taken the plunge,
and it has not been necessary yet. Gcc has not been as crazy in our experience as the UB rethoric might make one think. Why is that? I
think the reasons are:

1) Gforth and a lot of other "irrelevant" (to the gcc maintainers)
projects sail in the slipstream of "relevant" code like SPEC and
the Linux kernel that are all full of undefined behaviour (Linux
defines many of them with flags, like Gforth does), so gcc does not
"optimize" as crazily as a UB fan might wish.

2) The code snippets are very short, with many in-edges on the
preceding and following label, which tends to destroy any
"knowledge" that the compiler might have derived from the
assumption that the program does not exercise undefined behaviour.
This severely limits the distance over which such "optimizations"
can be performed.

Nevertheless, the last time I tried what happens if I compile without
the behaviour-defining options, the result did not work; I did not investigate this further.

You are looking for more than C and the gcc documented extensions give
you. That is always going to be hard.

Ideally, you need a new gcc flag or two with documented and guaranteed
effects to give you the assurance you need for your code. That's going
to take a lot of effort, I would expect, and I can see it being hard for
a relatively nice project like Gforth to push for that. Linux has the
backing here to push for changes - even if Linus Torvalds rants and
insults the gcc developers, IBM and friends can still pay gcc developers
to make the changes he wants.

Thank you for your explanation of your needs here, and information about
how your code works. I'm afraid I can't do anything to help, but it
helps me understand where you are coming from.

(I'm still a fan of the principle of UB, and of compilers using
knowledge of UB for optimisation - but that does not mean I can't
sympathise with people who find that frustrating and who see things differently.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Thu Sep 5 19:04:41 2024

David Brown wrote:

On 05/09/2024 11:12, Terje Mathisen wrote:

David Brown wrote:

Unsigned types are ideal for "raw" memory access or external data,
for anything involving bit manipulation (use of &, |, ^, << and >> on
signed types is usually wrong, IMHO), as building blocks in extended
arithmetic types, for the few occasions when you want two's
complement wrapping, and for the even fewer occasions when you
actually need that last bit of range.

That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as
an error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic?

Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine
with unsigned i, arrays always use a zero base so in Rust the only array
index type is usize, i.e the largest supported unsigned type in the
system, typically the same as u64.

unsigned arithmetic is easier than signed integer arithmetic, including comparisons that would result in a negative value, you just have to make
the test before subtracting, instead of checking if the result was negative.

I.e I cannot easily replicate a downward loop that exits when the
counter become negative:

for (int i = START; i >= 0; i-- ) {
// Do something with data[i]
}

One of my alternatives are

unsigned u = start; // Cannot be less than zero
if (u) {
u++;
do {
u--;
data[u]...
while (u);
}

This typically results in effectively the same asm code as the signed
version, except for a bottom JGE (Jump (signed) Greater or Equal instead
of JA (Jump Above or Equal, but my version is far more verbose.

Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:

for (unsigned u = start; (u & TOPBIT) == 0; u--)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Thu Sep 5 17:15:42 2024

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

David Brown <[email protected]> writes:

On 05/09/2024 11:12, Terje Mathisen wrote:

That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as an >>>> error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic?

We do. There is no issue using unsigned loop counters,

I find counting down from n to 0 using unsigned variables
unintuitive. Or do you always count up and then calculate
what you actually use? Induction variable optimization
should take care of that, but it would be more complicated
to use.

Just checked current project; out of some 5000 'for' loops, only two worked backwards, and terminating at zero worked algorithmically.
About 20% of loops were iterating using standard C++ iterators,
the rest were size_t or other unsigned integer types.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Thu Sep 5 15:49:39 2024

David Brown <[email protected]> writes:

On 05/09/2024 13:31, Anton Ertl wrote:

It's normal and no problem when the next version of gcc generates
different assembly language. There are some basic assumptions that
our code relies on, and that mostly does not change between gcc
versions.

...

In the days starting with
gcc-3.0, we found that gcc started reordering the basic blocks within
loops, so replaced loops in the part of the code that needs such
assumptions into separate functions. Around gcc-7, gcc started to
compile

A: C-code1
B: C-code2
C: goto *...

to the same code as

A: C-code1; C-code2; goto *...;
B: C-code2; goto *...;
C: goto *...;

I found a workaround that avoids this kind of code generation.

This is all the kind of thing you can expect when you make assumptions
about code generation that are not supported by the documentation.

Nobody said that gcc did anything wrong here. We were, however,
surprised that -fno-reorder-blocks did not suppress the reordering; we
reported this as bug, but were told that this option does something
different from what it says. Anyway, we developed a workaround. And
we also developed a workaround for the code duplication problem that
showed up in gcc-7.

I too have written code that relies on being able to identify the start
and end of certain bits of code - typically for microcontrollers where
you want some bits of code (like flash programming routines or very
timing critical interrupt code) put in ram rather than flash. Sometimes
that can be done with compiler extensions, sometimes it takes extra
flags, linker file magic, or other messing around. But it's not
something I would expect to be portable, and it needs confirmed for
every compiler version and selection of flags used. (I realise that
this is a vastly simpler task for the kind of work I do than for an open >source project!)

Between what we developed for gcc-3.2 (released 2002) in 2003 and
today, the only new development in these 21 years was the code
duplication in gcc-7 and the workaround for that. IIRC Gforth also
worked without that workaround, but was slower.

Another problem from gcc-3.1 to at least gcc-4.4 (intermittently) is
that gcc compiled

goto *ca;

into the equivalent of

goto gotoca;

/* and elsewhere */
gotoca: goto *ca;

We reported that repeatedly. At one point a gcc maintainer gave us
some bullshit about a possible performance advantage from this
transformation, of course without presenting any empirical support,
while we saw a big slowdown on our code. We developed workarounds for
that, and they are in Gforth to this day, even though we have not
encountered a new gcc version with this problem for over a decade, but
new Gforth should also work on old gcc.

Again, the compiler is not doing anything outside its specifications.

Nobody said it did. We did, however, report this as a pessimization repeatedly. And eventually the gcc people fixed it; we already saw
versions without this bug in gcc-4.0 or 4.1 IIRC, but in 4.4 it was
there again, but apparently they have since fixed it for good.

You are looking for more than C and the gcc documented extensions give
you. That is always going to be hard.

Really? It works.

Ideally, you need a new gcc flag or two with documented and guaranteed >effects to give you the assurance you need for your code. That's going
to take a lot of effort, I would expect, and I can see it being hard for
a relatively nice project like Gforth to push for that.

Our approach has been to find sanity-checks and workarounds based on
what gcc provided.

However, we were not the only ones working with code copying, and
Prokopski and Verbrugge have implemented changes to gcc that support
this technique, and presented it at the GCC Developers’ Summit 2007 <https://gcc.gnu.org/wiki/HomePage?action=AttachFile&do=get&target=GCC2007-Proceedings.pdf>
and at CC'08:

@InProceedings{prokopski&verbrugge08,
author = {Gregory B. Prokopski and Clark Verbrugge},
title = {Compiler-Guaranteed Safety in Code-Copying Virtual
Machines},
booktitle = {Compiler Construction (CC'08)},
pages = {163--177},
year = {2008},
publisher = {Springer LNCS 4959},
url = {http://www.sable.mcgill.ca/publications/papers/2008-2/paper.pdf},
OPTannote = {}
}

The source code was available, but the gcc maintainers were apparently
not interested. So much for "patches welcome".

Looking back, while there was quite a bit of interest in code-copying
(both for interpreters and for partial evaluators) from about
1998-2008, AFAIK Gforth is the only project that stuck with this
technique.

When others consider relatively unsophisticated interpreters to be too
slow, they tend to go for JIT compilers that generate machine code
using target-specific code (including machine-code encoding code).

Maybe the constant advocacy that everything outside the standard is
considered to be broken and the next compiler will not compile it as
intended has had its effects. Or maybe if we had published a
code-copying howto, more people would have found out how to do it in a
way that works pretty reliably.

OTOH, we ourselves have been thinking about switching to the kind of
JIT compiler that others have gone for. So we fell for this advocacy ourselves. But looking at the stability of Gforth, this is not really justified. Still, a solid foundation like machine code provides more confidence than a foundation based on C where every new compiler
version may bring unpleasant surprises (and not just for projects such
as Gforth), even if the experience is that things work.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Anton Ertl on Thu Sep 5 21:38:07 2024

On 2024-09-05 18:49, Anton Ertl wrote:

David Brown <[email protected]> writes:

On 05/09/2024 13:31, Anton Ertl wrote:

[ discussion of the implementation of Gforth as a code-copying
and code-pasting interpreter, and the maintenance problems
this leads to when changing gcc versions ]

It seems to me that this discussion (of Gforth) has very little do to
with the ability of C compilers to optimize away or do something else
with C code that the compiler detects invokes Undefined Behavior, and
instead concerns how successive gcc versions break the assumptions that
Gforth developers make about the structure of the machine code that gcc
emits for legal C code that does not invoke Undefined Behavior if
executed without modification.

If you try to restructure or modify the machine code that Gcc produces
on the fly, during program execution, as Gforth tries to do, that is so
outside the C standard that it is only Undefined Behavior in the sense
of not being even considered in the standard.

I don't doubt that Anton has experienced bad effects of the
"optimization" of Undefined Behavior, in other contexts, but I tend to
agree with David on that issue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Sep 5 19:29:36 2024

On Thu, 5 Sep 2024 14:05:08 +0000, Scott Lurndal wrote:

Terje Mathisen <[email protected]> writes:

David Brown wrote:

It would be nice if C had subrange types like Pascal or Ada, but it doe= >>s not.=C2=A0 Usually int - or sizeed ints - are the practical choice.

Agreed 100%

Although absent architecture support, how does one ensure that the
value remains within the subrange?

result = min(max(min_range,x),max_range);

or for 2^n values

result = ( ( x << (64-width) ) >> (64-width) );

The top is 2 instructions, the bottom 1 (both signed and unsigned).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Thu Sep 5 19:24:19 2024

On Thu, 5 Sep 2024 13:48:37 +0000, David Brown wrote:

On 04/09/2024 20:13, MitchAlsup1 wrote:

On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

On 04/09/2024 18:07, Tim Rentsch wrote:

If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.

Such tests are usually built into implementations of memmove(), which
will chose to run forwards or backwards as needed. So you might as well >>> just call memmove() any time you are not sure memcpy() is safe and
appropriate.

Memmove() is always appropriate unless you are doing something
nefarious.
So:
# define memcpy memomve
and move forward with life--for the 2 extra cycles memmove costs it
saves everyone long term grief.

Or just use memmove, and not memcpy, whenever you are moving stuff
around in the same buffer.

When you need the nefarious activities of memcpy write it as a
for loop by yourself and comment the nafariousness of the use.

memcpy is not nefarious. It's quite simple, and does what it says on
the tin. Use it when you want to copy non-overlapping memory areas.
Don't use it if you want to do something other than that. I have never understood why anyone would find this difficult.

There are compilers that:: s/memcpy/memmove/g

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Sep 5 19:31:23 2024

On Thu, 5 Sep 2024 14:06:45 +0000, Scott Lurndal wrote:

---------------------------------------------------------- array
indicies are always positive

Not in ada or fortran.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Thu Sep 5 21:36:04 2024

On 05/09/2024 19:04, Terje Mathisen wrote:

David Brown wrote:

On 05/09/2024 11:12, Terje Mathisen wrote:

David Brown wrote:

Unsigned types are ideal for "raw" memory access or external data,
for anything involving bit manipulation (use of &, |, ^, << and >>
on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want two's
complement wrapping, and for the even fewer occasions when you
actually need that last bit of range.

That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as
an error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic?

Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine
with unsigned i, arrays always use a zero base so in Rust the only array index type is usize, i.e the largest supported unsigned type in the
system, typically the same as u64.

Loop counters can usually be signed or unsigned, and it usually makes no difference. Array indices are also usually much the same signed or
unsigned, and it can feel more natural to use size_t here (an unsigned
type). It can make a difference to efficiency, however. On x86-64,
this code is 3 instructions with T as "unsigned long int" or "long int",
4 with "int", and 5 with "unsigned int".

int foo(int * p, T x) {
int a = p[x++];
int b = p[x++];
return a + b;
}

Anyway, I count loop counters and array indices as "use of integer-type variables", whether you prefer signed or unsigned.

unsigned arithmetic is easier than signed integer arithmetic, including comparisons that would result in a negative value, you just have to make
the test before subtracting, instead of checking if the result was
negative.

I can't follow that at all. Unsigned and signed arithmetic and
comparisons both work simply and as you'd expect. /Mixing/ signed and
unsigned types can get things wrong.

I.e I cannot easily replicate a downward loop that exits when the
counter become negative:

for (int i = START; i >= 0; i-- ) {
    // Do something with data[i]
}

One of my alternatives are

unsigned u = start; // Cannot be less than zero
if (u) {
    u++;
    do {
      u--;
      data[u]...
    while (u);
}

This typically results in effectively the same asm code as the signed version, except for a bottom JGE (Jump (signed) Greater or Equal instead
of JA (Jump Above or Equal, but my version is far more verbose.

A more important thing is that the first version, with signed i, is
/vastly/ simpler and clearer in the source code.

Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:

for (unsigned u = start; (u & TOPBIT) == 0; u--)

Terje

Or you could just write sane code that matches what you want to say.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bernd Linsel@21:1/5 to Anton Ertl on Thu Sep 5 22:05:17 2024

On 05.09.24 17:49, Anton Ertl wrote:

Nobody said that gcc did anything wrong here. We were, however,
surprised that -fno-reorder-blocks did not suppress the reordering; we reported this as bug, but were told that this option does something
different from what it says. Anyway, we developed a workaround. And
we also developed a workaround for the code duplication problem that
showed up in gcc-7.

Have you tried interspersing `asm volatile("")` statements?

It is very often an effective means to prevent gcc from reordering code
from before and after the asm statement.

If you additional specify inputs, e.g. `asm volatile("" :: "r" (foo))`,
you can force gcc to keep `foo` alive up to this point.

--
Bernd Linsel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to All on Thu Sep 5 23:05:51 2024

On 2024-09-05 22:29, MitchAlsup1 wrote:

On Thu, 5 Sep 2024 14:05:08 +0000, Scott Lurndal wrote:

Terje Mathisen <[email protected]> writes:

David Brown wrote:

It would be nice if C had subrange types like Pascal or Ada, but it
doe=

s not.=C2=A0 Usually int - or sizeed ints - are the practical choice.

Agreed 100%

Although absent architecture support, how does one ensure that the
value remains within the subrange?

result = min(max(min_range,x),max_range);

That would be a /saturating/ ranged type. Neither Pascal nor Ada
provides such types.

or for 2^n values

result = ( ( x << (64-width) ) >> (64-width) );

The top is 2 instructions, the bottom 1 (both signed and unsigned).

That would be a /wrap-around/ ranged type (if I understand the code
correctly). Pascal does not provide such; Ada does (modular integers)
and for any modulus, not only powers of two.

Pascal ranged types are expected to trap (abort) on exceeding the range,
IIRC, and Ada non-modular ranged types are expected to raise an
exception. Probably that, too, is only a couple of instructions for Mitch.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bill Findlay@21:1/5 to All on Thu Sep 5 22:23:51 2024

On 5 Sep 2024, MitchAlsup1 wrote
(in article<[email protected]>):

On Thu, 5 Sep 2024 14:06:45 +0000, Scott Lurndal wrote:

---------------------------------------------------------- array
indicies are always positive

Not in ada or fortran.

Or C.

--
Bill Findlay

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to David Brown on Thu Sep 5 20:46:24 2024

David Brown <[email protected]> wrote:

On 04/09/2024 22:15, Brett wrote:

David Brown <[email protected]> wrote:

On 03/09/2024 21:28, Stefan Monnier wrote:

My impression - based on hearsay for Rust as I have no experience - is that
the key point of Rust is memory "safety". I use scare-quotes here, since it
is simply about correct use of dynamic memory and buffers.

It is entirely possible to have correct use of memory in C,

If you look at the evolution of programming languages, "higher-level"
doesn't mean "you can do more stuff". On the contrary, making
a language "higher-level" means deciding what it is we want to make
harder or even impossible.

Agreed.

I've heard it said that the power of a programming language comes not
from what you can do with the language, but from what you cannot do.

Wrong, the last version of Swift added all the garbage programming concepts >> that one should avoid.

That does not show that I was wrong - perhaps Swift is not a powerful programming language!

Of course, it all depends on what you mean by "powerful".

(I don't know Swift at all.)

Clearly, you are not developing in the Apple ecosystem.
Swift has completely replaced Object C as the development language used on Apple hardware. C++ was not used for OSX development.

You have to give people the tools to do anything.

You don't /have/ to do that. But it's often easier to market a language
that can do anything.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bernd Linsel@21:1/5 to All on Thu Sep 5 23:07:22 2024

T24gMDUuMDkuMjQgMTk6MDQsIFRlcmplIE1hdGhpc2VuIHdyb3RlOg0KPiBPbmUgb2YgbXkg YWx0ZXJuYXRpdmVzIGFyZQ0KPiANCj4gIMKgIHVuc2lnbmVkIHUgPSBzdGFydDsgLy8gQ2Fu bm90IGJlIGxlc3MgdGhhbiB6ZXJvDQo+ICDCoCBpZiAodSkgew0KPiAgwqDCoMKgIHUrKzsN Cj4gIMKgwqDCoCBkbyB7DQo+ICDCoMKgwqDCoMKgIHUtLTsNCj4gIMKgwqDCoMKgwqAgZGF0 YVt1XS4uLg0KPiAgwqDCoMKgIHdoaWxlICh1KTsNCj4gIMKgIH0NCj4gDQo+IFRoaXMgdHlw aWNhbGx5IHJlc3VsdHMgaW4gZWZmZWN0aXZlbHkgdGhlIHNhbWUgYXNtIGNvZGUgYXMgdGhl IHNpZ25lZCANCj4gdmVyc2lvbiwgZXhjZXB0IGZvciBhIGJvdHRvbSBKR0UgKEp1bXAgKHNp Z25lZCkgR3JlYXRlciBvciBFcXVhbCBpbnN0ZWFkIA0KPiBvZiBKQSAoSnVtcCBBYm92ZSBv ciBFcXVhbCwgYnV0IG15IHZlcnNpb24gaXMgZmFyIG1vcmUgdmVyYm9zZS4NCj4gDQo+IEFs dGVybmF0aXZlbHksIGlmIHlvdSBkb24ndCBuZWVkIGFsbCBOIGJpdHMgb2YgdGhlIHVuc2ln bmVkIHR5cGUsIHRoZW4gDQo+IHlvdSBjYW4gc3VidHJhY3QgYW5kIGNoZWNrIGlmIHRoZSB0 b3AgYml0IGlzIHNldCBpbiB0aGUgcmVzdWx0Og0KPiANCj4gIMKgIGZvciAodW5zaWduZWQg dSA9IHN0YXJ0OyAodSAmIFRPUEJJVCkgPT0gMDsgdS0tKQ0KPiANCj4gVGVyamUNCj4gDQoN CldoYXQgYWJvdXQ6DQoNCmZvciAodW5zaWduZWQgdSA9IHN0YXJ0OyB1ICE9IH4wdTsgLS11 KQ0KICAgIC4uLg0KDQpvciBldmVuDQoNCmZvciAodW5zaWduZWQgdSA9IHN0YXJ0OyAoaW50 KXUgPj0gMDsgLS11KQ0KICAgIC4uLg0KDQo/DQoNCkkndmUgY29tcGFyZWQgYWxsIHZhcmlh bnRzIGZvciB4ODZfNjQgd2l0aCAtTzMgLWZleHBlbnNpdmUtb3B0aW1pemF0aW9ucyANCm9u IGdvZGJvbHQub3JnOg0KLSAzMiBiaXQgdmVyc2lvbjogaHR0cHM6Ly9nb2Rib2x0Lm9yZy96 L1RNaGh4M25jaA0KLSA2NCBiaXQgdmVyc2lvbjogaHR0cHM6Ly9nb2Rib2x0Lm9yZy96Lzhv eHpUZjVHZg0KDQoNCi0tIA0KQmVybmQgTGluc2VsDQo=

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Bernd Linsel on Thu Sep 5 21:39:00 2024

Bernd Linsel <[email protected]> writes:

On 05.09.24 19:04, Terje Mathisen wrote:

One of my alternatives are

unsigned u = start; // Cannot be less than zero
if (u) {
u++;
do {
u--;
data[u]...
while (u);
}

This typically results in effectively the same asm code as the signed
version, except for a bottom JGE (Jump (signed) Greater or Equal instead
of JA (Jump Above or Equal, but my version is far more verbose.

Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:

%G�%@| for (unsigned u = start; (u & TOPBIT) == 0; u--)

Terje

What about:

for (unsigned u = start; u != ~0u; --u)

This is the form we use most when we need
to work in reverse.

...

or even

for (unsigned u = start; (int)u >= 0; --u)
...

?

I've compared all variants for x86_64 with -O3 -fexpensive-optimizations
on godbolt.org:
- 32 bit version: https://godbolt.org/z/TMhhx3nch
- 64 bit version: https://godbolt.org/z/8oxzTf5Gf

No significant differences in code generation for unsigned vs. signed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Thu Sep 5 15:04:14 2024

Terje Mathisen <[email protected]> writes:

[...]

Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
fine with unsigned i, arrays always use a zero base so in Rust the
only array index type is usize, i.e the largest supported unsigned
type in the system, typically the same as u64.

unsigned arithmetic is easier than signed integer arithmetic,
including comparisons that would result in a negative value, you just
have to make the test before subtracting, instead of checking if the
result was negative.

I.e I cannot easily replicate a downward loop that exits when the
counter become negative:

for (int i = START; i >= 0; i-- ) {
// Do something with data[i]
}

See below.

One of my alternatives are

unsigned u = start; // Cannot be less than zero
if (u) {
u++;
do {
u--;
data[u]...
} while (u); /* presumably the } was intended */
}

This code isn't the same as the for() loop above. If start is
0, the for() loop runs once, but the do..while loop runs zero times.

Regarding the given for() loop, namely this:

for (int i = START; i >= 0; i-- ) {
// Do something with data[i]
}

If START is signed (presumably of type int), so the loop might run
zero times, but never more than INT_MAX times, then

for( unsigned u = START < 0 ? 0 : START + 1u; u > 0 && u--; ){
// Do something with data[i]
}

If START is unsigned, so in all cases the loop must run at
least once, then

unsigned u = START;
do {
// Do something with data[i]
} while( u > 0 && u-- );

(Yes I know the 'u > 0' expressions can be replaced by just 'u'.)

The optimizer should be smart enough to realize that if 'u > 0'
is true then the test 'u--' will also be true. The same should
hold if 'u > 0' is replaced by just 'u'.

(Disclaimer: code not compiled.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bernd Linsel@21:1/5 to Tim Rentsch on Fri Sep 6 00:19:52 2024

On 06.09.24 00:04, Tim Rentsch wrote:

If START is signed (presumably of type int), so the loop might run
zero times, but never more than INT_MAX times, then

for( unsigned u = START < 0 ? 0 : START + 1u; u > 0 && u--; ){
// Do something with data[i]
}

If START is unsigned, so in all cases the loop must run at
least once, then

unsigned u = START;
do {
// Do something with data[i]
} while( u > 0 && u-- );

(Yes I know the 'u > 0' expressions can be replaced by just 'u'.)

The optimizer should be smart enough to realize that if 'u > 0'
is true then the test 'u--' will also be true. The same should
hold if 'u > 0' is replaced by just 'u'.

(Disclaimer: code not compiled.)

Both yield not very elegant code:

https://godbolt.org/z/M4Y5PYP3v

--
Bernd Linsel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to BGB on Fri Sep 6 09:10:04 2024

On 06/09/2024 00:51, BGB wrote:

On 9/5/2024 8:27 AM, David Brown wrote:

On 05/09/2024 10:51, Niklas Holsti wrote:

On 2024-09-05 10:54, David Brown wrote:

On 05/09/2024 02:56, MitchAlsup1 wrote:

On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:

On 9/4/2024 3:59 PM, Scott Lurndal wrote:

Say:
   long z;
   int x, y;
   ...
   z=x*y;
Would auto-promote to long before the multiply.

\I may have to use this as an example of C allowing the programmer
to shoot himself in the foot; promotion or no promotion.

You snipped rather unfortunately here - it makes it look like this
was code that Scott wrote, and you've removed essential context by BGB. >>>>

While I agree it is an example of the kind of code that people
sometimes write when they don't understand C arithmetic, I don't
think it is C-specific. I can't think of any language off-hand
where expressions are evaluated differently depending on types used
further out in the expression. Can you give any examples of
languages where the equivalent code would either do the
multiplication as "long", or give an error so that the programmer
would be informed of their error?

The Ada language can work in both ways. If you just have:

    z : Long_Integer; -- Not a standard Ada type, but often provided. >>>     x, y : Integer;
    ...
    z := x * y;

the compiler will inform you that the types in the assignment do not
match: using the standard (predefined) operator "*", the product of
two Integers gives an Integer, not a Long_Integer.

That seems like a safe choice. C's implicit promotion of int to long
int can be convenient, but convenience is sometimes at odds with safety.

A lot of time, implicit promotion will be the "safer" option than first
doing an operation that overflows and then promoting.

Annoyingly, one can't really do the implicit promotion first and then
promote afterwards, as there may be programs that actually rely on this particular bit of overflow behavior.

A programming language has to work as it is defined. And people should
not be relying on code doing things that are /not/ defined.

So promoting arguments implicitly before the operation is only useful if
it is a clearly defined part of the language. (In C, that is the way it
works up to the size of "int".)

In C, if you have :

long int foo(int x, int y) {
return z = x *y ;
}

then the compiler is free to implement this as full 64-bit
multiplication and return that 64-bit value. This is because the result
of a 32 x 32 bit multiplication either gives the correct answer without overflow, and promoting it to 64 bit keeps that value, or there is an
overflow and the results are undefined, so the compiler can return
whatever it likes.

But unless the compiler documents this behaviour (in which case the code
would be correct but non-portable), the code is buggy.

Conversely, if unsigned types are used here, the results of the
multiplication must be truncated to 32 bits - keeping higher bits in the
return value would be a compiler bug.

However a language wants to handle this, it needs to be specified by the language. Most languages (AFAIK) have no implicit promotion that is
dependent on what you are doing with the results. (Some, including C,
will have various degrees of implicit promotion dependent solely on the expression itself, but not on what is done with the evaluated result.)
Ada, AFAIK, does not have implicit promotions between types - "int" does
not automatically promote to "long int". This can be seen as an
inconvenience compared to many other languages, but it means that it is possible to have a consistent and safe way to overload by return type.

In effect, in my case, the promotion behavior ends up needing to depend
on the language-mode (it is either this or maybe internally split the operators into widening or non-widening variants, which are selected
when translating the AST into the IR stage).

Dependency on a "language mode" does not sound "safe" to me.

Well, as opposed to dealing with the widening cases by emitting IR with
an implicit casts added into the IR.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Fri Sep 6 09:23:21 2024

On 05/09/2024 21:24, MitchAlsup1 wrote:

On Thu, 5 Sep 2024 13:48:37 +0000, David Brown wrote:

On 04/09/2024 20:13, MitchAlsup1 wrote:

On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

On 04/09/2024 18:07, Tim Rentsch wrote:

If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.

Such tests are usually built into implementations of memmove(), which
will chose to run forwards or backwards as needed. So you might as
well
just call memmove() any time you are not sure memcpy() is safe and
appropriate.

Memmove() is always appropriate unless you are doing something
nefarious.
So:
# define memcpy memomve
and move forward with life--for the 2 extra cycles memmove costs it
saves everyone long term grief.

Or just use memmove, and not memcpy, whenever you are moving stuff
around in the same buffer.

When you need the nefarious activities of memcpy write it as a
for loop by yourself and comment the nafariousness of the use.

memcpy is not nefarious. It's quite simple, and does what it says on
the tin. Use it when you want to copy non-overlapping memory areas.
Don't use it if you want to do something other than that. I have never
understood why anyone would find this difficult.

There are compilers that:: s/memcpy/memmove/g

They can do that if they want - memcpy can be implemented using memmove,
but not vice versa.

That doesn't mean it is at all a good idea to use memcpy when you mean
memmove.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Bernd Linsel on Fri Sep 6 07:16:43 2024

Bernd Linsel <[email protected]> writes:

On 05.09.24 17:49, Anton Ertl wrote:

Nobody said that gcc did anything wrong here. We were, however,
surprised that -fno-reorder-blocks did not suppress the reordering; we
reported this as bug, but were told that this option does something
different from what it says. Anyway, we developed a workaround. And
we also developed a workaround for the code duplication problem that
showed up in gcc-7.

Have you tried interspersing `asm volatile("")` statements?

It is very often an effective means to prevent gcc from reordering code
from before and after the asm statement.

We are using asm statements that result in no machine code for various
purposes (including the workaround for the code duplication of gcc-7
ff.)

We have not tried it for suppressing the basic block reordering, and I
would not expect such a statement to suppress that, because asm
volatile("") acts as a data-flow barrier, and basic-block reordering
has nothing to do with data flow.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Scott Lurndal on Fri Sep 6 09:17:45 2024

On 05/09/2024 23:39, Scott Lurndal wrote:

Bernd Linsel <[email protected]> writes:

On 05.09.24 19:04, Terje Mathisen wrote:

One of my alternatives are

unsigned u = start; // Cannot be less than zero
if (u) {
u++;
do {
u--;
data[u]...
while (u);
}

This typically results in effectively the same asm code as the signed
version, except for a bottom JGE (Jump (signed) Greater or Equal instead >>> of JA (Jump Above or Equal, but my version is far more verbose.

Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:

%G�%@| for (unsigned u = start; (u & TOPBIT) == 0; u--)

Terje

What about:

for (unsigned u = start; u != ~0u; --u)

This is the form we use most when we need
to work in reverse.

In a code review, I would reject that - and all the other nonsenses
suggested here as a way to force all loop indices to be unsigned types
as though that rule was the 11th commandment.

Just write code that makes sense - it's /not/ hard in this case!

for (int i = start; i >= 0; i--) ...

If you need the loop counter to be an unsigned type inside the loop
code, make an unsigned version:

for (int i = start; i >= 0; i--) {
const unsigned int u = i;
...
}

Sometimes it amazes me the kind of nonsense people write in code because
of obsession about particular rules. Code clarity trumps /all/
stylistic rules.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Niklas Holsti on Fri Sep 6 07:25:35 2024

Niklas Holsti <[email protected]d> writes:

On 2024-09-05 18:49, Anton Ertl wrote:

David Brown <[email protected]> writes:

On 05/09/2024 13:31, Anton Ertl wrote:

[ discussion of the implementation of Gforth as a code-copying
and code-pasting interpreter, and the maintenance problems
this leads to when changing gcc versions ]

It seems to me that this discussion (of Gforth) has very little do to
with the ability of C compilers to optimize away or do something else
with C code that the compiler detects invokes Undefined Behavior

Yes. What I wrote about was just to show what is happening in Gforth,
and that the techniques, even though they may seem totally outlandish
to some, are actually pretty usable across many releases of gcc
(despite the lack of guarantees from gcc); in the last 20 years we
have needed to deal with one new development, and our workaround for
that also works on older gcc releases.

What some C compilers tend to do is, however, better described as
"Assume That Undefined Behaviour Does Not Happen" (ATUBDNH), and
deriving "knowledge" from that (e.g., about the possible values of a
variable), and then using that "knowledge" in "optimizations".

I don't doubt that Anton has experienced bad effects of the
"optimization" of Undefined Behavior, in other contexts

The main bad effect is that I replaced more efficient and shorter code
with less efficient and longer code. In theory the compiler can
generate the same code for both, but in practice that does not happen.
As an example, the test for the smallest signed integer can be written
with -fwrapv as:

if (x<=x-1)

and gcc -fwrapv compiles this to shorter code on AMD64 than

if (x==CELL_MIN)

What gcc produces for both formulations is longer than

dec %rdi
jno ...

Maybe instead of pursuing "optimizations" against the intentions of
the programmer, they should concentrate on implementing real
optimizations like optimizing either variant into the small code shown
last.

Interestingly, the first idiom is a case where gcc recognizes what the intention of the programmer is, and warns that it is going to
miscompile it. The warning is good, the miscompilation not (but it
would be worse without the warning).

In any case, while the actual experience is that I have not been hit
by "optimizations" that ATUBDNH in production code, possibly because I
minimize these assumptions with flags like -fwrapv, the possibility
that my code might be hit by such an "optimization" (e.g., a new one
in a new compiler version, if I am lucky with a new flag for disabling
the assumption, but my source code does not know about it yet) and the
attitude of people who implement such "optimizations" is what I
resent.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Fri Sep 6 13:57:18 2024

On Fri, 06 Sep 2024 07:25:35 GMT
[email protected] (Anton Ertl) wrote:

The main bad effect is that I replaced more efficient and shorter code
with less efficient and longer code. In theory the compiler can
generate the same code for both, but in practice that does not happen.
As an example, the test for the smallest signed integer can be written
with -fwrapv as:

if (x<=x-1)

and gcc -fwrapv compiles this to shorter code on AMD64 than

if (x==CELL_MIN)

What gcc produces for both formulations is longer than

dec %rdi
jno ...

Good trick.
The same trick in non-destructive form would be 1 byte longer.
cmp $1, %rdi
jno ...

But I was not able to force any of compilers currently installed on my
home desktop (gcc 13.2, clang 18.1, MSVC 19.30.30706 == VS2022) to
produce such code.

The closest was MSVC that sometimes (not in all circumstances) produces
2 bytes longer versiin:
49 8d 49 ff lea -0x1(%r9),%rcx
4c 3b c9 cmp %rcx,%r9

Of course, it's still good deal shorter than
48 ba 00 00 00 00 00 00 00 80 movabs $0x8000000000000000,%rdx
4c 3b ca cmp %rdx,%r9

Both gcc and clang [under -fwrapv] insisted on turning x<=x-1 into x==LLONG_MIN.

However even if we were able to force compiler to produce desired code,
the space saving is architecture-specific.
E.g. I expect no saving on ARM64 where both variants occupie 8 bytes.

Maybe instead of pursuing "optimizations" against the intentions of
the programmer, they should concentrate on implementing real
optimizations like optimizing either variant into the small code shown
last.

Interestingly, the first idiom is a case where gcc recognizes what the intention of the programmer is, and warns that it is going to
miscompile it. The warning is good, the miscompilation not (but it
would be worse without the warning).

You had more luck with warnings than I did.
In all my test cases both gcc and clang [in absence of -fwrapv]
silently dropped the check and depended code.
MSVC didn't drop it, so, naturally, also it produced no warning.

In any case, while the actual experience is that I have not been hit
by "optimizations" that ATUBDNH in production code, possibly because I minimize these assumptions with flags like -fwrapv, the possibility
that my code might be hit by such an "optimization" (e.g., a new one
in a new compiler version, if I am lucky with a new flag for disabling
the assumption, but my source code does not know about it yet) and the attitude of people who implement such "optimizations" is what I
resent.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Bernd Linsel on Fri Sep 6 13:58:15 2024

On 05/09/2024 22:05, Bernd Linsel wrote:

On 05.09.24 17:49, Anton Ertl wrote:

Nobody said that gcc did anything wrong here. We were, however,
surprised that -fno-reorder-blocks did not suppress the reordering; we
reported this as bug, but were told that this option does something
different from what it says. Anyway, we developed a workaround. And
we also developed a workaround for the code duplication problem that
showed up in gcc-7.

Have you tried interspersing `asm volatile("")` statements?

It is very often an effective means to prevent gcc from reordering code
from before and after the asm statement.

(I am quite confident that Anton has uses asm volatile statements like
this.)

That only prevents movement of observable behaviour - basically volatile accesses, calls to externally defined functions, and other volatile asm statements. It does not prevent the movement of any other code.

A commonly used variant is `asm volatile("" ::: "memory")` which is a
local memory barrier, and blocks movements of loads and stores. But
that can often be costly in performance, and also does not block
movement of code that does not load or store memory.

The compiler is also free to duplicate and shuffle around these
"instructions", as long as they are "executed" as required. So it can
do the same kinds of movements as it did before, transforming freely
between:

A:
asm volatile("");
doThis();
asm volatile("");
B:
asm volatile("");
doThat();
asm volatile("");
C:

and

A:
asm volatile("");
doThis();
asm volatile("");
asm volatile("");
doThat();
asm volatile("");
goto C
B:
asm volatile("");
doThat();
asm volatile("");
C:

If you additional specify inputs, e.g. `asm volatile("" :: "r" (foo))`,
you can force gcc to keep `foo` alive up to this point.

That is sometimes a useful form of code. I've used it in sequences like
this:

x = long_calculation()_;
asm volatile ("" :: "g" (x));
get_lock();
use_x(x);
release_lock();

Without that block, the compiler is free to move long_calculation()
inside the locked area (within limitations from its knowledge of
observable behaviour). In most practical cases, the get_lock() and release_lock() parts will have a memory barrier, and you don't actually
get much of long_calculation() that might be moved, but it is certainly
a possibility.

asm volatile("" : "+g" (x));

can also be useful. It not only forces "x" to be stable before the
statement is "executed", but it tells the compiler to forget all it
knows about after it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Fri Sep 6 06:37:13 2024

Thomas Koenig <[email protected]> writes:

Thomas Koenig <[email protected]> schrieb:

"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.

Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.

So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.

If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?

Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).

The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior. (It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:

int a;

...

if (a > a + 1) {
...
}

Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here?

If a < INT_MAX, the behavior is the same as replacing the if()
test with 'if(0)'. If the compiler can accurately deduce that
the condition 'a < INT_MAX' will hold in all cases then the if()
can be optimized away accordingly.

If a == INT_MAX, one possibility is that code is generated to
evaluate the addition and the comparison, and the if-block is
either evaluated or it isn't, depending on the outcome of the
comparison. Important: the compiler is disallowed from drawing
any inferences based on "knowing" the result of either the
addition or the comparison; code must be generated under a "best
efforts" umbrella, and whatever the code does dictates whether
the if-block is evaluated or not, with the compiler being
forbidden to draw any conclusions based on what the result will
be.

If a == INT_MAX, it also should be possible for the addition to
abort the program. Here again the compiler is disallowed from
drawing any inferences based on knowing this will happen. To
make this work the rule allowing "UB to travel backwards in time"
must be revoked; unless a compiler can accurately deduce that a
given piece of code cannot transgress into UB then other code in
the program must not be moved (either forwards or backwards) past
the possibly-not-well-defined code segment.

Let me be clear that I have not thought through all the details
about exactly what the rules are or how they might be put into
effect. Hopefully though my comments here give you a better
sense of the direction meant to be suggested.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Fri Sep 6 13:26:42 2024

Michael S <[email protected]> writes:

On Fri, 06 Sep 2024 07:25:35 GMT
[email protected] (Anton Ertl) wrote:

What gcc produces for both formulations is longer than

dec %rdi
jno ...

Good trick.

Thanks. It's not from me. I published it in 2015 <https://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29.pdf>,
but unfortunately did not give a reference to where I have it from (I
read it elsewhere).

The same trick in non-destructive form would be 1 byte longer.
cmp $1, %rdi
jno ...

But I was not able to force any of compilers currently installed on my
home desktop (gcc 13.2, clang 18.1, MSVC 19.30.30706 == VS2022) to
produce such code.

The closest was MSVC that sometimes (not in all circumstances) produces
2 bytes longer versiin:
49 8d 49 ff lea -0x1(%r9),%rcx
4c 3b c9 cmp %rcx,%r9

Of course, it's still good deal shorter than
48 ba 00 00 00 00 00 00 00 80 movabs $0x8000000000000000,%rdx
4c 3b ca cmp %rdx,%r9

Both gcc and clang [under -fwrapv] insisted on turning x<=x-1 into >x==LLONG_MIN.

However even if we were able to force compiler to produce desired code,
the space saving is architecture-specific.

With this gcc-specific code we can force it:

extern long foo1(long);
extern long foo2(long);

long bar(long a, long b)
{
long c;
if (__builtin_sub_overflow(b,1,&c))
return foo1(a);
else
return foo2(a);
}

gcc -O3 -c and gcc -Os -c (gcc-12.2) produce, on AMD64:

0: 48 83 c6 ff add $0xffffffffffffffff,%rsi
4: 70 05 jo b <bar+0xb>
6: e9 00 00 00 00 jmp b <bar+0xb>
b: e9 00 00 00 00 jmp 10 <bar+0x10>

So, even though %rsi is dead afterwards, it does not use dec, but it's certainly better than the other variants.

On Arch A64 both gcc invocations (gcc-10.2) produce:

0: f1000421 subs x1, x1, #0x1
4: 54000046 b.vs c <bar+0xc>
8: 14000000 b 0 <foo2>
c: 14000000 b 0 <foo1>

On RV64GC bith gcc invocations (gcc-10.3) produce:

0000000000000000 <bar>:
0: fff58793 addi a5,a1,-1
4: 00f5c663 blt a1,a5,10 <.L6>
8: 00000317 auipc t1,0x0
c: 00030067 jr t1 # 8 <bar+0x8>

0000000000000010 <.L6>:
10: 00000317 auipc t1,0x0
14: 00030067 jr t1 # 10 <.L6>

So on RISC-V gcc manages to actually translate the if back into "if (b
< b-1)" without pessimising that (but gcc-10 does not pessimize this
code on AMD64, either.

E.g. I expect no saving on ARM64 where both variants occupie 8 bytes.

Here we have the three variants:

#include <limits.h>

extern long foo1(long);
extern long foo2(long);

long bar(long a, long b)
{
long c;
if (__builtin_sub_overflow(b,1,&c))
return foo1(a);
else
return foo2(a);
}

long bar2(long a, long b)
{
if (b < b-1)
return foo1(a);
else
return foo2(a);
}

long bar3(long a, long b)
{
if (b == LONG_MIN)
return foo1(a);
else
return foo2(a);
}

And here is what gcc-10 -Os -fwrapv -Wall -c produces:

ARM A64:
subs x1, x1, #0x1 sub x2, x1, #0x1 mov x2, #0x8000000000000000
b.vs c <bar+0xc> cmp x2, x1 cmp x1, x2
b.le 20 <bar2+0x10> b.ne 34 <bar3+0x10>

RV64GC:
addi a5,a1,-1 addi a5,a1,-1 li a5,-1
bge a1,a5,10 <.L4> bge a1,a5,28 <.L6> slli a5,a5,0x3f
bne a1,a5,40 <.L8>
8 Bytes 8 Bytes 8 Bytes

AMD64:
add $-1,%rsi lea -0x1(%rsi),%rax mov $0x1,%eax
jo b <bar+0xb> cmp %rsi,%rax shl $0x3f,%rax
jle 1e <bar2+0xe> cmp %rax,%rsi
jne 36 <bar3+0x13>
6 Bytes 9 Bytes 14 Bytes

With gcc-12 on AMD64:
add -1,%rsi mov $0x1,%eax mov $0x1,%eax
jo b <bar+0xb> shl $0x3f,%rax shl $0x3f,%rax
cmp %rax,%rsi cmp %rax,%rsi
jne 23 <bar2+0x13> jne 23 <bar2+0x13>
6 Bytes 14 Bytes 14 Bytes

(Actually in the latter case gcc recognizes that bar2 and bar3 are
equivalent and jumps from bar3 to bar2, but I am sure that without
bar2, bar3 would look the same as bar2 does now).

So when gcc does not pessimize "b < b-1" into "b == LONG_MIN", the straightforward code for the former has the same or smaller size, and
the same or smaller number of instructions on these architectures.
The "__builtin_sub_overflow(b,1,&c)" has the same or fewer bytes than
"b < b-1" and the same or fewer instructions. So, with
straightforward translations "__builtin_sub_overflow(b,1,&c)"
dominates "b < b-1", which dominates "b == LONG_MIN".

As a new feature, gcc-12 recognizes "b < b-1" and pessimizes it into
the same code as "b == LONG_MIN".

Interestingly, the first idiom is a case where gcc recognizes what the
intention of the programmer is, and warns that it is going to
miscompile it. The warning is good, the miscompilation not (but it
would be worse without the warning).

You had more luck with warnings than I did.
In all my test cases both gcc and clang [in absence of -fwrapv]
silently dropped the check and depended code.

Interesting. I tried both "b < b-1" and "b >= b+1" and got no warning
(with gcc-10 and gcc-12), but I have seen a warning with one of those
idioms in the past. Maybe someone decided that warning about this
idiom is unnecessary, while "optimizing" it is.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Tim Rentsch on Fri Sep 6 18:17:50 2024

On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:

Thomas Koenig <[email protected]> writes:

Thomas Koenig <[email protected]> schrieb:

"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.

Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.

So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.

If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?

Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).

The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior. (It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:

int a;

...

if (a > a + 1) {
...
}

Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here?

If a < INT_MAX, the behavior is the same as replacing the if()
test with 'if(0)'. If the compiler can accurately deduce that
the condition 'a < INT_MAX' will hold in all cases then the if()
can be optimized away accordingly.

If a == INT_MAX, one possibility is that code is generated to
evaluate the addition and the comparison, and the if-block is
either evaluated or it isn't, depending on the outcome of the
comparison. Important: the compiler is disallowed from drawing
any inferences based on "knowing" the result of either the
addition or the comparison; code must be generated under a "best
efforts" umbrella, and whatever the code does dictates whether
the if-block is evaluated or not, with the compiler being
forbidden to draw any conclusions based on what the result will
be.

If a == INT_MAX, it also should be possible for the addition to
abort the program. Here again the compiler is disallowed from
drawing any inferences based on knowing this will happen. To
make this work the rule allowing "UB to travel backwards in time"
must be revoked; unless a compiler can accurately deduce that a
given piece of code cannot transgress into UB then other code in
the program must not be moved (either forwards or backwards) past
the possibly-not-well-defined code segment.

It is also possible if a == INT_MAX that the exception will
transfer control to a signal handler to do some SW orchestrated
recovery.

Let me be clear that I have not thought through all the details
about exactly what the rules are or how they might be put into
effect. Hopefully though my comments here give you a better
sense of the direction meant to be suggested.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Fri Sep 6 23:10:16 2024

On Fri, 6 Sep 2024 22:41:12 +0000, Chris M. Thomasson wrote:

On 9/5/2024 10:04 AM, Terje Mathisen wrote:

David Brown wrote:

On 05/09/2024 11:12, Terje Mathisen wrote:

David Brown wrote:

Unsigned types are ideal for "raw" memory access or external data,
for anything involving bit manipulation (use of &, |, ^, << and >>
on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want two's >>>>> complement wrapping, and for the even fewer occasions when you
actually need that last bit of range.

That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as
an error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic?

Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine
with unsigned i, arrays always use a zero base so in Rust the only array
index type is usize, i.e the largest supported unsigned type in the
system, typically the same as u64.

unsigned arithmetic is easier than signed integer arithmetic, including
comparisons that would result in a negative value, you just have to make
the test before subtracting, instead of checking if the result was
negative.

I.e I cannot easily replicate a downward loop that exits when the
counter become negative:

for (int i = START; i >= 0; i-- ) {
    // Do something with data[i]
}

for (int i = START; i > -1; i-- ) {
// Do something with data[i]
}

;^)

# define START 0x80000001

One of my alternatives are

unsigned u = start; // Cannot be less than zero
if (u) {
    u++;
    do {
      u--;
      data[u]...
    while (u);
}

any unsigned integer cannot be less than zero?

This typically results in effectively the same asm code as the signed
version, except for a bottom JGE (Jump (signed) Greater or Equal instead
of JA (Jump Above or Equal, but my version is far more verbose.

Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:

for (unsigned u = start; (u & TOPBIT) == 0; u--)

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Sat Sep 7 09:15:11 2024

On 07/09/2024 01:10, MitchAlsup1 wrote:

On Fri, 6 Sep 2024 22:41:12 +0000, Chris M. Thomasson wrote:

On 9/5/2024 10:04 AM, Terje Mathisen wrote:

David Brown wrote:

On 05/09/2024 11:12, Terje Mathisen wrote:

David Brown wrote:

Unsigned types are ideal for "raw" memory access or external data, >>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>> on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want two's >>>>>> complement wrapping, and for the even fewer occasions when you
actually need that last bit of range.

That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as >>>>> an error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic?

Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine
with unsigned i, arrays always use a zero base so in Rust the only array >>> index type is usize, i.e the largest supported unsigned type in the
system, typically the same as u64.

unsigned arithmetic is easier than signed integer arithmetic, including
comparisons that would result in a negative value, you just have to make >>> the test before subtracting, instead of checking if the result was
negative.

I.e I cannot easily replicate a downward loop that exits when the
counter become negative:

   for (int i = START; i >= 0; i-- ) {
     // Do something with data[i]
   }

for (int i = START; i > -1; i-- ) {
      // Do something with data[i]
}

;^)

# define START 0x80000001

No.

The great thing about 32 bit integers is that your numbers are never
anywhere close to being too big - or you /know/ you are dealing with
very big numbers and you can take that into account such as by using
64-bit integer types.

A number that is the start or end of a normal count range is /never/ 0x80000001. Write code that is clear, simple and correct for what you
are actually doing. And if you think such big numbers are realistic,
write the same clear, simple and correct code with "int64_t" instead.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Sat Sep 7 07:26:51 2024

Tim Rentsch <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Thomas Koenig <[email protected]> schrieb:

"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.

Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.

So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.

If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?

Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).

The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior.

That sentece makes no sense to me.

Behavior is defined by the standard, by the compiler documentation,
by other standards (such as OpenMP) or it is undefined.

"Giving less freedom" has no difference from defining.

(It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:

int a;

...

if (a > a + 1) {
...
}

Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here?

If a < INT_MAX, the behavior is the same as replacing the if()
test with 'if(0)'. If the compiler can accurately deduce that
the condition 'a < INT_MAX' will hold in all cases then the if()
can be optimized away accordingly.

If a == INT_MAX, one possibility is that code is generated to
evaluate the addition and the comparison, and the if-block is
either evaluated or it isn't, depending on the outcome of the
comparison. Important: the compiler is disallowed from drawing
any inferences based on "knowing" the result of either the
addition or the comparison; code must be generated under a "best
efforts" umbrella, and whatever the code does dictates whether
the if-block is evaluated or not, with the compiler being
forbidden to draw any conclusions based on what the result will
be.

If a == INT_MAX, it also should be possible for the addition to
abort the program. Here again the compiler is disallowed from
drawing any inferences based on knowing this will happen. To
make this work the rule allowing "UB to travel backwards in time"
must be revoked; unless a compiler can accurately deduce that a
given piece of code cannot transgress into UB then other code in
the program must not be moved (either forwards or backwards) past
the possibly-not-well-defined code segment.

After thinking about this for a time, what you want looks a lot
like volaitle.

Is there any requirement that you can think of that would not
be fullfilled with "volatile int a"?

Is there anything with "volatile int a" that you do not want?

If volatile is close to what you want, then this would be
straightforward to incorporate into an existing compiler such as
gcc, just add an option which declares every variable in the C
front end volatile, weed out the resulting bugs (yes, that is a
mixed metaphor) and be done.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Sat Sep 7 05:51:41 2024

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

David Brown <[email protected]> writes:

On 05/09/2024 11:12, Terje Mathisen wrote:

That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1) as an >>>> error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic?

We do. There is no issue using unsigned loop counters,

I find counting down from n to 0 using unsigned variables
unintuitive. Or do you always count up and then calculate
what you actually use? Induction variable optimization
should take care of that, but it would be more complicated
to use.

In most cases of counting down the upper bound is one more
than the value to be used, reflecting a half-open interval.
These ranges are analogous to pointers traversing arrays
downwards:

int stuff[20];

for( int *p = stuff+20; p > stuff; ){
p--;
.. do something with *p ..
}

For pointers it's important that the pointer not "fall off the
bottom" of the array. That needn't apply to unsigned index
variables, so the decrement can be absorbed into the test:

int stuff[20];

for( unsigned i = 20; i-- > 0; ){
.. do something with stuff[i] ..
}

If you adopt patterns similar to this one I think you will
get used to it quickly and it will start to seem quite
natural. Counting down is the mirror image of counting
up. When counting up we "point at" and increment after using.
When counting down we "point after" and decrement before using.

Using half-open intervals also comes up in binary search:

int stuff[N];

unsigned low = 0, limit = N;
while( low+1 != limit ){
unsigned m = low + (limit-low)/2;
.. test stuff[m] and pick one of ..
.. low = m .. (or)
.. limit = m ..
}
.. stuff[low] has the answer, if there is one ..

At each point in the search we are considering a half-open
interval. That makes writing (or reading) invariants for
the code very easy. When low+1 == limit then there is only
one element to consider.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to [email protected] on Sat Sep 7 06:52:02 2024

[email protected] (MitchAlsup1) writes:

On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:

Thomas Koenig <[email protected]> writes:

Thomas Koenig <[email protected]> schrieb:

"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.

Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.

So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.

If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?

Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).

The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior. (It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:

int a;

...

if (a > a + 1) {
...
}

Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here?

If a < INT_MAX, the behavior is the same as replacing the if()
test with 'if(0)'. If the compiler can accurately deduce that
the condition 'a < INT_MAX' will hold in all cases then the if()
can be optimized away accordingly.

If a == INT_MAX, one possibility is that code is generated to
evaluate the addition and the comparison, and the if-block is
either evaluated or it isn't, depending on the outcome of the
comparison. Important: the compiler is disallowed from drawing
any inferences based on "knowing" the result of either the
addition or the comparison; code must be generated under a "best
efforts" umbrella, and whatever the code does dictates whether
the if-block is evaluated or not, with the compiler being
forbidden to draw any conclusions based on what the result will
be.

If a == INT_MAX, it also should be possible for the addition to
abort the program. Here again the compiler is disallowed from
drawing any inferences based on knowing this will happen. To
make this work the rule allowing "UB to travel backwards in time"
must be revoked; unless a compiler can accurately deduce that a
given piece of code cannot transgress into UB then other code in
the program must not be moved (either forwards or backwards) past
the possibly-not-well-defined code segment.

It is also possible if a == INT_MAX that the exception will
transfer control to a signal handler to do some SW orchestrated
recovery.

Philosophically this reaction doesn't fit with the others. Assuming
for the sake of discussion that raising an implementation-defined
signal is an important behavior to support, it should go into the
C standard in a different way than making it part of the "limited
undefined behavior" idea outlined above.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Brett on Sat Sep 7 06:38:31 2024

Brett <[email protected]> writes:

I tried using unsigned for a bunch of my data types that should
never go negative, but every time I would have to compare them
with an int somewhere and that would cause a compiler warning,
because the goal was to also remove unsafe code.

What sort of ints? How many of those were constants? In which
cases were the int values negative, and which cases non-negative?
More generally, what are the circumstances that prompted you to
compare a can-never-be-negative value to a potentially-negative
value? Are most of the comparisons relational, or are there
lots of equality/inequality?

There are easy ways to compare (without getting warnings) signed
values and unsigned values, but how a particular case should be
addressed depends on the details. Can you supply more information?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Tim Rentsch on Sat Sep 7 14:30:30 2024

On Sat, 7 Sep 2024 13:52:02 +0000, Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:

Thomas Koenig <[email protected]> writes:

Thomas Koenig <[email protected]> schrieb:

"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.

Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.

So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.

If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?

Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).

The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior. (It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:

int a;

...

if (a > a + 1) {
...
}

Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here?

If a < INT_MAX, the behavior is the same as replacing the if()
test with 'if(0)'. If the compiler can accurately deduce that
the condition 'a < INT_MAX' will hold in all cases then the if()
can be optimized away accordingly.

If a == INT_MAX, one possibility is that code is generated to
evaluate the addition and the comparison, and the if-block is
either evaluated or it isn't, depending on the outcome of the
comparison. Important: the compiler is disallowed from drawing
any inferences based on "knowing" the result of either the
addition or the comparison; code must be generated under a "best
efforts" umbrella, and whatever the code does dictates whether
the if-block is evaluated or not, with the compiler being
forbidden to draw any conclusions based on what the result will
be.

If a == INT_MAX, it also should be possible for the addition to
abort the program. Here again the compiler is disallowed from
drawing any inferences based on knowing this will happen. To
make this work the rule allowing "UB to travel backwards in time"
must be revoked; unless a compiler can accurately deduce that a
given piece of code cannot transgress into UB then other code in
the program must not be moved (either forwards or backwards) past
the possibly-not-well-defined code segment.

It is also possible if a == INT_MAX that the exception will
transfer control to a signal handler to do some SW orchestrated
recovery.

Philosophically this reaction doesn't fit with the others. Assuming
for the sake of discussion that raising an implementation-defined
signal is an important behavior to support, it should go into the
C standard in a different way than making it part of the "limited
undefined behavior" idea outlined above.

With it "being difficult" to determine when an integer overflow
has occurred in may architectures, it is unlikely that integer
overflow could ever be put into the C standard.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Bernd Linsel on Sat Sep 7 09:33:22 2024

Bernd Linsel <[email protected]> writes:

On 06.09.24 00:04, Tim Rentsch wrote:

If START is signed (presumably of type int), so the loop might run
zero times, but never more than INT_MAX times, then

for( unsigned u = START < 0 ? 0 : START + 1u; u > 0 && u--; ){
// Do something with data[i]
}

If START is unsigned, so in all cases the loop must run at
least once, then

unsigned u = START;
do {
// Do something with data[i]
} while( u > 0 && u-- );

(Yes I know the 'u > 0' expressions can be replaced by just 'u'.)

The optimizer should be smart enough to realize that if 'u > 0'
is true then the test 'u--' will also be true. The same should
hold if 'u > 0' is replaced by just 'u'.

(Disclaimer: code not compiled.)

Both yield not very elegant code:

https://godbolt.org/z/M4Y5PYP3v

The problem being solved is not typical. In most cases
downward-counting loops start at one past the end of the
values, not at the last value. I didn't choose the problem.

Any "inelegancy" might just as well as come from how the
optimizer was written as from the code. Clearly optimizers
do better on some patterns than others. (For that matter,
the earlier code shown may have resulted in generated code
that is just as unappealing.)

The generated code being not very elegant doesn't necessarily
imply poor performance.

In almost all cases the performance implications don't matter.
Premature optimization is the root of all evil. The first
reaction should never be to look at what code is generated.

The purpose of the example (besides fixing a bug in the original,
which was removed) is, one, to illustrate an idea, and two, to
show an alternate example pattern that may help in unrelated
cases. It helps to be familiar with different approaches to
common situations. For this particular problem, probably it's
better to revise code outside the loop so the loop would be
done differently. The point here is not this code specifically
but a pattern and a principle that might be applicable in a
range of coding circumstances.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to [email protected] on Sat Sep 7 16:59:39 2024

MitchAlsup1 <[email protected]> wrote:

On Sat, 7 Sep 2024 13:52:02 +0000, Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:

Thomas Koenig <[email protected]> writes:

Thomas Koenig <[email protected]> schrieb:

"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules >>>>>> as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.

Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.

So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.

If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?

Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).

The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior. (It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:

int a;

...

if (a > a + 1) {
...
}

Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here?

If a < INT_MAX, the behavior is the same as replacing the if()
test with 'if(0)'. If the compiler can accurately deduce that
the condition 'a < INT_MAX' will hold in all cases then the if()
can be optimized away accordingly.

If a == INT_MAX, one possibility is that code is generated to
evaluate the addition and the comparison, and the if-block is
either evaluated or it isn't, depending on the outcome of the
comparison. Important: the compiler is disallowed from drawing
any inferences based on "knowing" the result of either the
addition or the comparison; code must be generated under a "best
efforts" umbrella, and whatever the code does dictates whether
the if-block is evaluated or not, with the compiler being
forbidden to draw any conclusions based on what the result will
be.

If a == INT_MAX, it also should be possible for the addition to
abort the program. Here again the compiler is disallowed from
drawing any inferences based on knowing this will happen. To
make this work the rule allowing "UB to travel backwards in time"
must be revoked; unless a compiler can accurately deduce that a
given piece of code cannot transgress into UB then other code in
the program must not be moved (either forwards or backwards) past
the possibly-not-well-defined code segment.

It is also possible if a == INT_MAX that the exception will
transfer control to a signal handler to do some SW orchestrated
recovery.

Philosophically this reaction doesn't fit with the others. Assuming
for the sake of discussion that raising an implementation-defined
signal is an important behavior to support, it should go into the
C standard in a different way than making it part of the "limited
undefined behavior" idea outlined above.

With it "being difficult" to determine when an integer overflow
has occurred in may architectures, it is unlikely that integer
overflow could ever be put into the C standard.

Swift traps on all overflows:

https://docs.swift.org/swift-book/documentation/the-swift-programming-language/advancedoperators/#

Such branches are predicted perfectly so they only cost some code density.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Sat Sep 7 22:17:36 2024

On Fri, 06 Sep 2024 13:26:42 GMT
[email protected] (Anton Ertl) wrote:

ARM A64:
mov x2, #0x8000000000000000
cmp x1, x2
b.le 20 <bar2+0x10>

I am hardly an expert in aarch64 code generatiion, but IMHO gcc is
missing the shortest code:
eor x1, x1, #0x8000000000000000
b.eq 20 <bar2+0x10>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Sat Sep 7 16:45:45 2024

[email protected] (Anton Ertl) writes:

Stefan Monnier <[email protected]> writes:

Specifications are an agreement between the supplier and the client. The >>

The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.

For programs there is no conformance level "free from UB" in the C
standard.

The C standard doesn't define any conformance "levels": it defines
the term "strictly conforming program", for its own convenience in
defining the language; it also defines the term "conforming
program", for no apparent purpose at all. In both cases however
what is given are simply definitions; there is no reason an
interested party couldn't give a definition of some other term, for
the purpose of identifying a class of C programs that have some
particular property -- such as being free from undefined behavior --
where membership in the class is completely determined by statements
in the C standard, being used as a reference document.

There are two conformance levels for programs:

1) A strictly conforming program shall use only those features of the
language and library specified in this International Standard.
This excludes all programs that terminate, including the "Hello,
World" program. [...]

I don't know why you say this. Which aspects of the definition for
"strictly conforming program" do you think are violated by a typical
'Hello, World' program? I'm confident the people who wrote the C
standard would say such a program is strictly conforming.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Tim Rentsch on Sun Sep 8 00:12:40 2024

On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:

[email protected] (Anton Ertl) writes:

Stefan Monnier <[email protected]> writes:

Specifications are an agreement between the supplier and the client.
The

The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.

For programs there is no conformance level "free from UB" in the C
standard.

The C standard doesn't define any conformance "levels": it defines
the term "strictly conforming program", for its own convenience in
defining the language; it also defines the term "conforming
program", for no apparent purpose at all. In both cases however
what is given are simply definitions; there is no reason an
interested party couldn't give a definition of some other term, for
the purpose of identifying a class of C programs that have some
particular property -- such as being free from undefined behavior --
where membership in the class is completely determined by statements
in the C standard, being used as a reference document.

There are two conformance levels for programs:

1) A strictly conforming program shall use only those features of the
language and library specified in this International Standard.
This excludes all programs that terminate, including the "Hello,
World" program. [...]

I don't know why you say this. Which aspects of the definition for
"strictly conforming program" do you think are violated by a typical
'Hello, World' program? I'm confident the people who wrote the C
standard would say such a program is strictly conforming.

The standard "Hello World !" program does not return a value to
<effectively> crt0.

Secondarily while one is supposed to return 0 for success and
something else for failure, there is no standard C defined way
that this is related back to the invoker of the program.

Another issue is that main() may not have the 3 defined arguments
and the containing environment is not supposed to complain when
argc, arv, and envp are unused or even unnamed as arguments.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Sun Sep 8 00:17:25 2024

On Sat, 7 Sep 2024 7:15:11 +0000, David Brown wrote:

On 07/09/2024 01:10, MitchAlsup1 wrote:

On Fri, 6 Sep 2024 22:41:12 +0000, Chris M. Thomasson wrote:

On 9/5/2024 10:04 AM, Terje Mathisen wrote:

David Brown wrote:

On 05/09/2024 11:12, Terje Mathisen wrote:

David Brown wrote:

Unsigned types are ideal for "raw" memory access or external data, >>>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>>> on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want two's >>>>>>> complement wrapping, and for the even fewer occasions when you
actually need that last bit of range.

That last paragraph enumerates pretty much all the uses I have for >>>>>> integer-type variables, with (like Mitch) a few apis that use (-1) as >>>>>> an error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic?

Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine >>>> with unsigned i, arrays always use a zero base so in Rust the only array >>>> index type is usize, i.e the largest supported unsigned type in the
system, typically the same as u64.

unsigned arithmetic is easier than signed integer arithmetic, including >>>> comparisons that would result in a negative value, you just have to make >>>> the test before subtracting, instead of checking if the result was
negative.

I.e I cannot easily replicate a downward loop that exits when the
counter become negative:

   for (int i = START; i >= 0; i-- ) {
     // Do something with data[i]
   }

for (int i = START; i > -1; i-- ) {
      // Do something with data[i]
}

;^)

# define START 0x80000001

No.

The great thing about 32 bit integers is that your numbers are never
anywhere close to being too big - or you /know/ you are dealing with
very big numbers and you can take that into account such as by using
64-bit integer types.

A number that is the start or end of a normal count range is /never/ 0x80000001. Write code that is clear, simple and correct for what you
are actually doing. And if you think such big numbers are realistic,
write the same clear, simple and correct code with "int64_t" instead.

static uint64_t array[1024*1024*512+1]
static int SIZE = sizeof(array)/sizeof(uint65_t);

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sun Sep 8 00:23:38 2024

And just for fun::

On Fri, 6 Sep 2024 13:26:42 +0000, Anton Ertl wrote:

Here we have the three variants:

#include <limits.h>

extern long foo1(long);
extern long foo2(long);

long bar(long a, long b)
{
long c;
if (__builtin_sub_overflow(b,1,&c))
return foo1(a);
else
return foo2(a);
}

long bar2(long a, long b)
{
if (b < b-1)
return foo1(a);
else
return foo2(a);
}

long bar3(long a, long b)
{
if (b == LONG_MIN)
return foo1(a);
else
return foo2(a);
}

My 66000:
add r3,R1,#-1 add r3,r1,#-1 bepm r1,.L4
bge R3,.L4 bge r3,.L4
8-bytes 8-bytes 4-bytes

I have a direct test for POSMAX in ISA that does not use a constant.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Sat Sep 7 18:46:20 2024

Thomas Koenig <[email protected]> writes:

Tim Rentsch <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Thomas Koenig <[email protected]> schrieb:

"Don't do this" or "don't do that" is not sufficient. Maybe you,
together with like-minded people, could try formulating some rules
as an extension to the C standard, and see where it gets you.
Maybe you can get it published as an annex.

Hm... putting some thought into it, it may be a good first step
to define cases for which a a diagnostic is required; maybe
"observable error" would be a reasonable term.

So, put "dereferencing a NULL pointer shall be an observable
error" would make sure that no null pointer checks are thrown
away, and that this requires a run-time diagnostic.

If that is the case, should dereferencing a member of a struct
pointed to by a null pointer also be an observable error, and
be required to be caught at run-time?

Or is this completely the wrong track, and you would like to do
something entirely different? Any annex to the C standard would
still be constrained to the abstract machine (probably).

The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior.

That sentece makes no sense to me.

Behavior is defined by the standard, by the compiler documentation,
by other standards (such as OpenMP) or it is undefined.

"Giving less freedom" has no difference from defining.

I use the term "undefined behavior" in the same sense that the C
standard does. For example, if a particular C implementation
supports the POSIX extensions to printf(), including documenting
them, using those extensions still falls under the heading of
undefined behavior, support and documentation not withstanding.

The idea is to define a new classification, perhaps "limited
undefined behavior", that gives more freedom than "unspecified
behavior" but not nearly as much as "undefined behavior" does
now.

(It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:

int a;

...

if (a > a + 1) {
...
}

Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here?

If a < INT_MAX, the behavior is the same as replacing the if()
test with 'if(0)'. If the compiler can accurately deduce that
the condition 'a < INT_MAX' will hold in all cases then the if()
can be optimized away accordingly.

If a == INT_MAX, one possibility is that code is generated to
evaluate the addition and the comparison, and the if-block is
either evaluated or it isn't, depending on the outcome of the
comparison. Important: the compiler is disallowed from drawing
any inferences based on "knowing" the result of either the
addition or the comparison; code must be generated under a "best
efforts" umbrella, and whatever the code does dictates whether
the if-block is evaluated or not, with the compiler being
forbidden to draw any conclusions based on what the result will
be.

If a == INT_MAX, it also should be possible for the addition to
abort the program. Here again the compiler is disallowed from
drawing any inferences based on knowing this will happen. To
make this work the rule allowing "UB to travel backwards in time"
must be revoked; unless a compiler can accurately deduce that a
given piece of code cannot transgress into UB then other code in
the program must not be moved (either forwards or backwards) past
the possibly-not-well-defined code segment.

After thinking about this for a time, what you want looks a lot
like volaitle.

That's a good insight. Certainly there are aspects of what I
have proposed that are similar to how volatile works.

Is there any requirement that you can think of that would not
be fullfilled with "volatile int a"?

Is there anything with "volatile int a" that you do not want?

Something being volatile has consequences only in reference to
objects, and only when a memory access (either read or write) is
requested. There are no such things as volatile values. What
we're looking for here is constraints on operations, not on
memory accesses. In a sense one might say what we want is
"volatile operators": similar in concept to how volatile works,
but in a different area of language semantics.

Also there are aspects of 'volatile' is defined now that are too
lax for what I think "volatile operators" need to do. However
that is a fine point, I mention it only for completeness.

If volatile is close to what you want, then this would be
straightforward to incorporate into an existing compiler such as
gcc, just add an option which declares every variable in the C
front end volatile, weed out the resulting bugs (yes, that is a
mixed metaphor) and be done.

Like I said, it isn't the variables, it's the operators. Maybe
though you have a good idea there, looking at how volatile is
handled in gcc or clang might give some useful ideas about how to
implement volatile operators.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to [email protected] on Sat Sep 7 19:47:38 2024

[email protected] (MitchAlsup1) writes:

On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:

[email protected] (Anton Ertl) writes:

Stefan Monnier <[email protected]> writes:

Specifications are an agreement between the supplier and the client. >>>>> The

The problem here is that the C standard, seen as a contract, is unfair >>>> to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.

For programs there is no conformance level "free from UB" in the C
standard.

The C standard doesn't define any conformance "levels": it defines
the term "strictly conforming program", for its own convenience in
defining the language; it also defines the term "conforming
program", for no apparent purpose at all. In both cases however
what is given are simply definitions; there is no reason an
interested party couldn't give a definition of some other term, for
the purpose of identifying a class of C programs that have some
particular property -- such as being free from undefined behavior --
where membership in the class is completely determined by statements
in the C standard, being used as a reference document.

There are two conformance levels for programs:

1) A strictly conforming program shall use only those features of the
language and library specified in this International Standard.
This excludes all programs that terminate, including the "Hello,
World" program. [...]

I don't know why you say this. Which aspects of the definition for
"strictly conforming program" do you think are violated by a typical
'Hello, World' program? I'm confident the people who wrote the C
standard would say such a program is strictly conforming.

The standard "Hello World !" program does not return a value to
<effectively> crt0.

That has no effect on whether the program is strictly conforming.

Secondarily while one is supposed to return 0 for success and
something else for failure, there is no standard C defined way
that this is related back to the invoker of the program.

That has no effect on whether the program is strictly conforming.

Another issue is that main() may not have the 3 defined arguments
and the containing environment is not supposed to complain when
argc, arv, and envp are unused or even unnamed as arguments.

The usual "Hello, World" program defines main() either with no
arguments

int
main(){
...
}

or with two arguments

int
main( int argc, char *argv[] ){
...
}

and in both cases main() has defined behavior and does not
violate the strictures of strictly conforming programs.

If the surrounding OS or whatever cannot support these, that
doesn't change whether the program is strictly conforming. The
condition of being strictly conforming is a predicate on
programs, not on implementations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to [email protected] on Sat Sep 7 19:32:43 2024

[email protected] (MitchAlsup1) writes:

On Sat, 7 Sep 2024 13:52:02 +0000, Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:

[...]

The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior. (It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:

int a;

...

if (a > a + 1) {
...
}

Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here? [...] If a == INT_MAX, it also should be
possible for the addition to abort the program. [...]

It is also possible if a == INT_MAX that the exception will
transfer control to a signal handler to do some SW orchestrated
recovery.

Philosophically this reaction doesn't fit with the others. Assuming
for the sake of discussion that raising an implementation-defined
signal is an important behavior to support, it should go into the
C standard in a different way than making it part of the "limited
undefined behavior" idea outlined above.

With it "being difficult" to determine when an integer overflow
has occurred in may architectures, it is unlikely that integer
overflow could ever be put into the C standard.

It could easily be added to the C standard just by making the
signal-raise option be conditional: give each implementation
the choice of either (a) stipulating that overflow causes an implementation-defined signal to be raised, or (b) letting the
operation be limited undefined behavior. Limited undefined
behavior can be provided simply by naively compiling the code
in question, so that can be accommodated regardless of how
unsophisticated the processor is.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Sat Sep 7 21:17:02 2024

[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Sun Sep 8 08:25:10 2024

On 08/09/2024 02:17, MitchAlsup1 wrote:

On Sat, 7 Sep 2024 7:15:11 +0000, David Brown wrote:

On 07/09/2024 01:10, MitchAlsup1 wrote:

On Fri, 6 Sep 2024 22:41:12 +0000, Chris M. Thomasson wrote:

On 9/5/2024 10:04 AM, Terje Mathisen wrote:

David Brown wrote:

On 05/09/2024 11:12, Terje Mathisen wrote:

David Brown wrote:

Unsigned types are ideal for "raw" memory access or external data, >>>>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>>>> on signed types is usually wrong, IMHO), as building blocks in >>>>>>>> extended arithmetic types, for the few occasions when you want >>>>>>>> two's
complement wrapping, and for the even fewer occasions when you >>>>>>>> actually need that last bit of range.

That last paragraph enumerates pretty much all the uses I have for >>>>>>> integer-type variables, with (like Mitch) a few apis that use
(-1) as
an error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic? >>>>>

Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
fine
with unsigned i, arrays always use a zero base so in Rust the only
array
index type is usize, i.e the largest supported unsigned type in the
system, typically the same as u64.

unsigned arithmetic is easier than signed integer arithmetic,
including
comparisons that would result in a negative value, you just have to
make
the test before subtracting, instead of checking if the result was
negative.

I.e I cannot easily replicate a downward loop that exits when the
counter become negative:

   for (int i = START; i >= 0; i-- ) {
     // Do something with data[i]
   }

for (int i = START; i > -1; i-- ) {
      // Do something with data[i]
}

;^)

# define START 0x80000001

No.

The great thing about 32 bit integers is that your numbers are never
anywhere close to being too big - or you /know/ you are dealing with
very big numbers and you can take that into account such as by using
64-bit integer types.

A number that is the start or end of a normal count range is /never/
0x80000001. Write code that is clear, simple and correct for what you
are actually doing. And if you think such big numbers are realistic,
write the same clear, simple and correct code with "int64_t" instead.

static uint64_t array[1024*1024*512+1]
static int      SIZE = sizeof(array)/sizeof(uint65_t);

Surely you mean :

static const size_t array_size = sizeof(array) / sizeof(uint64_t);

?

Look, if you want to write such strange code, I certainly can't stop
you. But I can tell you that /I/ think it's very poor style, and that
/I/ would reject it in a code review.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Tim Rentsch on Sun Sep 8 09:20:53 2024

On 08/09/2024 04:32, Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Sat, 7 Sep 2024 13:52:02 +0000, Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:

[...]

The idea is not to make more of the language defined but to give
less freedom to cases of undefined behavior. (It might make
sense to define certain cases that are undefined behavior now but
that is a separate discussion.) Let me take an example from
another of your postings:

int a;

...

if (a > a + 1) {
...
}

Stipulating that 'a' has a well-defined int value, what behaviors
are allowable here? [...] If a == INT_MAX, it also should be
possible for the addition to abort the program. [...]

It is also possible if a == INT_MAX that the exception will
transfer control to a signal handler to do some SW orchestrated
recovery.

Philosophically this reaction doesn't fit with the others. Assuming
for the sake of discussion that raising an implementation-defined
signal is an important behavior to support, it should go into the
C standard in a different way than making it part of the "limited
undefined behavior" idea outlined above.

With it "being difficult" to determine when an integer overflow
has occurred in may architectures, it is unlikely that integer
overflow could ever be put into the C standard.

The ckd_add, ckd_sub and ckd_mul functions from C23 make it easy to
check for integer overflow in C23. And of course C has had guaranteed
64-bit support since C99 - it's very rare to overflow these.

It could easily be added to the C standard just by making the
signal-raise option be conditional: give each implementation
the choice of either (a) stipulating that overflow causes an implementation-defined signal to be raised, or (b) letting the
operation be limited undefined behavior. Limited undefined
behavior can be provided simply by naively compiling the code
in question, so that can be accommodated regardless of how
unsophisticated the processor is.

The C standard doesn't have anything where implementations have an
option between a particular behaviour or undefined behaviour - because
that would simply be the same as undefined behaviour. It sometimes has footnotes with suggestions of possible results, and it could add such a footnote for signed integer arithmetic overflow treatment. But it would
not have any greater blessing from the standard than wrapping,
saturating, or assuming it is impossible.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Sun Sep 8 08:26:25 2024

Tim Rentsch <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

After thinking about this for a time, what you want looks a lot
like volaitle.

That's a good insight. Certainly there are aspects of what I
have proposed that are similar to how volatile works.

The way I understand you is the following: You want the
compiler to be forbidden to remove codepaths on the assumption
that undefined behavior cannot happen, and you want a
"best effort" in that case, which includes throwing an error
or just ignoring everything and proceeding.

The observable behavior includes (n2596)

"Volatile accesses to objects are evaluated strictly according to
the rules of the abstract machine."

So, assuming that variables are objects (if there's a definition
of an object in n2596, I missed it) the compiler cannot remove
accessing a in

volatile int a;

if (a > a + 1)

so it cannot remove any code path leading to the if statement, which
is what you want. An interesting point is what "volatile access"
actually means, especially for automatic variables; it seems that
all compilers treat this as a memory access (which makes limited
sense in my opinion - is there an explanation for this?)

Is there any requirement that you can think of that would not
be fullfilled with "volatile int a"?

Is there anything with "volatile int a" that you do not want?

Something being volatile has consequences only in reference to
objects, and only when a memory access (either read or write) is
requested. There are no such things as volatile values. What
we're looking for here is constraints on operations, not on
memory accesses. In a sense one might say what we want is
"volatile operators": similar in concept to how volatile works,
but in a different area of language semantics.

Hmm.. OK. The nice thing about SSA is that it transforms
complicated expressions like "a + b + c" into

tmp1 = a + b
tmp2 = tmp1 + c

so it would be possible to write a pass which would declare those
variables as volatile that you want (not needed for unsigned, for
example).

Alternatively, you could write a pass which translates

int a, b;

tmp1 = a + b;

into

tmp1 = (int) ((unsigned) a + (unsigned) b)

or just use -frwapv in the first place.

So, SSA offers you the possibility of working on operators, like
you want to.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Tim Rentsch on Sun Sep 8 15:12:03 2024

Tim Rentsch <[email protected]> writes: >[email protected] (Anton Ertl) writes:

Stefan Monnier <[email protected]> writes:

Specifications are an agreement between the supplier and the client. The >>>

The problem here is that the C standard, seen as a contract, is unfair
to the programmer, because it's so excruciatingly hard to write code
that is guaranteed to be free from UB.

For programs there is no conformance level "free from UB" in the C
standard.

The C standard doesn't define any conformance "levels": it defines
the term "strictly conforming program", for its own convenience in
defining the language; it also defines the term "conforming
program", for no apparent purpose at all.

It defines both terms in the section on "Conformance", so I take it
that both are there for defining the conformance of programs; you may
not consider them to be levels, but given that all "strictly
conforming programs" are also "conforming programs", it has the
feeling of conformance levels to me.

In both cases however
what is given are simply definitions; there is no reason an
interested party couldn't give a definition of some other term, for
the purpose of identifying a class of C programs that have some
particular property -- such as being free from undefined behavior --
where membership in the class is completely determined by statements
in the C standard, being used as a reference document.

Sure, but the C standard does not give such a definition, so the
"interested party" would cherry-pick from the C standard.

There are two conformance levels for programs:

1) A strictly conforming program shall use only those features of the
language and library specified in this International Standard.
This excludes all programs that terminate, including the "Hello,
World" program. [...]

I don't know why you say this. Which aspects of the definition for
"strictly conforming program" do you think are violated by a typical
'Hello, World' program?

A typical "Hello, World" program terminates, and as mentioned, no
terminating program can be strictly conforming, because it exercises
at least implementation-defined behaviour (e.g., look at section
7.22.4.4 of C11).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Tim Rentsch on Sun Sep 8 15:36:39 2024

Tim Rentsch <[email protected]> writes: >[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?

2) Even if there is such a check, you have to be aware that there is a potential problem with memcpy(). In that case the way to go is to
just use memmove(). But that does not help you with the next "clever"
idea that some compiler or library maintainer has.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to [email protected] on Sun Sep 8 15:32:02 2024

[email protected] (MitchAlsup1) writes:

And just for fun::

On Fri, 6 Sep 2024 13:26:42 +0000, Anton Ertl wrote:

Here we have the three variants:

#include <limits.h>

extern long foo1(long);
extern long foo2(long);

long bar(long a, long b)
{
long c;
if (__builtin_sub_overflow(b,1,&c))
return foo1(a);
else
return foo2(a);
}

long bar2(long a, long b)
{
if (b < b-1)
return foo1(a);
else
return foo2(a);
}

long bar3(long a, long b)
{
if (b == LONG_MIN)
return foo1(a);
else
return foo2(a);
}

My 66000:
add r3,R1,#-1 add r3,r1,#-1 bepm r1,.L4
bge R3,.L4 bge r3,.L4
8-bytes 8-bytes 4-bytes

I have a direct test for POSMAX in ISA that does not use a constant.

How does bge work in the first and second column? My impression was
that you are using an 88k-style flags-in-GPR architecture.

Concerning the last column, the gcc developer who added the
transformation of bar2() into bar3() apparently had My66000 in mind.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Niklas Holsti on Sun Sep 8 09:19:13 2024

Niklas Holsti <[email protected]d> writes:

On 2024-09-03 11:10, David Brown wrote:

[snip]

(There are a few situations where UB in C could be diagnosed at
compile-time, which are probably historical decisions to avoid
imposing too much work on early compilers. Where possible, UB that
can be caught at compile time, could usefully be turned into
constrain violations that must be diagnosed.)

A thoughtless, knee-jerk reaction, ending in a wrongheaded
conclusion.

The problem, as you of course know, is that the "can" in "can be
caught at compile time" depends on the amount and kind of analysis
that is done at compile time -- some cases of UB "can" be caught at
compile time but only by advanced and costly analysis. If the language standard requires that such things /must/ be detected by the compiler,
it can place quite a burden on the developers of conforming compilers.

That is one problem.

As I understand it, current C compilers detect UB mostly as a side
effect of the analyses they do for code optimization purposes, which
vary widely between compilers, and so the UB-detections also vary.

There are different kinds of undefined behavior; some are easy
to detect, others require more extensive analysis. In the second
category the analysis usually is approximate rather than exact;
false positive cases need to be weighed against false negative
cases, looking for the right balance, and very often it happens
that neither of those is zero. Obviously any requirement that a
mandatory diagnostic be issued should have no false positives,
which often means doing a different analysis. More work.

Another problem is that just the act of specifying the condition under
which a diagnostic is required means a lot of work and a non-trivial
amount of additional text needed in the C standard. If someone is
interested to investigate this a good place to start is the Java
standard, where there are specific rules for deciding if variables are
all initialized before any use. Alternatively look in the C standard
at the formal definition of 'restrict'. Besides being hard to write,
both of these are quite difficult to read and understand. Even more
of those? No thanks.

Let me add, it is not always a good idea to require a diagnostic in
cases even when it is 100% certain that there is undefined behavior. Unfortunately it seems there are a fair number of people who don't
get this.

This issue (compile-time detection) has now and then been discussed in
the Ada standards group. Given the currently low market penetration of
Ada, the group has been reluctant to require too much of the
compilers, and so the more advanced UB-detecting tools are
stand-alone, such as the SPARK tools.

I'm all in favor of static analysis. And I don't mind if compilers do
it (selectively), instead of or in addition to stand-alone tools. But
there is a huge chasm between saying compilers /can/ do it and saying
compilers /must/ do it. Crossing that chasm is a bridge too far.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sun Sep 8 17:45:03 2024

On Sun, 8 Sep 2024 15:32:02 +0000, Anton Ertl wrote:

[email protected] (MitchAlsup1) writes:

And just for fun::

On Fri, 6 Sep 2024 13:26:42 +0000, Anton Ertl wrote:

Here we have the three variants:

#include <limits.h>

extern long foo1(long);
extern long foo2(long);

long bar(long a, long b)
{
long c;
if (__builtin_sub_overflow(b,1,&c))
return foo1(a);
else
return foo2(a);
}

long bar2(long a, long b)
{
if (b < b-1)
return foo1(a);
else
return foo2(a);
}

long bar3(long a, long b)
{
if (b == LONG_MIN)
return foo1(a);
else
return foo2(a);
}

My 66000:
add r3,R1,#-1 add r3,r1,#-1 bepm r1,.L4
bge R3,.L4 bge r3,.L4
8-bytes 8-bytes 4-bytes

I have a direct test for POSMAX in ISA that does not use a constant.

How does bge work in the first and second column? My impression was
that you are using an 88k-style flags-in-GPR architecture.

I just copied the RISC-V code

Concerning the last column, the gcc developer who added the
transformation of bar2() into bar3() apparently had My66000 in mind.

My branch on comparison to zero (BC) instruction has 32 variants
with only ~20 being normal uses. This gave room for signed and
unsigned int-MAX and int-MIN.

BTW I had the comparisons to int-MAX/MIN in since about 2016.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Sun Sep 8 18:32:10 2024

On Sun, 8 Sep 2024 6:25:10 +0000, David Brown wrote:

On 08/09/2024 02:17, MitchAlsup1 wrote:

On Sat, 7 Sep 2024 7:15:11 +0000, David Brown wrote:

static uint64_t array[1024*1024*512+1]
static int SIZE = sizeof(array)/sizeof(uint65_t);

Surely you mean :

static const size_t array_size = sizeof(array) / sizeof(uint64_t);

I wanted SIZE to have the same type as i.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Tim Rentsch on Sun Sep 8 18:34:54 2024

On Sun, 8 Sep 2024 2:47:38 +0000, Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:

Another issue is that main() may not have the 3 defined arguments
and the containing environment is not supposed to complain when
argc, arv, and envp are unused or even unnamed as arguments.

The usual "Hello, World" program defines main() either with no
arguments

int
main(){
...
}

or with two arguments

int
main( int argc, char *argv[] ){
...
}

and in both cases main() has defined behavior and does not
violate the strictures of strictly conforming programs.

The Linux environment (crt0) calls main with 3 arguments.

Are you arguing that a program can be strictly conforming and
not be type safe at its call/return interfaces ??

If the surrounding OS or whatever cannot support these, that
doesn't change whether the program is strictly conforming. The
condition of being strictly conforming is a predicate on
programs, not on implementations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Sun Sep 8 11:18:46 2024

Thomas Koenig <[email protected]> writes:

Tim Rentsch <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

After thinking about this for a time, what you want looks a lot
like volaitle.

That's a good insight. Certainly there are aspects of what I
have proposed that are similar to how volatile works.

The way I understand you is the following: You want the
compiler to be forbidden to remove codepaths on the assumption
that undefined behavior cannot happen, and you want a
"best effort" in that case, which includes throwing an error
or just ignoring everything and proceeding.

The key point is not about removing (or forcing) code paths, but
about what inferences may be drawn. Consider this example:

int a = .. something ..;
if( a > a+1 ){ .. stuff not involving a .. }
if( a != INT_MAX ){ ... }

Relying on the premise that "undefined behavior doesn't happen",
a compiler might discard the dependent block of the first if().
But the compiler might also always execute the dependent block
of the second if(), because if a == INT_MAX then the first test
would have been undefined behavior, which violates our premise.
It is just as wrong to skip the test in the second if() as it
is to remove the controlled block in the first if().

Consider a related example:

int a = .. something ..;
if( a < a+1 ){ .. stuff not involving a .. }
if( a != INT_MAX ){ .. other stuff not involving a .. }

Again operating under the premise that the program has no
undefined behavior, both controlled blocks can be executed
unconditionally, because the assumption of there being no
undefined behavior leads to a bad inference for the second
if() test. Notice by the way that the same bad inference
can be drawn if the order of the if() statements is reversed,
because of the rule that undefined behavior "can travel
backwards in time".

The observable behavior includes (n2596)

"Volatile accesses to objects are evaluated strictly according to
the rules of the abstract machine."

So, assuming that variables are objects (if there's a definition
of an object in n2596, I missed it) the compiler cannot remove
accessing a in

volatile int a;

if (a > a + 1)

so it cannot remove any code path leading to the if statement, which
is what you want. An interesting point is what "volatile access"
actually means, especially for automatic variables; it seems that
all compilers treat this as a memory access (which makes limited
sense in my opinion - is there an explanation for this?)

The original motivation for volatile is to ensure an actual memory
access occurs, in cases where what is happening is outside what
the C implementation know about. Examples are reading or writing
by another process (perhaps not written in C) or a memory-mapped
I/O port. It may be unlikely that a function-local variable would
fall into such a category, but volatile is there in case someone
thinks it does.

Is there any requirement that you can think of that would not
be fullfilled with "volatile int a"?

Is there anything with "volatile int a" that you do not want?

Something being volatile has consequences only in reference to
objects, and only when a memory access (either read or write) is
requested. There are no such things as volatile values. What
we're looking for here is constraints on operations, not on
memory accesses. In a sense one might say what we want is
"volatile operators": similar in concept to how volatile works,
but in a different area of language semantics.

Hmm.. OK. The nice thing about SSA is that it transforms
complicated expressions like "a + b + c" into

tmp1 = a + b
tmp2 = tmp1 + c

so it would be possible to write a pass which would declare those
variables as volatile that you want (not needed for unsigned, for
example).

Alternatively, you could write a pass which translates

int a, b;

tmp1 = a + b;

into

tmp1 = (int) ((unsigned) a + (unsigned) b)

or just use -frwapv in the first place.

So, SSA offers you the possibility of working on operators, like
you want to.

We're talking about different things. What you are talking about is
(perhaps only partially) an implementation strategy. What I am
talking about is how to define the abstract semantics. Exactly what
the rules are has to come first; after the rules are known then we
can think about how they might be implemented.

In terms of defining the abstract semantics, volatile doesn't do the
job. There are several reasons for this, but the most important is
that undefined behavior takes precedence over volatile. If we have
a program

volatile int *p;
...
*p = 0;
... much further down ...
if( 1/0 ) ...

the assignment to *p doesn't have to have happened, regardless of
the volatile status of *p. There needs to be a meaning defined
for some more constrained form of undefined behavior, which I have
called "limited undefined behavior" in other postings, and a change
to the semantics of some constructs from "undefined behavior" to
"limited undefined behavior" (or some other suitable term), to get
the results desired.

I hope you can see what I'm trying to get at here. I admit that my descriptions are more abstruse than I would like. It's not an easy
area to talk about.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to [email protected] on Sun Sep 8 17:52:33 2024

[email protected] (MitchAlsup1) writes:

On Sun, 8 Sep 2024 2:47:38 +0000, Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:

Another issue is that main() may not have the 3 defined arguments
and the containing environment is not supposed to complain when
argc, arv, and envp are unused or even unnamed as arguments.

The usual "Hello, World" program defines main() either with no
arguments

int
main(){
...
}

or with two arguments

int
main( int argc, char *argv[] ){
...
}

and in both cases main() has defined behavior and does not
violate the strictures of strictly conforming programs.

The Linux environment (crt0) calls main with 3 arguments.

Are you arguing that a program can be strictly conforming and
not be type safe at its call/return interfaces ??

Note by the way that the C standard doesn't make any guarantees
about how a strictly conforming program will run under any given implementation. All the standard does say is that a conforming
implementation shall accept any strictly conforming program (with
slightly different rules for conforming hosted implementations as
compared to conforming freestanding implementations).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to [email protected] on Sun Sep 8 17:31:19 2024

[email protected] (MitchAlsup1) writes:

On Sun, 8 Sep 2024 2:47:38 +0000, Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:

Another issue is that main() may not have the 3 defined arguments
and the containing environment is not supposed to complain when
argc, arv, and envp are unused or even unnamed as arguments.

The usual "Hello, World" program defines main() either with no
arguments

int
main(){
...
}

or with two arguments

int
main( int argc, char *argv[] ){
...
}

and in both cases main() has defined behavior and does not
violate the strictures of strictly conforming programs.

The Linux environment (crt0) calls main with 3 arguments.

The C standard allows defining main() either with no parameters,
with two parameters (of types int and char **), or "in some other implementation-defined manner". (Note: this rule applies only
to hosted implementations; freestanding implementations have a
different rule. Compilers on Linux are hosted implementations.)

On Ubuntu Linux, both gcc and clang accept (under -pedantic with
either -std=c99 or -std=c11) this input

#include <stdio.h>

int
main(){
printf( "Hello, world\n" );
return 0;
}

and this input

#include <stdio.h>

int
main( int argc, char *argv[] ){
printf( "Hello, world\n" );
return 0;
}

and this input

#include <stdio.h>

int
main( int argc, char *argv[], char *envp[] ){
printf( "Hello, world\n" );
return 0;
}

without giving any diagnostics. The executable produced in each
case runs fine. In fact using -S to look at generated code, all
three compile to the same code (different generated code under
gcc compared to clang, but the same code for all versions under
each compiler).

As a sanity check, I tried this input

#include <stdio.h>

int
main( int argc, char *argv[], double *envp[] ){
printf( "Hello, world\n" );
return 0;
}

which from gcc gives a warning diagnostic, and from clang gives
an error diagnostic. The generated code under gcc is the same as
that produced by gcc for the other inputs, and the produced
executable runs and does the same thing as the other versions (as
one would expect, since the generated code is the same).

Are you arguing that a program can be strictly conforming and
not be type safe at its call/return interfaces ??

Both of the first two versions (with a no-parameters main() and
with a two-parameter main()) satisfy all the criteria of strictly
conforming programs.

The third version (with a third parameter of type char **) does
not satisfy the definition of strictly conforming programs,
because it uses a feature not specified as part of the language
or library -- namely, the implementation-defined form of main().

The C standard requires every implementation to accept all
strictly conforming programs (or the implementation is not
conforming if it chooses not to accept a SC program for any
reason). We don't expect all C compilers to accept a main()
defined with three parameters, which is consistent with the
rule that they are required to accept all strictly conforming
programs.

Does this explanation help clear things up? Or is there still
some aspect I haven't explained adequately?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to [email protected] on Sun Sep 8 19:20:06 2024

[email protected] (MitchAlsup1) writes:

On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

On 04/09/2024 18:07, Tim Rentsch wrote:

Terje Mathisen <[email protected]> writes:

Michael S wrote:

On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

3 years ago Terje Mathisen wrote that many years ago he read
that behaviour of memcpy() with overlappped src/dst was defined. >>>>>>> https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>>> Mitch Alsup answered "That was true in 1983". So, two people of >>>>>>> different age living in different parts of the world are telling >>>>>>> the same story. May be, there exist old popular book that said
that it was defined?

It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime
doc?

Specifying that it would always copy from beginning to end of
the source buffer, in increasing address order meant that it
was guaranteed safe when used to compact buffers.

What is "compact buffers" ?

Assume a buffer consisting of records of some type, some of
them marked as deleted. Iterating over them while removing
the gaps means that you are always copying to a destination
lower in memory, right?

If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.

Such tests are usually built into implementations of memmove(),
which will chose to run forwards or backwards as needed. So you
might as well just call memmove() any time you are not sure
memcpy() is safe and appropriate.

The ever-shallow David Brown first misses the point, then makes a
slightly incorrect statement, and finally makes a recommendation
that surely is familiar to every reader in the newsgroup.

Memmove() is always appropriate unless you are doing something
nefarious.

So:
# define memcpy memomve

Incidentally, if one wants to do this, it's advisable to write

#undef memcpy

before the #define of memcpy.

and move forward with life--for the 2 extra cycles memmove costs
it saves everyone long term grief.

When you need the nefarious activities of memcpy write it as a
for loop by yourself and comment the nafariousness of the use.

The point of my comment is that there is extra information
available in the scenario described, and it might be useful to
take advantage of that information not to make a low-level change
(eg, substitute memmove() for memcpy()) but to switch to a
different higher level strategy, such as using a semi-space
compactor (or other possibilities).

Simply replacing memcpy() by memmove() of course will always
work, but there might be negative consequences beyond a cost
of 2 extra cycles -- for example, if a negative stride is
better performing than a positive stride, but the nature
of the compaction forces memmove() to always take the slower
choice.

It's always useful to have more options to choose from when there
is more information, even if ultimately what path is chosen
is the zero-information path.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Sun Sep 8 22:44:54 2024

[email protected] (Anton Ertl) writes:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether there
is an overlap of the memory areas. But then I remembered that you
cannot write such a check in standard C without (in the general
case) exercising undefined behaviour;

Yes, I can.

and then the compiler could eliminate the check or do something
else that's unexpected. Do you have such a check in mind that
does not exercise undefined behaviour in the general case?

Sure. I wouldn't have made my earlier statement otherwise.

2) Even if there is such a check, you have to be aware that there
is a potential problem with memcpy(). In that case the way to go
is to just use memmove().

The point of my previous comment was only to address the question
of whether any existing memcpy() calls are problematic. If all
of the checks return "no overlap" then memcpy() is not the problem.

That said, using memmove() in place of memcpy() is one way to get
around problems with undesired behavior from memcpy(), but depending
on circumstances there may be other ways that are better.

But that does not help you with the next "clever" idea that some
compiler or library maintainer has.

I have the impression that this is an editorial comment having
nothing to do with memcpy() or memmove(). If that impression
is wrong then I'm at a loss to understand what you are talking
about, and would you please elaborate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to [email protected] on Mon Sep 9 05:55:08 2024

[email protected] (MitchAlsup1) writes:

On Sun, 8 Sep 2024 15:32:02 +0000, Anton Ertl wrote:

[email protected] (MitchAlsup1) writes:

And just for fun::

On Fri, 6 Sep 2024 13:26:42 +0000, Anton Ertl wrote:

Here we have the three variants:

#include <limits.h>

extern long foo1(long);
extern long foo2(long);

long bar(long a, long b)
{
long c;
if (__builtin_sub_overflow(b,1,&c))
return foo1(a);
else
return foo2(a);
}

long bar2(long a, long b)
{
if (b < b-1)
return foo1(a);
else
return foo2(a);
}

long bar3(long a, long b)
{
if (b == LONG_MIN)
return foo1(a);
else
return foo2(a);
}

My 66000:
add r3,R1,#-1 add r3,r1,#-1 bepm r1,.L4
bge R3,.L4 bge r3,.L4
8-bytes 8-bytes 4-bytes

I have a direct test for POSMAX in ISA that does not use a constant.

How does bge work in the first and second column? My impression was
that you are using an 88k-style flags-in-GPR architecture.

I just copied the RISC-V code

The RISC-V bge has two operands (plus the branch target), the bge in
your code has only one operand. Here's the RISC-V code:

RV64GC:
addi a5,a1,-1 addi a5,a1,-1 li a5,-1
bge a1,a5,10 <.L4> bge a1,a5,28 <.L6> slli a5,a5,0x3f
bne a1,a5,40 <.L8>

Concerning the last column, the gcc developer who added the
transformation of bar2() into bar3() apparently had My66000 in mind.

...

BTW I had the comparisons to int-MAX/MIN in since about 2016.

The transformation was added to gcc after gcc-10 was released in 2020,
so my tongue-in-cheek theory is not falsified by the timing of events.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Mon Sep 9 08:56:45 2024

David Brown wrote:

On 05/09/2024 19:04, Terje Mathisen wrote:

David Brown wrote:

On 05/09/2024 11:12, Terje Mathisen wrote:

David Brown wrote:

Unsigned types are ideal for "raw" memory access or external data,
for anything involving bit manipulation (use of &, |, ^, << and >>
on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want
two's complement wrapping, and for the even fewer occasions when
you actually need that last bit of range.

That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1)
as an error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic?

Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
fine with unsigned i, arrays always use a zero base so in Rust the
only array index type is usize, i.e the largest supported unsigned
type in the system, typically the same as u64.

Loop counters can usually be signed or unsigned, and it usually makes no difference. Array indices are also usually much the same signed or unsigned, and it can feel more natural to use size_t here (an unsigned type). It can make a difference to efficiency, however. On x86-64,
this code is 3 instructions with T as "unsigned long int" or "long int",
4 with "int", and 5 with "unsigned int".

int foo(int * p, T x) {
    int a = p[x++];
    int b = p[x++];
    return a + b;
}

;; assume *p in rdi, x in rsi

mov rax,[rdi+rsi]
add rax,[rdi+rsi+8]
ret

Anyway, I count loop counters and array indices as "use of integer-type variables", whether you prefer signed or unsigned.

OK

unsigned arithmetic is easier than signed integer arithmetic,
including comparisons that would result in a negative value, you just
have to make the test before subtracting, instead of checking if the
result was negative.

I can't follow that at all. Unsigned and signed arithmetic and
comparisons both work simply and as you'd expect. /Mixing/ signed and unsigned types can get things wrong.

Oh yeah!

I.e I cannot easily replicate a downward loop that exits when the
counter become negative:

Â for (int i = START; i >= 0; i-- ) {
Â Â Â // Do something with data[i]
Â }

One of my alternatives are

Â unsigned u = start; // Cannot be less than zero
Â if (u) {
Â Â Â u++;
Â Â Â do {
Â Â Â Â Â u--;
Â Â Â Â Â data[u]...
Â Â Â while (u);
Â }

This typically results in effectively the same asm code as the signed
version, except for a bottom JGE (Jump (signed) Greater or Equal
instead of JA (Jump Above or Equal, but my version is far more verbose.

A more important thing is that the first version, with signed i, is
/vastly/ simpler and clearer in the source code.

Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:

Â for (unsigned u = start; (u & TOPBIT) == 0; u--)

Terje

Or you could just write sane code that matches what you want to say.

:-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Bernd Linsel on Mon Sep 9 09:09:13 2024

Bernd Linsel wrote:

On 05.09.24 19:04, Terje Mathisen wrote:

One of my alternatives are

Â unsigned u = start; // Cannot be less than zero
Â if (u) {
Â Â Â u++;
Â Â Â do {
Â Â Â Â Â u--;
Â Â Â Â Â data[u]...
Â Â Â while (u);
Â }

This typically results in effectively the same asm code as the signed
version, except for a bottom JGE (Jump (signed) Greater or Equal
instead of JA (Jump Above or Equal, but my version is far more verbose.

Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:

Â for (unsigned u = start; (u & TOPBIT) == 0; u--)

Terje

What about:

for (unsigned u = start; u != ~0u; --u)

I like that one!

...

or even

for (unsigned u = start; (int)u >= 0; --u)

That is the one that I've actually been using, i.e. casting to the corresponding signed type.

...

?

I've compared all variants for x86_64 with -O3 -fexpensive-optimizations
on godbolt.org:
- 32 bit version: https://godbolt.org/z/TMhhx3nch
- 64 bit version: https://godbolt.org/z/8oxzTf5Gf

Thanks!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Tim Rentsch on Mon Sep 9 07:40:38 2024

Tim Rentsch <[email protected]> writes: >[email protected] (Anton Ertl) writes:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether there
is an overlap of the memory areas. But then I remembered that you
cannot write such a check in standard C without (in the general
case) exercising undefined behaviour;

Yes, I can.

and then the compiler could eliminate the check or do something
else that's unexpected. Do you have such a check in mind that
does not exercise undefined behaviour in the general case?

Sure. I wouldn't have made my earlier statement otherwise.

You also stated "I'm confident the people who wrote the C standard
would say such a program is strictly conforming." about a program with implementation-defined behaviour, so I lack confidence in your claim.

2) Even if there is such a check, you have to be aware that there
is a potential problem with memcpy(). In that case the way to go
is to just use memmove().

The point of my previous comment was only to address the question
of whether any existing memcpy() calls are problematic. If all
of the checks return "no overlap" then memcpy() is not the problem.

At least for the test runs.

But that does not help you with the next "clever" idea that some
compiler or library maintainer has.

I have the impression that this is an editorial comment having
nothing to do with memcpy() or memmove(). If that impression
is wrong then I'm at a loss to understand what you are talking
about, and would you please elaborate.

There are at least 200 undefined behaviours in the C standard, and
according to some people, C programmers should avoid all of them. So
the possible breakage of memcpy() is just one of many problems that
the programmers should be aware of and that they should test for.

Just because we discussed memcpy() as one of the problems with this
approach does not mean that having a way to deal with memcpy() solves
the larger problem.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Tim Rentsch on Mon Sep 9 07:07:25 2024

Tim Rentsch <[email protected]> writes:

[email protected] (MitchAlsup1) writes:

So:
# define memcpy memomve

Incidentally, if one wants to do this, it's advisable to write

#undef memcpy

before the #define of memcpy.

and move forward with life--for the 2 extra cycles memmove costs
it saves everyone long term grief.

Is it two extra cycles? Here are some data points from <[email protected]>:

Haswell (Core i7-4790K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
14 14 15 15 17 30 48 85 150 281 570 1370 memmove
15 16 13 16 19 32 48 86 161 327 631 1420 memcpy

Skylake (Core i5-6600K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
14 14 14 14 15 27 43 77 147 305 573 1417 memmove
13 14 10 12 14 27 46 85 165 313 607 1350 memcpy

Zen (Ryzen 5 1600X), glibc 2.24
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
16 16 16 17 32 43 66 107 177 328 601 1225 memmove
13 13 14 13 38 49 73 116 188 336 610 1233 memcpy

I don't see a consistent speedup of memcpy over memmove here.

However, when one uses memcpy(&var,ptr,8) or the like to perform an
unaligned access, gcc transforms this into a load (or store) without
the redefinition of memcpy, but into much slower code with the
redefinition (i.e., when using memmove instead of memcpy).

Simply replacing memcpy() by memmove() of course will always
work, but there might be negative consequences beyond a cost
of 2 extra cycles -- for example, if a negative stride is
better performing than a positive stride, but the nature
of the compaction forces memmove() to always take the slower
choice.

If the two memory blocks don't overlap, memmove() can use the fastest
stride. If the two memory blocks overlap, memcpy() as implemented in
glibc is a bad idea.

The way to go for memmove() is:

On hardware where positive stride is faster:

if (((uintptr)(dest-src)) >= len)
return memcpy_posstride(dest,src,len)
else
return memcpy_negstride(dest,src,len)

On hardware where the negative stride is faster:

if (((uintptr)(src-dest)) >= len)
return memcpy_negstride(dest,src,len)
else
return memcpy_posstride(dest,src,len)

And I expect that my test is undefined behaviour, but most people
except the UB advocates should understand what I mean.

The benefit of this comparison over just comparing the addresses is
that the branch will have a much lower miss rate.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Tim Rentsch on Mon Sep 9 10:20:00 2024

Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

On 04/09/2024 18:07, Tim Rentsch wrote:

Terje Mathisen <[email protected]> writes:

Michael S wrote:

On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

3 years ago Terje Mathisen wrote that many years ago he read
that behaviour of memcpy() with overlappped src/dst was defined. >>>>>>>> https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>>>> Mitch Alsup answered "That was true in 1983". So, two people of >>>>>>>> different age living in different parts of the world are telling >>>>>>>> the same story. May be, there exist old popular book that said >>>>>>>> that it was defined?

It probably wasn't written in the official C standard, which I
couldn't have afforded to buy/read, but in a compiler runtime
doc?

Specifying that it would always copy from beginning to end of
the source buffer, in increasing address order meant that it
was guaranteed safe when used to compact buffers.

What is "compact buffers" ?

Assume a buffer consisting of records of some type, some of
them marked as deleted. Iterating over them while removing
the gaps means that you are always copying to a destination
lower in memory, right?

If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.

Such tests are usually built into implementations of memmove(),
which will chose to run forwards or backwards as needed. So you
might as well just call memmove() any time you are not sure
memcpy() is safe and appropriate.

The ever-shallow David Brown first misses the point, then makes a
slightly incorrect statement, and finally makes a recommendation
that surely is familiar to every reader in the newsgroup.

Memmove() is always appropriate unless you are doing something
nefarious.

So:
# define memcpy memomve

Incidentally, if one wants to do this, it's advisable to write

#undef memcpy

before the #define of memcpy.

What really worries me is that I've been told (and shown in godbolt)
that memcpy() can be magic, i.e the ocmpiler is allowed to make it NOP
when I use it to move data between an integer and float variable:

float invsqrt(float x)
{
...
int32_t ix = *(int32_t *) &x;

is deprecated, instead do something like this:

int32_t ix;
memcpy(&ix, &x, sizeof(ix));

and the compiler will see that x and ix can share the same register.

I don't suppose memmove() can be dependent upon to do the same?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Mon Sep 9 09:06:43 2024

Terje Mathisen <[email protected]> writes:

float invsqrt(float x)

[...]

int32_t ix;
memcpy(&ix, &x, sizeof(ix));

and the compiler will see that x and ix can share the same register.

I don't suppose memmove() can be dependent upon to do the same?

There is nothing that prevents the compiler from doing it, or forcing
the compiler to to it with memcpy(). So a compiler could call the
function memcpy() for the code above, and optimize it as you prefer
with memmove(). What actual compilers do is something you can try
out. My experience is that memcpy() is given more love by compiler
maintainers than memmove(). It's as if, despite all the rethoric that
C programmers should "sanitize" programs to get rid of undefined
behaviours in our programs, they actually prefer that we use functions
with less defined behaviour like memcpy() instead of functions with
more defined behaviour like memmove().

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to All on Mon Sep 9 12:26:57 2024

On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:

Does hardware on which negative stride is faster really exists?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Mon Sep 9 12:22:19 2024

On Mon, 9 Sep 2024 10:20:00 +0200
Terje Mathisen <[email protected]> wrote:

Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

On 04/09/2024 18:07, Tim Rentsch wrote:

Terje Mathisen <[email protected]> writes:

Michael S wrote:

On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

3 years ago Terje Mathisen wrote that many years ago he read >>>>>>>> that behaviour of memcpy() with overlappped src/dst was
defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>>>> Mitch Alsup answered "That was true in 1983". So, two
people of different age living in different parts of the
world are telling the same story. May be, there exist old
popular book that said that it was defined?

It probably wasn't written in the official C standard, which I >>>>>>> couldn't have afforded to buy/read, but in a compiler runtime
doc?

Specifying that it would always copy from beginning to end of
the source buffer, in increasing address order meant that it
was guaranteed safe when used to compact buffers.

What is "compact buffers" ?

Assume a buffer consisting of records of some type, some of
them marked as deleted. Iterating over them while removing
the gaps means that you are always copying to a destination
lower in memory, right?

If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.

Such tests are usually built into implementations of memmove(),
which will chose to run forwards or backwards as needed. So you
might as well just call memmove() any time you are not sure
memcpy() is safe and appropriate.

The ever-shallow David Brown first misses the point, then makes a
slightly incorrect statement, and finally makes a recommendation
that surely is familiar to every reader in the newsgroup.

Memmove() is always appropriate unless you are doing something
nefarious.

So:
# define memcpy memomve

Incidentally, if one wants to do this, it's advisable to write

#undef memcpy

before the #define of memcpy.

What really worries me is that I've been told (and shown in godbolt)
that memcpy() can be magic, i.e the ocmpiler is allowed to make it
NOP when I use it to move data between an integer and float variable:

float invsqrt(float x)
{
...
int32_t ix = *(int32_t *) &x;

is deprecated, instead do something like this:

int32_t ix;
memcpy(&ix, &x, sizeof(ix));

and the compiler will see that x and ix can share the same register.

I don't suppose memmove() can be dependent upon to do the same?

Terje

In simple situations like shown above, memmove is as dependable as
memcpy.

I don't know if it is always true in more complex cases, where absence
of aliasing is less obvious to compiler. However, I'd expect that as
long as a copied item fits in register, the magic will work equally
with both memcpy and memmove.

It depends on compiler, too.
MSVC from VS2019 produces the same code for both variants d_to_u below.
But MSVC from VS2017 does not.

#include <stdint.h>
#include <string.h>

void d_to_u_cpy(uint64_t* u, const double* d) {
memcpy(u, d, sizeof(*u));
}

#define memcpy memmove

void d_to_u_move(uint64_t* u, const double* d) {
memcpy(u, d, sizeof(*u));
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Mon Sep 9 10:30:34 2024

Michael S <[email protected]> writes:

On Mon, 9 Sep 2024 10:20:00 +0200
Terje Mathisen <[email protected]> wrote:

float invsqrt(float x)
{
...
int32_t ix = *(int32_t *) &x;

[...]

int32_t ix;
memcpy(&ix, &x, sizeof(ix));

...

I don't know if it is always true in more complex cases, where absence
of aliasing is less obvious to compiler.

Something like

memmove(*p, *q, 8)

can be translated to something like

0: 48 8b 06 mov (%rsi),%rax
3: 48 89 07 mov %rax,(%rdi)

without any aliasing worries, and indeed, gcc-9, gcc-10, and gcc-12,
does that.

However, I'd expect that as
long as a copied item fits in register, the magic will work equally
with both memcpy and memmove.

One would hope so, but here's what happens with gcc-12:

#include <string.h>

void foo1(char *p, char* q)
{
memcpy(p,q,32);
}

void foo2(char *p, char* q)
{
memmove(p,q,32);
}

gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o:

0000000000000000 <foo1>:
0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret
13: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
1a: 00 00 00 00
1e: 66 90 xchg %ax,%ax

0000000000000020 <foo2>:
20: ba 20 00 00 00 mov $0x20,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>

The jmp in line 25 is probably a tail-call to memmove().

My guess is that xmm registers and unrolling are used here rather than
ymm registers because waking up the second 128 bits takes time. But
even with that, the code uses two different registers, and if
scheduled differently, could be used for implementing foo2():

0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Mon Sep 9 13:03:19 2024

On 09/09/2024 08:56, Terje Mathisen wrote:

David Brown wrote:

On 05/09/2024 19:04, Terje Mathisen wrote:

David Brown wrote:

On 05/09/2024 11:12, Terje Mathisen wrote:

David Brown wrote:

Unsigned types are ideal for "raw" memory access or external data, >>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>> on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want
two's complement wrapping, and for the even fewer occasions when
you actually need that last bit of range.

That last paragraph enumerates pretty much all the uses I have for
integer-type variables, with (like Mitch) a few apis that use (-1)
as an error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic?

Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
fine with unsigned i, arrays always use a zero base so in Rust the
only array index type is usize, i.e the largest supported unsigned
type in the system, typically the same as u64.

Loop counters can usually be signed or unsigned, and it usually makes
no difference. Array indices are also usually much the same signed or
unsigned, and it can feel more natural to use size_t here (an unsigned
type). It can make a difference to efficiency, however. On x86-64,
this code is 3 instructions with T as "unsigned long int" or "long
int", 4 with "int", and 5 with "unsigned int".

int foo(int * p, T x) {
     int a = p[x++];
     int b = p[x++];
     return a + b;
}

;; assume *p in rdi, x in rsi

mov rax,[rdi+rsi]
add rax,[rdi+rsi+8]
ret

Yes - that's three instructions for 64-bit type T. (To be clear, I had
counted the "ret" here.)

With 32-bit int for T, you need a "movsx rsi, esi" first to sign-extend
the 32-bit int parameter "x" to 64 bits. (That could be different for different ABI's.) With 32-bit unsigned int for T you need an additional instruction to make sure the result of the first "x++" is wrapped as
32-bit unsigned.

Or you could just write sane code that matches what you want to say.

:-)

Of course the fine line between "smart code" and "smart-arse code" is
somewhat subjective!

It also varies over time, and depends on the needs of the code.
Sometimes it makes sense to prioritise efficiency over readability - but
that is rare, and has been getting steadily rarer over the decades as processors have been getting faster (disproportionally so for
inefficient code) and compilers have been getting better.

Often you get the most efficient results by writing code clearly and
simply so that the compiler can understand it better and good object
code. This is particularly true if you want the same source to be used
on different targets or different variants of a target - few people can
track the instruction scheduling and timings on multiple processors
better than a good compiler. (And the few people who /can/ do that
spend their time chatting in comp.arch instead of writing code...) When
you do hand-made micro-optimisations, these can work against the
compiler and give poorer results overall. This is especially the case
when code is moved around with inlining, constant propagation,
unrolling, link-time optimisation, etc.

Long ago, it was a different matter - then compilers needed more help to
get good results. And compilers are far from perfect - there are still
times when "smart" code or assembly-like C is needed (such as when
taking advantage of some vector and SIMD facilities).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Mon Sep 9 13:19:49 2024

On 08/09/2024 20:32, MitchAlsup1 wrote:

On Sun, 8 Sep 2024 6:25:10 +0000, David Brown wrote:

On 08/09/2024 02:17, MitchAlsup1 wrote:

On Sat, 7 Sep 2024 7:15:11 +0000, David Brown wrote:

static uint64_t array[1024*1024*512+1]
static int SIZE = sizeof(array)/sizeof(uint65_t);

Surely you mean :

static const size_t array_size = sizeof(array) / sizeof(uint64_t);

I wanted SIZE to have the same type as i.

Okay, I suppose - though I would rather have it being an appropriate
type and, if necessary, change the type of "i". But I still don't get
your point - what has this "SIZE" of 0x20000001 got to do with a "START"
that you want to equal 0x80000001 ? Were you just trying to show that
it is possible to make the number 0x80000001 in code, and got the
numbers wrong? If you know that you might have numbers exceeding 32-bit ranges, then you need to use a 64-bit type as the index variable - and
it can still happily be signed rather than writing more complicated code
just to force it into an obsessive rule about using unsigned types.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Mon Sep 9 04:32:17 2024

[email protected] (Anton Ertl) writes:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

[...]

1) A strictly conforming program shall use only those features
of the language and library specified in this International
Standard. This excludes all programs that terminate,
including the "Hello, World" program. [...]

I don't know why you say this. Which aspects of the definition
for "strictly conforming program" do you think are violated by a
typical 'Hello, World' program?

A typical "Hello, World" program terminates, and as mentioned,
no terminating program can be strictly conforming, because it
exercises at least implementation-defined behaviour (e.g., look
at section 7.22.4.4 of C11).

I'm familiar with the exit() function and how the C standard
defines it. You should re-read the definition of strictly
conforming program, which says in part

It shall not produce output dependent on any unspecified,
undefined, or implementation-defined behavior

It is not any use of implementation-defined behavior that is off
limits, only those uses that produce output dependent on such
behavior. The return status of a program is not an output.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Mon Sep 9 14:58:54 2024

On Mon, 09 Sep 2024 10:30:34 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

On Mon, 9 Sep 2024 10:20:00 +0200
Terje Mathisen <[email protected]> wrote:

float invsqrt(float x)
{
...
int32_t ix = *(int32_t *) &x;

[...]

int32_t ix;
memcpy(&ix, &x, sizeof(ix));

...

I don't know if it is always true in more complex cases, where
absence of aliasing is less obvious to compiler.

Something like

memmove(*p, *q, 8)

can be translated to something like

0: 48 8b 06 mov (%rsi),%rax
3: 48 89 07 mov %rax,(%rdi)

without any aliasing worries, and indeed, gcc-9, gcc-10, and gcc-12,
does that.

However, I'd expect that as
long as a copied item fits in register, the magic will work equally
with both memcpy and memmove.

One would hope so, but here's what happens with gcc-12:

#include <string.h>

void foo1(char *p, char* q)
{
memcpy(p,q,32);
}

void foo2(char *p, char* q)
{
memmove(p,q,32);
}

gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o:

0000000000000000 <foo1>:
0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret
13: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
1a: 00 00 00 00
1e: 66 90 xchg %ax,%ax

0000000000000020 <foo2>:
20: ba 20 00 00 00 mov $0x20,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>

The jmp in line 25 is probably a tail-call to memmove().

My guess is that xmm registers and unrolling are used here rather than
ymm registers because waking up the second 128 bits takes time. But
even with that, the code uses two different registers, and if
scheduled differently, could be used for implementing foo2():

0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret

- anton

Try -march instead of -mavx2. E.g. -march=haswell
Sometimes gcc is beyond logic.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Mon Sep 9 11:11:04 2024

Michael S <[email protected]> writes:

On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:

Does hardware on which negative stride is faster really exists?

At least that was claimed as the rationale for implementing a memcpy
with negative stride in glibc in 2010. Of course, we have every
reason to be skeptical, given that bullshit about undisclosed
performance advantages of their misdeeds is common in those circles.

And when somebody made the mistake of actually being a bit more
concrete with their claims, and I actually checked it <http://www.complang.tuwien.ac.at/anton/autovectors/>, it turned out
that the claimed-better version had essentially the same performance
as the more benign version.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Mon Sep 9 13:39:40 2024

On 09/09/2024 11:22, Michael S wrote:

On Mon, 9 Sep 2024 10:20:00 +0200
Terje Mathisen <[email protected]> wrote:

Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

On 04/09/2024 18:07, Tim Rentsch wrote:

Terje Mathisen <[email protected]> writes:

Michael S wrote:

On Tue, 3 Sep 2024 17:41:40 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

3 years ago Terje Mathisen wrote that many years ago he read >>>>>>>>>> that behaviour of memcpy() with overlappped src/dst was
defined.
https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>>>>>> Mitch Alsup answered "That was true in 1983". So, two
people of different age living in different parts of the
world are telling the same story. May be, there exist old >>>>>>>>>> popular book that said that it was defined?

It probably wasn't written in the official C standard, which I >>>>>>>>> couldn't have afforded to buy/read, but in a compiler runtime >>>>>>>>> doc?

Specifying that it would always copy from beginning to end of >>>>>>>>> the source buffer, in increasing address order meant that it >>>>>>>>> was guaranteed safe when used to compact buffers.

What is "compact buffers" ?

Assume a buffer consisting of records of some type, some of
them marked as deleted. Iterating over them while removing
the gaps means that you are always copying to a destination
lower in memory, right?

If all the records are in one large array, there is a simple
test to see if memcpy() must work or whether some alternative
should be used instead.

Such tests are usually built into implementations of memmove(),
which will chose to run forwards or backwards as needed. So you
might as well just call memmove() any time you are not sure
memcpy() is safe and appropriate.

The ever-shallow David Brown first misses the point, then makes a
slightly incorrect statement, and finally makes a recommendation
that surely is familiar to every reader in the newsgroup.

Memmove() is always appropriate unless you are doing something
nefarious.

So:
# define memcpy memomve

Incidentally, if one wants to do this, it's advisable to write

#undef memcpy

before the #define of memcpy.

What really worries me is that I've been told (and shown in godbolt)
that memcpy() can be magic, i.e the ocmpiler is allowed to make it
NOP when I use it to move data between an integer and float variable:

float invsqrt(float x)
{
...
int32_t ix = *(int32_t *) &x;

is deprecated, instead do something like this:

int32_t ix;
memcpy(&ix, &x, sizeof(ix));

and the compiler will see that x and ix can share the same register.

I don't suppose memmove() can be dependent upon to do the same?

Terje

In simple situations like shown above, memmove is as dependable as
memcpy.

I don't know if it is always true in more complex cases, where absence
of aliasing is less obvious to compiler. However, I'd expect that as
long as a copied item fits in register, the magic will work equally
with both memcpy and memmove.

That's my experience too, but as you say, it is compiler (and flag)
dependent.

In most such cases, there's no overlap so memcpy() is the common choice.
(Even if the same register is used as a result of optimisation,
logically the variables are independent.)

You could, I suppose, be trying to use memcpy() or memmove() on members
of a union in C++ (where type-punning using unions is UB, unlike in C).
Then you would have to use memmove() to be correct. (gcc can warn about aliases and overlaps for the "restrict" parameters of memcpy() in simple cases.)

It depends on compiler, too.
MSVC from VS2019 produces the same code for both variants d_to_u below.
But MSVC from VS2017 does not.

#include <stdint.h>
#include <string.h>

void d_to_u_cpy(uint64_t* u, const double* d) {
memcpy(u, d, sizeof(*u));
}

#define memcpy memmove

void d_to_u_move(uint64_t* u, const double* d) {
memcpy(u, d, sizeof(*u));
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Mon Sep 9 12:28:13 2024

Michael S <[email protected]> writes:

On Mon, 09 Sep 2024 10:30:34 GMT
[email protected] (Anton Ertl) wrote:

One would hope so, but here's what happens with gcc-12:

#include <string.h>

void foo1(char *p, char* q)
{
memcpy(p,q,32);
}

void foo2(char *p, char* q)
{
memmove(p,q,32);
}

gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o:

0000000000000000 <foo1>:
0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret
13: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
1a: 00 00 00 00
1e: 66 90 xchg %ax,%ax

0000000000000020 <foo2>:
20: ba 20 00 00 00 mov $0x20,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>

The jmp in line 25 is probably a tail-call to memmove().

My guess is that xmm registers and unrolling are used here rather than
ymm registers because waking up the second 128 bits takes time. But
even with that, the code uses two different registers, and if
scheduled differently, could be used for implementing foo2():

0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret

- anton

Try -march instead of -mavx2. E.g. -march=haswell
Sometimes gcc is beyond logic.

For gcc -O3 -march=haswell I got the same result (with gcc-12). I
also tried -march=x86-64-v3 with the same result.

But gcc -O3 -march=x86-64-v4 produced:

0000000000000000 <foo1>:
0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
8: c5 f8 77 vzeroupper
b: c3 ret
c: 0f 1f 40 00 nopl 0x0(%rax)

0000000000000010 <foo2>:
10: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
14: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
18: c5 f8 77 vzeroupper
1b: c3 ret

And when changing the length to 64:

0000000000000000 <foo1>:
0: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0
6: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi)
c: c5 f8 77 vzeroupper
f: c3 ret

0000000000000010 <foo2>:
10: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0
16: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi)
1c: c5 f8 77 vzeroupper
1f: c3 ret

But when changing the length to 63:

0000000000000000 <foo1>:
0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
8: c5 fe 6f 4e 1f vmovdqu 0x1f(%rsi),%ymm1
d: c5 fe 7f 4f 1f vmovdqu %ymm1,0x1f(%rdi)
12: c5 f8 77 vzeroupper
15: c3 ret
16: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
1d: 00 00 00

0000000000000020 <foo2>:
20: ba 3f 00 00 00 mov $0x3f,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Mon Sep 9 16:08:47 2024

On Mon, 09 Sep 2024 12:28:13 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

On Mon, 09 Sep 2024 10:30:34 GMT
[email protected] (Anton Ertl) wrote:

One would hope so, but here's what happens with gcc-12:

#include <string.h>

void foo1(char *p, char* q)
{
memcpy(p,q,32);
}

void foo2(char *p, char* q)
{
memmove(p,q,32);
}

gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o:

0000000000000000 <foo1>:
0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret
13: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
1a: 00 00 00 00
1e: 66 90 xchg %ax,%ax

0000000000000020 <foo2>:
20: ba 20 00 00 00 mov $0x20,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>

The jmp in line 25 is probably a tail-call to memmove().

My guess is that xmm registers and unrolling are used here rather
than ymm registers because waking up the second 128 bits takes
time. But even with that, the code uses two different registers,
and if scheduled differently, could be used for implementing
foo2():

0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret

- anton

Try -march instead of -mavx2. E.g. -march=haswell
Sometimes gcc is beyond logic.

For gcc -O3 -march=haswell I got the same result (with gcc-12). I
also tried -march=x86-64-v3 with the same result.

But gcc -O3 -march=x86-64-v4 produced:

My gcc was 14.1 and -O2. It produced same code as yours below (forcase
of 32) with -march=haswell

0000000000000000 <foo1>:
0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
8: c5 f8 77 vzeroupper
b: c3 ret
c: 0f 1f 40 00 nopl 0x0(%rax)

0000000000000010 <foo2>:
10: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
14: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
18: c5 f8 77 vzeroupper
1b: c3 ret

And when changing the length to 64:

0000000000000000 <foo1>:
0: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0
6: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi)
c: c5 f8 77 vzeroupper
f: c3 ret

0000000000000010 <foo2>:
10: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0
16: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi)
1c: c5 f8 77 vzeroupper
1f: c3 ret

And here I got different code for -march=tigerlake and
-march=znver4 despite both having approximately the same ISA.
It seems, for Toger Lake gcc is over-concerned about impact of
unaligned 64-bit accesses.

But when changing the length to 63:

0000000000000000 <foo1>:
0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
8: c5 fe 6f 4e 1f vmovdqu 0x1f(%rsi),%ymm1
d: c5 fe 7f 4f 1f vmovdqu %ymm1,0x1f(%rdi)
12: c5 f8 77 vzeroupper
15: c3 ret
16: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
1d: 00 00 00

0000000000000020 <foo2>:
20: ba 3f 00 00 00 mov $0x3f,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Mon Sep 9 15:21:27 2024

On Mon, 9 Sep 2024 08:56:45 +0200
Terje Mathisen <[email protected]> wrote:

David Brown wrote:

On 05/09/2024 19:04, Terje Mathisen wrote:

David Brown wrote:

On 05/09/2024 11:12, Terje Mathisen wrote:

David Brown wrote:

Unsigned types are ideal for "raw" memory access or external
data, for anything involving bit manipulation (use of &, |, ^,
<< and >> on signed types is usually wrong, IMHO), as building
blocks in extended arithmetic types, for the few occasions when
you want two's complement wrapping, and for the even fewer
occasions when you actually need that last bit of range.

That last paragraph enumerates pretty much all the uses I have
for integer-type variables, with (like Mitch) a few apis that
use (-1) as an error signal that has to be handled with special
code.

You don't have loop counters, array indices, or integer
arithmetic?

Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
fine with unsigned i, arrays always use a zero base so in Rust the
only array index type is usize, i.e the largest supported unsigned
type in the system, typically the same as u64.

Loop counters can usually be signed or unsigned, and it usually
makes no difference. Array indices are also usually much the same
signed or unsigned, and it can feel more natural to use size_t here
(an unsigned type). It can make a difference to efficiency,
however. On x86-64, this code is 3 instructions with T as
"unsigned long int" or "long int", 4 with "int", and 5 with
"unsigned int".

int foo(int * p, T x) {
    int a = p[x++];
    int b = p[x++];
    return a + b;
}

;; assume *p in rdi, x in rsi

mov rax,[rdi+rsi]
add rax,[rdi+rsi+8]
ret

more like
mov rax,[rdi+rsi*4]
add rax,[rdi+rsi*4+8]
ret

But that's not the point (==trap).
The point (==trap), I'd guess, is that for T=uint32_t code generator
has to account for possibility of x==2**32-1.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Mon Sep 9 06:21:12 2024

Terje Mathisen <[email protected]> writes:

Bernd Linsel wrote:

On 05.09.24 19:04, Terje Mathisen wrote:

One of my alternatives are

unsigned u = start; // Cannot be less than zero
if (u) {
u++;
do {
u--;
data[u]...
while (u);
}

This typically results in effectively the same asm code as the
signed version, except for a bottom JGE (Jump (signed) Greater or
Equal instead of JA (Jump Above or Equal, but my version is far
more verbose.

Alternatively, if you don't need all N bits of the unsigned type,
then you can subtract and check if the top bit is set in the
result:

for (unsigned u = start; (u & TOPBIT) == 0; u--)

What about:

for (unsigned u = start; u != ~0u; --u)

I like that one!

...

or even

for (unsigned u = start; (int)u >= 0; --u)

That is the one that I've actually been using, i.e. casting to the corresponding signed type.

I don't like either of these because they need a redundant
specification of the index variable's type (and similarly the
definition of TOPBIT depends on knowing that type). Needing to
redundantly know the type is dangerous because the two type
specifications might get out of sync. Instead, either

for (unsigned u = start; u != -1; --u)

or

for (unsigned u = start; u+1 != 0; --u)

avoids the danger of having types be out of sync (and also can be
used with signed types, not that I would advocate doing that).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Mon Sep 9 06:24:35 2024

[email protected] (Anton Ertl) writes:

Tim Rentsch <[email protected]> writes:

[email protected] (MitchAlsup1) writes:

So:
# define memcpy memomve

Incidentally, if one wants to do this, it's advisable to write

#undef memcpy

before the #define of memcpy.

and move forward with life--for the 2 extra cycles memmove costs
it saves everyone long term grief.

Is it two extra cycles? Here are some data points from <[email protected]>:

Haswell (Core i7-4790K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
14 14 15 15 17 30 48 85 150 281 570 1370 memmove
15 16 13 16 19 32 48 86 161 327 631 1420 memcpy

Skylake (Core i5-6600K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
14 14 14 14 15 27 43 77 147 305 573 1417 memmove
13 14 10 12 14 27 46 85 165 313 607 1350 memcpy

Zen (Ryzen 5 1600X), glibc 2.24
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
16 16 16 17 32 43 66 107 177 328 601 1225 memmove
13 13 14 13 38 49 73 116 188 336 610 1233 memcpy

I don't see a consistent speedup of memcpy over memmove here.

However, when one uses memcpy(&var,ptr,8) or the like to perform an
unaligned access, gcc transforms this into a load (or store) without
the redefinition of memcpy, but into much slower code with the
redefinition (i.e., when using memmove instead of memcpy).

Simply replacing memcpy() by memmove() of course will always
work, but there might be negative consequences beyond a cost
of 2 extra cycles -- for example, if a negative stride is
better performing than a positive stride, but the nature
of the compaction forces memmove() to always take the slower
choice.

If the two memory blocks don't overlap, memmove() can use the
fastest stride.

It /could/ use the fastest stride. Whether it /does/ use the
fastest stride is a different question (and one that may have
different answers on different platforms).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Mon Sep 9 16:30:50 2024

On Mon, 09 Sep 2024 12:28:13 GMT
[email protected] (Anton Ertl) wrote:

But when changing the length to 63:

0000000000000000 <foo1>:
0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
8: c5 fe 6f 4e 1f vmovdqu 0x1f(%rsi),%ymm1
d: c5 fe 7f 4f 1f vmovdqu %ymm1,0x1f(%rdi)
12: c5 f8 77 vzeroupper
15: c3 ret
16: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
1d: 00 00 00

0000000000000020 <foo2>:
20: ba 3f 00 00 00 mov $0x3f,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>

- anton

An interesting question is which code I want in this case.
In absence of -march options and with -O1|2|3 I want something like
that:

foo2:
movups (%rsi), %xmm0
movups 16(%rsi), %xmm1
movups 32(%rsi), %xmm2
movups 47(%rsi), %xmm3
movups %xmm0, (%rsi)
movups %xmm1, 16(%rsi)
movups %xmm2, 32(%rsi)
movups %xmm3, 47(%rsi)
ret

Without deep thinking I don't see why I would want anything
different for foo1().

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Mon Sep 9 06:41:15 2024

Terje Mathisen <[email protected]> writes:

Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

[...]

Memmove() is always appropriate unless you are doing something
nefarious.

So:
# define memcpy memomve

Incidentally, if one wants to do this, it's advisable to write

#undef memcpy

before the #define of memcpy.

What really worries me is that I've been told (and shown in
godbolt) that memcpy() can be magic, i.e the ocmpiler is allowed
to make it NOP when I use it to move data between an integer and
float variable:

float invsqrt(float x)
{
...
int32_t ix = *(int32_t *) &x;

is deprecated, instead do something like this:

int32_t ix;
memcpy(&ix, &x, sizeof(ix));

and the compiler will see that x and ix can share the same
register.

I don't suppose memmove() can be dependent upon to do the same?

In such cases I almost always use unions rather than memcpy()
or memmove():

float
invsqrt(float x){
int32_t ix = (union {float f; int32_t i32;}){ x } .i32;
// ...
}

No need for addresses, aliasing concerns, or any stdlib.h
functions. And typically the unioning/deunioning produces
no generated code.

Of course it helps to have an appropriate union type predefined;
here I wrote it inline to make the example self-contained.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Mon Sep 9 08:31:13 2024

[email protected] (Anton Ertl) writes:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether there
is an overlap of the memory areas. But then I remembered that you
cannot write such a check in standard C without (in the general
case) exercising undefined behaviour;

Yes, I can.

and then the compiler could eliminate the check or do something
else that's unexpected. Do you have such a check in mind that
does not exercise undefined behaviour in the general case?

Sure. I wouldn't have made my earlier statement otherwise.

You also stated "I'm confident the people who wrote the C standard
would say such a program is strictly conforming." about a program with implementation-defined behaviour, so I lack confidence in your claim.

Oh? Do you have some reason to think your sense of the beliefs and
attitudes of people on the ISO C committee is better than mine?

2) Even if there is such a check, you have to be aware that there
is a potential problem with memcpy(). In that case the way to go
is to just use memmove().

The point of my previous comment was only to address the question
of whether any existing memcpy() calls are problematic. If all
of the checks return "no overlap" then memcpy() is not the problem.

At least for the test runs.

Yes, the notion is to test exactly the runs that customers say
are giving problems, if necessary by having customers run a
version with the overlapping checks put in.

But that does not help you with the next "clever" idea that some
compiler or library maintainer has.

I have the impression that this is an editorial comment having
nothing to do with memcpy() or memmove(). If that impression
is wrong then I'm at a loss to understand what you are talking
about, and would you please elaborate.

There are at least 200 undefined behaviours in the C standard, and
according to some people, C programmers should avoid all of them. So
the possible breakage of memcpy() is just one of many problems that
the programmers should be aware of and that they should test for.

Just because we discussed memcpy() as one of the problems with this
approach does not mean that having a way to deal with memcpy() solves
the larger problem.

So you're saying my impression that your comment didn't really
have anything to do with memcpy() or memmove() is right?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Mon Sep 9 15:02:51 2024

Michael S <[email protected]> writes:

On Mon, 09 Sep 2024 12:28:13 GMT
[email protected] (Anton Ertl) wrote:

But when changing the length to 63:

...

An interesting question is which code I want in this case.
In absence of -march options and with -O1|2|3 I want something like
that:

foo2:
movups (%rsi), %xmm0
movups 16(%rsi), %xmm1
movups 32(%rsi), %xmm2
movups 47(%rsi), %xmm3
movups %xmm0, (%rsi)
movups %xmm1, 16(%rsi)
movups %xmm2, 32(%rsi)
movups %xmm3, 47(%rsi)
ret

Yes.

Without deep thinking I don't see why I would want anything
different for foo1().

I don't think that deep thinking helps here. One could try to measure microbenchmarks, but do they actually represent application use?

Given that the code is inlined, you can reduce register pressure (and
potential spilling and refilling cost) with:

foo1:
movups (%rsi), %xmm0
movups %xmm0, (%rsi)
movups 16(%rsi), %xmm0
movups %xmm0, 16(%rsi)
movups 32(%rsi), %xmm0
movups %xmm0, 32(%rsi)
movups 47(%rsi), %xmm0
movups %xmm0, 47(%rsi)

Interestingly, gcc uses this kind of scheduling, but different
register names, squandering that advantage of its scheduling. But I
did not test that in a situation where register pressure plays a role.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to David Brown on Mon Sep 9 19:25:51 2024

David Brown <[email protected]> wrote:

On 09/09/2024 08:56, Terje Mathisen wrote:

David Brown wrote:

On 05/09/2024 19:04, Terje Mathisen wrote:

David Brown wrote:

On 05/09/2024 11:12, Terje Mathisen wrote:

David Brown wrote:

Unsigned types are ideal for "raw" memory access or external data, >>>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>>> on signed types is usually wrong, IMHO), as building blocks in
extended arithmetic types, for the few occasions when you want
two's complement wrapping, and for the even fewer occasions when >>>>>>> you actually need that last bit of range.

That last paragraph enumerates pretty much all the uses I have for >>>>>> integer-type variables, with (like Mitch) a few apis that use (-1) >>>>>> as an error signal that has to be handled with special code.

You don't have loop counters, array indices, or integer arithmetic?

Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
fine with unsigned i, arrays always use a zero base so in Rust the
only array index type is usize, i.e the largest supported unsigned
type in the system, typically the same as u64.

Loop counters can usually be signed or unsigned, and it usually makes
no difference. Array indices are also usually much the same signed or
unsigned, and it can feel more natural to use size_t here (an unsigned
type). It can make a difference to efficiency, however. On x86-64,
this code is 3 instructions with T as "unsigned long int" or "long
int", 4 with "int", and 5 with "unsigned int".

int foo(int * p, T x) {
     int a = p[x++];
     int b = p[x++];
     return a + b;
}

;; assume *p in rdi, x in rsi

mov rax,[rdi+rsi]
add rax,[rdi+rsi+8]
ret

Yes - that's three instructions for 64-bit type T. (To be clear, I had counted the "ret" here.)

With 32-bit int for T, you need a "movsx rsi, esi" first to sign-extend
the 32-bit int parameter "x" to 64 bits. (That could be different for different ABI's.) With 32-bit unsigned int for T you need an additional instruction to make sure the result of the first "x++" is wrapped as
32-bit unsigned.

Or you could just write sane code that matches what you want to say.

:-)

Of course the fine line between "smart code" and "smart-arse code" is somewhat subjective!

It also varies over time, and depends on the needs of the code.
Sometimes it makes sense to prioritise efficiency over readability - but
that is rare, and has been getting steadily rarer over the decades as processors have been getting faster (disproportionally so for
inefficient code) and compilers have been getting better.

Often you get the most efficient results by writing code clearly and
simply so that the compiler can understand it better and good object
code. This is particularly true if you want the same source to be used
on different targets or different variants of a target - few people can
track the instruction scheduling and timings on multiple processors
better than a good compiler. (And the few people who /can/ do that
spend their time chatting in comp.arch instead of writing code...) When
you do hand-made micro-optimisations, these can work against the
compiler and give poorer results overall.

I know of no example where hand optimized code does worse on a newer CPU.
A newer CPU with bigger OoOe will effectively unroll your code and schedule
it even better.

It’s older lesser CPU’s where your hand optimized code might fail hard, and I know of few examples of that. None actually.

This is especially the case
when code is moved around with inlining, constant propagation,
unrolling, link-time optimisation, etc.

Long ago, it was a different matter - then compilers needed more help to
get good results. And compilers are far from perfect - there are still
times when "smart" code or assembly-like C is needed (such as when
taking advantage of some vector and SIMD facilities).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Sep 9 15:55:32 2024

So it's all up to the programmer, who often doesn't know either.
Other than using CompCert, I don't know of any reliable way for
a programmer to make sure his C code does not suffer from UB.

There is no full-proof or complete method for C. There are other language for which formal methods can come closer to proving the correctness of the code, but for most practical cases this is infeasible.

I'm not talking about proving that your code is correct. I'm talking
about making sure that your code can do only those things that you
wrote, as opposed to the situation with UB which includes all behaviors including those not written in your code.

Any strongly typed language (Javascript, Python, Java, Haskell, ...)
gives you such a guarantee with absolutely no effort required on the
part of the programmer.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Mon Sep 9 20:52:29 2024

On Mon, 9 Sep 2024 9:26:57 +0000, Michael S wrote:

On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:

Does hardware on which negative stride is faster really exists?

When the negative stride can be compared to zero, yes. else no.
But the performance gain is often zero and sometimes negative.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to Anton Ertl on Mon Sep 9 23:27:24 2024

On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:

Tim Rentsch <[email protected]> writes: >>[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?

The result of comparing pointers to two elements of the same array is
defined. Cast to (char*), both src and dst can be considered to point
to elements of the [address space sized] char array at address zero.

Adding size_t to a pointer yields another pointer of the same type.

All of gcc, clang and MSVC seem happy with this.

2) Even if there is such a check, you have to be aware that there is a >potential problem with memcpy(). In that case the way to go is to
just use memmove(). But that does not help you with the next "clever"
idea that some compiler or library maintainer has.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Tue Sep 10 05:22:32 2024

Michael S <[email protected]> schrieb:

On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:

Does hardware on which negative stride is faster really exists?

Depends on what the alterntive is.

For a Fortran assignment

a(n1:n2) = a(n3:n4)

the semantics of the language demand that the RHS is evaluated
completely before the assignment. In the case of the wrong
kind of overlap, a negative stride can be used instead of
using an array temporary.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Tue Sep 10 10:35:31 2024

On Mon, 9 Sep 2024 20:52:29 +0000
[email protected] (MitchAlsup1) wrote:

On Mon, 9 Sep 2024 9:26:57 +0000, Michael S wrote:

On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:

Does hardware on which negative stride is faster really exists?

When the negative stride can be compared to zero, yes. else no.
But the performance gain is often zero and sometimes negative.

Direction of the count is not related to the sign of pointer's
stride.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Tue Sep 10 10:36:55 2024

On Tue, 10 Sep 2024 05:22:32 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:

Does hardware on which negative stride is faster really exists?

Depends on what the alterntive is.

For a Fortran assignment

a(n1:n2) = a(n3:n4)

the semantics of the language demand that the RHS is evaluated
completely before the assignment. In the case of the wrong
kind of overlap, a negative stride can be used instead of
using an array temporary.

That sounds like memmove. The context of discussion was memcpy.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to George Neuner on Tue Sep 10 11:21:01 2024

On Mon, 09 Sep 2024 23:27:24 -0400
George Neuner <[email protected]> wrote:

On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:

Tim Rentsch <[email protected]> writes: >>[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you
cannot write such a check in standard C without (in the general case) >exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?

The result of comparing pointers to two elements of the same array is defined. Cast to (char*), both src and dst can be considered to point
to elements of the [address space sized] char array at address zero.

According to my understanding, your 'can be considered' part is not
codified in the C Standard.

Adding size_t to a pointer yields another pointer of the same type.

All of gcc, clang and MSVC seem happy with this.

It works. But is it guaranteed to work in the future by some sort of
document? I am pretty sure that no such guarantee exists in gcc and
MSVC docs. I did not look in clang docs. Trying to find anythings in
LLVM/clang docs makes me sad.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to George Neuner on Tue Sep 10 08:19:32 2024

George Neuner <[email protected]> writes:

On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:

1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot >>write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?

The result of comparing pointers to two elements of the same array is >defined. Cast to (char*), both src and dst can be considered to point
to elements of the [address space sized] char array at address zero.

Yes, that would be reasonable. Unfortunately, "optimizations" that
assume that undefined behaviour does not happen are not justified by
assigning reasonable meaning to language constructs, but by giving
only the little meaning to language constructs that the standard
requires, and in case of unequality comparisons between pointers to
different objects, the C standard does not define a meaning for that.

All of gcc, clang and MSVC seem happy with this.

But the next version of gcc or clang might see such a check and decide
to bite you.

One can cast the pointers into an uintptr_t, and try to do the check
there. AFAIK the result would be implementation-defined, but on an architecture with a flat address space it's unlikely that they will
find a way to compile the code in a different way than the programmer
intended without making "relevant" programs slower.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Michael S on Tue Sep 10 12:50:20 2024

On Mon, 9 Sep 2024 15:21:27 +0300
Michael S <[email protected]> wrote:

On Mon, 9 Sep 2024 08:56:45 +0200
Terje Mathisen <[email protected]> wrote:

David Brown wrote:

On 05/09/2024 19:04, Terje Mathisen wrote:

David Brown wrote:

On 05/09/2024 11:12, Terje Mathisen wrote:

David Brown wrote:

Unsigned types are ideal for "raw" memory access or external
data, for anything involving bit manipulation (use of &, |, ^,
<< and >> on signed types is usually wrong, IMHO), as building
blocks in extended arithmetic types, for the few occasions
when you want two's complement wrapping, and for the even
fewer occasions when you actually need that last bit of
range.

That last paragraph enumerates pretty much all the uses I have
for integer-type variables, with (like Mitch) a few apis that
use (-1) as an error signal that has to be handled with special
code.

You don't have loop counters, array indices, or integer
arithmetic?

Loop counters of the for (i= 0; i < LIMIT; i++) type are of
course fine with unsigned i, arrays always use a zero base so in
Rust the only array index type is usize, i.e the largest
supported unsigned type in the system, typically the same as
u64.

Loop counters can usually be signed or unsigned, and it usually
makes no difference. Array indices are also usually much the same signed or unsigned, and it can feel more natural to use size_t
here (an unsigned type). It can make a difference to efficiency, however. On x86-64, this code is 3 instructions with T as
"unsigned long int" or "long int", 4 with "int", and 5 with
"unsigned int".

int foo(int * p, T x) {
    int a = p[x++];
    int b = p[x++];
    return a + b;
}

;; assume *p in rdi, x in rsi

mov rax,[rdi+rsi]
add rax,[rdi+rsi+8]
ret

more like
mov rax,[rdi+rsi*4]
add rax,[rdi+rsi*4+8]
ret

Should be:
mov eax,[rdi+rsi*4]
add eax,[rdi+rsi*4+4]
ret
:(

But that's not the point (==trap).
The point (==trap), I'd guess, is that for T=uint32_t code generator
has to account for possibility of x==2**32-1.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Tue Sep 10 11:49:03 2024

On Sun, 08 Sep 2024 15:36:39 GMT
[email protected] (Anton Ertl) wrote:

Tim Rentsch <[email protected]> writes: >[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?

The check that reliably catches all overlaps seems easy.
E.g. (src <= dst) == (src+len > dst)

In theory, on unusual hardware platform it can give false positives.
May be, for task in hand that's o.k.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Tue Sep 10 08:33:45 2024

Thomas Koenig <[email protected]> writes:

Michael S <[email protected]> schrieb:

[on memcpy() where glibc used negative stride on some hardware and
existing binaries no longer worked as intended]

Does hardware on which negative stride is faster really exists?

Depends on what the alterntive is.

For a Fortran assignment

a(n1:n2) = a(n3:n4)

the semantics of the language demand that the RHS is evaluated
completely before the assignment. In the case of the wrong
kind of overlap, a negative stride can be used instead of
using an array temporary.

Which is a completely different situation from the one that was
assumed by Ulrich Drepper: that there is no overlap between the source
and the target of memcpy(), and if there is, the programmer "should
never have been allowed to touch a keyboard" (i.e., the user of the programmer's program deserves the breakage). So Ulrich Drepper
considered himself free to use an arbitrary stride, with no language
semantics limiting him. And he claimed that for some hardware,
negative stride is faster.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Brett on Tue Sep 10 12:47:37 2024

On 09/09/2024 21:25, Brett wrote:

David Brown <[email protected]> wrote:

Of course the fine line between "smart code" and "smart-arse code" is
somewhat subjective!

It also varies over time, and depends on the needs of the code.
Sometimes it makes sense to prioritise efficiency over readability - but
that is rare, and has been getting steadily rarer over the decades as
processors have been getting faster (disproportionally so for
inefficient code) and compilers have been getting better.

Often you get the most efficient results by writing code clearly and
simply so that the compiler can understand it better and good object
code. This is particularly true if you want the same source to be used
on different targets or different variants of a target - few people can
track the instruction scheduling and timings on multiple processors
better than a good compiler. (And the few people who /can/ do that
spend their time chatting in comp.arch instead of writing code...) When
you do hand-made micro-optimisations, these can work against the
compiler and give poorer results overall.

I know of no example where hand optimized code does worse on a newer CPU.
A newer CPU with bigger OoOe will effectively unroll your code and schedule it even better.

I would agree with you there. For the same object code, newer CPUs
(with the same ISA) are typically faster for a variety of reasons.
There may be the odd regression, but it is hard to market a newer CPU if
it is slower than the older ones!

However, my point was that "hand-optimised" source code can lead to
poorer results on newer /compilers/ compared to simpler source code. If
you've googled for "bit twiddling hacks" for cool tricks, or written
something like "(x << 4) + (x << 2) + x" instead of "x * 21", then the
results will be slower with a modern compiler and modern cpu, even
though the "hand-optimised" version might have been faster two decades
ago. You can expect the modern tool to convert the multiplication into
shifts and adds if that is more efficient on the target, or a
multiplication if that is best on the target. But you can't expect the compiler to turn the shifts and adds into a multiplication. (Sometimes
it can, but you can't expect it to.)

It’s older lesser CPU’s where your hand optimized code might fail hard, and
I know of few examples of that. None actually.

This is especially the case
when code is moved around with inlining, constant propagation,
unrolling, link-time optimisation, etc.

Long ago, it was a different matter - then compilers needed more help to
get good results. And compilers are far from perfect - there are still
times when "smart" code or assembly-like C is needed (such as when
taking advantage of some vector and SIMD facilities).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Tue Sep 10 07:45:07 2024

Michael S <[email protected]> writes:

On Mon, 09 Sep 2024 23:27:24 -0400
George Neuner <[email protected]> wrote:

On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you
cannot write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?

The result of comparing pointers to two elements of the same array is
defined. Cast to (char*), both src and dst can be considered to point
to elements of the [address space sized] char array at address zero.

According to my understanding, your 'can be considered' part is not
codified in the C Standard.

Right.

Adding size_t to a pointer yields another pointer of the same type.

All of gcc, clang and MSVC seem happy with this.

It works. But is it guaranteed to work in the future by some sort of document? I am pretty sure that no such guarantee exists in gcc and
MSVC docs. I did not look in clang docs. Trying to find anythings in LLVM/clang docs makes me sad.

What is being sought is something that works on any implementation
allowed by the C standard, including those that exist only in
someone's imagination.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Brett on Tue Sep 10 11:30:28 2024

Brett wrote:

David Brown <[email protected]> wrote:

Often you get the most efficient results by writing code clearly and
simply so that the compiler can understand it better and good object
code. This is particularly true if you want the same source to be used
on different targets or different variants of a target - few people can
track the instruction scheduling and timings on multiple processors
better than a good compiler. (And the few people who /can/ do that
spend their time chatting in comp.arch instead of writing code...) When
you do hand-made micro-optimisations, these can work against the
compiler and give poorer results overall.

I know of no example where hand optimized code does worse on a newer CPU.
A newer CPU with bigger OoOe will effectively unroll your code and schedule it even better.

Not true:

My favorite benchmark program for 20+ years was Word Count, I
re-optimized that for every new x86 generation, and on the Pentium I got
it to run at 1.5 clock cycles per character (40 MB/s on a 60 MHz Pentium).

When the PentiumPro came out, it did a 10-20 cycle stall for every pair
of characters, so about an order of magnitude slower in cycle count.
(But only about 3X clock time due to being 200 instead of 60 MHz.)

It’s older lesser CPU’s where your hand optimized code might fail hard, and
I know of few examples of that. None actually.

This is especially the case
when code is moved around with inlining, constant propagation,
unrolling, link-time optimisation, etc.

Long ago, it was a different matter - then compilers needed more help to
get good results. And compilers are far from perfect - there are still
times when "smart" code or assembly-like C is needed (such as when
taking advantage of some vector and SIMD facilities).

Right.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Tue Sep 10 07:37:59 2024

Michael S <[email protected]> writes:

On Sun, 08 Sep 2024 15:36:39 GMT
[email protected] (Anton Ertl) wrote:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?

The check that reliably catches all overlaps seems easy.
E.g. (src <= dst) == (src+len > dst)

In theory, on unusual hardware platform it can give false positives.
May be, for task in hand that's o.k.

The challenge is to find portable C that doesn't enter the arena
of undefined behavior (and also detects exactly those cases where
overlap occurs), and that is quite a stringent criterion.

The comparison shown works if src and dst both point to elements
of the same array. But if they don't, comparing pointers to see
if one is less than another (or any of <, <=, >, >=) is undefined
behavior. At the bit level it wouldn't surprise me to learn that
the test shown always returns accurate information. However the
C standard doesn't promise that a bit-level comparison will be
done, and implementations are allowed to do anything at all for
this test in cases where src and dst point to (somewhere within)
different top-level objects. What the hardware does doesn't
matter - what needs to be satisfied are the rules of the C
standard, and they are less forgiving.

I should add that I appreciate your proposed solution; it's
better than what I think I would have come up with under a
similar set of assumptions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Tue Sep 10 15:30:26 2024

Michael S <[email protected]> writes:

On Sun, 08 Sep 2024 15:36:39 GMT
[email protected] (Anton Ertl) wrote:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:
There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?

The check that reliably catches all overlaps seems easy.
E.g. (src <= dst) == (src+len > dst)

In theory, on unusual hardware platform it can give false positives.

That is probably the original motivation for that lack of definition
(e.g., compare only the offset on large-model 8086).

However, if the compiler ATUBDNH, that assumption can lead to the
"knowledge" that src and dest point into the same object, and that may
produce unintended results beyond false positives on some hardware
platforms.

I have not heard about a C compiler that has this misfeature, but I
would not be surprised if it shows up at some point (hopefully with
some flag to define the ordering of pointers to different objects).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Tue Sep 10 11:33:05 2024

On Tue, 10 Sep 2024 11:21:01 +0300, Michael S
<[email protected]> wrote:

On Mon, 09 Sep 2024 23:27:24 -0400
George Neuner <[email protected]> wrote:

On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you
cannot write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?

The result of comparing pointers to two elements of the same array is
defined. Cast to (char*), both src and dst can be considered to point
to elements of the [address space sized] char array at address zero.

According to my understanding, your 'can be considered' part is not
codified in the C Standard.

Adding size_t to a pointer yields another pointer of the same type.

All of gcc, clang and MSVC seem happy with this.

It works. But is it guaranteed to work in the future by some sort of >document? I am pretty sure that no such guarantee exists in gcc and
MSVC docs. I did not look in clang docs. Trying to find anythings in >LLVM/clang docs makes me sad.

I know that it has worked as expected with every version of gcc and
Microsoft I've used since 1988. [clang I don't use, but I tried it on godbolt.org with the most recent version]

Will it continue to work ... who knows?

I definitely am NOT an expert on the C standard, but thinking about
it, it occurred to me that if an array is explicitly defined that
*might* cover all memory (or at least all heap), then the compiler
would have to honor any apparent pointers into it.

E.g., char (*all_memory)[] = 0;

None of the compilers at godbolt seem to need this to compare
arbitrary addresses as char*, but all accept it.

Obviously speculation, but it's the best I have.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Tue Sep 10 17:16:07 2024

David Brown <[email protected]> writes:

However, my point was that "hand-optimised" source code can lead to
poorer results on newer /compilers/ compared to simpler source code. If >you've googled for "bit twiddling hacks" for cool tricks, or written >something like "(x << 4) + (x << 2) + x" instead of "x * 21", then the >results will be slower with a modern compiler and modern cpu, even
though the "hand-optimised" version might have been faster two decades
ago. You can expect the modern tool to convert the multiplication into >shifts and adds if that is more efficient on the target, or a
multiplication if that is best on the target. But you can't expect the >compiler to turn the shifts and adds into a multiplication.

Why not? Let's see:

[b3:~/tmp:109062] gcc -Os -c xxx-mul.c && objdump -d xxx-mul.o

xxx-mul.o: file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <foo>:
0: 48 6b c7 15 imul $0x15,%rdi,%rax
4: c3 ret
[b3:~/tmp:109063] gcc -O3 -c xxx-mul.c && objdump -d xxx-mul.o

xxx-mul.o: file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <foo>:
0: 48 8d 04 bf lea (%rdi,%rdi,4),%rax
4: 48 8d 04 87 lea (%rdi,%rax,4),%rax
8: c3 ret

So gcc-12 obviously understands that your "hand-optimized" version is equivalent to the multiplication, and with -O3 then decides that the
leas are faster.

(Sometimes it can, but you can't expect it to.)

That also works the other way.

But it becomes really annoying when I intend it not to perform a transformation, and it performs the transformation, like when writing
"-(x>0)" and the compiler turns that into a conditional branch. These
days gcc does not do that, but I have just seen another twist:

long bar(long x)
{
return -(x>0);
}

gcc-12 -O3 turns this into:

10: 31 c0 xor %eax,%eax
12: 48 85 ff test %rdi,%rdi
15: 0f 9f c0 setg %al
18: f7 d8 neg %eax
1a: 48 98 cltq
1c: c3 ret

So apparently sign-extension optimization is apparently still lacking.
Clang-14 handles this fine:

10: 31 c0 xor %eax,%eax
12: 48 85 ff test %rdi,%rdi
15: 0f 9f c0 setg %al
18: 48 f7 d8 neg %rax
1b: c3 ret

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Terje Mathisen on Tue Sep 10 18:03:01 2024

Terje Mathisen <[email protected]> wrote:

Brett wrote:

David Brown <[email protected]> wrote:

Often you get the most efficient results by writing code clearly and
simply so that the compiler can understand it better and good object
code. This is particularly true if you want the same source to be used
on different targets or different variants of a target - few people can
track the instruction scheduling and timings on multiple processors
better than a good compiler. (And the few people who /can/ do that
spend their time chatting in comp.arch instead of writing code...) When >>> you do hand-made micro-optimisations, these can work against the
compiler and give poorer results overall.

I know of no example where hand optimized code does worse on a newer CPU.
A newer CPU with bigger OoOe will effectively unroll your code and schedule >> it even better.

Not true:

My favorite benchmark program for 20+ years was Word Count, I
re-optimized that for every new x86 generation, and on the Pentium I got
it to run at 1.5 clock cycles per character (40 MB/s on a 60 MHz Pentium).

When the PentiumPro came out, it did a 10-20 cycle stall for every pair
of characters, so about an order of magnitude slower in cycle count.
(But only about 3X clock time due to being 200 instead of 60 MHz.)

But how big a slowdown did the unoptimized code get?

Are you describing a glass jaw handling unpredictable branches on a CPU
with a much longer pipeline?

A shorter pipeline with better worst case handling is going to do better,
even if older. Intel was going for high clock benchmark speed, not
performance.

It’s older lesser CPU’s where your hand optimized code might fail hard, and
I know of few examples of that. None actually.

This is especially the case
when code is moved around with inlining, constant propagation,
unrolling, link-time optimisation, etc.

Long ago, it was a different matter - then compilers needed more help to >>> get good results. And compilers are far from perfect - there are still
times when "smart" code or assembly-like C is needed (such as when
taking advantage of some vector and SIMD facilities).

Right.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Sep 10 18:36:38 2024

On Tue, 10 Sep 2024 7:35:31 +0000, Michael S wrote:

On Mon, 9 Sep 2024 20:52:29 +0000
[email protected] (MitchAlsup1) wrote:

On Mon, 9 Sep 2024 9:26:57 +0000, Michael S wrote:

On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:

Does hardware on which negative stride is faster really exists?

When the negative stride can be compared to zero, yes. else no.
But the performance gain is often zero and sometimes negative.

Direction of the count is not related to the sign of pointer's
stride.

For the record; I was responding to an array index stride not a
pointer stride.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Tue Sep 10 20:45:53 2024

On 10/09/2024 19:16, Anton Ertl wrote:

David Brown <[email protected]> writes:

However, my point was that "hand-optimised" source code can lead to
poorer results on newer /compilers/ compared to simpler source code. If
you've googled for "bit twiddling hacks" for cool tricks, or written
something like "(x << 4) + (x << 2) + x" instead of "x * 21", then the
results will be slower with a modern compiler and modern cpu, even
though the "hand-optimised" version might have been faster two decades
ago. You can expect the modern tool to convert the multiplication into
shifts and adds if that is more efficient on the target, or a
multiplication if that is best on the target. But you can't expect the
compiler to turn the shifts and adds into a multiplication.

Why not? Let's see:

[b3:~/tmp:109062] gcc -Os -c xxx-mul.c && objdump -d xxx-mul.o

xxx-mul.o: file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <foo>:
0: 48 6b c7 15 imul $0x15,%rdi,%rax
4: c3 ret
[b3:~/tmp:109063] gcc -O3 -c xxx-mul.c && objdump -d xxx-mul.o

xxx-mul.o: file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <foo>:
0: 48 8d 04 bf lea (%rdi,%rdi,4),%rax
4: 48 8d 04 87 lea (%rdi,%rax,4),%rax
8: c3 ret

So gcc-12 obviously understands that your "hand-optimized" version is equivalent to the multiplication, and with -O3 then decides that the
leas are faster.

(Sometimes it can, but you can't expect it to.)

Again - sometimes a compiler will recognise a particular hand-optimised pattern, turn it back to something logically simpler, then optimise from
there. But you cannot /expect/ that. On the whole, compilers are more
likely to recognise clear and simple patterns than complex ones,
especially using bit manipulation in odd ways.

There will always be exceptions, this is just a general rule.

And a related general rule is that /humans/ are much better at
understanding clear code written in a logical way, than something weird
and hand-optimised.

That also works the other way.

But it becomes really annoying when I intend it not to perform a transformation, and it performs the transformation, like when writing "-(x>0)" and the compiler turns that into a conditional branch. These
days gcc does not do that, but I have just seen another twist:

long bar(long x)
{
return -(x>0);
}

gcc-12 -O3 turns this into:

10: 31 c0 xor %eax,%eax
12: 48 85 ff test %rdi,%rdi
15: 0f 9f c0 setg %al
18: f7 d8 neg %eax
1a: 48 98 cltq
1c: c3 ret

So apparently sign-extension optimization is apparently still lacking. Clang-14 handles this fine:

10: 31 c0 xor %eax,%eax
12: 48 85 ff test %rdi,%rdi
15: 0f 9f c0 setg %al
18: 48 f7 d8 neg %rax
1b: c3 ret

One day, perhaps, compilers will be perfect. But not yet :-(

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Tue Sep 10 22:27:02 2024

On Tue, 10 Sep 2024 07:37:59 -0700
Tim Rentsch <[email protected]> wrote:

I should add that I appreciate your proposed solution; it's
better than what I think I would have come up with under a
similar set of assumptions.

Unfortunately, my solution is wrong and mistake is not even subtle.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Sep 10 15:09:41 2024

Again - sometimes a compiler will recognise a particular hand-optimised pattern, turn it back to something logically simpler, then optimise from there. But you cannot /expect/ that.

You might even consider those as performance bugs, since the
hand-optimized code is sometimes chosen specifically to try and impose
a particular kind of code. Compiler's "optimizations" are usually just heuristics so compilers are often better off not being "too clever" so
as to allow manual-optimization to override the heuristics: if
programmers want to use the heuristics, they should write
simple&clear code.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Brett on Tue Sep 10 22:34:11 2024

On Tue, 10 Sep 2024 18:03:01 -0000 (UTC)
Brett <[email protected]> wrote:

Terje Mathisen <[email protected]> wrote:

Brett wrote:

David Brown <[email protected]> wrote:

Often you get the most efficient results by writing code clearly
and simply so that the compiler can understand it better and good
object code. This is particularly true if you want the same
source to be used on different targets or different variants of a
target - few people can track the instruction scheduling and
timings on multiple processors better than a good compiler. (And
the few people who /can/ do that spend their time chatting in
comp.arch instead of writing code...) When you do hand-made
micro-optimisations, these can work against the compiler and give
poorer results overall.

I know of no example where hand optimized code does worse on a
newer CPU. A newer CPU with bigger OoOe will effectively unroll
your code and schedule it even better.

Not true:

My favorite benchmark program for 20+ years was Word Count, I
re-optimized that for every new x86 generation, and on the Pentium
I got it to run at 1.5 clock cycles per character (40 MB/s on a 60
MHz Pentium).

When the PentiumPro came out, it did a 10-20 cycle stall for every
pair of characters, so about an order of magnitude slower in cycle
count. (But only about 3X clock time due to being 200 instead of 60
MHz.)

But how big a slowdown did the unoptimized code get?

Are you describing a glass jaw handling unpredictable branches on a
CPU with a much longer pipeline?

No, the glass jaw of PPro described by Terje is known as partial
register stall.

A shorter pipeline with better worst case handling is going to do
better, even if older. Intel was going for high clock benchmark
speed, not performance.

Typically, PPro was much faster than Pentium clock-for-clock,
especially so when running 32-bit software.
But it had few weak points.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Tue Sep 10 22:35:16 2024

On Tue, 10 Sep 2024 18:36:38 +0000
[email protected] (MitchAlsup1) wrote:

On Tue, 10 Sep 2024 7:35:31 +0000, Michael S wrote:

On Mon, 9 Sep 2024 20:52:29 +0000
[email protected] (MitchAlsup1) wrote:

On Mon, 9 Sep 2024 9:26:57 +0000, Michael S wrote:

On Mon, 09 Sep 2024 07:07:25 GMT
[email protected] (Anton Ertl) wrote:

Does hardware on which negative stride is faster really exists?

When the negative stride can be compared to zero, yes. else no.
But the performance gain is often zero and sometimes negative.

Direction of the count is not related to the sign of pointer's
stride.

For the record; I was responding to an array index stride not a
pointer stride.

Same thing

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Josh Vanderhoof@21:1/5 to Anton Ertl on Tue Sep 10 16:44:21 2024

[email protected] (Anton Ertl) writes:

George Neuner <[email protected]> writes:

On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:

1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot >>>write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a >>>check in mind that does not exercise undefined behaviour in the
general case?

The result of comparing pointers to two elements of the same array is >>defined. Cast to (char*), both src and dst can be considered to point
to elements of the [address space sized] char array at address zero.

Yes, that would be reasonable. Unfortunately, "optimizations" that
assume that undefined behaviour does not happen are not justified by assigning reasonable meaning to language constructs, but by giving
only the little meaning to language constructs that the standard
requires, and in case of unequality comparisons between pointers to
different objects, the C standard does not define a meaning for that.

All of gcc, clang and MSVC seem happy with this.

But the next version of gcc or clang might see such a check and decide
to bite you.

One can cast the pointers into an uintptr_t, and try to do the check
there. AFAIK the result would be implementation-defined, but on an architecture with a flat address space it's unlikely that they will
find a way to compile the code in a different way than the programmer intended without making "relevant" programs slower.

It is legal to test for equality between pointers to different objects
so you could test for overlap by testing against every element in the
array. It seems like it should be possible for the compiler to figure
out what's happening and optimize those tests away, but unfortunately
no compiler I tested did it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Michael S on Wed Sep 11 02:24:13 2024

On Tue, 10 Sep 2024 22:27:02 +0300
Michael S <[email protected]> wrote:

On Tue, 10 Sep 2024 07:37:59 -0700
Tim Rentsch <[email protected]> wrote:

I should add that I appreciate your proposed solution; it's
better than what I think I would have come up with under a
similar set of assumptions.

Unfortunately, my solution is wrong and mistake is not even subtle.

This one appears to work: (src < dst+len) == (dst < src+len)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Michael S on Wed Sep 11 05:47:59 2024

Michael S <[email protected]> wrote:

On Tue, 10 Sep 2024 18:03:01 -0000 (UTC)
Brett <[email protected]> wrote:

Terje Mathisen <[email protected]> wrote:

Brett wrote:

David Brown <[email protected]> wrote:

Often you get the most efficient results by writing code clearly
and simply so that the compiler can understand it better and good
object code. This is particularly true if you want the same
source to be used on different targets or different variants of a
target - few people can track the instruction scheduling and
timings on multiple processors better than a good compiler. (And
the few people who /can/ do that spend their time chatting in
comp.arch instead of writing code...) When you do hand-made
micro-optimisations, these can work against the compiler and give
poorer results overall.

I know of no example where hand optimized code does worse on a
newer CPU. A newer CPU with bigger OoOe will effectively unroll
your code and schedule it even better.

Not true:

My favorite benchmark program for 20+ years was Word Count, I
re-optimized that for every new x86 generation, and on the Pentium
I got it to run at 1.5 clock cycles per character (40 MB/s on a 60
MHz Pentium).

When the PentiumPro came out, it did a 10-20 cycle stall for every
pair of characters, so about an order of magnitude slower in cycle
count. (But only about 3X clock time due to being 200 instead of 60
MHz.)

But how big a slowdown did the unoptimized code get?

Are you describing a glass jaw handling unpredictable branches on a
CPU with a much longer pipeline?

No, the glass jaw of PPro described by Terje is known as partial
register stall.

That is an exception that proves the rule. ;)

A shorter pipeline with better worst case handling is going to do
better, even if older. Intel was going for high clock benchmark
speed, not performance.

Typically, PPro was much faster than Pentium clock-for-clock,
especially so when running 32-bit software.
But it had few weak points.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Tim Rentsch on Wed Sep 11 13:07:33 2024

Tim Rentsch wrote:

Michael S <[email protected]> writes:

On Sun, 08 Sep 2024 15:36:39 GMT
[email protected] (Anton Ertl) wrote:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?

The check that reliably catches all overlaps seems easy.
E.g. (src <= dst) == (src+len > dst)

Does that work for dst < src? What if dst+len < src?

I.e. no overlap?

The first test will be false while the second test will always be true
when src >= dst, so I think it will have false positives?

What about:

max(src,dst) < (min(src,dst)+len)

If you have a min/max circuit, i.e a two-element sorter, then it could
be quite efficient, otherwise run the min first, then the max and the
add during the second cycle, before the less than test in the third cycle.

In theory, on unusual hardware platform it can give false positives.
May be, for task in hand that's o.k.

The challenge is to find portable C that doesn't enter the arena
of undefined behavior (and also detects exactly those cases where
overlap occurs), and that is quite a stringent criterion.

The comparison shown works if src and dst both point to elements
of the same array. But if they don't, comparing pointers to see
if one is less than another (or any of <, <=, >, >=) is undefined
behavior. At the bit level it wouldn't surprise me to learn that
the test shown always returns accurate information. However the
C standard doesn't promise that a bit-level comparison will be
done, and implementations are allowed to do anything at all for
this test in cases where src and dst point to (somewhere within)
different top-level objects. What the hardware does doesn't
matter - what needs to be satisfied are the rules of the C
standard, and they are less forgiving.

I should add that I appreciate your proposed solution; it's
better than what I think I would have come up with under a
similar set of assumptions.

I do believe though that in reality it could be faster to use the
branchy version, and let the branch predictors do their job instead of
having to wait to evaluate all three terms:

bool is_overlap(char *src, char *dst, size_t len)
{
if (src < dst) {
return (src+len > dst);
}
return (dst+len > src);
}

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Brett on Wed Sep 11 13:31:50 2024

Brett wrote:

Terje Mathisen <[email protected]> wrote:

Brett wrote:

David Brown <[email protected]> wrote:

Often you get the most efficient results by writing code clearly and
simply so that the compiler can understand it better and good object
code. This is particularly true if you want the same source to be used >>>> on different targets or different variants of a target - few people can >>>> track the instruction scheduling and timings on multiple processors
better than a good compiler. (And the few people who /can/ do that
spend their time chatting in comp.arch instead of writing code...) When >>>> you do hand-made micro-optimisations, these can work against the
compiler and give poorer results overall.

I know of no example where hand optimized code does worse on a newer CPU. >>> A newer CPU with bigger OoOe will effectively unroll your code and schedule >>> it even better.

Not true:

My favorite benchmark program for 20+ years was Word Count, I
re-optimized that for every new x86 generation, and on the Pentium I got
it to run at 1.5 clock cycles per character (40 MB/s on a 60 MHz Pentium). >>
When the PentiumPro came out, it did a 10-20 cycle stall for every pair
of characters, so about an order of magnitude slower in cycle count.
(But only about 3X clock time due to being 200 instead of 60 MHz.)

But how big a slowdown did the unoptimized code get?

The gcc-optimized unix wc was probably still a slower than my glass
jaw-hitting asm code: The issue was partial register stalls, where I had
been using the relatively tricky concept of interleaving updates to the
BL and BH halfs of BX, then using BX to index into a table of combined
word and line increments:

add dx,ax
mov ax,incr_table[bx]
mov bl,extra_segment[di]
mov di,[si+offset]

followed by

add dx,ax
mov ax,incr_table[bx+16] ;; Transposed table interleaved at +16
mov bh,extra_segment[di]
mov di,[si+offset+2]

All of the above unrolled 64 times so that the code would load & count
256 characters with zero branches.

Are you describing a glass jaw handling unpredictable branches on a CPU
with a much longer pipeline?

PRS stalls was the single largest glass jaw on the PentiumPro, but it
was very rare in compiled code.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Wed Sep 11 14:51:16 2024

On Wed, 11 Sep 2024 13:07:33 +0200
Terje Mathisen <[email protected]> wrote:

Tim Rentsch wrote:

Michael S <[email protected]> writes:

On Sun, 08 Sep 2024 15:36:39 GMT
[email protected] (Anton Ertl) wrote:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether
there is an overlap of the memory areas. But then I remembered
that you cannot write such a check in standard C without (in the
general case) exercising undefined behaviour; and then the
compiler could eliminate the check or do something else that's
unexpected. Do you have such a check in mind that does not
exercise undefined behaviour in the general case?

The check that reliably catches all overlaps seems easy.
E.g. (src <= dst) == (src+len > dst)

Does that work for dst < src? What if dst+len < src?

No, it doesn't. See the followup post.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Josh Vanderhoof on Wed Sep 11 10:38:24 2024

Josh Vanderhoof <[email protected]> writes:

[email protected] (Anton Ertl) writes:

George Neuner <[email protected]> writes:

On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:

1) At first I thought that yes, one could just check whether there is >>>>an overlap of the memory areas. But then I remembered that you cannot >>>>write such a check in standard C without (in the general case) >>>>exercising undefined behaviour; and then the compiler could eliminate >>>>the check or do something else that's unexpected. Do you have such a >>>>check in mind that does not exercise undefined behaviour in the
general case?

...

It is legal to test for equality between pointers to different objects
so you could test for overlap by testing against every element in the
array. It seems like it should be possible for the compiler to figure
out what's happening and optimize those tests away, but unfortunately
no compiler I tested did it.

That would be an interesting result of the ATUBDNH lunacy: programmers
would see themselves forced to write workarounds such as the one you
suggest (with terrible performance when not optimized), and then C
compiler maintainers would see themselves forced to optimize this kind
of code. The end result would be that both parties have to put in
more effort to eventually get the same result as if ordered comparison
between different objects had been defined from the start.

For now, the ATUBDNH advocates tell programmers that they have to work
around the lack of definition, but there is usually no optimization
for that.

One case where things work somewhat along the lines you suggest is
unaligned accesses. Traditionally, if knowing that the hardware
supports unaligned accesses, for a 16-bit load one would write:

int16_t foo1(int16_t *p)
{
return *p;
}

If one does not know that the hardware supports unaligned accesses,
the traditional way to perform such an access (little-endian) is
something like:

int16_t foo2(int16_t *p)
{
unsignedchar *q = p;
return (int16_t)(q[0] + (q[1]>>8));
}

Now, several years ago, somebody told me that the proper way is as
follows:

int16_t foo3(int16_t *p)
{
int16_t v;
memcpy(&v,p,2);
return v;
}

That way looked horribly inefficient to me, with v having to reside in
memory instead of in a register and then the expensive function call,
and all the decisions that memcpy() has to take depending on the
length argument. But gcc optimizes this idiom into an unaligned load
rather than taking all the steps that I expected (however, I have seen
cases where the code produced on hardware that supports unaligned
accesses is worse than that for foo1()). Of course, if you also want
to support less sophisticated compilers, this idiom may be really slow
on those, although not quite as expensive as your containment check.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Wed Sep 11 15:35:00 2024

On Wed, 11 Sep 2024 13:07:33 +0200
Terje Mathisen <[email protected]> wrote:

I do believe though that in reality it could be faster to use the
branchy version, and let the branch predictors do their job instead
of having to wait to evaluate all three terms:

bool is_overlap(char *src, char *dst, size_t len)
{
if (src < dst) {
return (src+len > dst);
}
return (dst+len > src);
}

Terje

I think that under assumptions that overlaps are very rare and that we
have wide OoO CPU, one-branch solution would be faster than multiple
branches.
Assuming Windows x64 coding conventions (dst==RCX, src==RDX, len=R8)
and using algorithm that I posted at night:

lea rax, [rcx,r8] ; rax = dst+len
lea r9, [rdx,r8] ; r9 = src+len
cmp rdx, rax
setb al ; al = src < dst+len
cmp rcx, r9
setb r9b ; r9b = dst < src+len
cmp al, r9b
je handle_overlap
; there is no overlap

The important observation here is that for as long as branch predictor correctly predicted that the branch is not taken all previous
calculation are not on the critical latency path. So, the fact that
there are 7 instructions before branch that have latency of ~4 clocks
does not matter.

On the other hand, in your branchy variant the second branch is easy
to predict, but the first branch if (src < dst) not necessarily easy.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Wed Sep 11 08:08:30 2024

Michael S <[email protected]> writes:

On Tue, 10 Sep 2024 07:37:59 -0700
Tim Rentsch <[email protected]> wrote:

I should add that I appreciate your proposed solution; it's
better than what I think I would have come up with under a
similar set of assumptions.

Unfortunately, my solution is wrong and mistake is not even subtle.

Oh that's okay, I appreciate it nonetheless. :)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Tim Rentsch on Wed Sep 11 08:07:34 2024

Tim Rentsch <[email protected]> writes:

Michael S <[email protected]> writes:

On Sun, 08 Sep 2024 15:36:39 GMT
[email protected] (Anton Ertl) wrote:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will work
on platforms one doesn't have, but there is a relatively simple
and portable way to tell if some memcpy() call crosses over into
the realm of undefined behavior.

1) At first I thought that yes, one could just check whether there is
an overlap of the memory areas. But then I remembered that you cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate
the check or do something else that's unexpected. Do you have such a
check in mind that does not exercise undefined behaviour in the
general case?

The check that reliably catches all overlaps seems easy.
E.g. (src <= dst) == (src+len > dst)

In theory, on unusual hardware platform it can give false positives.
May be, for task in hand that's o.k.

The challenge is to find portable C that doesn't enter the arena
of undefined behavior (and also detects exactly those cases where
overlap occurs), and that is quite a stringent criterion.

The comparison shown works if src and dst both point to elements
of the same array. [...]

Sorry, that statement isn't right. I accepted the stated test as
being accurate without checking it myself. My bad. :)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Wed Sep 11 09:08:12 2024

Michael S <[email protected]> writes:

On Tue, 10 Sep 2024 22:27:02 +0300
Michael S <[email protected]> wrote:

On Tue, 10 Sep 2024 07:37:59 -0700
Tim Rentsch <[email protected]> wrote:

I should add that I appreciate your proposed solution; it's
better than what I think I would have come up with under a
similar set of assumptions.

Unfortunately, my solution is wrong and mistake is not even subtle.

This one appears to work: (src < dst+len) == (dst < src+len)

Yes. This time I checked it. :)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to All on Wed Sep 11 09:29:04 2024

Josh Vanderhoof <[email protected]> writes:

[how to write a portable, UB-free check if mempcy() intervals overlap]

It is legal to test for equality between pointers to different objects

Right. This observation is the key insight.

so you could test for overlap by testing against every element in the
array.

For a complete test, compare the address of every element in both
arrays. For example:

#include <stddef.h>

_Bool
memcpy_intervals_overlap( void *const vd, void *const vs, size_t n ){
char *d = vd, *s = vs;
size_t k = 0;

while( k < n && d != vs && s != vd ) k++, d++, s++;

return k < n;
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Wed Sep 11 19:52:21 2024

On Wed, 11 Sep 2024 09:29:04 -0700
Tim Rentsch <[email protected]> wrote:

Josh Vanderhoof <[email protected]> writes:

[how to write a portable, UB-free check if mempcy() intervals overlap]

It is legal to test for equality between pointers to different
objects

Right. This observation is the key insight.

Real mode x86 C compilers operating in Large and Compact Models that
were popular on IBM-compatible PCs 30-40 years ago could have more than
one representation for the pointer to the same memory location. If my
memory serves me, the rules of pointers comparison for equality were
the same as rules of comparison for <>. In both cases for reliable
result pointers had to be explicitly normalized (i.e. converted from
'far' to 'huge' or something like that).

It was long time ago and even back then I didn't use Large model very
often, so it's possible that I misremember. But if I remember
correctly, does it mean that those C compilers now would be considered non-compliant?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Wed Sep 11 17:34:38 2024

Michael S <[email protected]> writes:

On Wed, 11 Sep 2024 09:29:04 -0700
Tim Rentsch <[email protected]> wrote:

Josh Vanderhoof <[email protected]> writes:

[how to write a portable, UB-free check if mempcy() intervals overlap]

It is legal to test for equality between pointers to different
objects

Right. This observation is the key insight.

Real mode x86 C compilers operating in Large and Compact Models that
were popular on IBM-compatible PCs 30-40 years ago could have more than
one representation for the pointer to the same memory location. If my
memory serves me, the rules of pointers comparison for equality were
the same as rules of comparison for <>. In both cases for reliable
result pointers had to be explicitly normalized (i.e. converted from
'far' to 'huge' or something like that).

It was long time ago and even back then I didn't use Large model very
often, so it's possible that I misremember. But if I remember
correctly, does it mean that those C compilers now would be considered non-compliant?

The C standard was first ratified (by ANSI) in 1989. The rules
for pointer comparison were clarified in the C99 standard, but it
has always been true that pointers to the same object have to
compare equal.

C environments that have things like 'far' or 'huge' pointers,
etc, are not standard C but must have extensions so that they can
deal with the different kinds of pointers. Depending on how the
non-standard kinds of pointer worked, the implementation might or
might not be conforming. Most likely though it's a moot point
because once a program starts using an extension all the rules
can change, and the C standard allows that. It's only programs
that look like really standard C that have to do what the C
standard says (for the implementation to be conforming); any
code that declares a 'far' pointer or 'huge' pointer certainly
isn't standard C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to BGB on Thu Sep 12 03:12:11 2024

BGB <[email protected]> writes:

[...]

Would be nice, say, if there were semi-standard compiler macros for
various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to use.
Whether or not the target/compiler allows misaligned memory access;
If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.

[elaborations on the above]

I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to George Neuner on Thu Sep 12 04:04:06 2024

George Neuner <[email protected]> writes:

On Tue, 10 Sep 2024 11:21:01 +0300, Michael S
<[email protected]> wrote:

On Mon, 09 Sep 2024 23:27:24 -0400
George Neuner <[email protected]> wrote:

On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will
work on platforms one doesn't have, but there is a relatively
simple and portable way to tell if some memcpy() call crosses
over into the realm of undefined behavior.

1) At first I thought that yes, one could just check whether
there is an overlap of the memory areas. But then I remembered
that you cannot write such a check in standard C without (in the
general case) exercising undefined behaviour; and then the
compiler could eliminate the check or do something else that's
unexpected. Do you have such a check in mind that does not
exercise undefined behaviour in the general case?

The result of comparing pointers to two elements of the same array
is defined. Cast to (char*), both src and dst can be considered
to point to elements of the [address space sized] char array at
address zero.

According to my understanding, your 'can be considered' part is not
codified in the C Standard.

Adding size_t to a pointer yields another pointer of the same
type.

In terms of types, that is right, but the addition works only if
the pointer points into an array large enough to include the
result of the addition (the result is also allowed to be just one
past the end of the array).

All of gcc, clang and MSVC seem happy with this.

It works. But is it guaranteed to work in the future by some sort
of document? I am pretty sure that no such guarantee exists in gcc
and MSVC docs. I did not look in clang docs. Trying to find
anythings in LLVM/clang docs makes me sad.

I know that it has worked as expected with every version of gcc
and Microsoft I've used since 1988. [clang I don't use, but I
tried it on godbolt.org with the most recent version]

Will it continue to work ... who knows?

I definitely am NOT an expert on the C standard, but thinking
about it, it occurred to me that if an array is explicitly defined
that *might* cover all memory (or at least all heap), then the
compiler would have to honor any apparent pointers into it.

E.g., char (*all_memory)[] = 0;

This declaration introduces a pointer, not an array. Similarly
the declaration

char (*great_white_array)[ 999999999999999999 ] = 0;

does not introduce an array but just a pointer (and initializes
the pointer to be a null pointer). There is no humongous array.

None of the compilers at godbolt seem to need this to compare
arbitrary addresses as char*, but all accept it.

The given declaration of 'all_memory' is strictly conforming.
It must be accepted by any conforming C implementation (which
all of gcc, clang, and MSVC purport to be, IIUC).

Obviously speculation, but it's the best I have.

It's important to realize that there are two distinct questions.
One, does the code work (in a given implementation)? Two, does
the code satisfy the rules given in the C standard?

Unfortunately having an answer to the first question does not by
itself give enough information to answer the second question.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Thu Sep 12 14:10:33 2024

On Wed, 11 Sep 2024 17:34:38 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Wed, 11 Sep 2024 09:29:04 -0700
Tim Rentsch <[email protected]> wrote:

Josh Vanderhoof <[email protected]> writes:

[how to write a portable, UB-free check if mempcy() intervals
overlap]

It is legal to test for equality between pointers to different
objects

Right. This observation is the key insight.

Real mode x86 C compilers operating in Large and Compact Models that
were popular on IBM-compatible PCs 30-40 years ago could have more
than one representation for the pointer to the same memory
location. If my memory serves me, the rules of pointers comparison
for equality were the same as rules of comparison for <>. In both
cases for reliable result pointers had to be explicitly normalized
(i.e. converted from 'far' to 'huge' or something like that).

It was long time ago and even back then I didn't use Large model
very often, so it's possible that I misremember. But if I remember correctly, does it mean that those C compilers now would be
considered non-compliant?

The C standard was first ratified (by ANSI) in 1989. The rules
for pointer comparison were clarified in the C99 standard, but it
has always been true that pointers to the same object have to
compare equal.

C environments that have things like 'far' or 'huge' pointers,
etc, are not standard C but must have extensions so that they can
deal with the different kinds of pointers. Depending on how the
non-standard kinds of pointer worked, the implementation might or
might not be conforming. Most likely though it's a moot point
because once a program starts using an extension all the rules
can change, and the C standard allows that. It's only programs
that look like really standard C that have to do what the C
standard says (for the implementation to be conforming); any
code that declares a 'far' pointer or 'huge' pointer certainly
isn't standard C.

In Compact and Large models data pointers are 'far' by default. So,
the source doesn't have to use non-standard declarations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Thu Sep 12 14:29:48 2024

On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:

BGB <[email protected]> writes:

[...]

Would be nice, say, if there were semi-standard compiler macros for
various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to
use. Whether or not the target/compiler allows misaligned memory
access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.

[elaborations on the above]

I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.

Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Thu Sep 12 06:06:46 2024

[email protected] (Anton Ertl) writes:

[considering which way to copy with memmove()]

If the two memory blocks don't overlap, memmove() can use the
fastest stride. [...]

The way to go for memmove() is:

On hardware where positive stride is faster:

if (((uintptr)(dest-src)) >= len)
return memcpy_posstride(dest,src,len)
else
return memcpy_negstride(dest,src,len)

On hardware where the negative stride is faster:

if (((uintptr)(src-dest)) >= len)
return memcpy_negstride(dest,src,len)
else
return memcpy_posstride(dest,src,len)

And I expect that my test is undefined behaviour, but most people
except the UB advocates should understand what I mean.

Code inside the implementation is allowed to exploit internal
knowledge.

The benefit of this comparison over just comparing the addresses
is that the branch will have a much lower miss rate.

It's a clever idea. It suffers from a few shortcomings.

First, the type name is uintptr_t. Also, uintptr_t might not
exist.

Second, uintptr_t might be small, leading to incorrect behavior
in some cases. Better to use a large unsigned type that is
known to exist, either unsigned long long or uintmax_t.

Third, pointer subtraction is not guaranteed to work for large
differences because ptrdiff_t might not be big enough. This is
just a technicality because presumably the implementation would
know how big ptrdiff_t is and wouldn't use this approach if it
were too small. That said, it's something to keep in mind if the
code is meant to be used on other systems.

Last but not least, having two different code blocks for the
different preferences is clunky. The two blocks can be
combined by fusing the two test expressions into a single
expression, as for example

#ifndef PREFER_UPWARDS
#define PREFER_UPWARDS 1
#endif/*PREFER_UPWARDS*/

extern void* ascending_copy( void*, const void*, size_t );
extern void* descending_copy( void*, const void*, size_t );

void *
good_memmove( void *vd, const void *vs, size_t n ){
const char *d = vd;
const char *s = vs;
_Bool upwards = PREFER_UPWARDS ? d-s +0ull >= n : s-d +0ull < n;

return
upwards
? ascending_copy( vd, vs, n )
: descending_copy( vd, vs, n );
}

Using the preprocessor symbol PREFER_UPWARDS to select between
the two preferences (ascending or descending) allows the choice
to made by a -D compiler option, and we can expect the compiler
to optimize away the part of the test that is never used.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Thu Sep 12 06:17:24 2024

[email protected] (Anton Ertl) writes:

Josh Vanderhoof <[email protected]> writes:

[email protected] (Anton Ertl) writes:

George Neuner <[email protected]> writes:

On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:

1) At first I thought that yes, one could just check whether
there is an overlap of the memory areas. But then I remembered
that you cannot write such a check in standard C without (in the
general case) exercising undefined behaviour; and then the
compiler could eliminate the check or do something else that's
unexpected. Do you have such a check in mind that does not
exercise undefined behaviour in the general case?

...

It is legal to test for equality between pointers to different
objects so you could test for overlap by testing against every
element in the array. It seems like it should be possible for the
compiler to figure out what's happening and optimize those tests
away, but unfortunately no compiler I tested did it.

That would be an interesting result of the ATUBDNH lunacy:
programmers would see themselves forced to write workarounds such
as the one you suggest (with terrible performance when not
optimized), and then C compiler maintainers would see themselves
forced to optimize this kind of code. The end result would be
that both parties have to put in more effort to eventually get the
same result as if ordered comparison between different objects had
been defined from the start.

For now, the ATUBDNH advocates tell programmers that they have to
work around the lack of definition, but there is usually no
optimization for that.

This reaction doesn't fit the case here. The C standard already
provides a way to do what is needed, namely memmove(). The code
being discussed in this thread is relevant only because someone
(may have) wrongly used memcpy() rather than memmove(). As has
been pointed out, all of the worries around this problem can be
avoided by simply using memmove() rather then memcpy().

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to BGB on Thu Sep 12 16:18:45 2024

On 11/09/2024 20:51, BGB wrote:

On 9/11/2024 5:38 AM, Anton Ertl wrote:

Josh Vanderhoof <[email protected]> writes:

[email protected] (Anton Ertl) writes:

George Neuner <[email protected]> writes:

On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:

1) At first I thought that yes, one could just check whether there is >>>>>> an overlap of the memory areas. But then I remembered that you
cannot
write such a check in standard C without (in the general case)
exercising undefined behaviour; and then the compiler could eliminate >>>>>> the check or do something else that's unexpected. Do you have such a >>>>>> check in mind that does not exercise undefined behaviour in the
general case?

...

It is legal to test for equality between pointers to different objects
so you could test for overlap by testing against every element in the
array. It seems like it should be possible for the compiler to figure
out what's happening and optimize those tests away, but unfortunately
no compiler I tested did it.

That would be an interesting result of the ATUBDNH lunacy: programmers
would see themselves forced to write workarounds such as the one you
suggest (with terrible performance when not optimized), and then C
compiler maintainers would see themselves forced to optimize this kind
of code. The end result would be that both parties have to put in
more effort to eventually get the same result as if ordered comparison
between different objects had been defined from the start.

For now, the ATUBDNH advocates tell programmers that they have to work
around the lack of definition, but there is usually no optimization
for that.

One case where things work somewhat along the lines you suggest is
unaligned accesses. Traditionally, if knowing that the hardware
supports unaligned accesses, for a 16-bit load one would write:

int16_t foo1(int16_t *p)
{
   return *p;
}

If one does not know that the hardware supports unaligned accesses,
the traditional way to perform such an access (little-endian) is
something like:

int16_t foo2(int16_t *p)
{
   unsignedchar *q = p;
   return (int16_t)(q[0] + (q[1]>>8));
}

Correcting the typos (in case anyone wants to copy-and-paste to
godbolt.org for testing):

int16_t foo2(int16_t *p)
{
unsigned char *q = (unsigned char *) p;
return (int16_t)(q[0] + (q[1] << 8));
}

Now, several years ago, somebody told me that the proper way is as
follows:

int16_t foo3(int16_t *p)
{
    int16_t v;
    memcpy(&v,p,2);
    return v;
}

That way looked horribly inefficient to me, with v having to reside in
memory instead of in a register and then the expensive function call,
and all the decisions that memcpy() has to take depending on the
length argument. But gcc optimizes this idiom into an unaligned load
rather than taking all the steps that I expected (however, I have seen
cases where the code produced on hardware that supports unaligned
accesses is worse than that for foo1()). Of course, if you also want
to support less sophisticated compilers, this idiom may be really slow
on those, although not quite as expensive as your containment check.

It is a unfortunate truth that code that is correct can be inefficient
on some compilers, while code that is efficient on those compilers is
not correct (according to the C standards) and can fail on other
compilers. I may be a "ATUBDNH advocate", but I can certainly
acknowledge that much. The C standard is concerned with the behaviour
of the code, not its efficiency, and it has always been a fact of life
for C programmers that different compilers give better or worse results
for different ways of writing source code. Not all code can be written portably /and/ efficiently, without at least some conditional compilation.

foo1() is defined behaviour if and only if the pointer is correctly
aligned. For a stand-alone function,

foo2() above is perfectly correct C and has fully defined behaviour
(with the obvious assumptions that CHARBIT is 8 and that int16_t
exists), but only gives the correct results for little-endian systems.

foo3() is correct regardless of the endianness (with the same
assumptions about the targets), but efficiency can vary.

Testing these on godbolt.org with gcc and MSVC shows these both optimise
the memcpy() into a single 16-bit load. MSVC does not recognize the
pattern in foo2() and generates poor code for it (it even uses an "imul" instruction!).

Another alternative is:

int16_t foo1v(int16_t *p)
{
volatile int16_t * q = p;
return *q;
}

The C standard does not say exactly what this will do, but you can
expect the compiler to generate the load, even if it knows "p" is
misaligned, and even if it knows the target does not support misaligned accesses. Of course, this has implications for optimisations as the
compiler can't re-order such loads.

Would be nice, say, if there were semi-standard compiler macros for
various things:

Ask, and you shall receive! (Well, sometimes you might receive.)

Endianess (macros exist, typically compiler specific);
    And, apparently GCC and Clang can't agree on which strategy to use.

#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
...
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
...
#else
...
#endif

Works in gcc, clang and MSVC.

And C23 has the <stdbit.h> header with many convenient little "bit and
byte" utilities, including endian detection:

#include <stdbit.h>
#if __STDC_ENDIAN_NATIVE__ == __STDC_ENDIAN_LITTLE__
...
#elif __STDC_ENDIAN_NATIVE__ == __STDC_ENDIAN_BIG__
...
#else
...
#endif

Whether or not the target/compiler allows misaligned memory access;
    If set, one may use misaligned access.

Why would you need that? Any decent compiler will know what is allowed
for the target (perhaps partly on the basis of compiler flags), and will generate the best allowed code for accesses like foo3() above.

Whether or not memory uses a single address space;
    If set, all pointer comparisons are allowed.

Pointer comparisons are always allowed for equality tests if they are
pointers to objects of compatible types. (Function pointers cannot be
compared at all.)

For other relational tests, the pointers must point to sub-objects of
the same aggregate object. (That means they can't be null pointers,
misaligned pointers, invalid pointers or pointers going nowhere.) This
is independent of how the address space(s) are organised on the target
machine.

What you /can/ do, on pretty much any implementation with a single
linear address space, is convert pointers to uintptr_t and then compare
them. There may be some targets for which there is no uintptr_t, or
where the mapping from pointer to integer does not match with the
address, but that would be very unusual.

I can't think when you would need to do such comparisons, however, other
than to implement memmove - and library functions can use any kind of implementation-specific feature they like.

Clang:
__LITTLE_ENDIAN__, __BIG_ENDIAN__
One or the other is defined based on endian.
GCC:
__BYTE_ORDER__ which may equal one of:
    __ORDER_LITTLE_ENDIAN__
    __ORDER_BIG_ENDIAN__
    __ORDER_PDP_ENDIAN__
MSVC:
REG_DWORD is one of:
    REG_DWORD_LITTLE_ENDIAN
    REG_DWORD_BIG_ENDIAN

GCC:
__SIZEOF_type__ //gives sizeof various types

See above.

Possible:
__MINALIGN_type__ //minimum allowed alignment for type

_Alignof(type) has been around since C11.

Maybe also alias pointer control:
__POINTER_ALIAS__
    __POINTER_ALIAS_CONSERVATIVE__
    __POINTER_ALIAS_STRICT__

Where, pointer alias can be declared, and:
If conservative, then conservative semantics are being used.
    Pointers may be freely cast without concern for pointer aliasing.
    Compiler will assume that "non restrict" pointer stores may alias.
If strict, the compiler is using TBAA semantics.
    Compiler may assume that aliasing is based on pointer types.

Faffing around with pointer types - breaking the "effective type" rules
- has been a bad idea and risky behaviour since C was standardised. You
never need to do it. (I accept, however, that on some weaker or older compilers "doing the right thing" can be noticeably less efficient than
writing bad code.) Just get a half-decent compiler and use memcpy().
For any situation where you might think casting pointer types would be a
good idea, your sizes are small and known at compile time, so they are
easy for the compiler to optimise.

If you /must/ do such casts, or you are dealing with questionable
quality code that uses them, at least add this to your code:

#ifdef __GNUC__
#pragma GCC optimize("-fno-strict-aliasing")
#endif

It won't make the code correct if you are using a compiler other than
gcc or clang, but it's a help.

And as a general rule, if you feel you really want to break the rules of
C and still get something useful out at the end, use "volatile" liberally.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Thu Sep 12 07:33:13 2024

Terje Mathisen <[email protected]> writes:

[how to detect interval overlap]

What about:

max(src,dst) < (min(src,dst)+len)

If you have a min/max circuit, i.e a two-element sorter, then it
could be quite efficient, otherwise run the min first, then the
max and the add during the second cycle, before the less than test
in the third cycle.

[...]

I do believe though that in reality it could be faster to use the
branchy version, and let the branch predictors do their job
instead of having to wait to evaluate all three terms:

bool is_overlap(char *src, char *dst, size_t len)
{
if (src < dst) {
return (src+len > dst);
}
return (dst+len > src);
}

Note that there are two distinct problems that are relevant to
the discussion: is there any overlap, and is there overlap of
the wrong kind. The question of Is there any overlap can be
done with a simple comparison if there is a non-branching abs()
function available (assuming a flat linear address space):

if( abs( source - destination ) < n ) ...

The question of Is there overlap of wrong kind, which is like
what memmove would want to ask, can be done with a single
comparison if the bad direction is known in advanced, and fixed.
An example is given by Anton, and a revision of that in my
recent response to his posting.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Thu Sep 12 16:34:31 2024

On 12/09/2024 13:29, Michael S wrote:

On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:

BGB <[email protected]> writes:

[...]

Would be nice, say, if there were semi-standard compiler macros for
various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to
use. Whether or not the target/compiler allows misaligned memory
access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.

[elaborations on the above]

I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.

I fully agree that C is not, and should not be seen as, a "high-level
assembly language". But it is a language that is very useful to
"hardware-type folks", and there are a few things that could make it
easier to write more portable code if they were standardised. As it is,
we just have to accept that some things are not portable.

Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.

And how should that be defined? And what is its "practical" definition?
My preference would be a hard compile-time error, but specifying that
in the standards would force compilers to do more analysis and checking
than the standards can reasonably enforce.

clang can warn on this - I am disappointed to see that gcc does not.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Tim Rentsch on Thu Sep 12 14:20:42 2024

Tim Rentsch <[email protected]> writes: >[email protected] (Anton Ertl) writes:

[considering which way to copy with memmove()]

If the two memory blocks don't overlap, memmove() can use the
fastest stride. [...]

The way to go for memmove() is:

On hardware where positive stride is faster:

if (((uintptr)(dest-src)) >= len)
return memcpy_posstride(dest,src,len)
else
return memcpy_negstride(dest,src,len)

On hardware where the negative stride is faster:

if (((uintptr)(src-dest)) >= len)
return memcpy_negstride(dest,src,len)
else
return memcpy_posstride(dest,src,len)

And I expect that my test is undefined behaviour, but most people
except the UB advocates should understand what I mean.

...

Last but not least, having two different code blocks for the
different preferences is clunky. The two blocks can be
combined by fusing the two test expressions into a single
expression, as for example

#ifndef PREFER_UPWARDS
#define PREFER_UPWARDS 1
#endif/*PREFER_UPWARDS*/

extern void* ascending_copy( void*, const void*, size_t );
extern void* descending_copy( void*, const void*, size_t );

void *
good_memmove( void *vd, const void *vs, size_t n ){
const char *d = vd;
const char *s = vs;
_Bool upwards = PREFER_UPWARDS ? d-s +0ull >= n : s-d +0ull < n;

return
upwards
? ascending_copy( vd, vs, n )
: descending_copy( vd, vs, n );
}

Using the preprocessor symbol PREFER_UPWARDS to select between
the two preferences (ascending or descending) allows the choice
to made by a -D compiler option, and we can expect the compiler
to optimize away the part of the test that is never used.

That's clever, but for usage in glibc or the like the clunky version
is the preferred one: memmove() is usually called through the dynamic
linking mechanism, and which implementation is actually called is
selected based on the hardware that it runs on (what does it do when
the program is linked statically?). There seem to be quite a few
memmove() (and __memmove_chk()) implementations in glibc-2.36 on
AMD64:

__memmove_chk
__memmove_sse2_unaligned_erms
__memmove_chk
__memmove_chk_erms
__memmove_chk_evex_unaligned
__memmove_chk_avx_unaligned
__memmove_chk_ssse3
__memmove_chk_sse2_unaligned
__memmove_erms
__memmove_avx512_unaligned
__memmove_evex_unaligned
__memmove_evex_unaligned_erms
__memmove_avx_unaligned
__memmove_avx_unaligned_erms
__memmove_avx_unaligned_rtm
__memmove_ssse3
__memmove_sse2_unaligned
__memmove_chk_sse2_unaligned_erms
__memmove_chk_avx512_no_vzeroupper
__memmove_chk_avx512_unaligned
__memmove_chk_avx512_unaligned_erms
__memmove_chk_evex_unaligned_erms
__memmove_chk_avx_unaligned_erms
__memmove_chk_avx_unaligned_rtm
__memmove_chk_avx_unaligned_erms_rtm
__memmove_avx512_no_vzeroupper
__memmove_avx512_unaligned_erms
__memmove_avx_unaligned_erms_rtm

From what I read, __memmove_chk() (which has an additional destlen
parameter) is apparently not intended to be called explicitly from the
source code, so I guess that some compilers generate calls to it.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Thu Sep 12 08:03:52 2024

[email protected] (Anton Ertl) writes:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

[considering which way to copy with memmove()]

If the two memory blocks don't overlap, memmove() can use the
fastest stride. [...]

The way to go for memmove() is:

On hardware where positive stride is faster:

if (((uintptr)(dest-src)) >= len)
return memcpy_posstride(dest,src,len)
else
return memcpy_negstride(dest,src,len)

On hardware where the negative stride is faster:

if (((uintptr)(src-dest)) >= len)
return memcpy_negstride(dest,src,len)
else
return memcpy_posstride(dest,src,len)

And I expect that my test is undefined behaviour, but most people
except the UB advocates should understand what I mean.

...

Last but not least, having two different code blocks for the
different preferences is clunky. The two blocks can be
combined by fusing the two test expressions into a single
expression, as for example

#ifndef PREFER_UPWARDS
#define PREFER_UPWARDS 1
#endif/*PREFER_UPWARDS*/

extern void* ascending_copy( void*, const void*, size_t );
extern void* descending_copy( void*, const void*, size_t );

void *
good_memmove( void *vd, const void *vs, size_t n ){
const char *d = vd;
const char *s = vs;
_Bool upwards = PREFER_UPWARDS ? d-s +0ull >= n : s-d +0ull < n; >>
return
upwards
? ascending_copy( vd, vs, n )
: descending_copy( vd, vs, n );
}

Using the preprocessor symbol PREFER_UPWARDS to select between
the two preferences (ascending or descending) allows the choice
to made by a -D compiler option, and we can expect the compiler
to optimize away the part of the test that is never used.

That's clever, but for usage in glibc or the like the clunky version
is the preferred one: [elaboration]

That's irrelevant to the point I was making. People working
inside an implementation can take advantage of knowledge unknown
to people working at the source code level. My comment was only
about what is visible at the source code level, not about the
unknown hidden workings of some particular implementation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Thu Sep 12 18:43:07 2024

On Thu, 12 Sep 2024 08:15:29 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Wed, 11 Sep 2024 17:34:38 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

Real mode x86 C compilers operating in Large and Compact Models
that were popular on IBM-compatible PCs 30-40 years ago could
have more than one representation for the pointer to the same
memory location. If my memory serves me, the rules of pointers
comparison for equality were the same as rules of comparison for
<>. In both cases for reliable result pointers had to be
explicitly normalized (i.e. converted from 'far' to 'huge' or
something like that).

It was long time ago and even back then I didn't use Large model
very often, so it's possible that I misremember. But if I
remember correctly, does it mean that those C compilers now would
be considered non-compliant?

The C standard was first ratified (by ANSI) in 1989. The rules
for pointer comparison were clarified in the C99 standard, but it
has always been true that pointers to the same object have to
compare equal.

C environments that have things like 'far' or 'huge' pointers,
etc, are not standard C but must have extensions so that they can
deal with the different kinds of pointers. Depending on how the
non-standard kinds of pointer worked, the implementation might or
might not be conforming. Most likely though it's a moot point
because once a program starts using an extension all the rules
can change, and the C standard allows that. It's only programs
that look like really standard C that have to do what the C
standard says (for the implementation to be conforming); any
code that declares a 'far' pointer or 'huge' pointer certainly
isn't standard C.

In Compact and Large models data pointers are 'far' by default. So,
the source doesn't have to use non-standard declarations.

In that case, if the defaulted 'far' pointers don't follow the
rules given in the C standard for regular pointers, then the
implementation is not conforming. Extensions are allowed only if
they don't change the behavior of any strictly conforming
program. If undecorated pointer declarations don't observe this
condition then it's not a valid extension, which in turn causes
the implementation to be non-conforming.

Thinking about it, there likely were no way to create aliases via using
only standard language constructs. That's assuming that any use of
preserved values of pointers to de-allocated heap storage, including
use for comparison, is non-standard.
So it probably was conforming implementation at the end.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Thu Sep 12 15:53:33 2024

Tim Rentsch <[email protected]> schrieb:

Code inside the implementation is allowed to exploit internal
knowledge.

Which is a cause of envy for people who don't...

glibc can compare pointers all it wants if it knows that the
pointer model is flat.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Thu Sep 12 08:15:29 2024

Michael S <[email protected]> writes:

On Wed, 11 Sep 2024 17:34:38 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Wed, 11 Sep 2024 09:29:04 -0700
Tim Rentsch <[email protected]> wrote:

Josh Vanderhoof <[email protected]> writes:

[how to write a portable, UB-free check if mempcy() intervals
overlap]

It is legal to test for equality between pointers to different
objects

Right. This observation is the key insight.

Real mode x86 C compilers operating in Large and Compact Models that
were popular on IBM-compatible PCs 30-40 years ago could have more
than one representation for the pointer to the same memory
location. If my memory serves me, the rules of pointers comparison
for equality were the same as rules of comparison for <>. In both
cases for reliable result pointers had to be explicitly normalized
(i.e. converted from 'far' to 'huge' or something like that).

It was long time ago and even back then I didn't use Large model
very often, so it's possible that I misremember. But if I remember
correctly, does it mean that those C compilers now would be
considered non-compliant?

The C standard was first ratified (by ANSI) in 1989. The rules
for pointer comparison were clarified in the C99 standard, but it
has always been true that pointers to the same object have to
compare equal.

C environments that have things like 'far' or 'huge' pointers,
etc, are not standard C but must have extensions so that they can
deal with the different kinds of pointers. Depending on how the
non-standard kinds of pointer worked, the implementation might or
might not be conforming. Most likely though it's a moot point
because once a program starts using an extension all the rules
can change, and the C standard allows that. It's only programs
that look like really standard C that have to do what the C
standard says (for the implementation to be conforming); any
code that declares a 'far' pointer or 'huge' pointer certainly
isn't standard C.

In Compact and Large models data pointers are 'far' by default. So,
the source doesn't have to use non-standard declarations.

In that case, if the defaulted 'far' pointers don't follow the
rules given in the C standard for regular pointers, then the
implementation is not conforming. Extensions are allowed only if
they don't change the behavior of any strictly conforming
program. If undecorated pointer declarations don't observe this
condition then it's not a valid extension, which in turn causes
the implementation to be non-conforming.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Thu Sep 12 17:57:52 2024

On Thu, 12 Sep 2024 14:20:42 +0000, Anton Ertl wrote:

That's clever, but for usage in glibc or the like the clunky version
is the preferred one: memmove() is usually called through the dynamic
linking mechanism, and which implementation is actually called is
selected based on the hardware that it runs on (what does it do when
the program is linked statically?). There seem to be quite a few
memmove() (and __memmove_chk()) implementations in glibc-2.36 on
AMD64:

__memmove_chk
__memmove_sse2_unaligned_erms
__memmove_chk
__memmove_chk_erms
__memmove_chk_evex_unaligned
__memmove_chk_avx_unaligned
__memmove_chk_ssse3
__memmove_chk_sse2_unaligned
__memmove_erms
__memmove_avx512_unaligned
__memmove_evex_unaligned
__memmove_evex_unaligned_erms
__memmove_avx_unaligned
__memmove_avx_unaligned_erms
__memmove_avx_unaligned_rtm
__memmove_ssse3
__memmove_sse2_unaligned
__memmove_chk_sse2_unaligned_erms
__memmove_chk_avx512_no_vzeroupper
__memmove_chk_avx512_unaligned
__memmove_chk_avx512_unaligned_erms
__memmove_chk_evex_unaligned_erms
__memmove_chk_avx_unaligned_erms
__memmove_chk_avx_unaligned_rtm
__memmove_chk_avx_unaligned_erms_rtm
__memmove_avx512_no_vzeroupper
__memmove_avx512_unaligned_erms
__memmove_avx_unaligned_erms_rtm

All of these compile to the MM instruction in My 66000,
including the memcpy() variants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Thu Sep 12 18:11:39 2024

On Thu, 12 Sep 2024 17:57:52 +0000, MitchAlsup1 wrote:

On Thu, 12 Sep 2024 14:20:42 +0000, Anton Ertl wrote:

That's clever, but for usage in glibc or the like the clunky version
is the preferred one: memmove() is usually called through the dynamic
linking mechanism, and which implementation is actually called is
selected based on the hardware that it runs on (what does it do when
the program is linked statically?). There seem to be quite a few
memmove() (and __memmove_chk()) implementations in glibc-2.36 on
AMD64:

__memmove_chk
__memmove_sse2_unaligned_erms
__memmove_chk
__memmove_chk_erms
__memmove_chk_evex_unaligned
__memmove_chk_avx_unaligned
__memmove_chk_ssse3
__memmove_chk_sse2_unaligned
__memmove_erms
__memmove_avx512_unaligned
__memmove_evex_unaligned
__memmove_evex_unaligned_erms
__memmove_avx_unaligned
__memmove_avx_unaligned_erms
__memmove_avx_unaligned_rtm
__memmove_ssse3
__memmove_sse2_unaligned
__memmove_chk_sse2_unaligned_erms
__memmove_chk_avx512_no_vzeroupper
__memmove_chk_avx512_unaligned
__memmove_chk_avx512_unaligned_erms
__memmove_chk_evex_unaligned_erms
__memmove_chk_avx_unaligned_erms
__memmove_chk_avx_unaligned_rtm
__memmove_chk_avx_unaligned_erms_rtm
__memmove_avx512_no_vzeroupper
__memmove_avx512_unaligned_erms
__memmove_avx_unaligned_erms_rtm

All of these compile to the MM instruction in My 66000,
including the memcpy() variants.

But the list above is a symptom of not providing the right abstract
to memmove() in ISA to begin with.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Thu Sep 12 23:10:16 2024

On Tue, 3 Sep 2024 17:46:38 +0200
Terje Mathisen <[email protected]> wrote:

Q&D programming is still far faster for me in C, but using Rust I
don't have to worry about how well the compiler will be able to
optimize my code, it is pretty much always close to speed of light
since the entire aliasing issue goes away.

I am trying to compare speed of few compiled languages in one benchmark
that I find interesting.
In order to make comparison I have to port a test bench first, because
while most of this languages are able, with various level of
difficulties, to call C routines, none of them can be called from 'C',
at least at my level of knowledge.

Porting test bench from C to Go was quite easy, the only part that I
didn't grasp immediately was related to time measurements.

Today I started Rust port and it is VERY much harder. After several
hours of reading of various tutorials, examples and Stack Overflow
articles I still don't know how to write
switch (argv[1][0]) {
case 't':
case 'T':
x = 42;
break;
}

At this rate, I am not sure that my motivation will last long enough to
finish the porting.

Rust also gets rid of the horrible external library/configure/cmake
mess that kept me from successfully compiling the reference LAStools
lidar code for nearly 10 years.

Using the Rust port I just tell cargo to add it to my project and
that's it.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Thu Sep 12 20:58:18 2024

On Thu, 12 Sep 2024 20:10:16 +0000, Michael S wrote:

On Tue, 3 Sep 2024 17:46:38 +0200
Terje Mathisen <[email protected]> wrote:

Q&D programming is still far faster for me in C, but using Rust I
don't have to worry about how well the compiler will be able to
optimize my code, it is pretty much always close to speed of light
since the entire aliasing issue goes away.

I am trying to compare speed of few compiled languages in one benchmark
that I find interesting.
In order to make comparison I have to port a test bench first, because
while most of this languages are able, with various level of
difficulties, to call C routines, none of them can be called from 'C',
at least at my level of knowledge.

FORTRAN 77 passes arguments indirectly so the subroutine can write to
the location storing the argument--giving it IN-OUT capabilities.
I never found this indirect creating a bother when calling FORTRAN
from C.

Since C only has IN style arguments (in ADA parlance)::
ADA OUT and INOUT arguments require the compiler knowing about the
OUT nature of the argument, so, upon return, it can place the OUT
argument variables back where they belong.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Sep 12 21:52:32 2024

On Thu, 12 Sep 2024 21:14:18 +0000, BGB wrote:

This is because in some cases, the performance overhead of copying the
last (sz&31) bytes is significant, say:
rsz=cte-ct;
if(rsz)
{
if(rsz&16)
{
v0=((u64 *)cs)[0]; v1=((u64 *)cs)[1];
((u64 *)ct)[0]=v0; ((u64 *)ct)[1]=v1;
cs+=16; ct+=16;
}
if(rsz&8)
{
v0=((u64 *)cs)[0];
((u64 *)ct)[0]=v0;
cs+=8; ct+=8;
}
if(rsz&4)
{
v0=((u32 *)cs)[0];
((u32 *)ct)[0]=v0;
cs+=4; ct+=4;
}
if(rsz&2)
{
v0=((u16 *)cs)[0];
((u16 *)ct)[0]=v0;
cs+=2; ct+=2;
}
if(rsz&1)
{
v0=((byte *)cs)[0];
((byte *)ct)[0]=v0;
cs++; ct++;
}
}

For small copies with awkward sizes, this tailing part can cost more
than the whole rest of the copy.

A fine rendition of why this should be in HW as an instruction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Thu Sep 12 18:33:18 2024

Michael S <[email protected]> writes:

On Tue, 3 Sep 2024 17:46:38 +0200
Terje Mathisen <[email protected]> wrote:

Q&D programming is still far faster for me in C, but using Rust I
don't have to worry about how well the compiler will be able to
optimize my code, it is pretty much always close to speed of light
since the entire aliasing issue goes away.

I am trying to compare speed of few compiled languages in one benchmark
that I find interesting.
In order to make comparison I have to port a test bench first, because
while most of this languages are able, with various level of
difficulties, to call C routines, none of them can be called from 'C',
at least at my level of knowledge.

Porting test bench from C to Go was quite easy, the only part that I
didn't grasp immediately was related to time measurements.

Today I started Rust port and it is VERY much harder. After several
hours of reading of various tutorials, examples and Stack Overflow
articles I still don't know how to write
switch (argv[1][0]) {
case 't':
case 'T':
x = 42;
break;
}

At this rate, I am not sure that my motivation will last long enough to finish the porting.

Disclaimer: I have very little experience with Rust. The
example shown below looks like Rust but may very well have
syntax errors (or worse).

match argv[1][0] {
't' | 'T' => { x = 42; }
_ => { }
}

The _ pattern matches anything that hasn't been matched (and
may be necessary, I'm not sure about that).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Fri Sep 13 05:40:08 2024

Michael S <[email protected]> schrieb:

In order to make comparison I have to port a test bench first, because
while most of this languages are able, with various level of
difficulties, to call C routines, none of them can be called from 'C',
at least at my level of knowledge.

If you declare a Fortran procedure BIND(C), you can call it from C.
gfortran will give you the C prototype with -fc-prototypes.

Or, if you don't declare it BIND(C) and it uses old-style code,
you can use -fc-prototypes-external.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Fri Sep 13 11:52:35 2024

On Fri, 13 Sep 2024 05:40:08 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

In order to make comparison I have to port a test bench first,
because while most of this languages are able, with various level of difficulties, to call C routines, none of them can be called from
'C', at least at my level of knowledge.

If you declare a Fortran procedure BIND(C), you can call it from C.
gfortran will give you the C prototype with -fc-prototypes.

Or, if you don't declare it BIND(C) and it uses old-style code,
you can use -fc-prototypes-external.

Thank you, but Fortran was not in the list of the languages that I
wanted to test.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Fri Sep 13 12:04:17 2024

On Thu, 12 Sep 2024 18:33:18 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Tue, 3 Sep 2024 17:46:38 +0200
Terje Mathisen <[email protected]> wrote:

Q&D programming is still far faster for me in C, but using Rust I
don't have to worry about how well the compiler will be able to
optimize my code, it is pretty much always close to speed of light
since the entire aliasing issue goes away.

I am trying to compare speed of few compiled languages in one
benchmark that I find interesting.
In order to make comparison I have to port a test bench first,
because while most of this languages are able, with various level of difficulties, to call C routines, none of them can be called from
'C', at least at my level of knowledge.

Porting test bench from C to Go was quite easy, the only part that I
didn't grasp immediately was related to time measurements.

Today I started Rust port and it is VERY much harder. After several
hours of reading of various tutorials, examples and Stack Overflow
articles I still don't know how to write
switch (argv[1][0]) {
case 't':
case 'T':
x = 42;
break;
}

At this rate, I am not sure that my motivation will last long
enough to finish the porting.

Disclaimer: I have very little experience with Rust. The
example shown below looks like Rust but may very well have
syntax errors (or worse).

match argv[1][0] {
't' | 'T' => { x = 42; }
_ => { }
}

The _ pattern matches anything that hasn't been matched (and
may be necessary, I'm not sure about that).

My hardle is relatedd to [0] part rather than to switch/case part.
Accessing nth character of String (or of str? Or &str ? I am still
trying to figure out the difference.) is not as simple as in C or Go.
One person on Stack Overflow said that he was able to figure it out
after he learned the difference between std::string and
std::string_view in C++. May be, I should follow the same process. But
I don't want to. I don't plan to become an expert Rust programmer,
but rather want to do a simple benchmark.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Fri Sep 13 12:05:10 2024

Michael S wrote:

On Thu, 12 Sep 2024 18:33:18 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Tue, 3 Sep 2024 17:46:38 +0200
Terje Mathisen <[email protected]> wrote:

Q&D programming is still far faster for me in C, but using Rust I
don't have to worry about how well the compiler will be able to
optimize my code, it is pretty much always close to speed of light
since the entire aliasing issue goes away.

I am trying to compare speed of few compiled languages in one
benchmark that I find interesting.
In order to make comparison I have to port a test bench first,
because while most of this languages are able, with various level of
difficulties, to call C routines, none of them can be called from
'C', at least at my level of knowledge.

Porting test bench from C to Go was quite easy, the only part that I
didn't grasp immediately was related to time measurements.

Today I started Rust port and it is VERY much harder. After several
hours of reading of various tutorials, examples and Stack Overflow
articles I still don't know how to write
switch (argv[1][0]) {
case 't':
case 'T':
x = 42;
break;
}

At this rate, I am not sure that my motivation will last long
enough to finish the porting.

Disclaimer: I have very little experience with Rust. The
example shown below looks like Rust but may very well have
syntax errors (or worse).

match argv[1][0] {
't' | 'T' => { x = 42; }
_ => { }
}

The _ pattern matches anything that hasn't been matched (and
may be necessary, I'm not sure about that).

My hardle is relatedd to [0] part rather than to switch/case part.
Accessing nth character of String (or of str? Or &str ? I am still
trying to figure out the difference.) is not as simple as in C or Go.
One person on Stack Overflow said that he was able to figure it out
after he learned the difference between std::string and
std::string_view in C++. May be, I should follow the same process. But
I don't want to. I don't plan to become an expert Rust programmer,
but rather want to do a simple benchmark.

Rust strings _always_ use utf8! If you use the .as_bytes() casting then
you can in fact address the underlying u8 bytes, and since you will be
working with 7-bit ascii only, that will not make any difference to you.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Fri Sep 13 04:12:21 2024

Michael S <[email protected]> writes:

On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:

BGB <[email protected]> writes:

[...]

Would be nice, say, if there were semi-standard compiler macros for
various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to
use. Whether or not the target/compiler allows misaligned memory
access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.

[elaborations on the above]

I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.

Why not?

Because it's not needed, and would make things worse rather
than better. The result would be a bigger language but not
a better language.

I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain
limited classes of buffer overflows.

Eliminating undefined behavior is not what's being asked for.
These two questions are not the same.

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.

Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Fri Sep 13 14:29:04 2024

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:

BGB <[email protected]> writes:

[...]

Would be nice, say, if there were semi-standard compiler macros
for various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to
use. Whether or not the target/compiler allows misaligned memory
access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.

[elaborations on the above]

I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.

Why not?

Because it's not needed, and would make things worse rather
than better. The result would be a bigger language but not
a better language.

I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain
limited classes of buffer overflows.

Eliminating undefined behavior is not what's being asked for.
These two questions are not the same.

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.

Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Fri Sep 13 14:44:11 2024

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:

BGB <[email protected]> writes:

[...]

Would be nice, say, if there were semi-standard compiler macros
for various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to
use. Whether or not the target/compiler allows misaligned memory
access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.

[elaborations on the above]

I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.

Why not?

Because it's not needed, and would make things worse rather
than better. The result would be a bigger language but not
a better language.

I beg to differ.
Yes, the standard would be bigger. And yes, few unimportant benchmarks
would run a little slower. But a job of compiler writers would be
simpler and less exciting (good thing!). The most importantly,
programming in resulting language would feel more predictable.

I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain
limited classes of buffer overflows.

Eliminating undefined behavior is not what's being asked for.
These two questions are not the same.

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.

Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.

No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be
bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to BGB on Fri Sep 13 17:30:35 2024

On 12/09/2024 23:14, BGB wrote:

On 9/12/2024 9:18 AM, David Brown wrote:

On 11/09/2024 20:51, BGB wrote:

On 9/11/2024 5:38 AM, Anton Ertl wrote:

Josh Vanderhoof <[email protected]> writes:

[email protected] (Anton Ertl) writes:

<snip lots>

Would be nice, say, if there were semi-standard compiler macros for
various things:

Ask, and you shall receive! (Well, sometimes you might receive.)

   Endianess (macros exist, typically compiler specific);
     And, apparently GCC and Clang can't agree on which strategy to use.

#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
...
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
...
#else
...
#endif

Works in gcc, clang and MSVC.

Technically now also in BGBCC, since I have just recently added it.

Good idea.

And C23 has the <stdbit.h> header with many convenient little "bit and
byte" utilities, including endian detection:

#include <stdbit.h>
#if __STDC_ENDIAN_NATIVE__ == __STDC_ENDIAN_LITTLE__
...
#elif __STDC_ENDIAN_NATIVE__ == __STDC_ENDIAN_BIG__
...
#else
...
#endif

This is good at least.

Though, generally takes a few years before new features become usable.
Like, it is only in recent years that it has become "safe" to use most
parts of C99.

Most of the commonly used parts of C99 have been "safe" to use for 20
years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.

There are only two serious, general purpose C compilers in mainstream
use - gcc and clang, and both support almost all of C23 now. But it
will take a while for the more niche tools, such as some embedded
compilers, to catch up.

<stdbit.h> is, however, in the standard library rather than the
compiler, and they can be a bit slow to catch up.

   Whether or not the target/compiler allows misaligned memory access; >>>      If set, one may use misaligned access.

Why would you need that? Any decent compiler will know what is
allowed for the target (perhaps partly on the basis of compiler
flags), and will generate the best allowed code for accesses like
foo3() above.

Imagine you have compilers that are smart enough to turn "memcpy()" into
a load and store, but not smart enough to optimize away the memory
accesses, or fully optimize away the wrapper functions...

Why would I do that? If I want to have efficient object code, I use a
good compiler. Under what realistic circumstances would you need to
have highly efficient results but be unable to use a good optimising
compiler? Compilers have been inlining code for 30 years at least
(that's when I first saw it) - this is not something new and rare.

So, for best results, the best case option is to use a pointer cast and dereference.

For some cases, one may also need to know whether or not they can access
the pointers in a misaligned way (and whether doing so would be better
or worse than something like "memcpy()").

Again, I cannot see a /real/ situation where that would be relevant.

   Whether or not memory uses a single address space;
     If set, all pointer comparisons are allowed.

Pointer comparisons are always allowed for equality tests if they are
pointers to objects of compatible types. (Function pointers cannot be
compared at all.)

For other relational tests, the pointers must point to sub-objects of
the same aggregate object. (That means they can't be null pointers,
misaligned pointers, invalid pointers or pointers going nowhere.)
This is independent of how the address space(s) are organised on the
target machine.

What you /can/ do, on pretty much any implementation with a single
linear address space, is convert pointers to uintptr_t and then
compare them. There may be some targets for which there is no
uintptr_t, or where the mapping from pointer to integer does not match
with the address, but that would be very unusual.

I can't think when you would need to do such comparisons, however,
other than to implement memmove - and library functions can use any
kind of implementation-specific feature they like.

Yeah.

My "_memlzcpy()" functions do a lot of relative comparisons (more than
needed for memmove):
dst<=src: memmove
(dst-src)>=sz: memcpy
(dst-src)>=32: can copy with 32B blocks
(dst-src)>=16: can copy with 16B blocks
(dst-src)>= 8: can copy with 8B blocks
1/2/4: Generate a full-block fill pattern
3/5/6/7: partial fill pattern (16B block with irregular step)

If this is something for your library for your compiler, then of course
you are free to do anything you want here - standard library code does
not need to be portable, but is free to use any kind of compiler "magic"
it likes. (For example, gcc has lots of builtins and extensions that
are not targeted at normal code, but are targeted specifically at
library writers.)

There is a difference here between "_memlzcpy()" and "_memlzcpyf()" in
that:
the former will always copy an exact number of bytes;
the latter may write 16-32 bytes over the limit.

It may do /what/ ? That is a scary function!

Possible:
   __MINALIGN_type__ //minimum allowed alignment for type

_Alignof(type) has been around since C11.

_Alignof tells the native alignment, not the minimum.

It is the same thing.

Where, _Alignof(int32_t) will give 4, but __MINALIGN_INT32__ would give
1 if the target supports misaligned pointers.

The alignment of types in C is given by _Alignof. Hardware may support unaligned accesses - C does not. (By that, I mean that unaligned
accesses are UB.)

Maybe also alias pointer control:
   __POINTER_ALIAS__
     __POINTER_ALIAS_CONSERVATIVE__
     __POINTER_ALIAS_STRICT__

Where, pointer alias can be declared, and:
   If conservative, then conservative semantics are being used.
     Pointers may be freely cast without concern for pointer aliasing. >>>      Compiler will assume that "non restrict" pointer stores may alias. >>>    If strict, the compiler is using TBAA semantics.
     Compiler may assume that aliasing is based on pointer types.

Faffing around with pointer types - breaking the "effective type"
rules - has been a bad idea and risky behaviour since C was
standardised. You never need to do it. (I accept, however, that on
some weaker or older compilers "doing the right thing" can be
noticeably less efficient than writing bad code.) Just get a
half-decent compiler and use memcpy(). For any situation where you
might think casting pointer types would be a good idea, your sizes are
small and known at compile time, so they are easy for the compiler to
optimise.

It depends.

In some things, like my ELF and PE/COFF program loaders, the code can
get particularly nasty in these areas...

It may look simpler in the code to do this kind of thing, but it is not /necessary/ and it is not safe unless you are writing non-portable code
and are sure it will only be used on a compiler that supports it. Thus
the Linux kernel requires "-fno-strict-aliasing", because some of the
Linux kernel authors write crap C code. (Or, to be a bit fairer, some
of the code in the Linux kernel is very old and comes from a time when
writing things correctly while generating efficient results would need
more effort.)

And as a general rule, if you feel you really want to break the rules
of C and still get something useful out at the end, use "volatile"
liberally.

I have used "volatile" here to good effect.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Fri Sep 13 15:55:39 2024

David Brown <[email protected]> schrieb:

Most of the commonly used parts of C99 have been "safe" to use for 20
years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.

What about VLAs?

There are only two serious, general purpose C compilers in mainstream
use - gcc and clang, and both support almost all of C23 now. But it
will take a while for the more niche tools, such as some embedded
compilers, to catch up.

It is almost impossible to gather statistics on compiler use,
especially with free compilers, but what about MSVC and icc?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Fri Sep 13 13:11:37 2024

On Thu, 12 Sep 2024 04:04:06 -0700, Tim Rentsch
<[email protected]> wrote:

George Neuner <[email protected]> writes:

On Tue, 10 Sep 2024 11:21:01 +0300, Michael S
<[email protected]> wrote:

On Mon, 09 Sep 2024 23:27:24 -0400
George Neuner <[email protected]> wrote:

On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
(Anton Ertl) wrote:

Tim Rentsch <[email protected]> writes:

[email protected] (Anton Ertl) writes:

There was still no easy way to determine whether your software
that calls memcpy() actually works as expected on all hardware,

There may not be a way to tell if memcpy()-calling code will
work on platforms one doesn't have, but there is a relatively
simple and portable way to tell if some memcpy() call crosses
over into the realm of undefined behavior.

1) At first I thought that yes, one could just check whether
there is an overlap of the memory areas. But then I remembered
that you cannot write such a check in standard C without (in the
general case) exercising undefined behaviour; and then the
compiler could eliminate the check or do something else that's
unexpected. Do you have such a check in mind that does not
exercise undefined behaviour in the general case?

The result of comparing pointers to two elements of the same array
is defined. Cast to (char*), both src and dst can be considered
to point to elements of the [address space sized] char array at
address zero.

According to my understanding, your 'can be considered' part is not
codified in the C Standard.

Adding size_t to a pointer yields another pointer of the same
type.

In terms of types, that is right, but the addition works only if
the pointer points into an array large enough to include the
result of the addition (the result is also allowed to be just one
past the end of the array).

All of gcc, clang and MSVC seem happy with this.

It works. But is it guaranteed to work in the future by some sort
of document? I am pretty sure that no such guarantee exists in gcc
and MSVC docs. I did not look in clang docs. Trying to find
anythings in LLVM/clang docs makes me sad.

I know that it has worked as expected with every version of gcc
and Microsoft I've used since 1988. [clang I don't use, but I
tried it on godbolt.org with the most recent version]

Will it continue to work ... who knows?

I definitely am NOT an expert on the C standard, but thinking
about it, it occurred to me that if an array is explicitly defined
that *might* cover all memory (or at least all heap), then the
compiler would have to honor any apparent pointers into it.

E.g., char (*all_memory)[] = 0;

This declaration introduces a pointer, not an array. Similarly
the declaration

char (*great_white_array)[ 999999999999999999 ] = 0;

does not introduce an array but just a pointer (and initializes
the pointer to be a null pointer). There is no humongous array.

Of course there is no actual array ... the point was to (try to)
define *something* such that the compiler would think there was an
array and consider any char* as possibly pointing to an element of
that array.
[And yes! it might end up pessimizing character manipulating code.]

The C standard guarantees that pointers to 2 elements of the same
array are comparable, and current (and past) compilers do allow
comparing arbitrary pointers when cast to char* without needing an
actual char array that covers the addresses.

But a guarantee wrt the standard requires the compiler to at least
*think* there is such an array. The question is how to do that.

None of the compilers at godbolt seem to need this to compare
arbitrary addresses as char*, but all accept it.

The given declaration of 'all_memory' is strictly conforming.
It must be accepted by any conforming C implementation (which
all of gcc, clang, and MSVC purport to be, IIUC).

Obviously speculation, but it's the best I have.

It's important to realize that there are two distinct questions.
One, does the code work (in a given implementation)? Two, does
the code satisfy the rules given in the C standard?

Unfortunately having an answer to the first question does not by
itself give enough information to answer the second question.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Fri Sep 13 10:42:22 2024

Michael S <[email protected]> writes:

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:

BGB <[email protected]> writes:

[...]

Would be nice, say, if there were semi-standard compiler macros
for various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to
use. Whether or not the target/compiler allows misaligned memory
access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.

[elaborations on the above]

I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.

Why not?

Because it's not needed, and would make things worse rather
than better. The result would be a bigger language but not
a better language.

I beg to differ.
Yes, the standard would be bigger. And yes, few unimportant benchmarks
would run a little slower. But a job of compiler writers would be
simpler and less exciting (good thing!). The most importantly,
programming in resulting language would feel more predictable.

I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain
limited classes of buffer overflows.

Eliminating undefined behavior is not what's being asked for.
These two questions are not the same.

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.

Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.

No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be
bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.

I think the consequences of changes like the ones you suggest
would be much larger than you think they would be. The result
would change C into a completely different language.

Also I think the percentage of code where such considerations are
relevant is extremely small, significantly less than a thousandth
of a percent. That's a mighty small tail wagging a mighty large
dog.

I'm not trying to convince anyone; just stating a personal view.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to George Neuner on Fri Sep 13 11:09:01 2024

George Neuner <[email protected]> writes:

On Thu, 12 Sep 2024 04:04:06 -0700, Tim Rentsch
<[email protected]> wrote:

George Neuner <[email protected]> writes:

[...]

I definitely am NOT an expert on the C standard, but thinking
about it, it occurred to me that if an array is explicitly defined
that *might* cover all memory (or at least all heap), then the
compiler would have to honor any apparent pointers into it.

E.g., char (*all_memory)[] = 0;

This declaration introduces a pointer, not an array. Similarly
the declaration

char (*great_white_array)[ 999999999999999999 ] = 0;

does not introduce an array but just a pointer (and initializes
the pointer to be a null pointer). There is no humongous array.

Of course there is no actual array ... the point was to (try to)
define *something* such that the compiler would think there was an
array and consider any char* as possibly pointing to an element of
that array.
[And yes! it might end up pessimizing character manipulating code.]

The C standard guarantees that pointers to 2 elements of the same
array are comparable, and current (and past) compilers do allow
comparing arbitrary pointers when cast to char* without needing an
actual char array that covers the addresses.

But a guarantee wrt the standard requires the compiler to at least
*think* there is such an array. The question is how to do that.

What the compiler thinks is irrelevant. It's only what the C
standard thinks that matters.

If someone fools the C compiler today they might (or might not)
get what they want or expect. But fooling the compiler is a
risky strategy, and it's almost never needed; people try to
trick compilers a lot more often than circumstances actually
warrant, and that is even not counting non-guarantees about
future behavior.

Incidentally, I am in the middle of porting some code from one
platform to another. Code works fine on the original platform,
millions of tests are picture perfect. No undefined behavior in
sight. The new platform is a nightmare, thanks to a certain
well-known company headquartered in the state of Washington. Any
concerns about what happens with undefined behavior are so far
down the list we'd need a telescope to see them. Given this
recent experience, it's hard for me to get too worked up about
defining these over-the-edge cases.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to David Brown on Fri Sep 13 13:09:00 2024

On 9/3/2024 4:14 PM, David Brown wrote:

On 03/09/2024 18:54, Stephen Fuld wrote:

On 9/2/2024 11:23 PM, David Brown wrote:

On 02/09/2024 18:46, Stephen Fuld wrote:

On 9/2/2024 1:23 AM, Terje Mathisen wrote:

Anyway, that is all mostly moot since I'm using Rust for this kind
of programming now. :-)

Can you talk about the advantages and disadvantages of Rust versus C?

And also for Rust versus C++ ?

I asked about C versus Rust as Terje explicitly mentioned those two
languages, but you make a good point in general.

I want to know about both :-)

In my field, small-systems embedded development, C has been dominant for
a long time, but C++ use is increasing. Most of my new stuff in recent times has been C++. There are some in the field who are trying out
Rust, so I need to look into it myself - either because it is a better
choice than C++, or because customers might want it.

My impression - based on hearsay for Rust as I have no experience -
is that the key point of Rust is memory "safety". I use scare-quotes
here, since it is simply about correct use of dynamic memory and
buffers.

I agree that memory safety is the key point, although I gather that it
has other features that many programmers like.

Sure. There are certainly plenty of things that I think are a better
idea in a modern programming language and that make it a good step up compared to C. My key interest is in comparison to C++ - it is a step
up in some ways, a step down in others, and a step sideways in many features. But is it overall up or down, for /my/ uses?

Examples of things that I think are good in Rust are making variables immutable by default and pattern matching. Steps down include lack of function overloading

Rust's generic functions are not sufficient?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Fri Sep 13 21:39:39 2024

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.

Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.

No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be
bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.

Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?

SCNR

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Sep 13 23:16:19 2024

On Fri, 13 Sep 2024 21:39:39 +0000, Thomas Koenig wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.

Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.

No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be
bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.

Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety.

FORTAN allowed::
subroutine1:
COMMON /ALPHA/i,j,k,l,m,n
subroutine2:
COMMON /ALPHA/x.y.z
expecting {i,j} which are INT*4 to overlap with x Read*8 ;...
{Completely neglecting the BE/LE problems,...}

Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?

SCNR

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to [email protected] on Sat Sep 14 07:25:00 2024

MitchAlsup1 <[email protected]> schrieb:

On Fri, 13 Sep 2024 21:39:39 +0000, Thomas Koenig wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.

Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.

No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be
bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.

Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety.

FORTAN allowed::
subroutine1:
COMMON /ALPHA/i,j,k,l,m,n
subroutine2:
COMMON /ALPHA/x.y.z
expecting {i,j} which are INT*4 to overlap with x Read*8 ;...
{Completely neglecting the BE/LE problems,...}

Not only that, also different FP formats...

The only thing that was guaranteed is the storage unit. An INTEGER
and a REAL occupies one storage unit, a DOUBLE PRECISION occoupies
two. Through EQUIVALENCE or through different COMMON blocks in
different procedures, an INTEGER and a REAL can occupy the same
storage location. And if a value was assigned to a variable of
one time (the entity became defined, in standardese) the variable
with the same storage location becomes undefined (at least as far
back as Fortran 77, I didn't check earlier).

This was very widely ignored, people used COMMON and EQUIVALENCE
for type punning all the time.

There also was the issue of alignment; by playing tricks with
EQUIVALENCE, you could put a double precision variable on an
unaligned memory location. With the advent of the RISC CPUs which
didn't support this, this became the most-ignored provision in the
standard (but with a flag to restorte standard-conforming behavior).

Hmm... what were the alignment restrictions on double precision
on the /360?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to BGB on Sat Sep 14 08:24:29 2024

BGB <[email protected]> schrieb:

On 9/13/2024 10:55 AM, Thomas Koenig wrote:

David Brown <[email protected]> schrieb:

Most of the commonly used parts of C99 have been "safe" to use for 20
years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.

What about VLAs?

IIRC, VLAs and _Complex and similar still don't work in MSVC.
Most of the rest does now at least.

It's only been 25 years. You have to give Microsoft a bit of
time to catch up. I'm sure they will get there by 2099.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kent Dickey@21:1/5 to Scott Lurndal on Sat Sep 14 13:08:05 2024

In article <UxpCO.174965$[email protected]>,
Scott Lurndal <[email protected]> wrote:

Bernd Linsel <[email protected]> writes:

On 05.09.24 19:04, Terje Mathisen wrote:

One of my alternatives are

unsigned u = start; // Cannot be less than zero
if (u) {
u++;
do {
u--;
data[u]...
while (u);
}

This typically results in effectively the same asm code as the signed
version, except for a bottom JGE (Jump (signed) Greater or Equal instead >>> of JA (Jump Above or Equal, but my version is far more verbose.

Alternatively, if you don't need all N bits of the unsigned type, then
you can subtract and check if the top bit is set in the result:

for (unsigned u = start; (u & TOPBIT) == 0; u--)

Terje

What about:

for (unsigned u = start; u != ~0u; --u)

This is the form we use most when we need
to work in reverse.

...

or even

for (unsigned u = start; (int)u >= 0; --u)
...

?

I've compared all variants for x86_64 with -O3 -fexpensive-optimizations
on godbolt.org:
- 32 bit version: https://godbolt.org/z/TMhhx3nch
- 64 bit version: https://godbolt.org/z/8oxzTf5Gf

No significant differences in code generation for unsigned vs. signed.

This discussion wandered into many subthreads, but I only want to make
one post and chose here.

When you write code working on signed numbers and do something like:

(a < 0) || (a >= max)

Then the compiler realizes if you treat 'a' as unsigned, this is just:

(unsigned)a >= max

since any negative number, treated as unsigned, will be larger than the
largest positive signed number. So, to do loops which count down and
have any stride using an unsigned loop count:

for(u = start; u <= start; u -= step)

With the usual caveats (start must be a valid signed number, and step
cannot be so large that start + step crosses the signed boundary).

But: unsigned numbers in C have some dangers, which no one here has
mentioned. Some code presented comes CLOSE to being wrong, but gets
lucky. With "int" being 32-bits, C promotion rules around unsigned
ints, signed ints, and unsigned 64-bit can create trouble.

uint64_t dval; uint32_t uval; int a;

val32 = 1 dval = 1; a = 1;
dval = val32 - 2 + dval;

C will do (val32 - 2) first, with is (1U - 2) which is 0xffff_ffff, and
then add dval, and the result is 0x1_0000_0000.

Signed numbers don't have this risk, so if you're doing known small loops,
you can just use ints. If you're doing possibly large loops, just use
int64_t.

Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
C/C++ is a mistake. It should always have been ILP64, and this nonsense
would go away. Any new architecture should make C ILP64 (looking at you RISC-V, missing yet another opportunity to not make the same mistakes as everyone else).

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Kent Dickey on Sat Sep 14 13:26:52 2024

[email protected] (Kent Dickey) writes:

Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
C/C++ is a mistake. It should always have been ILP64, and this nonsense >would go away. Any new architecture should make C ILP64 (looking at you >RISC-V, missing yet another opportunity to not make the same mistakes as >everyone else).

We now have had more than 30 years of catering for this mistake by
everyone involved. Given their goals, I think that RISC-V made the
right choice for int in their ABI, even if it was the original choice
by the MIPS and Alpha people that they follow, like everyone else, was
wrong.

That being said, one option would be to introduce another ABI and API
with 64-bit int (and maybe 32-bit long short int), and programmers
could choose whether to program for the ILP API, or the int=int32_t
API. Would the ILP API/ABI fare better then x32? I doubt it, even
though I would support it. This ship probably has sailed.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Sat Sep 14 21:59:22 2024

On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.

Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.

No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.

Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?

SCNR

What I wrote is how all production C compilers work today. So it
will add no new bugs. What I propose is to formally codify 50 y.o.
existing practice.
And no, it's both much easier to follow than old FORTRAN common blocks
and has wider scope (applies to all storage classes, rather than just
to global).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Sat Sep 14 19:02:43 2024

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.

Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.

No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be
bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.

Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?

SCNR

What I wrote is how all production C compilers work today. So it
will add no new bugs. What I propose is to formally codify 50 y.o.
existing practice.

So was Fortran's misuse of COMMON Blocks.

And no, it's both much easier to follow than old FORTRAN common blocks

You want to allow array bounds violations and type punning rolled into
one?

I beg to differ that this is in any way easier, or better.

and has wider scope (applies to all storage classes, rather than just
to global).

You're correct, the potential for mischief is far greater.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Kent Dickey on Sat Sep 14 19:00:35 2024

Kent Dickey <[email protected]> schrieb:

When you write code working on signed numbers and do something like:

(a < 0) || (a >= max)

Then the compiler realizes if you treat 'a' as unsigned, this is just:

(unsigned)a >= max

For which definition of a and max exactly?

It coertainly does not do so for

_Bool foo(int a, int max)
{
return (a < 0) || (a >= max);
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to [email protected] on Sat Sep 14 19:26:13 2024

MitchAlsup1 <[email protected]> schrieb:

In many cases int is slower now than long -- which violates the notion
of int from K&R days.

That's a designers's choice, I think. It is possible to add 32-bit instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.

And having a smaller memory footprint is also beneficial, especially
for caches.

(Plus, there are FORTRAN's storage association rules, but these should
be less used by now. But for a 64-bit integer, they pretty much would
require a 64-bit REAL and a 128-bit DOUBLE PRECISION).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sat Sep 14 19:11:30 2024

On Sat, 14 Sep 2024 13:26:52 +0000, Anton Ertl wrote:

[email protected] (Kent Dickey) writes:

Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
C/C++ is a mistake. It should always have been ILP64, and this nonsense >>would go away. Any new architecture should make C ILP64 (looking at you >>RISC-V, missing yet another opportunity to not make the same mistakes as >>everyone else).

We now have had more than 30 years of catering for this mistake by
everyone involved. Given their goals, I think that RISC-V made the
right choice for int in their ABI, even if it was the original choice
by the MIPS and Alpha people that they follow, like everyone else, was
wrong.

Until the advent of int32_t the only way to get a known 32-bit container
was int. But I agree with the notion that ILP64 should be universal now,
and if you want/need something smaller, use some other type indicator
than int.

In many cases int is slower now than long -- which violates the notion
of int from K&R days.

That being said, one option would be to introduce another ABI and API
with 64-bit int (and maybe 32-bit long short int), and programmers
could choose whether to program for the ILP API, or the int=int32_t
API. Would the ILP API/ABI fare better then x32? I doubt it, even
though I would support it. This ship probably has sailed.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kent Dickey@21:1/5 to [email protected] on Sat Sep 14 19:57:04 2024

In article <vc4mgj$1khmk$[email protected]>,
Thomas Koenig <[email protected]> wrote:

Kent Dickey <[email protected]> schrieb:

When you write code working on signed numbers and do something like:

(a < 0) || (a >= max)

Then the compiler realizes if you treat 'a' as unsigned, this is just:

(unsigned)a >= max

For which definition of a and max exactly?

It coertainly does not do so for

_Bool foo(int a, int max)
{
return (a < 0) || (a >= max);
}

Sorry, I should have made it clear for max >= 0 (but not necessarily an unsigned variable), and for my code, a constant, which is how the
compiler knows it's positive . I have this in my code all the time to
validate function inputs--a negative number is bad, and a number beyond
a certain reasonable value is bad. And I let the compiler optimize the
check to (unsigned)a >= (unsigned)max.

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Sat Sep 14 20:14:23 2024

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.

Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.

No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would be
bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.

Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?

SCNR

What I wrote is how all production C compilers work today. So it
will add no new bugs.

Maybe I should be a little bit more precise in why I think this
is an extemely bad idea.

struct {
char x[8]
int y;
} bar;

Assume

bar.y = 1234;
bar.x[i] = 42; // The compiler does not know i
// Do something with bar.y

The compiler should then treat the access to bar.x[i] as if bar.y
was clobbered by the assignment statement, and reload bar.y if
it was kept in a register? That is the semantics you propose.

So, either bar.y is treated as if it was volatile, or hard-to-detect
bugs would appear because, with optimization, the assignment would
sometimes change the value of bar.y and sometimes not.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bernd Linsel@21:1/5 to Kent Dickey on Sat Sep 14 22:18:12 2024

On 14.09.24 21:57, Kent Dickey wrote:

In article <vc4mgj$1khmk$[email protected]>,
Thomas Koenig <[email protected]> wrote:

Kent Dickey <[email protected]> schrieb:

When you write code working on signed numbers and do something like:

(a < 0) || (a >= max)

Then the compiler realizes if you treat 'a' as unsigned, this is just:

(unsigned)a >= max

For which definition of a and max exactly?

It coertainly does not do so for

_Bool foo(int a, int max)
{
return (a < 0) || (a >= max);
}

Sorry, I should have made it clear for max >= 0 (but not necessarily an unsigned variable), and for my code, a constant, which is how the
compiler knows it's positive . I have this in my code all the time to validate function inputs--a negative number is bad, and a number beyond
a certain reasonable value is bad. And I let the compiler optimize the
check to (unsigned)a >= (unsigned)max.

Kent

And that's the information the compiler was missing to optimize foo() in
the same way:

_Bool foo1(int a, int max)
{
if (__builtin_expect(max < 0, 0)) __builtin_unreachable();

return a < 0 || a >= max;
}

_Bool foo2(int a, int max)
{
return (unsigned)a >= (unsigned)max;
}

compiles to:

foo1:
cmp edi, esi
setnb al
ret
foo2:
cmp edi, esi
setnb al
ret

(x64-64-gcc 14.2 -Wall -Wextra -Wpedantic -O3 -fexpensive-optimizations)

--
Bernd Linsel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Sat Sep 14 23:53:40 2024

On Sat, 14 Sep 2024 08:24:29 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

BGB <[email protected]> schrieb:

On 9/13/2024 10:55 AM, Thomas Koenig wrote:

David Brown <[email protected]> schrieb:

Most of the commonly used parts of C99 have been "safe" to use
for 20 years. There were a few bits that MSVC did not implement
until relatively recently, but I think even have caught up now.

What about VLAs?

IIRC, VLAs and _Complex and similar still don't work in MSVC.
Most of the rest does now at least.

It's only been 25 years. You have to give Microsoft a bit of
time to catch up. I'm sure they will get there by 2099.

Microsoft does not see ISO C as their primary language.
They are willing to do an easy stuff, but seem very reluctant to
implement anything that is principally incompatible with C++.
Both VLA and _Complex fall under the later category.
Both were optional in C11/17.
However in C23, while VLA are still optional, variably-modified types
that are also principally incompatible with C++, became mandatory.
I wonder what Microsoft would do about it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Sun Sep 15 00:11:53 2024

On Thu, 12 Sep 2024 16:34:31 +0200
David Brown <[email protected]> wrote:

On 12/09/2024 13:29, Michael S wrote:

On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:

BGB <[email protected]> writes:

[...]

Would be nice, say, if there were semi-standard compiler macros
for various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy
to use. Whether or not the target/compiler allows misaligned
memory access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.

[elaborations on the above]

I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.

I fully agree that C is not, and should not be seen as, a "high-level assembly language". But it is a language that is very useful to "hardware-type folks", and there are a few things that could make it
easier to write more portable code if they were standardised. As it
is, we just have to accept that some things are not portable.

Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.

And how should that be defined?

bar.x[8] = 42 should be defined to be the same as
char tmp = 42
memcpy(&bar.y, &tmp, sizeof(tmp));

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Sun Sep 15 00:19:39 2024

On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.

Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.

No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would
be bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.

Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?

SCNR

What I wrote is how all production C compilers work today. So it
will add no new bugs.

Maybe I should be a little bit more precise in why I think this
is an extemely bad idea.

struct {
char x[8]
int y;
} bar;

Assume

bar.y = 1234;
bar.x[i] = 42; // The compiler does not know i
// Do something with bar.y

The compiler should then treat the access to bar.x[i] as if bar.y
was clobbered by the assignment statement, and reload bar.y if
it was kept in a register? That is the semantics you propose.

Yes, exactly.

So, either bar.y is treated as if it was volatile, or hard-to-detect
bugs would appear because, with optimization, the assignment would
sometimes change the value of bar.y and sometimes not.

No, semantics is that compiler has to reload bar.y if it keeps it in
register. Optimizer that does anything else is buggy.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Sat Sep 14 19:38:36 2024

Thomas Koenig <[email protected]> writes:

BGB <[email protected]> schrieb:

On 9/13/2024 10:55 AM, Thomas Koenig wrote:

David Brown <[email protected]> schrieb:

Most of the commonly used parts of C99 have been "safe" to use for 20
years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.

What about VLAs?

IIRC, VLAs and _Complex and similar still don't work in MSVC.
Most of the rest does now at least.

It's only been 25 years. You have to give Microsoft a bit of
time to catch up. I'm sure they will get there by 2099.

Microsoft is never going to catch up because they don't want to
catch up. The choice to offer a sub-standard C compiler is the
result of a business decision, not a technical decision; they
want to steer people away from open environments and towards
their proprietary environments. The world would be a better
place if Microsoft had been broken up in the judgment of the
anti-trust action 20+ years ago. And they certainly deserved
it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to BGB on Sat Sep 14 20:07:03 2024

BGB <[email protected]> writes:

On 9/12/2024 5:12 AM, Tim Rentsch wrote:

BGB <[email protected]> writes:

[...]

Would be nice, say, if there were semi-standard compiler macros for
various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy to use. >>> Whether or not the target/compiler allows misaligned memory access;
If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.

[elaborations on the above]

I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.

There are a few ways things can go:
Define rules, have one of N permutations for how those rules can go;
How it often worked in practice.
Throw up hands and say it is unknowable.
What a lot of "portability" people assert.
Do whatever gives the fastest results in standardized benchmarks.
What many compiler maintainers want.

These options come from the perspective of someone writing a
compiler. That is very different from the perspective of someone
writing a language definition. More than 60 years ago we learned
the lesson that we shouldn't let machine architectures be defined
just by what the hardware does. The same lesson applies to
defining a programming language just by what compilers do, or
even just what compilers can do.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Sun Sep 15 08:05:47 2024

Michael S <[email protected]> schrieb:

On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.

Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.

No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would
be bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.

Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?

SCNR

What I wrote is how all production C compilers work today. So it
will add no new bugs.

Maybe I should be a little bit more precise in why I think this
is an extemely bad idea.

struct {
char x[8]
int y;
} bar;

Assume

bar.y = 1234;
bar.x[i] = 42; // The compiler does not know i
// Do something with bar.y

The compiler should then treat the access to bar.x[i] as if bar.y
was clobbered by the assignment statement, and reload bar.y if
it was kept in a register? That is the semantics you propose.

Yes, exactly.

So, volatile for all structs, plus prescribed behavior on
array overruns.

At the risk of repeating myself: This is an extremely bad idea.

I rest my case.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Sun Sep 15 12:50:06 2024

On Sun, 15 Sep 2024 08:05:47 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by
implementation. And in practice it is. Just not in
theory.

Do you mean union rather than struct? And do you mean
bar.x[7] rather than bar.x[8]? Surely no one would expect
that storing into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think
should be defined by the C standard but is not? And the
same question for a struct if that is what you meant.

No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior
would be bar.y==42 on LE machines and bar.y==42*2**24 on BE
machines. As it actually happens in reality with all
production compilers.

Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers
(TM) have to have something that pays their salaries, right?

SCNR

What I wrote is how all production C compilers work today. So it
will add no new bugs.

Maybe I should be a little bit more precise in why I think this
is an extemely bad idea.

struct {
char x[8]
int y;
} bar;

Assume

bar.y = 1234;
bar.x[i] = 42; // The compiler does not know i
// Do something with bar.y

The compiler should then treat the access to bar.x[i] as if bar.y
was clobbered by the assignment statement, and reload bar.y if
it was kept in a register? That is the semantics you propose.

Yes, exactly.

So, volatile for all structs,

No.
Access to field of struct's should be ordered only relatively to
accesses to other fields *of the same instance* of the struct. And,
of course, usual 'as if' applies, so optimizing compiler can figure out
that bar.x[7] and bar.y do not overlap and thus generate code knowing
that write to one does not clobber the other.
That's pretty far from semantics of volatile.

plus prescribed behavior on array overruns.

Only withing bound of struct. bar.x[12] remains UB

At the risk of repeating myself: This is an extremely bad idea.

I rest my case.

You seem to think that C should be as optimizable and as full of UBs as Fortran. Many compiler authors agree with you.
I have different idea. IMHO, your party exploits the letter of C
standard in violation to its spirit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Waldek Hebisch on Sun Sep 15 12:30:22 2024

Waldek Hebisch <[email protected]> schrieb:

[...]

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.

That has two drawbacks: minor one that you need to know that
there are no padding between 'x' and 'y'.

Similar to Fortran's problems with unaligned variables in COMMON
blocks.

Major drawback
is that it would forbid bounds checking for array accesses.
In code like above it is easy to spot out of bound access at
compile time.

And it happens:

$ cat x.c

struct {
char x[8];
int y;
} bar;

void foo()
{
bar.y = 0;
bar.x[8] = 42;
}
$ gcc -O2 -c x.c
x.c: In function 'foo':
x.c:10:12: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=]
10 | bar.x[8] = 42;
| ~~~~~~~~~^~~~
x.c:3:9: note: at offset 8 into destination object 'x' of size 8
3 | char x[8];
| ^

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Sun Sep 15 12:38:32 2024

On 2024-09-15, Michael S <[email protected]> wrote:

You seem to think that C should be as optimizable and as full of UBs as Fortran.

The only place where "undefined behavior" is mentioned in the Fortran
standards is with reference to C.

Many compiler authors agree with you.
I have different idea.

You don't appear to believe in specifications.

iIMHO, your party exploits the letter of C
standard in violation to its spirit.

If you meet the spirit of the C standard, say hello to him for me.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Waldek Hebisch on Sun Sep 15 15:40:38 2024

On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
Waldek Hebisch <[email protected]> wrote:

Michael S <[email protected]> wrote:

On Thu, 12 Sep 2024 16:34:31 +0200
David Brown <[email protected]> wrote:

On 12/09/2024 13:29, Michael S wrote:

On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:

BGB <[email protected]> writes:

I fully agree that C is not, and should not be seen as, a
"high-level assembly language". But it is a language that is very
useful to "hardware-type folks", and there are a few things that
could make it easier to write more portable code if they were
standardised. As it is, we just have to accept that some things
are not portable.

Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.

And how should that be defined?

bar.x[8] = 42 should be defined to be the same as
char tmp = 42
memcpy(&bar.y, &tmp, sizeof(tmp));

That has two drawbacks: minor one that you need to know that
there are no padding between 'x' and 'y'.

Padding is another thing that should be Implementation Defined.
I.e. compiler should provide complete documentation of its padding
algorithms.
In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by
another field of integer type with the same or narrower width then
there should be no padding in-between.

Major drawback
is that it would forbid bounds checking for array accesses.
In code like above it is easy to spot out of bound access at
compile time. Even with variable index compiler knows size
of 'x' so can insert bounds checking code (and AFAIK if you
insist leading compilers will do this).

More generally, assuming cooperating compiler modern C has enough
features to eliminate out of bounds array indexing.

In general, only by means of fat pointers.
Fat pointers break existing ABIs.
Also if fat pointers is what I want then I already have them in few
mainstream languages where they are integrated much better than they
will ever be in "checked C".

More precisely,
I mean compiler which inserts bounds check where they are needed
and warns or rejects constructs that can not be checked. I claim
that it is possible to write nontrivial programs in "checked C".
With change as above very important language construct would be
uncheckable.

BTW: If you need such behaviour you can get what you want by
using unions, so there is no need to break language for folks
that do not need this.

Such behavior is sometimes handy, but I can easily live without it.
Its potential usefulness is not my motivation. My motivation is
eliminating as many UBs as is practically possible.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Sun Sep 15 15:46:03 2024

On Sun, 15 Sep 2024 12:38:32 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

On 2024-09-15, Michael S <[email protected]> wrote:

You seem to think that C should be as optimizable and as full of
UBs as Fortran.

The only place where "undefined behavior" is mentioned in the Fortran standards is with reference to C.

The rest of the time they write "program shouldn't" or "when xyz
the program is ill-formed" or something like that. But the meaning is
exactly the same as UB in C.

Many compiler authors agree with you.
I have different idea.

You don't appear to believe in specifications.

iIMHO, your party exploits the letter of C
standard in violation to its spirit.

If you meet the spirit of the C standard, say hello to him for me.

If I meet him, I'd try drink him.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Michael S on Sun Sep 15 12:19:02 2024

Michael S <[email protected]> wrote:

On Thu, 12 Sep 2024 16:34:31 +0200
David Brown <[email protected]> wrote:

On 12/09/2024 13:29, Michael S wrote:

On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:

BGB <[email protected]> writes:

I fully agree that C is not, and should not be seen as, a "high-level
assembly language". But it is a language that is very useful to
"hardware-type folks", and there are a few things that could make it
easier to write more portable code if they were standardised. As it
is, we just have to accept that some things are not portable.

Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.

And how should that be defined?

bar.x[8] = 42 should be defined to be the same as
char tmp = 42
memcpy(&bar.y, &tmp, sizeof(tmp));

That has two drawbacks: minor one that you need to know that
there are no padding between 'x' and 'y'. Major drawback
is that it would forbid bounds checking for array accesses.
In code like above it is easy to spot out of bound access at
compile time. Even with variable index compiler knows size
of 'x' so can insert bounds checking code (and AFAIK if you
insist leading compilers will do this).

More generally, assuming cooperating compiler modern C has enough
features to eliminate out of bounds array indexing. More precisely,
I mean compiler which inserts bounds check where they are needed
and warns or rejects constructs that can not be checked. I claim
that it is possible to write nontrivial programs in "checked C".
With change as above very important language construct would be
uncheckable.

BTW: If you need such behaviour you can get what you want by
using unions, so there is no need to break language for folks
that do not need this.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Michael S on Sun Sep 15 15:41:00 2024

In article <[email protected]>, [email protected] (Michael S) wrote:

Padding is another thing that should be Implementation Defined.
I.e. compiler should provide complete documentation of its padding algorithms.

It is, and they do. I've used a lot of different compilers over the last
29 years, needing to know about padding for a DIY varargs, and I've never
had problems with finding out what the padding was.

It can usually be described quite briefly, by saying that all data types
are naturally aligned. The only variant of that I've encountered is on
32-bit x86 Linux and 32-bit POWER AIX where in both cases 8-byte doubles
were 4-byte aligned.

The C standard specifies that struct members shall be stored in memory in
the same order as they appear in the declaration. It does not specify
padding because the standard committee feel they need to allow C to work
on machines that are not byte-addressed or are otherwise weird.

In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed
by another field of integer type with the same or narrower width then
there should be no padding in-between.

That would be fine if you were willing to confine yourself to
byte-addressed machines.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Sun Sep 15 17:50:15 2024

On 14/09/2024 21:26, Thomas Koenig wrote:

MitchAlsup1 <[email protected]> schrieb:

In many cases int is slower now than long -- which violates the notion
of int from K&R days.

That's a designers's choice, I think. It is possible to add 32-bit instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.

For some kinds of instructions, that's true - for others, it's not so
easy without either making rather complicated instructions or having
assembly instructions with undefined behaviour (imagine the terror that
would bring to some people!).

A classic example would be for "y = p[x++];" in a loop. For a 64-bit
type x, you would set up one register once with "p + x", and then have a
load with post-increment instruction in the loop. You can also do that
with x as a 32-bit int, unless you are of the opinion that enough apples
added to a pile should give a negative number of apples. But with a
wrapping type for x - such as unsigned int in C or modulo types in Ada,
you have little choice but to hold "p" and "x" separately in registers,
add them for every load, and do the increment and modulo operation. I
really can't see this all being handled by a single instruction.

Of course you could add a 32-bit zero extend or sign extend to many
32-bit ALU instructions and save some instructions - many architectures
already support that kind of thing.

And having a smaller memory footprint is also beneficial, especially
for caches.

(Plus, there are FORTRAN's storage association rules, but these should
be less used by now. But for a 64-bit integer, they pretty much would require a 64-bit REAL and a 128-bit DOUBLE PRECISION).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Sun Sep 15 15:45:39 2024

Michael S <[email protected]> writes:

On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
Waldek Hebisch <[email protected]> wrote:

That has two drawbacks: minor one that you need to know that
there are no padding between 'x' and 'y'.

Padding is another thing that should be Implementation Defined.
I.e. compiler should provide complete documentation of its padding >algorithms.

This is definitely in the realm of the processor ABI, not
the compiler. And, most processor ABIs do document the
padding requirements (which generally reflect optimal hardware
access rules).

Most C and C++ compilers provide support for "packed" structures
when the programmer wishes explicit control over structure
member layout.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Stephen Fuld on Sun Sep 15 17:53:32 2024

On 13/09/2024 22:09, Stephen Fuld wrote:

On 9/3/2024 4:14 PM, David Brown wrote:

On 03/09/2024 18:54, Stephen Fuld wrote:

On 9/2/2024 11:23 PM, David Brown wrote:

On 02/09/2024 18:46, Stephen Fuld wrote:

On 9/2/2024 1:23 AM, Terje Mathisen wrote:

Anyway, that is all mostly moot since I'm using Rust for this kind >>>>>> of programming now. :-)

Can you talk about the advantages and disadvantages of Rust versus C? >>>>>

And also for Rust versus C++ ?

I asked about C versus Rust as Terje explicitly mentioned those two
languages, but you make a good point in general.

I want to know about both :-)

In my field, small-systems embedded development, C has been dominant
for a long time, but C++ use is increasing. Most of my new stuff in
recent times has been C++. There are some in the field who are trying
out Rust, so I need to look into it myself - either because it is a
better choice than C++, or because customers might want it.

My impression - based on hearsay for Rust as I have no experience -
is that the key point of Rust is memory "safety". I use
scare-quotes here, since it is simply about correct use of dynamic
memory and buffers.

I agree that memory safety is the key point, although I gather that
it has other features that many programmers like.

Sure. There are certainly plenty of things that I think are a better
idea in a modern programming language and that make it a good step up
compared to C. My key interest is in comparison to C++ - it is a step
up in some ways, a step down in others, and a step sideways in many
features. But is it overall up or down, for /my/ uses?

Examples of things that I think are good in Rust are making variables
immutable by default and pattern matching. Steps down include lack of
function overloading

Rust's generic functions are not sufficient?

I don't know Rust well enough to say for sure, but certainly in C++ a
generic function (a template function) and an overloaded function are completely different things.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Sun Sep 15 18:02:35 2024

On 14/09/2024 23:11, Michael S wrote:

On Thu, 12 Sep 2024 16:34:31 +0200
David Brown <[email protected]> wrote:

On 12/09/2024 13:29, Michael S wrote:

On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:

BGB <[email protected]> writes:

[...]

Would be nice, say, if there were semi-standard compiler macros
for various things:
Endianess (macros exist, typically compiler specific);
And, apparently GCC and Clang can't agree on which strategy
to use. Whether or not the target/compiler allows misaligned
memory access; If set, one may use misaligned access.
Whether or not memory uses a single address space;
If set, all pointer comparisons are allowed.

[elaborations on the above]

I suppose it's natural for hardware-type folks to want features
like this to be part of standard C. In a sense what is being
asked is to make C a high-level assembly language. But that's
not what C is. Nor should it be.

I fully agree that C is not, and should not be seen as, a "high-level
assembly language". But it is a language that is very useful to
"hardware-type folks", and there are a few things that could make it
easier to write more portable code if they were standardised. As it
is, we just have to accept that some things are not portable.

Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation. And
in practice it is. Just not in theory.

And how should that be defined?

bar.x[8] = 42 should be defined to be the same as
char tmp = 42
memcpy(&bar.y, &tmp, sizeof(tmp));

No, it should not.

It should be "defined" like any other buffer overflow - if there is some
kind of checking mechanism possible and enabled, at compile time or
run-time, then that should trigger and tell you you've got a bug in your
code. If not - well, that's the way programming works. You are
responsible for writing correct code.

If you want the behaviour you describe here, then you might like to try:

union {
char x[9];
struct {
char padding[8];
int y;
}
} bar;

I can understand people wanting C to behave in a different way from the
way it is defined. I can understand people wanting to write code that
seems simple, clear and efficient, even though the C rules say it is
wrong. I can understand people wanting to continue using code
constructs that they know are wrong, because they used to get away with
it. I can understand people wanting some kind of limits to how bad
things can go for undefined behaviour (I think this comes from some
fundamental misunderstandings about how programming works, but I can
understand people wanting it).

But I really cannot get my head around the idea that someone would want
to be able to write code that is /clearly/ wrong, totally unnecessary,
and /clearly/ against the rules of the language, but somehow want the
compiler to give specific behaviour to that mistake.

It's like saying you want "1 / 0" to be defined as 6, because 6 is your favourite number.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Sun Sep 15 18:09:56 2024

On 15/09/2024 14:40, Michael S wrote:

On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
Waldek Hebisch <[email protected]> wrote:

Michael S <[email protected]> wrote:

On Thu, 12 Sep 2024 16:34:31 +0200
David Brown <[email protected]> wrote:

On 12/09/2024 13:29, Michael S wrote:

On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:

BGB <[email protected]> writes:

I fully agree that C is not, and should not be seen as, a
"high-level assembly language". But it is a language that is very
useful to "hardware-type folks", and there are a few things that
could make it easier to write more portable code if they were
standardised. As it is, we just have to accept that some things
are not portable.

Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.

And how should that be defined?

bar.x[8] = 42 should be defined to be the same as
char tmp = 42
memcpy(&bar.y, &tmp, sizeof(tmp));

That has two drawbacks: minor one that you need to know that
there are no padding between 'x' and 'y'.

Padding is another thing that should be Implementation Defined.

It is.

I.e. compiler should provide complete documentation of its padding algorithms.

They do. Or, they should. Often they are lazy and say "defined by the platform ABI". Really, it is only the alignments that are needed.

C defines the minimum padding between members in a struct - you get the
padding needed to ensure that members are correctly aligned. I don't
think the C standards disallow additional padding, but it would be an extraordinarily strange implementation if there were anything more than
this minimum padding.

But I certainly wouldn't mind if the standards dictated this minimum
padding, and then there would be nothing left to the implementation
other than alignments.

In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by another field of integer type with the same or narrower width then
there should be no padding in-between.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to John Dallman on Sun Sep 15 18:14:40 2024

On 15/09/2024 16:41, John Dallman wrote:

In article <[email protected]>, [email protected] (Michael S) wrote:

Padding is another thing that should be Implementation Defined.
I.e. compiler should provide complete documentation of its padding
algorithms.

It is, and they do. I've used a lot of different compilers over the last
29 years, needing to know about padding for a DIY varargs, and I've never
had problems with finding out what the padding was.

It can usually be described quite briefly, by saying that all data types
are naturally aligned. The only variant of that I've encountered is on
32-bit x86 Linux and 32-bit POWER AIX where in both cases 8-byte doubles
were 4-byte aligned.

It is better to say types are naturally aligned up to a maximum
appropriate for the architecture (usually the width of general-purpose registers and/or pointers). Then there are far fewer exceptions.

(So on 8-bit devices you usually see single byte alignment even for
64-bit types.)

The C standard specifies that struct members shall be stored in memory in
the same order as they appear in the declaration. It does not specify
padding because the standard committee feel they need to allow C to work
on machines that are not byte-addressed or are otherwise weird.

It specifies that there can be padding between members, and members need
to be aligned, so it gives the minimum padding (though the alignment requirements are implementation-defined). But it gives no maximum
padding, AFAIK.

In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed
by another field of integer type with the same or narrower width then
there should be no padding in-between.

That would be fine if you were willing to confine yourself to
byte-addressed machines.

There would not be padding between one integer type and another member
of the same or smaller integer type, unless you have a very odd
architecture or niche features (like, say, an int24_t with 1-byte
alignment followed by an int16_t with 2-byte alignment).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Michael S on Sun Sep 15 16:43:45 2024

Michael S <[email protected]> wrote:

On Sun, 15 Sep 2024 08:05:47 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by
implementation. And in practice it is. Just not in
theory.

Do you mean union rather than struct? And do you mean
bar.x[7] rather than bar.x[8]? Surely no one would expect
that storing into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think
should be defined by the C standard but is not? And the
same question for a struct if that is what you meant.

No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior
would be bar.y==42 on LE machines and bar.y==42*2**24 on BE
machines. As it actually happens in reality with all
production compilers.

Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers
(TM) have to have something that pays their salaries, right?

SCNR

What I wrote is how all production C compilers work today. So it
will add no new bugs.

Maybe I should be a little bit more precise in why I think this
is an extemely bad idea.

struct {
char x[8]
int y;
} bar;

Assume

bar.y = 1234;
bar.x[i] = 42; // The compiler does not know i
// Do something with bar.y

The compiler should then treat the access to bar.x[i] as if bar.y
was clobbered by the assignment statement, and reload bar.y if
it was kept in a register? That is the semantics you propose.

Yes, exactly.

So, volatile for all structs,

No.
Access to field of struct's should be ordered only relatively to
accesses to other fields *of the same instance* of the struct. And,
of course, usual 'as if' applies, so optimizing compiler can figure out
that bar.x[7] and bar.y do not overlap and thus generate code knowing
that write to one does not clobber the other.
That's pretty far from semantics of volatile.

plus prescribed behavior on array overruns.

Only withing bound of struct. bar.x[12] remains UB

At the risk of repeating myself: This is an extremely bad idea.

I rest my case.

You seem to think that C should be as optimizable and as full of UBs as Fortran. Many compiler authors agree with you.
I have different idea. IMHO, your party exploits the letter of C
standard in violation to its spirit.

In may copy (translation of) of K&R there is a passage which
says that C tries to define useful things, but unlike PL/I does
not define things to make them defined. And PL/I experience
was that many defined behaviours were bugs, but due to language
definiton compiler silenty accepred them and generated code.
The trouble was that program was doing different thing that
programmer intended. Anyway, the passge in K&R that I mention
and advice given in other places (like "implementation may do
different things, program should not depend on any particular
behaviour") for me means that UB was part of _original_ C spirit.
Later came folks from "do not break my code" camp, and they do
have _some_ points. But they do not represent point of view
of creators of the language.

BTW: Wording in Pascal standard is quite different, in particular
Pascal uses term "error" and "behaviour not defined by the standard".
But spirit is the same as C: break the rules and your program
may do whatever it wishes. Main difference is that C adopted
"trust the programmer" philosophy and offers several unsafe
constructs not present in Pascal. And with unsafe constructs
came associated UB.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Michael S on Sun Sep 15 16:22:13 2024

Michael S <[email protected]> wrote:

On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
Waldek Hebisch <[email protected]> wrote:

Michael S <[email protected]> wrote:

On Thu, 12 Sep 2024 16:34:31 +0200
David Brown <[email protected]> wrote:

On 12/09/2024 13:29, Michael S wrote:

On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:

BGB <[email protected]> writes:

I fully agree that C is not, and should not be seen as, a
"high-level assembly language". But it is a language that is very
useful to "hardware-type folks", and there are a few things that
could make it easier to write more portable code if they were
standardised. As it is, we just have to accept that some things
are not portable.

Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited
classes of buffer overflows.

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.

And how should that be defined?

bar.x[8] = 42 should be defined to be the same as
char tmp = 42
memcpy(&bar.y, &tmp, sizeof(tmp));

That has two drawbacks: minor one that you need to know that
there are no padding between 'x' and 'y'.

Padding is another thing that should be Implementation Defined.
I.e. compiler should provide complete documentation of its padding algorithms.
In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by another field of integer type with the same or narrower width then
there should be no padding in-between.

Major drawback
is that it would forbid bounds checking for array accesses.
In code like above it is easy to spot out of bound access at
compile time. Even with variable index compiler knows size
of 'x' so can insert bounds checking code (and AFAIK if you
insist leading compilers will do this).

More generally, assuming cooperating compiler modern C has enough
features to eliminate out of bounds array indexing.

In general, only by means of fat pointers.
Fat pointers break existing ABIs.
Also if fat pointers is what I want then I already have them in few mainstream languages where they are integrated much better than they
will ever be in "checked C".

No. When array declaration (or allocation) is visible adding checks
is trivial, so the problem is passing size information to functions.
As long as arrays have fixed sizes one can declare size of function
argument using qualifier "static", like in

void foo(int a[static 20]);

For arrays of variable size there are variably modified types.
Standard botched this, essentially saying that size info in
the prototype should be ignored, but in non-conforming mode
compiler may require size info and check it for correctness.

The point is that "checked program" can be compiled by standard
C complier. And as long as all accesses are in bound "checked
code" is ABI compatible with unchecked one. Of course,
if you take random C program, then with probablity close to 1
it will be rejected by checking compiler. But if you pass
variable sized arrays the called routine needs _some_ way to
find out how big the array is. And using vmt-s is a reasonable
way to pass size info. To make it more useful vmt-s should be
beefed up, in particular cover pointer inside structures.
But even as it is now one can write useful checkable programs
in C.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Robert Finch on Sun Sep 15 17:07:58 2024

Robert Finch <[email protected]> writes:

On 2024-09-15 12:09 p.m., David Brown wrote:

In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by
another field of integer type with the same or narrower width then
there should be no padding in-between.

What about bit-fields in a struct? I believe they are usually packed. In
case its for something like an I/O device.

That's a bit more complicated as it depends on the target byte-order.

e.g.

struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9; /**< R/W1C/H - RAM ECC DBE detected. */
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9; /**< R/W1C/H - RAM ECC SBE detected. */
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Sun Sep 15 17:46:12 2024

Michael S <[email protected]> writes:

Padding is another thing that should be Implementation Defined.

It is. It's defined in the ABI, so when the compiler documents to
follow some ABI, you automatically get that ABI's structure layout.
And if a compiler does not follow an ABI, it is practically useless.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sun Sep 15 17:21:42 2024

On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

Robert Finch <[email protected]> writes:

On 2024-09-15 12:09 p.m., David Brown wrote:

In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by >>>> another field of integer type with the same or narrower width then
there should be no padding in-between.

What about bit-fields in a struct? I believe they are usually packed. In >>case its for something like an I/O device.

That's a bit more complicated as it depends on the target byte-order.

e.g.

struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9; /**< R/W1C/H - RAM
ECC DBE detected. */
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9; /**< R/W1C/H - RAM
ECC SBE detected. */
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;

Which brings to mind a slight different but related bit-field issue.

If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??

So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Sun Sep 15 20:13:44 2024

On 14/09/2024 23:19, Michael S wrote:

On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 04:12:21 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

struct {
char x[8]
int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.

Do you mean union rather than struct? And do you mean bar.x[7]
rather than bar.x[8]? Surely no one would expect that storing
into bar.x[8] should be well-defined behavior.

If the code were this

union {
char x[8];
int y;
} bar;
bar.y = 0; bar.x[7] = 42;

and assuming sizeof(int) == 4, what is it that you think should
be defined by the C standard but is not? And the same question
for a struct if that is what you meant.

No, I mean struct and I mean 8.
And I mean that a typical implementation-defined behavior would
be bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
As it actually happens in reality with all production compilers.

Ah, you want to re-introduce Fortran's storage association and
common blocks, but without the type safety. Good idea, that.
That created *really* interesting bugs, and Real Programmers (TM)
have to have something that pays their salaries, right?

SCNR

What I wrote is how all production C compilers work today. So it
will add no new bugs.

Maybe I should be a little bit more precise in why I think this
is an extemely bad idea.

struct {
char x[8]
int y;
} bar;

Assume

bar.y = 1234;
bar.x[i] = 42; // The compiler does not know i
// Do something with bar.y

The compiler should then treat the access to bar.x[i] as if bar.y
was clobbered by the assignment statement, and reload bar.y if
it was kept in a register? That is the semantics you propose.

Yes, exactly.

Contrary to your imagination - compilers have /never/ followed your
proposed semantics. The oldest gcc version I found on godbolt.org is
3.4.6 from 2006, and given:

struct Bar {
char x[8];
int y;
} bar;

int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}

It generates:

foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret

That is, y is /not/ reloaded after bar.x[i] is set.

Your proposed semantics are extremely unexpected for most C developers,
would involve pretty much a complete re-write of the C model if they
were to be applied consistently to other aspects of C, would have a
significant impact on code efficiency, and they are not something anyone
has used or relied on before.

So, either bar.y is treated as if it was volatile, or hard-to-detect
bugs would appear because, with optimization, the assignment would
sometimes change the value of bar.y and sometimes not.

No, semantics is that compiler has to reload bar.y if it keeps it in register. Optimizer that does anything else is buggy.

Well, buggy according to your hypothetical semantics. Not buggy
according to the way C has always worked, and the way C compilers
generate code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Robert Finch on Sun Sep 15 20:47:11 2024

On 15/09/2024 18:52, Robert Finch wrote:

On 2024-09-15 12:09 p.m., David Brown wrote:

On 15/09/2024 14:40, Michael S wrote:

On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
Waldek Hebisch <[email protected]> wrote:

Michael S <[email protected]> wrote:

On Thu, 12 Sep 2024 16:34:31 +0200
David Brown <[email protected]> wrote:

On 12/09/2024 13:29, Michael S wrote:

On Thu, 12 Sep 2024 03:12:11 -0700
Tim Rentsch <[email protected]> wrote:

BGB <[email protected]> writes:

I fully agree that C is not, and should not be seen as, a
"high-level assembly language". But it is a language that is very >>>>>> useful to "hardware-type folks", and there are a few things that
could make it easier to write more portable code if they were
standardised. As it is, we just have to accept that some things
are not portable.

Why not?
I don't see practical need for all those UBs apart from buffer
overflow. More so, I don't see the need for UB in certain limited >>>>>>> classes of buffer overflows.

struct {
   char x[8]
   int y;
} bar;
bar.y = 0; bar.x[8] = 42;

IMHO, here behavior should be fully defined by implementation.
And in practice it is. Just not in theory.

And how should that be defined?

bar.x[8] = 42 should be defined to be the same as
   char tmp = 42
   memcpy(&bar.y, &tmp, sizeof(tmp));

That has two drawbacks: minor one that you need to know that
there are no padding between 'x' and 'y'.

Padding is another thing that should be Implementation Defined.

It is.

I.e. compiler should provide complete documentation of its padding
algorithms.

They do. Or, they should. Often they are lazy and say "defined by
the platform ABI". Really, it is only the alignments that are needed.

C defines the minimum padding between members in a struct - you get
the padding needed to ensure that members are correctly aligned. I
don't think the C standards disallow additional padding, but it would
be an extraordinarily strange implementation if there were anything
more than this minimum padding.

But I certainly wouldn't mind if the standards dictated this minimum
padding, and then there would be nothing left to the implementation
other than alignments.

In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by
another field of integer type with the same or narrower width then
there should be no padding in-between.

What about bit-fields in a struct? I believe they are usually packed. In
case its for something like an I/O device.

Generally, they are packed if you make the fields of the same type, but
if you change the type you get a new block that is aligned appropriately
for the type you gave. It is certainly the case that bit-field struct
layout is complicated, not well-specified in the C standards, and often
not as well documented as it could be by compilers.

When I use bit-field layouts and the layout matters (such as for an I/O
device, rather than just to collect lots of small bits of data in less
memory), I like to give any padding explicitly. And I put a
static_assert on the size of the struct, to be sure I haven't got it
wrong. Such code is, naturally, never intended to be very portable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Sun Sep 15 20:48:48 2024

On 15/09/2024 19:21, MitchAlsup1 wrote:

On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

Robert Finch <[email protected]> writes:

On 2024-09-15 12:09 p.m., David Brown wrote:

In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by >>>>> another field of integer type with the same or narrower width then
there should be no padding in-between.

What about bit-fields in a struct? I believe they are usually packed. In >>> case its for something like an I/O device.

That's a bit more complicated as it depends on the target byte-order.

e.g.

    struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
        uint64_t reserved_41_63              : 23;
        uint64_t dbe                         : 9; /**< R/W1C/H - RAM
ECC DBE detected. */
        uint64_t reserved_9_31               : 23;
        uint64_t sbe                         : 9; /**< R/W1C/H - RAM
ECC SBE detected. */
#else
        uint64_t sbe                         : 9;
        uint64_t reserved_9_31               : 23;
        uint64_t dbe                         : 9;
        uint64_t reserved_41_63              : 23;
#endif
    } s;

Which brings to mind a slight different but related bit-field issue.

If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??

So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.

You do so inconveniently, perhaps with access inline functions rather
than a bit-field struct.

Fortunately, not many hardware designers are that sadistic. (Or perhaps
they /are/ that sadistic, but lack the imagination for that particular
trick.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Sun Sep 15 21:03:00 2024

On 13/09/2024 17:55, Thomas Koenig wrote:

David Brown <[email protected]> schrieb:

Most of the commonly used parts of C99 have been "safe" to use for 20
years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.

What about VLAs?

I don't know if MSVC has VLAs - it's not a tool I ever use, so I don't
have the details in my head.

But perhaps VLAs don't count as "commonly used parts of C99". I have
only occasionally had use for real VLAs in my own programming (more
often I have local arrays whose size is a const known at compile time,
but not syntactically a constant expression - then you have something
that is technically a VLA but which the compiler can handle just like a
normal fixed size array). A lot of people seem to get in a fluster when
you talk about VLAs, and think their inclusion in the C standards was
inspired by the demons trying escape people's noses.

There are a few more obscure parts of C99 that are often poorly
implemented, such as some of the floating point details, and many
embedded compilers omit much of the wide character stuff.

I suppose you could argue that my claim is tautological - parts of C99
that are not implemented in the mainstream C compilers will of course
not be commonly used!

There are only two serious, general purpose C compilers in mainstream
use - gcc and clang, and both support almost all of C23 now. But it
will take a while for the more niche tools, such as some embedded
compilers, to catch up.

It is almost impossible to gather statistics on compiler use,
especially with free compilers, but what about MSVC and icc?

MSVC is rarely used for C - it is primarily a C++ tool. Traditionally,
you have had closer to modern C support using MSVC in C++ mode than in C
mode.

As for icc, I don't think it is nearly as popular as it used to be, but
I have no statistics to back that up. However, I believe it has kept up
with the standards (as well as compatibility with many of gcc and
clang's extensions). I don't know about C23 support.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to BGB on Sun Sep 15 21:09:47 2024

On 14/09/2024 04:39, BGB wrote:

On 9/13/2024 10:55 AM, Thomas Koenig wrote:

David Brown <[email protected]> schrieb:

Most of the commonly used parts of C99 have been "safe" to use for 20
years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.

What about VLAs?

IIRC, VLAs and _Complex and similar still don't work in MSVC.
Most of the rest does now at least.

Thanks - you know it far better than I do.

There are only two serious, general purpose C compilers in mainstream
use - gcc and clang, and both support almost all of C23 now. But it
will take a while for the more niche tools, such as some embedded
compilers, to catch up.

It is almost impossible to gather statistics on compiler use,
especially with free compilers, but what about MSVC and icc?

From what I gather:
GCC and Clang are popular for most mainline targets;
    GCC is the dominant C compiler on Linux.

It is also far and away the dominant compiler for embedded systems -
both embedded Linux and small embedded systems.

MSVC is popular on Windows
    Has been essentially freeware/fremium for over a decade;
    Visual Studio has a fairly good debugger;
    Targets limited to things you can run Windows on (x86, X64, ARM)

MSVC is mainly used for C++ - or for a C-like subset of C++.

.

TinyCC, popular for niche use, but limited range of targets;
    x86, ARM, experimental RISC-V.
SDCC, popular for 8/16 bit targets;

SDCC has never been very popular. For the targets SDCC support, Keil
(8051) and IAR (many small CISC targets) are far more common. But for
these kinds of devices, you are never working in anything close to
standard C anyway.

CC65, popular for 6502 and 65C816;

That's getting /really/ obscure now. There are thousands of C compilers
that are used, or have been used, for various microcontrollers. But if
you sum all their uses over the last decade, it will not be close to 1%
of the total use of C compilers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Scott Lurndal on Sun Sep 15 12:37:28 2024

[email protected] (Scott Lurndal) writes:

Robert Finch <[email protected]> writes:

On 2024-09-15 12:09 p.m., David Brown wrote:

In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by >>>> another field of integer type with the same or narrower width then
there should be no padding in-between.

What about bit-fields in a struct? I believe they are usually packed. In >> case its for something like an I/O device.

That's a bit more complicated as it depends on the target byte-order.

e.g.

struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9; /**< R/W1C/H - RAM ECC DBE detected. */
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9; /**< R/W1C/H - RAM ECC SBE detected. */
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;

Probably many people know that this code depends on an
implementation-defined extension (allowing uint64_t as
the type of a bitfield) and is not guaranteed to be
portable. Using 'unsigned' instead would be portable
(assuming typical 32-bit ints, etc).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Sun Sep 15 19:13:31 2024

On Sun, 15 Sep 2024 18:48:48 +0000, David Brown wrote:

On 15/09/2024 19:21, MitchAlsup1 wrote:

On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

Robert Finch <[email protected]> writes:

On 2024-09-15 12:09 p.m., David Brown wrote:

In addition, some padding-related things can be defined by Standard >>>>>> itself. Not in this particular case, but, for example, it could be >>>>>> defined that when field of one integer type is immediately followed by >>>>>> another field of integer type with the same or narrower width then >>>>>> there should be no padding in-between.

What about bit-fields in a struct? I believe they are usually packed. In >>>> case its for something like an I/O device.

That's a bit more complicated as it depends on the target byte-order.

e.g.

    struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
        uint64_t reserved_41_63              : 23;
        uint64_t dbe                         : 9; /**< R/W1C/H - RAM
ECC DBE detected. */
        uint64_t reserved_9_31               : 23;
        uint64_t sbe                         : 9; /**< R/W1C/H - RAM
ECC SBE detected. */
#else
        uint64_t sbe                         : 9;
        uint64_t reserved_9_31               : 23;
        uint64_t dbe                         : 9;
        uint64_t reserved_41_63              : 23;
#endif
    } s;

Which brings to mind a slight different but related bit-field issue.

If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??

So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.

You do so inconveniently, perhaps with access inline functions rather
than a bit-field struct.

Fortunately, not many hardware designers are that sadistic. (Or perhaps
they /are/ that sadistic, but lack the imagination for that particular trick.)

In My 66000 ISA it is both efficient and straightforward::

i = struct.field;
..
struct.field = j;

CARRY Rsf1,{I}
SRA Ri,Rsf0,<17,53>
and
CARRY Rsf1,{O}
INS Rsf0,Rj,<52,17>

Note: Rsf1 and Rsf0 combined are the 128 bits container, but there is no
need for these registers to be sequential.

As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant
past could do these rather efficiently {360 SRDL,...}

If the ISA has any realistically efficient grasp on multi-precision
integer operations, these fall out almost for free.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to BGB on Sun Sep 15 21:40:59 2024

On 14/09/2024 08:34, BGB wrote:

On 9/13/2024 10:30 AM, David Brown wrote:

On 12/09/2024 23:14, BGB wrote:

On 9/12/2024 9:18 AM, David Brown wrote:

On 11/09/2024 20:51, BGB wrote:

On 9/11/2024 5:38 AM, Anton Ertl wrote:

Josh Vanderhoof <[email protected]> writes:

[email protected] (Anton Ertl) writes:

<snip lots>

Though, generally takes a few years before new features become usable.
Like, it is only in recent years that it has become "safe" to use
most parts of C99.

Most of the commonly used parts of C99 have been "safe" to use for 20
years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.

Until VS2013, the most one could really use was:
// comments
long long
Otherwise, it was basically C90.
'stdint.h'? Nope.
Ability to declare variables wherever? Nope.
...

Nonsense.

MS basically gave up on C and concentrated on C++ (then later C# and
other languages). Their C compiler gained the parts of C99 that were in
common with C++ - and anyway, most people (that I have heard of) using
MSVC for C programming actually use the C++ compiler but stick
approximately to a C subset. And this has been the case for a /long/
time - long before 2013.

After this, it was piecewise.
Though, IIRC, still no VLAs or similar.

That I believe.

There are only two serious, general purpose C compilers in mainstream
use - gcc and clang, and both support almost all of C23 now. But it
will take a while for the more niche tools, such as some embedded
compilers, to catch up.

<stdbit.h> is, however, in the standard library rather than the
compiler, and they can be a bit slow to catch up.

FWIW:
I had been adding parts of newer standards in my case, but it is more hit/miss (more adding parts as they seem relevant).

Clearly your own compiler will only support the bits of C that you
implement. But I am not sure that it counts as a "serious, general
purpose C compiler in mainstream use" - no offence implied!

   Whether or not the target/compiler allows misaligned memory access; >>>>>      If set, one may use misaligned access.

Why would you need that? Any decent compiler will know what is
allowed for the target (perhaps partly on the basis of compiler
flags), and will generate the best allowed code for accesses like
foo3() above.

Imagine you have compilers that are smart enough to turn "memcpy()"
into a load and store, but not smart enough to optimize away the
memory accesses, or fully optimize away the wrapper functions...

Why would I do that? If I want to have efficient object code, I use a
good compiler. Under what realistic circumstances would you need to
have highly efficient results but be unable to use a good optimising
compiler? Compilers have been inlining code for 30 years at least
(that's when I first saw it) - this is not something new and rare.

Say, you are using a target where you can't use GCC or similar.

Which target would that be? Excluding personal projects, some very
niche devices, and long-outdated small CISC chips, there really aren't
many devices that don't have a GCC and clang port. Of course there
/are/ processors that gcc does not support, but almost nobody writes
code that has to be portable to such devices.

And as for optimising compilers, I used at least two different
optimising compilers in the mid nineties that inlined code
automatically, before using gcc. (I can't remember if they inlined
memcpy - it was a long time ago!). Optimising compilers are not a new
concept, and are not limited to gcc and clang.

Say:
BJX2, haven't ported GCC as it looks like a pain;
Also GCC is big and slow to recompile.

6502 and 65C816, because these are old and probably not worth the effort
from GCC's POV.

Various other obscure/niche targets.

Say, SH-5, which never saw a production run (it was a 64-bit successor
to SH-4), but seemingly around the time Hitachi spun-out Renesas, the
SH-5 essentially got canned. And, it apparently wasn't worth it for GCC
to maintain a target for which there were no actual chips (comparably
the SH-2 and SH-4 lived on a lot longer due to having niche uses).

It would be quite ridiculous to limit the way you write code because of possible limitations for non-existent compilers for target devices that
have never been made.

So, for best results, the best case option is to use a pointer cast
and dereference.

For some cases, one may also need to know whether or not they can
access the pointers in a misaligned way (and whether doing so would
be better or worse than something like "memcpy()").

Again, I cannot see a /real/ situation where that would be relevant.

I can think of a few.

Most often though it is in things like data compression/decompression
code, where there is often a lot of priority on "gotta go fast".

I still cannot see any situation where it would be relevant. If I need
to read 4 bytes of memory from an address, and don't know if the address
is uint32_t aligned or not, I would use memcpy(). The compiler would
know if unaligned 32-bit reads are supported or not for the target, or
if it is faster to use them or use byte reads. That's the compiler's
job - I'm the programmer, not the micro-manager.

And if I know that for a particular target there are particular
instructions that could be more efficient but are unknown to the
compiler (perhaps there are odd SIMD instructions), and it is worth the
effort to use them, then I would be writing that code for the specific
target. That's target-specific conditional compilation, and I still
have no need to know if the target can access misaligned data.

There is a difference here between "_memlzcpy()" and "_memlzcpyf()"
in that:
   the former will always copy an exact number of bytes;
   the latter may write 16-32 bytes over the limit.

It may do /what/ ? That is a scary function!

This is why the latter have an 'f' extension (for "fast").

I can accept that there are cases (such as you describe below) where
this might be useful, but I would not be identifying it just with an "f".

There are cases where it may be desirable to have the function write
past the end in the name of speed, and others where this would not be acceptable.

Hence why there are 2 functions.

The main intended use-case for _memlzcpyf() being use for match-copying
in something like my LZ4 decoder, where one may pad the decode buffer by
an extra 32 byes.

Also my RP2 decoder works in a similar way.

Possible:
   __MINALIGN_type__ //minimum allowed alignment for type

_Alignof(type) has been around since C11.

_Alignof tells the native alignment, not the minimum.

It is the same thing.

Not necessarily, it wouldn't make sense for _Alignof to return 1 for all
the basic integer types.

Of course it makes sense to do that, on targets where an alignment of 1
is safe and efficient.

But, for" minimum alignment" it may make sense
to return 1 for anything that can be accessed unaligned.

Again, I see no use for this.

Where, _Alignof(int32_t) will give 4, but __MINALIGN_INT32__ would
give 1 if the target supports misaligned pointers.

The alignment of types in C is given by _Alignof. Hardware may
support unaligned accesses - C does not. (By that, I mean that
unaligned accesses are UB.)

The point of __MINALIGN_type__ would be:
If the compiler defines it, and it is defined as 1, then this allows the compiler to be able to tell the program that it is safe to use this type
in an unaligned way.

For what purpose?

This also applies to targets where some types are unaligned but others
are not:
Say, if all integer types 64 bits or less are unaligned, but 128-bit
types are not.

For what purpose? And why do you want to worry about totally
hypothetical systems?

Most of this is being compiled by BGBCC for a 50 MHz cPU.

So, the CPU is slow and the compiler doesn't generate particularly
efficient code unless one writes it in a way it can use effectively.

Which often means trying to write C like it was assembler and manually organizing statements to try to minimize value dependencies (often
caching any values in variables, and using lots of variables).

In this case, the equivalent of "-fwrapv -fno-strict-aliasing" is the
default semantics.

Generally, MSVC also responds well to a similar coding style as used for BGBCC (or, as it more happened, the coding styles that gave good results
in MSVC also tended to work well in BGBCC).

Note that MSVC most certainly does /not/ work like "gcc -fwrapv" -
signed integer overflow is UB in MSVC, and it generates code that
assumes it never happens. There is an obscure officially undocumented
(or documented unofficially, if you prefer) flag to turn off such optimisations.

Last I read about it, they had no plans to do any type-based alias
analysis, but nor did they rule out the possibility in the future.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Sun Sep 15 22:42:35 2024

On Sun, 15 Sep 2024 20:13:44 +0200
David Brown <[email protected]> wrote:

On 14/09/2024 23:19, Michael S wrote:

Yes, exactly.

Contrary to your imagination - compilers have /never/ followed your
proposed semantics. The oldest gcc version I found on godbolt.org is
3.4.6 from 2006, and given:

struct Bar {
char x[8];
int y;
} bar;

int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}

It generates:

foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret

That is, y is /not/ reloaded after bar.x[i] is set.

No other compiler on godbolt is doing it, except possibly gcc clones.
Not even clang, who's former leader wrote "Nasal Manifest".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to [email protected] on Sun Sep 15 12:51:04 2024

[email protected] (MitchAlsup1) writes:

On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

Robert Finch <[email protected]> writes:

On 2024-09-15 12:09 p.m., David Brown wrote:

In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by >>>>> another field of integer type with the same or narrower width then
there should be no padding in-between.

What about bit-fields in a struct? I believe they are usually packed. In >>> case its for something like an I/O device.

That's a bit more complicated as it depends on the target byte-order.

e.g.

struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9;
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;

Which brings to mind a slight different but related bit-field issue.

If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??

So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.

The 17-bit bitfied can be specified in the usual way. Example:

struct bitfield_example {
unsigned one : 32;
unsigned two : 20;
unsigned hmm : 17;
};

An implementation is allowed to use up the last 12 bits of the
first 64-bit unit and the first 5 bits of the next 64-bit unit.
But, whether that happens or not is up to the implementation.
The bitfield for member 'hmm' could instead be put entirely in
the second 64-bit unit, with the last 12 bits of the first 64-bit
unit simply left as padding. There is no standard way to force
it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Sun Sep 15 12:54:04 2024

Michael S <[email protected]> writes:

My motivation is eliminating as many UBs as is practically
possible.

I think I understand what it is you want. What sort of case can
you make that other people should want it, or that I should want
it? So far I'm a very long way from being convinced.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Tim Rentsch on Sun Sep 15 21:05:05 2024

On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

Robert Finch <[email protected]> writes:

On 2024-09-15 12:09 p.m., David Brown wrote:

In addition, some padding-related things can be defined by Standard >>>>>> itself. Not in this particular case, but, for example, it could be >>>>>> defined that when field of one integer type is immediately followed by >>>>>> another field of integer type with the same or narrower width then >>>>>> there should be no padding in-between.

What about bit-fields in a struct? I believe they are usually packed. >>>> In
case its for something like an I/O device.

That's a bit more complicated as it depends on the target byte-order.

e.g.

struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9;
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;

Which brings to mind a slight different but related bit-field issue.

If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??

So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.

The 17-bit bitfied can be specified in the usual way. Example:

struct bitfield_example {
unsigned one : 32;
unsigned two : 20;
unsigned hmm : 17;
};

An implementation is allowed to use up the last 12 bits of the
first 64-bit unit and the first 5 bits of the next 64-bit unit.
But, whether that happens or not is up to the implementation.
The bitfield for member 'hmm' could instead be put entirely in
the second 64-bit unit, with the last 12 bits of the first 64-bit
unit simply left as padding. There is no standard way to force
it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Tim Rentsch on Sun Sep 15 23:43:08 2024

Tim Rentsch <[email protected]> writes:

[email protected] (Scott Lurndal) writes:

Robert Finch <[email protected]> writes:

On 2024-09-15 12:09 p.m., David Brown wrote:

In addition, some padding-related things can be defined by Standard
itself. Not in this particular case, but, for example, it could be
defined that when field of one integer type is immediately followed by >>>>> another field of integer type with the same or narrower width then
there should be no padding in-between.

What about bit-fields in a struct? I believe they are usually packed. In >>> case its for something like an I/O device.

That's a bit more complicated as it depends on the target byte-order.

e.g.

struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9; /**< R/W1C/H - RAM ECC DBE detected. */
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9; /**< R/W1C/H - RAM ECC SBE detected. */
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;

Probably many people know that this code depends on an
implementation-defined extension (allowing uint64_t as
the type of a bitfield) and is not guaranteed to be
portable. Using 'unsigned' instead would be portable
(assuming typical 32-bit ints, etc).

Portability in this case was not necessary. In any case,
it's portable to clang and gcc, which is good enough.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Kent Dickey on Sun Sep 15 18:32:51 2024

[email protected] (Kent Dickey) writes:

[examples of descending loops with unsigned loop variables]

This discussion wandered into many subthreads, but I only want to make
one post and chose here.

When you write code working on signed numbers and do something like:

(a < 0) || (a >= max)

Then the compiler realizes if you treat 'a' as unsigned, this is just:

(unsigned)a >= max

since any negative number, treated as unsigned, will be larger than the largest positive signed number. So, to do loops which count down and
have any stride using an unsigned loop count:

for(u = start; u <= start; u -= step)

With the usual caveats (start must be a valid signed number, and step
cannot be so large that start + step crosses the signed boundary).

Clever, although maybe too tricky. Better if start and step are
also unsigned, in which case a safe test is easily seen to be
start + step > start.

But: unsigned numbers in C have some dangers, which no one here has mentioned. Some code presented comes CLOSE to being wrong, but gets
lucky. With "int" being 32-bits, C promotion rules around unsigned
ints, signed ints, and unsigned 64-bit can create trouble.

uint64_t dval; uint32_t uval; int a;

val32 = 1 dval = 1; a = 1;
dval = val32 - 2 + dval;

C will do (val32 - 2) first, with is (1U - 2) which is 0xffff_ffff, and
then add dval, and the result is 0x1_0000_0000.

Not really interesting. It's usually a mistake to mix different
types, whether or not the types have different signedness. Arithmetic
is one problem but assignment is another. Using the same type
throughout avoids surprises like this one.

Signed numbers don't have this risk, so if you're doing known small loops, you can just use ints. If you're doing possibly large loops, just use int64_t.

I consider this bad advice. Loops are doing something with the loop
variable, and its type should be chosen according to how it is used.
If the loop variable represents an index, or a length, or count, it
should be unsigned (or unsigned long, etc). If the loop variable
represents degrees C or F, or some other naturally signed measure it
should be signed (or maybe floating point). What kind of loop it
is, whether ascending or descending, or what the increment is, etc,
is secondary; a more important factor is what sort of value is
being represented, and in almost all cases that is what should
determine the type used.

Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
C/C++ is a mistake. It should always have been ILP64, and this nonsense would go away. Any new architecture should make C ILP64 (looking at you RISC-V, missing yet another opportunity to not make the same mistakes as everyone else).

I believe this view is shortsighted. The big mistake is developers
hardcoding types everywhere - especially int, but also long, and
their unsigned variants. It's almost never a good idea to hardcode
a specific width (eg, uint32_t) in a type name used for parameters
or local variables, but that is by far a very common practice.
Names of types should reflect how the variable is meant to be used,
not the specifics of what sort of register it goes into. The more
firmly we cement our programs to specific hardware choices, the
greater the pain when those choices need to change, either due to
time or moving to a different platform. The key is to keep things
light and flexible, not encrusted onto fixed hardware choices like
barnacles on the hull of a ship.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Sun Sep 15 18:47:06 2024

Michael S <[email protected]> writes:

On Sun, 15 Sep 2024 20:13:44 +0200
David Brown <[email protected]> wrote:

struct Bar {
char x[8];
int y;
} bar;

int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}

It generates:

foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret

That is, y is /not/ reloaded after bar.x[i] is set.

No other compiler on godbolt is doing it, except possibly gcc clones.
Not even clang, who's former leader wrote "Nasal Manifest".

Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
both show bar.y not being overwritten (optimization levels -01 or -O2)
when foo() is called.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Scott Lurndal on Sun Sep 15 18:51:28 2024

[email protected] (Scott Lurndal) writes:

Tim Rentsch <[email protected]> writes:

[email protected] (Scott Lurndal) writes:

Robert Finch <[email protected]> writes:

On 2024-09-15 12:09 p.m., David Brown wrote:

In addition, some padding-related things can be defined by
Standard itself. Not in this particular case, but, for
example, it could be defined that when field of one integer
type is immediately followed by another field of integer type
with the same or narrower width then there should be no padding
in-between.

What about bit-fields in a struct? I believe they are usually
packed. In case its for something like an I/O device.

That's a bit more complicated as it depends on the target byte-order.

e.g.

struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9;
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;

Probably many people know that this code depends on an
implementation-defined extension (allowing uint64_t as
the type of a bitfield) and is not guaranteed to be
portable. Using 'unsigned' instead would be portable
(assuming typical 32-bit ints, etc).

Portability in this case was not necessary. In any case,
it's portable to clang and gcc, which is good enough.

I'm not criticizing the code; just pointing out an aspect
in case some people weren't aware of it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to BGB on Mon Sep 16 09:01:08 2024

On 15/09/2024 06:42, BGB wrote:

On 9/14/2024 8:26 AM, Anton Ertl wrote:

[email protected] (Kent Dickey) writes:

Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
C/C++ is a mistake. It should always have been ILP64, and this nonsense >>> would go away. Any new architecture should make C ILP64 (looking at you >>> RISC-V, missing yet another opportunity to not make the same mistakes as >>> everyone else).

We now have had more than 30 years of catering for this mistake by
everyone involved. Given their goals, I think that RISC-V made the
right choice for int in their ABI, even if it was the original choice
by the MIPS and Alpha people that they follow, like everyone else, was
wrong.

That being said, one option would be to introduce another ABI and API
with 64-bit int (and maybe 32-bit long short int), and programmers
could choose whether to program for the ILP API, or the int=int32_t
API. Would the ILP API/ABI fare better then x32? I doubt it, even
though I would support it. This ship probably has sailed.

Changing the size of 'int' would likely be a massive pain from a
software compatibility POV (possibly effecting things much more than
changing the size of pointers, or the size of 'long'; which was a major source of pain during the 32 to 64 bit migration).

When my project got started, I was originally going with 32-bit 'long',
like MSVC, but then switched over to keeping 'long' matched with the
pointer size, as code that assumed sizeof(long)==sizeof(void *) was more common than code that assumed sizeof(long)==4 (it was more common for
code to use 'int' as the de-facto 32-bit type), as well as this being a
more useful assumption (though this assumption breaks with 128 bit
pointers).

Changing sizeof(int) to be anything other than 4 is likely to break significant amounts of code, and pretty much anything that reads/writes structs to files or similar for data storage.

But, yes, this is even with the whole thing that on a 64-bit machine,
32-bit integers are typically handled in a way where they are sign or
zero extended to 64 bits.

Granted, a better alternative might be to rework code to generally use
the "stdint.h" types, and to use "intptr_t" for integer types matched to
the size of a pointer, ...

uintptr_t is usually a more natural choice - on almost all systems, it
is representing an address, and those are unsigned.

The other biggest hinder (apart from breaking unwarranted assumptions
about sizes in existing code) to 64-bit int is the number of fundamental integer types in C. You have char, short, int, long and long long. So
if int is 64-bit, there are not sufficient standard types to have 8-bit,
16-bit and 32-bit types as well. But at the other end you have int,
long and long long that are all 64-bit (perhaps one of them might be
128-bit). The integer type system in C was made at a time when 16-bit
systems were common and 32-bit would be more than enough for anyone, and
before the world settled on 8-bit bytes and powers of two for integer sizes.

I think using the <stdint.h> types for anything size-specific makes a
lot of sense. For a lot of things, exact sizes don't matter, 32-bit int
(but often not unsigned int) is as efficient as anything else, and
assuming at least 32 bits is not a hinder to portability. But I would
be reluctant to use "short", "long" or "long long" in any code -
<stdint.h> types do a much better job.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Mon Sep 16 06:50:45 2024

Tim Rentsch <[email protected]> schrieb:

Michael S <[email protected]> writes:

On Sun, 15 Sep 2024 20:13:44 +0200
David Brown <[email protected]> wrote:

struct Bar {
char x[8];
int y;
} bar;

int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}

It generates:

foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret

That is, y is /not/ reloaded after bar.x[i] is set.

No other compiler on godbolt is doing it, except possibly gcc clones.
Not even clang, who's former leader wrote "Nasal Manifest".

Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
both show bar.y not being overwritten (optimization levels -01 or -O2)
when foo() is called.

Same for current gcc trunk (bleeding edge development version).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Mon Sep 16 07:17:44 2024

David Brown <[email protected]> schrieb:

On 14/09/2024 21:26, Thomas Koenig wrote:

MitchAlsup1 <[email protected]> schrieb:

In many cases int is slower now than long -- which violates the notion
of int from K&R days.

That's a designers's choice, I think. It is possible to add 32-bit
instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.

For some kinds of instructions, that's true - for others, it's not so
easy without either making rather complicated instructions or having
assembly instructions with undefined behaviour (imagine the terror that
would bring to some people!).

It has happened, see the illegal (but sometimes useful)
6502 instructions, or the recent RISC-V implementation snafu
(GhostWrite).

A classic example would be for "y = p[x++];" in a loop. For a 64-bit
type x, you would set up one register once with "p + x", and then have a
load with post-increment instruction in the loop. You can also do that
with x as a 32-bit int, unless you are of the opinion that enough apples added to a pile should give a negative number of apples.

But of course it should!

But wait, no, the number of apples should become zero if you add
enough of them.

But wait... maybe if the pile becomes too large, then the apples
will no longer be individual apples, but will be crushed under
their weight, a bit like https://what-if.xkcd.com/4/ .

But with a
wrapping type for x - such as unsigned int in C or modulo types in Ada,
you have little choice but to hold "p" and "x" separately in registers,
add them for every load, and do the increment and modulo operation. I
really can't see this all being handled by a single instruction.

One reason not to use such a wrapping type.

Although, if you have (R1+R2) addressing and a 32-bit addition, this
could actually work, but not with a post-increment instruction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Mon Sep 16 07:25:38 2024

Tim Rentsch <[email protected]> schrieb:

If the loop variable
represents degrees C or F, or some other naturally signed measure it
should be signed (or maybe floating point).

The first one is a bad idea because temperature is a continuous
physical quantity.

The second has bad implications for constructs like

DO R = 0.0, 1.0, 0.1

where it will depend on details floating point arithmetic if the
number of loop trips is 10 or 11.

You can argue that people can write

DO R=0.0, 1.05, 0.1

but this construct was error-prone enough that it was deleted
from the Fortran standards.

What kind of loop it
is, whether ascending or descending, or what the increment is, etc,
is secondary; a more important factor is what sort of value is
being represented, and in almost all cases that is what should
determine the type used.

Not for floating point numbers. For that, you should simply do

DO I=0,10
R = I * 0.1

or

R = 0.0
DO I=0,10
...
R = R + 0.1
END DO

whichever rounding error you prefer.

Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
C/C++ is a mistake. It should always have been ILP64, and this nonsense
would go away. Any new architecture should make C ILP64 (looking at you
RISC-V, missing yet another opportunity to not make the same mistakes as
everyone else).

I believe this view is shortsighted. The big mistake is developers hardcoding types everywhere - especially int, but also long, and
their unsigned variants. It's almost never a good idea to hardcode
a specific width (eg, uint32_t) in a type name used for parameters
or local variables, but that is by far a very common practice.

Hence Fortran's SELECTED_REAL_KIND and SELECTED_INT_KIND...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Mon Sep 16 11:34:56 2024

On Sun, 15 Sep 2024 18:47:06 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Sun, 15 Sep 2024 20:13:44 +0200
David Brown <[email protected]> wrote:

struct Bar {
char x[8];
int y;
} bar;

int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}

It generates:

foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret

That is, y is /not/ reloaded after bar.x[i] is set.

No other compiler on godbolt is doing it, except possibly gcc
clones. Not even clang, who's former leader wrote "Nasal Manifest".

Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
both show bar.y not being overwritten (optimization levels -01 or -O2)
when foo() is called.

I didn't mean to say that gcc3 is the only gcc version that returns non-overwritten value.
I meant to say that all gcc versions are in one camp and the rest of
compilers represented on Goldbolt is in the other camp.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Mon Sep 16 10:37:47 2024

David Brown wrote:

On 14/09/2024 21:26, Thomas Koenig wrote:

MitchAlsup1 <[email protected]> schrieb:

In many cases int is slower now than long -- which violates the notion
of int from K&R days.

That's a designers's choice, I think. It is possible to add 32-bit
instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.

For some kinds of instructions, that's true - for others, it's not so
easy without either making rather complicated instructions or having assembly instructions with undefined behaviour (imagine the terror that would bring to some people!).

A classic example would be for "y = p[x++];" in a loop. For a 64-bit
type x, you would set up one register once with "p + x", and then have a load with post-increment instruction in the loop. You can also do that with x as a 32-bit int, unless you are of the opinion that enough apples added to a pile should give a negative number of apples. But with a wrapping type for x - such as unsigned int in C or modulo types in Ada,
you have little choice but to hold "p" and "x" separately in registers,
add them for every load, and do the increment and modulo operation. I really can't see this all being handled by a single instruction.

This becomes much simpler in Rust where usize is the only legal index type:

Yeah, you have to actually write it as

y = p[x];
x += 1;

instead of a single line, but this makes zero difference to the
compiler, right?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Mon Sep 16 10:34:19 2024

On 15/09/2024 21:13, MitchAlsup1 wrote:

On Sun, 15 Sep 2024 18:48:48 +0000, David Brown wrote:

On 15/09/2024 19:21, MitchAlsup1 wrote:

On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

Robert Finch <[email protected]> writes:

On 2024-09-15 12:09 p.m., David Brown wrote:

In addition, some padding-related things can be defined by Standard >>>>>>> itself. Not in this particular case, but, for example, it could be >>>>>>> defined that when field of one integer type is immediately
followed by
another field of integer type with the same or narrower width then >>>>>>> there should be no padding in-between.

What about bit-fields in a struct? I believe they are usually
packed. In
case its for something like an I/O device.

That's a bit more complicated as it depends on the target byte-order.

e.g.

    struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
        uint64_t reserved_41_63              : 23; >>>>         uint64_t dbe                         : 9; /**< R/W1C/H - RAM
ECC DBE detected. */
        uint64_t reserved_9_31               : 23; >>>>         uint64_t sbe                         : 9; /**< R/W1C/H - RAM
ECC SBE detected. */
#else
        uint64_t sbe                         : 9;
        uint64_t reserved_9_31               : 23; >>>>         uint64_t dbe                         : 9;
        uint64_t reserved_41_63              : 23; >>>> #endif
    } s;

Which brings to mind a slight different but related bit-field issue.

If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??

So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.

You do so inconveniently, perhaps with access inline functions rather
than a bit-field struct.

Fortunately, not many hardware designers are that sadistic. (Or perhaps
they /are/ that sadistic, but lack the imagination for that particular
trick.)

In My 66000 ISA it is both efficient and straightforward::

That does not change that it is inconvenient in C, which is what you
asked about. For any ISA, there will always be things that can easily
written in C that are awkward in assembly, and vice versa.

    i = struct.field;
..
    struct.field = j;

    CARRY    Rsf1,{I}
    SRA      Ri,Rsf0,<17,53>
and
    CARRY    Rsf1,{O}
    INS      Rsf0,Rj,<52,17>

Note: Rsf1 and Rsf0 combined are the 128 bits container, but there is no
need for these registers to be sequential.

As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant
past could do these rather efficiently {360 SRDL,...}

Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even
beneficial from the hardware viewpoint, but really inconvenient on the
software side (whether you use bit-fields or not).

Some hardware designers seem to have no understanding of or
consideration for the software folks that will use their designs. "HW
Sadism" is no doubt too strong a term - ignorance and a lack of
consideration is more realistic.

If the ISA has any realistically efficient grasp on multi-precision
integer operations, these fall out almost for free.

I can't see that. I am not saying you are wrong, but I don't see the connection.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Tim Rentsch on Mon Sep 16 11:14:52 2024

On 15/09/2024 21:51, Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

Robert Finch <[email protected]> writes:

On 2024-09-15 12:09 p.m., David Brown wrote:

In addition, some padding-related things can be defined by Standard >>>>>> itself. Not in this particular case, but, for example, it could be >>>>>> defined that when field of one integer type is immediately followed by >>>>>> another field of integer type with the same or narrower width then >>>>>> there should be no padding in-between.

What about bit-fields in a struct? I believe they are usually packed. In >>>> case its for something like an I/O device.

That's a bit more complicated as it depends on the target byte-order.

e.g.

struct GIC_ECC_INT_STATUSR_s {
#if __BYTE_ORDER == __BIG_ENDIAN
uint64_t reserved_41_63 : 23;
uint64_t dbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t sbe : 9;
#else
uint64_t sbe : 9;
uint64_t reserved_9_31 : 23;
uint64_t dbe : 9;
uint64_t reserved_41_63 : 23;
#endif
} s;

Which brings to mind a slight different but related bit-field issue.

If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??

So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.

The 17-bit bitfied can be specified in the usual way. Example:

struct bitfield_example {
unsigned one : 32;
unsigned two : 20;
unsigned hmm : 17;
};

An implementation is allowed to use up the last 12 bits of the
first 64-bit unit and the first 5 bits of the next 64-bit unit.
But, whether that happens or not is up to the implementation.
The bitfield for member 'hmm' could instead be put entirely in
the second 64-bit unit, with the last 12 bits of the first 64-bit
unit simply left as padding. There is no standard way to force
it.

Yes, implementations get to choose this, with most implementations
following the specifications from the ABI for the target.

Many implementations have a way to specify tighter packing, but
naturally this is not standardised. But it can give a picture of the differences in code generation between the two options, which makes it
easy to see why most compilers do not split bit-fields across two
storage units.

(There is a standard way to specify that "hmm" above is /not/ packed
across two units - adding a field "unsigned : 0;" between "two" and
"hmm" forces this.)

<https://godbolt.org/z/sYxWjM766>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to BGB on Mon Sep 16 11:27:15 2024

On 16/09/2024 09:18, BGB wrote:

On 9/15/2024 12:46 PM, Anton Ertl wrote:

Michael S <[email protected]> writes:

Padding is another thing that should be Implementation Defined.

It is. It's defined in the ABI, so when the compiler documents to
follow some ABI, you automatically get that ABI's structure layout.
And if a compiler does not follow an ABI, it is practically useless.

Though, there also isn't a whole lot of freedom of choice here regarding layout.

If member ordering or padding differs from typical expectations, then
any code which serializes structures to files is liable to break, and
this practice isn't particularly uncommon.

Your expectations here should match up with the ABI - otherwise things
are going to go wrong pretty quickly. But I think most ABIs will have
fairly sensible choices for padding and alignments.

Say, typical pattern:
Members are organized in the same order they appear in the source code;

That is required by the C standards. (A compiler can re-arrange the
order if that does not affect any observable behaviour. gcc used to
have an optimisation option that allowed it to re-arrange struct
ordering when it was safe to do so, but it was removed as it was rarely
used and a serious PITA to support with LTO.)

If the current position is not a multiple of the member's alignment, it
is padded to an offset that is a multiple of the member's alignment;

That is a requirement in the C standards.

The only implementation-defined option is whether or not there is
/extra/ padding - and I have never seen that in practice. (And there
are more implementation-defined options for bit-fields.)

For primitive types, the alignment is equal to the size, which is also a power of 2;

That is the norm, up to the maximum appropriate alignment for the
architecture. A 16-bit cpu has nothing to gain by making 32-bit types
32-bit aligned.

If needed, the total size of the struct is padded to a multiple of the largest alignment of the struct members.

That is required by the C standards.

For C++ classes, it is more chaotic (and more compiler dependent), but:

Not really, no. Apart from a few hidden bits such as pointers to handle virtual methods and virtual inheritance, the data fields are ordered,
padded and aligned just like in C structs. And these hidden pointers
follow the same rules as any other pointer.

The only other special bit is empty base class optimisation, and that's
pretty simple too.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Thomas Koenig on Mon Sep 16 12:26:20 2024

On 2024-09-16 10:25, Thomas Koenig wrote:

Tim Rentsch <[email protected]> schrieb:

If the loop variable
represents degrees C or F, or some other naturally signed measure it
should be signed (or maybe floating point).

The first one is a bad idea because temperature is a continuous
physical quantity.

The second has bad implications for constructs like

DO R = 0.0, 1.0, 0.1

where it will depend on details floating point arithmetic if the
number of loop trips is 10 or 11.

You can argue that people can write

DO R=0.0, 1.05, 0.1

but this construct was error-prone enough that it was deleted
from the Fortran standards.

What kind of loop it
is, whether ascending or descending, or what the increment is, etc,
is secondary; a more important factor is what sort of value is
being represented, and in almost all cases that is what should
determine the type used.

Not for floating point numbers. For that, you should simply do

DO I=0,10
R = I * 0.1

or

R = 0.0
DO I=0,10
...
R = R + 0.1
END DO

whichever rounding error you prefer.

Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
C/C++ is a mistake. It should always have been ILP64, and this nonsense >>> would go away. Any new architecture should make C ILP64 (looking at you >>> RISC-V, missing yet another opportunity to not make the same mistakes as >>> everyone else).

I believe this view is shortsighted. The big mistake is developers
hardcoding types everywhere - especially int, but also long, and
their unsigned variants. It's almost never a good idea to hardcode
a specific width (eg, uint32_t) in a type name used for parameters
or local variables, but that is by far a very common practice.

I agree. This issue guided the design of the scalar type system in Ada.

C programmers can use typedef to get part way there, but not all the way because typedefs are still weakly typed.

Hence Fortran's SELECTED_REAL_KIND and SELECTED_INT_KIND...

And the way Ada programmers can define application-specific types with
the ranges and precisions the application needs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to BGB on Mon Sep 16 13:12:15 2024

On 16/09/2024 02:00, BGB wrote:

On 9/15/2024 2:09 PM, David Brown wrote:

On 14/09/2024 04:39, BGB wrote:

On 9/13/2024 10:55 AM, Thomas Koenig wrote:

David Brown <[email protected]> schrieb:

Most of the commonly used parts of C99 have been "safe" to use for 20 >>>>> years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.

What about VLAs?

IIRC, VLAs and _Complex and similar still don't work in MSVC.
Most of the rest does now at least.

Thanks - you know it far better than I do.

I use it fairly often.
Mostly VS2022 at present.

There are only two serious, general purpose C compilers in mainstream >>>>> use - gcc and clang, and both support almost all of C23 now. But it >>>>> will take a while for the more niche tools, such as some embedded
compilers, to catch up.

It is almost impossible to gather statistics on compiler use,
especially with free compilers, but what about MSVC and icc?

From what I gather:
   GCC and Clang are popular for most mainline targets;
     GCC is the dominant C compiler on Linux.

It is also far and away the dominant compiler for embedded systems -
both embedded Linux and small embedded systems.

Albeit, ones with semi-popular CPU architectures.

What do you mean by that? ARM currently has perhaps 90% of the market
for small embedded systems, and gcc is used for development on perhaps
85% of those systems. Major non-ARM microcontroller cores include AVR,
RISC-V, ESP-32 and the undying PIC16x and 8051 cores. Only the last two
there do not have gcc ports, but those devices have almost died out of
the market for new designs. clang is still the "new kid on the block"
for small-systems embedded development, and has not yet made a big
impact. And of course there are a range of high-price commercial
toolchains that are very popular in some areas, but not a big fraction
of users overall.

Though, GCC and Linux kinda go together here.

Small embedded systems don't run Linux. And the people developing for
them usually do so on Windows, not Linux. So in the embedded
development world, gcc dominates, Linux does not. (But for embedded
Linux systems, gcc dominates.)

Say, one isn't going to find Linux ported to targets outside the scope
of GCC,

True.

and GCC isn't too interested outside the scope of targets that
could potentially run Linux and see at least semi-widespread use.

False.

Much of the development work in gcc is done based on which company pays
for the work, and many of the biggest commercial backers have an
interest in Linux (Intel, AMD, ARM, IBM, Google, Facebook, etc. - even Microsoft). But the gcc ports for smaller microcontrollers also have
their commercial backers, and they only need to concentrate on the
backend - they get most of the benefits (new language support, most optimisations, static error checking, etc.) for free.

The huge majority of current embedded systems use ARM Cortex-M cores.
The huge majority of these run software developed with gcc. None of
them run Linux.

.

   TinyCC, popular for niche use, but limited range of targets;
     x86, ARM, experimental RISC-V.
   SDCC, popular for 8/16 bit targets;

SDCC has never been very popular. For the targets SDCC support, Keil
(8051) and IAR (many small CISC targets) are far more common. But for
these kinds of devices, you are never working in anything close to
standard C anyway.

OK.

I had mostly heard of people using SDCC here.

With all respect to the regulars here, most people in technical Usenet
groups are either old, unusually nerdy, or both. They are not
representative of developers. And while I think SDCC is a very
impressive project and it would be my own first choice if I were working
with brain-dead 8-bitters, its popularity is close to negligible. And
that is in a market for 8-bit cores that is rapidly disappearing.

   CC65, popular for 6502 and 65C816;

That's getting /really/ obscure now. There are thousands of C
compilers that are used, or have been used, for various
microcontrollers. But if you sum all their uses over the last decade,
it will not be close to 1% of the total use of C compilers.

This is mostly for the crowd still messing around with a few older systems:
Commodore 64/128
Apple II / II/C / II/E
Apple IIGS
NES and SNES
...

It is not a "crowd" - it's a small group of oddballs and enthusiasts. I
fully support them, and playing with these things is a great hobby. I
would maybe be doing that too, if I had twice as many hours in the week.
But talking about "popular compilers like gcc and CC65" is like
talking about "popular sports like football and Inuit ear pulling contests".

Also, some newer projects, like the "Commander X16" are also using CC65
(it was based around a 65C816 being used in a 6502 compatibility mode).

Where, AFAIK, GCC proper has little interest in these targets.

The GCC community would be quite happy to support such targets, but
someone would need to make the port. And the architecture of the gcc
compiler suite is best suited to processors with reasonably regular and orthogonal ISAs with plenty of registers and at least 16-bit width -
getting good results for a cpu like the 6502 from gcc would be an
extraordinary level of effort. It makes a lot more sense to look at
tools like SDCC with an architecture that fits better.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Mon Sep 16 12:39:54 2024

On 16/09/2024 10:34, Michael S wrote:

On Sun, 15 Sep 2024 18:47:06 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Sun, 15 Sep 2024 20:13:44 +0200
David Brown <[email protected]> wrote:

struct Bar {
char x[8];
int y;
} bar;

int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}

It generates:

foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret

That is, y is /not/ reloaded after bar.x[i] is set.

No other compiler on godbolt is doing it, except possibly gcc
clones. Not even clang, who's former leader wrote "Nasal Manifest".

Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
both show bar.y not being overwritten (optimization levels -01 or -O2)
when foo() is called.

I didn't mean to say that gcc3 is the only gcc version that returns non-overwritten value.

I also did not mean to imply that - I meant merely to show that gcc has generated code this way since at least that version.

I meant to say that all gcc versions are in one camp and the rest of compilers represented on Goldbolt is in the other camp.

Yes, but you were wrong about that. And even if you were right, it
would still be irrelevant - your argument that "what I wrote is how all production C compilers work today" has been shattered. The most-used C compiler does not work as you thought, and has not done so for at least
20 years. Indeed, for some targets (such as 32-bit ARM that I tested)
it does the write to bar.x[i] first, then the write to bar.y, because
that makes more sense from an instruction scheduling viewpoint.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Mon Sep 16 12:32:39 2024

On 15/09/2024 21:42, Michael S wrote:

On Sun, 15 Sep 2024 20:13:44 +0200
David Brown <[email protected]> wrote:

On 14/09/2024 23:19, Michael S wrote:

Yes, exactly.

Contrary to your imagination - compilers have /never/ followed your
proposed semantics. The oldest gcc version I found on godbolt.org is
3.4.6 from 2006, and given:

struct Bar {
char x[8];
int y;
} bar;

int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}

It generates:

foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret

That is, y is /not/ reloaded after bar.x[i] is set.

No other compiler on godbolt is doing it, except possibly gcc clones.
Not even clang, who's former leader wrote "Nasal Manifest".

Is this going to be a "No true Scotsman" argument? Or did you forget to
enable optimisations when testing /all/ the compilers on godbolt?

I tested a couple more.

With gcc for 32-bit ARM, the code re-arranges the stores - bar.x[i] gets
the value of 42 before the store to bar.y is done, and bar.y is not
reloaded. This is perfectly valid code generation.

icc generates the same code as gcc for x86-64, other than the order of
the first two instructions.

Compilers are, of course, free to re-read bar.y. But they are not
obliged to. And a good enough optimising compiler will not re-read
bar.y because it is a waste of instruction cycles. Most of the C
compilers on godbolt do not optimise as well as gcc does, though some
(like clang and icc) will do better in a minority of cases. I know of a
number of other heavily optimising compilers that are not on godbolt
because they have high costs and licenses that forbid that kind of use.

However, what we have from godbolt is a clear pattern - there is
absolutely no basis for suggesting that accessing bar.x[] beyond the
defined limit of the array is defined in any way, either within the C standards, practical real-world compilers, or documented extensions in compilers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to BGB on Mon Sep 16 14:30:29 2024

On 16/09/2024 01:54, BGB wrote:

On 9/15/2024 2:40 PM, David Brown wrote:

On 14/09/2024 08:34, BGB wrote:

On 9/13/2024 10:30 AM, David Brown wrote:

On 12/09/2024 23:14, BGB wrote:

On 9/12/2024 9:18 AM, David Brown wrote:

On 11/09/2024 20:51, BGB wrote:

On 9/11/2024 5:38 AM, Anton Ertl wrote:

Josh Vanderhoof <[email protected]> writes:

[email protected] (Anton Ertl) writes:

<snip lots>

Though, generally takes a few years before new features become usable. >>>>> Like, it is only in recent years that it has become "safe" to use
most parts of C99.

Most of the commonly used parts of C99 have been "safe" to use for
20 years. There were a few bits that MSVC did not implement until
relatively recently, but I think even have caught up now.

Until VS2013, the most one could really use was:
   // comments
   long long
Otherwise, it was basically C90.
   'stdint.h'? Nope.
   Ability to declare variables wherever? Nope.
   ...

Nonsense.

MS basically gave up on C and concentrated on C++ (then later C# and
other languages). Their C compiler gained the parts of C99 that were
in common with C++ - and anyway, most people (that I have heard of)
using MSVC for C programming actually use the C++ compiler but stick
approximately to a C subset. And this has been the case for a /long/
time - long before 2013.

Go and try to write C with variables not declared at the start of a
block in VS2008 or similar and see how far you get...

While, it may work in C++ mode, it did not work in C mode.

You have tried it, I have not, so I will take your word for it. Perhaps
those I heard of using it were, as you say, compiling in C++ mode - my understanding is that is very common with MSVC.

IIRC, the ability to declare variables wherever got added in VS2013.
Looks like 'stdint.h' got added in VS2010.

<stdint.h> became part of C++ in C++11, but most C and C++ compilers
have had it since shortly after C99 came out, even if they did not
support much more of C99.

I can sort of understand MS being lazy about supporting new C standards
and features that required effort - after all, very few people use MSVC
in C mode. But a <stdint.h> header only takes a dozen lines on a fixed platform and is directly useful in C++ as well as C. I suppose at that
time MS was still desperately trying to fight against anything that was
open and not tying people into their systems, so they'd rather people
used DWORD and the like than uint32_t.

   Whether or not the target/compiler allows misaligned memory >>>>>>> access;
     If set, one may use misaligned access.

Why would you need that? Any decent compiler will know what is
allowed for the target (perhaps partly on the basis of compiler
flags), and will generate the best allowed code for accesses like
foo3() above.

Imagine you have compilers that are smart enough to turn "memcpy()"
into a load and store, but not smart enough to optimize away the
memory accesses, or fully optimize away the wrapper functions...

Why would I do that? If I want to have efficient object code, I use
a good compiler. Under what realistic circumstances would you need
to have highly efficient results but be unable to use a good
optimising compiler? Compilers have been inlining code for 30 years
at least (that's when I first saw it) - this is not something new
and rare.

Say, you are using a target where you can't use GCC or similar.

Which target would that be? Excluding personal projects, some very
niche devices, and long-outdated small CISC chips, there really aren't
many devices that don't have a GCC and clang port. Of course there /
are/ processors that gcc does not support, but almost nobody writes
code that has to be portable to such devices.

And as for optimising compilers, I used at least two different
optimising compilers in the mid nineties that inlined code
automatically, before using gcc. (I can't remember if they inlined
memcpy - it was a long time ago!). Optimising compilers are not a new
concept, and are not limited to gcc and clang.

It also depends on what one considers optimizing.

Yes, that's a fair point. As far as the C language is concerned,
there's no such thing - any generated code that gives the same (or
equally valid) observable behaviour is simply an alternative output for
the compiler. But it generally means that the compiler makes more than
a minimal effort to generate more efficient results.

But, like:
Allocates variables into registers;
Evaluates expressions involving constants;
Turns "memcpy()" into inlined loads/stores in some cases;
    Essentially treating it like a builtin function.
...

Well, at least BGBCC does this much.

Very good.

Things it doesn't do though:
Loop unrolling;

Loop unrolling can be difficult in a compiler - it's also not always a
good thing in the end (cache arrangements can sometimes mean a real loop
is faster than an unrolled loop).

Inline functions;
...

Inlining small functions is a /very/ useful optimisation, IMHO,
especially when it happens before other optimisations like constant propagation.

There is a partial feature to cache member loads and array loads within
a basic-block, but will flush any such cached values whenever a memory
store happens.

Say:
i=foo->bar->x + foo->bar->y;
Will cache and reuse the first foo->bar.
But, if you do:
*ptr=0;
Or:
foo->z=3;
It will flush any memory of the cached values (unless the pointers are 'restrict').

There is an option to disable this caching though (at which point it
will always do each member load). But, unlike TBAA, this optimization is
less prone to break stuff.

Slow and always correct is better than fast and sometimes wrong!

It also has a special feature than small leaf functions which can fit entirely in scratch registers may skip creation of a stack frame.

But, I can note that even with these limitations, BGBCC+BJX2 still seems
to be beating RV64G + "GCC -O3" in terms of performance in my tests
(well, mostly because clever compiler can't beat ISA limitations).

Say:
BJX2, haven't ported GCC as it looks like a pain;
   Also GCC is big and slow to recompile.

6502 and 65C816, because these are old and probably not worth the
effort from GCC's POV.

Various other obscure/niche targets.

Say, SH-5, which never saw a production run (it was a 64-bit
successor to SH-4), but seemingly around the time Hitachi spun-out
Renesas, the SH-5 essentially got canned. And, it apparently wasn't
worth it for GCC to maintain a target for which there were no actual
chips (comparably the SH-2 and SH-4 lived on a lot longer due to
having niche uses).

It would be quite ridiculous to limit the way you write code because
of possible limitations for non-existent compilers for target devices
that have never been made.

Hitachi did release an ISA spec for SH-5 at least (and it might have
worked OK, if Renesas had pushed "upwards" rather than focusing almost exclusively on the small embedded / microcontroller space).

Pushing upwards would have been a waste of money.

But, at present, people trying to worry about portability to things with non-power-of-2 integers, non-8-bit bytes, non-twos-complement
arithmetic, etc, has a similar level of validity (or non-validity) to
writing code for ISA's which never saw a release in "actual silicon".

Agreed. There /are/ cores that have such features, like DSPs and very specialised cores, but the code you use on them is equally specialised.
You don't need to port back and forth between such cores and "normal"
targets.

If the compiler is naive (wrt inline memcpy):
memcpy(&v, cs, 8);
rl=(v>>4)&15;
Needs 5 instructions, but:
v=*(uint64_t *)cs;
rl=(v>>4)&15;
Uses 3 instructions.

Having the compiler turn the former into the latter is possible, but
would require more complex pattern matching, and would likely need to be handled in the frontend (rather than in the function-call operation) in
the backend.

Can I recommend you try to implement gcc's __builtin_constant_p()
function that determines if the result of an expression is known at
compile time? (It's fine to have false negatives for complicated
cases.) But it needs to be evaluated at compile time and used for
dead-code elimination, otherwise there's little point.

Then your standard library implementation of memcpy (assuming unaligned accesses are allowed) can be something approximately like :

#define memcpy(s1, s2, n) \
if (__builtin_constant_p(n)) { \
if (n == 1) { \
uint8_t * p = (uint8_t *) s1; \
const uint8_t * q = (const uint8_t *) s2; \
*p = *q; \
} else if (n == 2) { \
uint16_t * p = (uint16_t *) s1; \
const uint16_t * q = (const uint16_t *) s2; \
*p = *q; \
} else if (n == 4) { \
uint32_t * p = (uint32_t *) s1; \
const uint32_t * q = (const uint32_t *) s2; \
*p = *q; \
} else {
__real_memcpy(s1, s2, n); \
} \
} else { \
__real_memcpy(s1, s2, n); \
}

This is missing several details to make it safe and to match the
standard library specifications, but I believe it should be possible to
do something along those lines. (Implementing gcc's statement
expressions would help too.)

Not necessarily, it wouldn't make sense for _Alignof to return 1 for
all the basic integer types.

Of course it makes sense to do that, on targets where an alignment of
1 is safe and efficient.

Tradition dictates that struct members are pad-aligned aligned to their native alignment (usually equal to the size of the base type), unless
the struct is 'packed'.

No, tradition dictates that there is a maximum to the alignment,
matching the size of the architecture. 16-bit implementations rarely
have any type alignment greater than 16-bit, 32-bit implementations
rarely have any alignment greater than 32-bit, and 8-bit implementations
rarely have any alignment greater than 8-bit.

An implementation where all structs are packed by default could have unforeseen consequences...

Yes - such as poor performance. And of course some programmers make unwarranted assumptions about alignments and paddings.

Presumably, _Alignof would give the same alignment as would appear in
structs or similar.

Yes. C requires that.

But, for" minimum alignment" it may make sense to return 1 for
anything that can be accessed unaligned.

Again, I see no use for this.

The main alternatives:
Detect target architecture and "know" whether the architecture is unaligned-safe (ye olde mess of ifdef's);
Have a global PP define that applies to all types, but this doesn't
allow for cases where some types are unaligned safe but others are not.

One possibility could be __minalign__(type), but (unlike doing it with preprocessor defines), one could not likely use it in preprocessor expressions.

#if __MINALIGN_LONG__==1
...
#else
...
#endif

Works, but:
#if _Alignof(long)==1
...

Poses problems, as generally the preprocessor is not able to evaluate
things like this.

Scrap all that and have functions to read or write from a given address
with specified sizes, using whatever method the compiler sees as most
efficient and supported by the target. Or implement mempcy()
optimisations for small known sizes, and use that.

Where, _Alignof(int32_t) will give 4, but __MINALIGN_INT32__ would
give 1 if the target supports misaligned pointers.

The alignment of types in C is given by _Alignof. Hardware may
support unaligned accesses - C does not. (By that, I mean that
unaligned accesses are UB.)

The point of __MINALIGN_type__ would be:
If the compiler defines it, and it is defined as 1, then this allows
the compiler to be able to tell the program that it is safe to use
this type in an unaligned way.

For what purpose?

Probably for unaligned deref's on targets where "memcpy()" is a less desirable option (say, if it takes several additional CPU instructions).

Make a better memcpy() implementation instead.

This also applies to targets where some types are unaligned but
others are not:
Say, if all integer types 64 bits or less are unaligned, but 128-bit
types are not.

For what purpose? And why do you want to worry about totally
hypothetical systems?

Note that a lot of what I am describing here is true of BJX2.

Are you saying that you have no alignment restrictions for types up to
64-bits (that is, they are placed at any address), but /do/ have
alignment restrictions for 128-bit types? That would be so strange that
I suspect I am misunderstanding you.

Perhaps you are saying that unaligned accesses are allowed for types up
to 64-bit even though the types are normally aligned for efficiency, but unaligned accesses are not allowed for 128-bit types? That is a lot
more plausible, especially if there is a special implementation for
128-bit accesses. (On x86-64 there are some SIMD vector instructions
that do not support unaligned accesses.)

It is also true of __m128 and similar in MSVC.
__m128 v;
v=*(__m128 *)someptr;
May explode if someptr is not 16-byte aligned, as it may emit a "MOVDQA"
or similar (rather than MOVDQU).

But, in both cases, if "int *" or "long *" is misaligned, both are fine
with it.

This is all quite simple to handle - don't faff around converting
pointer types unless you know exactly what you are doing, and you know
it is safe to do and your alignments are correct according to the ABI requirements. A decent C compiler is not going to give you incorrect alignments unless you go out of your way to create them via explicit
code (i.e., using casts).

The only time you get problems is if your compiler makes certain
assumptions (such as gcc x86-64 assuming 16-byte stack pointer
alignment) and you have an OS that does something stupid (like Windows
not necessarily aligning the stack pointer properly before calling
callbacks). For that, you want compiler help.

There may be other compilers in a similar camp.

But, then again, it is kinda hypothetical in the sense to claim that one can't cast and deref a pointer, since on most existing targets, it works without issue (except that on GCC one may also need to use 'volatile').

I've worked with targets where unaligned access does not work - or where
it is immensely slow. This is something that the compiler should get
right, and the user should rely on the compiler.

Most of this is being compiled by BGBCC for a 50 MHz cPU.

So, the CPU is slow and the compiler doesn't generate particularly
efficient code unless one writes it in a way it can use effectively.

Which often means trying to write C like it was assembler and
manually organizing statements to try to minimize value dependencies
(often caching any values in variables, and using lots of variables).

In this case, the equivalent of "-fwrapv -fno-strict-aliasing" is the
default semantics.

Generally, MSVC also responds well to a similar coding style as used
for BGBCC (or, as it more happened, the coding styles that gave good
results in MSVC also tended to work well in BGBCC).

Note that MSVC most certainly does /not/ work like "gcc -fwrapv" -
signed integer overflow is UB in MSVC, and it generates code that
assumes it never happens. There is an obscure officially undocumented
(or documented unofficially, if you prefer) flag to turn off such
optimisations.

Last I read about it, they had no plans to do any type-based alias
analysis, but nor did they rule out the possibility in the future.

I haven't seen any issues with MSVC and this sort of code usually works
as expected...

But, a lot of times, one has to supply these options to GCC otherwise
the code will break. So, it almost makes sense to assume these semantics
as a default.

What you mean is that for some bad code, you have to supply these flags
or you face "garbage in, garbage out". The code was already broken if
these flags are needed for it to behave as the programmer intended.

In the case of BGBCC, I decided to make these semantics the default as a matter of a policy decision.

And IMHO that's a /really/ bad idea. Instead of telling users "we know
you write shit code - so I'll assume your source code might be shit,
even if the results are worse when you write good code", why not
encourage people to write code correctly by giving them the best results
for correct code? And if possible, give them tools - static and
run-time - to help spot their mistakes, rather than blessing those
mistakes as a new norm.

There is some talk about pointer provenance semantics for C (apparently
semi controversial), but admittedly thus far I don't fully understand
the idea.

It is complicated, but has big potential for improving static analysis, run-time checkers, and code optimisations. One thing you can be sure is
that encouraging people to break the current C rules is only going to
make it more likely that they will have trouble in the future.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Mon Sep 16 14:45:44 2024

On 16/09/2024 09:17, Thomas Koenig wrote:

David Brown <[email protected]> schrieb:

On 14/09/2024 21:26, Thomas Koenig wrote:

MitchAlsup1 <[email protected]> schrieb:

In many cases int is slower now than long -- which violates the notion >>>> of int from K&R days.

That's a designers's choice, I think. It is possible to add 32-bit
instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.

For some kinds of instructions, that's true - for others, it's not so
easy without either making rather complicated instructions or having
assembly instructions with undefined behaviour (imagine the terror that
would bring to some people!).

It has happened, see the illegal (but sometimes useful)
6502 instructions, or the recent RISC-V implementation snafu
(GhostWrite).

I have seen plenty of undefined behaviour in ISA's over the years. (A
very common case is that instruction encodings that are not specified
are left as UB so that later extensions to the ISA can use them.) I was
just thinking of the reactions you'd get if you made an ISA where
attempting to overflow signed integer arithmetic was UB at the hardware
level, so that you could get faster and simpler instructions.

A classic example would be for "y = p[x++];" in a loop. For a 64-bit
type x, you would set up one register once with "p + x", and then have a
load with post-increment instruction in the loop. You can also do that
with x as a 32-bit int, unless you are of the opinion that enough apples
added to a pile should give a negative number of apples.

But of course it should!

But wait, no, the number of apples should become zero if you add
enough of them.

But wait... maybe if the pile becomes too large, then the apples
will no longer be individual apples, but will be crushed under
their weight, a bit like https://what-if.xkcd.com/4/ .

:-)

But with a
wrapping type for x - such as unsigned int in C or modulo types in Ada,
you have little choice but to hold "p" and "x" separately in registers,
add them for every load, and do the increment and modulo operation. I
really can't see this all being handled by a single instruction.

One reason not to use such a wrapping type.

Agreed.

Although, if you have (R1+R2) addressing and a 32-bit addition, this
could actually work, but not with a post-increment instruction.

Yes, but assuming you have 64-bit pointers you'd need a 64-bit + 32-bit addition. That could work, but I think you'd end up making your ISA a
fair bit more complicated for little gain (compared to just using UB
overflow int types and not going overboard in the software).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Mon Sep 16 14:48:50 2024

On 16/09/2024 10:37, Terje Mathisen wrote:

David Brown wrote:

On 14/09/2024 21:26, Thomas Koenig wrote:

MitchAlsup1 <[email protected]> schrieb:

In many cases int is slower now than long -- which violates the notion >>>> of int from K&R days.

That's a designers's choice, I think. It is possible to add 32-bit
instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.

For some kinds of instructions, that's true - for others, it's not so
easy without either making rather complicated instructions or having
assembly instructions with undefined behaviour (imagine the terror
that would bring to some people!).

A classic example would be for "y = p[x++];" in a loop. For a 64-bit
type x, you would set up one register once with "p + x", and then have
a load with post-increment instruction in the loop. You can also do
that with x as a 32-bit int, unless you are of the opinion that enough
apples added to a pile should give a negative number of apples. But
with a wrapping type for x - such as unsigned int in C or modulo types
in Ada, you have little choice but to hold "p" and "x" separately in
registers, add them for every load, and do the increment and modulo
operation. I really can't see this all being handled by a single
instruction.

This becomes much simpler in Rust where usize is the only legal index type:

Yeah, you have to actually write it as

y = p[x];
x += 1;

instead of a single line, but this makes zero difference to the
compiler, right?

I don't care much about the compiler - but I don't think this is an
improvement for the programmer. (In general, I dislike trying to do too
much in a single expression or statement, but some C constructs are
common enough that I am happy with them. It would be hard to formulate concrete rules here.)

And the resulting object code is less efficient than you get with signed
int and "y = p[x++];" (or "y = p[x]; x++;") in C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Mon Sep 16 16:04:02 2024

On Mon, 16 Sep 2024 14:48:50 +0200
David Brown <[email protected]> wrote:

On 16/09/2024 10:37, Terje Mathisen wrote:

David Brown wrote:

On 14/09/2024 21:26, Thomas Koenig wrote:

MitchAlsup1 <[email protected]> schrieb:

In many cases int is slower now than long -- which violates the
notion of int from K&R days.

That's a designers's choice, I think. It is possible to add
32-bit instructions which should be as fast (or possibly faster)
than 64-bit instructions, as AMD64 and ARM have shown.

For some kinds of instructions, that's true - for others, it's not
so easy without either making rather complicated instructions or
having assembly instructions with undefined behaviour (imagine the
terror that would bring to some people!).

A classic example would be for "y = p[x++];" in a loop. For a
64-bit type x, you would set up one register once with "p + x",
and then have a load with post-increment instruction in the loop.
You can also do that with x as a 32-bit int, unless you are of the
opinion that enough apples added to a pile should give a negative
number of apples. But with a wrapping type for x - such as
unsigned int in C or modulo types in Ada, you have little choice
but to hold "p" and "x" separately in registers, add them for
every load, and do the increment and modulo operation. I really
can't see this all being handled by a single instruction.

This becomes much simpler in Rust where usize is the only legal
index type:

Yeah, you have to actually write it as

y = p[x];
x += 1;

instead of a single line, but this makes zero difference to the
compiler, right?

I don't care much about the compiler - but I don't think this is an improvement for the programmer. (In general, I dislike trying to do
too much in a single expression or statement, but some C constructs
are common enough that I am happy with them. It would be hard to
formulate concrete rules here.)

And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.

It's not less efficient. usize in Rust is approximately the same as
size_t in C. With one exception that usize overflow panics under debug
build.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Mon Sep 16 16:09:38 2024

On 16/09/2024 15:04, Michael S wrote:

On Mon, 16 Sep 2024 14:48:50 +0200
David Brown <[email protected]> wrote:

On 16/09/2024 10:37, Terje Mathisen wrote:

David Brown wrote:

On 14/09/2024 21:26, Thomas Koenig wrote:

MitchAlsup1 <[email protected]> schrieb:

In many cases int is slower now than long -- which violates the
notion of int from K&R days.

That's a designers's choice, I think. It is possible to add
32-bit instructions which should be as fast (or possibly faster)
than 64-bit instructions, as AMD64 and ARM have shown.

For some kinds of instructions, that's true - for others, it's not
so easy without either making rather complicated instructions or
having assembly instructions with undefined behaviour (imagine the
terror that would bring to some people!).

A classic example would be for "y = p[x++];" in a loop. For a
64-bit type x, you would set up one register once with "p + x",
and then have a load with post-increment instruction in the loop.
You can also do that with x as a 32-bit int, unless you are of the
opinion that enough apples added to a pile should give a negative
number of apples. But with a wrapping type for x - such as
unsigned int in C or modulo types in Ada, you have little choice
but to hold "p" and "x" separately in registers, add them for
every load, and do the increment and modulo operation. I really
can't see this all being handled by a single instruction.

This becomes much simpler in Rust where usize is the only legal
index type:

Yeah, you have to actually write it as

y = p[x];
x += 1;

instead of a single line, but this makes zero difference to the
compiler, right?

I don't care much about the compiler - but I don't think this is an
improvement for the programmer. (In general, I dislike trying to do
too much in a single expression or statement, but some C constructs
are common enough that I am happy with them. It would be hard to
formulate concrete rules here.)

And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.

It's not less efficient. usize in Rust is approximately the same as
size_t in C.

Ah, okay - I was thinking of it as a C unsigned int.

With one exception that usize overflow panics under debug
build.

I'm quite happy with unsigned types that are not allowed to overflow, as
long as there is some other way to get efficient wrapping on the rare
occasions when you need it.

But I am completely against the idea that you have different defined
semantics for different builds. Run-time errors in a debug/test build
and undefined behaviour in release mode is fine - defining the behaviour
of overflow in release mode (other than possibly to the same run-time
checking) is wrong.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to David Brown on Mon Sep 16 10:51:21 2024

David Brown wrote:

On 15/09/2024 21:13, MitchAlsup1 wrote:

On Sun, 15 Sep 2024 18:48:48 +0000, David Brown wrote:

On 15/09/2024 19:21, MitchAlsup1 wrote:

Which brings to mind a slight different but related bit-field issue.

If one has an architecture that allows a bit-field to span a register
sized container, how does one specify that bit-field in C ??

So, assume a register contains 64-bits and we have a 17-bit field
starting at bit 53 and continuing to bit 69 of a 128-bit struct.
How would one "properly" specify this in C.

You do so inconveniently, perhaps with access inline functions rather
than a bit-field struct.

Fortunately, not many hardware designers are that sadistic. (Or perhaps >>> they /are/ that sadistic, but lack the imagination for that particular
trick.)

In My 66000 ISA it is both efficient and straightforward::

That does not change that it is inconvenient in C, which is what you
asked about. For any ISA, there will always be things that can easily written in C that are awkward in assembly, and vice versa.

i = struct.field;
..
struct.field = j;

CARRY Rsf1,{I}
SRA Ri,Rsf0,<17,53>
and
CARRY Rsf1,{O}
INS Rsf0,Rj,<52,17>

Note: Rsf1 and Rsf0 combined are the 128 bits container, but there is no
need for these registers to be sequential.

As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant
past could do these rather efficiently {360 SRDL,...}

Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even beneficial from the hardware viewpoint, but really inconvenient on the software side (whether you use bit-fields or not).

Some hardware designers seem to have no understanding of or
consideration for the software folks that will use their designs. "HW Sadism" is no doubt too strong a term - ignorance and a lack of
consideration is more realistic.

If the ISA has any realistically efficient grasp on multi-precision
integer operations, these fall out almost for free.

I can't see that. I am not saying you are wrong, but I don't see the connection.

These double-width bit-field straddle operations show up at 32-bits.
Various FP64 formats (DEC's middle-endian FP being the worst example),
Intel page table entries and segment/gate descriptors, come to mind.

It's just going to take a while for double-width things to show up
at the 64-bit level. But if FP128 becomes a reality...

Codecs likely have to deal with double-width straddles a lot, whatever
the register word size. So for them it likely happens at 64-bits already.

I added a bunch of instructions for dealing with double-width operations.
The main ISA design decision is whether to have register pair specifiers,
R0, R2, R4,... or two separate {r_high,r_low} registers.
In either case the main uArch issue is that now instructions have an extra source register and two dest registers, which has a number of consequences.
But once you bite the bullet on that it simplifies a lot of things,
like how to deal with carry or overflow without flags,
full width multiplies, divide producing both quotient and remainder.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Mon Sep 16 17:33:37 2024

On Mon, 16 Sep 2024 16:09:38 +0200
David Brown <[email protected]> wrote:

On 16/09/2024 15:04, Michael S wrote:

On Mon, 16 Sep 2024 14:48:50 +0200
David Brown <[email protected]> wrote:

On 16/09/2024 10:37, Terje Mathisen wrote:

David Brown wrote:

On 14/09/2024 21:26, Thomas Koenig wrote:

MitchAlsup1 <[email protected]> schrieb:

In many cases int is slower now than long -- which violates the
notion of int from K&R days.

That's a designers's choice, I think. It is possible to add
32-bit instructions which should be as fast (or possibly faster)
than 64-bit instructions, as AMD64 and ARM have shown.

For some kinds of instructions, that's true - for others, it's
not so easy without either making rather complicated
instructions or having assembly instructions with undefined
behaviour (imagine the terror that would bring to some people!).

A classic example would be for "y = p[x++];" in a loop. For a
64-bit type x, you would set up one register once with "p + x",
and then have a load with post-increment instruction in the loop.
You can also do that with x as a 32-bit int, unless you are of
the opinion that enough apples added to a pile should give a
negative number of apples. But with a wrapping type for x -
such as unsigned int in C or modulo types in Ada, you have
little choice but to hold "p" and "x" separately in registers,
add them for every load, and do the increment and modulo
operation. I really can't see this all being handled by a
single instruction.

This becomes much simpler in Rust where usize is the only legal
index type:

Yeah, you have to actually write it as

y = p[x];
x += 1;

instead of a single line, but this makes zero difference to the
compiler, right?

I don't care much about the compiler - but I don't think this is an
improvement for the programmer. (In general, I dislike trying to
do too much in a single expression or statement, but some C
constructs are common enough that I am happy with them. It would
be hard to formulate concrete rules here.)

And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.

It's not less efficient. usize in Rust is approximately the same as
size_t in C.

Ah, okay - I was thinking of it as a C unsigned int.

With one exception that usize overflow panics under debug
build.

I'm quite happy with unsigned types that are not allowed to overflow,
as long as there is some other way to get efficient wrapping on the
rare occasions when you need it.

Rust has it in form of builtin functions wrapping_*()

But I am completely against the idea that you have different defined semantics for different builds. Run-time errors in a debug/test
build and undefined behaviour in release mode is fine - defining the behaviour of overflow in release mode (other than possibly to the
same run-time checking) is wrong.

On the one hand, Rust manual says that integer overflow in release mode
wraps. On the other hand, it says that "Relying on integer overflow’s wrapping behavior is considered an error."
It does not sound particularly consistent and rather close to worst of
both worlds.

However on more important issue of out-of-bound array access Rust is consistent,

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Mon Sep 16 15:34:25 2024

David Brown <[email protected]> schrieb:

The GCC community would be quite happy to support such targets, but
someone would need to make the port. And the architecture of the gcc compiler suite is best suited to processors with reasonably regular and orthogonal ISAs with plenty of registers and at least 16-bit width -
getting good results for a cpu like the 6502 from gcc would be an extraordinary level of effort. It makes a lot more sense to look at
tools like SDCC with an architecture that fits better.

Native compilation of gcc on a 6502 would be... interesting.

But I think an adaption of gcc to a 6502 could actually work if
the zero page was treated as 128 16-bit registers. Not going
there, though :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to David Brown on Mon Sep 16 11:39:55 2024

David Brown wrote:

On 16/09/2024 15:04, Michael S wrote:

With one exception that usize overflow panics under debug
build.

I'm quite happy with unsigned types that are not allowed to overflow, as
long as there is some other way to get efficient wrapping on the rare occasions when you need it.

But I am completely against the idea that you have different defined semantics for different builds. Run-time errors in a debug/test build
and undefined behaviour in release mode is fine - defining the behaviour
of overflow in release mode (other than possibly to the same run-time checking) is wrong.

In the compilers that do checking which I have worked with
there was always a distinction between checked builds and debug builds.
In my C code I have Assert() and AssertDbg(). Assert stay in the
production code, AssertDbg are only in the debug builds.

Debug builds disable optimizations and spill all variable updates
to memory to make life easier for the debugger.
One usually compiles debug builds with no-optimize and all checks enabled.

But debug, optimize, and checking are separate controls.

In the compilers for checking languages I've worked with,
checking and optimization are compatible.
For example, if the compiler uses an AddFaultOverflow x = x + 1 instruction
to increment 'x' then it knows no overflow is possible and then
can make all the other optimizations that C assumes are true.

And on those compilers checks can be controlled with quite fine resolution. Checks can be enabled/disabled based on kind of check,
eg scalar overflow, array bounds,
for a compilation unit, a routine, a section of code,
a particular data type, a particular object.

This was all standard on DEC Ada85 so if Rust compilers do not
do this now they may in the near future.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to EricP on Mon Sep 16 18:58:57 2024

On Mon, 16 Sep 2024 11:39:55 -0400
EricP <[email protected]> wrote:

David Brown wrote:

On 16/09/2024 15:04, Michael S wrote:

With one exception that usize overflow panics under debug
build.

I'm quite happy with unsigned types that are not allowed to
overflow, as long as there is some other way to get efficient
wrapping on the rare occasions when you need it.

But I am completely against the idea that you have different
defined semantics for different builds. Run-time errors in a
debug/test build and undefined behaviour in release mode is fine -
defining the behaviour of overflow in release mode (other than
possibly to the same run-time checking) is wrong.

In the compilers that do checking which I have worked with
there was always a distinction between checked builds and debug
builds. In my C code I have Assert() and AssertDbg(). Assert stay in
the production code, AssertDbg are only in the debug builds.

Debug builds disable optimizations and spill all variable updates
to memory to make life easier for the debugger.
One usually compiles debug builds with no-optimize and all checks
enabled.

But debug, optimize, and checking are separate controls.

In the compilers for checking languages I've worked with,
checking and optimization are compatible.
For example, if the compiler uses an AddFaultOverflow x = x + 1
instruction to increment 'x' then it knows no overflow is possible
and then can make all the other optimizations that C assumes are true.

And on those compilers checks can be controlled with quite fine
resolution. Checks can be enabled/disabled based on kind of check,
eg scalar overflow, array bounds,
for a compilation unit, a routine, a section of code,
a particular data type, a particular object.

This was all standard on DEC Ada85 so if Rust compilers do not
do this now they may in the near future.

If ability to control compilers checks was standard on DEC Ada then it
made DEC Ada none-standard.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to David Brown on Mon Sep 16 09:02:38 2024

On 9/16/2024 4:12 AM, David Brown wrote:

snip

With all respect to the regulars here, most people in technical Usenet
groups are either old, unusually nerdy, or both.

I resemble that remark! :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Mon Sep 16 18:59:21 2024

On 16/09/2024 17:34, Thomas Koenig wrote:

David Brown <[email protected]> schrieb:

The GCC community would be quite happy to support such targets, but
someone would need to make the port. And the architecture of the gcc
compiler suite is best suited to processors with reasonably regular and
orthogonal ISAs with plenty of registers and at least 16-bit width -
getting good results for a cpu like the 6502 from gcc would be an
extraordinary level of effort. It makes a lot more sense to look at
tools like SDCC with an architecture that fits better.

Native compilation of gcc on a 6502 would be... interesting.

The 6502 would be a target, rather than a host!

Of course there were C compilers, and many other languages, running on
the 6502 BBC Micro and BBC Master computers. But those tools were a bit
more compact that gcc :-)

But I think an adaption of gcc to a 6502 could actually work if
the zero page was treated as 128 16-bit registers. Not going
there, though :-)

That would be a starting point, yes. But I would not use the whole zero
page there - perhaps just the first 32 bytes (and therefore 16 16-bit registers). Having a huge register bank would make function calls tough
when you have to stack all the callee-saved registers in your one-page
stack!

With 16 register pairs, you would get you close to how the AVR is
treated in gcc - it has 32 8-bit registers which are, for many purposes, handled in pairs by the compiler. (Lowering ALU operations on 16-bit
register pairs to 8-bit operations on single registers is done mostly as peephole optimisations at the backend.)

Someone did manage to get Linux running on an 8-bit AVR (by having the
AVR run an ARM emulator, and using ARM Linux). I'm sure the same
technique could be used to host Linux on a 6502 and run gcc on it,
though you might not consider that "native". And "run" might be a bit
of a misnomer - the AVR was a lot faster than a 6502, and it took 6
hours to boot to login.

<https://dmitry.gr/?r=05.Projects&proj=07.%20Linux%20on%208bit>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Michael S on Mon Sep 16 13:02:46 2024

Michael S wrote:

On Mon, 16 Sep 2024 11:39:55 -0400
EricP <[email protected]> wrote:

David Brown wrote:

On 16/09/2024 15:04, Michael S wrote:

With one exception that usize overflow panics under debug
build.

I'm quite happy with unsigned types that are not allowed to
overflow, as long as there is some other way to get efficient
wrapping on the rare occasions when you need it.

But I am completely against the idea that you have different
defined semantics for different builds. Run-time errors in a
debug/test build and undefined behaviour in release mode is fine -
defining the behaviour of overflow in release mode (other than
possibly to the same run-time checking) is wrong.

In the compilers that do checking which I have worked with
there was always a distinction between checked builds and debug
builds. In my C code I have Assert() and AssertDbg(). Assert stay in
the production code, AssertDbg are only in the debug builds.

Debug builds disable optimizations and spill all variable updates
to memory to make life easier for the debugger.
One usually compiles debug builds with no-optimize and all checks
enabled.

But debug, optimize, and checking are separate controls.

In the compilers for checking languages I've worked with,
checking and optimization are compatible.
For example, if the compiler uses an AddFaultOverflow x = x + 1
instruction to increment 'x' then it knows no overflow is possible
and then can make all the other optimizations that C assumes are true.

And on those compilers checks can be controlled with quite fine
resolution. Checks can be enabled/disabled based on kind of check,
eg scalar overflow, array bounds,
for a compilation unit, a routine, a section of code,
a particular data type, a particular object.

This was all standard on DEC Ada85 so if Rust compilers do not
do this now they may in the near future.

If ability to control compilers checks was standard on DEC Ada then it
made DEC Ada none-standard.

No, pragma SUPPRESS (check_identifier [, ON => name]);
is defined by the Ada85 standard, with 9 different kinds of checks that
can be suppressed on named type, object, routine, task, etc within a scope.
But support for pragmas and what they do is defined as implementation dependent, and may also be extended.

pragma suppress (INDEX_CHECK, myArray);

if supported, eliminates bounds check on just that array.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Mon Sep 16 18:44:47 2024

On 16/09/2024 16:33, Michael S wrote:

On Mon, 16 Sep 2024 16:09:38 +0200
David Brown <[email protected]> wrote:

On 16/09/2024 15:04, Michael S wrote:

On Mon, 16 Sep 2024 14:48:50 +0200

With one exception that usize overflow panics under debug
build.

I'm quite happy with unsigned types that are not allowed to overflow,
as long as there is some other way to get efficient wrapping on the
rare occasions when you need it.

Rust has it in form of builtin functions wrapping_*()

Okay.

But I am completely against the idea that you have different defined
semantics for different builds. Run-time errors in a debug/test
build and undefined behaviour in release mode is fine - defining the
behaviour of overflow in release mode (other than possibly to the
same run-time checking) is wrong.

On the one hand, Rust manual says that integer overflow in release mode wraps. On the other hand, it says that "Relying on integer overflow’s wrapping behavior is considered an error."
It does not sound particularly consistent and rather close to worst of
both worlds.

Yes. If it is not behaviour you can rely on (and if it is a run-time
error in debug mode, you certainly can't rely on it!) then the compiler
should be able to optimise ignoring the possibility of it happening,
when checks are not enabled.

However on more important issue of out-of-bound array access Rust is consistent,

I think it is difficult to determine the relative importance of
out-of-bounds array access and contradictory documentation about basic arithmetic!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Mon Sep 16 17:51:44 2024

On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

On 15/09/2024 21:13, MitchAlsup1 wrote:

As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant
past could do these rather efficiently {360 SRDL,...}

Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even beneficial from the hardware viewpoint, but really inconvenient on the software side (whether you use bit-fields or not).

Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
..

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Mon Sep 16 17:57:49 2024

On Mon, 16 Sep 2024 13:04:02 +0000, Michael S wrote:

On Mon, 16 Sep 2024 14:48:50 +0200
David Brown <[email protected]> wrote:

It's not less efficient. usize in Rust is approximately the same as
size_t in C. With one exception that usize overflow panics under debug
build.

One can and should argue that::

#p++;

should panic if p++ crosses an address space boundary (user->OS, or OS->HyperVisor,...) as no array is allowed to cross such a boundary.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Michael S on Mon Sep 16 21:06:48 2024

On 2024-09-16 18:58, Michael S wrote:

On Mon, 16 Sep 2024 11:39:55 -0400
EricP <[email protected]> wrote:

David Brown wrote:

On 16/09/2024 15:04, Michael S wrote:

With one exception that usize overflow panics under debug
build.

I'm quite happy with unsigned types that are not allowed to
overflow, as long as there is some other way to get efficient
wrapping on the rare occasions when you need it.

But I am completely against the idea that you have different
defined semantics for different builds. Run-time errors in a
debug/test build and undefined behaviour in release mode is fine -
defining the behaviour of overflow in release mode (other than
possibly to the same run-time checking) is wrong.

In the compilers that do checking which I have worked with
there was always a distinction between checked builds and debug
builds. In my C code I have Assert() and AssertDbg(). Assert stay in
the production code, AssertDbg are only in the debug builds.

Debug builds disable optimizations and spill all variable updates
to memory to make life easier for the debugger.
One usually compiles debug builds with no-optimize and all checks
enabled.

But debug, optimize, and checking are separate controls.

In the compilers for checking languages I've worked with,
checking and optimization are compatible.
For example, if the compiler uses an AddFaultOverflow x = x + 1
instruction to increment 'x' then it knows no overflow is possible
and then can make all the other optimizations that C assumes are true.

And on those compilers checks can be controlled with quite fine
resolution. Checks can be enabled/disabled based on kind of check,
eg scalar overflow, array bounds,
for a compilation unit, a routine, a section of code,
a particular data type, a particular object.

This was all standard on DEC Ada85 so if Rust compilers do not
do this now they may in the near future.

If ability to control compilers checks was standard on DEC Ada then it
made DEC Ada none-standard.

No, it means that DEC Ada could be used as a standard-conforming Ada
compiler or as a non-conforming compiler, to a user-chosen extent.

The recommended approach today (for applications where it matters) is to
use static analysis of the Ada code (e.g. SPARK or other tools) to prove
that run-time errors cannot happen, which then makes it possible to omit
the corresponding run-time checks while staying compliant.

I don't know if Rust code can be analysed as easily and completely as
Ada code can. But Ada compilers usually allow fine-grained control over
which checks are applied where, not just a single choice between "debug"
and "production" builds.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Mon Sep 16 20:08:42 2024

On 16/09/2024 19:57, MitchAlsup1 wrote:

On Mon, 16 Sep 2024 13:04:02 +0000, Michael S wrote:

On Mon, 16 Sep 2024 14:48:50 +0200
David Brown <[email protected]> wrote:

It's not less efficient. usize in Rust is approximately the same as
size_t in C. With one exception that usize overflow panics under debug
build.

One can and should argue that::

#p++;

should panic if p++ crosses an address space boundary (user->OS, or OS->HyperVisor,...) as no array is allowed to cross such a boundary.

That is outside the scope of C, which has no concept of address space boundaries, or even an OS (other than as something that makes the
standard library functions work).

Of course it is perfectly fine if, on any given implementation, trying
to access through an invalid pointer (including beyond the end of an
array) results in some kind of panic, crash, OS exception, or other
error. Those are all valid for UB. But it is not possible or practical
to specify or require such action from a language. At best, a language
could say that some kind of run-time error handling must be supported
and that it is triggered by certain kinds of out of bounds accesses
(defined by the language, not by address space boundaries). Even then,
you are not going to be able to detect all invalid pointer uses while maintaining low-level and efficient direct pointer usage.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Mon Sep 16 20:11:20 2024

On 16/09/2024 19:51, MitchAlsup1 wrote:

On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

On 15/09/2024 21:13, MitchAlsup1 wrote:

As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant
past could do these rather efficiently {360 SRDL,...}

Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even
beneficial from the hardware viewpoint, but really inconvenient on the
software side (whether you use bit-fields or not).

Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
..

The folks designing those register setups had a choice, and made a bad
choice from the viewpoint of software (whether it be C, assembly, or any
other language).

It's conceivable that it was the right choice on balance, considering
many factors. And it's certainly more believable that it was an
appropriate choice when sizes were smaller. It is less believable that
there is an overwhelming need to cross a 64-bit boundary.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Mon Sep 16 18:22:40 2024

[email protected] (MitchAlsup1) writes:

On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

On 15/09/2024 21:13, MitchAlsup1 wrote:

As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant
past could do these rather efficiently {360 SRDL,...}

Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even
beneficial from the hardware viewpoint, but really inconvenient on the
software side (whether you use bit-fields or not).

Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,

I'm not aware of any PCIe device where a field straddles the
boundary between two 64-bit registers. There are many devices
that split a 64-bit address across two 32-bit registers; including
the BAR registers in the configuration space, DMA addresses, etc.

Our CSR tool explicitly forbids field definitions that cross 64-bit
boundaries. If necessary, the logic designer will instead define
two smaller fields that software is required to combined explicitly.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dombo@21:1/5 to David Brown on Mon Sep 16 22:15:35 2024

On 16-09-2024 13:12, David Brown wrote:

On 16/09/2024 02:00, BGB wrote:

On 9/15/2024 2:09 PM, David Brown wrote:
This is mostly for the crowd still messing around with a few older
systems:
   Commodore 64/128
   Apple II / II/C / II/E
   Apple IIGS
   NES and SNES
   ...

It is not a "crowd" - it's a small group of oddballs and enthusiasts. I fully support them, and playing with these things is a great hobby. I
would maybe be doing that too, if I had twice as many hours in the week.
But talking about "popular compilers like gcc and CC65" is like
talking about "popular sports like football and Inuit ear pulling
contests".

Also, some newer projects, like the "Commander X16" are also using
CC65 (it was based around a 65C816 being used in a 6502 compatibility
mode).

Where, AFAIK, GCC proper has little interest in these targets.

The GCC community would be quite happy to support such targets, but
someone would need to make the port. And the architecture of the gcc compiler suite is best suited to processors with reasonably regular and orthogonal ISAs with plenty of registers and at least 16-bit width -
getting good results for a cpu like the 6502 from gcc would be an extraordinary level of effort.

I wouldn't be surprised if someone had a go at creating a 6502 back-end
for GCC. There is a 6502 back-end for LLVM (https://llvm-mos.org), it
appears that someone put a serious amount of effort into this. I've
played a little with it (using Compiler Explorer site -
https://godbolt.org/). Considering the limitations of the 6502 it seemed
to be able produce relatively decent code from C++ code when
optimizations are enabled. Like AVR-GCC you do need to keep in mind that
the target is an 8-bitter and its limitations to get somewhat reasonable
code out of it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bill Findlay@21:1/5 to Niklas Holsti on Mon Sep 16 22:40:33 2024

On 16 Sep 2024, Niklas Holsti wrote
(in article <[email protected]>):
...

The recommended approach today (for applications where it matters) is to
use static analysis of the Ada code (e.g. SPARK or other tools) to prove
that run-time errors cannot happen, which then makes it possible to omit
the corresponding run-time checks while staying compliant.

I don't know if Rust code can be analysed as easily and completely as
Ada code can. But Ada compilers usually allow fine-grained control over
which checks are applied where, not just a single choice between "debug"
and "production" builds.

I find, without using SPARK or any analysis (other than that done
by the compiler) that going from all Ada language-defined checks
ON to all OFF gains < 5% in speed.

So all checks are left ON in "production" builds.

--
Bill Findlay

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Mon Sep 16 20:15:59 2024

David Brown <[email protected]> schrieb:

On 16/09/2024 09:17, Thomas Koenig wrote:

David Brown <[email protected]> schrieb:

On 14/09/2024 21:26, Thomas Koenig wrote:

MitchAlsup1 <[email protected]> schrieb:

In many cases int is slower now than long -- which violates the notion >>>>> of int from K&R days.

That's a designers's choice, I think. It is possible to add 32-bit
instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.

For some kinds of instructions, that's true - for others, it's not so
easy without either making rather complicated instructions or having
assembly instructions with undefined behaviour (imagine the terror that
would bring to some people!).

It has happened, see the illegal (but sometimes useful)
6502 instructions, or the recent RISC-V implementation snafu
(GhostWrite).

I have seen plenty of undefined behaviour in ISA's over the years. (A
very common case is that instruction encodings that are not specified
are left as UB so that later extensions to the ISA can use them.)

A much better idea is to raise an exception, that way you can
be sure that nobody uses it for nefarious purposes.

I was
just thinking of the reactions you'd get if you made an ISA where
attempting to overflow signed integer arithmetic was UB at the hardware level, so that you could get faster and simpler instructions.

Hard to see how this would be possible... but I realize this
is a hypothetical example.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Bill Findlay on Mon Sep 16 20:00:34 2024

Bill Findlay wrote:

On 16 Sep 2024, Niklas Holsti wrote
(in article <[email protected]>):
....

The recommended approach today (for applications where it matters) is to
use static analysis of the Ada code (e.g. SPARK or other tools) to prove
that run-time errors cannot happen, which then makes it possible to omit
the corresponding run-time checks while staying compliant.

I don't know if Rust code can be analysed as easily and completely as
Ada code can. But Ada compilers usually allow fine-grained control over
which checks are applied where, not just a single choice between "debug"
and "production" builds.

I find, without using SPARK or any analysis (other than that done
by the compiler) that going from all Ada language-defined checks
ON to all OFF gains < 5% in speed.

So all checks are left ON in "production" builds.

I found the same 5% performance cost in my tests with DEC Ada85.
Most code was pretty optimal too.

The one thing I found DEC's compiler made a complete pigs breakfast
of the generated code was scanning a character string backwards:

function TrimBlanks (str : in string) return Natural is
n : natural;
begin
for n in reverse str'range loop
if (str(n) /= ' ')
return n;
end if;
end loop;
return 0;
end;

Godbolt x86-64 gnat Ada 14.2 -O3 gives

# Compilation provided by Compiler Explorer at https://godbolt.org/
.LC0:
.ascii "example.adb"
.zero 1
_ada_trimblanks:
mov ecx, DWORD PTR [rsi] # _1, str$P_BOUNDS_12->LB0
movsx rax, DWORD PTR [rsi+4] #, str$P_BOUNDS_12->UB0
cmp ecx, eax # _1, _3
jg .L5 #,
movsx rdx, ecx # _2, _1
add rax, 1 # I.0_8,
sub rdi, rdx # _22, _2
jmp .L4 #
.L3:
cmp rdx, rax # _2, I.0_8
je .L5 #,
.L4:
sub rax, 1 # I.0_8,
cmp BYTE PTR [rdi+rax], 32 # MEM <character>
je .L3 #,
mov edx, eax # <retval>, I.0_8
test ecx, eax # _1, I.0_8
jns .L1 #,
push rax #
mov esi, 6 #,
mov edi, OFFSET FLAT:.LC0 #,
call __gnat_rcheck_CE_Range_Check #
.L5:
xor edx, edx # <retval>
.L1:
mov eax, edx #, <retval>
ret

Gnat doesn't realize that the subtype of string's index is Positive,
range 1..Integer'last and therefore inside the range of return
subtype Natural, range 0..Integer'last, and therefore the "return n"
cannot fail, and this code that tests n in that range and throws
an exception is unnecessary:

test ecx, eax # _1, I.0_8
jns .L1 #,
push rax #
mov esi, 6 #,
mov edi, OFFSET FLAT:.LC0 #,
call __gnat_rcheck_CE_Range_Check #

Also it didn't needs to use edx at all.

The code it should have generated is

_ada_trimblanks:
movsx rcx, DWORD PTR [rsi] # _1, str$P_BOUNDS_12->LB0
movsx rax, DWORD PTR [rsi+4] #, str$P_BOUNDS_12->UB0
cmp rcx, rax # _1, _3
jg .L5 # null range?
.L4:
cmp BYTE PTR [rdi+rax], 32 # MEM <character>
je .L1
dec rax
cmp rcx, rax
jle .L4 # if rax >= rcx loop
.L5:
xor eax, eax # <retval>
.L1:
ret

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Niklas Holsti on Mon Sep 16 20:04:15 2024

Niklas Holsti wrote:

On 2024-09-16 18:58, Michael S wrote:

On Mon, 16 Sep 2024 11:39:55 -0400
EricP <[email protected]> wrote:

David Brown wrote:

On 16/09/2024 15:04, Michael S wrote:

With one exception that usize overflow panics under debug
build.

I'm quite happy with unsigned types that are not allowed to
overflow, as long as there is some other way to get efficient
wrapping on the rare occasions when you need it.

But I am completely against the idea that you have different
defined semantics for different builds. Run-time errors in a
debug/test build and undefined behaviour in release mode is fine -
defining the behaviour of overflow in release mode (other than
possibly to the same run-time checking) is wrong.

In the compilers that do checking which I have worked with
there was always a distinction between checked builds and debug
builds. In my C code I have Assert() and AssertDbg(). Assert stay in
the production code, AssertDbg are only in the debug builds.

Debug builds disable optimizations and spill all variable updates
to memory to make life easier for the debugger.
One usually compiles debug builds with no-optimize and all checks
enabled.

But debug, optimize, and checking are separate controls.

In the compilers for checking languages I've worked with,
checking and optimization are compatible.
For example, if the compiler uses an AddFaultOverflow x = x + 1
instruction to increment 'x' then it knows no overflow is possible
and then can make all the other optimizations that C assumes are true.

And on those compilers checks can be controlled with quite fine
resolution. Checks can be enabled/disabled based on kind of check,
eg scalar overflow, array bounds,
for a compilation unit, a routine, a section of code,
a particular data type, a particular object.

This was all standard on DEC Ada85 so if Rust compilers do not
do this now they may in the near future.

If ability to control compilers checks was standard on DEC Ada then it
made DEC Ada none-standard.

No, it means that DEC Ada could be used as a standard-conforming Ada
compiler or as a non-conforming compiler, to a user-chosen extent.

The recommended approach today (for applications where it matters) is to
use static analysis of the Ada code (e.g. SPARK or other tools) to prove
that run-time errors cannot happen, which then makes it possible to omit
the corresponding run-time checks while staying compliant.

DEC Ada did that too. It seems to me this optimization to be a relatively straight forward "propagation of constants" type of problem.
Most subtypes have a constant range

subtype Sub1T is integer range 1..100;

For ones with dynamic range

subtype Sub2T is integer range x..y;

then the worst case range can be inferred from the ranges attributes
of subtypes of x and y

Sub2T'first = min (x'first, y'first)
Sub2T'last = max (x'last, y'last)

And there is a check for a null range, where upper bound is less than
lower bound.

Then all these constant attribute values propagate onto the variables
declared with that subtype.

That should allow most checks to evaporate as compares of constants values.

I don't know if Rust code can be analysed as easily and completely as
Ada code can. But Ada compilers usually allow fine-grained control over
which checks are applied where, not just a single choice between "debug"
and "production" builds.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to David Brown on Tue Sep 17 01:36:46 2024

David Brown <[email protected]> wrote:

On 16/09/2024 19:51, MitchAlsup1 wrote:

On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

On 15/09/2024 21:13, MitchAlsup1 wrote:

As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant
past could do these rather efficiently {360 SRDL,...}

Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even
beneficial from the hardware viewpoint, but really inconvenient on the
software side (whether you use bit-fields or not).

Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
..

The folks designing those register setups had a choice, and made a bad
choice from the viewpoint of software (whether it be C, assembly, or any other language).

It's conceivable that it was the right choice on balance, considering
many factors. And it's certainly more believable that it was an
appropriate choice when sizes were smaller. It is less believable that
there is an overwhelming need to cross a 64-bit boundary.

Several pieces of software discoverd that "bad" smaller data
structures lead to faster execution. Simply, smaller data structures
lead to better utilization of caches and busses, and efect due to
this was larger than cost of extra instructions. So need to cross
64-bit boundary may be rare, but there will be cases when it is best
choice.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Tue Sep 17 01:35:17 2024

On Tue, 17 Sep 2024 0:00:34 +0000, EricP wrote:

Bill Findlay wrote:
I found the same 5% performance cost in my tests with DEC Ada85.
Most code was pretty optimal too.

The one thing I found DEC's compiler made a complete pigs breakfast
of the generated code was scanning a character string backwards:

Bacon, sausage, and ham.

Sounds yummy. Code not so much.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Mon Sep 16 19:51:26 2024

Thomas Koenig <[email protected]> writes:

Tim Rentsch <[email protected]> schrieb:

If the loop variable
represents degrees C or F, or some other naturally signed measure it
should be signed (or maybe floating point).

The first one is a bad idea because temperature is a continuous
physical quantity.

That doesn't mean that a quantity representing degrees C or
degrees F in a computer program always has to be a continuous
measure. Sometimes a signed integer for degrees is what's
needed. It depends on circumstances.

The second has bad implications for constructs like

DO R = 0.0, 1.0, 0.1

where it will depend on details floating point arithmetic if the
number of loop trips is 10 or 11.

You can argue that people can write

DO R=0.0, 1.05, 0.1

but this construct was error-prone enough that it was deleted
from the Fortran standards.

What kind of loop it
is, whether ascending or descending, or what the increment is, etc,
is secondary; a more important factor is what sort of value is
being represented, and in almost all cases that is what should
determine the type used.

Not for floating point numbers. For that, you should simply do

DO I=0,10
R = I * 0.1

or

R = 0.0
DO I=0,10
...
R = R + 0.1
END DO

whichever rounding error you prefer.

In cases like these I mean R as the loop variable. The extra
stuff is incidental scaffolding there only to make sure R
takes on all the appropriate values.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to [email protected] on Mon Sep 16 19:34:44 2024

[email protected] (MitchAlsup1) writes:

On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:

I didn't see any content from you in this last posting
of yours.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Mon Sep 16 19:33:22 2024

Michael S <[email protected]> writes:

On Sun, 15 Sep 2024 18:47:06 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Sun, 15 Sep 2024 20:13:44 +0200
David Brown <[email protected]> wrote:

struct Bar {
char x[8];
int y;
} bar;

int foo(int i) {
bar.y = 1234;
bar.x[i] = 42;
return bar.y;
}

It generates:

foo:
movslq %edi,%rdi
movl $1234, %eax
movl $1234, bar+8(%rip)
movb $42, bar(%rdi)
ret

That is, y is /not/ reloaded after bar.x[i] is set.

No other compiler on godbolt is doing it, except possibly gcc
clones. Not even clang, who's former leader wrote "Nasal Manifest".

Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
both show bar.y not being overwritten (optimization levels -01 or -O2)
when foo() is called.

I didn't mean to say that gcc3 is the only gcc version that returns non-overwritten value.
I meant to say that all gcc versions are in one camp and the rest of compilers represented on Goldbolt is in the other camp.

Okay.

Please note that I didn't mean to dispute your statement,
which is about compilers on godbolt. I meant only to give
an isolated data point that might be related.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Tue Sep 17 08:07:44 2024

David Brown wrote:

On 16/09/2024 10:37, Terje Mathisen wrote:

This becomes much simpler in Rust where usize is the only legal index
type:

Yeah, you have to actually write it as

Â y = p[x];
Â x += 1;

instead of a single line, but this makes zero difference to the
compiler, right?

I don't care much about the compiler - but I don't think this is an improvement for the programmer. (In general, I dislike trying to do too much in a single expression or statement, but some C constructs are
common enough that I am happy with them. It would be hard to formulate concrete rules here.)

And the resulting object code is less efficient than you get with signed
int and "y = p[x++];" (or "y = p[x]; x++;") in C.

Is that true? I'll have to check godbolt myself if that is really the case!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stephen Fuld on Tue Sep 17 08:43:22 2024

Stephen Fuld wrote:

On 9/16/2024 4:12 AM, David Brown wrote:

snip

With all respect to the regulars here, most people in technical Usenet
groups are either old, unusually nerdy, or both.

I resemble that remark! :-)

Ditto, probably...

I'm 67 (but not yet retired), I taught myself the Trachtenberg
algorithms for mental arithmetic when I was around 12 (was reminded of
this last night when I watched Gifted on netflix), I mail ordered what
was probably the first Rubik's cube to get to Norway. (And developed
three different algorithms to solve it, but I only remember the last one
now which I had optimized for simplicity, not speed.)

Those, along with high school chess and orienteering mapping should
count as nerdy pursuits, right?

Winning the County Yo-Yo championship would be less so?

Regards to all the regulars here, I do consider many of you friends that
I just haven't met yet.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Tue Sep 17 08:46:07 2024

MitchAlsup1 wrote:

On Mon, 16 Sep 2024 13:04:02 +0000, Michael S wrote:

On Mon, 16 Sep 2024 14:48:50 +0200
David Brown <[email protected]> wrote:

It's not less efficient. usize in Rust is approximately the same as
size_t in C. With one exception that usize overflow panics under debug
build.

One can and should argue that::

#p++;

should panic if p++ crosses an address space boundary (user->OS, or OS->HyperVisor,...) as no array is allowed to cross such a boundary.

I'm pretty sure you meant *p++; since the hash mark (#) is a comment
separator in many languages. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to EricP on Tue Sep 17 08:20:15 2024

EricP wrote:

These double-width bit-field straddle operations show up at 32-bits.
Various FP64 formats (DEC's middle-endian FP being the worst example),
Intel page table entries and segment/gate descriptors, come to mind.

Lots of them in 32-bit code!

It's just going to take a while for double-width things to show up
at the 64-bit level. But if FP128 becomes a reality...

If???

Codecs likely have to deal with double-width straddles a lot, whatever
the register word size. So for them it likely happens at 64-bits already.

Nothing likely about it: LZ4 is pretty much the only compression algorithm/lossless codec that never straddles, all the rest tend to
treat the source data as single bitstream of arbitrary length, except
for some built-in chunking mechanism which simplifies faster scanning.

The core of the algorithm always starts with knowing the endianness,
then picking up 32 or 64-bit chunks of input data (byte-flipping if
needed) and then extractin the next N bits either from the top of bottom
of the buffer register.

AlLmost by definition, this is not code that a compiler is setup to help
you get correct.

I added a bunch of instructions for dealing with double-width operations.
The main ISA design decision is whether to have register pair specifiers,
R0, R2, R4,... or two separate {r_high,r_low} registers.
In either case the main uArch issue is that now instructions have an extra source register and two dest registers, which has a number of consequences. But once you bite the bullet on that it simplifies a lot of things,
like how to deal with carry or overflow without flags,
full width multiplies, divide producing both quotient and remainder.

Very nice!

This means that you can do integer IMAC(), right?

(hi, lo) = imac(a, b, c); // == a*b+c

The only thing even nicer from the perspective of writing arbitrary
precision library code would be IMAA, i.e. a*b+c+d since that is the
largest combination which is guaranteed to never overflow the double
register target field.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Tue Sep 17 11:12:16 2024

On Tue, 17 Sep 2024 01:35:17 +0000
[email protected] (MitchAlsup1) wrote:

On Tue, 17 Sep 2024 0:00:34 +0000, EricP wrote:

Bill Findlay wrote:
I found the same 5% performance cost in my tests with DEC Ada85.
Most code was pretty optimal too.

The one thing I found DEC's compiler made a complete pigs breakfast
of the generated code was scanning a character string backwards:

Bacon, sausage, and ham.

Sounds yummy. Code not so much.

It seems that you and EricP give different (not to say an opposite)
meaning to the phrase "complete pigs breakfast".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Tue Sep 17 11:21:14 2024

On Tue, 17 Sep 2024 08:20:15 +0200
Terje Mathisen <[email protected]> wrote:

EricP wrote:

These double-width bit-field straddle operations show up at 32-bits. Various FP64 formats (DEC's middle-endian FP being the worst
example), Intel page table entries and segment/gate descriptors,
come to mind.

Lots of them in 32-bit code!

Lot's of what in 32-bit code?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Niklas Holsti on Tue Sep 17 01:38:03 2024

Niklas Holsti <[email protected]d> writes:

On 2024-09-16 10:25, Thomas Koenig wrote:

Tim Rentsch <[email protected]> schrieb:

[attribution lost]

Bringing it back to "architecture" Like Anton Ertl has said, LP64
for C/C++ is a mistake. It should always have been ILP64, and
this nonsense would go away. Any new architecture should make C
ILP64 (looking at you RISC-V, missing yet another opportunity to
not make the same mistakes as everyone else).

I believe this view is shortsighted. The big mistake is
developers hardcoding types everywhere - especially int, but
also long, and their unsigned variants. It's almost never a
good idea to hardcode a specific width (eg, uint32_t) in a type
name used for parameters or local variables, but that is by far
a very common practice.

I agree. This issue guided the design of the scalar type system
in Ada.

C programmers can use typedef to get part way there, but not all
the way because typedefs are still weakly typed.

I don't agree with this characterization. There are different kinds
of concerns here, but they don't form a linear progression.
Granted, C has a limited type system, but typedef is not part of
the type system, and it's important not to confuse the two. My
comment is only about what names of types are used, not about the
nature of type systems. As it happens I don't think the Ada type
system is where type systems should be heading, but that is a
separate discussion from my earlier comment.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Tue Sep 17 11:15:36 2024

On 16/09/2024 22:15, Thomas Koenig wrote:

David Brown <[email protected]> schrieb:

On 16/09/2024 09:17, Thomas Koenig wrote:

David Brown <[email protected]> schrieb:

On 14/09/2024 21:26, Thomas Koenig wrote:

MitchAlsup1 <[email protected]> schrieb:

In many cases int is slower now than long -- which violates the notion >>>>>> of int from K&R days.

That's a designers's choice, I think. It is possible to add 32-bit
instructions which should be as fast (or possibly faster) than
64-bit instructions, as AMD64 and ARM have shown.

For some kinds of instructions, that's true - for others, it's not so
easy without either making rather complicated instructions or having
assembly instructions with undefined behaviour (imagine the terror that >>>> would bring to some people!).

It has happened, see the illegal (but sometimes useful)
6502 instructions, or the recent RISC-V implementation snafu
(GhostWrite).

I have seen plenty of undefined behaviour in ISA's over the years. (A
very common case is that instruction encodings that are not specified
are left as UB so that later extensions to the ISA can use them.)

A much better idea is to raise an exception, that way you can
be sure that nobody uses it for nefarious purposes.

Sure. But not all processors are big enough to support such exceptions
- many of those I have used are really small. (An "unimplemented
instruction" exception also lets you use it for non-nefarious purposes,
such as supporting binary compatibility with other members of the
processor family, or as convenient user extensions.)

I was
just thinking of the reactions you'd get if you made an ISA where
attempting to overflow signed integer arithmetic was UB at the hardware
level, so that you could get faster and simpler instructions.

Hard to see how this would be possible... but I realize this
is a hypothetical example.

Yes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Tue Sep 17 11:21:38 2024

On 17/09/2024 08:07, Terje Mathisen wrote:

David Brown wrote:

On 16/09/2024 10:37, Terje Mathisen wrote:

This becomes much simpler in Rust where usize is the only legal index
type:

Yeah, you have to actually write it as

Â y = p[x];
Â x += 1;

instead of a single line, but this makes zero difference to the
compiler, right?

I don't care much about the compiler - but I don't think this is an
improvement for the programmer. (In general, I dislike trying to do
too much in a single expression or statement, but some C constructs
are common enough that I am happy with them. It would be hard to
formulate concrete rules here.)

And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.

Is that true? I'll have to check godbolt myself if that is really the case!

It is not true - or at least, it shouldn't be true. I had thought the
Rust code was using the equivalent of a C "unsigned int" here, which
would require extra code for wrapping semantics. But that was just my misunderstanding of Rust and its types - with a 64-bit unsigned type, it
should give the same results as C. However, there's no harm in checking
it and letting us know.

(I've previously shown how "y = p[x++];" in C is less efficient on
x86-64 if x is "unsigned int", compared to "int" or 64-bit types for x.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Waldek Hebisch on Tue Sep 17 11:29:15 2024

On 17/09/2024 03:36, Waldek Hebisch wrote:

David Brown <[email protected]> wrote:

On 16/09/2024 19:51, MitchAlsup1 wrote:

On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

On 15/09/2024 21:13, MitchAlsup1 wrote:

As to HW sadism:: this not not <realistically> any harder than mis-
aligned DW accesses from the cache. Many ISA from the rather distant >>>>> past could do these rather efficiently {360 SRDL,...}

Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even
beneficial from the hardware viewpoint, but really inconvenient on the >>>> software side (whether you use bit-fields or not).

Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
..

The folks designing those register setups had a choice, and made a bad
choice from the viewpoint of software (whether it be C, assembly, or any
other language).

It's conceivable that it was the right choice on balance, considering
many factors. And it's certainly more believable that it was an
appropriate choice when sizes were smaller. It is less believable that
there is an overwhelming need to cross a 64-bit boundary.

Several pieces of software discoverd that "bad" smaller data
structures lead to faster execution. Simply, smaller data structures
lead to better utilization of caches and busses, and efect due to
this was larger than cost of extra instructions. So need to cross
64-bit boundary may be rare, but there will be cases when it is best
choice.

It is possible, but I think it is rare.

Perhaps my perception is biased from working with microcontrollers,
where you often don't have caches and instruction speeds are not nearly
as much faster than ram access speeds as you see in modern x86 systems.

The other thing I don't like about split bit-fields is that there is
typically no way to do atomic updates, which can mean you need extra
care to keep things correct.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to BGB on Tue Sep 17 11:39:43 2024

On 16/09/2024 21:46, BGB wrote:

On 9/16/2024 4:27 AM, David Brown wrote:

On 16/09/2024 09:18, BGB wrote:

On 9/15/2024 12:46 PM, Anton Ertl wrote:

Michael S <[email protected]> writes:

Padding is another thing that should be Implementation Defined.

It is. It's defined in the ABI, so when the compiler documents to
follow some ABI, you automatically get that ABI's structure layout.
And if a compiler does not follow an ABI, it is practically useless.

Though, there also isn't a whole lot of freedom of choice here
regarding layout.

If member ordering or padding differs from typical expectations, then
any code which serializes structures to files is liable to break, and
this practice isn't particularly uncommon.

Your expectations here should match up with the ABI - otherwise things
are going to go wrong pretty quickly. But I think most ABIs will have
fairly sensible choices for padding and alignments.

Yeah. It is "almost fixed", as there are a lot of programs that are
liable to break if these assumptions differ.

Say, typical pattern:
Members are organized in the same order they appear in the source code;

That is required by the C standards. (A compiler can re-arrange the
order if that does not affect any observable behaviour. gcc used to
have an optimisation option that allowed it to re-arrange struct
ordering when it was safe to do so, but it was removed as it was
rarely used and a serious PITA to support with LTO.)

OK.

If the current position is not a multiple of the member's alignment,
it is padded to an offset that is a multiple of the member's alignment;

That is a requirement in the C standards.

The only implementation-defined option is whether or not there is /
extra/ padding - and I have never seen that in practice. (And there
are more implementation-defined options for bit-fields.)

Extra padding seems like it wouldn't have much benefit.

No, generally not - which is why it would be a really strange
implementation if it had extra padding. It's possible that extra
padding at the end of a struct could lead to more efficient array access
by aligning to cache line sizes, but I think such things are better left
to the programmer (possibly with the aid of compiler extensions) rather
than attempting to specify them in the ABI.

Albeit, types like _Bool in my implementation are padded to a full byte
(it is treated as an "unsigned char" that is assumed to always hold
either 0 or 1).

That's the usual way to handle them.

For primitive types, the alignment is equal to the size, which is
also a power of 2;

That is the norm, up to the maximum appropriate alignment for the
architecture. A 16-bit cpu has nothing to gain by making 32-bit types
32-bit aligned.

This comes up as an issue in some Windows file formats, where one can't
just naively use a struct with 32-bit fields because some 32-bit members
only have 16-bit alignment.

Ah, the joys of using ancient formats with new systems!

My comment above was in reference to data remaining on the system,
rather than moving off-system.

If I am making a format that is accessible externally - a file format, a network packet, etc., - I generally make sure all types are "naturally"
aligned up to at least 8-byte types, even if the processor's maximum
useful alignment is much smaller.

If needed, the total size of the struct is padded to a multiple of
the largest alignment of the struct members.

That is required by the C standards.

For C++ classes, it is more chaotic (and more compiler dependent), but:

Not really, no. Apart from a few hidden bits such as pointers to
handle virtual methods and virtual inheritance, the data fields are
ordered, padded and aligned just like in C structs. And these hidden
pointers follow the same rules as any other pointer.

The only other special bit is empty base class optimisation, and
that's pretty simple too.

For simple cases, they may match up, like a POD class may look just like
an equivalent struct, or single-inheritance classes with virtual methods
like a struct with a vtable, etc... But in more complex cases there may
be compiler differences (along with differences in things like name
mangling, etc).

I've never seen or header of a case where there there is anything
unexpected here.

Sure, different C++ implementations or ABIs might have different details
around these hidden pointers and the way they organise their vtables.
But they are still hidden /pointers/, and these are aligned and padded
like any other pointer. Even if the hidden data contained a bunch of
extra bits, flags, etc., to handle complicated inheritance setups, these
would still be padded and aligned like any other structs with bits,
flags, etc.

Though, unlike with structs, programs seem less inclined to rely on the memory layout specifics of class instances.

Of course they shouldn't be relying on such details!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Tue Sep 17 11:42:47 2024

Michael S wrote:

On Tue, 17 Sep 2024 08:20:15 +0200
Terje Mathisen <[email protected]> wrote:

EricP wrote:

These double-width bit-field straddle operations show up at 32-bits.
Various FP64 formats (DEC's middle-endian FP being the worst
example), Intel page table entries and segment/gate descriptors,
come to mind.

Lots of them in 32-bit code!

Lot's of what in 32-bit code?

Pretty much any 64-bit container with non-regular contents, with the
suggest double / fp64 as the classic example?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Tue Sep 17 11:48:12 2024

David Brown wrote:

On 17/09/2024 08:07, Terje Mathisen wrote:

David Brown wrote:

On 16/09/2024 10:37, Terje Mathisen wrote:

This becomes much simpler in Rust where usize is the only legal
index type:

Yeah, you have to actually write it as

Â Ã‚Â y = p[x];
Â Ã‚Â x += 1;

instead of a single line, but this makes zero difference to the
compiler, right?

I don't care much about the compiler - but I don't think this is an
improvement for the programmer.Â (In general, I dislike trying to do >>> too much in a single expression or statement, but some C constructs
are common enough that I am happy with them.Â It would be hard to
formulate concrete rules here.)

And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.

Is that true? I'll have to check godbolt myself if that is really the
case!

It is not true - or at least, it shouldn't be true. I had thought the
Rust code was using the equivalent of a C "unsigned int" here, which
would require extra code for wrapping semantics. But that was just my misunderstanding of Rust and its types - with a 64-bit unsigned type, it should give the same results as C. However, there's no harm in checking
it and letting us know.

No need to check this particular point, Rust's usize was obviously
designed to be an unsigned type large enough to index into the entire addressable memory range, so on a 64-bit platform it has to be 64 bits.

(I've previously shown how "y = p[x++];" in C is less efficient on
x86-64 if x is "unsigned int", compared to "int" or 64-bit types for x.)

That's actually surprising to me, I would have guessed any 32-bit index
would be less efficient than a full-width type, but if the idionm is
very, very common in C code, then it makes sense to make it fast.

Doing so would typically require either sign- or zero-extending all
32-bit variables when loaded into a 64-bit register, right?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Tue Sep 17 12:52:48 2024

On Tue, 17 Sep 2024 11:42:47 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Tue, 17 Sep 2024 08:20:15 +0200
Terje Mathisen <[email protected]> wrote:

EricP wrote:

These double-width bit-field straddle operations show up at
32-bits. Various FP64 formats (DEC's middle-endian FP being the
worst example), Intel page table entries and segment/gate
descriptors, come to mind.

Lots of them in 32-bit code!

Lot's of what in 32-bit code?

Pretty much any 64-bit container with non-regular contents, with the
suggest double / fp64 as the classic example?

Terje

You mean
struct { int a; double b; } where on 32-bit target we expect that b is
not padded?
And then mantissa of b crosses 64-bit boundary?
But mantissa of b is not accessed as bit field in a typical program.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Tue Sep 17 13:15:36 2024

On Tue, 17 Sep 2024 11:29:15 +0200
David Brown <[email protected]> wrote:

On 17/09/2024 03:36, Waldek Hebisch wrote:

David Brown <[email protected]> wrote:

On 16/09/2024 19:51, MitchAlsup1 wrote:

On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

On 15/09/2024 21:13, MitchAlsup1 wrote:

As to HW sadism:: this not not <realistically> any harder than
mis- aligned DW accesses from the cache. Many ISA from the
rather distant past could do these rather efficiently {360
SRDL,...}

Anyone who designs a data structure with a bit-field that spans
two 64-bit parts of a struct is probably ignorant of C
bit-fields and software in general. It is highly unlikely to be
necessary or even beneficial from the hardware viewpoint, but
really inconvenient on the software side (whether you use
bit-fields or not).

Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
..

The folks designing those register setups had a choice, and made a
bad choice from the viewpoint of software (whether it be C,
assembly, or any other language).

It's conceivable that it was the right choice on balance,
considering many factors. And it's certainly more believable that
it was an appropriate choice when sizes were smaller. It is less
believable that there is an overwhelming need to cross a 64-bit
boundary.

Several pieces of software discoverd that "bad" smaller data
structures lead to faster execution. Simply, smaller data
structures lead to better utilization of caches and busses, and
efect due to this was larger than cost of extra instructions. So
need to cross 64-bit boundary may be rare, but there will be cases
when it is best choice.

It is possible, but I think it is rare.

Perhaps my perception is biased from working with microcontrollers,
where you often don't have caches and instruction speeds are not
nearly as much faster than ram access speeds as you see in modern x86 systems.

On the other hand, with MCUs it's quite common to be limited by size of
data storage (SRAM), while size of program storage (flash) is bigger
than one will ever want. Plus, quite often, speed is of less concern.
In such [common] situation densely packed [arrays of] structures could
be desirable.

The other thing I don't like about split bit-fields is that there is typically no way to do atomic updates, which can mean you need extra
care to keep things correct.

In the common case, on common ISAs atomic RMW update of bit field is
impossible even when the field does not cross a word boundary.

In case you mean write-only update (i.e. values of adjacent fields are
known in advance and not expected to change), what you say can be
correct or not, depending on availability of unaligned stores and on
what exactly one consider 'atomic'.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Tue Sep 17 13:27:16 2024

On Tue, 17 Sep 2024 11:48:12 +0200
Terje Mathisen <[email protected]> wrote:

David Brown wrote:

On 17/09/2024 08:07, Terje Mathisen wrote:

David Brown wrote:

On 16/09/2024 10:37, Terje Mathisen wrote:

This becomes much simpler in Rust where usize is the only legal
index type:

Yeah, you have to actually write it as

Â Ã‚Â y = p[x];
Â Ã‚Â x += 1;

instead of a single line, but this makes zero difference to the
compiler, right?

I don't care much about the compiler - but I don't think this is
an improvement for the programmer.Â (In general, I dislike
trying to do too much in a single expression or statement, but
some C constructs are common enough that I am happy with them.Â
It would be hard to formulate concrete rules here.)

And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.

Is that true? I'll have to check godbolt myself if that is really
the case!

It is not true - or at least, it shouldn't be true. I had thought
the Rust code was using the equivalent of a C "unsigned int" here,
which would require extra code for wrapping semantics. But that
was just my misunderstanding of Rust and its types - with a 64-bit
unsigned type, it should give the same results as C. However,
there's no harm in checking it and letting us know.

No need to check this particular point, Rust's usize was obviously
designed to be an unsigned type large enough to index into the entire addressable memory range, so on a 64-bit platform it has to be 64
bits.

(I've previously shown how "y = p[x++];" in C is less efficient on
x86-64 if x is "unsigned int", compared to "int" or 64-bit types
for x.)

That's actually surprising to me, I would have guessed any 32-bit
index would be less efficient than a full-width type, but if the
idionm is very, very common in C code, then it makes sense to make it
fast.

Doing so would typically require either sign- or zero-extending all
32-bit variables when loaded into a 64-bit register, right?

Terje

Taken in isolation, on something like x86=64 or aarch64, where result
of 32-bit addition is by default zero-extended, there is no difference
between 32-bit and 64-bit unsigned x.
However when statement shown above is part of the sequence, even short
one, 64-bit x allows compiler optimizations that are impossible with
32-bit.
E.g.
y1 = p[x++]
y2 = p[x++]

On x86-64 with 64-bit x the second load can be implemented as
mov dstreg, [rcx+rdx*4+4]
On aarch64 with 64-bit x both loads can be folded into single 'load
pair' instruction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bill Findlay@21:1/5 to EricP on Tue Sep 17 15:13:43 2024

On 17 Sep 2024, EricP wrote
(in article <4J3GO.141734$[email protected]>):

Niklas Holsti wrote:

On 2024-09-16 18:58, Michael S wrote:

On Mon, 16 Sep 2024 11:39:55 -0400
EricP <[email protected]> wrote:

David Brown wrote:

On 16/09/2024 15:04, Michael S wrote:

With one exception that usize overflow panics under debug
build.

I'm quite happy with unsigned types that are not allowed to
overflow, as long as there is some other way to get efficient wrapping on the rare occasions when you need it.

But I am completely against the idea that you have different
defined semantics for different builds. Run-time errors in a debug/test build and undefined behaviour in release mode is fine - defining the behaviour of overflow in release mode (other than possibly to the same run-time checking) is wrong.

In the compilers that do checking which I have worked with
there was always a distinction between checked builds and debug
builds. In my C code I have Assert() and AssertDbg(). Assert stay in the production code, AssertDbg are only in the debug builds.

Debug builds disable optimizations and spill all variable updates
to memory to make life easier for the debugger.
One usually compiles debug builds with no-optimize and all checks enabled.

But debug, optimize, and checking are separate controls.

In the compilers for checking languages I've worked with,
checking and optimization are compatible.
For example, if the compiler uses an AddFaultOverflow x = x + 1 instruction to increment 'x' then it knows no overflow is possible
and then can make all the other optimizations that C assumes are true.

And on those compilers checks can be controlled with quite fine resolution. Checks can be enabled/disabled based on kind of check,
eg scalar overflow, array bounds,
for a compilation unit, a routine, a section of code,
a particular data type, a particular object.

This was all standard on DEC Ada85 so if Rust compilers do not
do this now they may in the near future.

If ability to control compilers checks was standard on DEC Ada then it made DEC Ada none-standard.

No, it means that DEC Ada could be used as a standard-conforming Ada compiler or as a non-conforming compiler, to a user-chosen extent.

The recommended approach today (for applications where it matters) is to use static analysis of the Ada code (e.g. SPARK or other tools) to prove that run-time errors cannot happen, which then makes it possible to omit the corresponding run-time checks while staying compliant.

DEC Ada did that too. It seems to me this optimization to be a relatively straight forward "propagation of constants" type of problem.

Not just that, many language forms actually preclude the need for checks,
e.g.:

for i in this_array'Range loop
... this_array(i) ...
end loop;

cannot fail on access to this_array(i), and:

this_array := that_array;

cannot fail in any of the ways that are endlessly debated
here in relation to *mem* C routines.

--
Bill Findlay

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Tue Sep 17 15:37:37 2024

On 17/09/2024 08:43, Terje Mathisen wrote:

Stephen Fuld wrote:

On 9/16/2024 4:12 AM, David Brown wrote:

snip

With all respect to the regulars here, most people in technical
Usenet groups are either old, unusually nerdy, or both.

I resemble that remark! :-)

Ditto, probably...

Of course my comment was not meant very seriously, though there is a lot
of truth in it. Most regulars in technical Usenet groups have been in
those groups for a long time - very few twenty year olds can hold a conversation about Fortran and S390 mainframes! And most of us are
fairly nerdy - this stuff is not just a job, it's also an interest. But
that does not mean any of us are /too/ old, or have only nerdy interests.

I'm 67 (but not yet retired), I taught myself the Trachtenberg
algorithms for mental arithmetic when I was around 12 (was reminded of
this last night when I watched Gifted on netflix), I mail ordered what
was probably the first Rubik's cube to get to Norway. (And developed
three different algorithms to solve it, but I only remember the last one
now which I had optimized for simplicity, not speed.)

I would have been about 9 or 10 when I got my first Rubik's cube. A mathematician colleague of my father's and I put together a solution
algorithm based on a few bits he had remembered from a lecture by David Singmaster. When the rest of the class played football at break, I
stood in the goals practising the Rubiks's cube - I believe that counts
as nerdy!

Those, along with high school chess and orienteering mapping should
count as nerdy pursuits, right?

Orienteering is too physical to be nerdy, isn't it? I teach judo to
kids - so none of us are perfect :-)

Winning the County Yo-Yo championship would be less so?

It is still a /bit/ nerdy...

Regards to all the regulars here, I do consider many of you friends that
I just haven't met yet.

That is a fine attitude. I like to think that even with the people I
regularly disagree with in technical groups, if we were to sit down with
a coffee or a beer, rather than a screen and keyboard, we'd have a very pleasant evening.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Tue Sep 17 15:53:16 2024

On 17/09/2024 12:27, Michael S wrote:

On Tue, 17 Sep 2024 11:48:12 +0200
Terje Mathisen <[email protected]> wrote:

David Brown wrote:

On 17/09/2024 08:07, Terje Mathisen wrote:

David Brown wrote:

On 16/09/2024 10:37, Terje Mathisen wrote:

This becomes much simpler in Rust where usize is the only legal
index type:

Yeah, you have to actually write it as

Â Ã‚Â y = p[x];
Â Ã‚Â x += 1;

instead of a single line, but this makes zero difference to the
compiler, right?

I don't care much about the compiler - but I don't think this is
an improvement for the programmer.Â (In general, I dislike
trying to do too much in a single expression or statement, but
some C constructs are common enough that I am happy with them.Â
It would be hard to formulate concrete rules here.)

And the resulting object code is less efficient than you get with
signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.

Is that true? I'll have to check godbolt myself if that is really
the case!

It is not true - or at least, it shouldn't be true. I had thought
the Rust code was using the equivalent of a C "unsigned int" here,
which would require extra code for wrapping semantics. But that
was just my misunderstanding of Rust and its types - with a 64-bit
unsigned type, it should give the same results as C. However,
there's no harm in checking it and letting us know.

No need to check this particular point, Rust's usize was obviously
designed to be an unsigned type large enough to index into the entire
addressable memory range, so on a 64-bit platform it has to be 64
bits.

(I've previously shown how "y = p[x++];" in C is less efficient on
x86-64 if x is "unsigned int", compared to "int" or 64-bit types
for x.)

That's actually surprising to me, I would have guessed any 32-bit
index would be less efficient than a full-width type, but if the
idionm is very, very common in C code, then it makes sense to make it
fast.

Doing so would typically require either sign- or zero-extending all
32-bit variables when loaded into a 64-bit register, right?

Terje

Taken in isolation, on something like x86=64 or aarch64, where result
of 32-bit addition is by default zero-extended, there is no difference between 32-bit and 64-bit unsigned x.
However when statement shown above is part of the sequence, even short
one, 64-bit x allows compiler optimizations that are impossible with
32-bit.
E.g.
y1 = p[x++]
y2 = p[x++]

On x86-64 with 64-bit x the second load can be implemented as
mov dstreg, [rcx+rdx*4+4]
On aarch64 with 64-bit x both loads can be folded into single 'load
pair' instruction.

That's it, yes. It's not the access that is slower for 32-bit x, it's
using it later after the increment because the increment has to be wrapped.

These things are always complicated by surrounding code, but consider
Michael's example here (which is the same as I discussed in another
post), assuming a 64-bit system with some common addressing modes :

y1 = p[x++];
y2 = p[x++];
...

When x is a 64-bit type, this can be implemented (where "r?" are general-purpose 64-bit registers) as :

r1 = p + x;
y1 = *r1++;
y2 = *r1++;
...

For a 32-bit x with defined wrapping, it might be implemented as :

r1 = x; // Zero or sign extend as appropriate
y1 = *(p + r1);
r1 += 1;
r1 &= 0xffffffff;
y2 = *(p + r1);
r1 += 1;
r1 &= 0xffffffff;
...

There might be a single instruction for adding 1 with 32-bit wrapping,
but it is still bigger.

For 32-bit x with undefined overflow, it will be :

r1 = x; // Zero or sign extend as appropriate
r2 = p + x;
y1 = *r2++;
y2 = *r2++;
...

So with a 32-bit index, you are probably going to have to have a sign or
zero extension somewhere. But key to the efficiency of signed int
compared to unsigned int is that the compiler can assume there is no
overflow, and does not need to implement wrapping.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Tue Sep 17 16:00:48 2024

On 17/09/2024 12:15, Michael S wrote:

On Tue, 17 Sep 2024 11:29:15 +0200
David Brown <[email protected]> wrote:

On 17/09/2024 03:36, Waldek Hebisch wrote:

David Brown <[email protected]> wrote:

On 16/09/2024 19:51, MitchAlsup1 wrote:

On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

On 15/09/2024 21:13, MitchAlsup1 wrote:

As to HW sadism:: this not not <realistically> any harder than
mis- aligned DW accesses from the cache. Many ISA from the
rather distant past could do these rather efficiently {360
SRDL,...}

Anyone who designs a data structure with a bit-field that spans
two 64-bit parts of a struct is probably ignorant of C
bit-fields and software in general. It is highly unlikely to be
necessary or even beneficial from the hardware viewpoint, but
really inconvenient on the software side (whether you use
bit-fields or not).

Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
..

The folks designing those register setups had a choice, and made a
bad choice from the viewpoint of software (whether it be C,
assembly, or any other language).

It's conceivable that it was the right choice on balance,
considering many factors. And it's certainly more believable that
it was an appropriate choice when sizes were smaller. It is less
believable that there is an overwhelming need to cross a 64-bit
boundary.

Several pieces of software discoverd that "bad" smaller data
structures lead to faster execution. Simply, smaller data
structures lead to better utilization of caches and busses, and
efect due to this was larger than cost of extra instructions. So
need to cross 64-bit boundary may be rare, but there will be cases
when it is best choice.

It is possible, but I think it is rare.

Perhaps my perception is biased from working with microcontrollers,
where you often don't have caches and instruction speeds are not
nearly as much faster than ram access speeds as you see in modern x86
systems.

On the other hand, with MCUs it's quite common to be limited by size of
data storage (SRAM), while size of program storage (flash) is bigger
than one will ever want. Plus, quite often, speed is of less concern.
In such [common] situation densely packed [arrays of] structures could
be desirable.

That can also be true. (The smallest device I ever used had 1 KB of
flash, and that was still plenty for the task I had!)

But in many embedded systems, speed is of some concern at least - if you
can do the task in fewer clock cycles, maybe you can use a slower device
(which might be cheaper, or have easier EMC requirements), or you can
spend more time in sleep modes for reduced average power. Run-time
efficiency isn't always about shorter wall-clock times.

The main thing about embedded development, however, is that the answer
is always "it depends". There are few hard and fast rules!

The other thing I don't like about split bit-fields is that there is
typically no way to do atomic updates, which can mean you need extra
care to keep things correct.

In the common case, on common ISAs atomic RMW update of bit field is impossible even when the field does not cross a word boundary.

In case you mean write-only update (i.e. values of adjacent fields are
known in advance and not expected to change), what you say can be
correct or not, depending on availability of unaligned stores and on
what exactly one consider 'atomic'.

Yes, these are all possibilities. But it is not uncommon that the key
thing is to avoid partial updates where you have changed one half of a
hardware register but not yet changed the other half.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Michael S on Tue Sep 17 10:57:49 2024

Michael S wrote:

On Tue, 17 Sep 2024 08:20:15 +0200
Terje Mathisen <[email protected]> wrote:

EricP wrote:

These double-width bit-field straddle operations show up at 32-bits.
Various FP64 formats (DEC's middle-endian FP being the worst
example), Intel page table entries and segment/gate descriptors,
come to mind.

Lots of them in 32-bit code!

Lot's of what in 32-bit code?

On 32-bit cpus, bit-fields that straddle 32-bit boundaries inside
larger structures like a 64-bit FP or PTE.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Tim Rentsch on Tue Sep 17 16:23:40 2024

On Tue, 17 Sep 2024 2:34:44 +0000, Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:

I didn't see any content from you in this last posting
of yours.

I had started to make a comment after hitting quote, and
while re-reading what you wrote I had nothing to add and
nothing to modify or complain about. While thinking it all
over I ended hitting the Post Article button without any
text.

There was no way to retrieve the post, so I let it lie.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bill Findlay@21:1/5 to Stefan Monnier on Tue Sep 17 18:32:35 2024

On 17 Sep 2024, Stefan Monnier wrote
(in article<[email protected]>):

With all respect to the regulars here, most people in technical Usenet groups are either old, unusually nerdy, or both.

I plead guilty to nerdy, but as for old, I'm still 27 (and that's been
true for more than 20 years).

Stefan

Hi Stefan!
At least equally nerdy, I should think, but 50 years older.
(Older, not old!)

--
Bill Findlay

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Sep 17 12:27:52 2024

With all respect to the regulars here, most people in technical Usenet
groups are either old, unusually nerdy, or both.

I plead guilty to nerdy, but as for old, I'm still 27 (and that's been
true for more than 20 years).

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Sep 17 16:34:10 2024

On Tue, 17 Sep 2024 8:12:16 +0000, Michael S wrote:

On Tue, 17 Sep 2024 01:35:17 +0000
[email protected] (MitchAlsup1) wrote:

On Tue, 17 Sep 2024 0:00:34 +0000, EricP wrote:

Bill Findlay wrote:
I found the same 5% performance cost in my tests with DEC Ada85.
Most code was pretty optimal too.

The one thing I found DEC's compiler made a complete pigs breakfast
of the generated code was scanning a character string backwards:

Bacon, sausage, and ham.

Sounds yummy. Code not so much.

It seems that you and EricP give different (not to say an opposite)
meaning to the phrase "complete pigs breakfast".

I had never heard or seen the phrase before. So I just made that up
on the spot.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Sep 17 16:32:08 2024

On Tue, 17 Sep 2024 6:20:15 +0000, Terje Mathisen wrote:

EricP wrote:

I added a bunch of instructions for dealing with double-width
operations.
The main ISA design decision is whether to have register pair
specifiers,
R0, R2, R4,... or two separate {r_high,r_low} registers.
In either case the main uArch issue is that now instructions have an
extra
source register and two dest registers, which has a number of
consequences.
But once you bite the bullet on that it simplifies a lot of things,
like how to deal with carry or overflow without flags,
full width multiplies, divide producing both quotient and remainder.

Very nice!

This means that you can do integer IMAC(), right?

(hi, lo) = imac(a, b, c); // == a*b+c

CARRY Rc,{{OI}}
MUL Rd,Ra,Rb
gives
{Rc,Rd} = product128(Ra,Rb)+Rc

where all registers are 64-bits.

The only thing even nicer from the perspective of writing arbitrary
precision library code would be IMAA, i.e. a*b+c+d since that is the
largest combination which is guaranteed to never overflow the double
register target field.

CARRY Rc,{{OI}{OI}}
MUL Re,Ra,Rb
ADD Re,Re,Rd
gives
{Rc,Re} = product128(Ra,Rb) + Rc + Rd

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Bill Findlay on Tue Sep 17 18:18:51 2024

On Tue, 17 Sep 2024 16:32:35 +0000, Bill Findlay wrote:

On 17 Sep 2024, Stefan Monnier wrote
(in article<[email protected]>):

With all respect to the regulars here, most people in technical Usenet
groups are either old, unusually nerdy, or both.

I plead guilty to nerdy, but as for old, I'm still 27 (and that's been
true for more than 20 years).

Stefan

Hi Stefan!
At least equally nerdy, I should think, but 50 years older.
(Older, not old!)

At 71 real years old I still operate as if I were <let's say> 21.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to [email protected] on Tue Sep 17 19:52:38 2024

MitchAlsup1 <[email protected]> schrieb:

On Tue, 17 Sep 2024 2:34:44 +0000, Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:

I didn't see any content from you in this last posting
of yours.

I had started to make a comment after hitting quote, and
while re-reading what you wrote I had nothing to add and
nothing to modify or complain about. While thinking it all
over I ended hitting the Post Article button without any
text.

There was no way to retrieve the post, so I let it lie.

Same thing happens to me on occasion.

With slrn, it is possible to cancel the post, and Eternal September
will honor the cancel.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to [email protected] on Tue Sep 17 12:48:31 2024

[email protected] (MitchAlsup1) writes:

On Tue, 17 Sep 2024 2:34:44 +0000, Tim Rentsch wrote:

[email protected] (MitchAlsup1) writes:

On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:

I didn't see any content from you in this last posting
of yours.

I had started to make a comment after hitting quote, and
while re-reading what you wrote I had nothing to add and
nothing to modify or complain about. While thinking it all
over I ended hitting the Post Article button without any
text.

There was no way to retrieve the post, so I let it lie.

Okay, thank you.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to BGB on Tue Sep 17 19:53:31 2024

BGB <[email protected]> schrieb:

Another option would be for adjacent _Bool values to merge similar to bitfields...

How would you manage a pointer to a _Bool?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to David Brown on Tue Sep 17 20:00:13 2024

David Brown <[email protected]> wrote:

On 17/09/2024 03:36, Waldek Hebisch wrote:

David Brown <[email protected]> wrote:

On 16/09/2024 19:51, MitchAlsup1 wrote:

On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

On 15/09/2024 21:13, MitchAlsup1 wrote:

As to HW sadism:: this not not <realistically> any harder than mis- >>>>>> aligned DW accesses from the cache. Many ISA from the rather distant >>>>>> past could do these rather efficiently {360 SRDL,...}

Anyone who designs a data structure with a bit-field that spans two
64-bit parts of a struct is probably ignorant of C bit-fields and
software in general. It is highly unlikely to be necessary or even >>>>> beneficial from the hardware viewpoint, but really inconvenient on the >>>>> software side (whether you use bit-fields or not).

Sometimes you don't have a choice::
x86-64 segment registers.
PCIe MMI/O registers,
..

The folks designing those register setups had a choice, and made a bad
choice from the viewpoint of software (whether it be C, assembly, or any >>> other language).

It's conceivable that it was the right choice on balance, considering
many factors. And it's certainly more believable that it was an
appropriate choice when sizes were smaller. It is less believable that
there is an overwhelming need to cross a 64-bit boundary.

Several pieces of software discoverd that "bad" smaller data
structures lead to faster execution. Simply, smaller data structures
lead to better utilization of caches and busses, and efect due to
this was larger than cost of extra instructions. So need to cross
64-bit boundary may be rare, but there will be cases when it is best
choice.

It is possible, but I think it is rare.

Perhaps my perception is biased from working with microcontrollers,
where you often don't have caches and instruction speeds are not nearly
as much faster than ram access speeds as you see in modern x86 systems.

I personally got lots of 20% speedups by restructuring data on PlayStation
2 code.

The C rules for data structure layout is stupid, a programmer would add a
int in front of a vector and fail to wonder why his structure grew by 16
bytes. Never mind that he used that 4 byte int to hold a value that had a
max of 15.

Had to annotate the data structures with 16 byte comment boundaries to stop endless stupidity.

The other thing I don't like about split bit-fields is that there is typically no way to do atomic updates, which can mean you need extra
care to keep things correct.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Sep 17 20:11:55 2024

On Tue, 17 Sep 2024 19:51:19 +0000, BGB wrote:

On 9/17/2024 4:39 AM, David Brown wrote:

On 16/09/2024 21:46, BGB wrote:

On 9/16/2024 4:27 AM, David Brown wrote:

Albeit, types like _Bool in my implementation are padded to a full
byte (it is treated as an "unsigned char" that is assumed to always
hold either 0 or 1).

That's the usual way to handle them.

Smallest C container is 1 byte
__BOOL can use as small a container as C can address

Another option would be for adjacent _Bool values to merge similar to bitfields...
Though, seems that simply turning it into a byte is the typical option.

One can do ATOMIC stuff on a __BOOL
one cannot do ATOMIC stuff on struct { unsigned __bool: 1};

This comes up as an issue in some Windows file formats, where one
can't just naively use a struct with 32-bit fields because some 32-bit
members only have 16-bit alignment.

Ah, the joys of using ancient formats with new systems!

I was around when this stuff was still newish.

Some are essentially frozen in time with their misaligned members.

In HW the packing and unpacking of multi-container single variables
is easy--its just wires.

Still better than:
"Well, initial field wasn't big enough";
"Repurpose those bytes from over there, and glue them on".

Really NOT a problem in HW--understandably low efficiency in SW.

There would need to be a mechanism in the ISA to select between these
modes though (probably a "magic branch" scheme different from the one
used for Inter-ISA branches).

Modes make testing significantly harder. Each mode adds 1 to the
exponent
how many test cases it takes to adequately test a part.

This would likely include an RV64 encoding for "Branch to/from CoEx",
and an encoding within this ISA to jump between CoEx and "Native" mode.

Magic branches make sense mostly as any such mode switch is going to
require a pipeline flush.

This is assuming an implementation that would want to be able to support
both this ISA and also RV64GC.

One possibility could be (in native RV notation):
RV64 (Branches if supported, NOP if not):
LBU X0, Xs, Disp12s //Dest=RV64GC
LWU X0, Xs, Disp12s //Dest=CoEx
LHU X0, Xs, Disp12s //Dest=Native
New ISA:
LBU X0, Xs, Disp10s //Dest=RV64GC
LWU X0, Xs, Disp10s //Dest=CoEx
LHU X0, Xs, Disp10s //Dest=Native

This only gives 36-bits (top) or 30-bits (bottom) or range. What you are
going to want is 64-bits of range -- especially when switching modes--
you PROBABLY want to use an entirely different sub-tree of the
translation
table trees.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Sep 17 23:04:52 2024

On Tue, 17 Sep 2024 22:15:12 +0000, BGB wrote:

On 9/17/2024 3:11 PM, MitchAlsup1 wrote:

Modes make testing significantly harder. Each mode adds 1 to the
exponent
how many test cases it takes to adequately test a part.

Possibly.

But, modes are kinda unavoidable here:
CPU only runs RV64GC or similar:
Doomed to relative slowness;
CPU only does CoEx:
Closes off the ability to run binaries that assume RV64GC.
CPU only does new ISA:
Well, then it can't run RISC-V code, making all this kinda moot.

My 66000 does not have modes (at least yet) it even comes our of
RESET with the MMUs turned on.
-----------

This is assuming an implementation that would want to be able to support >>> both this ISA and also RV64GC.

One possibility could be (in native RV notation):
RV64 (Branches if supported, NOP if not):
   LBU X0, Xs, Disp12s //Dest=RV64GC
   LWU X0, Xs, Disp12s //Dest=CoEx
   LHU X0, Xs, Disp12s //Dest=Native
New ISA:
   LBU X0, Xs, Disp10s //Dest=RV64GC
   LWU X0, Xs, Disp10s //Dest=CoEx
   LHU X0, Xs, Disp10s //Dest=Native

This only gives 36-bits (top) or 30-bits (bottom) or range. What you are
going to want is 64-bits of range -- especially when switching modes--
you PROBABLY want to use an entirely different sub-tree of the
translation
table trees.

Idea here is that 'Xs' will give the base address for the target.

On the RISC-V side, this would mean, say:
AUIPC X7, disp
LWU X0, X7, disp
Similar to a normal JALR.

Still limited to 32-bit displacement from IP.

How would you perform the following call::
current IP = 0x0000000000001234
target IP = 0x7FFFFFFF00001234

This is a single (2-word) instruction in my ISA, assuming GOT is
32-bit displaceable and 64-bit entries.

I could almost interpret X0 as PC, except that on a "standard" RISC-V
CPU, the non-supported case would be, likely: "program crashes trying to access a NULL pointer", which is less useful.

Branches in the new ISA would likely be encoded using jumbo prefixes.

Well, partly because the new ISA lacks AUIPC, but the new ISA can encode
it more directly as, essentially:
LWU X0, PC, Disp33s

AUPIC is (and remains) a crutch (like LUI from MIPS)
a) it consumes an instruction (space and time)
b) it consumes a register unnecessarily
c) it consumes power that direct delivery of the constant would not

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Wed Sep 18 10:13:12 2024

On 17/09/2024 20:18, MitchAlsup1 wrote:

On Tue, 17 Sep 2024 16:32:35 +0000, Bill Findlay wrote:

On 17 Sep 2024, Stefan Monnier wrote
(in article<[email protected]>):

With all respect to the regulars here, most people in technical Usenet >>>> groups are either old, unusually nerdy, or both.

I plead guilty to nerdy, but as for old, I'm still 27 (and that's been
true for more than 20 years).

Stefan

Hi Stefan!
At least equally nerdy, I should think, but 50 years older.
(Older, not old!)

At 71 real years old I still operate as if I were <let's say> 21.

You are not 71, you are merely 0x47 :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Wed Sep 18 13:37:14 2024

On Wed, 18 Sep 2024 8:13:12 +0000, David Brown wrote:

On 17/09/2024 20:18, MitchAlsup1 wrote:

On Tue, 17 Sep 2024 16:32:35 +0000, Bill Findlay wrote:

On 17 Sep 2024, Stefan Monnier wrote
(in article<[email protected]>):

With all respect to the regulars here, most people in technical Usenet >>>>> groups are either old, unusually nerdy, or both.

I plead guilty to nerdy, but as for old, I'm still 27 (and that's been >>>> true for more than 20 years).

Stefan

Hi Stefan!
At least equally nerdy, I should think, but 50 years older.
(Older, not old!)

At 71 real years old I still operate as if I were <let's say> 21.

You are not 71, you are merely 0x47 :-)

It is only 27 in base 32.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Terje Mathisen on Wed Sep 18 10:10:21 2024

Terje Mathisen wrote:

EricP wrote:

Codecs likely have to deal with double-width straddles a lot, whatever
the register word size. So for them it likely happens at 64-bits already.

Nothing likely about it: LZ4 is pretty much the only compression algorithm/lossless codec that never straddles, all the rest tend to
treat the source data as single bitstream of arbitrary length, except
for some built-in chunking mechanism which simplifies faster scanning.

The core of the algorithm always starts with knowing the endianness,
then picking up 32 or 64-bit chunks of input data (byte-flipping if
needed) and then extractin the next N bits either from the top of bottom
of the buffer register.

AlLmost by definition, this is not code that a compiler is setup to help
you get correct.

I added a bunch of instructions for dealing with double-width operations.
The main ISA design decision is whether to have register pair specifiers,
R0, R2, R4,... or two separate {r_high,r_low} registers.
In either case the main uArch issue is that now instructions have an
extra
source register and two dest registers, which has a number of
consequences.
But once you bite the bullet on that it simplifies a lot of things,
like how to deal with carry or overflow without flags,
full width multiplies, divide producing both quotient and remainder.

Very nice!

This means that you can do integer IMAC(), right?

(hi, lo) = imac(a, b, c); // == a*b+c

The only thing even nicer from the perspective of writing arbitrary
precision library code would be IMAA, i.e. a*b+c+d since that is the
largest combination which is guaranteed to never overflow the double
register target field.

Terje

I thought about IMAC but it was a bit too much.
And unlike FMA there is no precision gain in IMAC, just convenience.
IMAC requires 6 register specifiers, 2 dest and 4 source if you don't
care about overflow/carry on the accumulate.
2-wide = 2-wide + narrow * narrow
It needs 7 registers, 3 dest and 4 source if you want overflow/carry
on the accumulate.
3-wide = 2-wide + narrow * narrow

I wanted to support checked arithmetic which means full width multiplies.
And I was always bothered by the risc approach of MULL (low part) and
MULH (high part) where they do most of the multiply then toss half away
just because they won't have 2 dest registers.

So what else I can do with 2 dest registers? Wide add and sub.
Various wide Add,Sub solves the missing carry/overflow flags problems.

FMA already requires 3 source registers.
Beside Add,Sub,Mul what else can one do with 3 source and 2 dest registers? Wide shifts and wide bit-field extract and insert.

I went with two (r_hi,r_lo) register specifiers because it gave programmers more flexibility. I played a bit with even register pairs (R0, R2, R4...)
and found one had to do extra MOVs just form a pair.
(r_hi,r_lo) cost a longer instruction format but I have a variable length instruction so its mostly a wider fetch and decode pathways to handle
the worst case instruction size.

W = Wide = (hi,lo) register pair, N = Narrow = one register.

Add forms:
Add N = N + N // No carry out
Add3 N = N + N + N // No carry out
Addw2 W = N + N // Generate carry
Addw3 W = N + N + N // Generate + propagate carry
Addw1 W = W + N // Propagate carry

Same for subtract wide.
The three Add forms are chosen to make multi-precision integer
multiply easier. See below.

MUluw W = N * N
Mulsw W = N * N

Divuw (quo,rem) = N / N
Divsw (quo,rem) = N / N

Shllw W = W << size // Shift left logical
Shlaw W = W << size // Shift left arithmetic, fault on signed overflow
Shrlw W = W >> size // Shift right logical
Shraw W = W >> size // Shift right arithmetic, sign extend
Shrnw W = W >> size // Shift right numeric, round -1 to zero

Bfextu N = extract (W, size, position) // Bit-field extract, zero extend Bfexts N = extract (W, size, position) // Bit-field extract, sign extend Bfins W = insert (W, N, size, position) // Bit-field insert

=====================================
Example unsigned 128 * 128 => 256 multiply:

// Unsigned Multiply 128*128 => 256
// (r3,r2)*(r1,r0) => (r3,r2,r1,r0)
// Uses r4,r5,r6,r7,r8 as temp registers
//
muluw r5,r4 = r3*r0
muluw r6,r0 = r2*r0
muluw r8,r7 = r2*r1
muluw r3,r2 = r3*r1
addw3 r4,r1 = r4+r6+r7
addw3 r5,r2 = r5+r8+r2
addw2 r4,r2 = r2+r4
add3 r3 = r3+r5+r4

The reason I prefer the separate (r_hi,r_lo) pair specifiers rather
than the even number register pairs R0,R2,R4... is because the above
sequence would require extra moves for form the even numbered pairs.
With separate pairs one can select registers so that everything lands
in the right dest at the right time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Sep 18 14:27:28 2024

On Wed, 18 Sep 2024 4:00:43 +0000, BGB wrote:

On 9/17/2024 6:04 PM, MitchAlsup1 wrote:

Still limited to 32-bit displacement from IP.

How would you perform the following call::
current IP = 0x0000000000001234
target IP = 0x7FFFFFFF00001234

This is a single (2-word) instruction in my ISA, assuming GOT is
32-bit displaceable and 64-bit entries.

Granted, but in plain RISC-V, there is no real better option.

If one wants to generate 64-bit displacement, and doesn't want to load a constant from memory:
LUI X6, Disp20Hi //20 bits
ADDI X6, X6, Disp12Hi //12 bits
AUIPC X7, Disp20Lo
ADD X7, Disp12Lo
SLLI X6, X6, 32
ADD X7, X7, X6

How very much simpler is::

MEM Rd,[IP,Ri<<s,DISP64]

1 instruction, 3 words, 1 decode cycle, no forwarding, shorter latency.

Which is sort of the whole reason I am considering hacking around it
with an alternate encoding scheme.

Just put in real constants.

New encoding scheme can in theory do:
LEA X7, PC, Disp64
In a single 96-bit instruction.

Where is the indexing register?

------------

AUPIC is (and remains) a crutch (like LUI from MIPS)
a) it consumes an instruction (space and time)
b) it consumes a register unnecessarily
c) it consumes power that direct delivery of the constant would not

Yeah, pretty much.
LUI + AUIPC + JAL, eat nearly 27 bits of encoding space.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Sep 18 18:42:19 2024

On Wed, 18 Sep 2024 17:55:34 +0000, BGB wrote:

On 9/18/2024 9:27 AM, MitchAlsup1 wrote:

On Wed, 18 Sep 2024 4:00:43 +0000, BGB wrote:

On 9/17/2024 6:04 PM, MitchAlsup1 wrote:

Still limited to 32-bit displacement from IP.

How would you perform the following call::
current IP = 0x0000000000001234
target IP = 0x7FFFFFFF00001234

This is a single (2-word) instruction in my ISA, assuming GOT is
32-bit displaceable and 64-bit entries.

Granted, but in plain RISC-V, there is no real better option.

If one wants to generate 64-bit displacement, and doesn't want to load a >>> constant from memory:
   LUI X6, Disp20Hi       //20 bits
   ADDI X6, X6, Disp12Hi //12 bits
   AUIPC X7, Disp20Lo
   ADD X7, Disp12Lo
   SLLI X6, X6, 32
   ADD X7, X7, X6

How very much simpler is::

    MEM    Rd,[IP,Ri<<s,DISP64]

1 instruction, 3 words, 1 decode cycle, no forwarding, shorter latency.

It is simpler, but N/E in RV64G...

This is the whole issue of the idea:
Remain backwards compatible with RV64G / RV64GC (in a binary sense).

So, you like sailing with an albatross tied around your neck:: Check.

*and* try to allow extending it in a way such that performance can be
less poor...

I should remind you that if you eliminate the compressed parts of
RISC-V you can fit the entire My 66000 ISA in the space remaining.
All the constants, all transcendentals, all the far-control transfers,
the efficient context switching, overhead free world switching,...
---------

Which is sort of the whole reason I am considering hacking around it
with an alternate encoding scheme.

Just put in real constants.

New encoding scheme can in theory do:
   LEA X7, PC, Disp64
In a single 96-bit instruction.

Where is the indexing register?

Generally the use of a displacement and index register are mutually
exclusive (and, cases that can make use of Disp AND Index are much less common than Disp OR Index).

COMMON ?alpha/ a(100,100), b(300,300),

..

x = a(i,j)*b(j,i);

I see large displacements with indexing all the time from ASM out
of Brian's compiler.

I may still consider defining an encoding for this, but not yet. It is
in a similar boat as auto-increment. Both add resource cost with
relatively little benefit in terms of overall performance.
Auto-increment because if one has superscalar, the increment can usually
be co-executed. And, full [Rb+Ri*Sc+Disp], because it is just too
infrequent to really justify the extra cost of a 3-way adder even if
limited mostly to the low-order bits...

Myopathy--look it up.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to EricP on Wed Sep 18 21:15:55 2024

EricP <[email protected]> wrote:

Terje Mathisen wrote:

EricP wrote:

Codecs likely have to deal with double-width straddles a lot, whatever
the register word size. So for them it likely happens at 64-bits already. >>

Nothing likely about it: LZ4 is pretty much the only compression
algorithm/lossless codec that never straddles, all the rest tend to
treat the source data as single bitstream of arbitrary length, except
for some built-in chunking mechanism which simplifies faster scanning.

The core of the algorithm always starts with knowing the endianness,
then picking up 32 or 64-bit chunks of input data (byte-flipping if
needed) and then extractin the next N bits either from the top of bottom
of the buffer register.

AlLmost by definition, this is not code that a compiler is setup to help
you get correct.

I added a bunch of instructions for dealing with double-width operations. >>> The main ISA design decision is whether to have register pair specifiers, >>> R0, R2, R4,... or two separate {r_high,r_low} registers.
In either case the main uArch issue is that now instructions have an
extra
source register and two dest registers, which has a number of
consequences.
But once you bite the bullet on that it simplifies a lot of things,
like how to deal with carry or overflow without flags,
full width multiplies, divide producing both quotient and remainder.

Very nice!

This means that you can do integer IMAC(), right?

(hi, lo) = imac(a, b, c); // == a*b+c

The only thing even nicer from the perspective of writing arbitrary
precision library code would be IMAA, i.e. a*b+c+d since that is the
largest combination which is guaranteed to never overflow the double
register target field.

Terje

I thought about IMAC but it was a bit too much.
And unlike FMA there is no precision gain in IMAC, just convenience.
IMAC requires 6 register specifiers, 2 dest and 4 source if you don't
care about overflow/carry on the accumulate.
2-wide = 2-wide + narrow * narrow
It needs 7 registers, 3 dest and 4 source if you want overflow/carry
on the accumulate.
3-wide = 2-wide + narrow * narrow

I wanted to support checked arithmetic which means full width multiplies.
And I was always bothered by the risc approach of MULL (low part) and
MULH (high part) where they do most of the multiply then toss half away
just because they won't have 2 dest registers.

I always assumed that MULH just grabbed the part that would have been
thrown away. And that is how at least one RISC-V core does it:

https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit

They claim 5 cycles, should be six, five for the multiply and one more for
the second result, unless the next instruction does not need a write port,
and does not use the result. You can get a throughput of 5 cycles with
smart coding, but that rarely happens without effort.

So what else I can do with 2 dest registers? Wide add and sub.
Various wide Add,Sub solves the missing carry/overflow flags problems.

FMA already requires 3 source registers.
Beside Add,Sub,Mul what else can one do with 3 source and 2 dest registers? Wide shifts and wide bit-field extract and insert.

I went with two (r_hi,r_lo) register specifiers because it gave programmers more flexibility. I played a bit with even register pairs (R0, R2, R4...)
and found one had to do extra MOVs just form a pair.
(r_hi,r_lo) cost a longer instruction format but I have a variable length instruction so its mostly a wider fetch and decode pathways to handle
the worst case instruction size.

W = Wide = (hi,lo) register pair, N = Narrow = one register.

Add forms:
Add N = N + N // No carry out
Add3 N = N + N + N // No carry out
Addw2 W = N + N // Generate carry
Addw3 W = N + N + N // Generate + propagate carry
Addw1 W = W + N // Propagate carry

Same for subtract wide.
The three Add forms are chosen to make multi-precision integer
multiply easier. See below.

MUluw W = N * N
Mulsw W = N * N

Divuw (quo,rem) = N / N
Divsw (quo,rem) = N / N

Shllw W = W << size // Shift left logical
Shlaw W = W << size // Shift left arithmetic, fault on signed overflow Shrlw W = W >> size // Shift right logical
Shraw W = W >> size // Shift right arithmetic, sign extend
Shrnw W = W >> size // Shift right numeric, round -1 to zero

Bfextu N = extract (W, size, position) // Bit-field extract, zero extend Bfexts N = extract (W, size, position) // Bit-field extract, sign extend Bfins W = insert (W, N, size, position) // Bit-field insert

=====================================
Example unsigned 128 * 128 => 256 multiply:

// Unsigned Multiply 128*128 => 256
// (r3,r2)*(r1,r0) => (r3,r2,r1,r0)
// Uses r4,r5,r6,r7,r8 as temp registers
//
muluw r5,r4 = r3*r0
muluw r6,r0 = r2*r0
muluw r8,r7 = r2*r1
muluw r3,r2 = r3*r1
addw3 r4,r1 = r4+r6+r7
addw3 r5,r2 = r5+r8+r2
addw2 r4,r2 = r2+r4
add3 r3 = r3+r5+r4

The reason I prefer the separate (r_hi,r_lo) pair specifiers rather
than the even number register pairs R0,R2,R4... is because the above
sequence would require extra moves for form the even numbered pairs.
With separate pairs one can select registers so that everything lands
in the right dest at the right time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Thu Sep 19 00:35:03 2024

On Wed, 18 Sep 2024 21:15:55 +0000, Brett wrote:

EricP <[email protected]> wrote:

Terje Mathisen wrote:

EricP wrote:

I always assumed that MULH just grabbed the part that would have been
thrown away. And that is how at least one RISC-V core does it:

https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit

They claim 5 cycles, should be six, five for the multiply and one more
for the second result, unless the next instruction does not need a write port, and does not use the result. You can get a throughput of 5 cycles
with
smart coding, but that rarely happens without effort.

It is easy enough in the decoder to recognize a MUL followed by MULH
(and vice versa) as using the multiplier tree once and delivering 2
results. So the first result is 6 cycles, the second result on the 6th
cycle. {you ALMOST have to do this to avoid large wastes in power.}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to EricP on Thu Sep 19 08:34:19 2024

EricP wrote:

Terje Mathisen wrote:

Very nice!

This means that you can do integer IMAC(), right?

(hi, lo) = imac(a, b, c); // == a*b+c

The only thing even nicer from the perspective of writing arbitrary
precision library code would be IMAA, i.e. a*b+c+d since that is the
largest combination which is guaranteed to never overflow the double
register target field.

I thought about IMAC but it was a bit too much.
And unlike FMA there is no precision gain in IMAC, just convenience.
IMAC requires 6 register specifiers, 2 dest and 4 source if you don't
care about overflow/carry on the accumulate.
2-wide = 2-wide + narrow * narrow

No, no! IMAC is three in, two out, so in your syntax:

W = N*N+N

or

(rhi, rlo) = imac(r0,r1,r2)

It needs 7 registers, 3 dest and 4 source if you want overflow/carry
on the accumulate.
3-wide = 2-wide + narrow * narrow

Otoh, if you do have all the wide add forms you outlined below,
including the "full adder" with three inputs and a wirde/pair output,
then the carry propagations do become easier, and just doing

(a,b) = muluw(e,f)
(a,b) = addw1(a,b,g)

would do the same as my suggested

(a,b) = imac(a,f,g)

Anyway, very nice!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Brett on Thu Sep 19 11:07:11 2024

Brett wrote:

EricP <[email protected]> wrote:

I wanted to support checked arithmetic which means full width multiplies.
And I was always bothered by the risc approach of MULL (low part) and
MULH (high part) where they do most of the multiply then toss half away
just because they won't have 2 dest registers.

I always assumed that MULH just grabbed the part that would have been
thrown away. And that is how at least one RISC-V core does it:

https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit

They claim 5 cycles, should be six, five for the multiply and one more for the second result, unless the next instruction does not need a write port, and does not use the result. You can get a throughput of 5 cycles with
smart coding, but that rarely happens without effort.

That article is ignoring multiplier pipelining.
If the multiplier is pipelined with a latency of 5 and throughput of 1,
then MULL takes 5 cycles and MULL,MULH takes 6.

But those two multiplies still are tossing away 50% of their work.
And if it does fuse them then the internal uArch cost is the same as if
you had designed it optimally from the start, except now you have
to pay for a fuser.

<sound of soap box being dragged out>
This idea that macro-op fusion is some magic solution is bullshit.
1) It's not free.
2) It only works where Decode can see *all* the required lookahead
instructions, which means you have to pay for an N-lane decoder
but only get 1 lane.
3) It's probabilistic as it depends on how the fetch buffers get loaded.
Eg if the fetch buffer contains a valid instruction but does not have
a next instruction, do you stall Decode to see if a fuser might arrive
or dispatch it anyway.
4) It gets exponentially expensive if you start doing multiple instruction
lanes because decode has to deal with all the permutations of
fusion possibilities.
5) Any fused instructions leave (multiple) bubbles that should be
compacted out or there wasn't much point to doing the fusion.

In my opinion it is better to have an ISA that is optimal by design
rather than being patched up by fusion later.

Some of this inefficiency is caused by clinging to now 40 year old
risc design *guidelines* (ie not even rules) that:
- instructions have at most 1 dest and 2 source registers
- register specifier fields are either source or dest, never both
- instructions should take at most 1 clock (they never did)

These self imposed design restrictions cause ISA designers to miss
some possible more optimal solutions. The result is things like
RISC-V's memory reference linkage structures taking 6 instructions
to build a 64-bit PC-relative address. And I'm pretty sure we won't
see any 6 instruction fusers for quite some time.

<sound of soap box being dragged back to cupboard>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Sep 19 16:01:48 2024

On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:

Brett wrote:

EricP <[email protected]> wrote:

They claim 5 cycles, should be six, five for the multiply and one more
for
the second result, unless the next instruction does not need a write
port,
and does not use the result. You can get a throughput of 5 cycles with
smart coding, but that rarely happens without effort.

That article is ignoring multiplier pipelining.
If the multiplier is pipelined with a latency of 5 and throughput of 1,
then MULL takes 5 cycles and MULL,MULH takes 6.

But those two multiplies still are tossing away 50% of their work.
And if it does fuse them then the internal uArch cost is the same as if
you had designed it optimally from the start, except now you have
to pay for a fuser.

You failed to recognize the critical part of my comment on this::

When the IMUL function unit sees MULL and MULH back to back AND
when both operands are the same for both instructions; it KNOWS
that the second multiply has the same result as the first and
thereby that the second multiply can be suppressed and the first
multiply used twice. {{In pure CMOS, if you drop the same operands
twice into the multiplier tree, the multiplier tree burns no power
in any event, just the operand delivery power.}}

You may call this fusion, but it is the very lowest level of it
and was not called such when first used.

<sound of soap box being dragged out>
This idea that macro-op fusion is some magic solution is bullshit.

Agreed

1) It's not free.

Far from it.

2) It only works where Decode can see *all* the required lookahead
instructions, which means you have to pay for an N-lane decoder
but only get 1 lane.

I think it is but a crutch for a misdesigned ISA

3) It's probabilistic as it depends on how the fetch buffers get loaded.
Eg if the fetch buffer contains a valid instruction but does not
have
a next instruction, do you stall Decode to see if a fuser might
arrive
or dispatch it anyway.

It can be worse than that

4) It gets exponentially expensive if you start doing multiple
instruction
lanes because decode has to deal with all the permutations of
fusion possibilities.

All the more reason to have a better ISA

5) Any fused instructions leave (multiple) bubbles that should be
compacted out or there wasn't much point to doing the fusion.

One of the interesting things I have noticed with my ISA is that
when one has a properly designed higher level ISA, one gets rid
of so many of the "easy to schedule" instructions that one ends
up with 30 FMAC instructions in a row, with no other instruction
to occupy any of the other function units.

In my opinion it is better to have an ISA that is optimal by design
rather than being patched up by fusion later.

Indeed.

Some of this inefficiency is caused by clinging to now 40 year old
risc design *guidelines* (ie not even rules) that:
- instructions have at most 1 dest and 2 source registers

Makes FMAC had

- register specifier fields are either source or dest, never both

I happen to be wishywashy on this

- instructions should take at most 1 clock (they never did)

This never worked for floating point anyway...and many consider
branches and memory references as not fitting that tenet either.

What is required is that each instruction can be decoded in a single
cycle and delivered to whichever function unit in one cycle.

These self imposed design restrictions cause ISA designers to miss
some possible more optimal solutions. The result is things like
RISC-V's memory reference linkage structures taking 6 instructions
to build a 64-bit PC-relative address. And I'm pretty sure we won't
see any 6 instruction fusers for quite some time.

And it is just "so unnecessary".

I suspect that RISC-V will end up choosing AUPIC-LD-JMP instead
loosing the PIC nature of flow control.

Doing it right the first time is so much easier for everyone now
and down the line.

<sound of soap box being dragged back to cupboard>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Thu Sep 19 11:29:08 2024

MitchAlsup1 wrote:

On Wed, 18 Sep 2024 21:15:55 +0000, Brett wrote:

EricP <[email protected]> wrote:

Terje Mathisen wrote:

EricP wrote:

I always assumed that MULH just grabbed the part that would have been
thrown away. And that is how at least one RISC-V core does it:

https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit

They claim 5 cycles, should be six, five for the multiply and one more
for the second result, unless the next instruction does not need a write
port, and does not use the result. You can get a throughput of 5 cycles
with
smart coding, but that rarely happens without effort.

It is easy enough in the decoder to recognize a MUL followed by MULH
(and vice versa) as using the multiplier tree once and delivering 2
results. So the first result is 6 cycles, the second result on the 6th
cycle. {you ALMOST have to do this to avoid large wastes in power.}

Yes, but then you *require* a macro-op fuser to function efficiently. Probably... assuming it works.

OR one can give up the cherished 1-dest,2-source self imposed ISA design limitation and have a 32-bit instruction with four 5-bit registers,
2 source, 2 dest, leaving 12 bits for opcode and function code
that you know will calculate multiply once, and can write back
the result in 1 clock if it has two write ports (which it needs
anyway if it wants any hope of catching up after a stall bubble).

Also in the case of Alpha they only had unsigned MUL,MULH and
for signed multiply it had to use branchy code (pre-CMOV) to
do the signed correction subtracts, so fusion would be too complex.
That design decision is as baffling as HP-PA originally leaving
a MUL instruction out entirely because "it violated the 1-clock per
instruction design philosophy". (HP quickly fixed it, but still...)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to EricP on Thu Sep 19 18:46:04 2024

EricP <[email protected]> schrieb:

And I'm pretty sure we won't
see any 6 instruction fusers for quite some time.

That would probably blow a fuse.

SCNR,

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to [email protected] on Thu Sep 19 19:12:41 2024

MitchAlsup1 <[email protected]> wrote:

On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:

Brett wrote:

EricP <[email protected]> wrote:

They claim 5 cycles, should be six, five for the multiply and one more
for
the second result, unless the next instruction does not need a write
port,
and does not use the result. You can get a throughput of 5 cycles with
smart coding, but that rarely happens without effort.

That article is ignoring multiplier pipelining.
If the multiplier is pipelined with a latency of 5 and throughput of 1,
then MULL takes 5 cycles and MULL,MULH takes 6.

But those two multiplies still are tossing away 50% of their work.
And if it does fuse them then the internal uArch cost is the same as if
you had designed it optimally from the start, except now you have
to pay for a fuser.

You failed to recognize the critical part of my comment on this::

When the IMUL function unit sees MULL and MULH back to back AND
when both operands are the same for both instructions; it KNOWS
that the second multiply has the same result as the first and
thereby that the second multiply can be suppressed and the first
multiply used twice. {{In pure CMOS, if you drop the same operands
twice into the multiplier tree, the multiplier tree burns no power
in any event, just the operand delivery power.}}

You may call this fusion, but it is the very lowest level of it
and was not called such when first used.

<sound of soap box being dragged out>

- register specifier fields are either source or dest, never both

I happen to be wishywashy on this

This is deeply interesting, can you expound on why it is fine a register
field can be shared by loads and stores, and sometimes both like x86.

Classic RISC says the loads are critical, but no one is one wide today, so stores matter for deconfliction…. And does stuff just fall out right to
allow both?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to EricP on Thu Sep 19 19:12:42 2024

EricP <[email protected]> wrote:

MitchAlsup1 wrote:

On Wed, 18 Sep 2024 21:15:55 +0000, Brett wrote:

EricP <[email protected]> wrote:

Terje Mathisen wrote:

EricP wrote:

I always assumed that MULH just grabbed the part that would have been
thrown away. And that is how at least one RISC-V core does it:

https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit

They claim 5 cycles, should be six, five for the multiply and one more
for the second result, unless the next instruction does not need a write >>> port, and does not use the result. You can get a throughput of 5 cycles
with
smart coding, but that rarely happens without effort.

It is easy enough in the decoder to recognize a MUL followed by MULH
(and vice versa) as using the multiplier tree once and delivering 2
results. So the first result is 6 cycles, the second result on the 6th
cycle. {you ALMOST have to do this to avoid large wastes in power.}

Yes, but then you *require* a macro-op fuser to function efficiently. Probably... assuming it works.

OR one can give up the cherished 1-dest,2-source self imposed ISA design limitation and have a 32-bit instruction with four 5-bit registers,
2 source, 2 dest, leaving 12 bits for opcode and function code
that you know will calculate multiply once, and can write back
the result in 1 clock if it has two write ports (which it needs
anyway if it wants any hope of catching up after a stall bubble).

You already have 2 source, 2 dest if you have load with address update.
A low end CPU is going to have a shared INT/FPU pipeline so you have the hardware to do three sources for MAC. You might as well do 3 source 2 dest
on the int side as well. And ARM does Add with Shift which is 3 sources,
though one is a constant if you want one cycle uncracked throughput in most designs.

Also in the case of Alpha they only had unsigned MUL,MULH and
for signed multiply it had to use branchy code (pre-CMOV) to
do the signed correction subtracts, so fusion would be too complex.
That design decision is as baffling as HP-PA originally leaving
a MUL instruction out entirely because "it violated the 1-clock per instruction design philosophy". (HP quickly fixed it, but still...)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Thu Sep 19 20:21:20 2024

On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:

- register specifier fields are either source or dest, never both

I happen to be wishywashy on this

This is deeply interesting, can you expound on why it is fine a register field can be shared by loads and stores, and sometimes both like x86.

My 66000 encodes store data register in the same field position as it
encodes "what kind of branch" is being performed, and the same position
as all calculation (and load) results.

I started doing this in 1982 with Mc88100 ISA, and never found a problem
with the encoding nor in the decoding nor with the pipelining of it.

Let me be clear, I do not support necessarily damaging a source operand
to fit in another destination as::

ADD SP,SP,#0x40

by specifying SP only once in the instruction.

So,

+------+-----+-----+----------------+
| major| Rd | Rs1 | whatever |
+------+-----+-----+----------------+
| BC | cnd | Rs1 | label offset |
+------+-----+-----+----------------+
| LD | Rd | Rb | displacement |
+------+-----+-----+----------------+
| ST | Rs0 | Rb | displacement |
+------+-----+-----+----------------+

Is:
a) no burden in encoding
b) no burden in decoding
c) no burden in pipelining
d) no burden in stealing the Store data port late in the pipeline
{in particular, this saves lots of flip-flops deferring store
data until after cache hit, TLB hit, and data has arrived at
cache.}

I disagree with things like::

+------+-----+-----+----------------+
| big OpCode | Rds | whatever |
+------+-----+-----+----------------+

Where Rds means the specifier is used as both a source and destination.

Notice in my encoding one can ALWAYS take the register specification
fields and wire them directly into the RF/renamer decoder ports.
You lose this property the other way around.

Classic RISC says the loads are critical, but no one is one wide today,

SiFive disagrees with you.

so
stores matter for deconfliction…. And does stuff just fall out right to allow both?

Can you restate what you wanted to say using different words or perhaps
give an example ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Thu Sep 19 20:30:38 2024

On Thu, 19 Sep 2024 19:12:42 +0000, Brett wrote:

EricP <[email protected]> wrote:

MitchAlsup1 wrote:

It is easy enough in the decoder to recognize a MUL followed by MULH
(and vice versa) as using the multiplier tree once and delivering 2
results. So the first result is 6 cycles, the second result on the 6th
cycle. {you ALMOST have to do this to avoid large wastes in power.}

Yes, but then you *require* a macro-op fuser to function efficiently.
Probably... assuming it works.

OR one can give up the cherished 1-dest,2-source self imposed ISA design
limitation and have a 32-bit instruction with four 5-bit registers,
2 source, 2 dest, leaving 12 bits for opcode and function code
that you know will calculate multiply once, and can write back
the result in 1 clock if it has two write ports (which it needs
anyway if it wants any hope of catching up after a stall bubble).

You already have 2 source, 2 dest if you have load with address update.
A low end CPU is going to have a shared INT/FPU pipeline so you have the hardware to do three sources for MAC. You might as well do 3 source 2
dest on the int side as well. And ARM does Add with Shift which is 3
sources, though one is a constant if you want one cycle uncracked
throughput in most designs.

Once you bite off on a shared INT/FP multiplier, and that the FP
multiplier has to do FMAC, you HAVE 3-operand busses leaving the
decoder stage.

Those 3 operand busses give you [Rbase,Rindex<<scale,#displacement]
memory reference address mode. You can say you only use it 2%
of the time, but every time you can't use it and need it; it costs
1-2 additional instructions--multiplying the 2% into the 5% range
making it worthwhile even it you only save ICache misses.

So, if you do FMAC, you have the bussing to do efficient Mem Refs.

In addition:: if you have a pipelined FMAC unit, why NOT use it
for integer Multiplication ??

Additionally:: if you have a high performance FDIV unit, you can
borrow it for integer division at little costs--no matter if it
is in the FAMC unit or if it is a separate unit from FMAC.

Given the 3-operand busses:: one can have 128/64 in the divisor
at virtually no cost of calculation.

THEREFORE: once you have 3-operand busses to support FMAC you
should get all the bang out of them that you paid for.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to [email protected] on Fri Sep 20 00:12:48 2024

MitchAlsup1 <[email protected]> wrote:

On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:

- register specifier fields are either source or dest, never both

I happen to be wishywashy on this

This is deeply interesting, can you expound on why it is fine a register
field can be shared by loads and stores, and sometimes both like x86.

My 66000 encodes store data register in the same field position as it
encodes "what kind of branch" is being performed, and the same position
as all calculation (and load) results.

I started doing this in 1982 with Mc88100 ISA, and never found a problem
with the encoding nor in the decoding nor with the pipelining of it.

Let me be clear, I do not support necessarily damaging a source operand
to fit in another destination as::

ADD SP,SP,#0x40

by specifying SP only once in the instruction.

So,

+------+-----+-----+----------------+
| major| Rd | Rs1 | whatever |
+------+-----+-----+----------------+
| BC | cnd | Rs1 | label offset |
+------+-----+-----+----------------+
| LD | Rd | Rb | displacement |
+------+-----+-----+----------------+
| ST | Rs0 | Rb | displacement |
+------+-----+-----+----------------+

Is:
a) no burden in encoding
b) no burden in decoding
c) no burden in pipelining
d) no burden in stealing the Store data port late in the pipeline
{in particular, this saves lots of flip-flops deferring store
data until after cache hit, TLB hit, and data has arrived at
cache.}

I disagree with things like::

+------+-----+-----+----------------+
| big OpCode | Rds | whatever |
+------+-----+-----+----------------+

Where Rds means the specifier is used as both a source and destination.

Notice in my encoding one can ALWAYS take the register specification
fields and wire them directly into the RF/renamer decoder ports.
You lose this property the other way around.

Classic RISC says the loads are critical, but no one is one wide today,

SiFive disagrees with you.

so
stores matter for deconfliction…. And does stuff just fall out right to
allow both?

Can you restate what you wanted to say using different words or perhaps
give an example ??

A series of adds to the same register in a four wide design.

A = A + 1
A = A + B
A = A + C
A = A + D

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Fri Sep 20 01:05:34 2024

On Fri, 20 Sep 2024 0:12:48 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:

- register specifier fields are either source or dest, never both

I happen to be wishywashy on this

This is deeply interesting, can you expound on why it is fine a register >>> field can be shared by loads and stores, and sometimes both like x86.

My 66000 encodes store data register in the same field position as it
encodes "what kind of branch" is being performed, and the same position
as all calculation (and load) results.

I started doing this in 1982 with Mc88100 ISA, and never found a problem
with the encoding nor in the decoding nor with the pipelining of it.

Let me be clear, I do not support necessarily damaging a source operand
to fit in another destination as::

ADD SP,SP,#0x40

by specifying SP only once in the instruction.

So,

+------+-----+-----+----------------+
| major| Rd | Rs1 | whatever |
+------+-----+-----+----------------+
| BC | cnd | Rs1 | label offset |
+------+-----+-----+----------------+
| LD | Rd | Rb | displacement |
+------+-----+-----+----------------+
| ST | Rs0 | Rb | displacement |
+------+-----+-----+----------------+

Is:
a) no burden in encoding
b) no burden in decoding
c) no burden in pipelining
d) no burden in stealing the Store data port late in the pipeline
{in particular, this saves lots of flip-flops deferring store
data until after cache hit, TLB hit, and data has arrived at
cache.}

I disagree with things like::

+------+-----+-----+----------------+
| big OpCode | Rds | whatever |
+------+-----+-----+----------------+

Where Rds means the specifier is used as both a source and destination.

Notice in my encoding one can ALWAYS take the register specification
fields and wire them directly into the RF/renamer decoder ports.
You lose this property the other way around.

Classic RISC says the loads are critical, but no one is one wide today,

SiFive disagrees with you.

so
stores matter for deconfliction…. And does stuff just fall out right to >>> allow both?

Can you restate what you wanted to say using different words or perhaps
give an example ??

A series of adds to the same register in a four wide design.

A = A + 1
A = A + B
A = A + C
A = A + D

Which any good compiler should emit as::

T1 = A + B
T2 = C + D
A = LEA( T1, T2, #1 )

With a 2 cycle latency instead of 4.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to [email protected] on Fri Sep 20 03:31:36 2024

MitchAlsup1 <[email protected]> wrote:

On Fri, 20 Sep 2024 0:12:48 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:

- register specifier fields are either source or dest, never both

I happen to be wishywashy on this

This is deeply interesting, can you expound on why it is fine a register >>>> field can be shared by loads and stores, and sometimes both like x86.

My 66000 encodes store data register in the same field position as it
encodes "what kind of branch" is being performed, and the same position
as all calculation (and load) results.

I started doing this in 1982 with Mc88100 ISA, and never found a problem >>> with the encoding nor in the decoding nor with the pipelining of it.

Let me be clear, I do not support necessarily damaging a source operand
to fit in another destination as::

ADD SP,SP,#0x40

by specifying SP only once in the instruction.

So,

+------+-----+-----+----------------+
| major| Rd | Rs1 | whatever |
+------+-----+-----+----------------+
| BC | cnd | Rs1 | label offset |
+------+-----+-----+----------------+
| LD | Rd | Rb | displacement |
+------+-----+-----+----------------+
| ST | Rs0 | Rb | displacement |
+------+-----+-----+----------------+

Is:
a) no burden in encoding
b) no burden in decoding
c) no burden in pipelining
d) no burden in stealing the Store data port late in the pipeline
{in particular, this saves lots of flip-flops deferring store
data until after cache hit, TLB hit, and data has arrived at
cache.}

I disagree with things like::

+------+-----+-----+----------------+
| big OpCode | Rds | whatever |
+------+-----+-----+----------------+

Where Rds means the specifier is used as both a source and destination.

Notice in my encoding one can ALWAYS take the register specification
fields and wire them directly into the RF/renamer decoder ports.
You lose this property the other way around.

Classic RISC says the loads are critical, but no one is one wide today, >>>

SiFive disagrees with you.

so
stores matter for deconfliction…. And does stuff just fall out right to >>>> allow both?

Can you restate what you wanted to say using different words or perhaps
give an example ??

A series of adds to the same register in a four wide design.

A = A + 1
A = A + B
A = A + C
A = A + D

Which any good compiler should emit as::

T1 = A + B
T2 = C + D
A = LEA( T1, T2, #1 )

With a 2 cycle latency instead of 4.

The point was that you have three renames of A, so you can’t just blindly load the first A for all instructions. This takes gate time to determine,
you can’t ignore the store field until later.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Chris M. Thomasson on Fri Sep 20 09:40:58 2024

Chris M. Thomasson wrote:

On 9/19/2024 12:15 PM, BGB wrote:

On 9/19/2024 2:04 AM, Robert Finch wrote:

On 2024-09-18 10:30 p.m., BGB wrote:

On 9/18/2024 2:29 PM, Chris M. Thomasson wrote:

On 9/18/2024 1:13 AM, David Brown wrote:

On 17/09/2024 20:18, MitchAlsup1 wrote:

On Tue, 17 Sep 2024 16:32:35 +0000, Bill Findlay wrote:

On 17 Sep 2024, Stefan Monnier wrote
(in article<[email protected]>):

With all respect to the regulars here, most people in
technical Usenet
groups are either old, unusually nerdy, or both.

I plead guilty to nerdy, but as for old, I'm still 27 (and
that's been
true for more than 20 years).

Stefan

Hi Stefan!
At least equally nerdy, I should think, but 50 years older.
(Older, not old!)

At 71 real years old I still operate as if I were <let's say> 21. >>>>>>

You are not 71, you are merely 0x47 :-)

LOL! :^)

Not going to say my exact age, but if I wrote my age in hex I could
almost try to pass myself off as an early Zoomer (rather than as a
millennial...).

...

I think I am early GenX. 59 and still learning loads of stuff.
Old enough to remember tube TVs and radios. Transistorized pocket
radio were a big thing.

In my case, my childhood was mostly in the era of Win 3.x and Win 9x
PCs, and early dial-up internet (unlike most Zoomers, I remember a
time before YouTube).

[...]

I remember way back wrt compuserve. :^)

BIX (Byte Information eXchange)!

I believe my id/mail was terjem (@bix.com), but it could have been tma
or terje.

I had some wonderful discussions with Mike Abrash and other x86 asm
programmers there.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Brett on Fri Sep 20 10:02:55 2024

Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:

- register specifier fields are either source or dest, never both

I happen to be wishywashy on this

This is deeply interesting, can you expound on why it is fine a register >>> field can be shared by loads and stores, and sometimes both like x86.

My 66000 encodes store data register in the same field position as it
encodes "what kind of branch" is being performed, and the same position
as all calculation (and load) results.

I started doing this in 1982 with Mc88100 ISA, and never found a problem
with the encoding nor in the decoding nor with the pipelining of it.

Let me be clear, I do not support necessarily damaging a source operand
to fit in another destination as::

ADD SP,SP,#0x40

by specifying SP only once in the instruction.

So,

+------+-----+-----+----------------+
| major| Rd | Rs1 | whatever |
+------+-----+-----+----------------+
| BC | cnd | Rs1 | label offset |
+------+-----+-----+----------------+
| LD | Rd | Rb | displacement |
+------+-----+-----+----------------+
| ST | Rs0 | Rb | displacement |
+------+-----+-----+----------------+

Is:
a) no burden in encoding
b) no burden in decoding
c) no burden in pipelining
d) no burden in stealing the Store data port late in the pipeline
{in particular, this saves lots of flip-flops deferring store
data until after cache hit, TLB hit, and data has arrived at
cache.}

I disagree with things like::

+------+-----+-----+----------------+
| big OpCode | Rds | whatever |
+------+-----+-----+----------------+

Where Rds means the specifier is used as both a source and destination.

Notice in my encoding one can ALWAYS take the register specification
fields and wire them directly into the RF/renamer decoder ports.
You lose this property the other way around.

Classic RISC says the loads are critical, but no one is one wide today,

SiFive disagrees with you.

so
stores matter for deconflictionâ€¦. And does stuff just fall out right to
allow both?

Can you restate what you wanted to say using different words or perhaps
give an example ??

A series of adds to the same register in a four wide design.

A = A + 1
A = A + B
A = A + C
A = A + D

That's a compiler issue, not a HW architecture problem imho:

lea rega,[rega+regb+1]
lea temp,[regc,regd]

add rega,temp

is 2 cycles, using two ports for the first cycle.

If you have an add3 opcode, then you can do it in a single lane.

Please note that I'm assuming either -fwrapv wrapping signed or just
regular unsigned adds.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Fri Sep 20 09:52:32 2024

MitchAlsup1 wrote:

On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:

- register specifier fields are either source or dest, never both

I happen to be wishywashy on this

This is deeply interesting, can you expound on why it is fine a register
field can be shared by loads and stores, and sometimes both like x86.

My 66000 encodes store data register in the same field position as it
encodes "what kind of branch" is being performed, and the same position
as all calculation (and load) results.

I started doing this in 1982 with Mc88100 ISA, and never found a problem
with the encoding nor in the decoding nor with the pipelining of it.

Let me be clear, I do not support necessarily damaging a source operand
to fit in another destination as::

ADD SP,SP,#0x40

by specifying SP only once in the instruction.

So,

+------+-----+-----+----------------+
| major| Rd | Rs1 | whatever |
+------+-----+-----+----------------+
| BC | cnd | Rs1 | label offset |
+------+-----+-----+----------------+
| LD | Rd | Rb | displacement |
+------+-----+-----+----------------+
| ST | Rs0 | Rb | displacement |
+------+-----+-----+----------------+

Is:
a) no burden in encoding
b) no burden in decoding
c) no burden in pipelining
d) no burden in stealing the Store data port late in the pipeline
{in particular, this saves lots of flip-flops deferring store
data until after cache hit, TLB hit, and data has arrived at
cache.}

I disagree with things like::

+------+-----+-----+----------------+
| big OpCode | Rds | whatever |
+------+-----+-----+----------------+

Where Rds means the specifier is used as both a source and destination.

Notice in my encoding one can ALWAYS take the register specification
fields and wire them directly into the RF/renamer decoder ports.
You lose this property the other way around.

I assume in your examples that you want to start your register file
read access and or rename register lookup access in the decode stage,
and not wait to start at the end of the decode stage.
Effectively pipelining those accesses.
That's fine.

But that's my point - it doesn't make a difference because in both
cases you can wire the reg fields to the reg file or rename directly
and start the access ASAP.
In both cases the enable signal determining what to do shows up
later after decode has done its thing. And the critical path for
that decode enable signal is the same both ways.

And if you are not doing this early access start but the traditional
of latch the decode output THEN start your RegRd or Rename access
it makes no timing difference at all.

By allowing the opcode-Rds style instructions to be *CONSIDERED*
it opens an avenue to potential instructions that cost little or
nothing extra in terms of logic or performance.

And this is particularly useful with fixed width 32-bit instructions
where one is try to pack as much function into a fixed size space as
possible. Even more so with 16-bit compact instructions.

For example, a 32-bit fixed format instruction with four 5-bit registers
could do a full width integer multiply wide-accumulate

IMAC (Rsd_hi,Rsd_lo) = (Rsd_hi,Rsd_lo) + Rs1 * Rs2

with little more logic than the existing MULL,MULH approach.
It still only needs 2 read ports because Rs1,Rs2 are read first to start
the multiply, then (Rsd_hi,Rsd_lo) second as they aren't needed until
late in the multiply-accumulate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Sep 20 17:39:34 2024

On Fri, 20 Sep 2024 13:52:32 +0000, EricP wrote:

MitchAlsup1 wrote:

On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:

- register specifier fields are either source or dest, never both

I happen to be wishywashy on this

This is deeply interesting, can you expound on why it is fine a register >>> field can be shared by loads and stores, and sometimes both like x86.

My 66000 encodes store data register in the same field position as it
encodes "what kind of branch" is being performed, and the same position
as all calculation (and load) results.

I started doing this in 1982 with Mc88100 ISA, and never found a problem
with the encoding nor in the decoding nor with the pipelining of it.

Let me be clear, I do not support necessarily damaging a source operand
to fit in another destination as::

ADD SP,SP,#0x40

by specifying SP only once in the instruction.

So,

+------+-----+-----+----------------+
| major| Rd | Rs1 | whatever |
+------+-----+-----+----------------+
| BC | cnd | Rs1 | label offset |
+------+-----+-----+----------------+
| LD | Rd | Rb | displacement |
+------+-----+-----+----------------+
| ST | Rs0 | Rb | displacement |
+------+-----+-----+----------------+

Is:
a) no burden in encoding
b) no burden in decoding
c) no burden in pipelining
d) no burden in stealing the Store data port late in the pipeline
{in particular, this saves lots of flip-flops deferring store
data until after cache hit, TLB hit, and data has arrived at
cache.}

I disagree with things like::

+------+-----+-----+----------------+
| big OpCode | Rds | whatever |
+------+-----+-----+----------------+

Where Rds means the specifier is used as both a source and destination.

Notice in my encoding one can ALWAYS take the register specification
fields and wire them directly into the RF/renamer decoder ports.
You lose this property the other way around.

I assume in your examples that you want to start your register file
read access and or rename register lookup access in the decode stage,
and not wait to start at the end of the decode stage.
Effectively pipelining those accesses.
That's fine.

But that's my point - it doesn't make a difference because in both
cases you can wire the reg fields to the reg file or rename directly
and start the access ASAP.

Not when a source field and a destination field are the same
field sometimes but not always. Your thought train adds a
register specifier mux between the destination field and
the overused source field in front of the destination
rename port. It is not a BIG hinderance, but it is not
insignificant is you are doing a "balls to the walls"
design.

In both cases the enable signal determining what to do shows up
later after decode has done its thing. And the critical path for
that decode enable signal is the same both ways.

And if you are not doing this early access start but the traditional
of latch the decode output THEN start your RegRd or Rename access
it makes no timing difference at all.

By allowing the opcode-Rds style instructions to be *CONSIDERED*
it opens an avenue to potential instructions that cost little or
nothing extra in terms of logic or performance.

The actual calculations are easy, it is the routing of data
to and from the calculation that is hard.

And this is particularly useful with fixed width 32-bit instructions
where one is try to pack as much function into a fixed size space as possible. Even more so with 16-bit compact instructions.

RISC-V, because of where the various fields ARE, have a mux between
every source field and every register port--simply because their
positions move between non-compressed and compressed.

I agree with the position that if the mux is already there
that one should use it often and greatly.

Where I disagree is that the mux HAS to be there.

For example, a 32-bit fixed format instruction with four 5-bit registers could do a full width integer multiply wide-accumulate

IMAC (Rsd_hi,Rsd_lo) = (Rsd_hi,Rsd_lo) + Rs1 * Rs2

This violates the RISC tenet where each calculation instruction
produces exactly 1 result. I get around this with the mechanical
definition of the CARRY instruction. The MUL instruction produces
its result, CARRY captures the other, and deposits it in RF when
possible.

with little more logic than the existing MULL,MULH approach.
It still only needs 2 read ports because Rs1,Rs2 are read first to start
the multiply, then (Rsd_hi,Rsd_lo) second as they aren't needed until
late in the multiply-accumulate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Sep 20 20:34:01 2024

On Fri, 20 Sep 2024 2:09:35 +0000, BGB wrote:

On 9/18/2024 1:42 PM, MitchAlsup1 wrote:

One simple option would be to assume an instruction looks like:
[Prefix Bytes]
[REX byte]
OP_Byte | 0F+OP_Byte
Mod/RM + SIB + ...

No, the simple option is that an instruction looks like:

+------+-----+-----+----------------+
| major| Rd | Rs1 | imm16 |
+------+-----+-----+----------------+
| mem | Rd | Rb | disp16 |
+------+-----+-----+----------------+
| Bcnd | cnd | Rs1 | disp18 |
+------+-----+-----+----------------+
| 2OP | Rd | Rs1 |mods| 2op | Rs2 |
+------+-----+-----+----------------+
| 3OP | Rd | Rs1 | Rs3 | 3op| Rs2 |
+------+-----+-----+----------------+

And then use a heuristic to try to guess how to interpret the
instruction stream based on "looks better" (more likely to be aligned
with the instruction stream vs random unaligned garbage).

Though, such a "looks good" heuristic could itself risk skewing the
results.

I may still consider defining an encoding for this, but not yet. It is
in a similar boat as auto-increment. Both add resource cost with
relatively little benefit in terms of overall performance.
Auto-increment because if one has superscalar, the increment can usually >>> be co-executed. And, full [Rb+Ri*Sc+Disp], because it is just too
infrequent to really justify the extra cost of a 3-way adder even if
limited mostly to the low-order bits...

Myopathy--look it up.

OK.

Not sure how that is related (a medical condition involving muscle defects...).

Myopathy is NEAR SIGHTEDNESS.

You are not looking far enough into the future to avoid problems in your
ISA and architecture. {I did the same in my youth. almost everyone
does.}

Can also note that a worthwhile design goal is to not add significant
cost over what would be needed for a plain RV64GC implementation, but,
could define a [Rb+Ri*Sc+Disp] encoding or similar if it would likely be beneficial enough to justify its existence.

486 showed that "[Rbase+Rindex<<scale+displacement]:segment" could all
be performed in a single cycle at a frequency competitive with the RISC processors available at the time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to All on Sat Sep 21 10:45:47 2024

On 2024-09-20 23:34, MitchAlsup1 wrote:

Myopathy is NEAR SIGHTEDNESS.

Perhaps you meant "myopia", https://en.wikipedia.org/wiki/Myopia.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Sun Sep 22 22:19:15 2024

On Sun, 22 Sep 2024 20:43:38 +0000, Paul A. Clayton wrote:

On 9/19/24 11:07 AM, EricP wrote:
[snip]

If the multiplier is pipelined with a latency of 5 and throughput
of 1,
then MULL takes 5 cycles and MULL,MULH takes 6.

But those two multiplies still are tossing away 50% of their work.

I do not remember how multipliers are actually implemented — and
am not motivated to refresh my memory at the moment — but I
thought a multiply low would not need to generate the upper bits,
so I do not understand where your "50% of their work" is coming
from.

+-----------+ +------------+
\ mplier / \ mcand / Big input mux
+--------+ +--------+
| |
| +--------------+
| / /
| / /
+-- / /
/ Tree /
/ /--+
/ / |
/ / |
+---------------+-----------+
hi low Products

two n-bit operands are multiplied into a 2×n-bit result.
{{All the rest is HOW not what}}

The high result needs the low result carry-out but not the rest of
the result. (An approximate multiply high for multiply by
reciprocal might be useful, avoiding the low result work. There
might also be ways that a multiplier could be configured to also
provide bit mixing similar to middle result for generating a
hash?)

I seem to recall a PowerPC implementation did semi-pipelined 32-
bit multiplication 16-bits at a time. This presumably saved area
and power

You save 1/2 of the tree area, but ultimately consume more power.

while also facilitating early out for small
multiplicands,

Dadda showed that doubling the size of the tree only adds one
4-2 compressor delay to the whole calculation.

at the cost of some latency and substantial
throughput compared to a fully pipelined multiplication.

Throughput that the rest of the engine could not use.

If I
remember correctly, this produced a result for 16-bit by 32-bit multiplication, which is different from generating a low or high
result.

And if it does fuse them then the internal uArch cost is the same
as if
you had designed it optimally from the start, except now you have
to pay for a fuser.

<sound of soap box being dragged out>
This idea that macro-op fusion is some magic solution is bullshit.

The argument is, at best, of Academic Quality, made by a student
at the time as a way to justify RISC-V not having certain easy
for HW to perform calculations.

1) It's not free.

Neither is increasing the number of opcodes or providing extender
prefixes. If one wants binary compatibility, non-fusing
implementations would work.

I did neither and avoided both.

(I tend to favor providing a translation layer between software
distribution format and instruction cache format, which reduces
the binary compatibility constraint.)

2) It only works where Decode can see *all* the required lookahead
   instructions, which means you have to pay for an N-lane decoder
   but only get 1 lane.

Most fusion is for two adjacent instructions, which significantly
limits the complexity.

To quadratic {BigO( instruction-OpCode-bits ** 2)}

The fusable patterns are also a subset of
all pairs of two instructions, so complete two-way decoding may
not be needed.

There may also be optimization opportunities from looking ahead.
Mitch Alsup proposed such for branch handling in a scalar
implementation.

I use this, to be clear, as a means to eliminate any need of the
branch delay slot in smaller narrow machines.

Apart from fusion, there might be advantages for
avoiding bank conflicts in a banked register file. I.e., the cost
of lookahead might be shared by multiple techniques/optimizations.

I tend to agree that fusion tends to be a workaround for sub-
optimal instruction encoding, but it seems that encoding involves
a lot of tradeoffs.

3) It's probabilistic as it depends on how the fetch buffers get
loaded.
   Eg if the fetch buffer contains a valid instruction but does
not have
   a next instruction, do you stall Decode to see if a fuser
might arrive
   or dispatch it anyway.

This is also somewhat true for variable length encodings that
cross fetch boundaries.

In My 1-wide machine, the only time this comes up is when a
long instruction crosses into a new cache line (or page) and
the cache (or TLB) takes a miss.

In general a boundary-crossing instruction
would probably stall even if such was not strictly necessary
(e.g., if the missing information is opcode refinement — not
related to instruction routing — or an immediate or even a
register source identifier specifying a value that can have
delayed use (e.g., value of a store, addend of a FMADD).

In my case, immediate data for a ST is not needed until the ST
has retired, so a) it is placed last, b) delay can be tolerated
as long as the pipeline depth.

This does seem a weakness, but fusion is not entirely negative
factors.

4) It gets exponentially expensive if you start doing multiple
instruction
   lanes because decode has to deal with all the permutations of
   fusion possibilities.

Fusion in an already variable length RISC ISA is already exponential.

This is also a factor in mere superscalar decode/execute.
Detecting that an instruction is dependent on another would
normally stall the execution of that instruction.

(I feel that encoding some of the dependency information could
be useful to avoid some of this work. In theory, common
dependency detection could also be more broadly useful; e.g.,
operand availability detection and execution/operand routing.)

So useful that it is encoded directly in My 66000 ISA.

5) Any fused instructions leave (multiple) bubbles that should be
   compacted out or there wasn't much point to doing the fusion.

Even with reduced operations per cycle, fusion could still provide
a net energy benefit.

Here I disagree:: but for a different reason::

In order for RISC-V to use a 64-bit constant as an operand, it has
to execute either:: AUPIC-LD to an area of memory containing the
64-bit constant, or a 6-7 instruction stream to build the constant
inline. While an ISA that directly supports 64-bit constants in ISA
does not execute any of those.

Thus, while it may save power seen at the "its my ISA" level it
may save power, but when seem from the perspective of "it is
directly supported in my ISA" it wastes power.

There is NO less power expensive way to deliver a constant into
execution as from the instruction stream directly to the function
unit performing the calculation.

In my opinion it is better to have an ISA that is optimal by design
rather than being patched up by fusion later.

Fusion is mostly presented for "patching up", but there are also considerations of diverse microarchitectures. With pre-fused
instructions, an implementation might need to crack some of those instructions. Software optimized for such an implementation might
also prefer more flexible compile-time scheduling of pre-cracked
operations.

Agreed:: there is a cost of implementing a means by which large
constants can be used in the instruction. I argue that this is
a) only apparent in the smallest implementations, b) is smaller
than the cost in cycles and power that fusion requires.

A load-op instruction is perhaps particularly difficult because
one needs frequent stalls, a skewed (or second chance) pipeline to
hide the load latency, out-of-order execution, or some other stall
avoidance mechanism.

There are also constraints in encoding granularity.

Some of this inefficiency is caused by clinging to now 40 year old
risc design *guidelines* (ie not even rules) that:
- instructions have at most 1 dest and 2 source registers

FMADD seems to have mostly killed the 2-source limit. AArch64's
paired load removes the 2 destination limit. (Paired destinations
were common for early double precision implementations.)

FMAD also provides the operand bussing to support the::
mem rd,[Rbase+Rindex<<scale+disp]
addressing mode.

But this was already possible since "disp" always comes from the
instruction, and only goes to the AGEN unit.

FMAD just got rid of all the other excuses not to do the right
thing.

- register specifier fields are either source or dest, never both

This seems mostly a code density consideration. I think using a
single name for both a source and a destination is not so
horrible, but I am not a hardware guy.

All we HW guys want is the where ever the field is specified,
it is specified in exactly 1 field in the instruction. So, if
field<a..b> is used to specify Rd in one instruction, there is
no other field<!a..!b> specifies the Rd register. RISC-V blew
this "requirement.

- instructions should take at most 1 clock (they never did)

That was clearly overconstraining.

These self imposed design restrictions cause ISA designers to miss
some possible more optimal solutions. The result is things like
RISC-V's memory reference linkage structures taking 6 instructions
to build a 64-bit PC-relative address. And I'm pretty sure we won't
see any 6 instruction fusers for quite some time.

I very much doubt a compiler would generate such outside of some
real-time application where the time constancy might justify the
code bloat.

<sound of soap box being dragged back to cupboard>

I do not mean my response to be heckling. Your points are very
true. However, I think fusion is a technique — like cracking —
that is a natural part of an architect's toolbox.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to David Brown on Mon Sep 23 11:45:16 2024

On 9/16/2024 4:12 AM, David Brown wrote:

big snip

With all respect to the regulars here, most people in technical Usenet
groups are either old, unusually nerdy, or both.

Of course, that is true, but it raises some questions.

Are there fewer younger people interested in computer architecture? I
guess this is possible, since the number of new architectures seems to
be declining, thus interest might be too.

Are the younger people discussing computer architecture in the way we
do, but are doing it in other places? If so, where? I know that web
based forums are more "user friendly" than Usenet, but does that explain
the difference? Do wherever they are going provide the same quality of discussion that comp.arch does?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB-Alt on Tue Sep 24 00:12:43 2024

On Mon, 23 Sep 2024 23:16:08 +0000, BGB-Alt wrote:

On 9/22/2024 3:43 PM, Paul A. Clayton wrote:

On 9/19/24 11:07 AM, EricP wrote:

I tend to agree that fusion tends to be a workaround for sub-
optimal instruction encoding, but it seems that encoding involves
a lot of tradeoffs.

Yeah...

However, the cost of doing fusion is higher than having longer-form variable-length instructions via prefixes...

If one wants a cheapish way to do prefixes on a 1-wide machine, they
could transpose the instruction words during fetch, and then only need a single decoder.

So:
WordA
PrefixA WordB
PrefixA PrefixB WordC

Is presented to the decoder as:
WordA
WordB PrefixA
WordC PrefixB PrefixA

So, the decoder doesn't move...

Exactly my reasoning wrt constants

INSTA
INSTB DISP32 DISP64 SDATA32 SDATA64
INSTC SDATA32

Possibly, a similar trick could be used for 2-wide with limited variable-length, but would get more complicated.

<snip>

FMADD seems to have mostly killed the 2-source limit. AArch64's
paired load removes the 2 destination limit. (Paired destinations
were common for early double precision implementations.)

IMHO:
RISC-V not having register-index load/store, while having things like
FMADD, is kinda stupid. Having advanced features while taking a big hit
on the lack of cheap features is not ideal.

I had recently been working on getting BGBCC to target RISC-V (generated
code still not fully working, but the compiler is now able to do the
compiler thing at least).

However, with all of the limits that RISC-V imposes, BGBCC is currently generating output that is around 43% bigger in RISC-V mode than BJX2-XG2
mode (or around 56% bigger than baseline mode).

My 66000 tends to use only 72% of the instructions needed by RISC-V
1/0.72 = 39% more instructions for RISC-V. Almost the same number.

This is kinda terrible...

"kinda" is unwarranted in that statement.
<snip>

So, say, 6 instructions for a 64-bit constant load, or around 4
instructions to load/store a global variable (relative to GP), 4
instructions whenever the 12-bit displacement fails, ...

0 instructions in My 66000. Constants are simply operands fed from
the instruction stream.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to BGB-Alt on Tue Sep 24 05:23:08 2024

BGB-Alt <[email protected]> schrieb:

On 9/22/2024 3:43 PM, Paul A. Clayton wrote:

On 9/19/24 11:07 AM, EricP wrote:
[snip]

If the multiplier is pipelined with a latency of 5 and throughput of 1,
then MULL takes 5 cycles and MULL,MULH takes 6.

But those two multiplies still are tossing away 50% of their work.

I do not remember how multipliers are actually implemented — and
am not motivated to refresh my memory at the moment — but I
thought a multiply low would not need to generate the upper bits,
so I do not understand where your "50% of their work" is coming
from.

The high result needs the low result carry-out but not the rest of
the result. (An approximate multiply high for multiply by
reciprocal might be useful, avoiding the low result work. There
might also be ways that a multiplier could be configured to also
provide bit mixing similar to middle result for generating a
hash?)

I guess it might be interesting if one made a bigger multiplier out of
4-bit multipliers, in a way similar to a 4-bit shift-add.

If you look through the old TTL handbooks by TI, you will find how
people did multipliers in the bit-slice age. They had 4 bit *
4 bit->8 bit multipliers (74274) or Booth recoding with a 74261
and then summed up the partial products using the 74275.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Paul A. Clayton on Fri Sep 27 13:31:13 2024

Paul A. Clayton wrote:

On 9/22/24 6:19 PM, MitchAlsup1 wrote:

On 9/19/24 11:07 AM, EricP wrote:

<sound of soap box being dragged out>
This idea that macro-op fusion is some magic solution is bullshit.

The argument is, at best, of Academic Quality, made by a student
at the time as a way to justify RISC-V not having certain easy
for HW to perform calculations.

The RISC-V published argument for fusion is not great, but fusion
(and cracking/fission) seem natural architectural mechanisms *if*
one is stuck with binary compatibility.

As far as I know there are only 3 published articles on RV fusion.

The Renewed Case for the Reduced Instruction Set Computer
Avoiding ISA Bloat with Macro-Op Fusion for RISC-V, 2016 http://people.eecs.berkeley.edu/~krste/papers/EECS-2016-130.pdf

is an academic paper that proposes some fusion and compares compiler
outputs but does not consider hardware cost.

Exploring Instruction Fusion Opportunities in
General Purpose Processors, 2022 https://webs.um.es/aros/papers/pdfs/ssingh-micro22.pdf

looks at a much more difficult fusion:
"In this paper, we propose and study techniques to increase the number of
fused memory instructions, notably nonconsecutive and non-contiguous fusion. Non-ConSecutive Fusion (NCSF) is the operation of fusing two (or more) μ-ops that are not consecutive in the dynamic execution stream of the program. Non-ConTiguous Fusion (NCTF) is the operation of fusing two (or more)
memory μ-ops that access non-contiguous memory bytes."

There is a very recent paper that I have not read as it is paywalled.

[paywalled]
Evaluating and Enhancing Performance through Macro-Op Fusion Optimization
with RISC-V, 2024
https://dl.acm.org/doi/abs/10.1145/3677333.3678150

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Fri Sep 27 18:01:40 2024

On Wed, 25 Sep 2024 2:49:07 +0000, Paul A. Clayton wrote:

On 9/22/24 6:19 PM, MitchAlsup1 wrote:

On Sun, 22 Sep 2024 20:43:38 +0000, Paul A. Clayton wrote:

On 9/19/24 11:07 AM, EricP wrote:
[snip]

If the multiplier is pipelined with a latency of 5 and throughput
of 1,
then MULL takes 5 cycles and MULL,MULH takes 6.

But those two multiplies still are tossing away 50% of their work.

I do not remember how multipliers are actually implemented — and
am not motivated to refresh my memory at the moment — but I
thought a multiply low would not need to generate the upper bits,
so I do not understand where your "50% of their work" is coming
from.

    +-----------+   +------------+
    \ mplier /     \   mcand /        Big input mux >>      +--------+       +--------+
          |                |
          |      +--------------+
          |     /               /
          |    /               /
          +-- /               /
             /     Tree      /
            /               /--+
           /               /   |
          /               /    |
         +---------------+-----------+
               hi             low        Products

two n-bit operands are multiplied into a 2×n-bit result.
{{All the rest is HOW not what}}

So are you saying the high bits come for free? This seems
contrary to the conception of sums of partial products, where
some of the partial products are only needed for the upper bits
and so could (it seems to me) be uncalculated if one only wanted
the lower bits.

The high order bits are free WRT gates of delay, but consume as much
area as the lower order bits. I was answering the question of
"I do not remember how multipliers are actually implemented".

The high result needs the low result carry-out but not the rest of
the result. (An approximate multiply high for multiply by
reciprocal might be useful, avoiding the low result work. There
might also be ways that a multiplier could be configured to also
provide bit mixing similar to middle result for generating a
hash?)

I seem to recall a PowerPC implementation did semi-pipelined 32-
bit multiplication 16-bits at a time. This presumably saved area
and power

You save 1/2 of the tree area, but ultimately consume more power.

The power consumption would seem to depend on how frequently both
multiplier and multiplicand are larger than 16 bits. (However, I
seem to recall that the mentioned implementation only checked one
operand.) I suspect that for a lot of code, small values are
common.

It is 100% of the time in FP codes, and generally unknowable in
integer codes.
<snip>

My 66000's CARRY and PRED are "extender prefixes", admittedly
included in the original architecture so compensating for encoding constraints (e.g., not having 36-bit instruction parcels) rather
than microarchitectural or architectural variation.

Since they cast extra bits over a number of instructions, and
while they precede the instructions they modify, they are not
classical prefixes--so I use the term Instruction-modifier instead.

[snip]>> (I feel that encoding some of the dependency information
could

be useful to avoid some of this work. In theory, common
dependency detection could also be more broadly useful; e.g.,
operand availability detection and execution/operand routing.)

So useful that it is encoded directly in My 66000 ISA.

How so? My 66000 does not provide any explicit declaration what
operation will be using a result (or where an operand is being
sourced from). Register names express the dependencies so the
dataflow graph is implicit.

I was talking about how operand routing is explicitly described
in ISA--which is mainly about how constants override register
file reads by the time operands get to the calculation unit.

I was speculating that _knowing_ when an operand will be available
and where a result should be sent (rather than broadcasting) could
be useful information.

It is easier to record which FU will deliver a result, the when
part is simply a pipeline sequencer from the end of a FU to the
entries in the reservation station.

Even with reduced operations per cycle, fusion could still provide
a net energy benefit.

Here I disagree:: but for a different reason::

In order for RISC-V to use a 64-bit constant as an operand, it has
to execute either:: AUPIC-LD to an area of memory containing the
64-bit constant, or a 6-7 instruction stream to build the constant
inline. While an ISA that directly supports 64-bit constants in ISA
does not execute any of those.

Thus, while it may save power seen at the "its my ISA" level it
may save power, but when seem from the perspective of "it is
directly supported in my ISA" it wastes power.

Yes, but "computing" large immediates is obviously less efficient
(except for compression), the computation part is known to be
unnecessary. Fusing a comparison and a branch may be a consequence
of bad ISA design in not properly estimating how much work an
instruction can do (and be encoded in available space) and there
is excess decode overhead with separate instructions, but the
individual operations seem to be doing actual work.

I suspect there can be cases where different microarchitectures
would benefit from different amounts of instruction/operation
complexity such that cracking and/or fusion may be useful even in
an optimally designed generic ISA.

[snip]

- register specifier fields are either source or dest, never both

This seems mostly a code density consideration. I think using a
single name for both a source and a destination is not so
horrible, but I am not a hardware guy.

All we HW guys want is the where ever the field is specified,
it is specified in exactly 1 field in the instruction. So, if
field<a..b> is used to specify Rd in one instruction, there is
no other field<!a..!b> specifies the Rd register. RISC-V blew
this "requirement.

Only with the Compressed extension, I think. The Compressed
extension was somewhat rushed and, in my opinion, philosophically
flawed by being redundant (i.e., every C instruction can be
expanded to a non-C instruction). Things like My 66000's ENTER
provide code density benefits but are contrary to the simplicity
emphasis. Perhaps a Rho (density) extension would have been
better.☺ (The extension letter idea was interesting for an
academic ISA but has been clearly shown to be seriously flawed.)

The R in RISC-V does not represent REDUCED.

16-bit instructions could have kept the same register field
placements with masking/truncation for two-register-field
instructions.

The whole layout of the ISA is sloppy...

Even a non-destructive form might be provided by
different masking or bit inversion for the destination. However,
providing three register fields seems to require significant
irregularity in extracting register names. (Another technique
would be using opcode bits for specifying part or all of a
register name. Some special purpose registers or groups of
registers may not be horrible for compiler register allocation,
but such seems rather funky/clunky.)

It is interesting that RISC-V chose to split the immediate field
for store instructions so that source register names would be in
the same place for all (non-C) instructions.

Lipstick on a pig.

Comparing an ISA design to RISC-V is not exactly the same as
comparing to "best in class".

I don't even know if My 66000 can or should be termed RISC since
it is a bit closer to VAX but did not go so far as to allow all
operands to be constants--just one; the memory unit has a sequencer
to perform ENTER, EXIT, LDM, STM, MM, MS; the FPU has a sequencer
to do FDIV, SQRT, Log-family, exp-family, sin-family, arc-family
and pow, flow control unit has a sequencer to do PIC switch-case:
all while allowing other FUs to process instructions while those
sequencers run.

I postulate that My 66000 ISA is RISC because it actually IS a
Reduced instruction set computer--currently standing at 64
instructions including SIMD and vectors.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to BGB on Sun Oct 13 11:30:52 2024

On Thu, 5 Sep 2024 20:08:23 -0500
BGB <[email protected]> wrote:

On 9/3/2024 3:40 AM, Michael S wrote:

On Tue, 3 Sep 2024 05:55:14 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Tim Rentsch <[email protected]> schrieb:

My suggestion is not to implement a language extension, but to
implement a compiler conforming to C as it is now,

Sure, that was also what I was suggesting - define things that
are currently undefined behavior.

with
additional guarantees for what happens in cases that are
undefined behavior.

Guarantees or specifications - no difference there.

Moreover the additional guarantees are
always in effect unless explicitly and specifically requested
otherwise (most likely by means of a #pragma or _Pragma).
Documentation needs to be written for the #pragmas, but no other
documentation is required (it might be nice to describe the
additional guarantees but that is not required by the C
standard).

It' the other way around - you need to describe first what the
actual behavior in absence of any pragmas is, and this needs to be
a firm specification, so the programmer doesn't need to read your
mind (or the source code to the compiler) to find out what you
meant. "But it is clear that..." would not be a specification;
what is clear to you may absolutely not be clear to anybody else.

This is also the only chance you'll have of getting this
implemented in one of the current compilers (and let's face it, if
you want high-quality code, you would need that; both LLVM and GCC
have taken an enormous amount of effort up to now, and duplicating
that is probably not going to happen).

The point is to change the behavior of the compiler but
still conform to the existing ISO C standard.

I understood that - defining things that are currently undefined.
But without a specification, that falls down.

So, let's try something that causes some grief - what should
be the default behavior (in the absence of pragmas) for integer
overflow? More specifically, can the compiler set the condition
to false in

int a;

...

if (a > a + 1) {
}

and how would you specify this in an unabigous manner?

I'd start much earlier, by declaration of "Homogeneity and
Exclusion". It would state that "more defined C" does not pretend
to cover all targets covered by existing C language.
Specifically, following target characteristics are required:
- byte-addressable machine with 8-bit bytes
- two-complement integer types
- if float type is supported it has to be IEEE-754 binary32
- if double type is supported it has to be IEEE-754 binary64
- if long double type is supported it has to be IEEE-754 binary128
- storage order for multibyte types should be either LE or BE,
consistently for all built-in types
- flat address space That part should be specified in more formal
manner

I might add a few things.

ALU:
If integer types overflow, they wrap, with any internal sign or zero extension consistent with the declared type;
If a multiply overflows, the result will contain the low-order bits
of the product, sign or zero extended according to the declared types;
If a variable is shifted left, it will behave as-if it were sign or
zero extended in a way consistent with the type;
If a signed value is shifted right, its high order bits will remain consistent with the original sign bit.

So, in the above example, one could see:
if (a > a + 1) { }
As a hypothetical:
if (a > SignExtend32(a + 1)) { }
Where SignExtent32 returns the input value sign-extended from 32 bits
(a+1 always incrementing the value, but may conceptually either wrap
or go outside the allowed range for 'int', with the sign extension
always returning it to its canonical form, seen as twos complement).

I will not define the behavior of shifts greater than or equal to the
modulo of the integer size, or of negative shifts, as there isn't a consistent behavior here across targets.

However, will note for shifting in a constant expression, it does
seem to be the case, that the shift will behave as-if the width was unbounded, and negative shifts as a shift in the opposite direction,
with the result then being sign or zero extended in accordance with
the type.

Say, for example, zigzag sign folding:
int32_t i, j, k;
i=somevalue;
j=(i<<1)^(i>>31); //fold sign into LSB
k=(j>>1)^((j<<31)>>31);
assert(k==i);

Memory:
One may freely cast pointers to different types and dereference them, regardless of types or alignment of said pointers;
Pointers will behave as-if the memory space were a linear array of
bytes, with each value as one or more contiguous bytes in memory;
Structs are normally packed with each member stored sequentially in
memory, with each member padded to its natural alignment, and the
overal struct, if needed, padded to a multiple of the largest member alignment; The natural alignment for primitive types is equal to the
size of said primitive type;
The address taken of any variable will have an in-memory layout
consistent with the declared type;
...

Implicitly:
Any memory store may potentially alias with any other memory access,
unless: One or both pointers has the restrict keyword;
It can be reasonably proven that the pointed-to memory locations do
not alias;
A compiler may assume an access is aligned if it can be verified that
no operation has caused the address to become misaligned (though, as
a reservation, may assume that if a variable is declared restrict, it
may also be assumed to be properly aligned for its type).

Granted, there are targets where pointers are assumed aligned by
default and declared unaligned, but there is no standard way in C to
declare an unaligned pointer, and there is code that assumes the
ability to freely de-reference pointers regardless of alignment.

Though, a less conservative option would be to assume that any normal
pointer variable is aligned by default, but may become unaligned if
it accepts a value created by casting from a type of smaller
alignment (or is assigned a value from a pointer holding such a
value).

char *cs;
int *pi, *pj;
...
pi=(int *)cs; //taints pi with unaligned status.
..
pj=pi; //taints pj with unaligned status via pi

This would still leave it as UB to pass or return a misaligned
pointer across function boundaries (if the pointer is then
de-referenced), or similar for putting them in struct members.

May leave a partial exception for "void *", which may be cast to
another type without causing the result to become unaligned.

...

Misc:
A missing return value is required to still return as normal;
However, the nature and contents of the value returned will be
undefined (it will be "probably random garbage").

But, would make some reservations:
The relative location and alignment of global variables remains
undefined; The relative location and alignment of automatic variables
remains undefined;
The nature or the storage of any global or automatic variable whose
address has not been taken, remains undefined;
The nature or identity of any temporary variables created within an expression, remains undefined;
Calling a function with a missing prototype will remain undefined,
except if both the argument and return types are all primitive types,
the argument types are an exact match and either pointer or integer
types, and the return type is a small integer;
...

Similar, one likely can't (yet) require that targets be little
endian, but one can make a working assumption that the target is
probably little endian.

...

I agree with great majority of it.

Rules for shifts could be formulated better. I think, they are
formulated better in gcc manual, in section about implementation-defined behaviors.

For functions without arguments, I'd prefer mandatory prototypes, even
at cost of breakage of existing code.
Also more draconian both about missing return type and about missing
return statement in non-void function.

About endiannes, I think that my definition in post above is most
practical. I.e. BE allowed, but inconsistent byte orders are prohibited.
Plus, of course, standardized name of preprocessor built-in for easy compile-time detection of endianness.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet
- Bob Worm
  Mon Jul 27 15:19:55 2026
  from Wales, Uk via Telnet
- Rixter
  Mon Jul 27 13:04:59 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	50:37:40
Calls:	12,444
Calls today:	4
Files:	15,192
Messages:	6,537,158

Computer architects leaving Intel...

Who's Online

Recent Visitors

System Info