• Computer architects leaving Intel...

    From Thomas Koenig@21:1/5 to All on Tue Aug 27 05:29:22 2024
    Just read that some architects are leaving Intel and doing their own
    startup, apparently aiming to develop RISC-V cores of all things.

    https://www.tomshardware.com/tech-industry/senior-intel-cpu-architects-splinter-to-develop-risc-v-processors-veterans-establish-aheadcomputing

    Maybe a good time to get some developers on board for development.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Tue Aug 27 12:02:40 2024
    On Tue, 27 Aug 2024 05:29:22 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Just read that some architects are leaving Intel and doing their own
    startup, apparently aiming to develop RISC-V cores of all things.

    https://www.tomshardware.com/tech-industry/senior-intel-cpu-architects-splinter-to-develop-risc-v-processors-veterans-establish-aheadcomputing

    Maybe a good time to get some developers on board for development.

    It looks like exodus from Intel Hillsboro. Hillsboro was #1 and then
    #2 (after Haifa) Intel x86 development center in relatively recent
    past, but it seems that by now this role firmly belongs to Austin.
    It's believable that more ambitious among Intel Hillsboro people are
    not happy with that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Tue Aug 27 12:06:16 2024
    On 8/26/2024 10:29 PM, Thomas Koenig wrote:
    Just read that some architects are leaving Intel and doing their own
    startup, apparently aiming to develop RISC-V cores of all things.

    https://www.tomshardware.com/tech-industry/senior-intel-cpu-architects-splinter-to-develop-risc-v-processors-veterans-establish-aheadcomputing

    Maybe a good time to get some developers on board for development.

    Or suggest to them that, instead of RISC-V, they should look at My 66000.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Koenig on Tue Aug 27 20:59:00 2024
    In article <vajo7i$2s028$[email protected]>, [email protected] (Thomas Koenig) wrote:

    Just read that some architects are leaving Intel and doing their own
    startup, apparently aiming to develop RISC-V cores of all things.

    They're presumably intending to develop high-performance cores, since
    they have substantial experience in doing that for x86-64. The question
    is if demand for those will develop.

    Android is apparently waiting for a new RISC-V instruction set extension;
    you can run various Linuxes, but I have not heard about anyone wanting to
    do so on a large scale.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Dallman on Tue Aug 27 21:04:53 2024
    [email protected] (John Dallman) writes:
    In article <vajo7i$2s028$[email protected]>, [email protected] (Thomas >Koenig) wrote:

    Just read that some architects are leaving Intel and doing their own
    startup, apparently aiming to develop RISC-V cores of all things.

    They're presumably intending to develop high-performance cores, since
    they have substantial experience in doing that for x86-64. The question
    is if demand for those will develop.

    Ask Si-Five about demand for high-performance risc-v cores.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Aug 27 23:50:56 2024
    On Tue, 27 Aug 2024 22:39:02 +0000, BGB wrote:

    On 8/27/2024 2:59 PM, John Dallman wrote:
    In article <vajo7i$2s028$[email protected]>, [email protected] (Thomas
    Koenig) wrote:

    Just read that some architects are leaving Intel and doing their own
    startup, apparently aiming to develop RISC-V cores of all things.

    They're presumably intending to develop high-performance cores, since
    they have substantial experience in doing that for x86-64. The question
    is if demand for those will develop.


    Making RISC-V "not suck" in terms of performance will probably at least
    be easier than making x86-64 "not suck".

    Yet, these people have decades of experience building complex things
    that
    made x86 (also() not suck. They should have the "drawing power" to get
    more people with similar experiences.

    The drawback is that they are competing with "everyone else in
    RISC-V-land,
    and starting several years late.

    Android is apparently waiting for a new RISC-V instruction set
    extension; >> you can run various Linuxes, but I have not heard
    about anyone wanting to do so on a large scale.


    My thoughts for "major missing features" is still:
    Needs register-indexed load;
    Needs an intermediate size constant load (such as 17-bit sign extended)
    in a 32-bit op.

    Full access to constants.

    Where, there is a sizeable chunk of constants between 12 and 17 bits,
    but not quite as many between 17 and 32 (and 32-64 bits is comparably infrequent).

    Except in in "math codes".

    But 64-bit memory reference displacements means one does not have to
    even bother to have a strategy of what to do when you need a single
    FORTRAN common block to be 74GB in size in order to run 5-decade old
    FEM codes.

    I could also make a case for an instruction to load a Binary16 value and convert to Binary32 or Binary64 in an FPR, but this is arguably a bit
    niche (but, would still beat out using a memory load).

    Most of these are covered by something like::

    CVTSD Rd,#1 // 32-bit instruction


    Big annoying thing with it, is that to have any hope of adoption, one
    needs an "actually involved" party to add it. There doesn't seem to be
    any sort of aggregated list of "known in-use" opcodes, or any real
    mechanism for "informal" extensions.

    With the OpCode space already 98% filled there does not need to
    be such a list.

    The closest we have on the latter point is the "Composable Extensions" extension by Jan Gray, which seems to be mostly that part of the ISA's encoding space can be banked out based on a CSR or similar.


    Though, bigger immediate values and register-indexed loads do arguably
    better belong in the base ISA encoding space.

    Agreed, but there is so much more.

    FCMP Rt,#14,R19 // 32-bit instruction
    ENTER R16,R0,#400 // 32-bit instruction
    ..


    At present, I am still on the fence about whether or not to support the
    C extension in RISC-V mode in the BJX2 Core, mostly because the encoding scheme just sucks bad enough that I don't really want to deal with it.


    Realistically, can't likely expect anyone else to adopt BJX2 though.

    Captain Obvious strikes again.


    Though, bigger issue might be how to make it able to access hardware
    devices (seems like part of the physical address space is used for as a
    PCI Config space, and would need to figure out what sorts of devices the Linux kernel expects to be there in such a scenario).

    It is reasons like this that cause My 66000 to have four 64-bit address
    spaces {DRAM, MMI/O, configuration, ROM}. PCIe MMI/O space can easily
    exceed 42-bits before one throws MR-IOV at the problem. Configuration
    headers in My 66000 contain all the information CPUID has in x86-land.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Aug 28 16:40:24 2024
    On Wed, 28 Aug 2024 3:33:40 +0000, BGB wrote:

    On 8/27/2024 6:50 PM, MitchAlsup1 wrote:
    On Tue, 27 Aug 2024 22:39:02 +0000, BGB wrote:

    On 8/27/2024 2:59 PM, John Dallman wrote:
    In article <vajo7i$2s028$[email protected]>, [email protected] (Thomas >>>> Koenig) wrote:

    Just read that some architects are leaving Intel and doing their own >>>>> startup, apparently aiming to develop RISC-V cores of all things.

    They're presumably intending to develop high-performance cores, since
    they have substantial experience in doing that for x86-64. The question >>>> is if demand for those will develop.


    Making RISC-V "not suck" in terms of performance will probably at least
    be easier than making x86-64 "not suck".

    Yet, these people have decades of experience building complex things
    that
    made x86 (also() not suck. They should have the "drawing power" to get
    more people with similar experiences.

    The drawback is that they are competing with "everyone else in
    RISC-V-land,
    and starting several years late.

    Though, if anything, they probably have the experience to know how to
    make things like the fabled "opcode fusion" work without burning too
    many resources.



    Android is apparently waiting for a new RISC-V instruction set
    extension; >> you can run various Linuxes, but I have not heard
    about anyone wanting to do so on a large scale.


    My thoughts for "major missing features" is still:
    Needs register-indexed load;
    Needs an intermediate size constant load (such as 17-bit sign extended)
    in a 32-bit op.

    Full access to constants.


    That would be better, but is unlikely within the existing encoding constraints.

    But, say, if one burned one of the remaining unused "OP Rd, Rs, Imm12s" encodings as an Imm17s, well then...

    Dropping compressed instructions gives enough OpCode room to put the
    entire My 66000 ISA in what remains.



    With the OpCode space already 98% filled there does not need to
    be such a list.


    One would still need it if multiple parties want to be able to define an extension independently of each other and not step on the same
    encodings.

    And what kind of code compatibility would you have between different
    designs...

    The closest we have on the latter point is the "Composable Extensions"
    extension by Jan Gray, which seems to be mostly that part of the ISA's
    encoding space can be banked out based on a CSR or similar.


    Though, bigger immediate values and register-indexed loads do arguably
    better belong in the base ISA encoding space.

    Agreed, but there is so much more.

        FCMP    Rt,#14,R19        // 32-bit instruction
        ENTER   R16,R0,#400       // 32-bit instruction
    ..


    These are likely a bit further down the priority list.


    Prolog/Epilog happens once per function, and often may be skipped for
    small leaf functions, so seems like a lower priority. More so, if one
    lacks a good way to optimize it much beyond the sequence of load/store
    ops which is would be replacing (and maybe not a way to do it much
    faster than however can be moved in a single clock cycle with the
    available register ports).

    My 1-wide machines does ENTER and EXIT at 4 registers per cycle.
    Try doing 4 LDs or 4 STs per cycle on a 1-wide machine.



    At present, I am still on the fence about whether or not to support the
    C extension in RISC-V mode in the BJX2 Core, mostly because the encoding >>> scheme just sucks bad enough that I don't really want to deal with it.


    Realistically, can't likely expect anyone else to adopt BJX2 though.

    Captain Obvious strikes again.


    This is likely the fate of nearly every hobby class ISA.

    Time to up your game to an industrial quality ISA.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Dallman on Wed Aug 28 18:28:44 2024
    [email protected] (John Dallman) writes:
    In article <VbrzO.74199$[email protected]>, [email protected] (Scott >Lurndal) wrote:

    [email protected] (John Dallman) writes:
    They're presumably intending to develop high-performance cores,
    since they have substantial experience in doing that for x86-64.
    The question is if demand for those will develop.
    Ask Si-Five about demand for high-performance risc-v cores.

    SiFive were pretty sure there wasn't near-term demand for them in 4Q2023. >Ahead Computing are presumably not expecting to deliver IP cores for a
    year or two, so /maybe/ they have reasons to expect demand then.

    But it's also possible they just want to carry on being chip architects
    while being in charge of their own company. If so, adopting RISC-V is
    more credible in the short term than starting to design a new ISA as a >commercial project. Intel won't sell them an x86 license at any
    reasonable price.

    Thinking a bit more, they may be trying to go the Nuvia route: design >original cores for an existing ISA and get bought out. Nuvia were bought
    by Qualcomm for their ARMv9-A core IP well before they released anything.
    If Ahead were to successfully design a fast RISC-V core with >power:performance that was competitive with ARM, /Intel/ might well buy
    them.

    Intel were all over RISC-V in 4Q2022 and 1Q2023, looking for something to >compete with ARM after having accepted you can't get power:performance to >match ARM out of x86-64. Then it all went quiet, and Intel didn't
    manufacture the SiFive SoC ("Horse Creek") that was supposed to blaze the >trail for RISC-V as a consumer and/or enterprise architecture.

    The problem with this is that RISC-V isn't currently comparable,
    feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
    they'll need to support a similar feature set - most of which doesn't
    exist in the RISC-V design space yet.


    If you were a discontented Intel senior engineer, demonstrating that you >could produce what Intel needed, getting your company bought and you
    brought back to Intel in a more senior position might seem worth trying.

    Perhaps, but the last few decades are littered with failed similar attempts.

    (the exceptions, starting with Amdahl, _are_ notable for not being
    re-absorbed, but rather for striking out solo successfully).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Lurndal on Wed Aug 28 19:17:00 2024
    In article <VbrzO.74199$[email protected]>, [email protected] (Scott Lurndal) wrote:

    [email protected] (John Dallman) writes:
    They're presumably intending to develop high-performance cores,
    since they have substantial experience in doing that for x86-64.
    The question is if demand for those will develop.
    Ask Si-Five about demand for high-performance risc-v cores.

    SiFive were pretty sure there wasn't near-term demand for them in 4Q2023.
    Ahead Computing are presumably not expecting to deliver IP cores for a
    year or two, so /maybe/ they have reasons to expect demand then.

    But it's also possible they just want to carry on being chip architects
    while being in charge of their own company. If so, adopting RISC-V is
    more credible in the short term than starting to design a new ISA as a commercial project. Intel won't sell them an x86 license at any
    reasonable price.

    Thinking a bit more, they may be trying to go the Nuvia route: design
    original cores for an existing ISA and get bought out. Nuvia were bought
    by Qualcomm for their ARMv9-A core IP well before they released anything.
    If Ahead were to successfully design a fast RISC-V core with
    power:performance that was competitive with ARM, /Intel/ might well buy
    them.

    Intel were all over RISC-V in 4Q2022 and 1Q2023, looking for something to compete with ARM after having accepted you can't get power:performance to
    match ARM out of x86-64. Then it all went quiet, and Intel didn't
    manufacture the SiFive SoC ("Horse Creek") that was supposed to blaze the
    trail for RISC-V as a consumer and/or enterprise architecture.

    If you were a discontented Intel senior engineer, demonstrating that you
    could produce what Intel needed, getting your company bought and you
    brought back to Intel in a more senior position might seem worth trying.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Lurndal on Wed Aug 28 19:49:00 2024
    In article <w%JzO.33560$[email protected]>, [email protected] (Scott Lurndal) wrote:

    Intel were all over RISC-V in 4Q2022 and 1Q2023, looking for
    something to compete with ARM after having accepted you can't
    get power:performance to match ARM out of x86-64. Then it all
    went quiet, and Intel didn't manufacture the SiFive SoC
    ("Horse Creek") that was supposed to blaze the trail for
    RISC-V as a consumer and/or enterprise architecture.

    The problem with this is that RISC-V isn't currently comparable, feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
    they'll need to support a similar feature set - most of which
    doesn't exist in the RISC-V design space yet.

    Open-source design of the ISA has delivered an architecture suitable for teaching, its original purpose, but has failed to promptly deliver the dull-but-necessary features for large-scale systems? I'm shocked!

    Surely SiFive should have done this work, if they'd known what they were
    doing in competing with ARM?

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Wed Aug 28 18:55:14 2024
    Scott Lurndal <[email protected]> schrieb:

    The problem with this is that RISC-V isn't currently comparable, feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
    they'll need to support a similar feature set - most of which doesn't
    exist in the RISC-V design space yet.

    What is missing (in broad terms)?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Wed Aug 28 20:46:08 2024
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:

    The problem with this is that RISC-V isn't currently comparable,
    feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
    they'll need to support a similar feature set - most of which doesn't
    exist in the RISC-V design space yet.

    What is missing (in broad terms)?

    NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
    quit extensive. Many of them are related to supporting server-grade
    RAS, Memory Partitioning, address translation (e.g. 52-bit PA, 52-bit VA)
    or accelerator interfaces (ST64B, LD64B).

    Moreover, they have a mature SoC ecosystem including a well-
    defined and highly capable interrupt controller, an I/O
    MMU, a high-speed processor interconnect (CHI), a standard debug
    infrastructure (coresight), embedded logic analyzer (ELA),
    network on chip (NIC-700), et alia.

    https://developer.arm.com/documentation/107997/latest

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Thu Aug 29 11:51:24 2024
    [email protected] (John Dallman) writes:
    In article <VbrzO.74199$[email protected]>, [email protected] (Scott >Lurndal) wrote:

    [email protected] (John Dallman) writes:
    They're presumably intending to develop high-performance cores,
    since they have substantial experience in doing that for x86-64.
    The question is if demand for those will develop.
    Ask Si-Five about demand for high-performance risc-v cores.

    SiFive were pretty sure there wasn't near-term demand for them in 4Q2023.

    Or maybe there was some other reason that the investor money did not
    flow as plentiful as it used to, and so SiFive put the most far-out
    projects on the back-burner.

    Concerning the demand, RISC-V has the advantage of no ARM tax (and
    legal costs like those between ARM and Qualcomm over the developments
    started at NUVIA) or the question of AMD64 licensing to third parties.

    Another RISC-V advantage is that the government of the USA puts
    restrictions on ARM that should not apply to the free RISC-V
    architecture.

    It would apply to implementations designed in the USA (such as those
    by Ahead), but the point is that on the ISA level, and thus the buy-in
    into the ecosystem (e.g., from ISVs), RISC-V has an advantage.

    RISC-V also has a technical advantage over ARM: It has Ztso (total
    store order) as an optional extension, which helps porting of
    multi-threaded software from AMD64 (and emulation of AMD64 software).
    No such thing on ARMv8 or ARMv9 yet, although implementations like the
    Apple M1 and Fujitsu A64FX provide this feature.

    Ahead Computing are presumably not expecting to deliver IP cores for a
    year or two

    Three years sounds overly optimistic. Nuvia was founded in 2019,
    acquired in 2021, and hardware has been delivered in 2024, very much
    in line with the often-read number of 5 years for CPU design projects.

    But it's also possible they just want to carry on being chip architects
    while being in charge of their own company.

    Sure. But what are the investors seeing in the company?

    If so, adopting RISC-V is
    more credible in the short term than starting to design a new ISA as a >commercial project.

    Certainly. Establishing another ISA is hard, because it requires
    buy-in from many forces for lasting success. Even if an architecture
    has a long track record, like MIPS, that's not enough, as the switch
    from the MIPS ISA to RISC-V shows.

    RISC-V has quite a bit of mindshare, it lacks the ARM tax, and with
    the government of the USA hampering ARM, the RISC-V future looks even
    brighter. They still have quite a way to go.

    Thinking a bit more, they may be trying to go the Nuvia route: design >original cores for an existing ISA and get bought out.

    Probably. Getting bought is a common outcome of a successful startup.

    Nuvia were bought
    by Qualcomm for their ARMv9-A core IP well before they released anything.

    What I read is that the Snapdragon X implements ARM v8.7.

    If Ahead were to successfully design a fast RISC-V core with >power:performance that was competitive with ARM, /Intel/ might well buy
    them.

    Yes, or somebody else, as happened with Nuvia.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Thu Aug 29 13:17:55 2024
    [email protected] (John Dallman) writes:
    Android is apparently waiting for a new RISC-V instruction set extension;

    Which one?

    you can run various Linuxes, but I have not heard about anyone wanting to
    do so on a large scale.

    You may not consider it large-scale, but we wanted to have two RISC-V
    servers for teaching (in particular, for the compiler course). Some
    years earlier we had written that into a "future plans" document, and
    in 2022 we got the request to buy them now, because the period that
    was covered in that document was coming to an end. Of course at the
    time the best RISC-V thing to be had was the Visionfive V1, which was
    cheap, but too weak for our purposes (cross-compiling would have been
    possible, but we did not want to go there).

    So we eventually settled on two servers based on the Rocket Lake,
    which at least gave us AVX-512 (the deadline was too early for Zen4).

    Now it's two years later, and the RISC-V servers are still not showing
    up. We'll see how things look when it's time to retire the Rocket
    Lakes (their predecessors were good for a decade).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Thu Aug 29 13:47:58 2024
    [email protected] (Scott Lurndal) writes:
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:

    The problem with this is that RISC-V isn't currently comparable,
    feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
    they'll need to support a similar feature set - most of which doesn't
    exist in the RISC-V design space yet.

    What is missing (in broad terms)?

    NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
    quit extensive.

    I think the lack of "extensive" features is a feature of RISC-V. Last
    I heard, the ARM manual was >10000 pages.

    The RISC-V user manual has put on a lot of weight since Volume I (unpriviledged) Version 2.2 (145 pages) and Volume II (priviledged)
    20211203 (155 pages). The 20240411 draft of Volume I weighs in at 670
    pages), and the 20240411 draft of Volume II at 172 pages, but that's
    still quite a long way from 10000.

    One interesting case here is that the 236-page version
    20190608-Base-Ratified of Volume I spends 12 pages on Chapter 14
    "RVWMO Memory Consistency Model, Version 0.1" plus 30 pages for
    "Appendix A RVWMO Explanatory Material, Version 0.1" plus 27 pages on
    "Appendix B Formal Memory Model Specifications, Version 0.1"
    (apparently not grown further in 20240411; the number of pages is a
    little smaller for each of the parts).

    If the goal of RISC-V was a really simple ISA (as in "simple to
    specify"), they would have gone for sequential consistency, but
    obviously the lure of implementation simplicity won out here.

    Many of them are related to supporting server-grade
    RAS, Memory Partitioning, address translation (e.g. 52-bit PA, 52-bit VA)
    or accelerator interfaces (ST64B, LD64B).

    Can't say I ever missed such instructions.

    Are RAS instructions like memory-ordering instructions? The hardware
    does not provide the feature, but it provides instructions for
    throwing the problem over to software, which is then supposed to use
    those instructions (but not too often) to provide the feature that
    hardware does not provide?

    Moreover, they have a mature SoC ecosystem

    ARM certainly has that. However, a lot of the SoC ecosystem is only
    accessed through drivers that are specific to one kernel and that
    nobody maintains, and that's why many smartphones don't get any
    updates after a few years. Let's hope it's better for servers.

    One hope is that the openness of RISC-V will also create a more open
    ecosystem that will result in drivers in mainline Linux. But my guess
    is that for smartphones, the economic incentives are in the other
    direction. For servers things may be better, though (even on ARM).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Thu Aug 29 15:06:45 2024
    [email protected] (Anton Ertl) writes:
    [email protected] (Scott Lurndal) writes:
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:

    The problem with this is that RISC-V isn't currently comparable,
    feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
    they'll need to support a similar feature set - most of which doesn't
    exist in the RISC-V design space yet.

    What is missing (in broad terms)?

    NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
    quit extensive.

    I think the lack of "extensive" features is a feature of RISC-V. Last
    I heard, the ARM manual was >10000 pages.

    Actually considerably more if you consider all the related IP
    such as the GIC, SMMU and others.

    The RISC-V user manual has put on a lot of weight since Volume I >(unpriviledged) Version 2.2 (145 pages) and Volume II (priviledged)
    20211203 (155 pages). The 20240411 draft of Volume I weighs in at 670 >pages), and the 20240411 draft of Volume II at 172 pages, but that's
    still quite a long way from 10000.

    I think comparing manual pages is somewhat pointless.


    <snip>


    Many of them are related to supporting server-grade
    RAS, Memory Partitioning, address translation (e.g. 52-bit PA, 52-bit VA) >>or accelerator interfaces (ST64B, LD64B).

    Can't say I ever missed such instructions.

    They are architectural features. The may, or many not, require
    additional instructions.

    The RAS feature is a framework that software can rely on for
    any implementation of an ARM SoC regardless of vendor.


    Are RAS instructions like memory-ordering instructions?

    There is one instruction specific to RAS. ESB, which is a
    barrier instruction synchronizing error events.



    Moreover, they have a mature SoC ecosystem

    ARM certainly has that. However, a lot of the SoC ecosystem is only
    accessed through drivers that are specific to one kernel and that
    nobody maintains, and that's why many smartphones don't get any
    updates after a few years. Let's hope it's better for servers.

    It is far better for servers. The SBSA specification, for example,
    is designed specifically to support standard software interfaces to
    the hardware/firmware. Microsoft, Ubuntu, Redhat et alia are
    all involved in the creation and maintenance of that and related
    specifications along with the ARM processor vendors.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Aug 29 16:23:19 2024
    On Thu, 29 Aug 2024 3:36:44 +0000, BGB wrote:

    On 8/28/2024 11:40 AM, MitchAlsup1 wrote:
    On Wed, 28 Aug 2024 3:33:40 +0000, BGB wrote:

    And what kind of code compatibility would you have between different
    designs...


    If people can agree as to the encodings, then implementations are more
    free to pick which extensions they want or don't want.

    If the encodings conflict with each other, no such free choice is
    possible.

    With differing instructions, how does a software vendor write software
    such that it can run near optimally on any implementation ??

    Prolog/Epilog happens once per function, and often may be skipped for
    small leaf functions, so seems like a lower priority. More so, if one
    lacks a good way to optimize it much beyond the sequence of load/store
    ops which is would be replacing (and maybe not a way to do it much
    faster than however can be moved in a single clock cycle with the
    available register ports).

    My 1-wide machines does ENTER and EXIT at 4 registers per cycle.
    Try doing 4 LDs or 4 STs per cycle on a 1-wide machine.


    It likely isn't going to happen because a 1-wide machine isn't going to
    have the needed register ports.

    3R1W most of the time converts to 4R or 4W for the *logues.

    But, if one doesn't have the register ports, there is likely no viable
    way to move 4 registers/cycle to/from memory (and it wouldn't make sense
    for the register file to have a path to memory that is wider than what
    the pipeline has).
    ---------------
    This is likely the fate of nearly every hobby class ISA.

    Time to up your game to an industrial quality ISA.

    Open question of what an "industrial quality" ISA has that BJX2 lacks...
    Limiting the scope to things that RISC-V and ARM have.

    Proper handling of exceptions (ignoring them is not proper)
    Proper IEEE 754-2018 handling of FMAC (compute all the bits)
    Floating Point Transcendentals
    HyperVisors/Secure Monitors
    Write Interrupt service routines entirely in HLL
    proper Privileges and Priorities
    Multi-location ATOMIC events
    ..


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Fri Aug 30 06:12:02 2024
    Scott Lurndal <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:

    The problem with this is that RISC-V isn't currently comparable,
    feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
    they'll need to support a similar feature set - most of which doesn't
    exist in the RISC-V design space yet.

    What is missing (in broad terms)?

    NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
    quit extensive.

    Is there any way to get that list? I've looked, but I only got rough
    overview articles and links to the full documentation, which is fairly overwhelming.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to BGB on Fri Aug 30 09:05:00 2024
    In article <vaqgtl$3526$[email protected]>, [email protected] (BGB) wrote:

    On 8/29/2024 11:23 AM, MitchAlsup1 wrote:

    With differing instructions, how does a software vendor write
    software such that it can run near optimally on any implementation?

    They presumably target whatever is common, or the least common
    denominator (such as RV64G or RV64GC), and settle with "probably
    good enough"...

    ISVs can be proactive or passive about adopting a new ISA. Anyone
    promoting a new ISA wants to motivate them to be proactive, but faces
    problems with prerequisites:

    * Who can work with simulators, and who needs hardware?
    * Different kinds of software need more or less powerful hardware.
    * Application people need an OS and development tools at minimum.
    * Quite often they need other software: math libraries, databases, etc.

    But, probably not too much different from other ISAs, just with a
    lot more parties involved.

    Variant ISAs create fear, uncertainty and doubt, and that means delay.
    ISA promotors fear delay, because their investors will run out of
    patience.

    The alternative is that one expects that all the software be
    rebuilt for the specific configuration being used,

    ISVs /really/ don't like that. It multiplies their testing and QA and
    those are expensive. It rarely shows up problems, but convincing
    themselves to do without it is hard for them.

    or recompiled from source or some other distribution format on
    the local machine which it is to be run (with binaries distributed
    as some form of "portable IR").

    ISVs get sceptical about that, because it's generating code they have not tested.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Dallman on Fri Aug 30 09:38:02 2024
    John Dallman <[email protected]> schrieb:
    In article <vaqgtl$3526$[email protected]>, [email protected] (BGB) wrote:

    On 8/29/2024 11:23 AM, MitchAlsup1 wrote:

    With differing instructions, how does a software vendor write
    software such that it can run near optimally on any implementation?

    They presumably target whatever is common, or the least common
    denominator (such as RV64G or RV64GC), and settle with "probably
    good enough"...

    ISVs can be proactive or passive about adopting a new ISA.

    What is an ISV? I assume "SV" is for "software vendor", but what
    does the I stand for?

    [...]

    Variant ISAs create fear, uncertainty and doubt, and that means delay.
    ISA promotors fear delay, because their investors will run out of
    patience.

    Which makes me wonder why companies such as Intel introduce new
    instructions all the time. For people who compile their own code
    (scientists and engineers) that can be OK, they can just use
    -march=native (or equivalent), and it can even make sense to have architecture-optimized core libraries such as BLAS, or switch on
    availability of features such as AVX512 (but that again has many
    sub-features and highly different performance characteristics,
    depending on the micro-arch).

    But standard software (office applications, browsers...) should
    just run everywhere, and there it gets hard to justify.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Fri Aug 30 13:48:25 2024
    On Fri, 30 Aug 2024 09:38:02 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    John Dallman <[email protected]> schrieb:
    In article <vaqgtl$3526$[email protected]>, [email protected] (BGB)
    wrote:
    On 8/29/2024 11:23 AM, MitchAlsup1 wrote:

    ISVs can be proactive or passive about adopting a new ISA.

    What is an ISV? I assume "SV" is for "software vendor", but what
    does the I stand for?


    https://en.wikipedia.org/wiki/Independent_software_vendor

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Fri Aug 30 10:26:38 2024
    Thomas Koenig <[email protected]> writes:
    John Dallman <[email protected]> schrieb:
    [...]
    What is an ISV? I assume "SV" is for "software vendor", but what
    does the I stand for?

    <https://en.wikipedia.org/wiki/Independent_software_vendor>

    Variant ISAs create fear, uncertainty and doubt, and that means delay.
    ISA promotors fear delay, because their investors will run out of
    patience.

    Which makes me wonder why companies such as Intel introduce new
    instructions all the time.

    AMD64 already has the buy-in of application vendors for desktops and
    servers, so it does not have the problem that extensions create
    uncertainty among application vendors.

    My guess is that there are the following motivations:

    1) The new instructions make technical sense (for certain
    applications).

    2) Even if the applications that the users use don't benefit from the extensions, the users think (thanks also to Intels marketing) that
    they might (because of 1); maybe not today, but maybe the next version
    or maybe the application that the user will run in a year or two. And
    I certainly have seen reports that this or that game does not work on
    K10 or whatever because the game uses some SSE4.2 instruction that the
    K10 does not have. Intel could have increased this kind of
    obsolescence (and the resulting new sales) through instruction set
    extensions by supporting AVX across the board early on (as AMD did),
    and later by supporting AVX512 across the board, but Intel marketing
    apparently thinks it's better to get people to buy Core-branded rather
    than Pentium-branded CPUs by disabling AVX for a long time on the
    latter.

    3) I expect that Intel patents the extensions. So these days
    everybody could build an AMD64 CPU, because the patent has expired,
    but nobody wants to buy such a CPU without the extensions (because of
    2), and the extensions are patented.

    and it can even make sense to have
    architecture-optimized core libraries such as BLAS, or switch on
    availability of features such as AVX512

    Yes. And given that a lot of software uses some library or other, a
    lot of software may benefit from the extensions. Of course, the
    question is how big the benefit is.

    E.g., glibc has many different versions of memcpy() and memmove() and
    selects among them based on the actual CPU used in the run, thanks to


    But standard software (office applications, browsers...) should
    just run everywhere, and there it gets hard to justify.

    That will also benefit from libraries.

    For browsers the JavaScript and WASM JIT compiler can generate code
    specific to the extensions present in the hardware; however, no ISA
    extension comes to my mind that a JavaScript or current WASM JIT
    compiler will benefit from; IIRC there is discussion about explicit
    vector stuff in WASM, and there the extensions may make a difference.

    Also, a friend who works on a JavaVM JIT told me he is working on auto-vectorization, but I don't know if they really went for that; Auto-vectorization is not just the wrong approach, it also seems
    particularly inappropriate for JIT compilers, because it requires a
    lot of analysis, i.e., compile time.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Fri Aug 30 14:52:46 2024
    On Fri, 30 Aug 2024 10:26:38 GMT
    [email protected] (Anton Ertl) wrote:

    Thomas Koenig <[email protected]> writes:
    John Dallman <[email protected]> schrieb:
    [...]
    What is an ISV? I assume "SV" is for "software vendor", but what
    does the I stand for?

    <https://en.wikipedia.org/wiki/Independent_software_vendor>

    Variant ISAs create fear, uncertainty and doubt, and that means
    delay. ISA promotors fear delay, because their investors will run
    out of patience.

    Which makes me wonder why companies such as Intel introduce new >instructions all the time.

    AMD64 already has the buy-in of application vendors for desktops and
    servers, so it does not have the problem that extensions create
    uncertainty among application vendors.

    My guess is that there are the following motivations:

    1) The new instructions make technical sense (for certain
    applications).

    2) Even if the applications that the users use don't benefit from the extensions, the users think (thanks also to Intels marketing) that
    they might (because of 1); maybe not today, but maybe the next version
    or maybe the application that the user will run in a year or two. And
    I certainly have seen reports that this or that game does not work on
    K10 or whatever because the game uses some SSE4.2 instruction that the
    K10 does not have. Intel could have increased this kind of
    obsolescence (and the resulting new sales) through instruction set
    extensions by supporting AVX across the board early on (as AMD did),
    and later by supporting AVX512 across the board, but Intel marketing apparently thinks it's better to get people to buy Core-branded rather
    than Pentium-branded CPUs by disabling AVX for a long time on the
    latter.


    I wish if it was only marketing, i.e. if it were only fuses in big-core
    derived Pentiums and Celerons.
    Unfortunately, the bigger problem was poor work (laziness) of Intel's engineering that didn't have AVX, or any for VEX decoding, in their
    Atom line until Gracemont.
    It's not marketing, it's engineers, who produced quite capable core
    like Tremont with thhe level of ISA support 10 years behind its time.

    3) I expect that Intel patents the extensions. So these days
    everybody could build an AMD64 CPU, because the patent has expired,
    but nobody wants to buy such a CPU without the extensions (because of
    2), and the extensions are patented.

    and it can even make sense to have
    architecture-optimized core libraries such as BLAS, or switch on >availability of features such as AVX512

    Yes. And given that a lot of software uses some library or other, a
    lot of software may benefit from the extensions. Of course, the
    question is how big the benefit is.

    E.g., glibc has many different versions of memcpy() and memmove() and
    selects among them based on the actual CPU used in the run, thanks to


    But standard software (office applications, browsers...) should
    just run everywhere, and there it gets hard to justify.

    That will also benefit from libraries.

    For browsers the JavaScript and WASM JIT compiler can generate code
    specific to the extensions present in the hardware; however, no ISA
    extension comes to my mind that a JavaScript or current WASM JIT
    compiler will benefit from;

    More convenient FP->Int conversion than what is available in SSE3.
    Also, I'd guess, due to non-destructive ops scalar DPFP code could be
    sometimes more compact with AVX encoding than with SSE2 encoding.

    IIRC there is discussion about explicit
    vector stuff in WASM, and there the extensions may make a difference.

    Also, a friend who works on a JavaVM JIT told me he is working on auto-vectorization, but I don't know if they really went for that; Auto-vectorization is not just the wrong approach, it also seems
    particularly inappropriate for JIT compilers, because it requires a
    lot of analysis, i.e., compile time.

    - anton

    I agree for case of JS. Not so much for case of Enterprise Java.
    OTOH, personally I care about performance of JS and don't care at all
    about Enterprise Java. Would think that great majority of the world
    is like me in that regard, but may be not so great among those who
    sign checks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Fri Aug 30 12:07:04 2024
    Michael S <[email protected]> writes:
    On Fri, 30 Aug 2024 10:26:38 GMT
    [email protected] (Anton Ertl) wrote:
    Intel could have increased this kind of
    obsolescence (and the resulting new sales) through instruction set
    extensions by supporting AVX across the board early on (as AMD did),
    and later by supporting AVX512 across the board, but Intel marketing
    apparently thinks it's better to get people to buy Core-branded rather
    than Pentium-branded CPUs by disabling AVX for a long time on the
    latter.


    I wish if it was only marketing, i.e. if it were only fuses in big-core >derived Pentiums and Celerons.
    Unfortunately, the bigger problem was poor work (laziness) of Intel's >engineering that didn't have AVX, or any for VEX decoding, in their
    Atom line until Gracemont.

    Intel has certainly disabled AVX in Pentiums and Celerons that used
    the P-cores (e.g., Skylake-based Pentiums). That's purely marketing.

    Concerning the "Atom"-based processors, it seems to me that they were
    not lazy, they did what they were told, and they were told not to
    implement AVX. Admittedly, this saves a little area and maybe a
    little power, but the AMD Jaguar (2013) included AVX and went for the
    same market segment as the Intel Silvermont (2013). And not just
    Silvermont excluded AVX, so did Goldmont (2016), Goldmont+ (2017), and
    Tremont (2020), and also the contemporaneous P-core-based Pentiums and Celerons. Apparently the idea was that AVX/AVX2 and AVX-512 are
    premium features.

    One interesting case is the Xeon E-2400 line. On these CPUs only the
    P-Cores are enabled, they are server processors, and yet Intel
    disabled AVX-512 (which the Xeon E-2300 line has). I wonder what the
    reasoning behind that decision was.

    It's not marketing, it's engineers, who produced quite capable core
    like Tremont with thhe level of ISA support 10 years behind its time.

    If their bosses tell them to create a core without AVX, what should
    they do? (Answer: Found Ahead! :-) If their bosses had asked them to
    create a core with AVX, would they have rebelled out of lazyness? I
    doubt it.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Fri Aug 30 14:30:22 2024
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:

    The problem with this is that RISC-V isn't currently comparable,
    feature-wise, with ARMv8.0. To compete with Neoverse-N2 cores,
    they'll need to support a similar feature set - most of which doesn't
    exist in the RISC-V design space yet.

    What is missing (in broad terms)?

    NeoverseN3 is ARMv9.2. The list of ISA features from V8.0 to v9.2 is
    quit extensive.

    Is there any way to get that list? I've looked, but I only got rough >overview articles and links to the full documentation, which is fairly >overwhelming.

    Chapter A2 (A-Profile Extensions) of DDI0487 (ARM ARM) gives a nice list
    for each architectecture version.

    https://developer.arm.com/documentation/ddi0487/latest/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Fri Aug 30 15:48:00 2024
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    Concerning the demand, RISC-V has the advantage of no ARM tax (and
    legal costs like those between ARM and Qualcomm over the
    developments started at NUVIA)

    True, although the market for high-performance application cores is less price-sensitive than the market for low-performance embedded ones.

    Another RISC-V advantage is that the government of the USA puts
    restrictions on ARM that should not apply to the free RISC-V
    architecture.

    It would apply to implementations designed in the USA (such as those
    by Ahead), but the point is that on the ISA level, and thus the
    buy-in into the ecosystem (e.g., from ISVs), RISC-V has an advantage.

    As someone who does porting and platforms for an ISV, I'm seeing no
    customer demand whatsoever. I'm pretty sure that's because of the lack of high-performance implementations. I'd like to do RISC-V, because new architectures are fun, but I can't get hardware at present that's up to
    the job, and so I can't justify spending time on it.

    RISC-V also has a technical advantage over ARM: It has Ztso (total
    store order) as an optional extension, which helps porting of
    multi-threaded software from AMD64 (and emulation of AMD64
    software). No such thing on ARMv8 or ARMv9 yet, although
    implementations like the Apple M1 and Fujitsu A64FX provide
    this feature.

    Yup, that's an advantage. I have not had trouble with the lack of it on multi-threaded ARM Linux or ARM Windows, but the threading framework I
    use was originally developed on SPARC and does its mutexes properly.

    But it's also possible they just want to carry on being chip
    architects while being in charge of their own company.
    Sure. But what are the investors seeing in the company?

    Hard to say, given the things venture capitalists are prepared to throw
    money at these days.

    Even if an architecture has a long track record, like MIPS, that's
    not enough, as the switch from the MIPS ISA to RISC-V shows.

    In my market sector, so far, that's "the death of MIPS." That happened in
    2008, simply because it wasn't remotely performance-competitive.

    What I read is that the Snapdragon X implements ARM v8.7.

    You're right, I mis-remembered.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Fri Aug 30 14:12:04 2024
    [email protected] (John Dallman) writes:
    In article <vaqgtl$3526$[email protected]>, [email protected] (BGB) wrote:
    The alternative is that one expects that all the software be
    rebuilt for the specific configuration being used,

    ISVs /really/ don't like that. It multiplies their testing and QA and
    those are expensive. It rarely shows up problems, but convincing
    themselves to do without it is hard for them.

    You actually don't need different extensions for such problems, if you
    have library providers like the glibc people which use different implementations with different behaviours (in ways that resulted in
    breakage) depending on the processor (not architectural extensions).

    In particular, apparently around 2010 or shortly earlier, glibc
    started to implement memcpy() with backwards stride on some (not all)
    AMD64 hardware, and on some software this led to breakage. The cool
    feature is that you could test the software on your hardware and it
    would behave as expected, while on some other, hardware-level 100%
    compatible hardware it would misbehave. And if the user on that
    system reported the problem, you would be unable to reproduce it. I
    am not sure if static linking protects against this. Containerization
    does not.

    Anyway, Ulrich Drepper (glibc maintainer at the time) made the usual C undefined behaviour argument and blamed the application, which
    resulted in a huge flame war. The resolution was that glibc was
    modified to behave as expected for binaries linked against older
    versions of glibc, but would still misbehave for binaries that are
    linked against more recent glibc versions. The idea was apparently
    that this avoids breakage of the existing binaries, and that new
    binaries would be built from source code that avoids the problem
    (probably by using memmove() instead of memcpy()).

    There was still no easy way to determine whether your software that
    calls memcpy() actually works as expected on all hardware, but there
    is a way to avoid this particular problem if you are aware of it:

    #define memcpy(dest,src,n) memmove(dest,src,n)

    or recompiled from source or some other distribution format on
    the local machine which it is to be run (with binaries distributed
    as some form of "portable IR").

    ISVs get sceptical about that, because it's generating code they have not >tested.

    Yes, that thinking seems to be a result of C/C++ compiler shenanigans.
    People advocating "optimization" based on the assumption that
    undefined behaviour does not happen have suggested that I should keep
    compiler versions around that compile my source code as I expect it.
    Of course that does not help, because I distribute (GNU) software in
    source code. And, as the glibc issue discussed earlier shows, even
    testing code with a specific compiler and library version does not
    necessarily help.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Fri Aug 30 16:42:00 2024
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    ISVs get sceptical about that, because it's generating code they
    have not tested.

    Yes, that thinking seems to be a result of C/C++ compiler
    shenanigans. People advocating "optimization" based on the
    assumption that undefined behaviour does not happen have
    suggested that I should keep compiler versions around that
    compile my source code as I expect it.

    Plain old compiler bugs, introduced while fixing other ones, are quite
    enough to make me assume that I'll find problems on each change of
    compiler. I have had a manager in a very large software company assure me
    that it was impossible for them to add bugs while making fixes. His
    technical people corrected him immediately, because I'd just laughed.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Dallman on Fri Aug 30 15:44:56 2024
    [email protected] (John Dallman) writes:
    In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
    [email protected] (John Dallman) writes:
    Android is apparently waiting for a new RISC-V instruction set
    extension;
    Which one?

    I don't know what its name is. It was proposed by Hans Boehm, and the
    Android team pointed me to this discussion on a RISC-V mailing list:

    https://lists.riscv.org/g/tech-unprivileged/topic/92916241

    Searching with various terms suggests it might well be the Zabha
    extension, ratified in April this year, but that is deduction.

    You may not consider it large-scale, but we wanted to have two
    RISC-V servers for teaching (in particular, for the compiler
    course).

    Makes sense. It is not in itself "large-scale," but suitable hardware is
    only going to be available if someone wants a lot of it, enough to make >building it worthwhile.

    Now it's two years later, and the RISC-V servers are still not
    showing up.

    Yup. RISC-V established a lot of awareness, and some expectations, but
    there hasn't been the equipment to let people start using it.

    I expect RISC-V to gradually encroach on the embedded market and as microcontroller IP that can be included in SoC accelerators (primarily
    to avoid license fees for the alternatives such as cortex m7).

    I don't see it replacing ARM64, X86_64/AMD64 or other server-grade
    processors.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Fri Aug 30 15:10:02 2024
    [email protected] (John Dallman) writes:
    In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
    [email protected] (John Dallman) writes:
    Android is apparently waiting for a new RISC-V instruction set
    extension;
    Which one?

    I don't know what its name is. It was proposed by Hans Boehm, and the
    Android team pointed me to this discussion on a RISC-V mailing list:

    https://lists.riscv.org/g/tech-unprivileged/topic/92916241

    Thanks.

    Searching with various terms suggests it might well be the Zabha
    extension, ratified in April this year, but that is deduction.

    Yes.

    Now it's two years later, and the RISC-V servers are still not
    showing up.

    Yup. RISC-V established a lot of awareness, and some expectations, but
    there hasn't been the equipment to let people start using it.

    There is equipment, but only at the small-system end for now, with
    Raspi-like SBCs being the top of the line for now.

    The Visionfive V2 is one of them, and is roughly comparable to a Raspi
    3 (1.5GHz in-order core). We have the V1, and it runs Fedora just
    fine, albeit slowly.

    The BeagleV-Ahead has 4 Xuantie C910 cores (2GHz out-of-order multiple
    issue), but only 4GB RAM. It's harder to find, but there seems to be
    an Ubuntu image for it: <https://community.element14.com/products/devtools/single-board-computers/next-genbeaglebone/b/blog/posts/beaglev-ahead-getting-started-1>

    I find it funny to find this on an Element14 page (the company
    formerly known as Acorn, the original A in ARM); Element14 has long
    since been bought by Broadcom, but apparently some web presence still
    exists.

    But making the jump from embedded systems and SBCs to servers has not
    happened for RISC-V yet, and looking how long it took to establish ARM
    in servers, I expect that RISC-V will take quite a while. I guess
    that high-performance cores like those that Ahead is probably working
    on are one component along the way.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to John Dallman on Fri Aug 30 12:04:22 2024
    On 8/30/24 10:48, John Dallman wrote:
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:
    [email protected] (John Dallman) writes:
    Android is apparently waiting for a new RISC-V instruction set
    extension;
    Which one?

    I don't know what its name is. It was proposed by Hans Boehm, and the
    Android team pointed me to this discussion on a RISC-V mailing list:

    https://lists.riscv.org/g/tech-unprivileged/topic/92916241


    The RV64A stuff? I don't know about android but I would find
    it limiting. Kind of like having to work with C/C++17 concurrency
    support without having to resort to inline assembly on x64. I
    know risc-v thinks they solved the ABA problem with lr/sc but
    they haven't in all cases.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Fri Aug 30 16:15:14 2024
    [email protected] (Anton Ertl) writes:
    [email protected] (John Dallman) writes:


    But making the jump from embedded systems and SBCs to servers has not >happened for RISC-V yet, and looking how long it took to establish ARM
    in servers, I expect that RISC-V will take quite a while. I guess
    that high-performance cores like those that Ahead is probably working
    on are one component along the way.

    It takes a whole ecosystem, from the OS vendors to the Lauterbachs et alia
    to support a new architecture. RISC-V may get there eventually, but
    I don't see it happening quickly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to John Dallman on Fri Aug 30 18:28:08 2024
    On 30/08/2024 17:42, John Dallman wrote:
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    ISVs get sceptical about that, because it's generating code they
    have not tested.

    Yes, that thinking seems to be a result of C/C++ compiler
    shenanigans. People advocating "optimization" based on the
    assumption that undefined behaviour does not happen have
    suggested that I should keep compiler versions around that
    compile my source code as I expect it.

    Plain old compiler bugs, introduced while fixing other ones, are quite
    enough to make me assume that I'll find problems on each change of
    compiler. I have had a manager in a very large software company assure me that it was impossible for them to add bugs while making fixes. His
    technical people corrected him immediately, because I'd just laughed.


    I always keep old versions of compilers around, and don't change
    compilers (or libraries) in the middle of a project. Since I work with embedded systems, there are significantly fewer users compared to, say,
    x86 target compilers. Thus there is a higher risk of bugs being missed
    in beta testing and going unreported for longer. (IME bugs are far more
    likely in vendor SDK's than in gcc or newlib, but I keep everything
    archived just in case.) I also like to have reproducible builds -
    something that many Linux distributions are aiming for these days -
    which requires archiving the toolchain.

    If you want to write reliable code that can be distributed as source and compiled by any conforming C/C++ compiler, you need to be very sure that
    you avoid relying on behaviour that is not specified and documented.
    You need to write correct code. That means if you want to copy some
    memory with overlapping source and destination arrays, you use "memmove"
    - the function for that purpose. You don't use "memcpy", since it is
    specified explicitly as requiring non-overlapping arrays.

    If you want to write software that is "correct because it passed its
    tests", you can only expect it to be reliable when it is run exactly as
    tested. That means it must be compiled as it was during tests (same
    compiler, same options, same library), and arguably even run only on the
    same hardware (if you only test on one particular cpu, OS, etc., you can
    only be sure it works on that cpu, OS, etc.).

    It is, of course, a lot easier to write software that appears roughly
    correct in the source code and passes its tests, than software that is
    rigidly accurate.

    That's why a lot of pre-compiled commercial software gives particular
    versions of particular OS's or Linux distributions in their lists of requirements - even though the software would probably work fine on a
    much wider range.


    I see nothing wrong in blaming programmers for using "memcpy" when they
    should have used "memmeove" - it was those programmers that made the
    error. And there is nothing wrong with toolchain developers wanting to
    give the most efficient results possible to those that code correctly,
    rather than punishing accurate programmers for the mistakes of less
    accurate programmers. But it is also important for toolchain developers
    to remember that programmers are all fallible humans, and sometimes they
    could do a better job of minimising the consequences of other people's
    errors, or at least informing about these issues - especially for errors
    that might be fairly common.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Fri Aug 30 18:33:40 2024
    On 30/08/2024 17:44, Scott Lurndal wrote:
    [email protected] (John Dallman) writes:
    In article <[email protected]>,
    [email protected] (Anton Ertl) wrote:
    [email protected] (John Dallman) writes:
    Android is apparently waiting for a new RISC-V instruction set
    extension;
    Which one?

    I don't know what its name is. It was proposed by Hans Boehm, and the
    Android team pointed me to this discussion on a RISC-V mailing list:

    https://lists.riscv.org/g/tech-unprivileged/topic/92916241

    Searching with various terms suggests it might well be the Zabha
    extension, ratified in April this year, but that is deduction.

    You may not consider it large-scale, but we wanted to have two
    RISC-V servers for teaching (in particular, for the compiler
    course).

    Makes sense. It is not in itself "large-scale," but suitable hardware is
    only going to be available if someone wants a lot of it, enough to make
    building it worthwhile.

    Now it's two years later, and the RISC-V servers are still not
    showing up.

    Yup. RISC-V established a lot of awareness, and some expectations, but
    there hasn't been the equipment to let people start using it.

    I expect RISC-V to gradually encroach on the embedded market and as microcontroller IP that can be included in SoC accelerators (primarily
    to avoid license fees for the alternatives such as cortex m7).


    That's where I expect to see it, and I hope to see more of it. At the
    very least, decent competition will help push ARM forward.

    What I personally would like to see is RISC-V extensions aimed at
    real-time and deterministic systems - RTOS acceleration, hardware
    semaphores, and the like.

    I don't see it replacing ARM64, X86_64/AMD64 or other server-grade processors.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Fri Aug 30 18:52:00 2024
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    I find it funny to find this on an Element14 page (the company
    formerly known as Acorn, the original A in ARM); Element14 has long
    since been bought by Broadcom, but apparently some web presence
    still exists.

    A bit of exploration of the website reveals it's a promotional website
    for Farnells, an electronics distributor, and doesn't seem to have
    anything to do with ex-Acorn or Broadcom.

    But making the jump from embedded systems and SBCs to servers has
    not happened for RISC-V yet, and looking how long it took to
    establish ARM in servers, I expect that RISC-V will take quite a
    while. I guess that high-performance cores like those that Ahead
    is probably working on are one component along the way.

    A necessary step, but there are many more.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to John Dallman on Fri Aug 30 17:59:42 2024
    John Dallman <[email protected]> wrote:
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    AMD64 already has the buy-in of application vendors for desktops and
    servers, so it does not have the problem that extensions create
    uncertainty among application vendors.

    My guess is that there are the following motivations:

    1) The new instructions make technical sense (for certain
    applications).

    This is sometimes true, but manufacturers tend to over-promote them,
    claiming wider applicability and bigger effects than show up in real application code. After a few disappointments, ISVs tend to become less
    keen on doing work on marketing advice.

    Some manufacturers pay bonuses to their technical marketing people for getting ISVs to adopt new ISA extensions. This is counter productive,
    because it means the ISVs are sure that the marketing advice will take no account of their interests.

    They prefer to wait until an extension has been out for several years
    before supporting it, so that it's available in pretty well all the
    end-user hardware that hasn't finished its depreciation yet. That's
    driven by a facet of the application software industry that most hardware manufacturers don't seem to understand. They appear to assume that
    computers are set up with an initial software load and carry on running
    that for their entire lives.

    In fact, organisations replace about a quarter of their machines each
    year, always buying up-to-date ones, and want to run the /same/ version
    of software on all of them. They want common software versions for data compatibility, ease of training and so on. That means that a new release
    of an application has to run on all the machines sold in the last four
    years, sometimes longer.

    I assume you work in the high end, as the average desktop PC is replaced
    every 8 years on a “use it until it breaks” policy.

    Dell will tell you 5 years, and Google is paid to say the same.
    And that actually might be true for laptops, but not desktops.

    The bulk of the PC’s and servers where I work are a dozen years old.
    A smattering of new PC’s bring the average down to 9 years.

    Some manufacturers expect ISVs to produce multiple versions of software
    for different sets of ISA extensions. They'll do that if the gains are
    large enough, but they have to be quite large: for my employer, 25% is enough, but 10% isn't. We haven't had to make a decision in between those numbers yet. We've had one 25% case, for Intel SSE2, and many of 10% or
    less.

    2) Even if the applications that the users use don't benefit from
    the extensions, the users think (thanks also to Intels marketing)

    The sheer flood of extensions from Intel means most end-user
    organisations have stopped trying to keep track these days.

    John


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Aug 30 18:11:48 2024
    On Thu, 29 Aug 2024 19:07:29 +0000, BGB wrote:

    On 8/29/2024 11:23 AM, MitchAlsup1 wrote:

    Time to up your game to an industrial quality ISA.

    Open question of what an "industrial quality" ISA has that BJX2 lacks... >>>    Limiting the scope to things that RISC-V and ARM have.

    Proper handling of exceptions (ignoring them is not proper)

    If you mean FPU exceptions, maybe.

    As far as general interrupt handling, mechanism isn't too far off from
    what SH-4 had used, and apparently also RISC-V's CLINT and MIPS work in
    a similar way.

    Though, with differences as to how they divide up exceptions.
    In my case:
    Reset;
    General Fault;
    External Interrupt;
    TLB/MMU;
    Syscall.


    Integer Overflow
    Bad Instruction encoding--OpCode exists but not as this
    instruction uses it. Random code generation can use
    every instruction without privilege.
    Bad address--address exists but you are not allowed to touch it
    with LD or ST instruction or to attempt to execute it.

    Proper IEEE 754-2018 handling of FMAC (compute all the  bits)

    Possibly true.
    My FPU can more-or-less pass the 1985 spec, but not the 2018 spec.

    As I understand it, you don't even get FMUL correctly rounded.
    To get it properly rounded you have to compute the full 53*53
    product.

    Floating Point Transcendentals

    Not present in many/most ISA's I have looked at.

    Its time has come.

    HyperVisors/Secure Monitors

    Possible. I had considered doing it essentially with emulators, but
    granted, this is not quite the same thing.

    How can something of lesser privilege emulate something of greater
    privilege ??


    Seems many of the extant RV implementations don't have this either.

    Then not of Industrial quality !!

    Write Interrupt service routines entirely in HLL

    If you mean C... I do have this.

    #ifdef TK_REGSAVE_TBR
    __interrupt_tbrsave void __isr_syscall(void)
    #else
    __interrupt void __isr_syscall(void)
    #endif
    {
    ....
    }

    So there is NO (nadda == 0) ASM instructions between "Core takes
    interrupt" and control arrives at __isr_call() ??


    AKA: What exactly is the '__interrupt' for?...

    However, the ISR's can't access virtual memory apart from manually translating the pointers.

    The various architectural CR's can be accessed from C as well, such as "__arch_tbr" to access TBR, etc.


    proper Privileges and Priorities

    ?...

    OS cannot access Hypervisor data/code
    Hypervisor cannot access Secure Monitor data/code

    Every thread runs at its proper priority at all cycles that it has
    control.
    Thus, you cannot receive interrupt control and then set priority,
    priority
    needs to be part of delivering control.

    Threads are always re-entrant eave the instant they receive control.

    Application can call OS
    OS can call Hypervisor
    Hypervisor can call secure Monitor
    as easily as thread can call itself.

    Interrupts need no maintenance when Hypervisor changes OS[k] to OS[j] Interrupts need no maintenance when Secure monitor changes
    Hypervisor[k] to Hypervisor]j]

    System has a means to detect DRAM failures and map-out affected
    pages.

    System has a means to detect Device failure and restart device
    or change mapping to device.

    Multi-location ATOMIC events

    Possibly true.
    Maybe the "volatile" mechanism is weak.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Fri Aug 30 17:58:31 2024
    David Brown <[email protected]> writes:
    If you want to write reliable code that can be distributed as source and >compiled by any conforming C/C++ compiler, you need to be very sure that
    you avoid relying on behaviour that is not specified and documented.

    GCC and Clang/LLVM are distributed in source code, and given that
    their maintainers find it ok to compile programs to arbitrary code if
    they do not meet your expectations, one should expect that they do not
    rely on behaviour that is not specified and documented, and never have
    (at least not since adopting this attitude). But even they are not up
    to the task. As John Regehr writes
    <https://blog.regehr.org/archives/761>:

    |LLVM/Clang 3.1 and GCC (SVN head from July 14 2012) [...] execute
    |undefined behaviors even when compiling an empty C or C++ program with |optimizations turned off.

    I am not surprised that nobody has risen to my challenge <[email protected]>:

    |Write a proof-of-concept Forth interpreter in the language you
    |advocate that runs at least one of bubble-sort, matrix-mult or sieve
    |from bench/forth in
    |<http://www.complang.tuwien.ac.at/forth/bench.zip>

    in the last 7 years.

    It is, of course, a lot easier to write software that appears roughly
    correct in the source code and passes its tests, than software that is >rigidly accurate.

    I never heard about "rigidly accurate" as a property of software
    (except maybe numeric software).

    The practice is that software is either tested (the usual case) or
    formally proved correct. For a C program to be formally proved
    correct would, dirst and foremost require a formal specification of C.

    I see nothing wrong in blaming programmers for using "memcpy" when they >should have used "memmeove" - it was those programmers that made the
    error.

    I did not expect *you* to see what's wrong. But I hope that I never
    have anything to do with anything that you programmed.

    What's wrong with blaming the application programmers is that it does
    not help the users of the binary that misbehaved after glibc was
    "up"graded. It also does not help users who have a no-longer
    maintained piece of source code that used to work with earlier
    versions of glibc, but now acts up on some hardware. Sure, there are workarounds, but first the user would have to understand the problem.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Fri Aug 30 18:20:54 2024
    On Fri, 30 Aug 2024 16:28:08 +0000, David Brown wrote:

    On 30/08/2024 17:42, John Dallman wrote:

    I always keep old versions of compilers around, and don't change
    compilers (or libraries) in the middle of a project. Since I work with embedded systems, there are significantly fewer users compared to, say,
    x86 target compilers. Thus there is a higher risk of bugs being missed
    in beta testing and going unreported for longer. (IME bugs are far more likely in vendor SDK's than in gcc or newlib, but I keep everything
    archived just in case.) I also like to have reproducible builds -
    something that many Linux distributions are aiming for these days -
    which requires archiving the toolchain.

    There was once a software CAD vendor that made the transition from
    SUNos to SOLARIS and we as a major purchaser could not follow due
    to several OS differences:: SUNos had a license server that counted
    licenses while SOLARIS had a license server that counted the cross
    produce of licenses*core. We as a small company could not afford to
    upgrade to Solaris. Then their new product simply had different bugs.
    We chose to stay with the old SW because we knew where all the bugs
    were and how not to stimulate them into nasal deamons. Ultimately
    they got bought out and disappeared...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Aug 30 14:34:23 2024
    If you want to write reliable code that can be distributed as source and compiled by any conforming C/C++ compiler, you need to be very sure that you avoid relying on behaviour that is not specified and documented. You need to write correct code. That means if you want to copy some memory with overlapping source and destination arrays, you use "memmove" - the function for that purpose. You don't use "memcpy", since it is specified explicitly as requiring non-overlapping arrays.

    The difficulty here is that the tools provide very little help for that, because all too often it's virtually impossible for the tools to
    understand that this particular code can/will hit UB.

    So it's all up to the programmer, who often doesn't know either.
    Other than using CompCert, I don't know of any reliable way for
    a programmer to make sure his C code does not suffer from UB.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Linsel@21:1/5 to Anton Ertl on Fri Aug 30 21:08:09 2024
    The clang/gcc maintainers' POV violates the first part of Postel's Law:

    Be liberal in what you accept, and conservative in what you send.

    Life would be a lot easier if they just provided a -WUB option that
    warns and explains *any* construct that the compiler may regard as UB.

    (The various already existing options, e.g. -Wnull-dereference etc., and
    the most deviant outgrowth, -fsanitize=... are *not* reliable; the
    compiler happily optimizes whole execution paths away and does not tell
    about it with any syllable).


    On 30.08.24 19:58, Anton Ertl wrote:
    David Brown <[email protected]> writes:
    If you want to write reliable code that can be distributed as source and
    compiled by any conforming C/C++ compiler, you need to be very sure that
    you avoid relying on behaviour that is not specified and documented.

    GCC and Clang/LLVM are distributed in source code, and given that
    their maintainers find it ok to compile programs to arbitrary code if
    they do not meet your expectations, one should expect that they do not
    rely on behaviour that is not specified and documented, and never have
    (at least not since adopting this attitude). But even they are not up
    to the task. As John Regehr writes
    <https://blog.regehr.org/archives/761>:

    |LLVM/Clang 3.1 and GCC (SVN head from July 14 2012) [...] execute
    |undefined behaviors even when compiling an empty C or C++ program with |optimizations turned off.

    I am not surprised that nobody has risen to my challenge <[email protected]>:

    |Write a proof-of-concept Forth interpreter in the language you
    |advocate that runs at least one of bubble-sort, matrix-mult or sieve
    |from bench/forth in
    |<http://www.complang.tuwien.ac.at/forth/bench.zip>

    in the last 7 years.

    It is, of course, a lot easier to write software that appears roughly
    correct in the source code and passes its tests, than software that is
    rigidly accurate.

    I never heard about "rigidly accurate" as a property of software
    (except maybe numeric software).

    The practice is that software is either tested (the usual case) or
    formally proved correct. For a C program to be formally proved
    correct would, dirst and foremost require a formal specification of C.

    I see nothing wrong in blaming programmers for using "memcpy" when they
    should have used "memmeove" - it was those programmers that made the
    error.

    I did not expect *you* to see what's wrong. But I hope that I never
    have anything to do with anything that you programmed.

    What's wrong with blaming the application programmers is that it does
    not help the users of the binary that misbehaved after glibc was
    "up"graded. It also does not help users who have a no-longer
    maintained piece of source code that used to work with earlier
    versions of glibc, but now acts up on some hardware. Sure, there are workarounds, but first the user would have to understand the problem.

    - anton


    --
    Bernd Linsel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Stefan Monnier on Fri Aug 30 20:38:00 2024
    In article <[email protected]>, [email protected] (Stefan Monnier) wrote:

    Other than using CompCert, I don't know of any reliable way for
    a programmer to make sure his C code does not suffer from UB.

    That looked very interesting for a few minutes. If CompCert could warn
    about undefined behaviour reasonably reliably, I'd be very interested in
    using it as a specialised lint program.

    As far as I can see from the documentation, the C interpreter that comes
    with it can do that, but that's not very practical with millions of lines
    of source.

    because all too often it's virtually impossible for the tools to
    understand that this particular code can/will hit UB.

    Presumably this is often impractical for a compiler, and run-time
    checking is required? I gave Clang's Undefined Behaviour Sanitizer a try
    a few weeks ago, and must get back to it.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Brown on Fri Aug 30 21:03:00 2024
    In article <vasruo$id3b$[email protected]>, [email protected] (David Brown) wrote:

    On 30/08/2024 17:42, John Dallman wrote:
    Plain old compiler bugs, introduced while fixing other ones, are
    quite enough to make me assume that I'll find problems on each
    change of compiler.

    I always keep old versions of compilers around, and don't change
    compilers (or libraries) in the middle of a project.

    I always have at least a couple of machines at the previous build
    standard of any platform, often more machines and/or older build
    standards.

    Changing compilers or libraries is done at new major releases.

    If you want to write software that is "correct because it passed
    its tests", you can only expect it to be reliable when it is run
    exactly as tested. That means it must be compiled as it was during
    tests (same compiler, same options, same library), and arguably
    even run only on the same hardware (if you only test on one
    particular cpu, OS, etc., you can only be sure it works on that
    cpu, OS, etc.).

    This is simpler when you produce closed-source binary software. We only
    ship builds we've tested. That means the /same binaries/ as we tested,
    not rebuilt or modified. This requires a separate test harness, rather
    than testing code compiled into the binaries.

    We test on a wide variety of hardware for the most-used platforms, by
    putting it into the distributed testing pools and always knowing which
    machine an individual test case ran on, because it's recorded in the test results.

    That's why a lot of pre-compiled commercial software gives
    particular versions of particular OS's or Linux distributions in
    their lists of requirements - even though the software would
    probably work fine on a much wider range.

    We specify what we specifically support, because we've tested that, plus
    the much broader requirements that it should work on. For Linux those are
    a GCC runtimes version (currently 8.x) or later and a glibc version
    (currently 2.28) or later. We don't seem to have problems with
    compatibility since we understood how the compatibility works with those libraries, and started doing it that way.

    If there's a problem on a specifically supported Linux, we'll fix it
    unless that's impossible. If there's a problem on one where it should
    work, we'll investigate it, and fix it if we can, which may cause a distribution to be added to the specifically supported list. If we can't
    fix a problem, we'll explain why not, and normally add the problem to the documentation. We can't do miracles, but we do pretty well.

    Yes, doing good support is expensive, but it pays off in customer loyalty, which means money.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Brett on Fri Aug 30 20:38:00 2024
    In article <vat1ad$jeb4$[email protected]>, [email protected] (Brett) wrote:

    I assume you work in the high end, as the average desktop PC is
    replaced every 8 years on a _use it until it breaks_ policy.

    Yes: we supply software components for stuff where end-users understand
    they need powerful machines, and generally have them.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Fri Aug 30 21:24:10 2024
    [email protected] (MitchAlsup1) writes:
    On Fri, 30 Aug 2024 16:28:08 +0000, David Brown wrote:

    On 30/08/2024 17:42, John Dallman wrote:

    I always keep old versions of compilers around, and don't change
    compilers (or libraries) in the middle of a project. Since I work with
    embedded systems, there are significantly fewer users compared to, say,
    x86 target compilers. Thus there is a higher risk of bugs being missed
    in beta testing and going unreported for longer. (IME bugs are far more
    likely in vendor SDK's than in gcc or newlib, but I keep everything
    archived just in case.) I also like to have reproducible builds -
    something that many Linux distributions are aiming for these days -
    which requires archiving the toolchain.

    There was once a software CAD vendor that made the transition from
    SUNos to SOLARIS and we as a major purchaser could not follow due

    Solbourne?

    https://en.wikipedia.org/wiki/Solbourne_Computer

    to several OS differences:: SUNos had a license server that counted
    licenses while SOLARIS had a license server that counted the cross
    produce of licenses*core. We as a small company could not afford to
    upgrade to Solaris. Then their new product simply had different bugs.
    We chose to stay with the old SW because we knew where all the bugs
    were and how not to stimulate them into nasal deamons. Ultimately
    they got bought out and disappeared...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Aug 31 02:04:23 2024
    On Fri, 30 Aug 2024 22:42:19 +0000, BGB wrote:

    On 8/30/2024 1:11 PM, MitchAlsup1 wrote:
    On Thu, 29 Aug 2024 19:07:29 +0000, BGB wrote:
    Integer Overflow

    Not usually a thing. Pretty much everything seems to treat integer
    overflow as silently wrapping.

    ADA wants these.



    Bad Instruction encoding--OpCode exists but not as this
       instruction uses it. Random code generation can use
       every instruction without privilege.

    Hit or miss.

    Will usually fault on invalid instructions.

    Must be 100% to guarantee upwards compatibility.

    There is logic in place to reject privileged instructions in user-mode,
    if the CPU is actually run in user-mode. Some of this is still TODO (currently, TestKern is still running everything in Supervisor Mode).

    Yes, it is a pain--but a pain that is absolutely worth it.


    The alternative is to treat them as UB, so they may be one of:
    Trap;
    Do something else (like, if an instruction was added);
    Do something wonky / unintended.

    In practice, this seems to be more how it works.

    Bad practice == not industrial quality.


    Bad address--address exists but you are not allowed to touch it>   
    with LD or ST instruction or to attempt to execute it.

    If the MMU is enabled, it should fault on bad memory accesses.

    In physical addressing mode, it does not trap.

    YOU FAIL TO UNDERSTAND--there is an area in memory where the
    preserved registers are stored--stored in a way that only 3
    instructions can access--and the PTE is marked RWE=000
    This prevents damaging the contract between callee and caller.
    3 instructions can access these pages ENTER, EXIT and RET
    nothing else.


    IIRC, there was a mechanism on the bus to deal with accesses to bad
    physical addresses (returning all zeroes). Otherwise, trying to access
    an invalid address would cause the CPU to deadlock.

    It is NOT a BAD address--it is a good but inaccessible address
    outside those 3 instructions.



    As I understand it, you don't even get FMUL correctly rounded.
    To get it properly rounded you have to compute the full 53*53
    product.

    AFAICT, this wasn't required for the 1985 spec...

    You Cannot get rounding correct unless you "compute as if to
    infinite precision" and then follow the rules of rounding
    (all modes).

    Things like "optional trap on denormal" seems like it should be OK (this
    is what MIPS and friends did at the time).

    I am talking about FMUL and getting the proper result--no
    denorms needed.

    For the most part, seems like the '85 spec was more "uses these formats
    and gets more or less the same values, good enough". A lot of the
    pedantic rounding stuff, etc, seemed to be more something for the 2008
    spec.

    Then you fail to grasp the spirit of the spec.



    The lack of single-rounded FMA shouldn't matter, since this wasn't added until later.

    It was in the 19985 spec.

    Support for Binary16 is a bonus feature (since 85 spec only gave Single
    / Double / Extended), but Binary16 is useful...

    So is a dildo for some people. Irrelevant to the issues at hand.



    Floating Point Transcendentals

    Not present in many/most ISA's I have looked at.

    Its time has come.


    Then who has done it, besides x87 and similar?...

    I am talking about transcendentals that take FDIV number of cycles
    Not FADD taking 200 cycles.

    Not going to put much weight in something if:
    The only real known example is the legacy x87 ISA;
    Pretty much everyone else (including on x86-64) is using unrolled Taylor-series expansion and similar.

    At least spell it Chebychev.



    HyperVisors/Secure Monitors

    Possible. I had considered doing it essentially with emulators, but
    granted, this is not quite the same thing.

    How can something of lesser privilege emulate something of greater
    privilege ??


    Top level OS (or hypervisor layer) runs an emulator, which runs any VMs holding guest OS instances.

    But if the most you have is Supervisor how do you emulate something
    of higher privilege efficiently ??


    Granted, running the main OS in an emulator wouldn't be great for performance. But, in most contexts, this isn't really a thing.

    Quit acting stupid. You are better than that.

    Like, pretty sure Windows and Linux still tend to run bare-metal on most systems, ... (or, if a VM layer exists, it is unclear what if-any
    purpose it would serve).

    You can run both windows and linux at the same time.
    Windows for games and documents, linux for CAD.


    But, in any case, one doesn't need any special ISA level support to make things like QEMU and DOSBox work.

    Quit acting stupid. You are better than that.

    And, if a person wants to essentially use something like QEMU to run the whole OS, nothing really is stopping them.

    Quit acting stupid. You are better than that.

    Well, except maybe how slow that QEMU and DOSBox tend to be on something
    like a RasPi (on a 50MHz CPU, one would likely be hard-pressed to even
    run something like SimCity at acceptable speeds).

    Quit acting stupid. You are better than that.


    Not yet tried porting something like DOSBox to my stuff though...


    But, a more clever emulator could likely leverage things like hardware address translation and maybe only JIT parts of the target system (vs,
    say, fully emulating the memory access and using JIT compilation or interpretation for "pretty much everything").

    You need efficient 2-level (or more) translation.

    Say, for example, if the host system and guest OS are running the same
    ISA (vs, say, the guest OS running x86 or x86-64; on a host running a different ISA).

    What if one thread wants 386, another wants 486, another x86-64
    AND all three get the proper undefined instruction trapping.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Dallman on Sat Aug 31 08:59:16 2024
    John Dallman <[email protected]> schrieb:
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    Concerning the demand, RISC-V has the advantage of no ARM tax (and
    legal costs like those between ARM and Qualcomm over the
    developments started at NUVIA)

    True, although the market for high-performance application cores is less price-sensitive than the market for low-performance embedded ones.

    Definitely - if you have 512 GB DDR5 memory in your workstation, the
    cost of the CPU itself is a relatively small fraction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Bernd Linsel on Sat Aug 31 08:45:16 2024
    Bernd Linsel <[email protected]> schrieb:
    The clang/gcc maintainers' POV violates the first part of Postel's Law:

    Be liberal in what you accept, and conservative in what you send.

    Life would be a lot easier if they just provided a -WUB option that
    warns and explains *any* construct that the compiler may regard as UB.

    Patches welcome.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Thomas Koenig on Sat Aug 31 09:29:46 2024
    Thomas Koenig <[email protected]> schrieb:
    Bernd Linsel <[email protected]> schrieb:
    The clang/gcc maintainers' POV violates the first part of Postel's Law:

    Be liberal in what you accept, and conservative in what you send.

    Life would be a lot easier if they just provided a -WUB option that
    warns and explains *any* construct that the compiler may regard as UB.

    Maybe a bit more elaborate:

    #include <stdio.h>

    int main()
    {
    int i;
    sscanf("%d", &i);

    Should be "scanf", of course.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Bernd Linsel on Sat Aug 31 09:24:59 2024
    Bernd Linsel <[email protected]> schrieb:
    The clang/gcc maintainers' POV violates the first part of Postel's Law:

    Be liberal in what you accept, and conservative in what you send.

    Life would be a lot easier if they just provided a -WUB option that
    warns and explains *any* construct that the compiler may regard as UB.

    Maybe a bit more elaborate:

    #include <stdio.h>

    int main()
    {
    int i;
    sscanf("%d", &i);
    return 0;
    }

    Should this be warned about?

    Or what about

    void foo(int *a)
    {
    *a ++;
    }

    Two possible cases of undefined behavior here: a could be an
    invalid pointer, and the arithmetic operation could overflow.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Linsel@21:1/5 to Thomas Koenig on Sat Aug 31 13:10:22 2024
    On 31.08.24 11:24, Thomas Koenig wrote:
    Bernd Linsel <[email protected]> schrieb:
    The clang/gcc maintainers' POV violates the first part of Postel's Law:

    Be liberal in what you accept, and conservative in what you send.

    Life would be a lot easier if they just provided a -WUB option that
    warns and explains *any* construct that the compiler may regard as UB.

    Maybe a bit more elaborate:

    #include <stdio.h>

    int main()
    {
    int i;
    scanf("%d", &i);
    return 0;
    }

    Should this be warned about?

    [corrected sscanf -> scanf]
    Why? This "program" has the purpose to read one line, presumably
    containing an integer number, from stdin and ignore it. No UB anywhere.

    It does accept an empty line as well as 3432 MB of garbage, and even an
    integer without leading space, but always returns true.
    Scanf's man page does not state anything about warn-unused-result, and
    it's input parsing is clearly described.

    I would not complain if the compiler would deliver something that's
    roughly equivalent to

    int main(void)
    {
    (void)scanf("%*s");
    return 0;
    }

    while

    int main(void)
    {
    return 0;
    }

    would be inacceptable.


    Or what about

    void foo(int *a)
    {
    *a ++;
    }

    Two possible cases of undefined behavior here: a could be an
    invalid pointer, and the arithmetic operation could overflow.

    The result of the pointer increment is never used, so the compiler will
    warn and not compile any increment instruction nonetheless.

    Furthermore, as *a is not declared volatile, the read operation is
    superfluous. A call to foo() may thus legally result in:

    <nothing>,

    but a still better result would be:

    foo(x) -> assert(__builtin_expect(x != NULL, 1)).

    Additionally, I'd expect at least 2 warnings:
    - result of pointer increment `a++` never used
    - result of variable access `*a` never used.

    GCC provides means like e.g. the nonnull() attribute, and even if that
    were not available, it is good practice to assert() pointer arguments --
    or check and return an error code -- at the beginning of the function
    body, if you expect to be called from arbitrary (library user) code.

    Furthermore, to provide hints to the compiler, you can always write
    something like:

    if (a == NULL) __builtin_unreachable();

    Commonly, one instruments that as:

    #define ASSUME(cond) \
    do { \
    if (!__builtin_expect(!(cond),0)) \
    __builtin_unreachable(); \
    } while (0)


    Maybe my previous post was not clear enough: It's not a general UB
    detector that I'd like to have integrated into the compiler (there are
    static checker tools available that can nearly perfectly do that);
    instead, I'd like to get a warning when the compiler does something
    other than you would expect when reading the code in a "do what I mean"
    manner.

    --
    Bernd Linsel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Bernd Linsel on Sat Aug 31 11:18:15 2024
    Bernd Linsel <[email protected]> schrieb:
    On 31.08.24 11:24, Thomas Koenig wrote:
    Bernd Linsel <[email protected]> schrieb:
    The clang/gcc maintainers' POV violates the first part of Postel's Law:

    Be liberal in what you accept, and conservative in what you send.

    Life would be a lot easier if they just provided a -WUB option that
    warns and explains *any* construct that the compiler may regard as UB.

    Maybe a bit more elaborate:

    #include <stdio.h>

    int main()
    {
    int i;
    scanf("%d", &i);
    return 0;
    }

    Should this be warned about?

    [corrected sscanf -> scanf]
    Why? This "program" has the purpose to read one line, presumably
    containing an integer number, from stdin and ignore it. No UB anywhere.

    What happens on overflow on input? That's undefined behavior, IIRC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Linsel@21:1/5 to Thomas Koenig on Sat Aug 31 13:58:46 2024
    On 31.08.24 13:26, Thomas Koenig wrote:
    So, sorry for the too-quick examples earlier...

    What about

    int foo (int a)
    {
    return a + 1;
    }

    or

    int foo(int *a)
    {
    return *a;
    }

    Both can exhibit undefined behavior, and for both it
    is impossible for the compiler to tell at compile-time.

    So the compiler should just compile both functions (gcc 12.2.0 with -O3
    does):

    $ gcc -Wall -Wextra -Wpedantic -O3 -xc -std=gnu11 -c - -o foo.o
    int foo(int a)
    {
    return a + 1;
    }

    int bar(int *a)
    {
    return *a;
    }
    ^D

    $ objdump -d foo.o

    foo.o: file format elf64-x86-64


    Disassembly of section .text:

    0000000000000000 <foo>:
    0: 8d 47 01 lea 0x1(%rdi),%eax
    3: c3 ret
    4: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
    b: 00 00 00 00
    f: 90 nop

    0000000000000010 <bar>:
    10: 8b 07 mov (%rdi),%eax
    12: c3 ret

    All as expected.

    What I don't want is that the compiler makes assumptions, concludes UB,
    feels entitled to compile whatever it wants and deliver rubbish without
    telling about it.

    --
    Bernd Linsel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to All on Sat Aug 31 11:26:58 2024
    So, sorry for the too-quick examples earlier...

    What about

    int foo (int a)
    {
    return a + 1;
    }

    or

    int foo(int *a)
    {
    return *a;
    }

    Both can exhibit undefined behavior, and for both it
    is impossible for the compiler to tell at compile-time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Stefan Monnier on Sat Aug 31 15:55:56 2024
    On 30/08/2024 20:34, Stefan Monnier wrote:
    If you want to write reliable code that can be distributed as source and
    compiled by any conforming C/C++ compiler, you need to be very sure that you >> avoid relying on behaviour that is not specified and documented. You need to >> write correct code. That means if you want to copy some memory with
    overlapping source and destination arrays, you use "memmove" - the function >> for that purpose. You don't use "memcpy", since it is specified explicitly >> as requiring non-overlapping arrays.

    The difficulty here is that the tools provide very little help for that, because all too often it's virtually impossible for the tools to
    understand that this particular code can/will hit UB.


    Yes, that is true. And in such cases there is no way for a compiler to "optimise on the assumption of no UB", since it does not know that there
    will be, or could be, UB. So Anton has nothing to fear there. Bernd,
    on the other hand, might be disappointed - there is also no way for the compiler to warn that the code might have error or UB.

    So it's all up to the programmer, who often doesn't know either.
    Other than using CompCert, I don't know of any reliable way for
    a programmer to make sure his C code does not suffer from UB.


    There is no full-proof or complete method for C. There are other
    language for which formal methods can come closer to proving the
    correctness of the code, but for most practical cases this is infeasible.

    The best you can do, as a programmer, is to learn the language as well
    as you can, write code carefully, and use whatever help you can get that
    is within budget - including linter tools, code reviews, test setups,
    and so on. You can come a long way using good free tools such as gcc
    and clang, including their extensive compiler warnings and their
    sanitizers for run-time checking and testing.

    No one claims that writing good, working code is easy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to John Dallman on Sat Aug 31 16:00:35 2024
    On 30/08/2024 22:03, John Dallman wrote:
    In article <vasruo$id3b$[email protected]>, [email protected] (David Brown) wrote:

    On 30/08/2024 17:42, John Dallman wrote:
    Plain old compiler bugs, introduced while fixing other ones, are
    quite enough to make me assume that I'll find problems on each
    change of compiler.

    I always keep old versions of compilers around, and don't change
    compilers (or libraries) in the middle of a project.

    I always have at least a couple of machines at the previous build
    standard of any platform, often more machines and/or older build
    standards.

    Changing compilers or libraries is done at new major releases.

    If you want to write software that is "correct because it passed
    its tests", you can only expect it to be reliable when it is run
    exactly as tested. That means it must be compiled as it was during
    tests (same compiler, same options, same library), and arguably
    even run only on the same hardware (if you only test on one
    particular cpu, OS, etc., you can only be sure it works on that
    cpu, OS, etc.).

    This is simpler when you produce closed-source binary software. We only
    ship builds we've tested. That means the /same binaries/ as we tested,
    not rebuilt or modified. This requires a separate test harness, rather
    than testing code compiled into the binaries.


    It is indeed simpler when you produce binaries. (We make embedded
    systems - for many products, we have full control of the of software and
    the hardware, which makes it a lot easier to have a consistent test environment.)

    We test on a wide variety of hardware for the most-used platforms, by
    putting it into the distributed testing pools and always knowing which machine an individual test case ran on, because it's recorded in the test results.

    That's why a lot of pre-compiled commercial software gives
    particular versions of particular OS's or Linux distributions in
    their lists of requirements - even though the software would
    probably work fine on a much wider range.

    We specify what we specifically support, because we've tested that, plus
    the much broader requirements that it should work on. For Linux those are
    a GCC runtimes version (currently 8.x) or later and a glibc version (currently 2.28) or later. We don't seem to have problems with
    compatibility since we understood how the compatibility works with those libraries, and started doing it that way.


    That is a good compromise.

    If there's a problem on a specifically supported Linux, we'll fix it
    unless that's impossible. If there's a problem on one where it should
    work, we'll investigate it, and fix it if we can, which may cause a distribution to be added to the specifically supported list. If we can't
    fix a problem, we'll explain why not, and normally add the problem to the documentation. We can't do miracles, but we do pretty well.

    Yes, doing good support is expensive, but it pays off in customer loyalty, which means money.


    Agreed. For a lot of businesses, customer loyalty comes not from making working products (lots of people can do that), but how you handle things
    when something goes wrong.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Sat Aug 31 14:33:47 2024
    On Sat, 31 Aug 2024 2:04:23 +0000, MitchAlsup1 wrote:

    On Fri, 30 Aug 2024 22:42:19 +0000, BGB wrote:

    For example::

    You CAN buy an Ultima GTR an engine and transmission and assemble
    a sports car you can register as a street car in your state.

    You CANNOT form a company to buy and assemble 1,000 of those and
    sell them to the general public.

    The former is hobby level, the latter is industrial grade.

    The difference is standards and regulations and expectations::
    emission regulations
    crash structure regulations
    pedestrian impact regulations
    lighting standards
    licensing criterion
    infotainment system
    air conditioning
    ..

    ALL of which can be ignored for a hobby, none of which can
    be ignored for industrial grade.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Bernd Linsel on Sat Aug 31 15:03:47 2024
    Bernd Linsel <[email protected]> writes:
    Maybe my previous post was not clear enough: It's not a general UB
    detector that I'd like to have integrated into the compiler (there are
    static checker tools available that can nearly perfectly do that);

    Undefined behaviour is something that is exercised at run-time.
    That's why the "undefined behaviour sanitizers" insert run-time
    checks. And of course they only detect the behaviour when it is
    actually exercised. I.e., they usually will not detect overflowable
    buffers, because your usual test inputs don't exercise those.

    What do you mean with the static checker tools you mention?

    instead, I'd like to get a warning when the compiler does something
    other than you would expect when reading the code in a "do what I mean" >manner.

    Of course the fans of compilers that do what nobody means found a counterargument long ago: They claim that compilers would need psychic
    powers to know what you mean. So one way to specify what I guess you
    mean with 'read the code in a "do what I mean" manner' is the
    behaviour that the the compiler exhibits without "knowledge" coming
    from the assumption that there is no undefined behaviour in the
    program. For a longer discussion read <https://www.complang.tuwien.ac.at/papers/ertl17kps.pdf>.

    And yes, compilers could actually produce information about
    differences between such a compilation and a compilation where the
    compiler assumes that undefined behaviour does not happen.

    One way to use such information is if you then intend to run the
    compiler in "Assume That Undefined Behaviour Does Not Happen" mode for production code: check *every* case where the resulting code behaves differently. If the behaviour of the ATUBDNH compiler is not
    according to your intentions, change the source code to avoid
    undefined behaviour in such cases, forcing the ATUBDNH compiler to
    behave as you intend. If the behaviour of the ATUBDNH compiler is as
    you intended, you can keep the source code as-is (but then you get the
    same warning the next time 'round). Or you can change the source code
    in a way that results in the compiler not needing to ATUBDNH in order
    to produce the code you would like (see below for examples).

    Another way to use such information is if you intend to run the
    compiler in don't-ATUBDNH mode for production code. In that case you
    only need to look at a few cases: those occuring in the
    most-frequently executed code. Again, for each difference there are
    two cases: If your intention is only reflected in the don't-ATUBDNH
    code, you don't have to do anything, or change the code such that the
    warning goes away in the future (without changing the code). If your
    intention is also covered by the ATUBDNH case, you can change the code
    to actually perform the optimization also in the don't-ATUBDNH
    compiler.

    Here are examples: Wang et al. [Section 3.3 of wang+12], found that in
    all of SPECint 2006 there were only two places where the ATUBDNH made
    a measurable difference to performance. These were two inner loops.

    In one case the code is

    int k;
    int *ic, *is;
    ...
    for (k = 1; k <= M; k++) {
    ...
    ic[k] += is[k];
    ...
    }

    and the don't-ATUBDNH variant has a sign extension after the "k++"
    that the ATUBDNH does not have. Wang et al. suggest changing the type
    of k to size_t to avoid this sign-extension operation. After that
    change ATUBDNH makes no difference to this loop.

    The other loop is

    quantum_reg *reg;
    ...
    // reg->size: int
    // reg->node[i].state: unsigned long long
    for (i = 0; i < reg->size; i++)
    reg->node[i].state = ...;

    Here ATUBDNH pulls the load of reg->size out of the loop (it assumes
    that reg->size does not alias with reg->node[i].state). Wang et
    al. solved that by assigning reg->size to a variable outside the loop,
    i.e., something like:

    quantum_reg *reg;
    ...
    long reg_size = reg->size
    for (i = 0; i < reg_size; i++)
    reg->node[i].state = ...;


    But once we are at that, why stop at optimizations suggestions coming
    from ATUBDNH. E.g., consider a loop similar to the second loop:

    quantum_reg *reg;
    ...
    // reg->size: int
    // reg->node[i].state: int <==== HERE'S THE DIFFERENCE
    for (i = 0; i < reg->size; i++)
    reg->node[i].state = ...;

    In this case ATUBDNH would not allow pulling reg->size out of the
    loop, yet you don't intend to ever alias reg->size with
    reg->node[i].state. A compiler could actually guess your intention,
    and suggest that you may want to pull reg->size out (plus also mention
    the caveats about possible aliasing).

    So once we are there, we no longer need ATUBDNH, we just need
    don't-ATUBDNH and a compiler option that produces manual-optimization suggestions, ordered by the expected payoff (probably it's a good idea
    to use profile data for this ordering).

    I personally try to turn GCC into don't-ATUBDNH as far as possible
    with options like "-fno-delete-null-pointer-checks
    -fno-strict-aliasing -fno-strict-overflow".

    @InProceedings{wang+12,
    author = {Xi Wang and Haogang Chen and Alvin Cheung and Zhihao Jia and Nickolai Zeldovich and M. Frans Kaashoek},
    title = {Undefined Behavior: What Happened to My Code?},
    booktitle = {Asia-Pacific Workshop on Systems (APSYS'12)},
    OPTpages = {},
    year = {2012},
    url1 = {http://homes.cs.washington.edu/~akcheung/getFile.php?file=apsys12.pdf},
    url2 = {http://people.csail.mit.edu/nickolai/papers/wang-undef-2012-08-21.pdf},
    OPTannote = {}
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Sat Aug 31 17:10:29 2024
    Thomas Koenig <[email protected]> writes:
    Definitely - if you have 512 GB DDR5 memory in your workstation, the
    cost of the CPU itself is a relatively small fraction.

    Reality check:
    EUR
    2400 =8*300 8*64GB MTC40F2046S1RC48BA1R Micron RDIMM 64GB, DDR5-4800
    9300 AMD Ryzen Threadripper PRO 7995WX 96C boxed

    The Intel side is a little cheaper, but also offers fewer cores:

    4100 Intel Xeon w9-3475X, 36C boxed
    6800 Intel Xeon w9-3495X, 56C tray

    In any case, all three CPUs are significantly more expensive than
    512GB of RAM.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sat Aug 31 18:55:25 2024
    Anton Ertl <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Definitely - if you have 512 GB DDR5 memory in your workstation, the
    cost of the CPU itself is a relatively small fraction.

    Reality check:
    EUR
    2400 =8*300 8*64GB MTC40F2046S1RC48BA1R Micron RDIMM 64GB, DDR5-4800
    9300 AMD Ryzen Threadripper PRO 7995WX 96C boxed

    The Intel side is a little cheaper, but also offers fewer cores:

    4100 Intel Xeon w9-3475X, 36C boxed
    6800 Intel Xeon w9-3495X, 56C tray

    In any case, all three CPUs are significantly more expensive than
    512GB of RAM.

    Let's just say those prices are not representative of what I have
    in my workstation. First, the CPUs are different, and second,
    the deals that a large corporation gets on hardware can be quite
    surprising to somebody who is not familiar with them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sat Aug 31 19:08:33 2024
    Anton Ertl <[email protected]> schrieb:

    Of course the fans of compilers that do what nobody means found a counterargument long ago: They claim that compilers would need psychic
    powers to know what you mean.

    Of course, different compiler writers have different opinions, but
    what you write is very close to a straw man argument.

    What compiler writers generlly agree upon is that specifications
    matter (either in the language standard or in documented behavior
    of the compiler). Howewer, the concept of a specification is
    something that you do not appear to understand, and maybe never
    will.

    An example: I work in the chemical industry. If a pressure vessel
    is rated for 16 bar overpressure, we are not allowed to run it at
    32 bar. If the supplier happens to have sold vessels which can
    actually withstand 32 bar, and then makes modifications which
    lower the actual pressure the vessel can withstand only 16 bar,
    the customer has no cause for complaint.

    As usual, the specification goes both ways: The supplier
    guarantees the pressure rating, and the customer is obliged
    (by law, in this case) to never operate the vessel above its
    pressure rating. Hence, safety valves rupture discs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Linsel@21:1/5 to Thomas Koenig on Sat Aug 31 23:01:54 2024
    On 31.08.24 21:08, Thomas Koenig wrote:
    Anton Ertl <[email protected]> schrieb:

    Of course the fans of compilers that do what nobody means found a
    counterargument long ago: They claim that compilers would need psychic
    powers to know what you mean.

    Of course, different compiler writers have different opinions, but
    what you write is very close to a straw man argument.

    What compiler writers generlly agree upon is that specifications
    matter (either in the language standard or in documented behavior
    of the compiler). Howewer, the concept of a specification is
    something that you do not appear to understand, and maybe never
    will.

    An example: I work in the chemical industry. If a pressure vessel
    is rated for 16 bar overpressure, we are not allowed to run it at
    32 bar. If the supplier happens to have sold vessels which can
    actually withstand 32 bar, and then makes modifications which
    lower the actual pressure the vessel can withstand only 16 bar,
    the customer has no cause for complaint.

    As usual, the specification goes both ways: The supplier
    guarantees the pressure rating, and the customer is obliged
    (by law, in this case) to never operate the vessel above its
    pressure rating. Hence, safety valves rupture discs.

    You compare apples and peaches. Technical specifications for your
    pressure vessel result from the physical abilities of the chosen
    material, by keeping requirements as vessel border width, geometry etc.,
    while compiler writers are free in their search for optimization tricks
    that let them shine at SPEC benchmarks.

    I personally write most code as in the days I learned C, where compilers
    where literally too dumb to remember what they did 2 source lines ago,
    so you could not rely on the compiler doing the "right thing" -- same as nowadays, but because of other reasons.

    So the things that Anton mentioned -- using size_t (or suitable other
    unsigned types) for iteration variables, pulling invariants out of
    loops, and many more common optimizations -- can still be found in my
    source codes.

    PS: I find -fno-strict-overflow and -fno-strict-aliasing of value, too,
    while I found that -fdelete-null-pointer-checks together with -Wnull-pointer-dereference has some utility.

    --
    Bernd Linsel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Bernd Linsel on Sat Aug 31 21:14:53 2024
    On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:

    On 31.08.24 21:08, Thomas Koenig wrote:
    Anton Ertl <[email protected]> schrieb:

    Of course the fans of compilers that do what nobody means found a
    counterargument long ago: They claim that compilers would need psychic
    powers to know what you mean.

    Of course, different compiler writers have different opinions, but
    what you write is very close to a straw man argument.

    What compiler writers generlly agree upon is that specifications
    matter (either in the language standard or in documented behavior
    of the compiler). Howewer, the concept of a specification is
    something that you do not appear to understand, and maybe never
    will.

    An example: I work in the chemical industry. If a pressure vessel
    is rated for 16 bar overpressure, we are not allowed to run it at
    32 bar. If the supplier happens to have sold vessels which can
    actually withstand 32 bar, and then makes modifications which
    lower the actual pressure the vessel can withstand only 16 bar,
    the customer has no cause for complaint.

    As usual, the specification goes both ways: The supplier
    guarantees the pressure rating, and the customer is obliged
    (by law, in this case) to never operate the vessel above its
    pressure rating. Hence, safety valves rupture discs.

    You compare apples and peaches. Technical specifications for your
    pressure vessel result from the physical abilities of the chosen
    material, by keeping requirements as vessel border width, geometry etc., while compiler writers are free in their search for optimization tricks
    that let them shine at SPEC benchmarks.

    A pressure vessel may actually be able to contain 2× the pressure it
    will be able to contain 20 after 20 years of service due to stress
    and strain acting on the base materials.

    Then there are 3 kinds of metals {grey, white, yellow} with different
    responses to stress and induced strain. There is no analogy in code--
    If there were perhaps we would have better code today...

    I personally write most code as in the days I learned C, where compilers where literally too dumb to remember what they did 2 source lines ago,
    so you could not rely on the compiler doing the "right thing" -- same as nowadays, but because of other reasons.

    I do too.

    So the things that Anton mentioned -- using size_t (or suitable other unsigned types) for iteration variables, pulling invariants out of
    loops, and many more common optimizations -- can still be found in my
    source codes.

    The modern change is that "int" is no longer the fastest integral
    type {which it was guaranteed to be in the days I learned C}.

    PS: I find -fno-strict-overflow and -fno-strict-aliasing of value, too,
    while I found that -fdelete-null-pointer-checks together with -Wnull-pointer-dereference has some utility.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Aug 31 21:25:16 2024
    On Sat, 31 Aug 2024 20:56:56 +0000, BGB wrote:

    On 8/30/2024 7:11 PM, Paul A. Clayton wrote:
    On 8/28/24 11:36 PM, BGB wrote:
    On 8/28/2024 11:40 AM, MitchAlsup1 wrote:
    [snip]
    My 1-wide machines does ENTER and EXIT at 4 registers per cycle.
    Try doing 4 LDs or 4 STs per cycle on a 1-wide machine.


    It likely isn't going to happen because a 1-wide machine isn't going
    to have the needed register ports.

    For an in-order implementation, banking could be used for saving
    a contiguous range of registers with no bank conflicts.

    Mitch Alsup chose to provide four read/write ports with the
    typical use being three read, one write instructions. This not
    only facilitates faster register save/restore for function calls
    (and context switches/interrupts) but presents the opportunity of
    limited dual issue ("CoIssue").


    I was mostly doing dual-issue with a 4R2W design.

    Initially, 6R3W won out mostly because 4R2W disallows an indexed store
    to be run in parallel with another op; but 6R3W did allow this. This
    scenario made enough of a difference to seemingly justify the added cost
    of a 3-wide design with a 3rd lane that goes mostly unused (and is
    mostly limited to register MOV's and basic ALU ops and similar).


    But, then this leads to an annoyance:
    As is, I will need to generate different code for 1W, 2W, and 3W configurations;
    It is starting to become tempting to generate code resembling that for
    the 1W case (albeit still using the shuffling that would be used when bundling), and then using superscalar since, it turns out, it is not
    quite as expensive as I had thought).

    You are falling for the VLIW thought train trap...

    With superscalar, I wouldn't have the issue of 2W and 3W cores having
    issues running code built for the other.

    Such is the advantage of configurable register file ports.

    Also, on both 2W and 3W configurations, I can have a 128-bit MOV.X (load/store pair) instruction, so if one assumes 2-wide as the minimum,
    this instruction can be safely assumed to exist.

    VLIW trap again.

    ENTER and EXIT have no such trap as they are not tied to the number of
    file ports in any given implementation. They work even when the file
    is not configurable and especially when it is. Different timing, thou;
    because RF configuration determines throughput (as it does OH SO often}

    I can mostly ignore 1-wide scenarios (2R1W and 3W1W), mostly as I have
    ended up mostly deciding to relegate these to RISC-V.

    Tisc..

    By the time I have stripped down BJX2 enough to fit into a small FPGA,
    it essentially has almost nothing to offer that RV wouldn't offer
    already (and it makes more practical sense to use something like RV32IM
    or similar).



    I am not sure how one would efficiently pull off a 4W write operation.



    Can note that generally, the GPR part of the register file can be built
    with LUTRAMs, which on Xilinx chips have the property:
    1R1W, 5-bit addr, 3-bit data; comb read, clock-edge write.
    1R1W, 6-bit addr, 2-bit data; comb read, clock-edge write.


    This means, the number of LUTRAMs needed for NxM with G registers can be calculated:
    2R1W, 32, Cost=44
    3R1W, 32, Cost=66
    4R2W, 32, Cost=176
    6R3W, 32, Cost=396
    4R4W, 32, Cost=352
    6R4W, 32, Cost=528

    2R1W, 64, Cost=64
    3R1W, 64, Cost=96
    4R2W, 64, Cost=256
    6R3W, 64, Cost=576
    4R4W, 64, Cost=512
    6R4W, 64, Cost=768

    10R5W, 64, cost=1600.

    An accurate but slight underestimate.


    I am not sure about ASIC.

    Depends on who implemented the SRAM and RF technology.

    For FPGA, pretty sure that bidirectional ports would gain little or
    nothing over fixed-direction ports (since bidirectional IO is not a
    thing, and the internal logic is almost entirely different between a
    read and write port).

    It is even easier when you have access to individual transistors
    and wires...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Bernd Linsel on Sat Aug 31 21:42:31 2024
    Bernd Linsel <[email protected]> schrieb:
    On 31.08.24 21:08, Thomas Koenig wrote:
    Anton Ertl <[email protected]> schrieb:

    Of course the fans of compilers that do what nobody means found a
    counterargument long ago: They claim that compilers would need psychic
    powers to know what you mean.

    Of course, different compiler writers have different opinions, but
    what you write is very close to a straw man argument.

    What compiler writers generlly agree upon is that specifications
    matter (either in the language standard or in documented behavior
    of the compiler). Howewer, the concept of a specification is
    something that you do not appear to understand, and maybe never
    will.

    An example: I work in the chemical industry. If a pressure vessel
    is rated for 16 bar overpressure, we are not allowed to run it at
    32 bar. If the supplier happens to have sold vessels which can
    actually withstand 32 bar, and then makes modifications which
    lower the actual pressure the vessel can withstand only 16 bar,
    the customer has no cause for complaint.

    As usual, the specification goes both ways: The supplier
    guarantees the pressure rating, and the customer is obliged
    (by law, in this case) to never operate the vessel above its
    pressure rating. Hence, safety valves rupture discs.

    You compare apples and peaches. Technical specifications for your
    pressure vessel result from the physical abilities of the chosen
    material, by keeping requirements as vessel border width, geometry etc., while compiler writers are free in their search for optimization tricks
    that let them shine at SPEC benchmarks.

    A specification is a specification, but it seems you do not grasp
    the concept. It seems a curious mental gap in some people who
    think that it means fundamentally different things in different fields.

    But if you insist in putting some extra constraints on compiler
    writers, apart from the official standards, feel free to write them
    down (but please in a concise manner) and try to get them accepted,
    preferably by the relevant standards committees. But you should know
    that writing a specication that is unambiguous and clear is
    hard work, and needs a lot of discussion and reviews.

    Or fork either gcc or LLVM (or both) and implement whatever
    restrictions you want, and if you can convince the maintainers
    of these compilers that it is a good idea to fold in your changes,
    they may do so.

    If you can make your case to enough people (or companies),
    then you will find enough volunteers and/or funding to do so.
    Snide remarks about compiler writers on comp.arch aren't going
    to have any meaningful impact, I'm afraid; if anything, they will
    lower your chance of success.

    But of course that depends on your definition of success - do
    you want to achive anything, or do you want to aggravate people?
    If it is the latter, then your chance of success might be a
    bit higher.

    I personally write most code as in the days I learned C, where compilers where literally too dumb to remember what they did 2 source lines ago,
    so you could not rely on the compiler doing the "right thing" -- same as nowadays, but because of other reasons.

    So you learned programming by ignoring the specifications that
    were available. Well, sometimes making progress means unlearning
    something.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Aug 31 23:37:28 2024
    On Sat, 31 Aug 2024 21:42:31 +0000, Thomas Koenig wrote:

    Bernd Linsel <[email protected]> schrieb:
    On 31.08.24 21:08, Thomas Koenig wrote:
    Anton Ertl <[email protected]> schrieb:

    Of course the fans of compilers that do what nobody means found a
    counterargument long ago: They claim that compilers would need psychic >>>> powers to know what you mean.

    Of course, different compiler writers have different opinions, but
    what you write is very close to a straw man argument.

    What compiler writers generlly agree upon is that specifications
    matter (either in the language standard or in documented behavior
    of the compiler). Howewer, the concept of a specification is
    something that you do not appear to understand, and maybe never
    will.

    An example: I work in the chemical industry. If a pressure vessel
    is rated for 16 bar overpressure, we are not allowed to run it at
    32 bar. If the supplier happens to have sold vessels which can
    actually withstand 32 bar, and then makes modifications which
    lower the actual pressure the vessel can withstand only 16 bar,
    the customer has no cause for complaint.

    As usual, the specification goes both ways: The supplier
    guarantees the pressure rating, and the customer is obliged
    (by law, in this case) to never operate the vessel above its
    pressure rating. Hence, safety valves rupture discs.

    You compare apples and peaches. Technical specifications for your
    pressure vessel result from the physical abilities of the chosen
    material, by keeping requirements as vessel border width, geometry etc.,
    while compiler writers are free in their search for optimization tricks
    that let them shine at SPEC benchmarks.

    A specification is a specification, but it seems you do not grasp
    the concept. It seems a curious mental gap in some people who
    think that it means fundamentally different things in different fields.

    But if you insist in putting some extra constraints on compiler
    writers, apart from the official standards, feel free to write them
    down (but please in a concise manner) and try to get them accepted, preferably by the relevant standards committees. But you should know
    that writing a specication that is unambiguous and clear is
    hard work, and needs a lot of discussion and reviews.

    convincing the random code exercisers not to try the ATOMIC
    parts of the ISA is even harder.


    Or fork either gcc or LLVM (or both) and implement whatever
    restrictions you want, and if you can convince the maintainers
    of these compilers that it is a good idea to fold in your changes,
    they may do so.

    If you can make your case to enough people (or companies),
    then you will find enough volunteers and/or funding to do so.
    Snide remarks about compiler writers on comp.arch aren't going
    to have any meaningful impact, I'm afraid; if anything, they will
    lower your chance of success.

    But of course that depends on your definition of success - do
    you want to achive anything, or do you want to aggravate people?
    If it is the latter, then your chance of success might be a
    bit higher.

    I personally write most code as in the days I learned C, where compilers
    where literally too dumb to remember what they did 2 source lines ago,
    so you could not rely on the compiler doing the "right thing" -- same as
    nowadays, but because of other reasons.

    So you learned programming by ignoring the specifications that
    were available. Well, sometimes making progress means unlearning
    something.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Aug 31 23:35:33 2024
    On Sat, 31 Aug 2024 21:42:31 +0000, Thomas Koenig wrote:

    Bernd Linsel <[email protected]> schrieb:
    On 31.08.24 21:08, Thomas Koenig wrote:
    Anton Ertl <[email protected]> schrieb:

    Of course the fans of compilers that do what nobody means found a
    counterargument long ago: They claim that compilers would need psychic >>>> powers to know what you mean.

    Of course, different compiler writers have different opinions, but
    what you write is very close to a straw man argument.

    What compiler writers generlly agree upon is that specifications
    matter (either in the language standard or in documented behavior
    of the compiler). Howewer, the concept of a specification is
    something that you do not appear to understand, and maybe never
    will.

    An example: I work in the chemical industry. If a pressure vessel
    is rated for 16 bar overpressure, we are not allowed to run it at
    32 bar. If the supplier happens to have sold vessels which can
    actually withstand 32 bar, and then makes modifications which
    lower the actual pressure the vessel can withstand only 16 bar,
    the customer has no cause for complaint.

    As usual, the specification goes both ways: The supplier
    guarantees the pressure rating, and the customer is obliged
    (by law, in this case) to never operate the vessel above its
    pressure rating. Hence, safety valves rupture discs.

    You compare apples and peaches. Technical specifications for your
    pressure vessel result from the physical abilities of the chosen
    material, by keeping requirements as vessel border width, geometry etc.,
    while compiler writers are free in their search for optimization tricks
    that let them shine at SPEC benchmarks.

    A specification is a specification, but it seems you do not grasp
    the concept. It seems a curious mental gap in some people who
    think that it means fundamentally different things in different fields.

    But if you insist in putting some extra constraints on compiler
    writers, apart from the official standards, feel free to write them
    down (but please in a concise manner) and try to get them accepted, preferably by the relevant standards committees. But you should know
    that writing a specication that is unambiguous and clear is
    hard work, and needs a lot of discussion and reviews.

    Convincing the random code exercisers to obey the "that is not
    an instruction" part of the specification is vastly harder.
    C

    Or fork either gcc or LLVM (or both) and implement whatever
    restrictions you want, and if you can convince the maintainers
    of these compilers that it is a good idea to fold in your changes,
    they may do so.

    If you can make your case to enough people (or companies),
    then you will find enough volunteers and/or funding to do so.
    Snide remarks about compiler writers on comp.arch aren't going
    to have any meaningful impact, I'm afraid; if anything, they will
    lower your chance of success.

    But of course that depends on your definition of success - do
    you want to achive anything, or do you want to aggravate people?
    If it is the latter, then your chance of success might be a
    bit higher.

    I personally write most code as in the days I learned C, where compilers
    where literally too dumb to remember what they did 2 source lines ago,
    so you could not rely on the compiler doing the "right thing" -- same as
    nowadays, but because of other reasons.

    So you learned programming by ignoring the specifications that
    were available. Well, sometimes making progress means unlearning
    something.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Sat Aug 31 19:45:54 2024
    On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
    On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:

    On 31.08.24 21:08, Thomas Koenig wrote:
    Anton Ertl <[email protected]> schrieb:

    Of course the fans of compilers that do what nobody means found a
    counterargument long ago: They claim that compilers would need psychic >>>> powers to know what you mean.

    Of course, different compiler writers have different opinions, but
    what you write is very close to a straw man argument.

    What compiler writers generlly agree upon is that specifications
    matter (either in the language standard or in documented behavior
    of the compiler).  Howewer, the concept of a  specification is
    something that you do not appear to understand, and maybe never
    will.

    An example: I work in the chemical industry.  If a pressure vessel
    is rated for 16 bar overpressure, we are not allowed to run it at
    32 bar. If the supplier happens to have sold vessels which can
    actually withstand 32 bar, and then makes modifications which
    lower the actual pressure the vessel can withstand only 16 bar,
    the customer has no cause for complaint.

    As usual, the specification goes both ways:  The supplier
    guarantees the pressure rating, and the customer is obliged
    (by law, in this case) to never operate the vessel above its
    pressure rating.  Hence, safety valves rupture discs.

    You compare apples and peaches. Technical specifications for your
    pressure vessel result from the physical abilities of the chosen
    material, by keeping requirements as vessel border width, geometry etc.,
    while compiler writers are free in their search for optimization tricks
    that let them shine at SPEC benchmarks.

    A pressure vessel may actually be able to contain 2× the pressure it
    will be able to contain 20 after 20 years of service due to stress
    and strain acting on the base materials.

    Then there are 3 kinds of metals {grey, white, yellow} with different responses to stress and induced strain. There is no analogy in code--
    If there were perhaps we would have better code today...

    Perhaps an analogy is code written in assembler, versus coed written in
    C versus code written in something like Ada or Rust. Backing away now .
    . . :-)


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Sun Sep 1 08:34:11 2024
    MitchAlsup1 wrote:
    On Fri, 30 Aug 2024 22:42:19 +0000, BGB wrote:

    On 8/30/2024 1:11 PM, MitchAlsup1 wrote:
    On Thu, 29 Aug 2024 19:07:29 +0000, BGB wrote:
    Integer Overflow

    Not usually a thing. Pretty much everything seems to treat integer
    overflow as silently wrapping.

    ADA wants these.



    Bad Instruction encoding--OpCode exists but not as this
     Â Â  instruction uses it. Random code generation can use
     Â Â  every instruction without privilege.

    Hit or miss.

    Will usually fault on invalid instructions.

    Must be 100% to guarantee upwards compatibility.

    There is logic in place to reject privileged instructions in user-mode,
    if the CPU is actually run in user-mode. Some of this is still TODO
    (currently, TestKern is still running everything in Supervisor Mode).

    Yes, it is a pain--but a pain that is absolutely worth it.


    The alternative is to treat them as UB, so they may be one of:
       Trap;
       Do something else (like, if an instruction was added);
       Do something wonky / unintended.

    In practice, this seems to be more how it works.

    Bad practice == not industrial quality.


    Bad address--address exists but you are not allowed to touch it>   Â
    with LD or ST instruction or to attempt to execute it.

    If the MMU is enabled, it should fault on bad memory accesses.

    In physical addressing mode, it does not trap.

    YOU FAIL TO UNDERSTAND--there is an area in memory where the
    preserved registers are stored--stored in a way that only 3
    instructions can access--and the PTE is marked RWE=000
    This prevents damaging the contract between callee and caller.
    3 instructions can access these pages ENTER, EXIT and RET
    nothing else.


    IIRC, there was a mechanism on the bus to deal with accesses to bad
    physical addresses (returning all zeroes). Otherwise, trying to access
    an invalid address would cause the CPU to deadlock.

    It is NOT a BAD address--it is a good but inaccessible address
    outside those 3 instructions.



    As I understand it, you don't even get FMUL correctly rounded.
    To get it properly rounded you have to compute the full 53*53
    product.

    AFAICT, this wasn't required for the 1985 spec...

    You Cannot get rounding correct unless you "compute as if to
    infinite precision" and then follow the rules of rounding
    (all modes).

    This rule is in fact really simple:

    In all versions of the standard, from the very first up to the upcoming
    2029, the core instructions (FADD/FSUB/FMUL/FDIV/FSQRT) MUST result in
    the correctly rounded result, according to whatever the current rounding
    mode is/was.

    This does mean that you have to act as if you did the calculation to arbitrary/infinite precision, which really means "enough bits so that
    any following bits do not matter for the rounding result".

    It was a revelation to me when I wrote my first fp emulation code and
    grok'ed how having a single guard bit followed by a sticky bit was
    sufficient to do this for all rounding modes.

    At that point I only needed to maintain enough intermediate bits to
    guarantee I would still have those rounding bits after normalization.

    This doesn't mean that I could skip calculating all the bits of the full NxN->2N mantissa product, only that I didn't need to keep them all
    around after normalization.

    FMAC (with single rounding, which is the interesting one) you can of
    course get catastrophic cancellation, so you need all the 2N mantissa
    bits of the multiplication plus the N bits from the addend, then you
    either need a normalizer wide enough to take in any possibly alignments
    of the two parts, or you must have separate logic for each of the major
    cases.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Sun Sep 1 11:21:00 2024
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    Undefined behaviour is something that is exercised at run-time.
    That's why the "undefined behaviour sanitizers" insert run-time
    checks. And of course they only detect the behaviour when it is
    actually exercised. I.e., they usually will not detect overflowable
    buffers, because your usual test inputs don't exercise those.

    That's among the many reasons why there is no single way "to make code
    secure." For string buffers, you turn on the compiler run-time checks,
    and use the length-checking versions of string handling functions. Then
    you write tests to check both of those are actually working.

    Then you discover that the C++ string[] operator is not bounds-checked,
    as per the C++ standard, but string.at() is bounds-checked, and curse a
    bit.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to John Dallman on Sun Sep 1 22:12:34 2024
    On 01/09/2024 12:21, John Dallman wrote:
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    Undefined behaviour is something that is exercised at run-time.
    That's why the "undefined behaviour sanitizers" insert run-time
    checks. And of course they only detect the behaviour when it is
    actually exercised. I.e., they usually will not detect overflowable
    buffers, because your usual test inputs don't exercise those.

    That's among the many reasons why there is no single way "to make code secure." For string buffers, you turn on the compiler run-time checks,
    and use the length-checking versions of string handling functions. Then
    you write tests to check both of those are actually working.

    Then you discover that the C++ string[] operator is not bounds-checked,
    as per the C++ standard, but string.at() is bounds-checked, and curse a
    bit.


    But surely you would discover that before using the std::string type? I
    might do some quick test code using "stuff copied off the internet", but
    for any serious programming I would want to read the specifications of a
    type or function before using it. That's the only way to be sure you
    are writing correct code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Sun Sep 1 22:07:53 2024
    On 31/08/2024 21:08, Thomas Koenig wrote:
    Anton Ertl <[email protected]> schrieb:

    Of course the fans of compilers that do what nobody means found a
    counterargument long ago: They claim that compilers would need psychic
    powers to know what you mean.

    Of course, different compiler writers have different opinions, but
    what you write is very close to a straw man argument.

    What compiler writers generlly agree upon is that specifications
    matter (either in the language standard or in documented behavior
    of the compiler). Howewer, the concept of a specification is
    something that you do not appear to understand, and maybe never
    will.

    An example: I work in the chemical industry. If a pressure vessel
    is rated for 16 bar overpressure, we are not allowed to run it at
    32 bar. If the supplier happens to have sold vessels which can
    actually withstand 32 bar, and then makes modifications which
    lower the actual pressure the vessel can withstand only 16 bar,
    the customer has no cause for complaint.

    As usual, the specification goes both ways: The supplier
    guarantees the pressure rating, and the customer is obliged
    (by law, in this case) to never operate the vessel above its
    pressure rating. Hence, safety valves rupture discs.

    That is very well put.

    Specifications are an agreement between the supplier and the client.
    The supplier promises particular functionality if the client stays
    within those specifications. It is how things work in a huge range of
    aspects of life. Sometimes there are agreements in place for what
    happens if the specifications are broken (fine if you fail to deliver as promised, jail sentence if you break the law, etc.), but these are
    really just extensions of the agreement and specification.

    If we think about computing, we can start with mathematics for examples.
    A mathematical function maps one set onto another - it specifies what
    value in the output set is produced from each value in the input set.
    It does not specify the result for values that are not in the input set,
    even if they are in a "reasonable" superset. So the real square root
    function specifies an output for all non-negative real numbers - it does
    not specify the result for negative real numbers. Attempting to find
    the square root of a negative number is undefined behaviour.

    Functions in computing are the same. You have a specification - a pre-condition, and a post-condition. The inputs (including the
    environment, if that is relevant) has to satisfy the pre-condition, and
    then the function guarantees that the post-condition will hold after the function call. Try to put anything else into the function without
    satisfying the pre-condition, and it's garbage in, garbage out. If you
    don't understand "garbage in, garbage out", you really don't understand
    the first thing about software development. This has been understood
    since the beginning of the programmable computer:

    """
    On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into
    the machine wrong figures, will the right answers come out?' I am not
    able rightly to apprehend the kind of confusion of ideas that could
    provoke such a question.
    """


    In the context of compilers, the specification is the language standard
    in use at the time, combined with the specifications of any library
    functions or other code being used. If you don't follow those
    specifications - your input code does not meet the pre-conditions, or
    the pre-conditions are not met when your code is run - you get undefined behaviour. There is no rational way to expect any particular result
    when the input is in essence meaningless.


    So if there is a function (or operator, or other feature) specified by
    the language or by library or function documentation, and you pass it
    something that is not documented as fulfilling the pre-conditions, it's
    garbage in, garbage out - your code is wrong. If your code makes
    assumptions about the workings of a function that are not specified in
    its post-condition, the code is wrong. It might work during testing,
    but it is not guaranteed to work. If you try to use a function outside
    its specifications, then your code is wrong.


    Of course it is not always easy to make sure everything is correct
    within specifications. Programming languages and libraries are
    complicated, and people make mistakes. And where practical, it can be
    good to take that into consideration - if it is possible to give error
    messages or help in the case of bad inputs, then that can be very
    helpful to people. But it doesn't make sense to try to give the "right"
    output for wrong input. And it doesn't make sense to do this to the significant detriment of efficiency with correct inputs.

    To compare this to specifications in other walks of life, imagine an electricity company. The specification they provide to you, the
    customer, has the pre-condition that you pay your bills. The
    post-condition is that you get electricity. If you break the
    specification - you stop paying your bills - it's perfectly reasonable
    that they cut off your electricity. But it is /nice/ if they first send
    you warning letters, and offers to re-arrange your debt. But if you are following the specifications and paying your bills, you would not want
    the electricity company to keep providing electricity to those who don't
    pay, because that would mean /you/ would have to pay more.

    In the same way, I want my compiler to warn about potential problems or undefined behaviour when it reasonably can, rather than jumping straight
    to nasal daemons. But I don't want it to generate slower code that it otherwise could, just because some people might write incorrect code. I
    should not have to pay (in run-time efficiency losses) for other
    people's potential failure to follow specifications.

    But I am quite happy to have compiler options to control the balance and behaviour. Compilers generally do little optimisation without flags
    explicitly enabling them. And some compilers have flags to change the
    language specifications (such as making signed integer arithmetic wrap).
    There's not a lot they could do better to satisfy people who want the
    tools to conform to their imagined specification rather than the actual specifications.

    I suppose one thing they could do is that when a new compiler version
    comes out with new optimisations, they could have a flag that turns
    these off even if you have enabled others. Maybe you could have
    "-olimit=8" to say "limit optimisations to those in gcc 8". That might
    give fewer surprises to people who have got their code wrong.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to David Brown on Sun Sep 1 21:43:00 2024
    In article <vb2hri$1jub9$[email protected]>, [email protected]
    (David Brown) wrote:
    On 01/09/2024 12:21, John Dallman wrote:
    Then you discover that the C++ string[] operator is not
    bounds-checked, as per the C++ standard, but string.at()
    is bounds-checked, and curse a bit.

    But surely you would discover that before using the std::string
    type? I might do some quick test code using "stuff copied off the
    internet", but for any serious programming I would want to read the specifications of a type or function before using it. That's the
    only way to be sure you are writing correct code.

    I didn't write that code, and I don't have the power to demand it be re-written. My group is somewhat pickier about correctness and security
    than the group who created it.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Sun Sep 1 17:47:06 2024
    EricP wrote:
    BGB wrote:

    I am not sure how one would efficiently pull off a 4W write operation.



    Can note that generally, the GPR part of the register file can be
    built with LUTRAMs, which on Xilinx chips have the property:
    1R1W, 5-bit addr, 3-bit data; comb read, clock-edge write.
    1R1W, 6-bit addr, 2-bit data; comb read, clock-edge write.


    This means, the number of LUTRAMs needed for NxM with G registers can
    be calculated:
    2R1W, 32, Cost=44
    3R1W, 32, Cost=66
    4R2W, 32, Cost=176
    6R3W, 32, Cost=396
    4R4W, 32, Cost=352
    6R4W, 32, Cost=528

    2R1W, 64, Cost=64
    3R1W, 64, Cost=96
    4R2W, 64, Cost=256
    6R3W, 64, Cost=576
    4R4W, 64, Cost=512
    6R4W, 64, Cost=768

    10R5W, 64, cost=1600.


    There is also the mUX logic and similar, but should follow the same
    pattern.

    There is a bit-array (2b per register) to indicate which of the arrays
    holds each register. This ends up turning into FFs, but doesn't matter
    as much.

    In the Verilog, one can write it as-if there were only 1 array per
    write port, with the duplication (for the read ports) handled
    transparently by the synthesis stage (convenient), although it still
    has a steep resource cost.

    Since you are targeting 50 MHz, 20 ns per stage, and those LUTRAMs
    possibly run at 500 MHz, and assuming the read port numbers are
    ready at the start of the cycle, one might multi-pump the register
    file read port access and save a pile on read banks and muxes.

    For example, you could 4-pump the read port at 5 ns per read,
    the LUTRAM read access taking 2 ns and 3 ns for muxing and routing.
    That should divide your numbers above by more than 4 because some
    muxing becomes simpler too (fewer sources).

    You can't multi-pump the write access as the write port data usually
    isn't ready until the end of the cycle.

    Oh wait, the write-back data output from the MEM-LD stage is ready
    at the start of the WB cycle so you could multi-pump the write too.
    The normal forwarding logic would pick off if a read register number
    matches a write register number so you shouldn't have to worry about
    the order of reads and writes to the same register.

    That would cut the cost of multiple write ports way down.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to BGB on Sun Sep 1 17:17:04 2024
    BGB wrote:

    I am not sure how one would efficiently pull off a 4W write operation.



    Can note that generally, the GPR part of the register file can be built
    with LUTRAMs, which on Xilinx chips have the property:
    1R1W, 5-bit addr, 3-bit data; comb read, clock-edge write.
    1R1W, 6-bit addr, 2-bit data; comb read, clock-edge write.


    This means, the number of LUTRAMs needed for NxM with G registers can be calculated:
    2R1W, 32, Cost=44
    3R1W, 32, Cost=66
    4R2W, 32, Cost=176
    6R3W, 32, Cost=396
    4R4W, 32, Cost=352
    6R4W, 32, Cost=528

    2R1W, 64, Cost=64
    3R1W, 64, Cost=96
    4R2W, 64, Cost=256
    6R3W, 64, Cost=576
    4R4W, 64, Cost=512
    6R4W, 64, Cost=768

    10R5W, 64, cost=1600.


    There is also the mUX logic and similar, but should follow the same
    pattern.

    There is a bit-array (2b per register) to indicate which of the arrays
    holds each register. This ends up turning into FFs, but doesn't matter
    as much.

    In the Verilog, one can write it as-if there were only 1 array per write port, with the duplication (for the read ports) handled transparently by
    the synthesis stage (convenient), although it still has a steep resource cost.

    Since you are targeting 50 MHz, 20 ns per stage, and those LUTRAMs
    possibly run at 500 MHz, and assuming the read port numbers are
    ready at the start of the cycle, one might multi-pump the register
    file read port access and save a pile on read banks and muxes.

    For example, you could 4-pump the read port at 5 ns per read,
    the LUTRAM read access taking 2 ns and 3 ns for muxing and routing.
    That should divide your numbers above by more than 4 because some
    muxing becomes simpler too (fewer sources).

    You can't multi-pump the write access as the write port data usually
    isn't ready until the end of the cycle.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun Sep 1 23:32:47 2024
    On Sun, 1 Sep 2024 21:21:38 +0000, BGB wrote:

    On 9/1/2024 1:34 AM, Terje Mathisen wrote:
    MitchAlsup1 wrote:

    It was a revelation to me when I wrote my first fp emulation code and
    grok'ed how having a single guard bit followed by a sticky bit was
    sufficient to do this for all rounding modes.

    At that point I only needed to maintain enough intermediate bits to
    guarantee I would still have those rounding bits after normalization.

    This doesn't mean that I could skip calculating all the bits of the full
    NxN->2N mantissa product, only that I didn't need to keep them all
    around after normalization.


    OK.

    It seemed like when I looked over the 1985 spec initially, it only
    required that the result be larger than that of the destination
    (seemingly missed the point of it also requiring infinite precision).

    Say, 54*54 => 68 bits, where 68 > 52, under this interpretation, it
    would have worked. Granted, this does turn it into a probability game
    whether the result is correct or off by 1.

    it is 53×53->106 to get correct rounding in 1 step.

    But, have now since noticed that it did specify computing to infinite precision (in this version of the standard), which, my FPU does not do.

    My point exactly,


    There was mention of some operations that I have generally not seen in
    the ISA in real-world FPUs:
    An FP remainder operator;

    Something IEE specifies but would require an intermediate of 2045
    bits to get correct in all circumstances. This is easier to do in
    Sw ! Mc6881 did it in nearly 2300 cycles !!

    Converters to/from ASCII strings;

    Easier and better in SW.

    An FP->Int truncate operator with the result still in FP format;

    RND (round) instrution.

    Usually, one goes round-trip FP->Int->FP;

    Has underflow and overflow problems 2^1022 -> int=>overflow, ...
    ...

    Seems like pretty much everyone offloaded these tasks to the C library.

    More modern machines have RND nobody will ever have REM.


    I had ended up with coverage of most of the rest, albeit still lacking a "trap on denormal" handler (seemingly worked for MIPS and friends, *).

    So, it seemed like it was getting pretty close to "could maybe pass the
    1985 spec if one lawyers it...". Maybe not so much it seems, unless I
    fix the FMUL issue (TBD if it can be done without significantly
    increasing adder-chain latency).

    You could check for "inability to correctly round and trap on that
    {I have a patent on doing this in transcendental instructions}


    It is possible I could also add a check to detect and trap multiplies
    for cases where both values have non-zero low-order bits (allowing these
    to also be emulated in software).

    So, went and added a flag for "Trap as needed to emulate full IEEE
    semantics" to FPSCR, where the idea is that enabling this will cause it
    to trap in cases where the FPU detects that the results would likely not match the IEEE standard (if using FADDG/FSUBG/FMULG/..., generally if fenv_access is enabled).

    Might make sense to have a compiler option to assume fenv_access is
    always enabled.



    *: Though, from what I can gather, most of the N64 games and similar had operated with this disabled (giving DAZ/FTZ semantics) which apparently
    posed an annoyance for later emulators (things like moving platforms in
    games like SMB64 would apparently slowly drift upwards or away from the origin if the map was left running for long enough, etc; due to SSE and similar tending to operate with denormals enabled).

    GPUs started out without even IEEE 754 formats and over many generations
    did more and more of 754, then 2008, and closing in on 2019


    FMAC (with single rounding, which is the interesting one) you can of
    course get catastrophic cancellation, so you need all the 2N mantissa
    bits of the multiplication plus the N bits from the addend, then you
    either need a normalizer wide enough to take in any possibly alignments
    of the two parts, or you must have separate logic for each of the major
    cases.


    Yeah, for the 2008 spec onward, would also need this...

    It is possible to provide it as a library call, but granted this makes
    it slower.


    There are FMAC instructions, but they are currently both slow and double-rounded (so, not so useful). Well, except for Binary16 and
    Binary32 which appear single-rounded mostly because they happen to be performed internally as Binary64 (but are still slow).


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Mon Sep 2 00:08:21 2024
    On Sun, 1 Sep 2024 22:07:53 +0200, David Brown
    <[email protected]> wrote:

    On 31/08/2024 21:08, Thomas Koenig wrote:
    Anton Ertl <[email protected]> schrieb:

    Of course the fans of compilers that do what nobody means found a
    counterargument long ago: They claim that compilers would need psychic
    powers to know what you mean.

    Of course, different compiler writers have different opinions, but
    what you write is very close to a straw man argument.

    What compiler writers generlly agree upon is that specifications
    matter (either in the language standard or in documented behavior
    of the compiler). Howewer, the concept of a specification is
    something that you do not appear to understand, and maybe never
    will.

    An example: I work in the chemical industry. If a pressure vessel
    is rated for 16 bar overpressure, we are not allowed to run it at
    32 bar. If the supplier happens to have sold vessels which can
    actually withstand 32 bar, and then makes modifications which
    lower the actual pressure the vessel can withstand only 16 bar,
    the customer has no cause for complaint.

    As usual, the specification goes both ways: The supplier
    guarantees the pressure rating, and the customer is obliged
    (by law, in this case) to never operate the vessel above its
    pressure rating. Hence, safety valves rupture discs.

    That is very well put.

    Specifications are an agreement between the supplier and the client.
    The supplier promises particular functionality if the client stays
    within those specifications. It is how things work in a huge range of >aspects of life. Sometimes there are agreements in place for what
    happens if the specifications are broken (fine if you fail to deliver as >promised, jail sentence if you break the law, etc.), but these are
    really just extensions of the agreement and specification.

    If we think about computing, we can start with mathematics for examples.
    A mathematical function maps one set onto another - it specifies what
    value in the output set is produced from each value in the input set.
    It does not specify the result for values that are not in the input set,
    even if they are in a "reasonable" superset. So the real square root >function specifies an output for all non-negative real numbers - it does
    not specify the result for negative real numbers. Attempting to find
    the square root of a negative number is undefined behaviour.

    Functions in computing are the same. You have a specification - a >pre-condition, and a post-condition. The inputs (including the
    environment, if that is relevant) has to satisfy the pre-condition, and
    then the function guarantees that the post-condition will hold after the >function call. Try to put anything else into the function without
    satisfying the pre-condition, and it's garbage in, garbage out. If you
    don't understand "garbage in, garbage out", you really don't understand
    the first thing about software development. This has been understood
    since the beginning of the programmable computer:

    """
    On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into
    the machine wrong figures, will the right answers come out?' I am not
    able rightly to apprehend the kind of confusion of ideas that could
    provoke such a question.
    """


    In the context of compilers, the specification is the language standard
    in use at the time, combined with the specifications of any library
    functions or other code being used. If you don't follow those
    specifications - your input code does not meet the pre-conditions, or
    the pre-conditions are not met when your code is run - you get undefined >behaviour. There is no rational way to expect any particular result
    when the input is in essence meaningless.


    So if there is a function (or operator, or other feature) specified by
    the language or by library or function documentation, and you pass it >something that is not documented as fulfilling the pre-conditions, it's >garbage in, garbage out - your code is wrong. If your code makes
    assumptions about the workings of a function that are not specified in
    its post-condition, the code is wrong. It might work during testing,
    but it is not guaranteed to work. If you try to use a function outside
    its specifications, then your code is wrong.


    Of course it is not always easy to make sure everything is correct
    within specifications. Programming languages and libraries are
    complicated, and people make mistakes. And where practical, it can be
    good to take that into consideration - if it is possible to give error >messages or help in the case of bad inputs, then that can be very
    helpful to people. But it doesn't make sense to try to give the "right" >output for wrong input. And it doesn't make sense to do this to the >significant detriment of efficiency with correct inputs.

    To compare this to specifications in other walks of life, imagine an >electricity company. The specification they provide to you, the
    customer, has the pre-condition that you pay your bills. The
    post-condition is that you get electricity. If you break the
    specification - you stop paying your bills - it's perfectly reasonable
    that they cut off your electricity. But it is /nice/ if they first send
    you warning letters, and offers to re-arrange your debt. But if you are >following the specifications and paying your bills, you would not want
    the electricity company to keep providing electricity to those who don't
    pay, because that would mean /you/ would have to pay more.

    In the same way, I want my compiler to warn about potential problems or >undefined behaviour when it reasonably can, rather than jumping straight
    to nasal daemons. But I don't want it to generate slower code that it >otherwise could, just because some people might write incorrect code. I >should not have to pay (in run-time efficiency losses) for other
    people's potential failure to follow specifications.

    But I am quite happy to have compiler options to control the balance and >behaviour. Compilers generally do little optimisation without flags >explicitly enabling them. And some compilers have flags to change the >language specifications (such as making signed integer arithmetic wrap).
    There's not a lot they could do better to satisfy people who want the
    tools to conform to their imagined specification rather than the actual >specifications.

    I suppose one thing they could do is that when a new compiler version
    comes out with new optimisations, they could have a flag that turns
    these off even if you have enabled others. Maybe you could have
    "-olimit=8" to say "limit optimisations to those in gcc 8". That might
    give fewer surprises to people who have got their code wrong.


    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that explicitly is
    mentioned as UB in some standard N, but was not addressed in previous standards.

    Was it always UB? Or should it be considered ID until it became UB?


    It does seem to me that as the C standard evolved, and as more things
    have *explicitly* become documented as UB, compiler developers have
    responded largely by dropping whatever the compiler did previously -
    sometimes breaking code that relied on it.

    I have moved on from C (mostly), and I learned long ago to archive
    toolchains and to expect that any new version of a tool might break
    something that worked previously. I don't like it, but it generally
    doesn't annoy me that much.


    MMV. Certainly Anton's does. ;-)

    Similar to you (David), I came from a - not embedded per se - but
    kiosk background: HRT indrustrial QA/QC systems. I know well the
    attraction of a new compiler yielding better performing code. I also
    know a large amount of my code was hardware and OS specific, that
    those are the things beyond the scope of the compiler, but they also
    are things that I don't want to have to revisit every time a new
    version of the compiler is released.

    13 of one, baker's dozen of the other.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to George Neuner on Mon Sep 2 05:55:34 2024
    George Neuner <[email protected]> schrieb:

    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that explicitly is mentioned as UB in some standard N, but was not addressed in previous standards.

    Was it always UB? Or should it be considered ID until it became UB?

    Can you give an exapmple?

    I would say this really depends on the circumstances. If it is
    something left unspecified by earlier standards, and put into the
    list of undefined behavior as a clarification, that is one thing.

    If it is something that was previosly well-defined and then made
    into undefined behavior, that is another thing; I would then
    likely consider it a bug in the standard (but again, depending
    on the circumstances).


    It does seem to me that as the C standard evolved, and as more things
    have *explicitly* become documented as UB, compiler developers have
    responded largely by dropping whatever the compiler did previously - sometimes breaking code that relied on it.

    There's a reason that there is a "porting to" file for each release
    of gcc; in a way, each release can be considered a new compiler.

    As an example, here's an entry from
    https://gcc.gnu.org/gcc-13/porting_to.html :

    # Fortran language issues

    # Behavior on integer overflow

    # GCC 13 includes new optimizations which may change behavior on
    # integer overflow. Traditional code, like linear congruential
    # pseudo-random number generators in old programs and relying on
    # a specific, non-standard behavior may now generate unexpected
    # results. The option -fsanitize=undefined can be used to detect
    # such code at runtime.
    #
    # It is recommended to use the intrinsic subroutine RANDOM_NUMBER for
    # random number generators or, if the old behavior is desired, to use
    # the -fwrapv option. Note that this option can impact performance.

    Integer overflow on multiplication had always been illegal in
    Fortran (prohibited by "shall not"), but it had widely been used
    anyway. That was a though one...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Brett on Mon Sep 2 10:01:43 2024
    Brett wrote:
    John Dallman <[email protected]> wrote:
    In fact, organisations replace about a quarter of their machines each
    year, always buying up-to-date ones, and want to run the /same/ version
    of software on all of them. They want common software versions for data
    compatibility, ease of training and so on. That means that a new release
    of an application has to run on all the machines sold in the last four
    years, sometimes longer.

    I assume you work in the high end, as the average desktop PC is replaced every 8 years on a “use it until it breaks” policy.

    Dell will tell you 5 years, and Google is paid to say the same.
    And that actually might be true for laptops, but not desktops.

    The bulk of the PC’s and servers where I work are a dozen years old.
    A smattering of new PC’s bring the average down to 9 years.

    Organizations that rely on commercial licenced software have a much
    easier calculation to make:

    "I pay 10-100K dollar every year per CPU for my 3D
    CAD/modelling/whatever software, if I can buy a new system in 2-4 years
    time which is 50% faster (more cores/faster threads), then it could make
    sense to upgrade every year, except for the hazzle of installing
    everything."

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to George Neuner on Mon Sep 2 01:40:46 2024
    George Neuner <[email protected]> writes:

    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that explicitly
    is mentioned as UB in some standard N, but was not addressed in
    previous standards.

    Was it always UB? Or should it be considered ID until it became
    UB?

    It does seem to me that as the C standard evolved, and as more
    things have *explicitly* become documented as UB, compiler
    developers have responded largely by dropping whatever the
    compiler did previously - sometimes breaking code that relied on
    it.

    For the most part the circumstances you describe simply don't
    occur. I know of one case where a rule introduced in C11
    identified a specific situation as undefined behavior whereas
    in C99 and before it was arguably not undefined behavior (but
    never behavior that should be relied on). I don't remember
    any others; if you have any specific examples please mention
    them.

    Something that does happen is a rule is given fuzzily in one
    version of the C standard and then made more precise in a later
    version. A good example of that is evaluation sequencing.
    Before C11 the rules about what evaluations must be done before
    other evaluations were not as clear as they should be. C11 fixed
    that. However in that case I don't think anything went from
    certainly defined (or certainly unspecified) to undefined, but
    rather changed in the other direction, from possibly undefined
    to certainly defined. Offhand I don't remember any other
    examples, although surely there must be some.

    Sometimes it happens that there is a change in the C language not
    because wording in the Standard changes but because how the
    wording in the Standard is interpreted, usually through a
    response to a Defect Report. A good example of this kind of
    change is "wobbly bits" - the idea that when a variable has not
    been initialized then the bits of the variable are allowed to
    change at any time. (By the way, IMO this idea is completely
    stupid.) As far as I am aware this principle is not stated
    anywhere in the C standard itself, but has crept into how the C
    standard is interpreted by way of responses to Defect Reports.
    It could be that changes of this kind is what you are thinking
    about.

    Overall though, I think the greatest changes in compiler behavior
    are a result not of changes in the C standard but of optimization
    techniques becoming more aggressive. To make things worse, it
    isn't always clear whether a changed behavior is the result of a
    more aggressive advantage-taking of a true UB situation, or if
    the optimizer is buggy. I encountered an interesting situation
    recently where a given piece of code worked just fine under both
    gcc and clang, *except* under gcc at level O3 (clang at O3 had no
    problems). It's been more than a decade since C11 was ratified
    (and nearly a quarter of a century since C99). Compilations
    should always be done with an explicit -std=c99 or -std=c11. If
    you have been compiling with -std=c99 all this time, or even
    using -std=c11 over the shorter time frame, and you see changes
    between different versions of the compiler, it's not the C
    standard changing that's causing the problem, but how the
    compiler is choosing to act on what should be a fixed set of
    rules.

    Completely coincidentally, I happened to see a couple of
    videos recently

    https://www.youtube.com/watch?v=si9iqF5uTFk Grace M Hopper I
    https://www.youtube.com/watch?v=AW7ZHpKuqZg Grace M Hopper II

    that I think folks in comp.arch might be interested to watch.
    The second one deals with language versions and compiler
    verification (among other topics). A bit on the long side
    but I enjoyed watching them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Mon Sep 2 10:23:43 2024
    Stephen Fuld wrote:
    On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
    On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
    You compare apples and peaches. Technical specifications for your
    pressure vessel result from the physical abilities of the chosen
    material, by keeping requirements as vessel border width, geometry etc., >>> while compiler writers are free in their search for optimization tricks
    that let them shine at SPEC benchmarks.

    A pressure vessel may actually be able to contain 2× the pressure it
    will be able to contain 20 after 20 years of service due to stress
    and strain acting on the base materials.

    Then there are 3 kinds of metals {grey, white, yellow} with different
    responses to stress and induced strain. There is no analogy in code--
    If there were perhaps we would have better code today...

    Perhaps an analogy is code written in assembler, versus coed written in
    C versus code written in something like Ada or Rust.  Backing away now .
    . . :-)

    IMNSHO, code written in asm is generally more safe than code written in
    C, because the author knows exactly what each line of code is going to do.

    The problem is of course that it is harder to get 10x lines of correct
    asm than to get 1x lines of correct C.

    BTW, I am also solidly in the grey hair group here, writing C code that
    is very low-level, using explicit local variables for any loop
    invariant, copying other stuff into temp vars in order to make it really obvious that they cannot alias any globals or input/output parameters.

    Anyway, that is all mostly moot since I'm using Rust for this kind of programming now. :-)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to George Neuner on Mon Sep 2 14:22:51 2024
    On 02/09/2024 06:08, George Neuner wrote:
    On Sun, 1 Sep 2024 22:07:53 +0200, David Brown

    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that explicitly is mentioned as UB in some standard N, but was not addressed in previous standards.

    Was it always UB? Or should it be considered ID until it became UB?

    I can't answer for languages other than C and C++ (others might be able
    to compare usefully to, for example, Ada or Fortran). But the C
    standards explicitly state that behaviours that are not defined in the standards are undefined behaviour in exactly the same way as cases that
    are labelled as undefined behaviour, and also cases where the program
    violates a "shall" or "shall not" requirement.

    To be clear - the meaning of "undefined behaviour" is simply that no
    behaviour has been defined. The C standards can say that something is "undefined behaviour" (or just fail to give a definition of the
    behaviour) and then the implementation can give a definition of it. An
    example here would be that the C standards say that signed integer
    arithmetic overflow is undefined behaviour - if you have a signed
    integer operation and the mathematically correct results can't be
    represented in the type, then there is no possible way for the generated
    code to give the correct result. The C standards therefore leave this
    as "undefined behaviour". However, if you use "gcc -fwrapv" then the
    behaviour /is/ defined - it is defined as two's complement wrapping.

    So if you write C code that overflows signed integer arithmetic and
    relies on given behaviour and results, the code is wrong because it has undefined behaviour - you are, at best, relying on luck. But if you
    write C code with such demands and specify that it is only suitable for
    use with the gcc "-fwrapv" flag, then it is not wrong and there is no
    undefined behaviour because the compiler implementation has given a
    definition of the behaviour. However, if you use the same code with,
    say, old versions of MSVC then you are back to luck and UB even if that compiler does not have optimisations based on knowing that signed
    integer arithmetic overflow is UB. And it is /your/ fault when the code
    fails on newer versions of MSVC that /do/ have such optimisations.

    This is all very different from what the C standards call "implementation-defined behaviour". Such things as how signed integers
    are converted to unsigned integers are explicitly IB in the C standards
    - implementations must define and document the behaviour.



    It does seem to me that as the C standard evolved, and as more things
    have *explicitly* become documented as UB, compiler developers have
    responded largely by dropping whatever the compiler did previously - sometimes breaking code that relied on it.


    I think that is perhaps partly true, partly a myth, and partly simply a side-effect of compilers gaining more optimisations as they are able to
    analyse more code at a time and do more advanced transforms. The C
    standards have clarified some of the text over time (most people would
    agree there is still plenty of scope for improvement there!). That can
    include changing some things that were previously undefined by omission
    to being explicitly labelled UB. I can't think of any examples
    off-hand. But note that this would not in any way change the meaning of
    the code - UB by omission is the same as explicit UB as far as the C
    language is concerned. There are very few cases where code was correct
    for original standard C90 (i.e., independent of any IB and independent
    of particular compilers) and is not correct C23 with identical defined behaviour. There were a few things changed between C90 and C99, but I
    don't know of any since then other than a few added keywords that could conflict with user identifiers.


    It is an unfortunate truth that older C compilers did not do as good a
    job at optimisation as newer ones. And this meant that many tricks were
    used in order to get efficient results, even those some of these relied
    on UB. Such code can have different results on different compilers, or different sets of options, because there is no definition of what the
    "correct" result should be. The programmer will have a clear idea of
    what they think is "correct", but it is not defined or specified
    anywhere. Usually the programmer feels it is "obvious" what the
    intended behaviour is - but "obvious" to a programmer does not mean
    "obvious" to a compiler. Thus you end up with code that works (as
    intended by the programmer) by testing and good luck with some compilers
    and options, and fails by bad luck on other compilers or options. The
    compiler didn't "break" the code - the code was broken to start with.
    But it is entirely reasonable and understandable why the programmer
    wrote the "broken" code in the first place, and why it did a useful job
    despite having UB.


    So I appreciate when people get frustrated that changes to a tool change
    the apparent behaviour of their code. But it is important to understand
    the the compiler is not wrong here - it is doing the best job it can for
    people writing correct code. A development tool should emphasis people
    using it /now/ - and while there is C code in use today that was written
    many decades ago, the majority of C code (and even more so for C++) is
    much more recent. It would be wrong to limit modern programmers because
    of code written long ago - even more so when there is no clear
    specification of how that old code was supposed to work.


    I have moved on from C (mostly), and I learned long ago to archive
    toolchains and to expect that any new version of a tool might break
    something that worked previously. I don't like it, but it generally
    doesn't annoy me that much.

    This all depends on the kind of code you write, and the kind of system
    you target. On my embedded targets, most of my code can be written in
    standard C. But a lot of it also uses at least some gcc extensions to
    improve the code - enhancing static error checking, making it more
    efficient, or making it easier and clearer to write. I am quite clear
    there that the code is dependent on gcc (it would probably also be fine
    for clang, but I have not checked that). For all such code, I do my
    utmost to make sure it is correct and safe, with no UB and no IB beyond
    what is obvious and necessary. Most programs will also contain code
    that is more specifically toolchain-dependent, perhaps with snippets of
    inline assembly, or target-specific features that are needed. This was
    more of an issue before, when I was using a wider range of compilers.

    But for any given project, I stick to a single compiler version and
    usually one set of compiler flags. For my work, code without C-level UB
    is not enough - I sometimes also need to test for things like run-time
    speed and code size, or interaction with external tools of various
    sorts, or stack usage limits - all things that are outside the scope of C.

    However, I don't remember when I last found that portable code that I
    wrote and was working on one compiler failed to have correct C-level functionality when compiled with a newer compiler (or flags) due to
    undefined behaviour, new optimisations, or changes in the C standard.
    I've had portability issues with older code due to IB such as writing
    code for a microcontroller with a different size of "int". I've seen
    issues with third-party code - I've had to compile such code with
    "-fwrapv -fno-strict-aliasing" on occasion. I've made other mistakes in
    my code. And I've got UB things wrong in my early days when new to C programming. But truly, I am at a loss to understand why some people
    are so worried about UB in C - you simply need to know the rules and specifications for the language features you use, and follow those rules.



    MMV. Certainly Anton's does. ;-)

    Anton writes code that seriously pushes the boundary of what can be
    achieved. For at least some of the things he does (such as GForth) he
    is trying to squeeze every last drop of speed out of the target. And he
    is /really/ good at it. But that means he is forever relying on nuances
    about code generation. His code, at least for efficiency if not for correctness, is dependent on details far beyond what is specified and documented for C and for the gcc compiler. He might spend a long time
    working with his code and a version of gcc, fine-tuning the details of
    his source code to get out exactly the assembly he wants from the
    compiler. Of course it is frustrating for him when the next version of
    gcc generates very different assembly from that same source, but he is
    not really programming at the level of C, and he should not expect
    consistency from C compilers like he does.


    Similar to you (David), I came from a - not embedded per se - but
    kiosk background: HRT indrustrial QA/QC systems. I know well the
    attraction of a new compiler yielding better performing code. I also
    know a large amount of my code was hardware and OS specific, that
    those are the things beyond the scope of the compiler, but they also
    are things that I don't want to have to revisit every time a new
    version of the compiler is released.


    Yes. For this kind of work, you want to keep your build environment
    consistent - no matter how careful you are to write correct code without UB.

    13 of one, baker's dozen of the other.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Mon Sep 2 13:13:20 2024
    Terje Mathisen <[email protected]> schrieb:
    Brett wrote:
    John Dallman <[email protected]> wrote:
    In fact, organisations replace about a quarter of their machines each
    year, always buying up-to-date ones, and want to run the /same/ version
    of software on all of them. They want common software versions for data
    compatibility, ease of training and so on. That means that a new release >>> of an application has to run on all the machines sold in the last four
    years, sometimes longer.

    I assume you work in the high end, as the average desktop PC is replaced
    every 8 years on a “use it until it breaks” policy.

    Dell will tell you 5 years, and Google is paid to say the same.
    And that actually might be true for laptops, but not desktops.

    The bulk of the PC’s and servers where I work are a dozen years old. >> A smattering of new PC’s bring the average down to 9 years.

    Organizations that rely on commercial licenced software have a much
    easier calculation to make:

    "I pay 10-100K dollar every year per CPU for my 3D
    CAD/modelling/whatever software, if I can buy a new system in 2-4 years
    time which is 50% faster (more cores/faster threads), then it could make sense to upgrade every year, except for the hazzle of installing
    everything."

    Made more complicated by wildly different license schemes.
    Some vendors give the victim^H^H^H^H^H^Hcustomer a number of
    licenses for interactive use (up to four cores, for example),
    and you have to purchase extra for "HPC" use (which is ridiculous
    today). With others, you need a "network license" to even connect
    remotely, but you can run a single calculation on as many parallel
    cores and CPUs, on a cluster, as you want.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Mon Sep 2 13:33:23 2024
    On Mon, 2 Sep 2024 5:06:34 +0000, Robert Finch wrote:

    ENTER, LEAVE, and RET as the only instructions capable of accessing the
    safe stack is fascinating me. I would like to try implementing this sort
    of thing in my design. Pondering why the PTE is specially marked
    RWE=000? One would think that some other OS available bits could be
    used. Does it make the MMU software easier to implement? Assuming that
    faults processed during ENTER, LEAVE, and RET are processed at a higher privilege level, could it not just check some other internal tables?

    a) I did not want to consume another bit in PTE
    b) I did not want to compare CSP with another base register
    So, RWE=000 was the ticket.

    This ends up very similar to MILL in the Safe-Stack stuff. I tried to
    do it without a separate stack and failed.


    Decided to try implementing a capabilities machine in the current
    design. Modeled it after the RISC-V capabilities instructions in the
    CHERI document. It was either that or a segmentation system. Got to keep
    the ole brain working.

    Going with an OoO design for Bigfoot.

    The rf386 takes an average of about 8 clocks per instruction. Helped out
    by the presence of a data cache. IPC of 0.125 is nothing to write about. About 5 MIPs at 50 MHz. Stores are fast (2-3 cycles), but loads are
    another story (14 ish cycles).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Mon Sep 2 13:36:49 2024
    On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

    George Neuner <[email protected]> schrieb:

    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that explicitly is
    mentioned as UB in some standard N, but was not addressed in previous
    standards.

    Was it always UB? Or should it be considered ID until it became UB?

    Can you give an exapmple?

    Memcopy() with overlapping pointers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to [email protected] on Mon Sep 2 06:59:32 2024
    [email protected] (MitchAlsup1) writes:

    On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

    George Neuner <[email protected]> schrieb:

    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that explicitly is
    mentioned as UB in some standard N, but was not addressed in previous
    standards.

    Was it always UB? Or should it be considered ID until it became UB?

    Can you give an exapmple?

    Memcopy() with overlapping pointers.

    Calling memcpy() between objects that overlap has always been
    explicitly and specifically undefined behavior, going back to
    the original ANSI C standard.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Mon Sep 2 18:09:03 2024
    On Mon, 02 Sep 2024 06:59:32 -0700
    Tim Rentsch <[email protected]> wrote:

    [email protected] (MitchAlsup1) writes:

    On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

    George Neuner <[email protected]> schrieb:

    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that
    explicitly is mentioned as UB in some standard N, but was not
    addressed in previous standards.

    Was it always UB? Or should it be considered ID until it became
    UB?

    Can you give an exapmple?

    Memcopy() with overlapping pointers.

    Calling memcpy() between objects that overlap has always been
    explicitly and specifically undefined behavior, going back to
    the original ANSI C standard.

    3 years ago Terje Mathisen wrote that many years ago he read that
    behaviour of memcpy() with overlappped src/dst was defined. https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
    Mitch Alsup answered "That was true in 1983".
    So, two people of different age living in different parts of the world
    are telling the same story. May be, there exist old popular book that
    said that it was defined?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Terje Mathisen on Mon Sep 2 09:46:47 2024
    On 9/2/2024 1:23 AM, Terje Mathisen wrote:
    Stephen Fuld wrote:
    On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
    On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
    You compare apples and peaches. Technical specifications for your
    pressure vessel result from the physical abilities of the chosen
    material, by keeping requirements as vessel border width, geometry
    etc.,
    while compiler writers are free in their search for optimization tricks >>>> that let them shine at SPEC benchmarks.

    A pressure vessel may actually be able to contain 2× the pressure it >>> will be able to contain 20 after 20 years of service due to stress
    and strain acting on the base materials.

    Then there are 3 kinds of metals {grey, white, yellow} with different
    responses to stress and induced strain. There is no analogy in code--
    If there were perhaps we would have better code today...

    Perhaps an analogy is code written in assembler, versus coed written
    in C versus code written in something like Ada or Rust.  Backing away
    now . . . :-)

    IMNSHO, code written in asm is generally more safe than code written in
    C, because the author knows exactly what each line of code is going to do.

    The problem is of course that it is harder to get 10x lines of correct
    asm than to get 1x lines of correct C.

    BTW, I am also solidly in the grey hair group here, writing C code that
    is very low-level, using explicit local variables for any loop
    invariant, copying other stuff into temp vars in order to make it really obvious that they cannot alias any globals or input/output parameters.

    Anyway, that is all mostly moot since I'm using Rust for this kind of programming now. :-)

    Can you talk about the advantages and disadvantages of Rust versus C?



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Mon Sep 2 10:21:17 2024
    Michael S <[email protected]> writes:

    On Mon, 02 Sep 2024 06:59:32 -0700
    Tim Rentsch <[email protected]> wrote:

    [email protected] (MitchAlsup1) writes:

    On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

    George Neuner <[email protected]> schrieb:

    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that
    explicitly is mentioned as UB in some standard N, but was not
    addressed in previous standards.

    Was it always UB? Or should it be considered ID until it became
    UB?

    Can you give an exapmple?

    Memcopy() with overlapping pointers.

    Calling memcpy() between objects that overlap has always been
    explicitly and specifically undefined behavior, going back to
    the original ANSI C standard.

    3 years ago Terje Mathisen wrote that many years ago he read that
    behaviour of memcpy() with overlappped src/dst was defined. https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
    Mitch Alsup answered "That was true in 1983".
    So, two people of different age living in different parts of the world
    are telling the same story. May be, there exist old popular book that
    said that it was defined?

    My first answer is that the question asked was about standards, and
    that is the question I was answering. There were no C standards
    before 1989.

    My second answer is, if I wanted to research the issue for the time
    before there were any C standards, I would start with these
    references, in more or less this order:

    K&R original edition (1978)
    PJ Plauger's book on implementing the C standard library
    Harbison and Steele
    K&R 2nd edition (1988?)

    Probably there are others but these are what I thought of off the
    top of my head.

    My third answer is, it wouldn't surprise me if there were a book or
    some sort of reference document that makes such a claim about how
    memcpy behaves, but I'm not aware of any (which doesn't mean
    anything), and nothing comes to mind in the general domain of near-authoritative books on C (other than the four listed above).
    So, assuming there is such a book or document, I expect it would
    be one of two things:

    Reference documentation for some specific C implementation (as
    for example from Sun Microsystems); or

    A book (or document) that purports to be authoritative (or maybe
    appears to be authoritative) but in reality is not.

    Obviously I can't disprove the existence of something that Terje
    said he read many years ago (perhaps with more information this
    could be done, but for sure I don't have such information). For the
    sake of discussion I'm willing to stipulate that Terje did read
    something and that what he read did say something about memcpy
    working for overlapping arguments. The question then becomes, What
    is it that he read, and what exactly did it say? I'm not in a
    position to answer those questions but maybe Terje or someone else
    remembers and can fill us in.

    (My aversion to using google groups stops me from following the
    reference you nicely provided.)

    To all this I should add that it certainly is feasible to implement
    memcpy so that it works with overlapping arguments, and I have no
    doubt (strictly speaking, less than epsilon doubt) that some library implementer somewhere (and probably more than one) has done this.
    Also it goes without saying that the C standard allows such a choice
    even today, and an implementation could choose to document that
    memcpy is well-behaved in that implementation. Undefined behavior
    doesn't mean that what will happen must be bad, only that what does
    happen is completely up to the implementation. Unfortunately more
    and more compiler writers are taking the attitude that any tiny bit
    of freedom in the direction of undefined behavior should be taken
    advantage of in pursuit of even the most trivial possible gain in
    performance, at the cost of ripping the code to shreds and making C
    less reliable than it could be (and should be). In some sense I am
    agreeing that the problem here is caused by the C standard, not by
    it changing in different versions but by it giving too much freedom
    to implementors for so-called "undefined behavior". Sadly the
    standardization process seems to have been taken over by compiler
    writers, so the best advice I can offer is to join the ISO C
    committee and start voting out the lunacy. Alternatively I suppose
    one could start up a competitive effort to gcc and clang, and offer
    a compiler that doesn't engage in such shenanigans unless told to do
    so (and told specifically), and then try to get developers to switch
    to sane C in preference to the ever-increasingly insane C that is
    most commonly used today.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to [email protected] on Mon Sep 2 17:59:16 2024
    MitchAlsup1 <[email protected]> schrieb:
    On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

    George Neuner <[email protected]> schrieb:

    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that explicitly is
    mentioned as UB in some standard N, but was not addressed in previous
    standards.

    Was it always UB? Or should it be considered ID until it became UB?

    Can you give an exapmple?

    Memcopy() with overlapping pointers.

    Does anybody have the first edition of K&R around to check what is
    explicity stated there?

    If both were intended to have the same functionality, it would have
    been strange to define both.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Mon Sep 2 18:32:54 2024
    Tim Rentsch <[email protected]> schrieb:
    In some sense I am
    agreeing that the problem here is caused by the C standard, not by
    it changing in different versions but by it giving too much freedom
    to implementors for so-called "undefined behavior". Sadly the standardization process seems to have been taken over by compiler
    writers, so the best advice I can offer is to join the ISO C
    committee and start voting out the lunacy.

    The standard could always define previously undefined behavior in
    subsequent versions. (Adding a new feature is mostly that).

    However, the main problem I see is that of defining that subset
    or version or whatever you want to call it of C that you (generic
    you) want implemented. It could be defined as an extension (or
    restriction, if you will) of the C standard, with additional
    rules.

    Alternatively I suppose
    one could start up a competitive effort to gcc and clang, and offer
    a compiler that doesn't engage in such shenanigans unless told to do
    so (and told specifically), and then try to get developers to switch
    to sane C in preference to the ever-increasingly insane C that is
    most commonly used today.

    The specification needs to come first! Right now, compiler writers
    have a specification, the standard, which they generally follow
    (modulo bugs and extensions). You have to give them another,
    supplemental specification to follow if you want any chance
    of success.

    But writing such a specification is a lot of work, very hard work,
    and needs a lot of discussion.

    "Don't do this" or "don't do that" is not sufficient. Maybe you,
    together with like-minded people, could try formulating some rules
    as an extension to the C standard, and see where it gets you.
    Maybe you can get it published as an annex.

    If it gets accepted by a wide community, then a branch trying to
    implement that particular version in either gcc or clang (or
    both) could have a certain chance of being implemented by the
    main compilers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Tim Rentsch on Mon Sep 2 19:31:18 2024
    Tim Rentsch <[email protected]> writes:
    Michael S <[email protected]> writes:

    On Mon, 02 Sep 2024 06:59:32 -0700
    Tim Rentsch <[email protected]> wrote:

    [email protected] (MitchAlsup1) writes:

    On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

    George Neuner <[email protected]> schrieb:

    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that
    explicitly is mentioned as UB in some standard N, but was not
    addressed in previous standards.

    Was it always UB? Or should it be considered ID until it became
    UB?

    Can you give an exapmple?

    Memcopy() with overlapping pointers.

    Calling memcpy() between objects that overlap has always been
    explicitly and specifically undefined behavior, going back to
    the original ANSI C standard.

    3 years ago Terje Mathisen wrote that many years ago he read that
    behaviour of memcpy() with overlappped src/dst was defined.
    https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
    Mitch Alsup answered "That was true in 1983".
    So, two people of different age living in different parts of the world
    are telling the same story. May be, there exist old popular book that
    said that it was defined?

    My first answer is that the question asked was about standards, and
    that is the question I was answering. There were no C standards
    before 1989.

    Third edition of the SVID (8/89) has on pg. 7-83:

    USAGE:
    Character movement is performed differently in different
    implementations. Thus overlapping moves may be unpredictable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Mon Sep 2 19:32:56 2024
    Thomas Koenig <[email protected]> writes:
    MitchAlsup1 <[email protected]> schrieb:
    On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

    George Neuner <[email protected]> schrieb:

    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that explicitly is
    mentioned as UB in some standard N, but was not addressed in previous
    standards.

    Was it always UB? Or should it be considered ID until it became UB?

    Can you give an exapmple?

    Memcopy() with overlapping pointers.

    Does anybody have the first edition of K&R around to check what is
    explicity stated there?

    The system V interface definition, third edition, August 1989 states
    that overlapping moves are unpredictable specifically due to differences
    in implementation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Thomas Koenig on Mon Sep 2 20:52:21 2024
    Thomas Koenig <[email protected]> schrieb:

    "Don't do this" or "don't do that" is not sufficient. Maybe you,
    together with like-minded people, could try formulating some rules
    as an extension to the C standard, and see where it gets you.
    Maybe you can get it published as an annex.

    Hm... putting some thought into it, it may be a good first step
    to define cases for which a a diagnostic is required; maybe
    "observable error" would be a reasonable term.

    So, put "dereferencing a NULL pointer shall be an observable
    error" would make sure that no null pointer checks are thrown
    away, and that this requires a run-time diagnostic.

    If that is the case, should dereferencing a member of a struct
    pointed to by a null pointer also be an observable error, and
    be required to be caught at run-time?

    Or is this completely the wrong track, and you would like to do
    something entirely different? Any annex to the C standard would
    still be constrained to the abstract machine (probably).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Schultz@21:1/5 to Thomas Koenig on Mon Sep 2 16:42:07 2024
    On 9/2/24 12:59 PM, Thomas Koenig wrote:

    Memcopy() with overlapping pointers.

    Does anybody have the first edition of K&R around to check what is
    explicity stated there?

    memcpy() doesn't appear in the index.

    --
    http://davesrocketworks.com
    David Schultz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Sep 3 01:47:22 2024
    On Mon, 2 Sep 2024 19:32:23 +0000, BGB wrote:

    On 9/1/2024 6:32 PM, MitchAlsup1 wrote:

    More modern machines have RND nobody will ever have REM.

    Which is probably not a lot, as off-hand I am not aware of many ISA's
    that have floor/ceil/round in the ISA itself, rather than doing it via conversion to an integer type.

    VAX has round float* to float* 1978.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Sep 3 01:50:52 2024
    On Mon, 2 Sep 2024 20:52:21 +0000, Thomas Koenig wrote:

    Thomas Koenig <[email protected]> schrieb:

    "Don't do this" or "don't do that" is not sufficient. Maybe you,
    together with like-minded people, could try formulating some rules
    as an extension to the C standard, and see where it gets you.
    Maybe you can get it published as an annex.

    Hm... putting some thought into it, it may be a good first step
    to define cases for which a a diagnostic is required; maybe
    "observable error" would be a reasonable term.

    So, put "dereferencing a NULL pointer shall be an observable
    error" would make sure that no null pointer checks are thrown
    away, and that this requires a run-time diagnostic.

    If that is the case, should dereferencing a member of a struct
    pointed to by a null pointer also be an observable error, and
    be required to be caught at run-time?

    It depends::

    Let
    Base = NULL;
    Index = &array / sizeof( array[0] );

    is::

    x = [base+index<<sale+small_offset]

    u8ndefined ??

    Or is this completely the wrong track, and you would like to do
    something entirely different? Any annex to the C standard would
    still be constrained to the abstract machine (probably).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Schultz on Tue Sep 3 01:52:06 2024
    On Mon, 2 Sep 2024 21:42:07 +0000, David Schultz wrote:

    On 9/2/24 12:59 PM, Thomas Koenig wrote:

    Memcopy() with overlapping pointers.

    Does anybody have the first edition of K&R around to check what is
    explicity stated there?

    memcpy() doesn't appear in the index.

    Was in the library I used in 1980 BSD Unix PDP-11-70.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to [email protected] on Mon Sep 2 20:40:39 2024
    [email protected] (MitchAlsup1) writes:

    On Mon, 2 Sep 2024 20:52:21 +0000, Thomas Koenig wrote:

    Thomas Koenig <[email protected]> schrieb:

    "Don't do this" or "don't do that" is not sufficient. Maybe you,
    together with like-minded people, could try formulating some rules
    as an extension to the C standard, and see where it gets you.
    Maybe you can get it published as an annex.

    Hm... putting some thought into it, it may be a good first step
    to define cases for which a a diagnostic is required; maybe
    "observable error" would be a reasonable term.

    So, put "dereferencing a NULL pointer shall be an observable
    error" would make sure that no null pointer checks are thrown
    away, and that this requires a run-time diagnostic.

    If that is the case, should dereferencing a member of a struct
    pointed to by a null pointer also be an observable error, and
    be required to be caught at run-time?

    It depends::

    Let
    Base = NULL;
    Index = &array / sizeof( array[0] );

    is::

    x = [base+index<<sale+small_offset]

    u8ndefined ??

    These lines aren't even close to being meaningful C source.
    What question are you trying to ask?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to All on Mon Sep 2 20:37:31 2024
    Thomas Koenig <[email protected]> writes:

    I'm responding here to one part of your posting. I
    may respond to the other part at a later time.

    Tim Rentsch <[email protected]> schrieb:

    In some sense I am
    agreeing that the problem here is caused by the C standard, not by
    it changing in different versions but by it giving too much freedom
    to implementors for so-called "undefined behavior". Sadly the
    standardization process seems to have been taken over by compiler
    writers, so the best advice I can offer is to join the ISO C
    committee and start voting out the lunacy.

    Alternatively I suppose
    one could start up a competitive effort to gcc and clang, and offer
    a compiler that doesn't engage in such shenanigans unless told to do
    so (and told specifically), and then try to get developers to switch
    to sane C in preference to the ever-increasingly insane C that is
    most commonly used today.

    The specification needs to come first! Right now, compiler writers
    have a specification, the standard, which they generally follow
    (modulo bugs and extensions). You have to give them another,
    supplemental specification to follow if you want any chance
    of success.

    But writing such a specification is a lot of work, very hard work,
    and needs a lot of discussion.

    "Don't do this" or "don't do that" is not sufficient. Maybe you,
    together with like-minded people, could try formulating some rules
    as an extension to the C standard, and see where it gets you.
    Maybe you can get it published as an annex.

    If it gets accepted by a wide community, then a branch trying to
    implement that particular version in either gcc or clang (or
    both) could have a certain chance of being implemented by the
    main compilers.

    My suggestion is not to implement a language extension, but to
    implement a compiler conforming to C as it is now, with
    additional guarantees for what happens in cases that are
    undefined behavior. Moreover the additional guarantees are
    always in effect unless explicitly and specifically requested
    otherwise (most likely by means of a #pragma or _Pragma).
    Documentation needs to be written for the #pragmas, but no other
    documentation is required (it might be nice to describe the
    additional guarantees but that is not required by the C
    standard).

    The point is to change the behavior of the compiler but
    still conform to the existing ISO C standard.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Tue Sep 3 05:55:14 2024
    Tim Rentsch <[email protected]> schrieb:

    My suggestion is not to implement a language extension, but to
    implement a compiler conforming to C as it is now,

    Sure, that was also what I was suggesting - define things that
    are currently undefined behavior.

    with
    additional guarantees for what happens in cases that are
    undefined behavior.

    Guarantees or specifications - no difference there.

    Moreover the additional guarantees are
    always in effect unless explicitly and specifically requested
    otherwise (most likely by means of a #pragma or _Pragma).
    Documentation needs to be written for the #pragmas, but no other documentation is required (it might be nice to describe the
    additional guarantees but that is not required by the C
    standard).

    It' the other way around - you need to describe first what the
    actual behavior in absence of any pragmas is, and this needs to be a
    firm specification, so the programmer doesn't need to read your mind
    (or the source code to the compiler) to find out what you meant.
    "But it is clear that..." would not be a specification; what is
    clear to you may absolutely not be clear to anybody else.

    This is also the only chance you'll have of getting this implemented
    in one of the current compilers (and let's face it, if you want
    high-quality code, you would need that; both LLVM and GCC
    have taken an enormous amount of effort up to now, and duplicating
    that is probably not going to happen).

    The point is to change the behavior of the compiler but
    still conform to the existing ISO C standard.

    I understood that - defining things that are currently undefined.
    But without a specification, that falls down.

    So, let's try something that causes some grief - what should
    be the default behavior (in the absence of pragmas) for integer
    overflow? More specifically, can the compiler set the condition
    to false in

    int a;

    ...

    if (a > a + 1) {
    }

    and how would you specify this in an unabigous manner?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Stephen Fuld on Tue Sep 3 08:23:33 2024
    On 02/09/2024 18:46, Stephen Fuld wrote:
    On 9/2/2024 1:23 AM, Terje Mathisen wrote:
    Stephen Fuld wrote:
    On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
    On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
    You compare apples and peaches. Technical specifications for your
    pressure vessel result from the physical abilities of the chosen
    material, by keeping requirements as vessel border width, geometry
    etc.,
    while compiler writers are free in their search for optimization
    tricks
    that let them shine at SPEC benchmarks.

    A pressure vessel may actually be able to contain 2× the pressure it >>>> will be able to contain 20 after 20 years of service due to stress
    and strain acting on the base materials.

    Then there are 3 kinds of metals {grey, white, yellow} with different
    responses to stress and induced strain. There is no analogy in code--
    If there were perhaps we would have better code today...

    Perhaps an analogy is code written in assembler, versus coed written
    in C versus code written in something like Ada or Rust.  Backing away
    now . . . :-)

    IMNSHO, code written in asm is generally more safe than code written
    in C, because the author knows exactly what each line of code is going
    to do.

    The problem is of course that it is harder to get 10x lines of correct
    asm than to get 1x lines of correct C.

    BTW, I am also solidly in the grey hair group here, writing C code
    that is very low-level, using explicit local variables for any loop
    invariant, copying other stuff into temp vars in order to make it
    really obvious that they cannot alias any globals or input/output
    parameters.

    Anyway, that is all mostly moot since I'm using Rust for this kind of
    programming now. :-)

    Can you talk about the advantages and disadvantages of Rust versus C?


    And also for Rust versus C++ ?

    My impression - based on hearsay for Rust as I have no experience - is
    that the key point of Rust is memory "safety". I use scare-quotes here,
    since it is simply about correct use of dynamic memory and buffers.

    It is entirely possible to have correct use of memory in C, but it is
    also very easy to get it wrong - especially if the developer doesn't use available tools for static and run-time checks. Modern C++, on the
    other hand, makes it much easier to get right. You can cause yourself
    extra work and risk by using more old-fashioned C++, but following
    modern design guides using smart pointers and containers, along with
    easily available tools, and you get a lot of the management of memory
    handled automatically for very little cost.

    C++ provides a huge amount more than Rust - when I have looked at Rust,
    it is (still) too limited for some of what I want to do. Of course,
    "with great power comes great responsibility" - C++ provides many
    exciting ways to write a complete mess :-)


    Most of the "Rust vs C++" comparisons I see are complete rubbish in
    regards to C++ - they tend to see it as "C with a couple of OOP bits
    added", and are usually strongly biased towards the Rust fad. For example :

    <https://www.geeksforgeeks.org/rust-vs-c/>

    This says Rust is "Multi-paradigm (functional, imperative)" while C++ is "Object-oriented". C++ is as "multi-paradigm" as you can get in a
    programming language - object-oriented /and/ functional /and/ imperative
    /and/ generic /and/ lots of other "paradigms". And it says C++ has
    "manual memory management", while omitting that it /also/ has extensive automatic memory management.


    To my mind, the important question is not "Should we move from C to
    Rust?", but "Should we move from bad C to C++, Rust, or simply to good C practices?".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Tue Sep 3 10:44:21 2024
    On Tue, 3 Sep 2024 05:55:14 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Tim Rentsch <[email protected]> schrieb:

    My suggestion is not to implement a language extension, but to
    implement a compiler conforming to C as it is now,

    Sure, that was also what I was suggesting - define things that
    are currently undefined behavior.

    with
    additional guarantees for what happens in cases that are
    undefined behavior.

    Guarantees or specifications - no difference there.

    Moreover the additional guarantees are
    always in effect unless explicitly and specifically requested
    otherwise (most likely by means of a #pragma or _Pragma).
    Documentation needs to be written for the #pragmas, but no other documentation is required (it might be nice to describe the
    additional guarantees but that is not required by the C
    standard).

    It' the other way around - you need to describe first what the
    actual behavior in absence of any pragmas is, and this needs to be a
    firm specification, so the programmer doesn't need to read your mind
    (or the source code to the compiler) to find out what you meant.
    "But it is clear that..." would not be a specification; what is
    clear to you may absolutely not be clear to anybody else.

    This is also the only chance you'll have of getting this implemented
    in one of the current compilers (and let's face it, if you want
    high-quality code, you would need that; both LLVM and GCC
    have taken an enormous amount of effort up to now, and duplicating
    that is probably not going to happen).

    The point is to change the behavior of the compiler but
    still conform to the existing ISO C standard.

    I understood that - defining things that are currently undefined.
    But without a specification, that falls down.

    So, let's try something that causes some grief - what should
    be the default behavior (in the absence of pragmas) for integer
    overflow? More specifically, can the compiler set the condition
    to false in

    int a;

    ...

    if (a > a + 1) {
    }

    and how would you specify this in an unabigous manner?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Tue Sep 3 09:29:22 2024
    On 02/09/2024 22:52, Thomas Koenig wrote:
    Thomas Koenig <[email protected]> schrieb:

    "Don't do this" or "don't do that" is not sufficient. Maybe you,
    together with like-minded people, could try formulating some rules
    as an extension to the C standard, and see where it gets you.
    Maybe you can get it published as an annex.

    Hm... putting some thought into it, it may be a good first step
    to define cases for which a a diagnostic is required; maybe
    "observable error" would be a reasonable term.


    That sounds a lot like adding a new type of run-time error handling to
    the language. That's not necessarily a bad idea, but it would likely be
    a very big change with significant ramifications for existing code.

    So, put "dereferencing a NULL pointer shall be an observable
    error" would make sure that no null pointer checks are thrown
    away, and that this requires a run-time diagnostic.


    The kind of null pointer checks that are thrown away by some compilers
    are those that come /after/ a dereference :

    int foo(int * p) {
    int x = *p;
    if (!p) {
    printf("I shouldn't have done that...\n");
    }
    return x;
    }

    If dereferencing a null pointer is an "observable error", it needs to be observed at the "int x = *p;" line, and has no influence on the deletion
    of the later pointer check.

    Making dereferencing a null pointer an "observable error" would mean
    requiring compilers to insert an explicit check in a large number of
    cases, with a jump to some kind of run-time error-handling code when it
    is zero. That is a very significant cost, to be paid by all users of
    pointers in C - even those that are careful to ensure that their
    pointers are not null before calling "foo". (There's also the
    definition complications - a pointer that happens to contain the value
    0, or point to address 0, is not necessarily a NULL pointer, and on some targets there are lots of different values that are all null pointers.
    And there are endless possibilities for invalid pointers that are not null.)

    C is a language where the programmer takes the responsibility to get the
    code right - not the language or run-time. It insists on manual and
    explicit control of this kind of thing, so that you don't have to pay
    for checks you don't want.

    Leaving the dereferencing of invalid pointers as undefined behaviour
    means that code that does not have invalid pointers does not have extra
    hidden checks and costs, along with hidden jumps to error handlers. It
    also means that development tools can run in modes that add whatever
    they like of extra checks and handling of invalid pointers -
    "sanitizers" and other run-time checkers. And static error checkers can
    warn if they see code paths with bad dereferences.

    If that is the case, should dereferencing a member of a struct
    pointed to by a null pointer also be an observable error, and
    be required to be caught at run-time?

    Or is this completely the wrong track, and you would like to do
    something entirely different? Any annex to the C standard would
    still be constrained to the abstract machine (probably).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Tue Sep 3 10:10:20 2024
    On 03/09/2024 07:55, Thomas Koenig wrote:
    Tim Rentsch <[email protected]> schrieb:

    My suggestion is not to implement a language extension, but to
    implement a compiler conforming to C as it is now,

    Sure, that was also what I was suggesting - define things that
    are currently undefined behavior.

    with
    additional guarantees for what happens in cases that are
    undefined behavior.

    Guarantees or specifications - no difference there.


    I personally think that - for the most part - that would be a really bad
    idea. I am not in favour of arbitrarily defining the behaviour of
    something that has no sensible correct behaviour. If the code flow
    reaches something that is run-time UB, the code is wrong or has been
    used incorrectly (i.e., the calling code, or user, or something else has
    made a mistake). No possible handling of the UB will result in correct results.

    It is sometimes possible to have damage limitation, such as exiting the
    program quickly with an error message rather than corrupting files,
    opening security breaches, etc. But that is always context specific -
    stopping the program with an error message is fine for many PC programs,
    but less ideal for a flight control system.


    There are some languages that have integrated error handling, and can
    sensibly have checks as a natural part of the language and the code. C
    is not such a language. Let C remain a language where the programmer
    has control, and where checks are done manually or they are not done at
    all. People who don't want that, should use other languages that give
    them what they want. UB in C is a /feature/, it is not a problem.
    Trying to remove UB (by specifying more behaviour) reduces the power of
    the language, and reduces the power of tools for the language, often for downright silly results (like wrapping integer overflow).

    But if people want a compiler that has extra guarantees and
    specifications for behaviour in cases of UB, then those already exist -
    "gcc -fsanitize=undefined" would be a good example. Of course such
    tools could be improved in a variety of ways.


    (There are a few situations where UB in C could be diagnosed at
    compile-time, which are probably historical decisions to avoid imposing
    too much work on early compilers. Where possible, UB that can be caught
    at compile time, could usefully be turned into constrain violations that
    must be diagnosed.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Tue Sep 3 11:40:42 2024
    On Tue, 3 Sep 2024 05:55:14 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Tim Rentsch <[email protected]> schrieb:

    My suggestion is not to implement a language extension, but to
    implement a compiler conforming to C as it is now,

    Sure, that was also what I was suggesting - define things that
    are currently undefined behavior.

    with
    additional guarantees for what happens in cases that are
    undefined behavior.

    Guarantees or specifications - no difference there.

    Moreover the additional guarantees are
    always in effect unless explicitly and specifically requested
    otherwise (most likely by means of a #pragma or _Pragma).
    Documentation needs to be written for the #pragmas, but no other documentation is required (it might be nice to describe the
    additional guarantees but that is not required by the C
    standard).

    It' the other way around - you need to describe first what the
    actual behavior in absence of any pragmas is, and this needs to be a
    firm specification, so the programmer doesn't need to read your mind
    (or the source code to the compiler) to find out what you meant.
    "But it is clear that..." would not be a specification; what is
    clear to you may absolutely not be clear to anybody else.

    This is also the only chance you'll have of getting this implemented
    in one of the current compilers (and let's face it, if you want
    high-quality code, you would need that; both LLVM and GCC
    have taken an enormous amount of effort up to now, and duplicating
    that is probably not going to happen).

    The point is to change the behavior of the compiler but
    still conform to the existing ISO C standard.

    I understood that - defining things that are currently undefined.
    But without a specification, that falls down.

    So, let's try something that causes some grief - what should
    be the default behavior (in the absence of pragmas) for integer
    overflow? More specifically, can the compiler set the condition
    to false in

    int a;

    ...

    if (a > a + 1) {
    }

    and how would you specify this in an unabigous manner?

    I'd start much earlier, by declaration of "Homogeneity and Exclusion".
    It would state that "more defined C" does not pretend to cover all
    targets covered by existing C language.
    Specifically, following target characteristics are required:
    - byte-addressable machine with 8-bit bytes
    - two-complement integer types
    - if float type is supported it has to be IEEE-754 binary32
    - if double type is supported it has to be IEEE-754 binary64
    - if long double type is supported it has to be IEEE-754 binary128
    - storage order for multibyte types should be either LE or BE,
    consistently for all built-in types
    - flat address space That part should be specified in more formal manner

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Tue Sep 3 17:41:40 2024
    Michael S wrote:
    On Mon, 02 Sep 2024 06:59:32 -0700
    Tim Rentsch <[email protected]> wrote:

    [email protected] (MitchAlsup1) writes:

    On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

    George Neuner <[email protected]> schrieb:

    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that
    explicitly is mentioned as UB in some standard N, but was not
    addressed in previous standards.

    Was it always UB? Or should it be considered ID until it became
    UB?

    Can you give an exapmple?

    Memcopy() with overlapping pointers.

    Calling memcpy() between objects that overlap has always been
    explicitly and specifically undefined behavior, going back to
    the original ANSI C standard.

    3 years ago Terje Mathisen wrote that many years ago he read that
    behaviour of memcpy() with overlappped src/dst was defined. https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
    Mitch Alsup answered "That was true in 1983".
    So, two people of different age living in different parts of the world
    are telling the same story. May be, there exist old popular book that
    said that it was defined?


    It probably wasn't written in the official C standard, which I couldn't
    have afforded to buy/read, but in a compiler runtime doc?

    Specifying that it would always copy from beginning to end of the source buffer, in increasing address order meant that it was guaranteed safe
    when used to compact buffers.

    Code that depended on this was fine for decades, until the first library/compiler implementation discovered that in some circumstances it
    could be faster to go in reverse order.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue Sep 3 19:09:28 2024
    On Tue, 3 Sep 2024 17:41:40 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Mon, 02 Sep 2024 06:59:32 -0700
    Tim Rentsch <[email protected]> wrote:

    [email protected] (MitchAlsup1) writes:

    On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

    George Neuner <[email protected]> schrieb:

    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that
    explicitly is mentioned as UB in some standard N, but was not
    addressed in previous standards.

    Was it always UB? Or should it be considered ID until it became
    UB?

    Can you give an exapmple?

    Memcopy() with overlapping pointers.

    Calling memcpy() between objects that overlap has always been
    explicitly and specifically undefined behavior, going back to
    the original ANSI C standard.

    3 years ago Terje Mathisen wrote that many years ago he read that
    behaviour of memcpy() with overlappped src/dst was defined. https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
    Mitch Alsup answered "That was true in 1983".
    So, two people of different age living in different parts of the
    world are telling the same story. May be, there exist old popular
    book that said that it was defined?


    It probably wasn't written in the official C standard, which I
    couldn't have afforded to buy/read, but in a compiler runtime doc?

    Specifying that it would always copy from beginning to end of the
    source buffer, in increasing address order meant that it was
    guaranteed safe when used to compact buffers.


    What is "compact buffers" ?

    Code that depended on this was fine for decades, until the first library/compiler implementation discovered that in some circumstances
    it could be faster to go in reverse order.

    Terje



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Tue Sep 3 17:46:38 2024
    Stephen Fuld wrote:
    On 9/2/2024 1:23 AM, Terje Mathisen wrote:
    Stephen Fuld wrote:
    On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
    On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
    You compare apples and peaches. Technical specifications for your
    pressure vessel result from the physical abilities of the chosen
    material, by keeping requirements as vessel border width, geometry
    etc.,
    while compiler writers are free in their search for optimization
    tricks
    that let them shine at SPEC benchmarks.

    A pressure vessel may actually be able to contain 2× the
    pressure it
    will be able to contain 20 after 20 years of service due to stress
    and strain acting on the base materials.

    Then there are 3 kinds of metals {grey, white, yellow} with different
    responses to stress and induced strain. There is no analogy in code--
    If there were perhaps we would have better code today...

    Perhaps an analogy is code written in assembler, versus coed written
    in C versus code written in something like Ada or Rust.  Backing
    away now . . . :-)

    IMNSHO, code written in asm is generally more safe than code written
    in C, because the author knows exactly what each line of code is going
    to do.

    The problem is of course that it is harder to get 10x lines of correct
    asm than to get 1x lines of correct C.

    BTW, I am also solidly in the grey hair group here, writing C code
    that is very low-level, using explicit local variables for any loop
    invariant, copying other stuff into temp vars in order to make it
    really obvious that they cannot alias any globals or input/output
    parameters.

    Anyway, that is all mostly moot since I'm using Rust for this kind of
    programming now. :-)

    Can you talk about the advantages and disadvantages of Rust versus C?

    Q&D programming is still far faster for me in C, but using Rust I don't
    have to worry about how well the compiler will be able to optimize my
    code, it is pretty much always close to speed of light since the entire aliasing issue goes away.

    Rust also gets rid of the horrible external library/configure/cmake mess
    that kept me from successfully compiling the reference LAStools lidar
    code for nearly 10 years.

    Using the Rust port I just tell cargo to add it to my project and that's it.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Tue Sep 3 16:17:49 2024
    Michael S <[email protected]> writes:
    On Tue, 3 Sep 2024 17:41:40 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Mon, 02 Sep 2024 06:59:32 -0700
    Tim Rentsch <[email protected]> wrote:

    [email protected] (MitchAlsup1) writes:

    On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

    George Neuner <[email protected]> schrieb:

    I'm not going to argue about whether UB in code is wrong. The
    question I have concerns what to do with something that
    explicitly is mentioned as UB in some standard N, but was not
    addressed in previous standards.

    Was it always UB? Or should it be considered ID until it became
    UB?

    Can you give an exapmple?

    Memcopy() with overlapping pointers.

    Calling memcpy() between objects that overlap has always been
    explicitly and specifically undefined behavior, going back to
    the original ANSI C standard.

    3 years ago Terje Mathisen wrote that many years ago he read that
    behaviour of memcpy() with overlappped src/dst was defined.
    https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
    Mitch Alsup answered "That was true in 1983".
    So, two people of different age living in different parts of the
    world are telling the same story. May be, there exist old popular
    book that said that it was defined?


    It probably wasn't written in the official C standard, which I
    couldn't have afforded to buy/read, but in a compiler runtime doc?

    Specifying that it would always copy from beginning to end of the
    source buffer, in increasing address order meant that it was
    guaranteed safe when used to compact buffers.


    What is "compact buffers" ?

    In this case, 'compact' was used as a verb. Perhaps by removing
    extraneous whitespace.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to David Brown on Tue Sep 3 09:54:11 2024
    On 9/2/2024 11:23 PM, David Brown wrote:
    On 02/09/2024 18:46, Stephen Fuld wrote:
    On 9/2/2024 1:23 AM, Terje Mathisen wrote:
    Stephen Fuld wrote:
    On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
    On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
    You compare apples and peaches. Technical specifications for your
    pressure vessel result from the physical abilities of the chosen
    material, by keeping requirements as vessel border width, geometry >>>>>> etc.,
    while compiler writers are free in their search for optimization
    tricks
    that let them shine at SPEC benchmarks.

    A pressure vessel may actually be able to contain 2× the pressure it >>>>> will be able to contain 20 after 20 years of service due to stress
    and strain acting on the base materials.

    Then there are 3 kinds of metals {grey, white, yellow} with different >>>>> responses to stress and induced strain. There is no analogy in code-- >>>>> If there were perhaps we would have better code today...

    Perhaps an analogy is code written in assembler, versus coed written
    in C versus code written in something like Ada or Rust.  Backing
    away now . . . :-)

    IMNSHO, code written in asm is generally more safe than code written
    in C, because the author knows exactly what each line of code is
    going to do.

    The problem is of course that it is harder to get 10x lines of
    correct asm than to get 1x lines of correct C.

    BTW, I am also solidly in the grey hair group here, writing C code
    that is very low-level, using explicit local variables for any loop
    invariant, copying other stuff into temp vars in order to make it
    really obvious that they cannot alias any globals or input/output
    parameters.

    Anyway, that is all mostly moot since I'm using Rust for this kind of
    programming now. :-)

    Can you talk about the advantages and disadvantages of Rust versus C?


    And also for Rust versus C++ ?

    I asked about C versus Rust as Terje explicitly mentioned those two
    languages, but you make a good point in general.



    My impression - based on hearsay for Rust as I have no experience - is
    that the key point of Rust is memory "safety".  I use scare-quotes here, since it is simply about correct use of dynamic memory and buffers.

    I agree that memory safety is the key point, although I gather that it
    has other features that many programmers like.


    It is entirely possible to have correct use of memory in C, but it is
    also very easy to get it wrong - especially if the developer doesn't use available tools for static and run-time checks.  Modern C++, on the
    other hand, makes it much easier to get right.  You can cause yourself
    extra work and risk by using more old-fashioned C++, but following
    modern design guides using smart pointers and containers, along with
    easily available tools, and you get a lot of the management of memory
    handled automatically for very little cost.

    Is it fair to say then that Rust makes it harder to get memory
    management "wrong"?




    C++ provides a huge amount more than Rust - when I have looked at Rust,
    it is (still) too limited for some of what I want to do.


    Can you give a few examples?


    Of course,
    "with great power comes great responsibility" - C++ provides many
    exciting ways to write a complete mess :-)

    Sure. I gather that templates are very powerful and potentially very
    useful. On the other hand, I gather that multiple inheritance is very powerful, but difficult to use and potentially very ugly, and has not
    been carried forward in the same way into newer languages.




    snip stuff about the inadequacy of existing Rust versus C++ comparisons.


    To my mind, the important question is not "Should we move from C to
    Rust?", but "Should we move from bad C to C++, Rust, or simply to good C practices?".

    I understand. This brings up an important issue, that of older versus
    newer languages.

    A newer language has several advantages. One is it can take advantage
    of what we have learned about language design and usage since the older language was designed. I can't underestimate this enough. While many
    new language features turn out to be not useful, many are.

    Another is that it doesn't have to worry about support for "dusty
    decks", i.e. the existing base which may conform to an older version of
    the language, nor for "dusty brains", that is programmers who learned
    the older (i.e. worse) ways and keep generating new code using those
    ways. You mention this issue in your comments.

    Of course, the counter to that is that new languages have to overcome
    the huge "installed base" advantage of existing languages.

    Let me be clear. I am not a Rust evangelist. I am just looking for a
    way forward that will help us make programmer easier and not to make
    some of the same mistakes we have made in the past. Is Rust that? Some
    people think so. I just want to understand more.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Linsel@21:1/5 to All on Tue Sep 3 19:19:31 2024
    T24gMDMuMDkuMjQgMTA6MTAsIERhdmlkIEJyb3duIHdyb3RlOg0KDQpzbmlwIDg8IC0gLSAt IC0gLSAtIC0gLQ0KDQo+IChUaGVyZSBhcmUgYSBmZXcgc2l0dWF0aW9ucyB3aGVyZSBVQiBp biBDIGNvdWxkIGJlIGRpYWdub3NlZCBhdCANCj4gY29tcGlsZS10aW1lLCB3aGljaCBhcmUg cHJvYmFibHkgaGlzdG9yaWNhbCBkZWNpc2lvbnMgdG8gYXZvaWQgaW1wb3NpbmcgDQo+IHRv byBtdWNoIHdvcmsgb24gZWFybHkgY29tcGlsZXJzLsKgIFdoZXJlIHBvc3NpYmxlLCBVQiB0 aGF0IGNhbiBiZSBjYXVnaHQgDQo+IGF0IGNvbXBpbGUgdGltZSwgY291bGQgdXNlZnVsbHkg YmUgdHVybmVkIGludG8gY29uc3RyYWluIHZpb2xhdGlvbnMgdGhhdCANCj4gbXVzdCBiZSBk aWFnbm9zZWQuKQ0KDQpBbmQgZXhhY3RseSB0aGVzZSBhcmUgdGhlIHNpdHVhdGlvbnMgdGhh dCBJJ2QgbGlrZSB0byBiZSB3YXJuZWQgZnJvbSwgDQpyYXRoZXIgdGhhbiB0aGUgY29tcGls ZXIgbWFraW5nIHVwIHNvbWV0aGluZyB3aXRob3V0IHRlbGxpbmcuDQoNCi0tIA0KQmVybmQg TGluc2VsDQo=

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Tue Sep 3 19:52:49 2024
    Michael S wrote:
    On Tue, 3 Sep 2024 17:41:40 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    3 years ago Terje Mathisen wrote that many years ago he read that
    behaviour of memcpy() with overlappped src/dst was defined.
    https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
    Mitch Alsup answered "That was true in 1983".
    So, two people of different age living in different parts of the
    world are telling the same story. May be, there exist old popular
    book that said that it was defined?


    It probably wasn't written in the official C standard, which I
    couldn't have afforded to buy/read, but in a compiler runtime doc?

    Specifying that it would always copy from beginning to end of the
    source buffer, in increasing address order meant that it was
    guaranteed safe when used to compact buffers.


    What is "compact buffers" ?

    Assume a buffer consisting of records of some type, some of them marked
    as deleted. Iterating over them while removing the gaps means that you
    are always copying to a destination lower in memory, right?

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to David Brown on Tue Sep 3 21:08:31 2024
    On 2024-09-03 11:10, David Brown wrote:

    [snip]

    (There are a few situations where UB in C could be diagnosed at
    compile-time, which are probably historical decisions to avoid imposing
    too much work on early compilers.  Where possible, UB that can be caught
    at compile time, could usefully be turned into constrain violations that
    must be diagnosed.)


    The problem, as you of course know, is that the "can" in "can be caught
    at compile time" depends on the amount and kind of analysis that is done
    at compile time -- some cases of UB "can" be caught at compile time but
    only by advanced and costly analysis. If the language standard requires
    that such things /must/ be detected by the compiler, it can place quite
    a burden on the developers of conforming compilers.

    As I understand it, current C compilers detect UB mostly as a side
    effect of the analyses they do for code optimization purposes, which
    vary widely between compilers, and so the UB-detections also vary.

    This issue (compile-time detection) has now and then been discussed in
    the Ada standards group. Given the currently low market penetration of
    Ada, the group has been reluctant to require too much of the compilers,
    and so the more advanced UB-detecting tools are stand-alone, such as the
    SPARK tools.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Terje Mathisen on Tue Sep 3 21:10:00 2024
    On 2024-09-03 20:52, Terje Mathisen wrote:
    Michael S wrote:
    On Tue, 3 Sep 2024 17:41:40 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    3 years ago Terje Mathisen wrote that many years ago he read that
    behaviour of memcpy() with overlappped src/dst was defined.
    https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
    Mitch Alsup answered "That was true in 1983".
    So, two people of different age living in different parts of the
    world are telling the same story. May be, there exist old popular
    book that said that it was defined?

    It probably wasn't written in the official C standard, which I
    couldn't have afforded to buy/read, but in a compiler runtime doc?

    Specifying that it would always copy from beginning to end of the
    source buffer, in increasing address order meant that it was
    guaranteed safe when used to compact buffers.


    What is "compact buffers" ?

    Assume a buffer consisting of records of some type, some of them marked
    as deleted. Iterating over them while removing the gaps means that you
    are always copying to a destination lower in memory, right?


    Only if you iterate in order of increasing memory address, which is not
    the only possibility.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Sep 3 15:06:38 2024
    Undefined behaviour is something that is exercised at run-time.
    That's why the "undefined behaviour sanitizers" insert run-time
    checks. And of course they only detect the behaviour when it is
    actually exercised.

    IIUC the way the run-time checks need to *prevent* undefined behavior
    rather than merely detecting it, because if you do

    if (would_UB_here())
    fprintf (stderr, ...);
    maybe_do_UB_here();

    the compiler is allowed to skip the `fprintf` if `maybe_do_UB_here`
    does UB. IOW the UB effect can be "retroactive".


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Niklas Holsti on Tue Sep 3 21:08:50 2024
    Niklas Holsti wrote:
    On 2024-09-03 20:52, Terje Mathisen wrote:
    Michael S wrote:
    On Tue, 3 Sep 2024 17:41:40 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    3 years ago Terje Mathisen wrote that many years ago he read that
    behaviour of memcpy() with overlappped src/dst was defined.
    https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
    Mitch Alsup answered "That was true in 1983".
    So, two people of different age living in different parts of the
    world are telling the same story. May be, there exist old popular
    book that said that it was defined?

    It probably wasn't written in the official C standard, which I
    couldn't have afforded to buy/read, but in a compiler runtime doc?

    Specifying that it would always copy from beginning to end of the
    source buffer, in increasing address order meant that it was
    guaranteed safe when used to compact buffers.


    What is "compact buffers" ?

    Assume a buffer consisting of records of some type, some of them
    marked as deleted. Iterating over them while removing the gaps means
    that you are always copying to a destination lower in memory, right?


    Only if you iterate in order of increasing memory address, which is not
    the only possibility.

    Obviously so, I really didn't think that needed to be stated. :-(

    uint8_t buffer[1000]

    memcpy(buffer + 0, buffer + 10, 100)

    OK?

    This is the memcpy() version which the original 8086 REP MOVSB was
    designed for, long before alternative code turned out to be faster in
    some circumstances.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Sep 3 15:28:03 2024
    My impression - based on hearsay for Rust as I have no experience - is that the key point of Rust is memory "safety". I use scare-quotes here, since it is simply about correct use of dynamic memory and buffers.

    It is entirely possible to have correct use of memory in C,

    If you look at the evolution of programming languages, "higher-level"
    doesn't mean "you can do more stuff". On the contrary, making
    a language "higher-level" means deciding what it is we want to make
    harder or even impossible.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Sep 3 15:30:21 2024
    Specifications are an agreement between the supplier and the client. The

    The problem here is that the C standard, seen as a contract, is unfair
    to the programmer, because it's so excruciatingly hard to write code
    that is guaranteed to be free from UB.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Terje Mathisen on Tue Sep 3 22:34:42 2024
    On 2024-09-03 22:08, Terje Mathisen wrote:
    Niklas Holsti wrote:
    On 2024-09-03 20:52, Terje Mathisen wrote:
    Michael S wrote:
    On Tue, 3 Sep 2024 17:41:40 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    3 years ago Terje Mathisen wrote that many years ago he read that
    behaviour of memcpy() with overlappped src/dst was defined.
    https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>> Mitch Alsup answered "That was true in 1983".
    So, two people of different age living in different parts of the
    world are telling the same story. May be, there exist old popular
    book that said that it was defined?

    It probably wasn't written in the official C standard, which I
    couldn't have afforded to buy/read, but in a compiler runtime doc?

    Specifying that it would always copy from beginning to end of the
    source buffer, in increasing address order meant that it was
    guaranteed safe when used to compact buffers.


    What is "compact buffers" ?

    Assume a buffer consisting of records of some type, some of them
    marked as deleted. Iterating over them while removing the gaps means
    that you are always copying to a destination lower in memory, right?


    Only if you iterate in order of increasing memory address, which is
    not the only possibility.

    Obviously so, I really didn't think that needed to be stated. :-(

    I admit my comment was partly tongue-in-cheek, but if the issue is when
    and whether a memcpy() that always copies in increasing address order is useful, it seems that a statement about "iterating over" an array should
    also specify the iteration order. Ok, ;-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stefan Monnier on Tue Sep 3 20:05:14 2024
    Stefan Monnier <[email protected]> schrieb:
    My impression - based on hearsay for Rust as I have no experience - is that >> the key point of Rust is memory "safety". I use scare-quotes here, since it >> is simply about correct use of dynamic memory and buffers.

    It is entirely possible to have correct use of memory in C,

    If you look at the evolution of programming languages, "higher-level"
    doesn't mean "you can do more stuff". On the contrary, making
    a language "higher-level" means deciding what it is we want to make
    harder or even impossible.

    Really?

    I thought Fortran was higher level than C, and you can do a lot
    more things in Fortran than in C.

    Or rather, Fortran allows you to do things which are possible,
    but very cumbersome, in C. Both are Turing complete, after all.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Tue Sep 3 20:20:41 2024
    On Tue, 3 Sep 2024 19:28:03 +0000, Stefan Monnier wrote:

    My impression - based on hearsay for Rust as I have no experience - is
    that
    the key point of Rust is memory "safety". I use scare-quotes here,
    since it
    is simply about correct use of dynamic memory and buffers.

    It is entirely possible to have correct use of memory in C,

    If you look at the evolution of programming languages, "higher-level"
    doesn't mean "you can do more stuff". On the contrary, making
    a language "higher-level" means deciding what it is we want to make
    harder or even impossible.

    A higher level language simply makes it HARDER to shoot yourself in the
    foot, not easier to express this-crap or that-crap.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Sep 3 20:25:22 2024
    On Tue, 3 Sep 2024 20:05:14 +0000, Thomas Koenig wrote:

    Stefan Monnier <[email protected]> schrieb:
    My impression - based on hearsay for Rust as I have no experience - is
    that
    the key point of Rust is memory "safety". I use scare-quotes here,
    since it
    is simply about correct use of dynamic memory and buffers.

    It is entirely possible to have correct use of memory in C,

    If you look at the evolution of programming languages, "higher-level"
    doesn't mean "you can do more stuff". On the contrary, making
    a language "higher-level" means deciding what it is we want to make
    harder or even impossible.

    Really?

    I thought Fortran was higher level than C, and you can do a lot
    more things in Fortran than in C.

    Fortran has a memory model where if address aliasing occurs it is
    the programmers fault, C has the contrapositive.

    Given the Fortran library, it is easy to write in C what could be
    written in Fortran--mostly because Fortran programmers use their
    library instead of trying to circumvent it at every step.

    Or rather, Fortran allows you to do things which are possible,
    but very cumbersome, in C. Both are Turing complete, after all.

    Turing complete does not take memory order into account.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Tue Sep 3 20:22:03 2024
    On Tue, 3 Sep 2024 19:30:21 +0000, Stefan Monnier wrote:

    Specifications are an agreement between the supplier and the client. The

    The problem here is that the C standard, seen as a contract, is unfair
    to the programmer, because it's so excruciatingly hard to write code
    that is guaranteed to be free from UB.

    # define int int64_t
    ..

    makes it easier.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Stefan Monnier on Wed Sep 4 01:15:14 2024
    On 03/09/2024 21:28, Stefan Monnier wrote:
    My impression - based on hearsay for Rust as I have no experience - is that >> the key point of Rust is memory "safety". I use scare-quotes here, since it >> is simply about correct use of dynamic memory and buffers.

    It is entirely possible to have correct use of memory in C,

    If you look at the evolution of programming languages, "higher-level"
    doesn't mean "you can do more stuff". On the contrary, making
    a language "higher-level" means deciding what it is we want to make
    harder or even impossible.


    Agreed.

    I've heard it said that the power of a programming language comes not
    from what you can do with the language, but from what you cannot do.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Stephen Fuld on Wed Sep 4 01:14:14 2024
    On 03/09/2024 18:54, Stephen Fuld wrote:
    On 9/2/2024 11:23 PM, David Brown wrote:
    On 02/09/2024 18:46, Stephen Fuld wrote:
    On 9/2/2024 1:23 AM, Terje Mathisen wrote:

    Anyway, that is all mostly moot since I'm using Rust for this kind
    of programming now. :-)

    Can you talk about the advantages and disadvantages of Rust versus C?


    And also for Rust versus C++ ?

    I asked about C versus Rust as Terje explicitly mentioned those two languages, but you make a good point in general.


    I want to know about both :-)

    In my field, small-systems embedded development, C has been dominant for
    a long time, but C++ use is increasing. Most of my new stuff in recent
    times has been C++. There are some in the field who are trying out
    Rust, so I need to look into it myself - either because it is a better
    choice than C++, or because customers might want it.



    My impression - based on hearsay for Rust as I have no experience - is
    that the key point of Rust is memory "safety".  I use scare-quotes
    here, since it is simply about correct use of dynamic memory and buffers.

    I agree that memory safety is the key point, although I gather that it
    has other features that many programmers like.


    Sure. There are certainly plenty of things that I think are a better
    idea in a modern programming language and that make it a good step up
    compared to C. My key interest is in comparison to C++ - it is a step
    up in some ways, a step down in others, and a step sideways in many
    features. But is it overall up or down, for /my/ uses?

    Examples of things that I think are good in Rust are making variables
    immutable by default and pattern matching. Steps down include lack of
    function overloading and limited object oriented support.

    There are some things that some people really like about Rust, that I am
    far from convinced about - such as package management. I could be misunderstanding (since I don't have the experience), but for /my/ work,
    I am very much against anything that encourages an "always get the
    latest version" attitude. Stability is much more important to me. (I
    dislike the rate at which Rust changes - every two weeks or so for small things, and every couple of years for breaking changes.)

    And there are some things that Rust simply gets wrong - such as the
    handling of signed integer overflows.


    It is entirely possible to have correct use of memory in C, but it is
    also very easy to get it wrong - especially if the developer doesn't
    use available tools for static and run-time checks.  Modern C++, on
    the other hand, makes it much easier to get right.  You can cause
    yourself extra work and risk by using more old-fashioned C++, but
    following modern design guides using smart pointers and containers,
    along with easily available tools, and you get a lot of the management
    of memory handled automatically for very little cost.

    Is it fair to say then that Rust makes it harder to get memory
    management "wrong"?


    I don't know about reality, but that's what the salesmen say.

    In modern C++ it's not hard to write code that doesn't leak and doesn't
    have out of bounds accesses, but you need to put a bit more effort into
    coding to track ownership properly, and not everything is as well
    diagnosed at compile time as it could be. There has been progress
    towards the equivalent to the Rust borrow checker for C++, I hear.




    C++ provides a huge amount more than Rust - when I have looked at
    Rust, it is (still) too limited for some of what I want to do.


    Can you give a few examples?

    As an example, in C++, you can make your own types that are, as fast as
    I can see, much more expressive and flexible than in Rust, while also
    being safe to use. This requires object syntax with support for
    multiple constructors, operator overload, function overloads,
    public/private separation, and multiple inheritance (at least of methods).




    Of course, "with great power comes great responsibility" - C++
    provides many exciting ways to write a complete mess :-)

    Sure.  I gather that templates are very powerful and potentially very useful.  On the other hand, I gather that multiple inheritance is very powerful, but difficult to use and potentially very ugly, and has not
    been carried forward in the same way into newer languages.


    Multiple inheritance can easily get really messy, especially with
    polymorphic types where data fields come from different ancestors, and
    it get even more messy with virtual inheritance. I don't think that is
    a good solution in more than a very few niche situations.

    But multiple inheritance from bases with no data (just methods, types,
    static data, constexpr data, etc.) is fine and can be very handy. Non-polymorphic inheritance with data fields is also fine.




    snip stuff about the inadequacy of existing Rust versus C++ comparisons.


    To my mind, the important question is not "Should we move from C to
    Rust?", but "Should we move from bad C to C++, Rust, or simply to good
    C practices?".

    I understand.  This brings up an important issue, that of older versus
    newer languages.

    A newer language has several advantages.  One is it can take advantage
    of what we have learned about language design and usage since the older language was designed. I can't underestimate this enough.  While many
    new language features turn out to be not useful, many are.

    Absolutely. There's things about newer languages, like Rust, Go, and
    Swift that I like. For example, they are designed with concurrency and multi-threading from the start, rather than an add-on. C++, as we know
    it today, has grown gradually, and a lot of its complexity is because of features added on rather than having been part of the original design.

    But it seems to me that Rust could have taken more from C++ and been a
    more complete rival. That is, it could have taken more of what can be
    done in C++, and found more elegant way to achieve the same effects from
    the start.


    Another is that it doesn't have to worry about support for "dusty
    decks", i.e. the existing base which may conform to an older version of
    the language, nor for "dusty brains", that is programmers who learned
    the older (i.e. worse) ways and keep generating new code using those
    ways.  You mention this issue in your comments.

    Of course, the counter to that is that new languages have to overcome
    the huge "installed base" advantage of existing languages.

    Let me be clear.  I am not a Rust evangelist.  I am just looking for a
    way forward that will help us make programmer easier and not to make
    some of the same mistakes we have made in the past.  Is Rust that?  Some people think so.  I just want to understand more.


    I am in the same boat. While I like C++ and find it a lot better than
    C, I'd be quite happy to drop it for Rust or anything else if I found
    they were better.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bernd Linsel on Wed Sep 4 01:19:36 2024
    On 03/09/2024 19:19, Bernd Linsel wrote:
    On 03.09.24 10:10, David Brown wrote:

    snip 8< - - - - - - - -

    (There are a few situations where UB in C could be diagnosed at
    compile-time, which are probably historical decisions to avoid
    imposing too much work on early compilers.  Where possible, UB that
    can be caught at compile time, could usefully be turned into constrain
    violations that must be diagnosed.)

    And exactly these are the situations that I'd like to be warned from,
    rather than the compiler making up something without telling.


    Some of those /are/ warned about by compilers (but I'd rather the
    standards said that they were errors). But in general, many can be
    handled by good development practice and compiler warnings. Still,
    compilers could always get better!

    One thing that could make a big difference, I think, is to drop the
    compilation model of each translation unit being compiled to a binary
    object independently, with only a minimal amount of information for
    linking. Link-time optimisation allows for many extra checks, not all
    of which are currently implemented AFAIK. For example, it should be
    possible to check that external declarations and definitions match up
    correctly across modules - that's currently UB and rarely checked.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Niklas Holsti on Wed Sep 4 01:22:27 2024
    On 03/09/2024 20:08, Niklas Holsti wrote:
    On 2024-09-03 11:10, David Brown wrote:

       [snip]

    (There are a few situations where UB in C could be diagnosed at
    compile-time, which are probably historical decisions to avoid
    imposing too much work on early compilers.  Where possible, UB that
    can be caught at compile time, could usefully be turned into constrain
    violations that must be diagnosed.)


    The problem, as you of course know, is that the "can" in "can be caught
    at compile time" depends on the amount and kind of analysis that is done
    at compile time -- some cases of UB "can" be caught at compile time but
    only by advanced and costly analysis. If the language standard requires
    that such things /must/ be detected by the compiler, it can place quite
    a burden on the developers of conforming compilers.


    Yes. But I am happy to place a bigger burden on compilers if it reduces
    the risk of errors for developers.

    Of course there must be some balance. But many of the rules are based
    on the kind of compiler that could run on a PDP-11 - it's reasonable to
    expect more these days.

    As I understand it, current C compilers detect UB mostly as a side
    effect of the analyses they do for code optimization purposes, which
    vary widely between compilers, and so the UB-detections also vary.

    This issue (compile-time detection) has now and then been discussed in
    the Ada standards group. Given the currently low market penetration of
    Ada, the group has been reluctant to require too much of the compilers,
    and so the more advanced UB-detecting tools are stand-alone, such as the SPARK tools.


    That makes sense for Ada. Given the high market penetration of C and
    C++, the balance is different.

    And of course if a future C26 (or whatever) standard required more UB
    detection for conformity, that would not affect existing C23 or earlier compilers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Terje Mathisen on Tue Sep 3 16:27:32 2024
    On 9/3/2024 8:46 AM, Terje Mathisen wrote:
    Stephen Fuld wrote:
    On 9/2/2024 1:23 AM, Terje Mathisen wrote:
    Stephen Fuld wrote:
    On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
    On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
    You compare apples and peaches. Technical specifications for your
    pressure vessel result from the physical abilities of the chosen
    material, by keeping requirements as vessel border width, geometry
    etc.,
    while compiler writers are free in their search for optimization
    tricks
    that let them shine at SPEC benchmarks.

    A pressure vessel may actually be able to contain 2× the
    pressure it
    will be able to contain 20 after 20 years of service due to stress
    and strain acting on the base materials.

    Then there are 3 kinds of metals {grey, white, yellow} with different
    responses to stress and induced strain. There is no analogy in code--
    If there were perhaps we would have better code today...

    Perhaps an analogy is code written in assembler, versus coed written
    in C versus code written in something like Ada or Rust. Backing
    away now . . . :-)

    IMNSHO, code written in asm is generally more safe than code written
    in C, because the author knows exactly what each line of code is
    going to do.

    The problem is of course that it is harder to get 10x lines of
    correct asm than to get 1x lines of correct C.

    BTW, I am also solidly in the grey hair group here, writing C code
    that is very low-level, using explicit local variables for any loop
    invariant, copying other stuff into temp vars in order to make it
    really obvious that they cannot alias any globals or input/output
    parameters.

    Anyway, that is all mostly moot since I'm using Rust for this kind of
    programming now. :-)

    Can you talk about the advantages and disadvantages of Rust versus C?

    Q&D programming is still far faster for me in C, but using Rust I don't
    have to worry about how well the compiler will be able to optimize my
    code, it is pretty much always close to speed of light since the entire aliasing issue goes away.

    Rust also gets rid of the horrible external library/configure/cmake mess that kept me from successfully compiling the reference LAStools lidar
    code for nearly 10 years.

    Using the Rust port I just tell cargo to add it to my project and that's
    it.

    Thank you. I find it interesting that the main advantage of Rust as
    touted by its evangelists, memory safety, didn't make your list.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Wed Sep 4 01:54:52 2024
    On Tue, 3 Sep 2024 23:19:36 +0000, David Brown wrote:

    On 03/09/2024 19:19, Bernd Linsel wrote:
    On 03.09.24 10:10, David Brown wrote:

    snip 8< - - - - - - - -

    (There are a few situations where UB in C could be diagnosed at
    compile-time, which are probably historical decisions to avoid
    imposing too much work on early compilers.  Where possible, UB that
    can be caught at compile time, could usefully be turned into constrain
    violations that must be diagnosed.)

    And exactly these are the situations that I'd like to be warned from,
    rather than the compiler making up something without telling.


    Some of those /are/ warned about by compilers (but I'd rather the
    standards said that they were errors). But in general, many can be
    handled by good development practice and compiler warnings. Still,
    compilers could always get better!

    Something that might be an error in a 32-bit machine may not be
    an error in a 36-bit {48, 64, 72} machine.

    One thing that could make a big difference, I think, is to drop the compilation model of each translation unit being compiled to a binary
    object independently, with only a minimal amount of information for
    linking. Link-time optimisation allows for many extra checks, not all
    of which are currently implemented AFAIK. For example, it should be
    possible to check that external declarations and definitions match up correctly across modules - that's currently UB and rarely checked.

    How does one call fprintf() under those rules ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Wed Sep 4 01:49:12 2024
    On Sun, 1 Sep 2024 21:02:16 +0000, Paul A. Clayton wrote:

    On 8/31/24 4:56 PM, BGB wrote:
    [snip]
    I was mostly doing dual-issue with a 4R2W design.

    Initially, 6R3W won out mostly because 4R2W disallows an indexed
    store to be run in parallel with another op; but 6R3W did allow
    this.

    Stores and MADD allow one register read to be delayed by at least
    one cycle. If the following cycle had a free read port, that could
    be stolen to complete the store/MADD. This could be viewed as
    cracking a three-source operation into a two-source operation and
    a one-source operation that reads source operands in a following
    cycle except that this operation never uses a result from the
    previous cycle.

    Stores are allowed to delay the St.Data read until after retirement.
    Thus, you are guaranteed that the cache line is present, that the
    cache is in a hit state, and that the TLB has translated the address,
    And finally, you need no forwarding on that read.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Sep 4 01:57:24 2024
    On Tue, 3 Sep 2024 23:23:50 +0000, BGB wrote:

    On 9/1/2024 4:02 PM, Paul A. Clayton wrote:
    On 8/31/24 4:56 PM, BGB wrote:
    [snip]
    I was mostly doing dual-issue with a 4R2W design.

    Initially, 6R3W won out mostly because 4R2W disallows an indexed store
    to be run in parallel with another op; but 6R3W did allow this.

    Stores and MADD allow one register read to be delayed by at least
    one cycle. If the following cycle had a free read port, that could
    be stolen to complete the store/MADD. This could be viewed as
    cracking a three-source operation into a two-source operation and
    a one-source operation that reads source operands in a following
    cycle except that this operation never uses a result from the
    previous cycle.


    This wouldn't map well to my existing decoder/pipeline, which requires
    all the ports (and all the registers) to be available at the time an instruction enters EX1, and currently has no support for "cracking" an instruction over multiple cycles, but may spread a single instruction
    across multiple lanes.

    Your pipeline is amateur at best.
    --------------
    But, yeah, if the restriction only applied to indexed store (in the
    current implementation, it applies to all stores), it would still be
    around 4% of the total instruction stream.

    As-is, it is closer to 12%, and causing an extra penalty for 12% of the total-executed instructions was undesirable (but, IMHO, still better
    than needing to use multiple instructions).

    Delaying ST.data only delays LDs which alias that ST.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Wed Sep 4 08:53:28 2024
    On 04/09/2024 03:54, MitchAlsup1 wrote:
    On Tue, 3 Sep 2024 23:19:36 +0000, David Brown wrote:

    On 03/09/2024 19:19, Bernd Linsel wrote:
    On 03.09.24 10:10, David Brown wrote:

    snip 8< - - - - - - - -

    (There are a few situations where UB in C could be diagnosed at
    compile-time, which are probably historical decisions to avoid
    imposing too much work on early compilers.  Where possible, UB that
    can be caught at compile time, could usefully be turned into constrain >>>> violations that must be diagnosed.)

    And exactly these are the situations that I'd like to be warned from,
    rather than the compiler making up something without telling.


    Some of those /are/ warned about by compilers (but I'd rather the
    standards said that they were errors).  But in general, many can be
    handled by good development practice and compiler warnings.  Still,
    compilers could always get better!

    Something that might be an error in a 32-bit machine may not be
    an error in a 36-bit {48, 64, 72} machine.

    One thing that could make a big difference, I think, is to drop the
    compilation model of each translation unit being compiled to a binary
    object independently, with only a minimal amount of information for
    linking.  Link-time optimisation allows for many extra checks, not all
    of which are currently implemented AFAIK.  For example, it should be
    possible to check that external declarations and definitions match up
    correctly across modules - that's currently UB and rarely checked.

    How does one call fprintf() under those rules ??

    Untyped vararg functions are a big risk factor for programming and are
    always difficult for static (or run-time) checking. The best you can do
    is limit them to the standard ones (the printf family is very useful),
    make sure you are always using declarations from common headers rather
    than "home-made" declarations, and use the tools you can (such as gcc
    and clang's format attribute checks).

    There will never be a way to do full automatic checking of code
    correctness. But the more mistakes that can be caught automatically,
    the better. Modern tools can catch more than older tools, and there is
    scope for them to catch even more. (Though it can sometimes be
    surprising how difficult it can be to add seemingly obvious warnings to compilers - the way the different analysis and optimisation passes are
    divided can mean critical information is lost or too inefficient to track.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to BGB on Wed Sep 4 09:04:40 2024
    On 03/09/2024 20:39, BGB wrote:
    On 9/2/2024 8:36 AM, MitchAlsup1 wrote:
    On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

    George Neuner <[email protected]> schrieb:

    I'm not going to argue about whether UB in code is wrong.  The
    question I have concerns what to do with something that explicitly is
    mentioned as UB in some standard N, but was not addressed in previous
    standards.

    Was it always UB?  Or should it be considered ID until it became UB?

    Can you give an exapmple?

    Memcopy() with overlapping pointers.

    I had just recently discovered that newer versions of GCC will cause
    code to break if it is missing a return value in C++ mode.


    No, the error in the code caused the code to break. You don't get to
    blame the compiler if you write rubbish. You get to /thank/ the
    compiler if it has helpfully added an instruction to cause the program
    to stop abruptly with a UD2 instruction.

    Note that in C, falling off the end of Foo here is fine - it is only if
    the caller attempts to use the non-existent return value that there is
    UB. Thus in C mode, gcc implements Foo as "ret" (when optimised), and
    will only warn you if you enable warnings.

    In C++, it is the act of falling off the end of Foo that is UB, thus the compiler will generate an UB2 (for -O0) or no code at all (when
    optimised), and will warn you without requiring options.

    So:
    int Foo() { }

    Will (in theory) cause the program to crash when called (emitting a
    'UD2' instruction), except in WSL it seems this doesn't quite work
    correctly (the UD2 doesn't result in an immediate crash), and the
    program seemingly instead "goes off the rails and crashes at a later
    point" (GCC omits the epilog when it does this, and seemingly control
    flow then goes into whatever function follows in the binary, crashing
    when that function tries to return seemingly by branching to an invalid address or similar).

    This was mostly effecting "init" functions in my Verilator test benches...


    Well, that, and a more inconsistent variant, where if one declares
    struct fields as 8 and 3 bytes and then strncpy's 11 bytes into the
    combined field, it may also insert a UD2 and skip emitting the following code.

    ...


    But, yeah, that was annoying...


    If your compiler tells you you are doing something stupid, and you
    ignore it, I really don't think you can claim "the compiler broke my code".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Wed Sep 4 09:15:19 2024
    David Brown wrote:
    On 03/09/2024 18:54, Stephen Fuld wrote:
    On 9/2/2024 11:23 PM, David Brown wrote:
    On 02/09/2024 18:46, Stephen Fuld wrote:
    On 9/2/2024 1:23 AM, Terje Mathisen wrote:

    Anyway, that is all mostly moot since I'm using Rust for this kind
    of programming now. :-)

    Can you talk about the advantages and disadvantages of Rust versus C?


    And also for Rust versus C++ ?

    I asked about C versus Rust as Terje explicitly mentioned those two
    languages, but you make a good point in general.


    I want to know about both :-)

    In my field, small-systems embedded development, C has been dominant for
    a long time, but C++ use is increasing.  Most of my new stuff in recent times has been C++.  There are some in the field who are trying out
    Rust, so I need to look into it myself - either because it is a better
    choice than C++, or because customers might want it.



    My impression - based on hearsay for Rust as I have no experience -
    is that the key point of Rust is memory "safety".  I use
    scare-quotes here, since it is simply about correct use of dynamic
    memory and buffers.

    I agree that memory safety is the key point, although I gather that it
    has other features that many programmers like.


    Sure.  There are certainly plenty of things that I think are a better
    idea in a modern programming language and that make it a good step up compared to C.  My key interest is in comparison to C++ - it is a step
    up in some ways, a step down in others, and a step sideways in many features.  But is it overall up or down, for /my/ uses?

    Examples of things that I think are good in Rust are making variables immutable by default and pattern matching.  Steps down include lack of function overloading and limited object oriented support.

    There are some things that some people really like about Rust, that I am
    far from convinced about - such as package management.  I could be misunderstanding (since I don't have the experience), but for /my/ work,
    I am very much against anything that encourages an "always get the
    latest version" attitude.  Stability is much more important to me.  (I dislike the rate at which Rust changes - every two weeks or so for small things, and every couple of years for breaking changes.)

    That's yet another of the things cargo (the rust package manager, as
    well as lots of other stuff) get right:

    Yes, by default you'll pick up the latest of every package/module you
    "cargo add foo" to your project, but then you can edit the resulting text-format configuration file, and lock down exact versions of some or
    all of those packages.

    This is similar to how we always freeze python packages: Any changes are something we decide to employ.



    And there are some things that Rust simply gets wrong - such as the
    handling of signed integer overflows.

    Maybe?

    Rust will _always_ check for such overflow in debug builds, then when
    you've determined that they don't occur, the release build falls back
    standard CPU behavior, i.e. wrapping around with no panics.

    You can argue both pro and con here, personally I like the Rust setup
    much more than C(++) which will use code that could do so as an excuse
    to elide that as well as all surrounding/dependent code.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Stefan Monnier on Wed Sep 4 09:20:10 2024
    On 03/09/2024 21:30, Stefan Monnier wrote:
    Specifications are an agreement between the supplier and the client. The

    The problem here is that the C standard, seen as a contract, is unfair
    to the programmer, because it's so excruciatingly hard to write code
    that is guaranteed to be free from UB.


    That's what I do for a living. And I'm not exactly unique here. If we
    charge money for a product with code, and there's a bug in the code,
    that is covered by the product's guarantee, just like design faults in
    the hardware.

    Basically, hitting UB at run-time means a bug in the code because the
    program does not do what you intended. And if you hit a bug in the
    code, then the behaviour is not what you defined in the code
    specifications - it is UB.

    As I see it, the task of avoiding UB in general is simply the task of
    writing bug-free code. That can definitely be hard, regardless of the language.


    But if you are thinking specifically of "popular" UB in C, such as dereferencing null pointers, overflowing signed arithmetic, using
    pointers after "free", or accessing arrays out of bounds, then no, I
    don't think it is hard at all. Seriously, it is extremely rare that I
    have bugs in my code from such UB, even during early development. Maybe
    it is the type of code I write (it's a somewhat niche field), or the way
    I do my development, but it just is not a problem. (I can have plenty
    of other kinds of bugs, of course!)


    What /definitely/ does not help is for a language to define incorrect
    behaviour in order to say it doesn't have undefined behaviour. A
    classic example is defining signed integer overflow as two's complement wrapping. That does not fix any errors in the code - it just guarantees
    that the code will produce incorrect answers which can later lead to
    nasal daemons, but that it won't launch the nasal daemons immediately.
    So your tools can't do as much to help catch the errors (from static
    error checking, debuggers or sanitizers), and the compiler can't
    generate as efficient results for correct code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Wed Sep 4 09:29:01 2024
    On 03/09/2024 22:22, MitchAlsup1 wrote:
    On Tue, 3 Sep 2024 19:30:21 +0000, Stefan Monnier wrote:

    Specifications are an agreement between the supplier and the client. The

    The problem here is that the C standard, seen as a contract, is unfair
    to the programmer, because it's so excruciatingly hard to write code
    that is guaranteed to be free from UB.

    # define int int64_t
    ..

    makes it easier.

    That's UB, I believe :-) And it will certainly be confusing.

    But good use of size-specific types is helpful to writing correct code.
    If your calculations could conceivably overflow 32 bits, int64_t is a
    good choice.

    For smaller numbers and portable code, you might want int_fast32_t or int_fast16_t, which on most 64-bit systems will be faster than "int".

    You can call it /ugly/, but it's not /hard/.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Wed Sep 4 09:26:16 2024
    Stephen Fuld wrote:
    On 9/3/2024 8:46 AM, Terje Mathisen wrote:
    Stephen Fuld wrote:
    On 9/2/2024 1:23 AM, Terje Mathisen wrote:
    Stephen Fuld wrote:
    On 8/31/2024 2:14 PM, MitchAlsup1 wrote:
    On Sat, 31 Aug 2024 21:01:54 +0000, Bernd Linsel wrote:
    You compare apples and peaches. Technical specifications for your
    pressure vessel result from the physical abilities of the chosen
    material, by keeping requirements as vessel border width, geometry
    etc.,
    while compiler writers are free in their search for optimization
    tricks
    that let them shine at SPEC benchmarks.

    A pressure vessel may actually be able to contain 2× the
    pressure it
    will be able to contain 20 after 20 years of service due to stress
    and strain acting on the base materials.

    Then there are 3 kinds of metals {grey, white, yellow} with
    different
    responses to stress and induced strain. There is no analogy in
    code--
    If there were perhaps we would have better code today...

    Perhaps an analogy is code written in assembler, versus coed written
    in C versus code written in something like Ada or Rust.  Backing
    away now . . . :-)

    IMNSHO, code written in asm is generally more safe than code written
    in C, because the author knows exactly what each line of code is
    going to do.

    The problem is of course that it is harder to get 10x lines of
    correct asm than to get 1x lines of correct C.

    BTW, I am also solidly in the grey hair group here, writing C code
    that is very low-level, using explicit local variables for any loop
    invariant, copying other stuff into temp vars in order to make it
    really obvious that they cannot alias any globals or input/output
    parameters.

    Anyway, that is all mostly moot since I'm using Rust for this kind of
    programming now. :-)

    Can you talk about the advantages and disadvantages of Rust versus C?

    Q&D programming is still far faster for me in C, but using Rust I don't have to worry about how well the compiler will be able to optimize my code, it is pretty much always close to speed of light since the entire aliasing issue goes away.

    Rust also gets rid of the horrible external library/configure/cmake mess that kept me from successfully compiling the reference LAStools lidar
    code for nearly 10 years.

    Using the Rust port I just tell cargo to add it to my project and that's it.

    Thank you.  I find it interesting that the main advantage of Rust as
    touted by its evangelists, memory safety, didn't make your list.

    Possibly because, due to the way I've been writing C(++) code for the
    last 40 years, I have almost never been hit by those problems myself?

    OTOH, in retrospect I know I have written a lot of code that would not
    have survived an experienced attacker, i.e. strcpy()/memcpy()/etc
    without explicit checks that the target buffer is large enough.

    This is of course fine in the classic "everyone is a friend, all code is
    open source, and nobody wants to actively attack us" environment, but
    not so much for anything exposed to the Internet.

    During my NTP Hackers time we never had memory overruns afair, but we
    did get a lot of abuse when DoS attacks were using our by default open debug/monitoring interface to amplify attacks on other systems. This was similar to the classic DNS abuse for the same purpose.

    Yes, I do like the Rust memory safety, but it does nothing to prevent
    attacks of that type: We had to switch from UDP to TCP for all requests
    that could produce outputs larger than the input size.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Wed Sep 4 08:05:51 2024
    BGB <[email protected]> writes:
    Otherwise, annoying:
    Despite configuring GCC to use RV64G, it builds its C library as RV64GC
    and is like "well, close enough".

    This may be an artifact of bootstrapping. At some point I built a new
    version of gcc for our Alphas. We had machines without the BWX
    extensions and machines with the BWX extension.

    Of course I built gcc on the fastest machine we had, one with BWX.
    And then I found out that the resulting compiler binary would not run
    on the machines without BWX.

    Ok, so build it again, taking care to configure it to not use BWX in bootstrapping itself. However, somehow libgcc got inherited from the
    previous build, so the resulting compiler would run on machines
    without BWX, but the binaries it produced would not. My guess is that something similar happened for libgcc in your case.

    I did another round of rebuilding, making sure that libgcc was rebuilt
    from scratch without BWX. I don't remember all that was involved;
    maybe I just did this build on a machine that does not have BWX.

    [Risc-V compressed instructions]
    Which is annoying because seemingly nearly every instruction has its own >encoding scheme for the immediate fields.

    It's designed for easy hardware decoding, so maybe you just need to
    discover the ideas behind that and put them into your decoder.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Wed Sep 4 12:57:21 2024
    On 04/09/2024 09:15, Terje Mathisen wrote:
    David Brown wrote:
    On 03/09/2024 18:54, Stephen Fuld wrote:
    On 9/2/2024 11:23 PM, David Brown wrote:
    On 02/09/2024 18:46, Stephen Fuld wrote:
    On 9/2/2024 1:23 AM, Terje Mathisen wrote:

    Anyway, that is all mostly moot since I'm using Rust for this kind >>>>>> of programming now. :-)

    Can you talk about the advantages and disadvantages of Rust versus C? >>>>>

    And also for Rust versus C++ ?

    I asked about C versus Rust as Terje explicitly mentioned those two
    languages, but you make a good point in general.


    I want to know about both :-)

    In my field, small-systems embedded development, C has been dominant
    for a long time, but C++ use is increasing.  Most of my new stuff in
    recent times has been C++.  There are some in the field who are trying
    out Rust, so I need to look into it myself - either because it is a
    better choice than C++, or because customers might want it.



    My impression - based on hearsay for Rust as I have no experience -
    is that the key point of Rust is memory "safety".  I use
    scare-quotes here, since it is simply about correct use of dynamic
    memory and buffers.

    I agree that memory safety is the key point, although I gather that
    it has other features that many programmers like.


    Sure.  There are certainly plenty of things that I think are a better
    idea in a modern programming language and that make it a good step up
    compared to C.  My key interest is in comparison to C++ - it is a step
    up in some ways, a step down in others, and a step sideways in many
    features.  But is it overall up or down, for /my/ uses?

    Examples of things that I think are good in Rust are making variables
    immutable by default and pattern matching.  Steps down include lack of
    function overloading and limited object oriented support.

    There are some things that some people really like about Rust, that I
    am far from convinced about - such as package management.  I could be
    misunderstanding (since I don't have the experience), but for /my/
    work, I am very much against anything that encourages an "always get
    the latest version" attitude.  Stability is much more important to
    me.  (I dislike the rate at which Rust changes - every two weeks or so
    for small things, and every couple of years for breaking changes.)

    That's yet another of the things cargo (the rust package manager, as
    well as lots of other stuff) get right:

    Yes, by default you'll pick up the latest of every package/module you
    "cargo add foo" to your project, but then you can edit the resulting text-format configuration file, and lock down exact versions of some or
    all of those packages.

    OK, that's good. And I presume there is no problem keeping these
    versions locally in your git (or other source code system), for when the
    old versions are removed from their internet sources.


    This is similar to how we always freeze python packages: Any changes are something we decide to employ.



    And there are some things that Rust simply gets wrong - such as the
    handling of signed integer overflows.

    Maybe?

    Rust will _always_ check for such overflow in debug builds, then when
    you've determined that they don't occur, the release build falls back standard CPU behavior, i.e. wrapping around with no panics.

    But if you've determined that they do not occur (during debugging), then
    your code never makes use of the results of an overflow - thus why is it defined behaviour? It makes no sense. The only time when you would
    actually see wrapping in final code is if you hadn't tested it properly,
    and then you can be pretty confident that the whole thing will end in
    tears when signs change unexpectedly. It would be much more sensible to
    leave signed overflow undefined, and let the compiler optimise on that
    basis.

    I'm all in favour of temporarily having checks for overflow (and other
    errors) during debugging, but I am sceptical to having distinct
    debug/release builds. It encourages people to use debug builds during development, bug hunting and testing, then when all looks good they
    switch to release build and deploy it. I prefer a single build, and
    enable run-time checks on parts of it if and when necessary.


    You can argue both pro and con here, personally I like the Rust setup
    much more than C(++) which will use code that could do so as an excuse
    to elide that as well as all surrounding/dependent code.


    If the compiler can see that code is never run, or that it will have all
    gone horribly wrong before the code is reached, I am happy to see it
    removed by the compiler. (Where possible - and there are unfortunately
    limits to warning abilities - I like the compiler to tell me about it.)
    I see no benefit in keeping code in place if it can't be run.

    (But I agree that there are pros and cons to many of these things.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to BGB on Wed Sep 4 13:12:08 2024
    On 04/09/2024 10:06, BGB wrote:
    On 9/4/2024 2:04 AM, David Brown wrote:
    On 03/09/2024 20:39, BGB wrote:
    On 9/2/2024 8:36 AM, MitchAlsup1 wrote:
    On Mon, 2 Sep 2024 5:55:34 +0000, Thomas Koenig wrote:

    George Neuner <[email protected]> schrieb:

    I'm not going to argue about whether UB in code is wrong.  The
    question I have concerns what to do with something that explicitly is >>>>>> mentioned as UB in some standard N, but was not addressed in previous >>>>>> standards.

    Was it always UB?  Or should it be considered ID until it became UB? >>>>>
    Can you give an exapmple?

    Memcopy() with overlapping pointers.

    I had just recently discovered that newer versions of GCC will cause
    code to break if it is missing a return value in C++ mode.


    No, the error in the code caused the code to break.  You don't get to
    blame the compiler if you write rubbish.  You get to /thank/ the
    compiler if it has helpfully added an instruction to cause the program
    to stop abruptly with a UD2 instruction.


    Usually the role of the compiler is to make existing code work as it did before, not to cause it to break, even in the face of UB.

    No, it is not.

    The role of a compiler is to take correct input code and generate
    correct output code. And if the compiler is helpful, then as a bonus it
    can tell you when you have made mistakes.

    It is most certainly /not/ the aim of a compiler to generate exactly the
    same garbage out as some other compiler did for some garbage in.


    I would have more accepted if it turned it into a compiler error or
    similar though (rather than turn it into a runtime crash).


    The compiler /does/ generate an error - /if/ you use the compiler properly.

    It is an unfortunate thing, IMHO, that gcc is far too lenient to random
    crap input by default. The world of C and C++ programming would have
    seen vastly fewer bugs if "gcc -Wpedantic -Wall -Wextra -Werror" had
    been the default, and expert users turned off specific warnings if they
    wanted.

    If you choose to use your shiny new power saw without guards, holding
    your wood by hand, without reading any instructions, and you lose a hand
    - who would you blame? Do you blame the power saw for not being as slow
    and weak as your old rusty handsaw that merely scratched you?

    Your code was wrong. You should have known it was wrong when you wrote
    it. You should have used standard, common free tools that would have
    told you it was wrong. You should not have ignored the warnings these
    tools gave you even when you didn't use them appropriately. It is not
    the fault of the tool.



    Note that in C, falling off the end of Foo here is fine - it is only
    if the caller attempts to use the non-existent return value that there
    is UB.  Thus in C mode, gcc implements Foo as "ret" (when optimised),
    and will only warn you if you enable warnings.

    In C++, it is the act of falling off the end of Foo that is UB, thus
    the compiler will generate an UB2 (for -O0) or no code at all (when
    optimised), and will warn you without requiring options.


    It worked fine in the older instance of WSL running GCC 4.8.0 ("Ubuntu
    14"), but sorta exploded when switching to a newer instance of WSL (with "Ubuntu 22")...

    But, sometimes got lazy, and did:
    int InitSomething()
    {
       ...
    }


    Without a return, but was an issue when it was unexpectedly crashing
    (and the cause was not immediately obvious, and I had not heard that
    there had been a behavioral change here).


    There has been no behaviour changes in the language or the compiler.
    Your code had no defined behaviour before, it has no defined behaviour
    now, and that is completely independent of the compiler or version.

    Well, also partly because it is traditional to always return 'int' even
    when 'void' is technically more correct.

    Don't be absurd. That C tradition was outdated before you were born,
    and has never existed in C++.



    But, in general, coding practices in my Verilator testbenches tends to
    be more lax (mostly code thrown together so the Verilog can do its thing
    and display its output to a window, and accept user input as needed).


    So:
    int Foo() { }

    Will (in theory) cause the program to crash when called (emitting a
    'UD2' instruction), except in WSL it seems this doesn't quite work
    correctly (the UD2 doesn't result in an immediate crash), and the
    program seemingly instead "goes off the rails and crashes at a later
    point" (GCC omits the epilog when it does this, and seemingly control
    flow then goes into whatever function follows in the binary, crashing
    when that function tries to return seemingly by branching to an
    invalid address or similar).

    This was mostly effecting "init" functions in my Verilator test
    benches...


    Well, that, and a more inconsistent variant, where if one declares
    struct fields as 8 and 3 bytes and then strncpy's 11 bytes into the
    combined field, it may also insert a UD2 and skip emitting the
    following code.

    ...


    But, yeah, that was annoying...


    If your compiler tells you you are doing something stupid, and you
    ignore it, I really don't think you can claim "the compiler broke my
    code".


    It would have been nicer if it crashed in a way where GDB could show me
    the point at which the crash was triggered...


    Finding bugs is always cheaper (in time, effort and money) the earlier
    you find it. Get yourself an editor or IDE that will spot such mistakes
    as you type them. Failing that, at least use a compiler with good
    warnings, use those warnings, and pay attention to them. Then learn to
    use the sanitizer options as the next step. It is a waste of time to
    wait until debugging to find such obvious and simple mistakes.

    as opposed to just showing "??" followed by a random address (followed
    by "can't read from address" or something to this effect).
     (with the "-g" option). Where, "bt" and similar didn't work either.



    I could tell it wasn't crashing immediately, because if it crashed immediately it would fail at the point the UD2 occurred.

    However, in a lot of cases it was carrying on and triggering a storm of
    debug prints for a while with often impossible values, before then
    crashing (in a way that looked more like a possible stack corruption).

    I suspect the latter being due to some weirdness in WSL (I figured about
    the "UD2" mostly by trying to recreate the scenarios in "Compiler
    Explorer" / "godbolt.org").


    Luckily stuff mostly worked after this point, as the missing return
    values were mostly limited to initialization functions.


    Oddly though, "Compiler Explorer" was showing warnings for the missing
    return values, but not in GCC in WSL.


    Though, have noted that generally MSVC will warn about them, and in this
    case I had usually fixed them, as granted it is still good practice to
    return a value (more so if actually used, because "random garbage" isn't usually a particularly useful return value).

    But, generally, MSVC will not unexpectedly break things.


    gcc did not break anything.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to David Brown on Wed Sep 4 08:53:08 2024
    On 9/4/24 06:57, David Brown wrote:
    On 04/09/2024 09:15, Terje Mathisen wrote:
    David Brown wrote:

    Maybe?

    Rust will _always_ check for such overflow in debug builds, then when
    you've determined that they don't occur, the release build falls back
    standard CPU behavior, i.e. wrapping around with no panics.

    But if you've determined that they do not occur (during debugging), then
    your code never makes use of the results of an overflow - thus why is it defined behaviour?  It makes no sense.  The only time when you would actually see wrapping in final code is if you hadn't tested it properly,
    and then you can be pretty confident that the whole thing will end in
    tears when signs change unexpectedly.  It would be much more sensible to leave signed overflow undefined, and let the compiler optimise on that
    basis.


    You absolutely do want defined behavior on overflow. There are
    algorithms that depend on that. Bakery algorithms for instance.
    Unless you think a real life bakery with service tickets
    numbering from 1 to 50 either never gets more than 50 customers
    in a day or closes after their 50th customer. :)

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to David Brown on Wed Sep 4 08:41:22 2024
    On 9/3/24 19:14, David Brown wrote:

    Absolutely.  There's things about newer languages, like Rust, Go, and
    Swift that I like.  For example, they are designed with concurrency and multi-threading from the start, rather than an add-on.  C++, as we know
    it today, has grown gradually, and a lot of its complexity is because of features added on rather than having been part of the original design.


    Rust and Go use C/C++ atomics and concurrency model. I think that's
    maybe to do with using common compiler back ends. They do try to make/encourage programmers use language constructs that they think
    are safe, fool proof, and generalizable (though that's up to debate).

    I don't know about Swift. Apple is off in their own alternate
    reality. I like some of their hardware. I would get the M4
    mac mini but I've owned both an x86 and powerpc mini and dealing
    with their tool chains and api's is an absolute nightmare.

    Part of the problem with concurrency support is that it is limited
    by the imagination and foibles of the language architects. It
    sucks that even today you have to resort to assembler to implement
    some very basic and fundamental lock-free algorithms that are 30
    to 50 years old at least.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Wed Sep 4 09:07:20 2024
    Terje Mathisen <[email protected]> writes:

    Michael S wrote:

    On Tue, 3 Sep 2024 17:41:40 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:

    3 years ago Terje Mathisen wrote that many years ago he read that
    behaviour of memcpy() with overlappped src/dst was defined.
    https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
    Mitch Alsup answered "That was true in 1983".
    So, two people of different age living in different parts of the
    world are telling the same story. May be, there exist old popular
    book that said that it was defined?
    >>

    It probably wasn't written in the official C standard, which I
    couldn't have afforded to buy/read, but in a compiler runtime doc?

    Specifying that it would always copy from beginning to end of the
    source buffer, in increasing address order meant that it was
    guaranteed safe when used to compact buffers.

    What is "compact buffers" ?

    Assume a buffer consisting of records of some type, some of them
    marked as deleted. Iterating over them while removing the gaps means
    that you are always copying to a destination lower in memory, right?

    If all the records are in one large array, there is a simple
    test to see if memcpy() must work or whether some alternative
    should be used instead.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Wed Sep 4 16:32:37 2024
    On Wed, 4 Sep 2024 7:29:01 +0000, David Brown wrote:

    On 03/09/2024 22:22, MitchAlsup1 wrote:
    On Tue, 3 Sep 2024 19:30:21 +0000, Stefan Monnier wrote:

    Specifications are an agreement between the supplier and the client. The >>>
    The problem here is that the C standard, seen as a contract, is unfair
    to the programmer, because it's so excruciatingly hard to write code
    that is guaranteed to be free from UB.

    # define int int64_t
    ..

    makes it easier.

    That's UB, I believe :-) And it will certainly be confusing.

    On 64-bit machines it re-establishes the dusty-deck old K&R C where
    int was the fastest integer type.

    But good use of size-specific types is helpful to writing correct code.
    If your calculations could conceivably overflow 32 bits, int64_t is a
    good choice.

    For smaller numbers and portable code, you might want int_fast32_t or int_fast16_t, which on most 64-bit systems will be faster than "int".

    You can call it /ugly/, but it's not /hard/.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to jseigh on Wed Sep 4 18:34:29 2024
    On 04/09/2024 14:53, jseigh wrote:
    On 9/4/24 06:57, David Brown wrote:
    On 04/09/2024 09:15, Terje Mathisen wrote:
    David Brown wrote:

    Maybe?

    Rust will _always_ check for such overflow in debug builds, then when
    you've determined that they don't occur, the release build falls back
    standard CPU behavior, i.e. wrapping around with no panics.

    But if you've determined that they do not occur (during debugging),
    then your code never makes use of the results of an overflow - thus
    why is it defined behaviour?  It makes no sense.  The only time when
    you would actually see wrapping in final code is if you hadn't tested
    it properly, and then you can be pretty confident that the whole thing
    will end in tears when signs change unexpectedly.  It would be much
    more sensible to leave signed overflow undefined, and let the compiler
    optimise on that basis.


    You absolutely do want defined behavior on overflow.

    No, you absolutely do /not/ want that - for the vast majority of use-cases.

    There are times when you want wrapping behaviour, yes. More generally,
    you want modulo arithmetic rather than a model of mathematical integer arithmetic. But those cases are rare, and in C they are easily handled
    using unsigned integers.

    You can't use signed integers for them in C (except of course if you use explicit modulo and none of your intermediary results overflow int),
    because signed integer overflow is UB. You can't use signed integers
    for the purpose in Rust either, even though it is defined behaviour in
    release mode, because it is a run-time error in debug mode. (That's why
    Rust's attitude here is completely daft to me.)

    There are
    algorithms that depend on that.  Bakery algorithms for instance.
    Unless you think a real life bakery with service tickets
    numbering from 1 to 50 either never gets more than 50 customers
    in a day or closes after their 50th customer. :)

    Joe Seigh


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Tim Rentsch on Wed Sep 4 19:53:13 2024
    On 04/09/2024 18:07, Tim Rentsch wrote:
    Terje Mathisen <[email protected]> writes:

    Michael S wrote:

    On Tue, 3 Sep 2024 17:41:40 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:

    3 years ago Terje Mathisen wrote that many years ago he read that
    behaviour of memcpy() with overlappped src/dst was defined.
    https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ
    Mitch Alsup answered "That was true in 1983".
    So, two people of different age living in different parts of the
    world are telling the same story. May be, there exist old popular
    book that said that it was defined?
    >>

    It probably wasn't written in the official C standard, which I
    couldn't have afforded to buy/read, but in a compiler runtime doc?

    Specifying that it would always copy from beginning to end of the
    source buffer, in increasing address order meant that it was
    guaranteed safe when used to compact buffers.

    What is "compact buffers" ?

    Assume a buffer consisting of records of some type, some of them
    marked as deleted. Iterating over them while removing the gaps means
    that you are always copying to a destination lower in memory, right?

    If all the records are in one large array, there is a simple
    test to see if memcpy() must work or whether some alternative
    should be used instead.

    Such tests are usually built into implementations of memmove(), which
    will chose to run forwards or backwards as needed. So you might as well
    just call memmove() any time you are not sure memcpy() is safe and
    appropriate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Wed Sep 4 17:25:44 2024
    David Brown <[email protected]> schrieb:

    I'm all in favour of temporarily having checks for overflow (and other errors) during debugging, but I am sceptical to having distinct
    debug/release builds. It encourages people to use debug builds during development, bug hunting and testing, then when all looks good they
    switch to release build and deploy it. I prefer a single build, and
    enable run-time checks on parts of it if and when necessary.

    Wise man once said...

    # It is absurd to make elaborate security checks on debugging runs,
    # when no trust is put in the results, and then remove them in
    # production runs, when an erroneous result could be expensive or
    # disastrous. What would we think of a sailing enthusiast who wears
    # his lifejacket when training on dry land, but takes it off as soon
    # as he goes to sea?

    (C.A.R. Hoare, in "Hints on Programming Language Desin)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Wed Sep 4 18:13:17 2024
    On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

    On 04/09/2024 18:07, Tim Rentsch wrote:
    Terje Mathisen <[email protected]> writes:

    Michael S wrote:

    On Tue, 3 Sep 2024 17:41:40 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:

    3 years ago Terje Mathisen wrote that many years ago he read that
    behaviour of memcpy() with overlappped src/dst was defined.
    https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>> Mitch Alsup answered "That was true in 1983".
    So, two people of different age living in different parts of the
    world are telling the same story. May be, there exist old popular >>>>>> book that said that it was defined?
    >>

    It probably wasn't written in the official C standard, which I
    couldn't have afforded to buy/read, but in a compiler runtime doc?

    Specifying that it would always copy from beginning to end of the
    source buffer, in increasing address order meant that it was
    guaranteed safe when used to compact buffers.

    What is "compact buffers" ?

    Assume a buffer consisting of records of some type, some of them
    marked as deleted. Iterating over them while removing the gaps means
    that you are always copying to a destination lower in memory, right?

    If all the records are in one large array, there is a simple
    test to see if memcpy() must work or whether some alternative
    should be used instead.

    Such tests are usually built into implementations of memmove(), which
    will chose to run forwards or backwards as needed. So you might as well
    just call memmove() any time you are not sure memcpy() is safe and appropriate.

    Memmove() is always appropriate unless you are doing something
    nefarious.
    So:
    # define memcpy memomve
    and move forward with life--for the 2 extra cycles memmove costs it
    saves everyone long term grief.

    When you need the nefarious activities of memcpy write it as a
    for loop by yourself and comment the nafariousness of the use.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to David Brown on Wed Sep 4 20:15:24 2024
    David Brown <[email protected]> wrote:
    On 03/09/2024 21:28, Stefan Monnier wrote:
    My impression - based on hearsay for Rust as I have no experience - is that >>> the key point of Rust is memory "safety". I use scare-quotes here, since it
    is simply about correct use of dynamic memory and buffers.

    It is entirely possible to have correct use of memory in C,

    If you look at the evolution of programming languages, "higher-level"
    doesn't mean "you can do more stuff". On the contrary, making
    a language "higher-level" means deciding what it is we want to make
    harder or even impossible.


    Agreed.

    I've heard it said that the power of a programming language comes not
    from what you can do with the language, but from what you cannot do.

    Wrong, the last version of Swift added all the garbage programming concepts that one should avoid.

    You have to give people the tools to do anything.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to [email protected] on Wed Sep 4 20:58:32 2024
    MitchAlsup1 <[email protected]> wrote:
    On Wed, 4 Sep 2024 20:15:24 +0000, Brett wrote:

    David Brown <[email protected]> wrote:
    On 03/09/2024 21:28, Stefan Monnier wrote:
    My impression - based on hearsay for Rust as I have no experience - is >>>>> that
    the key point of Rust is memory "safety". I use scare-quotes here,
    since it
    is simply about correct use of dynamic memory and buffers.

    It is entirely possible to have correct use of memory in C,

    If you look at the evolution of programming languages, "higher-level"
    doesn't mean "you can do more stuff". On the contrary, making
    a language "higher-level" means deciding what it is we want to make
    harder or even impossible.


    Agreed.

    I've heard it said that the power of a programming language comes not
    from what you can do with the language, but from what you cannot do.

    Wrong, the last version of Swift added all the garbage programming
    concepts
    that one should avoid.

    You have to give people the tools to do anything.

    It is impossible to create a computer programming language where
    the programmer cannot shoot himself in the foot.

    https://www-users.york.ac.uk/~ss44/joke/foot.htm

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Wed Sep 4 20:18:55 2024
    On Wed, 4 Sep 2024 20:15:24 +0000, Brett wrote:

    David Brown <[email protected]> wrote:
    On 03/09/2024 21:28, Stefan Monnier wrote:
    My impression - based on hearsay for Rust as I have no experience - is >>>> that
    the key point of Rust is memory "safety". I use scare-quotes here,
    since it
    is simply about correct use of dynamic memory and buffers.

    It is entirely possible to have correct use of memory in C,

    If you look at the evolution of programming languages, "higher-level"
    doesn't mean "you can do more stuff". On the contrary, making
    a language "higher-level" means deciding what it is we want to make
    harder or even impossible.


    Agreed.

    I've heard it said that the power of a programming language comes not
    from what you can do with the language, but from what you cannot do.

    Wrong, the last version of Swift added all the garbage programming
    concepts
    that one should avoid.

    You have to give people the tools to do anything.

    It is impossible to create a computer programming language where
    the programmer cannot shoot himself in the foot.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Wed Sep 4 23:53:58 2024
    On Wed, 4 Sep 2024 17:25:44 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    David Brown <[email protected]> schrieb:

    I'm all in favour of temporarily having checks for overflow (and
    other errors) during debugging, but I am sceptical to having
    distinct debug/release builds. It encourages people to use debug
    builds during development, bug hunting and testing, then when all
    looks good they switch to release build and deploy it. I prefer a
    single build, and enable run-time checks on parts of it if and when necessary.

    Wise man once said...

    # It is absurd to make elaborate security checks on debugging runs,
    # when no trust is put in the results, and then remove them in
    # production runs, when an erroneous result could be expensive or
    # disastrous. What would we think of a sailing enthusiast who wears
    # his lifejacket when training on dry land, but takes it off as soon
    # as he goes to sea?

    (C.A.R. Hoare, in "Hints on Programming Language Desin)

    Wise man was wrong.
    Range check are not similar to live jackets. They do not turn incorrect
    program into correct one.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to David Brown on Wed Sep 4 20:31:24 2024
    David Brown <[email protected]> wrote:
    On 04/09/2024 14:53, jseigh wrote:
    On 9/4/24 06:57, David Brown wrote:
    On 04/09/2024 09:15, Terje Mathisen wrote:
    David Brown wrote:

    Maybe?

    Rust will _always_ check for such overflow in debug builds, then when
    you've determined that they don't occur, the release build falls back
    standard CPU behavior, i.e. wrapping around with no panics.

    But if you've determined that they do not occur (during debugging),
    then your code never makes use of the results of an overflow - thus
    why is it defined behaviour?  It makes no sense.  The only time when
    you would actually see wrapping in final code is if you hadn't tested
    it properly, and then you can be pretty confident that the whole thing
    will end in tears when signs change unexpectedly.  It would be much
    more sensible to leave signed overflow undefined, and let the compiler
    optimise on that basis.


    You absolutely do want defined behavior on overflow.

    No, you absolutely do /not/ want that - for the vast majority of use-cases.

    There are times when you want wrapping behaviour, yes. More generally,
    you want modulo arithmetic rather than a model of mathematical integer arithmetic. But those cases are rare, and in C they are easily handled
    using unsigned integers.

    I tried using unsigned for a bunch of my data types that should never go negative, but every time I would have to compare them with an int somewhere
    and that would cause a compiler warning, because the goal was to also
    remove unsafe code.

    Complete and udder disaster, went back to plain sized ints.

    You can't use signed integers for them in C (except of course if you use explicit modulo and none of your intermediary results overflow int),
    because signed integer overflow is UB. You can't use signed integers
    for the purpose in Rust either, even though it is defined behaviour in release mode, because it is a run-time error in debug mode. (That's why Rust's attitude here is completely daft to me.)

    There are
    algorithms that depend on that.  Bakery algorithms for instance.
    Unless you think a real life bakery with service tickets
    numbering from 1 to 50 either never gets more than 50 customers
    in a day or closes after their 50th customer. :)

    Joe Seigh




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Brett on Wed Sep 4 20:59:07 2024
    Brett <[email protected]> writes:
    David Brown <[email protected]> wrote:
    On 04/09/2024 14:53, jseigh wrote:
    On 9/4/24 06:57, David Brown wrote:
    On 04/09/2024 09:15, Terje Mathisen wrote:
    David Brown wrote:

    Maybe?

    Rust will _always_ check for such overflow in debug builds, then when >>>>> you've determined that they don't occur, the release build falls back >>>>> standard CPU behavior, i.e. wrapping around with no panics.

    But if you've determined that they do not occur (during debugging),
    then your code never makes use of the results of an overflow - thus
    why is it defined behaviour?  It makes no sense.  The only time when >>>> you would actually see wrapping in final code is if you hadn't tested
    it properly, and then you can be pretty confident that the whole thing >>>> will end in tears when signs change unexpectedly.  It would be much
    more sensible to leave signed overflow undefined, and let the compiler >>>> optimise on that basis.


    You absolutely do want defined behavior on overflow.

    No, you absolutely do /not/ want that - for the vast majority of use-cases. >>
    There are times when you want wrapping behaviour, yes. More generally,
    you want modulo arithmetic rather than a model of mathematical integer
    arithmetic. But those cases are rare, and in C they are easily handled
    using unsigned integers.

    I tried using unsigned for a bunch of my data types that should never go >negative, but every time I would have to compare them with an int somewhere >and that would cause a compiler warning, because the goal was to also
    remove unsafe code.

    We use it exclusively for datatypes in the domain [0, 2**n). It's always compared against other unsigned variables or constants. Works quite well. Safer and cleaner than willy-nilly using int.

    This is in a multi-million line C++ application.


    Complete and udder disaster, went back to plain sized ints.

    s/udder/utter/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Wed Sep 4 21:02:35 2024
    Scott Lurndal <[email protected]> schrieb:

    [unsigned]

    We use it exclusively for datatypes in the domain [0, 2**n). It's always compared against other unsigned variables or constants. Works quite well. Safer and cleaner than willy-nilly using int.

    The proposal for adding an unsigned data type to Fortran, which
    I initiated and which I am currently implementing for gfortran,
    does exactly that - no comparisions of signed vs. unsigned without
    explicit conversion (and no arithmetic either).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Wed Sep 4 21:22:34 2024
    On Wed, 4 Sep 2024 20:31:24 +0000, Brett wrote:

    David Brown <[email protected]> wrote:
    On 04/09/2024 14:53, jseigh wrote:
    On 9/4/24 06:57, David Brown wrote:
    On 04/09/2024 09:15, Terje Mathisen wrote:
    David Brown wrote:

    Maybe?

    Rust will _always_ check for such overflow in debug builds, then when >>>>> you've determined that they don't occur, the release build falls back >>>>> standard CPU behavior, i.e. wrapping around with no panics.

    But if you've determined that they do not occur (during debugging),
    then your code never makes use of the results of an overflow - thus
    why is it defined behaviour?  It makes no sense.  The only time when >>>> you would actually see wrapping in final code is if you hadn't tested
    it properly, and then you can be pretty confident that the whole thing >>>> will end in tears when signs change unexpectedly.  It would be much
    more sensible to leave signed overflow undefined, and let the compiler >>>> optimise on that basis.


    You absolutely do want defined behavior on overflow.

    No, you absolutely do /not/ want that - for the vast majority of
    use-cases.

    There are times when you want wrapping behaviour, yes. More generally,
    you want modulo arithmetic rather than a model of mathematical integer
    arithmetic. But those cases are rare, and in C they are easily handled
    using unsigned integers.

    I tried using unsigned for a bunch of my data types that should never go negative, but every time I would have to compare them with an int
    somewhere
    and that would cause a compiler warning, because the goal was to also
    remove unsafe code.

    For the last 25 years I have used nothing but unsigned (other than
    places
    where the interface standard passes an int argument or returns an int
    result or I explicitly expect a negative number.) It has worked
    fabulously
    well for me.

    I would LIKE a compiler warning if it sees::

    for( int i = positive; i < something_positive; i++ )

    The warning being:: "signed loop variable should be unsigned."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Sep 4 22:50:18 2024
    On Wed, 4 Sep 2024 22:11:47 +0000, BGB wrote:

    On 9/4/2024 3:18 PM, MitchAlsup1 wrote:
    On Wed, 4 Sep 2024 20:15:24 +0000, Brett wrote:

    David Brown <[email protected]> wrote:
    On 03/09/2024 21:28, Stefan Monnier wrote:
    My impression - based on hearsay for Rust as I have no experience - is >>>>>> that
    the key point of Rust is memory "safety".  I use scare-quotes here, >>>>>> since it
    is simply about correct use of dynamic memory and buffers.

    It is entirely possible to have correct use of memory in C,

    If you look at the evolution of programming languages, "higher-level" >>>>> doesn't mean "you can do more stuff".  On the contrary, making
    a language "higher-level" means deciding what it is we want to make
    harder or even impossible.


    Agreed.

    I've heard it said that the power of a programming language comes not
    from what you can do with the language, but from what you cannot do.

    Wrong, the last version of Swift added all the garbage programming
    concepts
    that one should avoid.

    You have to give people the tools to do anything.

    It is impossible to create a computer programming language where
    the programmer cannot shoot himself in the foot.

    A language could alternatively try to go in a direction like HolyC:
    Take C:
    Remove most advanced features;
    Add some weird syntax tweaks;
    Make all the types explicit sized.


    Some of it is almost half tempting, except that I would probably make
    the type-names lower-case to match with my existing usage (and save
    needing to hit SHIFT as often).

    Say:
    u0: void
    u1: _Bool
    u8: unsigned char
    u16: unsigned short
    ...
    i16/s16: signed short
    i32/s32: signed int
    i64/s64: signed long long

    I suspect that My 66000 is the only current ISA that efficiently
    supports::
    u7:
    u11:
    u15:
    u21:
    s47:
    s19:
    ..

    f32: float
    f64: double
    m32: opaque 32-bit type
    m64: opaque 64-bit type
    m128: opaque 128-bit type

    ....

    Then, say:
    u0 foo(args...)
    {
    ...
    }

    Where, args is exposed as an array of u32 or u64 depending on the target architecture.

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Sep 5 00:56:22 2024
    On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:

    On 9/4/2024 3:59 PM, Scott Lurndal wrote:

    Say:
    long z;
    int x, y;
    ...
    z=x*y;
    Would auto-promote to long before the multiply.

    \I may have to use this as an example of C allowing the programmer
    to shoot himself in the foot; promotion or no promotion.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Thu Sep 5 09:54:56 2024
    On 05/09/2024 02:56, MitchAlsup1 wrote:
    On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:

    On 9/4/2024 3:59 PM, Scott Lurndal wrote:

    Say:
       long z;
       int x, y;
       ...
       z=x*y;
    Would auto-promote to long before the multiply.

    \I may have to use this as an example of C allowing the programmer
    to shoot himself in the foot; promotion or no promotion.

    You snipped rather unfortunately here - it makes it look like this was
    code that Scott wrote, and you've removed essential context by BGB.


    While I agree it is an example of the kind of code that people sometimes
    write when they don't understand C arithmetic, I don't think it is
    C-specific. I can't think of any language off-hand where expressions
    are evaluated differently depending on types used further out in the expression. Can you give any examples of languages where the equivalent
    code would either do the multiplication as "long", or give an error so
    that the programmer would be informed of their error?



    (I don't count personal one-person languages here. They are very rarely formally or accurately specified.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Brett on Thu Sep 5 09:43:37 2024
    On 04/09/2024 22:31, Brett wrote:
    David Brown <[email protected]> wrote:
    On 04/09/2024 14:53, jseigh wrote:
    On 9/4/24 06:57, David Brown wrote:
    On 04/09/2024 09:15, Terje Mathisen wrote:
    David Brown wrote:

    Maybe?

    Rust will _always_ check for such overflow in debug builds, then when >>>>> you've determined that they don't occur, the release build falls back >>>>> standard CPU behavior, i.e. wrapping around with no panics.

    But if you've determined that they do not occur (during debugging),
    then your code never makes use of the results of an overflow - thus
    why is it defined behaviour?  It makes no sense.  The only time when >>>> you would actually see wrapping in final code is if you hadn't tested
    it properly, and then you can be pretty confident that the whole thing >>>> will end in tears when signs change unexpectedly.  It would be much
    more sensible to leave signed overflow undefined, and let the compiler >>>> optimise on that basis.


    You absolutely do want defined behavior on overflow.

    No, you absolutely do /not/ want that - for the vast majority of use-cases. >>
    There are times when you want wrapping behaviour, yes. More generally,
    you want modulo arithmetic rather than a model of mathematical integer
    arithmetic. But those cases are rare, and in C they are easily handled
    using unsigned integers.

    I tried using unsigned for a bunch of my data types that should never go negative, but every time I would have to compare them with an int somewhere and that would cause a compiler warning, because the goal was to also
    remove unsafe code.

    Complete and udder disaster, went back to plain sized ints.


    That's a matter of choice in the warnings you pick and the style you use
    - these should match.

    However, I don't think C's integer promotion rules are ideal in regard
    to mixing signed and unsigned arithmetic - converting both to "unsigned"
    can easily lead to trouble.

    Some people recommend using unsigned int everywhere you can, because the overflow behaviour is defined - I think that is simply wrong. Use
    unsigned int where it is appropriate, but it is very rare (though it
    happens sometimes) that you want any arithmetic to overflow in any way.
    So the justification is wrong.

    Some people like to use unsigned int when the values will not be
    negative. I don't think that is a good idea either. In general, for
    any given use you only need a limited range of values. 0 to 10000 is
    just as much a subset of "int" as "unsigned int", and using "unsigned
    int" does not give any advantages. On the contrary, using "int" can
    give more efficient code in many places, and lets you enable warnings
    about mixed unsigned / signed operations for when you actually want them.

    Unsigned types are ideal for "raw" memory access or external data, for
    anything involving bit manipulation (use of &, |, ^, << and >> on signed
    types is usually wrong, IMHO), as building blocks in extended arithmetic
    types, for the few occasions when you want two's complement wrapping,
    and for the even fewer occasions when you actually need that last bit of
    range.

    It would be nice if C had subrange types like Pascal or Ada, but it does
    not. Usually int - or sizeed ints - are the practical choice.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to David Brown on Thu Sep 5 11:51:32 2024
    On 2024-09-05 10:54, David Brown wrote:
    On 05/09/2024 02:56, MitchAlsup1 wrote:
    On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:

    On 9/4/2024 3:59 PM, Scott Lurndal wrote:

    Say:
       long z;
       int x, y;
       ...
       z=x*y;
    Would auto-promote to long before the multiply.

    \I may have to use this as an example of C allowing the programmer
    to shoot himself in the foot; promotion or no promotion.

    You snipped rather unfortunately here - it makes it look like this was
    code that Scott wrote, and you've removed essential context by BGB.


    While I agree it is an example of the kind of code that people sometimes write when they don't understand C arithmetic, I don't think it is C-specific.  I can't think of any language off-hand where expressions
    are evaluated differently depending on types used further out in the expression.  Can you give any examples of languages where the equivalent code would either do the multiplication as "long", or give an error so
    that the programmer would be informed of their error?


    The Ada language can work in both ways. If you just have:

    z : Long_Integer; -- Not a standard Ada type, but often provided.
    x, y : Integer;
    ...
    z := x * y;

    the compiler will inform you that the types in the assignment do not
    match: using the standard (predefined) operator "*", the product of two Integers gives an Integer, not a Long_Integer. If you add this
    definition to the code:

    function "*" (Left, Right : Integer) return Long_Integer
    is (Long_Integer(Left) * Long_Integer(Right));

    the compiler sees that there is now /also/ an Integer * Integer =>
    Long_Integer multiplication operator, and uses that. Function
    overloading in Ada can depend on the type expected of the result.

    Perhaps you asked for a language that worked like this "out of the box", without the programmer having to add things like the "*" function above,
    and then Ada would not qualify on the second alternative (automatic
    lengthening before multiplication, depending on the result type desired).


    (I don't count personal one-person languages here.

    While Ada has low market penetration, I don't think it quite qualifies
    as a one-person language -- yet :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Thu Sep 5 11:12:04 2024
    David Brown wrote:
    On 04/09/2024 22:31, Brett wrote:
    David Brown <[email protected]> wrote:
    On 04/09/2024 14:53, jseigh wrote:
    On 9/4/24 06:57, David Brown wrote:
    On 04/09/2024 09:15, Terje Mathisen wrote:
    David Brown wrote:

    Maybe?

    Rust will _always_ check for such overflow in debug builds, then when >>>>>> you've determined that they don't occur, the release build falls back >>>>>> standard CPU behavior, i.e. wrapping around with no panics.

    But if you've determined that they do not occur (during debugging),
    then your code never makes use of the results of an overflow - thus
    why is it defined behaviour?  It makes no sense.  The only time when
    you would actually see wrapping in final code is if you hadn't tested >>>>> it properly, and then you can be pretty confident that the whole thing >>>>> will end in tears when signs change unexpectedly.  It would be much >>>>> more sensible to leave signed overflow undefined, and let the compiler >>>>> optimise on that basis.


    You absolutely do want defined behavior on overflow.

    No, you absolutely do /not/ want that - for the vast majority of
    use-cases.

    There are times when you want wrapping behaviour, yes.  More generally, >>> you want modulo arithmetic rather than a model of mathematical integer
    arithmetic.  But those cases are rare, and in C they are easily handled >>> using unsigned integers.

    I tried using unsigned for a bunch of my data types that should never go
    negative, but every time I would have to compare them with an int
    somewhere
    and that would cause a compiler warning, because the goal was to also
    remove unsafe code.

    Complete and udder disaster, went back to plain sized ints.


    That's a matter of choice in the warnings you pick and the style you use
    - these should match.

    However, I don't think C's integer promotion rules are ideal in regard
    to mixing signed and unsigned arithmetic - converting both to "unsigned"
    can easily lead to trouble.

    Some people recommend using unsigned int everywhere you can, because the overflow behaviour is defined - I think that is simply wrong.  Use
    unsigned int where it is appropriate, but it is very rare (though it
    happens sometimes) that you want any arithmetic to overflow in any way.
    So the justification is wrong.

    Some people like to use unsigned int when the values will not be
    negative.  I don't think that is a good idea either.  In general, for
    any given use you only need a limited range of values.  0 to 10000 is
    just as much a subset of "int" as "unsigned int", and using "unsigned
    int" does not give any advantages.  On the contrary, using "int" can
    give more efficient code in many places, and lets you enable warnings
    about mixed unsigned / signed operations for when you actually want them.

    Unsigned types are ideal for "raw" memory access or external data, for anything involving bit manipulation (use of &, |, ^, << and >> on signed types is usually wrong, IMHO), as building blocks in extended arithmetic types, for the few occasions when you want two's complement wrapping,
    and for the even fewer occasions when you actually need that last bit of range.

    That last paragraph enumerates pretty much all the uses I have for integer-type variables, with (like Mitch) a few apis that use (-1) as an
    error signal that has to be handled with special code.


    It would be nice if C had subrange types like Pascal or Ada, but it does not.  Usually int - or sizeed ints - are the practical choice.


    Agreed 100%

    I wrote enough Pascal with ranged types that I got used to it, and found
    that I was missing the feature when I used C.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to [email protected] on Thu Sep 5 13:15:00 2024
    In article <[email protected]>, [email protected] (MitchAlsup1) wrote:

    I suspect that My 66000 is the only current ISA that efficiently
    supports::
    u7:
    u11:
    u15:
    u21:
    s47:
    s19:

    Concertina II has them on the way...

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Thu Sep 5 15:08:44 2024
    On 04/09/2024 22:53, Michael S wrote:
    On Wed, 4 Sep 2024 17:25:44 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    David Brown <[email protected]> schrieb:

    I'm all in favour of temporarily having checks for overflow (and
    other errors) during debugging, but I am sceptical to having
    distinct debug/release builds. It encourages people to use debug
    builds during development, bug hunting and testing, then when all
    looks good they switch to release build and deploy it. I prefer a
    single build, and enable run-time checks on parts of it if and when
    necessary.

    Wise man once said...

    # It is absurd to make elaborate security checks on debugging runs,
    # when no trust is put in the results, and then remove them in
    # production runs, when an erroneous result could be expensive or
    # disastrous. What would we think of a sailing enthusiast who wears
    # his lifejacket when training on dry land, but takes it off as soon
    # as he goes to sea?

    (C.A.R. Hoare, in "Hints on Programming Language Desin)

    Wise man was wrong.
    Range check are not similar to live jackets. They do not turn incorrect program into correct one.


    Wise man was right. Range checks are not intended to turn incorrect
    programs into correct ones - they are for damage mitigation. Life
    jackets don't stop you falling overboard, they stop you drowning if you
    /do/ fall overboard. The context of the quotation was "security
    checks", which is different from debugging and fault-finding.


    For some kinds of software, you have to think about what can go wrong
    outside the context of software bugs, and what can be done about it -
    such as damage limitation. There can be external effects such as
    malicious or accidental corruption of data, hardware failures, etc.
    These are outside the scope of C, and need special treatment such as
    using "volatile" to inform the compiler that something has observable behaviour, or using inline assembly or intrinsic functions for fine
    control. And you have to accept that usually, there is no way to handle
    these things entirely in software.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Thu Sep 5 11:31:02 2024
    David Brown <[email protected]> writes:
    Anton writes code that seriously pushes the boundary of what can be
    achieved. For at least some of the things he does (such as GForth) he
    is trying to squeeze every last drop of speed out of the target. And he
    is /really/ good at it. But that means he is forever relying on nuances >about code generation. His code, at least for efficiency if not for >correctness, is dependent on details far beyond what is specified and >documented for C and for the gcc compiler. He might spend a long time >working with his code and a version of gcc, fine-tuning the details of
    his source code to get out exactly the assembly he wants from the
    compiler.

    No. We distribute Gforth as source code. It works for a wide variety
    of architectures and compilers. So unlike what you suggest and what
    some people have suggested earlier to avoid problems with new
    "optimizations" in newer releases of gcc, we don't concentrate on a
    specific version of gcc.

    Of course it is frustrating for him when the next version of
    gcc generates very different assembly from that same source, but he is
    not really programming at the level of C, and he should not expect >consistency from C compilers like he does.

    It's normal and no problem when the next version of gcc generates
    different assembly language. There are some basic assumptions that
    our code relies on, and that mostly does not change between gcc
    versions.

    An essential assumption is that, when we have:

    A:
    C code
    B:

    ... that when we do &&A and &&B (which is documented in the GNU C
    manual), we get the addresses pointing to the start and end of the
    machine code corresponding to the C code. In the days starting with
    gcc-3.0, we found that gcc started reordering the basic blocks within
    loops, so replaced loops in the part of the code that needs such
    assumptions into separate functions. Around gcc-7, gcc started to
    compile

    A: C-code1
    B: C-code2
    C: goto *...

    to the same code as

    A: C-code1; C-code2; goto *...;
    B: C-code2; goto *...;
    C: goto *...;

    I found a workaround that avoids this kind of code generation.

    Another problem from gcc-3.1 to at least gcc-4.4 (intermittently) is
    that gcc compiled

    goto *ca;

    into the equivalent of

    goto gotoca;

    /* and elsewhere */
    gotoca: goto *ca;

    We reported that repeatedly. At one point a gcc maintainer gave us
    some bullshit about a possible performance advantage from this
    transformation, of course without presenting any empirical support,
    while we saw a big slowdown on our code. We developed workarounds for
    that, and they are in Gforth to this day, even though we have not
    encountered a new gcc version with this problem for over a decade, but
    new Gforth should also work on old gcc.

    Another assumption is that when we concatenate the code snippet
    between label A and B (which contains C-code1) and the code snippet
    between label X and Y (which contains C-code3), executing the result
    will behave like the concatenation of C-code1 and C-code3 in source
    code. This assumption has two aspects:

    1) Do the register assignments at the labels fit together. It turns
    out that we never had a problem with that, and I think that the
    reason for that is that the "goto *" can jump to any of those
    labels (all their addresses are taken), and so the register
    assignment must be the same right after each label.

    What guarantees that the assignments are the same right before each
    label? Probably that after the label, there is not much between
    the label and the next goto*, and that makes all registers at
    potential targets live.

    2) If we have two pieces of machine code produced in this fashion,
    does the architecture guarantee that such a concatenation works?
    It turns out that in general-purpose architectures, all-but-one do.
    That includes IA-64. The exception is MIPS with its architectural
    load-delay slot (and there are also scheduling restrictions having
    to do with the hilo register that may be problematic): the first
    code snippet may end in a load, and the next code snippet may start
    with an instruction that reads the result of the load. So we just
    disabled this concatenation on MIPS.

    We do a number of things to achieve stability: We do sanity-checking
    on the resulting machine code snippets and fall back to plain threaded
    code if the snippets turn out not to be relocatable.

    Also, we enable all the flags for defining behaviour in gcc that we
    find (unfortunately, in the documentation they are intermixed with
    other options). For good measure, this includes -fno-delete-null-pointer-checks, although I doubt that it makes a
    difference for our code either way.

    One thing that came up about a year ago was that gcc auto-vectorizes
    adjacent memory accesses on AMD64 (apparently the AMD64 port
    maintainers are unhappy because AMD64 does not have instructions like
    ARM A64's ldp and stp:-), which did not impact correctness, but led to
    worse performance (not just in Gforth; I have also seen it in the
    bubble benchmark from John Hennessy's Stanford small integer
    benchmarks; I'm sure there is some SPEC benchmark that benefits). A
    quick addition of -fno-tree-vectorize fixed that.

    We have been thinking about moving from C to a better-defined
    language, namely assembly language, but have not yet taken the plunge,
    and it has not been necessary yet. Gcc has not been as crazy in our
    experience as the UB rethoric might make one think. Why is that? I
    think the reasons are:

    1) Gforth and a lot of other "irrelevant" (to the gcc maintainers)
    projects sail in the slipstream of "relevant" code like SPEC and
    the Linux kernel that are all full of undefined behaviour (Linux
    defines many of them with flags, like Gforth does), so gcc does not
    "optimize" as crazily as a UB fan might wish.

    2) The code snippets are very short, with many in-edges on the
    preceding and following label, which tends to destroy any
    "knowledge" that the compiler might have derived from the
    assumption that the program does not exercise undefined behaviour.
    This severely limits the distance over which such "optimizations"
    can be performed.

    Nevertheless, the last time I tried what happens if I compile without
    the behaviour-defining options, the result did not work; I did not
    investigate this further.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Niklas Holsti on Thu Sep 5 15:27:47 2024
    On 05/09/2024 10:51, Niklas Holsti wrote:
    On 2024-09-05 10:54, David Brown wrote:
    On 05/09/2024 02:56, MitchAlsup1 wrote:
    On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:

    On 9/4/2024 3:59 PM, Scott Lurndal wrote:

    Say:
       long z;
       int x, y;
       ...
       z=x*y;
    Would auto-promote to long before the multiply.

    \I may have to use this as an example of C allowing the programmer
    to shoot himself in the foot; promotion or no promotion.

    You snipped rather unfortunately here - it makes it look like this was
    code that Scott wrote, and you've removed essential context by BGB.


    While I agree it is an example of the kind of code that people
    sometimes write when they don't understand C arithmetic, I don't think
    it is C-specific.  I can't think of any language off-hand where
    expressions are evaluated differently depending on types used further
    out in the expression.  Can you give any examples of languages where
    the equivalent code would either do the multiplication as "long", or
    give an error so that the programmer would be informed of their error?


    The Ada language can work in both ways. If you just have:

       z : Long_Integer;  -- Not a standard Ada type, but often provided.
       x, y : Integer;
       ...
       z := x * y;

    the compiler will inform you that the types in the assignment do not
    match: using the standard (predefined) operator "*", the product of two Integers gives an Integer, not a Long_Integer.

    That seems like a safe choice. C's implicit promotion of int to long
    int can be convenient, but convenience is sometimes at odds with safety.


    If you add this
    definition to the code:

       function "*" (Left, Right : Integer) return Long_Integer
       is (Long_Integer(Left) * Long_Integer(Right));

    the compiler sees that there is now /also/ an Integer * Integer => Long_Integer multiplication operator, and uses that. Function
    overloading in Ada can depend on the type expected of the result.


    You can make types in C++ that have this effect, but you have to make
    them and use them consistently. You can't overload operators on
    standard types like that.

    Perhaps you asked for a language that worked like this "out of the box", without the programmer having to add things like the "*" function above,
    and then Ada would not qualify on the second alternative (automatic lengthening before multiplication, depending on the result type desired).


    I asked for either, and you gave me both :-)


    (I don't count personal one-person languages here.

    While Ada has low market penetration, I don't think it quite qualifies
    as a one-person language -- yet :-)


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Thu Sep 5 15:29:47 2024
    On 05/09/2024 11:12, Terje Mathisen wrote:
    David Brown wrote:
    On 04/09/2024 22:31, Brett wrote:
    David Brown <[email protected]> wrote:
    On 04/09/2024 14:53, jseigh wrote:
    On 9/4/24 06:57, David Brown wrote:
    On 04/09/2024 09:15, Terje Mathisen wrote:
    David Brown wrote:

    Maybe?

    Rust will _always_ check for such overflow in debug builds, then >>>>>>> when
    you've determined that they don't occur, the release build falls >>>>>>> back
    standard CPU behavior, i.e. wrapping around with no panics.

    But if you've determined that they do not occur (during debugging), >>>>>> then your code never makes use of the results of an overflow - thus >>>>>> why is it defined behaviour?  It makes no sense.  The only time >>>>>> when
    you would actually see wrapping in final code is if you hadn't tested >>>>>> it properly, and then you can be pretty confident that the whole
    thing
    will end in tears when signs change unexpectedly.  It would be much >>>>>> more sensible to leave signed overflow undefined, and let the
    compiler
    optimise on that basis.


    You absolutely do want defined behavior on overflow.

    No, you absolutely do /not/ want that - for the vast majority of
    use-cases.

    There are times when you want wrapping behaviour, yes.  More generally, >>>> you want modulo arithmetic rather than a model of mathematical integer >>>> arithmetic.  But those cases are rare, and in C they are easily handled >>>> using unsigned integers.

    I tried using unsigned for a bunch of my data types that should never go >>> negative, but every time I would have to compare them with an int
    somewhere
    and that would cause a compiler warning, because the goal was to also
    remove unsafe code.

    Complete and udder disaster, went back to plain sized ints.


    That's a matter of choice in the warnings you pick and the style you
    use - these should match.

    However, I don't think C's integer promotion rules are ideal in regard
    to mixing signed and unsigned arithmetic - converting both to
    "unsigned" can easily lead to trouble.

    Some people recommend using unsigned int everywhere you can, because
    the overflow behaviour is defined - I think that is simply wrong.  Use
    unsigned int where it is appropriate, but it is very rare (though it
    happens sometimes) that you want any arithmetic to overflow in any
    way. So the justification is wrong.

    Some people like to use unsigned int when the values will not be
    negative.  I don't think that is a good idea either.  In general, for
    any given use you only need a limited range of values.  0 to 10000 is
    just as much a subset of "int" as "unsigned int", and using "unsigned
    int" does not give any advantages.  On the contrary, using "int" can
    give more efficient code in many places, and lets you enable warnings
    about mixed unsigned / signed operations for when you actually want them.

    Unsigned types are ideal for "raw" memory access or external data, for
    anything involving bit manipulation (use of &, |, ^, << and >> on
    signed types is usually wrong, IMHO), as building blocks in extended
    arithmetic types, for the few occasions when you want two's complement
    wrapping, and for the even fewer occasions when you actually need that
    last bit of range.

    That last paragraph enumerates pretty much all the uses I have for integer-type variables, with (like Mitch) a few apis that use (-1) as an error signal that has to be handled with special code.


    You don't have loop counters, array indices, or integer arithmetic?


    It would be nice if C had subrange types like Pascal or Ada, but it
    does not.  Usually int - or sizeed ints - are the practical choice.


    Agreed 100%

    I wrote enough Pascal with ranged types that I got used to it, and found
    that I was missing the feature when I used C.

    Terje


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Brett on Thu Sep 5 15:35:59 2024
    On 04/09/2024 22:15, Brett wrote:
    David Brown <[email protected]> wrote:
    On 03/09/2024 21:28, Stefan Monnier wrote:
    My impression - based on hearsay for Rust as I have no experience - is that
    the key point of Rust is memory "safety". I use scare-quotes here, since it
    is simply about correct use of dynamic memory and buffers.

    It is entirely possible to have correct use of memory in C,

    If you look at the evolution of programming languages, "higher-level"
    doesn't mean "you can do more stuff". On the contrary, making
    a language "higher-level" means deciding what it is we want to make
    harder or even impossible.


    Agreed.

    I've heard it said that the power of a programming language comes not
    from what you can do with the language, but from what you cannot do.

    Wrong, the last version of Swift added all the garbage programming concepts that one should avoid.


    That does not show that I was wrong - perhaps Swift is not a powerful programming language!

    Of course, it all depends on what you mean by "powerful".

    (I don't know Swift at all.)

    You have to give people the tools to do anything.


    You don't /have/ to do that. But it's often easier to market a language
    that can do anything.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Thu Sep 5 13:19:59 2024
    Stefan Monnier <[email protected]> writes:
    Specifications are an agreement between the supplier and the client. The

    The problem here is that the C standard, seen as a contract, is unfair
    to the programmer, because it's so excruciatingly hard to write code
    that is guaranteed to be free from UB.

    For programs there is no conformance level "free from UB" in the C
    standard. There are two conformance levels for programs:

    1) A strictly conforming program shall use only those features of the
    language and library specified in this International Standard.
    This excludes all programs that terminate, including the "Hello,
    World" program. And of course it also excludes pretty much all
    non-terminating programs.

    2) A conforming program is one that is acceptable to a conforming
    implementation. So if, say, gcc-10 is a conforming implementation
    (and I think that it claims so), and it accepts your program, your
    program is a conforming program.

    One first would have to agree on whether the program should be
    conforming or strictly conforming. In the "strictly conforming" case,
    it is indeed hard to write any useful code (I find it even hard to
    think of a useful non-terminating program that uses only things
    specified in the C standard features). OTOH, conforming programs
    include many that exercise undefined, unspecified, or
    implementation-defined behaviour, so in that case the C standard does
    not serve as specification.

    In either case, treating the C standard as agreement is nonsense.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Thu Sep 5 15:48:37 2024
    On 04/09/2024 20:13, MitchAlsup1 wrote:
    On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

    On 04/09/2024 18:07, Tim Rentsch wrote:

    If all the records are in one large array, there is a simple
    test to see if memcpy() must work or whether some alternative
    should be used instead.

    Such tests are usually built into implementations of memmove(), which
    will chose to run forwards or backwards as needed.  So you might as well
    just call memmove() any time you are not sure memcpy() is safe and
    appropriate.

    Memmove() is always appropriate unless you are doing something
    nefarious.
    So:
    # define memcpy memomve
    and move forward with life--for the 2 extra cycles memmove costs it
    saves everyone long term grief.


    Or just use memmove, and not memcpy, whenever you are moving stuff
    around in the same buffer.

    When you need the nefarious activities of memcpy write it as a
    for loop by yourself and comment the nafariousness of the use.

    memcpy is not nefarious. It's quite simple, and does what it says on
    the tin. Use it when you want to copy non-overlapping memory areas.
    Don't use it if you want to do something other than that. I have never understood why anyone would find this difficult.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Thu Sep 5 16:01:26 2024
    On 04/09/2024 18:32, MitchAlsup1 wrote:
    On Wed, 4 Sep 2024 7:29:01 +0000, David Brown wrote:

    On 03/09/2024 22:22, MitchAlsup1 wrote:
    On Tue, 3 Sep 2024 19:30:21 +0000, Stefan Monnier wrote:

    Specifications are an agreement between the supplier and the
    client. The

    The problem here is that the C standard, seen as a contract, is unfair >>>> to the programmer, because it's so excruciatingly hard to write code
    that is guaranteed to be free from UB.

    # define int int64_t
    ..

    makes it easier.

    That's UB, I believe :-)  And it will certainly be confusing.

    On 64-bit machines it re-establishes the dusty-deck old K&R C where
    int was the fastest integer type.


    No, it does not. It gives you an inconsistent mess and opens up all
    sorts of potential complications when interacting with code that uses
    "int" properly. It does not help you avoid UB - it creates a lot more potential for mistakes.

    Now, if you had suggested that we'd have been better off if the powers
    that be had made int 64-bit on 64-bit targets, then it would be a very different matter. It would reduce the risk of UB from signed integer
    overload quite considerably - few numbers are big enough to overflow 64
    bits without being so big that you are using multi-precision numerics
    libraries anyway.

    It would also mean that a lot of existing code that incorrectly or
    non-portably assumes "int" is 32-bit, would fail to work on the new systems.

    We can't change existing non-portable code. We can't change existing
    ABI's for 64-bit targets. Slapping a #define band-aid on the code will
    not fix anything.

    A better answer is it use int_fastNN_t types in your own code, picking a
    size that matches what you actually need. (Perhaps limit it to 32 or
    64, to be portable to most systems - or just 64 if you really are sure
    the code will not be used on smaller targets.) int_fast32_t and
    int_fast64_t are both 64-bit on x86-64.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Thu Sep 5 14:06:45 2024
    David Brown <[email protected]> writes:
    On 05/09/2024 11:12, Terje Mathisen wrote:

    That last paragraph enumerates pretty much all the uses I have for
    integer-type variables, with (like Mitch) a few apis that use (-1) as an
    error signal that has to be handled with special code.


    You don't have loop counters, array indices, or integer arithmetic?

    We do. There is no issue using unsigned loop counters, array
    indicies are always positive and unsigned integer arithmetic works
    just fine in our application.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Terje Mathisen on Thu Sep 5 14:05:08 2024
    Terje Mathisen <[email protected]> writes:
    David Brown wrote:

    Unsigned types are ideal for "raw" memory access or external data, for =

    anything involving bit manipulation (use of &, |, ^, << and >> on signe= >d=20
    types is usually wrong, IMHO), as building blocks in extended arithmeti= >c=20
    types, for the few occasions when you want two's complement wrapping,=20
    and for the even fewer occasions when you actually need that last bit o= >f=20
    range.

    That last paragraph enumerates pretty much all the uses I have for=20 >integer-type variables, with (like Mitch) a few apis that use (-1) as an =

    error signal that has to be handled with special code.

    Same here.


    =20
    It would be nice if C had subrange types like Pascal or Ada, but it doe= >s=20
    not.=C2=A0 Usually int - or sizeed ints - are the practical choice.


    Agreed 100%

    Although absent architecture support, how does one ensure that the
    value remains within the subrange?


    I wrote enough Pascal with ranged types that I got used to it, and found =

    that I was missing the feature when I used C.

    On the Burroughs Medium Systems, which addressed to the digit/nibble,
    ranged types (up to 100 digits/bytes) were de rigueur. Natural types
    for COBOL.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Thu Sep 5 14:49:42 2024
    Anton Ertl <[email protected]> schrieb:

    In either case, treating the C standard as agreement is nonsense.

    That's a good summary of your attitude. Using this argument with
    compiler writers will get you precisely nowhere, but you already
    have experience with that.

    Now for a challenge: Try to specify the behavior of any piece of
    undefined behavior whose treatment by compilers you object to in
    a way that a compiler writer can follow it. Think that it could
    be published as an annex to the standard.

    What could this be?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Thu Sep 5 14:58:54 2024
    Scott Lurndal <[email protected]> schrieb:
    David Brown <[email protected]> writes:
    On 05/09/2024 11:12, Terje Mathisen wrote:

    That last paragraph enumerates pretty much all the uses I have for
    integer-type variables, with (like Mitch) a few apis that use (-1) as an >>> error signal that has to be handled with special code.


    You don't have loop counters, array indices, or integer arithmetic?

    We do. There is no issue using unsigned loop counters,

    I find counting down from n to 0 using unsigned variables
    unintuitive. Or do you always count up and then calculate
    what you actually use? Induction variable optimization
    should take care of that, but it would be more complicated
    to use.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Thu Sep 5 17:09:19 2024
    On 05/09/2024 13:31, Anton Ertl wrote:
    David Brown <[email protected]> writes:
    Anton writes code that seriously pushes the boundary of what can be
    achieved. For at least some of the things he does (such as GForth) he
    is trying to squeeze every last drop of speed out of the target. And he
    is /really/ good at it. But that means he is forever relying on nuances
    about code generation. His code, at least for efficiency if not for
    correctness, is dependent on details far beyond what is specified and
    documented for C and for the gcc compiler. He might spend a long time
    working with his code and a version of gcc, fine-tuning the details of
    his source code to get out exactly the assembly he wants from the
    compiler.

    No. We distribute Gforth as source code. It works for a wide variety
    of architectures and compilers. So unlike what you suggest and what
    some people have suggested earlier to avoid problems with new
    "optimizations" in newer releases of gcc, we don't concentrate on a
    specific version of gcc.


    OK.

    Of course it is frustrating for him when the next version of
    gcc generates very different assembly from that same source, but he is
    not really programming at the level of C, and he should not expect
    consistency from C compilers like he does.

    It's normal and no problem when the next version of gcc generates
    different assembly language. There are some basic assumptions that
    our code relies on, and that mostly does not change between gcc
    versions.


    As long as you are sticking to defined behaviour (defined by the C
    standards, or by the gcc documentation), and use specified C standard
    versions in the build, then there should not be any incorrect behaviour
    in different versions. Performance might regress, and of course there's
    always the risk of bugs.

    An essential assumption is that, when we have:

    A:
    C code
    B:

    ... that when we do &&A and &&B (which is documented in the GNU C
    manual), we get the addresses pointing to the start and end of the
    machine code corresponding to the C code.

    I don't see anything in the gcc reference manual suggesting that &&B is
    the end of the corresponding code. What you get - all you get - is that
    "goto * &&A" gives the same effect as "goto A".

    In the days starting with
    gcc-3.0, we found that gcc started reordering the basic blocks within
    loops, so replaced loops in the part of the code that needs such
    assumptions into separate functions. Around gcc-7, gcc started to
    compile

    A: C-code1
    B: C-code2
    C: goto *...

    to the same code as

    A: C-code1; C-code2; goto *...;
    B: C-code2; goto *...;
    C: goto *...;

    I found a workaround that avoids this kind of code generation.

    This is all the kind of thing you can expect when you make assumptions
    about code generation that are not supported by the documentation.
    Compilers can, and do, move code around in various ways, duplicate it,
    combine it, unroll it, compress it - whatever gives (or tries to give - optimisation is not an exact science) better results while giving the documented behaviour.

    I too have written code that relies on being able to identify the start
    and end of certain bits of code - typically for microcontrollers where
    you want some bits of code (like flash programming routines or very
    timing critical interrupt code) put in ram rather than flash. Sometimes
    that can be done with compiler extensions, sometimes it takes extra
    flags, linker file magic, or other messing around. But it's not
    something I would expect to be portable, and it needs confirmed for
    every compiler version and selection of flags used. (I realise that
    this is a vastly simpler task for the kind of work I do than for an open
    source project!)


    Another problem from gcc-3.1 to at least gcc-4.4 (intermittently) is
    that gcc compiled

    goto *ca;

    into the equivalent of

    goto gotoca;

    /* and elsewhere */
    gotoca: goto *ca;

    We reported that repeatedly. At one point a gcc maintainer gave us
    some bullshit about a possible performance advantage from this transformation, of course without presenting any empirical support,
    while we saw a big slowdown on our code. We developed workarounds for
    that, and they are in Gforth to this day, even though we have not
    encountered a new gcc version with this problem for over a decade, but
    new Gforth should also work on old gcc.


    Again, the compiler is not doing anything outside its specifications.
    What you want here is a guarantee of behaviour that is not defined
    anywhere. You are not seeing a bug in the compiler, or an
    incompatibility with previous versions - you are seeing the need for a
    feature (and a controlling compiler flag) that gcc does not currently
    have. It's a potential feature that might be useful to other people
    too, while being an anti-feature to others.

    Another assumption is that when we concatenate the code snippet
    between label A and B (which contains C-code1) and the code snippet
    between label X and Y (which contains C-code3), executing the result
    will behave like the concatenation of C-code1 and C-code3 in source
    code. This assumption has two aspects:

    1) Do the register assignments at the labels fit together. It turns
    out that we never had a problem with that, and I think that the
    reason for that is that the "goto *" can jump to any of those
    labels (all their addresses are taken), and so the register
    assignment must be the same right after each label.

    What guarantees that the assignments are the same right before each
    label? Probably that after the label, there is not much between
    the label and the next goto*, and that makes all registers at
    potential targets live.

    2) If we have two pieces of machine code produced in this fashion,
    does the architecture guarantee that such a concatenation works?
    It turns out that in general-purpose architectures, all-but-one do.
    That includes IA-64. The exception is MIPS with its architectural
    load-delay slot (and there are also scheduling restrictions having
    to do with the hilo register that may be problematic): the first
    code snippet may end in a load, and the next code snippet may start
    with an instruction that reads the result of the load. So we just
    disabled this concatenation on MIPS.

    We do a number of things to achieve stability: We do sanity-checking
    on the resulting machine code snippets and fall back to plain threaded
    code if the snippets turn out not to be relocatable.

    Also, we enable all the flags for defining behaviour in gcc that we
    find (unfortunately, in the documentation they are intermixed with
    other options). For good measure, this includes -fno-delete-null-pointer-checks, although I doubt that it makes a
    difference for our code either way.


    (-fno-delete-null-pointer-checks will make no difference to code that
    doesn't accidentally use leap-before-you-look checking.)

    There are certainly a few cases (-fno-strict-aliasing is a prime
    example) where flags are documented as disabling optimisations, when
    they are better viewed as adding definitions to the language and would
    be better documented under "Options Controlling C Dialect" or "Options
    for Code Generation Conventions".

    One thing that came up about a year ago was that gcc auto-vectorizes
    adjacent memory accesses on AMD64 (apparently the AMD64 port
    maintainers are unhappy because AMD64 does not have instructions like
    ARM A64's ldp and stp:-), which did not impact correctness, but led to
    worse performance (not just in Gforth; I have also seen it in the
    bubble benchmark from John Hennessy's Stanford small integer
    benchmarks; I'm sure there is some SPEC benchmark that benefits). A
    quick addition of -fno-tree-vectorize fixed that.


    That happens sometimes. In my brief testing of clang, it often seems a
    bit too keen on vectorising code that would be better kept short and
    simple. I have no doubt gcc gets that wrong sometimes too.

    We have been thinking about moving from C to a better-defined
    language, namely assembly language, but have not yet taken the plunge,
    and it has not been necessary yet. Gcc has not been as crazy in our experience as the UB rethoric might make one think. Why is that? I
    think the reasons are:

    1) Gforth and a lot of other "irrelevant" (to the gcc maintainers)
    projects sail in the slipstream of "relevant" code like SPEC and
    the Linux kernel that are all full of undefined behaviour (Linux
    defines many of them with flags, like Gforth does), so gcc does not
    "optimize" as crazily as a UB fan might wish.

    2) The code snippets are very short, with many in-edges on the
    preceding and following label, which tends to destroy any
    "knowledge" that the compiler might have derived from the
    assumption that the program does not exercise undefined behaviour.
    This severely limits the distance over which such "optimizations"
    can be performed.

    Nevertheless, the last time I tried what happens if I compile without
    the behaviour-defining options, the result did not work; I did not investigate this further.


    You are looking for more than C and the gcc documented extensions give
    you. That is always going to be hard.

    Ideally, you need a new gcc flag or two with documented and guaranteed
    effects to give you the assurance you need for your code. That's going
    to take a lot of effort, I would expect, and I can see it being hard for
    a relatively nice project like Gforth to push for that. Linux has the
    backing here to push for changes - even if Linus Torvalds rants and
    insults the gcc developers, IBM and friends can still pay gcc developers
    to make the changes he wants.

    Thank you for your explanation of your needs here, and information about
    how your code works. I'm afraid I can't do anything to help, but it
    helps me understand where you are coming from.

    (I'm still a fan of the principle of UB, and of compilers using
    knowledge of UB for optimisation - but that does not mean I can't
    sympathise with people who find that frustrating and who see things differently.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Thu Sep 5 19:04:41 2024
    David Brown wrote:
    On 05/09/2024 11:12, Terje Mathisen wrote:
    David Brown wrote:
    Unsigned types are ideal for "raw" memory access or external data,
    for anything involving bit manipulation (use of &, |, ^, << and >> on
    signed types is usually wrong, IMHO), as building blocks in extended
    arithmetic types, for the few occasions when you want two's
    complement wrapping, and for the even fewer occasions when you
    actually need that last bit of range.

    That last paragraph enumerates pretty much all the uses I have for
    integer-type variables, with (like Mitch) a few apis that use (-1) as
    an error signal that has to be handled with special code.


    You don't have loop counters, array indices, or integer arithmetic?

    Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine
    with unsigned i, arrays always use a zero base so in Rust the only array
    index type is usize, i.e the largest supported unsigned type in the
    system, typically the same as u64.

    unsigned arithmetic is easier than signed integer arithmetic, including comparisons that would result in a negative value, you just have to make
    the test before subtracting, instead of checking if the result was negative.

    I.e I cannot easily replicate a downward loop that exits when the
    counter become negative:

    for (int i = START; i >= 0; i-- ) {
    // Do something with data[i]
    }

    One of my alternatives are

    unsigned u = start; // Cannot be less than zero
    if (u) {
    u++;
    do {
    u--;
    data[u]...
    while (u);
    }

    This typically results in effectively the same asm code as the signed
    version, except for a bottom JGE (Jump (signed) Greater or Equal instead
    of JA (Jump Above or Equal, but my version is far more verbose.

    Alternatively, if you don't need all N bits of the unsigned type, then
    you can subtract and check if the top bit is set in the result:

    for (unsigned u = start; (u & TOPBIT) == 0; u--)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Thu Sep 5 17:15:42 2024
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:
    David Brown <[email protected]> writes:
    On 05/09/2024 11:12, Terje Mathisen wrote:

    That last paragraph enumerates pretty much all the uses I have for
    integer-type variables, with (like Mitch) a few apis that use (-1) as an >>>> error signal that has to be handled with special code.


    You don't have loop counters, array indices, or integer arithmetic?

    We do. There is no issue using unsigned loop counters,

    I find counting down from n to 0 using unsigned variables
    unintuitive. Or do you always count up and then calculate
    what you actually use? Induction variable optimization
    should take care of that, but it would be more complicated
    to use.

    Just checked current project; out of some 5000 'for' loops, only two worked backwards, and terminating at zero worked algorithmically.
    About 20% of loops were iterating using standard C++ iterators,
    the rest were size_t or other unsigned integer types.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Thu Sep 5 15:49:39 2024
    David Brown <[email protected]> writes:
    On 05/09/2024 13:31, Anton Ertl wrote:
    It's normal and no problem when the next version of gcc generates
    different assembly language. There are some basic assumptions that
    our code relies on, and that mostly does not change between gcc
    versions.
    ...
    In the days starting with
    gcc-3.0, we found that gcc started reordering the basic blocks within
    loops, so replaced loops in the part of the code that needs such
    assumptions into separate functions. Around gcc-7, gcc started to
    compile

    A: C-code1
    B: C-code2
    C: goto *...

    to the same code as

    A: C-code1; C-code2; goto *...;
    B: C-code2; goto *...;
    C: goto *...;

    I found a workaround that avoids this kind of code generation.

    This is all the kind of thing you can expect when you make assumptions
    about code generation that are not supported by the documentation.

    Nobody said that gcc did anything wrong here. We were, however,
    surprised that -fno-reorder-blocks did not suppress the reordering; we
    reported this as bug, but were told that this option does something
    different from what it says. Anyway, we developed a workaround. And
    we also developed a workaround for the code duplication problem that
    showed up in gcc-7.

    I too have written code that relies on being able to identify the start
    and end of certain bits of code - typically for microcontrollers where
    you want some bits of code (like flash programming routines or very
    timing critical interrupt code) put in ram rather than flash. Sometimes
    that can be done with compiler extensions, sometimes it takes extra
    flags, linker file magic, or other messing around. But it's not
    something I would expect to be portable, and it needs confirmed for
    every compiler version and selection of flags used. (I realise that
    this is a vastly simpler task for the kind of work I do than for an open >source project!)

    Between what we developed for gcc-3.2 (released 2002) in 2003 and
    today, the only new development in these 21 years was the code
    duplication in gcc-7 and the workaround for that. IIRC Gforth also
    worked without that workaround, but was slower.

    Another problem from gcc-3.1 to at least gcc-4.4 (intermittently) is
    that gcc compiled

    goto *ca;

    into the equivalent of

    goto gotoca;

    /* and elsewhere */
    gotoca: goto *ca;

    We reported that repeatedly. At one point a gcc maintainer gave us
    some bullshit about a possible performance advantage from this
    transformation, of course without presenting any empirical support,
    while we saw a big slowdown on our code. We developed workarounds for
    that, and they are in Gforth to this day, even though we have not
    encountered a new gcc version with this problem for over a decade, but
    new Gforth should also work on old gcc.


    Again, the compiler is not doing anything outside its specifications.

    Nobody said it did. We did, however, report this as a pessimization repeatedly. And eventually the gcc people fixed it; we already saw
    versions without this bug in gcc-4.0 or 4.1 IIRC, but in 4.4 it was
    there again, but apparently they have since fixed it for good.

    You are looking for more than C and the gcc documented extensions give
    you. That is always going to be hard.

    Really? It works.

    Ideally, you need a new gcc flag or two with documented and guaranteed >effects to give you the assurance you need for your code. That's going
    to take a lot of effort, I would expect, and I can see it being hard for
    a relatively nice project like Gforth to push for that.

    Our approach has been to find sanity-checks and workarounds based on
    what gcc provided.

    However, we were not the only ones working with code copying, and
    Prokopski and Verbrugge have implemented changes to gcc that support
    this technique, and presented it at the GCC Developers’ Summit 2007 <https://gcc.gnu.org/wiki/HomePage?action=AttachFile&do=get&target=GCC2007-Proceedings.pdf>
    and at CC'08:

    @InProceedings{prokopski&verbrugge08,
    author = {Gregory B. Prokopski and Clark Verbrugge},
    title = {Compiler-Guaranteed Safety in Code-Copying Virtual
    Machines},
    booktitle = {Compiler Construction (CC'08)},
    pages = {163--177},
    year = {2008},
    publisher = {Springer LNCS 4959},
    url = {http://www.sable.mcgill.ca/publications/papers/2008-2/paper.pdf},
    OPTannote = {}
    }

    The source code was available, but the gcc maintainers were apparently
    not interested. So much for "patches welcome".

    Looking back, while there was quite a bit of interest in code-copying
    (both for interpreters and for partial evaluators) from about
    1998-2008, AFAIK Gforth is the only project that stuck with this
    technique.

    When others consider relatively unsophisticated interpreters to be too
    slow, they tend to go for JIT compilers that generate machine code
    using target-specific code (including machine-code encoding code).

    Maybe the constant advocacy that everything outside the standard is
    considered to be broken and the next compiler will not compile it as
    intended has had its effects. Or maybe if we had published a
    code-copying howto, more people would have found out how to do it in a
    way that works pretty reliably.

    OTOH, we ourselves have been thinking about switching to the kind of
    JIT compiler that others have gone for. So we fell for this advocacy ourselves. But looking at the stability of Gforth, this is not really justified. Still, a solid foundation like machine code provides more confidence than a foundation based on C where every new compiler
    version may bring unpleasant surprises (and not just for projects such
    as Gforth), even if the experience is that things work.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Anton Ertl on Thu Sep 5 21:38:07 2024
    On 2024-09-05 18:49, Anton Ertl wrote:
    David Brown <[email protected]> writes:
    On 05/09/2024 13:31, Anton Ertl wrote:

    [ discussion of the implementation of Gforth as a code-copying
    and code-pasting interpreter, and the maintenance problems
    this leads to when changing gcc versions ]

    It seems to me that this discussion (of Gforth) has very little do to
    with the ability of C compilers to optimize away or do something else
    with C code that the compiler detects invokes Undefined Behavior, and
    instead concerns how successive gcc versions break the assumptions that
    Gforth developers make about the structure of the machine code that gcc
    emits for legal C code that does not invoke Undefined Behavior if
    executed without modification.

    If you try to restructure or modify the machine code that Gcc produces
    on the fly, during program execution, as Gforth tries to do, that is so
    outside the C standard that it is only Undefined Behavior in the sense
    of not being even considered in the standard.

    I don't doubt that Anton has experienced bad effects of the
    "optimization" of Undefined Behavior, in other contexts, but I tend to
    agree with David on that issue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Sep 5 19:29:36 2024
    On Thu, 5 Sep 2024 14:05:08 +0000, Scott Lurndal wrote:

    Terje Mathisen <[email protected]> writes:
    David Brown wrote:

    It would be nice if C had subrange types like Pascal or Ada, but it doe= >>s not.=C2=A0 Usually int - or sizeed ints - are the practical choice.


    Agreed 100%

    Although absent architecture support, how does one ensure that the
    value remains within the subrange?

    result = min(max(min_range,x),max_range);

    or for 2^n values

    result = ( ( x << (64-width) ) >> (64-width) );

    The top is 2 instructions, the bottom 1 (both signed and unsigned).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Thu Sep 5 19:24:19 2024
    On Thu, 5 Sep 2024 13:48:37 +0000, David Brown wrote:

    On 04/09/2024 20:13, MitchAlsup1 wrote:
    On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

    On 04/09/2024 18:07, Tim Rentsch wrote:

    If all the records are in one large array, there is a simple
    test to see if memcpy() must work or whether some alternative
    should be used instead.

    Such tests are usually built into implementations of memmove(), which
    will chose to run forwards or backwards as needed.  So you might as well >>> just call memmove() any time you are not sure memcpy() is safe and
    appropriate.

    Memmove() is always appropriate unless you are doing something
    nefarious.
    So:
    # define memcpy memomve
    and move forward with life--for the 2 extra cycles memmove costs it
    saves everyone long term grief.


    Or just use memmove, and not memcpy, whenever you are moving stuff
    around in the same buffer.

    When you need the nefarious activities of memcpy write it as a
    for loop by yourself and comment the nafariousness of the use.

    memcpy is not nefarious. It's quite simple, and does what it says on
    the tin. Use it when you want to copy non-overlapping memory areas.
    Don't use it if you want to do something other than that. I have never understood why anyone would find this difficult.

    There are compilers that:: s/memcpy/memmove/g

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Sep 5 19:31:23 2024
    On Thu, 5 Sep 2024 14:06:45 +0000, Scott Lurndal wrote:

    ---------------------------------------------------------- array
    indicies are always positive


    Not in ada or fortran.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Thu Sep 5 21:36:04 2024
    On 05/09/2024 19:04, Terje Mathisen wrote:
    David Brown wrote:
    On 05/09/2024 11:12, Terje Mathisen wrote:
    David Brown wrote:
    Unsigned types are ideal for "raw" memory access or external data,
    for anything involving bit manipulation (use of &, |, ^, << and >>
    on signed types is usually wrong, IMHO), as building blocks in
    extended arithmetic types, for the few occasions when you want two's
    complement wrapping, and for the even fewer occasions when you
    actually need that last bit of range.

    That last paragraph enumerates pretty much all the uses I have for
    integer-type variables, with (like Mitch) a few apis that use (-1) as
    an error signal that has to be handled with special code.


    You don't have loop counters, array indices, or integer arithmetic?

    Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine
    with unsigned i, arrays always use a zero base so in Rust the only array index type is usize, i.e the largest supported unsigned type in the
    system, typically the same as u64.

    Loop counters can usually be signed or unsigned, and it usually makes no difference. Array indices are also usually much the same signed or
    unsigned, and it can feel more natural to use size_t here (an unsigned
    type). It can make a difference to efficiency, however. On x86-64,
    this code is 3 instructions with T as "unsigned long int" or "long int",
    4 with "int", and 5 with "unsigned int".

    int foo(int * p, T x) {
    int a = p[x++];
    int b = p[x++];
    return a + b;
    }


    Anyway, I count loop counters and array indices as "use of integer-type variables", whether you prefer signed or unsigned.




    unsigned arithmetic is easier than signed integer arithmetic, including comparisons that would result in a negative value, you just have to make
    the test before subtracting, instead of checking if the result was
    negative.

    I can't follow that at all. Unsigned and signed arithmetic and
    comparisons both work simply and as you'd expect. /Mixing/ signed and
    unsigned types can get things wrong.


    I.e I cannot easily replicate a downward loop that exits when the
    counter become negative:

      for (int i = START; i >= 0; i-- ) {
        // Do something with data[i]
      }

    One of my alternatives are

      unsigned u = start; // Cannot be less than zero
      if (u) {
        u++;
        do {
          u--;
          data[u]...
        while (u);
      }

    This typically results in effectively the same asm code as the signed version, except for a bottom JGE (Jump (signed) Greater or Equal instead
    of JA (Jump Above or Equal, but my version is far more verbose.


    A more important thing is that the first version, with signed i, is
    /vastly/ simpler and clearer in the source code.

    Alternatively, if you don't need all N bits of the unsigned type, then
    you can subtract and check if the top bit is set in the result:

      for (unsigned u = start; (u & TOPBIT) == 0; u--)

    Terje


    Or you could just write sane code that matches what you want to say.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Linsel@21:1/5 to Anton Ertl on Thu Sep 5 22:05:17 2024
    On 05.09.24 17:49, Anton Ertl wrote:

    Nobody said that gcc did anything wrong here. We were, however,
    surprised that -fno-reorder-blocks did not suppress the reordering; we reported this as bug, but were told that this option does something
    different from what it says. Anyway, we developed a workaround. And
    we also developed a workaround for the code duplication problem that
    showed up in gcc-7.


    Have you tried interspersing `asm volatile("")` statements?

    It is very often an effective means to prevent gcc from reordering code
    from before and after the asm statement.

    If you additional specify inputs, e.g. `asm volatile("" :: "r" (foo))`,
    you can force gcc to keep `foo` alive up to this point.

    --
    Bernd Linsel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to All on Thu Sep 5 23:05:51 2024
    On 2024-09-05 22:29, MitchAlsup1 wrote:
    On Thu, 5 Sep 2024 14:05:08 +0000, Scott Lurndal wrote:

    Terje Mathisen <[email protected]> writes:
    David Brown wrote:

    It would be nice if C had subrange types like Pascal or Ada, but it
    doe=
    s not.=C2=A0 Usually int - or sizeed ints - are the practical choice.


    Agreed 100%

    Although absent architecture support, how does one ensure that the
    value remains within the subrange?

      result = min(max(min_range,x),max_range);


    That would be a /saturating/ ranged type. Neither Pascal nor Ada
    provides such types.


    or for 2^n values

      result = ( ( x << (64-width) ) >> (64-width) );

    The top is 2 instructions, the bottom 1 (both signed and unsigned).


    That would be a /wrap-around/ ranged type (if I understand the code
    correctly). Pascal does not provide such; Ada does (modular integers)
    and for any modulus, not only powers of two.

    Pascal ranged types are expected to trap (abort) on exceeding the range,
    IIRC, and Ada non-modular ranged types are expected to raise an
    exception. Probably that, too, is only a couple of instructions for Mitch.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bill Findlay@21:1/5 to All on Thu Sep 5 22:23:51 2024
    On 5 Sep 2024, MitchAlsup1 wrote
    (in article<[email protected]>):

    On Thu, 5 Sep 2024 14:06:45 +0000, Scott Lurndal wrote:

    ---------------------------------------------------------- array
    indicies are always positive

    Not in ada or fortran.

    Or C.

    --
    Bill Findlay

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to David Brown on Thu Sep 5 20:46:24 2024
    David Brown <[email protected]> wrote:
    On 04/09/2024 22:15, Brett wrote:
    David Brown <[email protected]> wrote:
    On 03/09/2024 21:28, Stefan Monnier wrote:
    My impression - based on hearsay for Rust as I have no experience - is that
    the key point of Rust is memory "safety". I use scare-quotes here, since it
    is simply about correct use of dynamic memory and buffers.

    It is entirely possible to have correct use of memory in C,

    If you look at the evolution of programming languages, "higher-level"
    doesn't mean "you can do more stuff". On the contrary, making
    a language "higher-level" means deciding what it is we want to make
    harder or even impossible.


    Agreed.

    I've heard it said that the power of a programming language comes not
    from what you can do with the language, but from what you cannot do.

    Wrong, the last version of Swift added all the garbage programming concepts >> that one should avoid.


    That does not show that I was wrong - perhaps Swift is not a powerful programming language!

    Of course, it all depends on what you mean by "powerful".

    (I don't know Swift at all.)

    Clearly, you are not developing in the Apple ecosystem.
    Swift has completely replaced Object C as the development language used on Apple hardware. C++ was not used for OSX development.

    You have to give people the tools to do anything.


    You don't /have/ to do that. But it's often easier to market a language
    that can do anything.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Linsel@21:1/5 to All on Thu Sep 5 23:07:22 2024
    T24gMDUuMDkuMjQgMTk6MDQsIFRlcmplIE1hdGhpc2VuIHdyb3RlOg0KPiBPbmUgb2YgbXkg YWx0ZXJuYXRpdmVzIGFyZQ0KPiANCj4gIMKgIHVuc2lnbmVkIHUgPSBzdGFydDsgLy8gQ2Fu bm90IGJlIGxlc3MgdGhhbiB6ZXJvDQo+ICDCoCBpZiAodSkgew0KPiAgwqDCoMKgIHUrKzsN Cj4gIMKgwqDCoCBkbyB7DQo+ICDCoMKgwqDCoMKgIHUtLTsNCj4gIMKgwqDCoMKgwqAgZGF0 YVt1XS4uLg0KPiAgwqDCoMKgIHdoaWxlICh1KTsNCj4gIMKgIH0NCj4gDQo+IFRoaXMgdHlw aWNhbGx5IHJlc3VsdHMgaW4gZWZmZWN0aXZlbHkgdGhlIHNhbWUgYXNtIGNvZGUgYXMgdGhl IHNpZ25lZCANCj4gdmVyc2lvbiwgZXhjZXB0IGZvciBhIGJvdHRvbSBKR0UgKEp1bXAgKHNp Z25lZCkgR3JlYXRlciBvciBFcXVhbCBpbnN0ZWFkIA0KPiBvZiBKQSAoSnVtcCBBYm92ZSBv ciBFcXVhbCwgYnV0IG15IHZlcnNpb24gaXMgZmFyIG1vcmUgdmVyYm9zZS4NCj4gDQo+IEFs dGVybmF0aXZlbHksIGlmIHlvdSBkb24ndCBuZWVkIGFsbCBOIGJpdHMgb2YgdGhlIHVuc2ln bmVkIHR5cGUsIHRoZW4gDQo+IHlvdSBjYW4gc3VidHJhY3QgYW5kIGNoZWNrIGlmIHRoZSB0 b3AgYml0IGlzIHNldCBpbiB0aGUgcmVzdWx0Og0KPiANCj4gIMKgIGZvciAodW5zaWduZWQg dSA9IHN0YXJ0OyAodSAmIFRPUEJJVCkgPT0gMDsgdS0tKQ0KPiANCj4gVGVyamUNCj4gDQoN CldoYXQgYWJvdXQ6DQoNCmZvciAodW5zaWduZWQgdSA9IHN0YXJ0OyB1ICE9IH4wdTsgLS11 KQ0KICAgIC4uLg0KDQpvciBldmVuDQoNCmZvciAodW5zaWduZWQgdSA9IHN0YXJ0OyAoaW50 KXUgPj0gMDsgLS11KQ0KICAgIC4uLg0KDQo/DQoNCkkndmUgY29tcGFyZWQgYWxsIHZhcmlh bnRzIGZvciB4ODZfNjQgd2l0aCAtTzMgLWZleHBlbnNpdmUtb3B0aW1pemF0aW9ucyANCm9u IGdvZGJvbHQub3JnOg0KLSAzMiBiaXQgdmVyc2lvbjogaHR0cHM6Ly9nb2Rib2x0Lm9yZy96 L1RNaGh4M25jaA0KLSA2NCBiaXQgdmVyc2lvbjogaHR0cHM6Ly9nb2Rib2x0Lm9yZy96Lzhv eHpUZjVHZg0KDQoNCi0tIA0KQmVybmQgTGluc2VsDQo=

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bernd Linsel on Thu Sep 5 21:39:00 2024
    Bernd Linsel <[email protected]> writes:
    On 05.09.24 19:04, Terje Mathisen wrote:
    One of my alternatives are

    unsigned u = start; // Cannot be less than zero
    if (u) {
    u++;
    do {
    u--;
    data[u]...
    while (u);
    }

    This typically results in effectively the same asm code as the signed
    version, except for a bottom JGE (Jump (signed) Greater or Equal instead
    of JA (Jump Above or Equal, but my version is far more verbose.

    Alternatively, if you don't need all N bits of the unsigned type, then
    you can subtract and check if the top bit is set in the result:

    %G�%@| for (unsigned u = start; (u & TOPBIT) == 0; u--)

    Terje


    What about:

    for (unsigned u = start; u != ~0u; --u)

    This is the form we use most when we need
    to work in reverse.

    ...

    or even

    for (unsigned u = start; (int)u >= 0; --u)
    ...

    ?

    I've compared all variants for x86_64 with -O3 -fexpensive-optimizations
    on godbolt.org:
    - 32 bit version: https://godbolt.org/z/TMhhx3nch
    - 64 bit version: https://godbolt.org/z/8oxzTf5Gf


    No significant differences in code generation for unsigned vs. signed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Thu Sep 5 15:04:14 2024
    Terje Mathisen <[email protected]> writes:

    [...]

    Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
    fine with unsigned i, arrays always use a zero base so in Rust the
    only array index type is usize, i.e the largest supported unsigned
    type in the system, typically the same as u64.

    unsigned arithmetic is easier than signed integer arithmetic,
    including comparisons that would result in a negative value, you just
    have to make the test before subtracting, instead of checking if the
    result was negative.

    I.e I cannot easily replicate a downward loop that exits when the
    counter become negative:

    for (int i = START; i >= 0; i-- ) {
    // Do something with data[i]
    }

    See below.

    One of my alternatives are

    unsigned u = start; // Cannot be less than zero
    if (u) {
    u++;
    do {
    u--;
    data[u]...
    } while (u); /* presumably the } was intended */
    }

    This code isn't the same as the for() loop above. If start is
    0, the for() loop runs once, but the do..while loop runs zero times.

    Regarding the given for() loop, namely this:

    for (int i = START; i >= 0; i-- ) {
    // Do something with data[i]
    }

    If START is signed (presumably of type int), so the loop might run
    zero times, but never more than INT_MAX times, then

    for( unsigned u = START < 0 ? 0 : START + 1u; u > 0 && u--; ){
    // Do something with data[i]
    }

    If START is unsigned, so in all cases the loop must run at
    least once, then

    unsigned u = START;
    do {
    // Do something with data[i]
    } while( u > 0 && u-- );

    (Yes I know the 'u > 0' expressions can be replaced by just 'u'.)

    The optimizer should be smart enough to realize that if 'u > 0'
    is true then the test 'u--' will also be true. The same should
    hold if 'u > 0' is replaced by just 'u'.

    (Disclaimer: code not compiled.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Linsel@21:1/5 to Tim Rentsch on Fri Sep 6 00:19:52 2024
    On 06.09.24 00:04, Tim Rentsch wrote:

    If START is signed (presumably of type int), so the loop might run
    zero times, but never more than INT_MAX times, then

    for( unsigned u = START < 0 ? 0 : START + 1u; u > 0 && u--; ){
    // Do something with data[i]
    }

    If START is unsigned, so in all cases the loop must run at
    least once, then

    unsigned u = START;
    do {
    // Do something with data[i]
    } while( u > 0 && u-- );

    (Yes I know the 'u > 0' expressions can be replaced by just 'u'.)

    The optimizer should be smart enough to realize that if 'u > 0'
    is true then the test 'u--' will also be true. The same should
    hold if 'u > 0' is replaced by just 'u'.

    (Disclaimer: code not compiled.)

    Both yield not very elegant code:

    https://godbolt.org/z/M4Y5PYP3v

    --
    Bernd Linsel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to BGB on Fri Sep 6 09:10:04 2024
    On 06/09/2024 00:51, BGB wrote:
    On 9/5/2024 8:27 AM, David Brown wrote:
    On 05/09/2024 10:51, Niklas Holsti wrote:
    On 2024-09-05 10:54, David Brown wrote:
    On 05/09/2024 02:56, MitchAlsup1 wrote:
    On Thu, 5 Sep 2024 0:41:36 +0000, BGB wrote:

    On 9/4/2024 3:59 PM, Scott Lurndal wrote:

    Say:
       long z;
       int x, y;
       ...
       z=x*y;
    Would auto-promote to long before the multiply.

    \I may have to use this as an example of C allowing the programmer
    to shoot himself in the foot; promotion or no promotion.

    You snipped rather unfortunately here - it makes it look like this
    was code that Scott wrote, and you've removed essential context by BGB. >>>>

    While I agree it is an example of the kind of code that people
    sometimes write when they don't understand C arithmetic, I don't
    think it is C-specific.  I can't think of any language off-hand
    where expressions are evaluated differently depending on types used
    further out in the expression.  Can you give any examples of
    languages where the equivalent code would either do the
    multiplication as "long", or give an error so that the programmer
    would be informed of their error?


    The Ada language can work in both ways. If you just have:

        z : Long_Integer;  -- Not a standard Ada type, but often provided. >>>     x, y : Integer;
        ...
        z := x * y;

    the compiler will inform you that the types in the assignment do not
    match: using the standard (predefined) operator "*", the product of
    two Integers gives an Integer, not a Long_Integer.

    That seems like a safe choice.  C's implicit promotion of int to long
    int can be convenient, but convenience is sometimes at odds with safety.


    A lot of time, implicit promotion will be the "safer" option than first
    doing an operation that overflows and then promoting.

    Annoyingly, one can't really do the implicit promotion first and then
    promote afterwards, as there may be programs that actually rely on this particular bit of overflow behavior.


    A programming language has to work as it is defined. And people should
    not be relying on code doing things that are /not/ defined.

    So promoting arguments implicitly before the operation is only useful if
    it is a clearly defined part of the language. (In C, that is the way it
    works up to the size of "int".)

    In C, if you have :

    long int foo(int x, int y) {
    return z = x *y ;
    }

    then the compiler is free to implement this as full 64-bit
    multiplication and return that 64-bit value. This is because the result
    of a 32 x 32 bit multiplication either gives the correct answer without overflow, and promoting it to 64 bit keeps that value, or there is an
    overflow and the results are undefined, so the compiler can return
    whatever it likes.

    But unless the compiler documents this behaviour (in which case the code
    would be correct but non-portable), the code is buggy.

    Conversely, if unsigned types are used here, the results of the
    multiplication must be truncated to 32 bits - keeping higher bits in the
    return value would be a compiler bug.


    However a language wants to handle this, it needs to be specified by the language. Most languages (AFAIK) have no implicit promotion that is
    dependent on what you are doing with the results. (Some, including C,
    will have various degrees of implicit promotion dependent solely on the expression itself, but not on what is done with the evaluated result.)
    Ada, AFAIK, does not have implicit promotions between types - "int" does
    not automatically promote to "long int". This can be seen as an
    inconvenience compared to many other languages, but it means that it is possible to have a consistent and safe way to overload by return type.


    In effect, in my case, the promotion behavior ends up needing to depend
    on the language-mode (it is either this or maybe internally split the operators into widening or non-widening variants, which are selected
    when translating the AST into the IR stage).


    Dependency on a "language mode" does not sound "safe" to me.

    Well, as opposed to dealing with the widening cases by emitting IR with
    an implicit casts added into the IR.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Fri Sep 6 09:23:21 2024
    On 05/09/2024 21:24, MitchAlsup1 wrote:
    On Thu, 5 Sep 2024 13:48:37 +0000, David Brown wrote:

    On 04/09/2024 20:13, MitchAlsup1 wrote:
    On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

    On 04/09/2024 18:07, Tim Rentsch wrote:

    If all the records are in one large array, there is a simple
    test to see if memcpy() must work or whether some alternative
    should be used instead.

    Such tests are usually built into implementations of memmove(), which
    will chose to run forwards or backwards as needed.  So you might as
    well
    just call memmove() any time you are not sure memcpy() is safe and
    appropriate.

    Memmove() is always appropriate unless you are doing something
    nefarious.
    So:
    # define memcpy memomve
    and move forward with life--for the 2 extra cycles memmove costs it
    saves everyone long term grief.


    Or just use memmove, and not memcpy, whenever you are moving stuff
    around in the same buffer.

    When you need the nefarious activities of memcpy write it as a
    for loop by yourself and comment the nafariousness of the use.

    memcpy is not nefarious.  It's quite simple, and does what it says on
    the tin.  Use it when you want to copy non-overlapping memory areas.
    Don't use it if you want to do something other than that.  I have never
    understood why anyone would find this difficult.

    There are compilers that:: s/memcpy/memmove/g

    They can do that if they want - memcpy can be implemented using memmove,
    but not vice versa.

    That doesn't mean it is at all a good idea to use memcpy when you mean
    memmove.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Bernd Linsel on Fri Sep 6 07:16:43 2024
    Bernd Linsel <[email protected]> writes:
    On 05.09.24 17:49, Anton Ertl wrote:

    Nobody said that gcc did anything wrong here. We were, however,
    surprised that -fno-reorder-blocks did not suppress the reordering; we
    reported this as bug, but were told that this option does something
    different from what it says. Anyway, we developed a workaround. And
    we also developed a workaround for the code duplication problem that
    showed up in gcc-7.


    Have you tried interspersing `asm volatile("")` statements?

    It is very often an effective means to prevent gcc from reordering code
    from before and after the asm statement.

    We are using asm statements that result in no machine code for various
    purposes (including the workaround for the code duplication of gcc-7
    ff.)

    We have not tried it for suppressing the basic block reordering, and I
    would not expect such a statement to suppress that, because asm
    volatile("") acts as a data-flow barrier, and basic-block reordering
    has nothing to do with data flow.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Fri Sep 6 09:17:45 2024
    On 05/09/2024 23:39, Scott Lurndal wrote:
    Bernd Linsel <[email protected]> writes:
    On 05.09.24 19:04, Terje Mathisen wrote:
    One of my alternatives are

    unsigned u = start; // Cannot be less than zero
    if (u) {
    u++;
    do {
    u--;
    data[u]...
    while (u);
    }

    This typically results in effectively the same asm code as the signed
    version, except for a bottom JGE (Jump (signed) Greater or Equal instead >>> of JA (Jump Above or Equal, but my version is far more verbose.

    Alternatively, if you don't need all N bits of the unsigned type, then
    you can subtract and check if the top bit is set in the result:

    %G�%@| for (unsigned u = start; (u & TOPBIT) == 0; u--)

    Terje


    What about:

    for (unsigned u = start; u != ~0u; --u)

    This is the form we use most when we need
    to work in reverse.


    In a code review, I would reject that - and all the other nonsenses
    suggested here as a way to force all loop indices to be unsigned types
    as though that rule was the 11th commandment.

    Just write code that makes sense - it's /not/ hard in this case!

    for (int i = start; i >= 0; i--) ...

    If you need the loop counter to be an unsigned type inside the loop
    code, make an unsigned version:

    for (int i = start; i >= 0; i--) {
    const unsigned int u = i;
    ...
    }

    Sometimes it amazes me the kind of nonsense people write in code because
    of obsession about particular rules. Code clarity trumps /all/
    stylistic rules.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Niklas Holsti on Fri Sep 6 07:25:35 2024
    Niklas Holsti <[email protected]d> writes:
    On 2024-09-05 18:49, Anton Ertl wrote:
    David Brown <[email protected]> writes:
    On 05/09/2024 13:31, Anton Ertl wrote:

    [ discussion of the implementation of Gforth as a code-copying
    and code-pasting interpreter, and the maintenance problems
    this leads to when changing gcc versions ]

    It seems to me that this discussion (of Gforth) has very little do to
    with the ability of C compilers to optimize away or do something else
    with C code that the compiler detects invokes Undefined Behavior

    Yes. What I wrote about was just to show what is happening in Gforth,
    and that the techniques, even though they may seem totally outlandish
    to some, are actually pretty usable across many releases of gcc
    (despite the lack of guarantees from gcc); in the last 20 years we
    have needed to deal with one new development, and our workaround for
    that also works on older gcc releases.

    What some C compilers tend to do is, however, better described as
    "Assume That Undefined Behaviour Does Not Happen" (ATUBDNH), and
    deriving "knowledge" from that (e.g., about the possible values of a
    variable), and then using that "knowledge" in "optimizations".

    I don't doubt that Anton has experienced bad effects of the
    "optimization" of Undefined Behavior, in other contexts

    The main bad effect is that I replaced more efficient and shorter code
    with less efficient and longer code. In theory the compiler can
    generate the same code for both, but in practice that does not happen.
    As an example, the test for the smallest signed integer can be written
    with -fwrapv as:

    if (x<=x-1)

    and gcc -fwrapv compiles this to shorter code on AMD64 than

    if (x==CELL_MIN)

    What gcc produces for both formulations is longer than

    dec %rdi
    jno ...

    Maybe instead of pursuing "optimizations" against the intentions of
    the programmer, they should concentrate on implementing real
    optimizations like optimizing either variant into the small code shown
    last.

    Interestingly, the first idiom is a case where gcc recognizes what the intention of the programmer is, and warns that it is going to
    miscompile it. The warning is good, the miscompilation not (but it
    would be worse without the warning).

    In any case, while the actual experience is that I have not been hit
    by "optimizations" that ATUBDNH in production code, possibly because I
    minimize these assumptions with flags like -fwrapv, the possibility
    that my code might be hit by such an "optimization" (e.g., a new one
    in a new compiler version, if I am lucky with a new flag for disabling
    the assumption, but my source code does not know about it yet) and the
    attitude of people who implement such "optimizations" is what I
    resent.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Fri Sep 6 13:57:18 2024
    On Fri, 06 Sep 2024 07:25:35 GMT
    [email protected] (Anton Ertl) wrote:

    The main bad effect is that I replaced more efficient and shorter code
    with less efficient and longer code. In theory the compiler can
    generate the same code for both, but in practice that does not happen.
    As an example, the test for the smallest signed integer can be written
    with -fwrapv as:

    if (x<=x-1)

    and gcc -fwrapv compiles this to shorter code on AMD64 than

    if (x==CELL_MIN)

    What gcc produces for both formulations is longer than

    dec %rdi
    jno ...


    Good trick.
    The same trick in non-destructive form would be 1 byte longer.
    cmp $1, %rdi
    jno ...

    But I was not able to force any of compilers currently installed on my
    home desktop (gcc 13.2, clang 18.1, MSVC 19.30.30706 == VS2022) to
    produce such code.

    The closest was MSVC that sometimes (not in all circumstances) produces
    2 bytes longer versiin:
    49 8d 49 ff lea -0x1(%r9),%rcx
    4c 3b c9 cmp %rcx,%r9

    Of course, it's still good deal shorter than
    48 ba 00 00 00 00 00 00 00 80 movabs $0x8000000000000000,%rdx
    4c 3b ca cmp %rdx,%r9

    Both gcc and clang [under -fwrapv] insisted on turning x<=x-1 into x==LLONG_MIN.

    However even if we were able to force compiler to produce desired code,
    the space saving is architecture-specific.
    E.g. I expect no saving on ARM64 where both variants occupie 8 bytes.

    Maybe instead of pursuing "optimizations" against the intentions of
    the programmer, they should concentrate on implementing real
    optimizations like optimizing either variant into the small code shown
    last.

    Interestingly, the first idiom is a case where gcc recognizes what the intention of the programmer is, and warns that it is going to
    miscompile it. The warning is good, the miscompilation not (but it
    would be worse without the warning).


    You had more luck with warnings than I did.
    In all my test cases both gcc and clang [in absence of -fwrapv]
    silently dropped the check and depended code.
    MSVC didn't drop it, so, naturally, also it produced no warning.

    In any case, while the actual experience is that I have not been hit
    by "optimizations" that ATUBDNH in production code, possibly because I minimize these assumptions with flags like -fwrapv, the possibility
    that my code might be hit by such an "optimization" (e.g., a new one
    in a new compiler version, if I am lucky with a new flag for disabling
    the assumption, but my source code does not know about it yet) and the attitude of people who implement such "optimizations" is what I
    resent.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bernd Linsel on Fri Sep 6 13:58:15 2024
    On 05/09/2024 22:05, Bernd Linsel wrote:
    On 05.09.24 17:49, Anton Ertl wrote:

    Nobody said that gcc did anything wrong here.  We were, however,
    surprised that -fno-reorder-blocks did not suppress the reordering; we
    reported this as bug, but were told that this option does something
    different from what it says.  Anyway, we developed a workaround.  And
    we also developed a workaround for the code duplication problem that
    showed up in gcc-7.


    Have you tried interspersing `asm volatile("")` statements?

    It is very often an effective means to prevent gcc from reordering code
    from before and after the asm statement.


    (I am quite confident that Anton has uses asm volatile statements like
    this.)

    That only prevents movement of observable behaviour - basically volatile accesses, calls to externally defined functions, and other volatile asm statements. It does not prevent the movement of any other code.

    A commonly used variant is `asm volatile("" ::: "memory")` which is a
    local memory barrier, and blocks movements of loads and stores. But
    that can often be costly in performance, and also does not block
    movement of code that does not load or store memory.

    The compiler is also free to duplicate and shuffle around these
    "instructions", as long as they are "executed" as required. So it can
    do the same kinds of movements as it did before, transforming freely
    between:

    A:
    asm volatile("");
    doThis();
    asm volatile("");
    B:
    asm volatile("");
    doThat();
    asm volatile("");
    C:

    and

    A:
    asm volatile("");
    doThis();
    asm volatile("");
    asm volatile("");
    doThat();
    asm volatile("");
    goto C
    B:
    asm volatile("");
    doThat();
    asm volatile("");
    C:


    If you additional specify inputs, e.g. `asm volatile("" :: "r" (foo))`,
    you can force gcc to keep `foo` alive up to this point.


    That is sometimes a useful form of code. I've used it in sequences like
    this:

    x = long_calculation()_;
    asm volatile ("" :: "g" (x));
    get_lock();
    use_x(x);
    release_lock();

    Without that block, the compiler is free to move long_calculation()
    inside the locked area (within limitations from its knowledge of
    observable behaviour). In most practical cases, the get_lock() and release_lock() parts will have a memory barrier, and you don't actually
    get much of long_calculation() that might be moved, but it is certainly
    a possibility.


    asm volatile("" : "+g" (x));

    can also be useful. It not only forces "x" to be stable before the
    statement is "executed", but it tells the compiler to forget all it
    knows about after it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Fri Sep 6 06:37:13 2024
    Thomas Koenig <[email protected]> writes:

    Thomas Koenig <[email protected]> schrieb:

    "Don't do this" or "don't do that" is not sufficient. Maybe you,
    together with like-minded people, could try formulating some rules
    as an extension to the C standard, and see where it gets you.
    Maybe you can get it published as an annex.

    Hm... putting some thought into it, it may be a good first step
    to define cases for which a a diagnostic is required; maybe
    "observable error" would be a reasonable term.

    So, put "dereferencing a NULL pointer shall be an observable
    error" would make sure that no null pointer checks are thrown
    away, and that this requires a run-time diagnostic.

    If that is the case, should dereferencing a member of a struct
    pointed to by a null pointer also be an observable error, and
    be required to be caught at run-time?

    Or is this completely the wrong track, and you would like to do
    something entirely different? Any annex to the C standard would
    still be constrained to the abstract machine (probably).

    The idea is not to make more of the language defined but to give
    less freedom to cases of undefined behavior. (It might make
    sense to define certain cases that are undefined behavior now but
    that is a separate discussion.) Let me take an example from
    another of your postings:

    int a;

    ...

    if (a > a + 1) {
    ...
    }


    Stipulating that 'a' has a well-defined int value, what behaviors
    are allowable here?

    If a < INT_MAX, the behavior is the same as replacing the if()
    test with 'if(0)'. If the compiler can accurately deduce that
    the condition 'a < INT_MAX' will hold in all cases then the if()
    can be optimized away accordingly.

    If a == INT_MAX, one possibility is that code is generated to
    evaluate the addition and the comparison, and the if-block is
    either evaluated or it isn't, depending on the outcome of the
    comparison. Important: the compiler is disallowed from drawing
    any inferences based on "knowing" the result of either the
    addition or the comparison; code must be generated under a "best
    efforts" umbrella, and whatever the code does dictates whether
    the if-block is evaluated or not, with the compiler being
    forbidden to draw any conclusions based on what the result will
    be.

    If a == INT_MAX, it also should be possible for the addition to
    abort the program. Here again the compiler is disallowed from
    drawing any inferences based on knowing this will happen. To
    make this work the rule allowing "UB to travel backwards in time"
    must be revoked; unless a compiler can accurately deduce that a
    given piece of code cannot transgress into UB then other code in
    the program must not be moved (either forwards or backwards) past
    the possibly-not-well-defined code segment.

    Let me be clear that I have not thought through all the details
    about exactly what the rules are or how they might be put into
    effect. Hopefully though my comments here give you a better
    sense of the direction meant to be suggested.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Fri Sep 6 13:26:42 2024
    Michael S <[email protected]> writes:
    On Fri, 06 Sep 2024 07:25:35 GMT
    [email protected] (Anton Ertl) wrote:
    What gcc produces for both formulations is longer than

    dec %rdi
    jno ...


    Good trick.

    Thanks. It's not from me. I published it in 2015 <https://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29.pdf>,
    but unfortunately did not give a reference to where I have it from (I
    read it elsewhere).

    The same trick in non-destructive form would be 1 byte longer.
    cmp $1, %rdi
    jno ...

    But I was not able to force any of compilers currently installed on my
    home desktop (gcc 13.2, clang 18.1, MSVC 19.30.30706 == VS2022) to
    produce such code.

    The closest was MSVC that sometimes (not in all circumstances) produces
    2 bytes longer versiin:
    49 8d 49 ff lea -0x1(%r9),%rcx
    4c 3b c9 cmp %rcx,%r9

    Of course, it's still good deal shorter than
    48 ba 00 00 00 00 00 00 00 80 movabs $0x8000000000000000,%rdx
    4c 3b ca cmp %rdx,%r9

    Both gcc and clang [under -fwrapv] insisted on turning x<=x-1 into >x==LLONG_MIN.

    However even if we were able to force compiler to produce desired code,
    the space saving is architecture-specific.

    With this gcc-specific code we can force it:

    extern long foo1(long);
    extern long foo2(long);

    long bar(long a, long b)
    {
    long c;
    if (__builtin_sub_overflow(b,1,&c))
    return foo1(a);
    else
    return foo2(a);
    }

    gcc -O3 -c and gcc -Os -c (gcc-12.2) produce, on AMD64:

    0: 48 83 c6 ff add $0xffffffffffffffff,%rsi
    4: 70 05 jo b <bar+0xb>
    6: e9 00 00 00 00 jmp b <bar+0xb>
    b: e9 00 00 00 00 jmp 10 <bar+0x10>

    So, even though %rsi is dead afterwards, it does not use dec, but it's certainly better than the other variants.

    On Arch A64 both gcc invocations (gcc-10.2) produce:

    0: f1000421 subs x1, x1, #0x1
    4: 54000046 b.vs c <bar+0xc>
    8: 14000000 b 0 <foo2>
    c: 14000000 b 0 <foo1>

    On RV64GC bith gcc invocations (gcc-10.3) produce:

    0000000000000000 <bar>:
    0: fff58793 addi a5,a1,-1
    4: 00f5c663 blt a1,a5,10 <.L6>
    8: 00000317 auipc t1,0x0
    c: 00030067 jr t1 # 8 <bar+0x8>

    0000000000000010 <.L6>:
    10: 00000317 auipc t1,0x0
    14: 00030067 jr t1 # 10 <.L6>

    So on RISC-V gcc manages to actually translate the if back into "if (b
    < b-1)" without pessimising that (but gcc-10 does not pessimize this
    code on AMD64, either.

    E.g. I expect no saving on ARM64 where both variants occupie 8 bytes.

    Here we have the three variants:

    #include <limits.h>

    extern long foo1(long);
    extern long foo2(long);

    long bar(long a, long b)
    {
    long c;
    if (__builtin_sub_overflow(b,1,&c))
    return foo1(a);
    else
    return foo2(a);
    }

    long bar2(long a, long b)
    {
    if (b < b-1)
    return foo1(a);
    else
    return foo2(a);
    }

    long bar3(long a, long b)
    {
    if (b == LONG_MIN)
    return foo1(a);
    else
    return foo2(a);
    }

    And here is what gcc-10 -Os -fwrapv -Wall -c produces:

    ARM A64:
    subs x1, x1, #0x1 sub x2, x1, #0x1 mov x2, #0x8000000000000000
    b.vs c <bar+0xc> cmp x2, x1 cmp x1, x2
    b.le 20 <bar2+0x10> b.ne 34 <bar3+0x10>

    RV64GC:
    addi a5,a1,-1 addi a5,a1,-1 li a5,-1
    bge a1,a5,10 <.L4> bge a1,a5,28 <.L6> slli a5,a5,0x3f
    bne a1,a5,40 <.L8>
    8 Bytes 8 Bytes 8 Bytes

    AMD64:
    add $-1,%rsi lea -0x1(%rsi),%rax mov $0x1,%eax
    jo b <bar+0xb> cmp %rsi,%rax shl $0x3f,%rax
    jle 1e <bar2+0xe> cmp %rax,%rsi
    jne 36 <bar3+0x13>
    6 Bytes 9 Bytes 14 Bytes

    With gcc-12 on AMD64:
    add -1,%rsi mov $0x1,%eax mov $0x1,%eax
    jo b <bar+0xb> shl $0x3f,%rax shl $0x3f,%rax
    cmp %rax,%rsi cmp %rax,%rsi
    jne 23 <bar2+0x13> jne 23 <bar2+0x13>
    6 Bytes 14 Bytes 14 Bytes

    (Actually in the latter case gcc recognizes that bar2 and bar3 are
    equivalent and jumps from bar3 to bar2, but I am sure that without
    bar2, bar3 would look the same as bar2 does now).

    So when gcc does not pessimize "b < b-1" into "b == LONG_MIN", the straightforward code for the former has the same or smaller size, and
    the same or smaller number of instructions on these architectures.
    The "__builtin_sub_overflow(b,1,&c)" has the same or fewer bytes than
    "b < b-1" and the same or fewer instructions. So, with
    straightforward translations "__builtin_sub_overflow(b,1,&c)"
    dominates "b < b-1", which dominates "b == LONG_MIN".

    As a new feature, gcc-12 recognizes "b < b-1" and pessimizes it into
    the same code as "b == LONG_MIN".

    Interestingly, the first idiom is a case where gcc recognizes what the
    intention of the programmer is, and warns that it is going to
    miscompile it. The warning is good, the miscompilation not (but it
    would be worse without the warning).


    You had more luck with warnings than I did.
    In all my test cases both gcc and clang [in absence of -fwrapv]
    silently dropped the check and depended code.

    Interesting. I tried both "b < b-1" and "b >= b+1" and got no warning
    (with gcc-10 and gcc-12), but I have seen a warning with one of those
    idioms in the past. Maybe someone decided that warning about this
    idiom is unnecessary, while "optimizing" it is.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Tim Rentsch on Fri Sep 6 18:17:50 2024
    On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:

    Thomas Koenig <[email protected]> writes:

    Thomas Koenig <[email protected]> schrieb:

    "Don't do this" or "don't do that" is not sufficient. Maybe you,
    together with like-minded people, could try formulating some rules
    as an extension to the C standard, and see where it gets you.
    Maybe you can get it published as an annex.

    Hm... putting some thought into it, it may be a good first step
    to define cases for which a a diagnostic is required; maybe
    "observable error" would be a reasonable term.

    So, put "dereferencing a NULL pointer shall be an observable
    error" would make sure that no null pointer checks are thrown
    away, and that this requires a run-time diagnostic.

    If that is the case, should dereferencing a member of a struct
    pointed to by a null pointer also be an observable error, and
    be required to be caught at run-time?

    Or is this completely the wrong track, and you would like to do
    something entirely different? Any annex to the C standard would
    still be constrained to the abstract machine (probably).

    The idea is not to make more of the language defined but to give
    less freedom to cases of undefined behavior. (It might make
    sense to define certain cases that are undefined behavior now but
    that is a separate discussion.) Let me take an example from
    another of your postings:

    int a;

    ...

    if (a > a + 1) {
    ...
    }


    Stipulating that 'a' has a well-defined int value, what behaviors
    are allowable here?

    If a < INT_MAX, the behavior is the same as replacing the if()
    test with 'if(0)'. If the compiler can accurately deduce that
    the condition 'a < INT_MAX' will hold in all cases then the if()
    can be optimized away accordingly.

    If a == INT_MAX, one possibility is that code is generated to
    evaluate the addition and the comparison, and the if-block is
    either evaluated or it isn't, depending on the outcome of the
    comparison. Important: the compiler is disallowed from drawing
    any inferences based on "knowing" the result of either the
    addition or the comparison; code must be generated under a "best
    efforts" umbrella, and whatever the code does dictates whether
    the if-block is evaluated or not, with the compiler being
    forbidden to draw any conclusions based on what the result will
    be.

    If a == INT_MAX, it also should be possible for the addition to
    abort the program. Here again the compiler is disallowed from
    drawing any inferences based on knowing this will happen. To
    make this work the rule allowing "UB to travel backwards in time"
    must be revoked; unless a compiler can accurately deduce that a
    given piece of code cannot transgress into UB then other code in
    the program must not be moved (either forwards or backwards) past
    the possibly-not-well-defined code segment.

    It is also possible if a == INT_MAX that the exception will
    transfer control to a signal handler to do some SW orchestrated
    recovery.

    Let me be clear that I have not thought through all the details
    about exactly what the rules are or how they might be put into
    effect. Hopefully though my comments here give you a better
    sense of the direction meant to be suggested.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Fri Sep 6 23:10:16 2024
    On Fri, 6 Sep 2024 22:41:12 +0000, Chris M. Thomasson wrote:

    On 9/5/2024 10:04 AM, Terje Mathisen wrote:
    David Brown wrote:
    On 05/09/2024 11:12, Terje Mathisen wrote:
    David Brown wrote:
    Unsigned types are ideal for "raw" memory access or external data,
    for anything involving bit manipulation (use of &, |, ^, << and >>
    on signed types is usually wrong, IMHO), as building blocks in
    extended arithmetic types, for the few occasions when you want two's >>>>> complement wrapping, and for the even fewer occasions when you
    actually need that last bit of range.

    That last paragraph enumerates pretty much all the uses I have for
    integer-type variables, with (like Mitch) a few apis that use (-1) as
    an error signal that has to be handled with special code.


    You don't have loop counters, array indices, or integer arithmetic?

    Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine
    with unsigned i, arrays always use a zero base so in Rust the only array
    index type is usize, i.e the largest supported unsigned type in the
    system, typically the same as u64.

    unsigned arithmetic is easier than signed integer arithmetic, including
    comparisons that would result in a negative value, you just have to make
    the test before subtracting, instead of checking if the result was
    negative.

    I.e I cannot easily replicate a downward loop that exits when the
    counter become negative:

      for (int i = START; i >= 0; i-- ) {
        // Do something with data[i]
      }

    for (int i = START; i > -1; i-- ) {
    // Do something with data[i]
    }

    ;^)

    # define START 0x80000001



    One of my alternatives are

      unsigned u = start; // Cannot be less than zero
      if (u) {
        u++;
        do {
          u--;
          data[u]...
        while (u);
      }

    any unsigned integer cannot be less than zero?



    This typically results in effectively the same asm code as the signed
    version, except for a bottom JGE (Jump (signed) Greater or Equal instead
    of JA (Jump Above or Equal, but my version is far more verbose.

    Alternatively, if you don't need all N bits of the unsigned type, then
    you can subtract and check if the top bit is set in the result:

      for (unsigned u = start; (u & TOPBIT) == 0; u--)

    Terje


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Sat Sep 7 09:15:11 2024
    On 07/09/2024 01:10, MitchAlsup1 wrote:
    On Fri, 6 Sep 2024 22:41:12 +0000, Chris M. Thomasson wrote:

    On 9/5/2024 10:04 AM, Terje Mathisen wrote:
    David Brown wrote:
    On 05/09/2024 11:12, Terje Mathisen wrote:
    David Brown wrote:
    Unsigned types are ideal for "raw" memory access or external data, >>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>> on signed types is usually wrong, IMHO), as building blocks in
    extended arithmetic types, for the few occasions when you want two's >>>>>> complement wrapping, and for the even fewer occasions when you
    actually need that last bit of range.

    That last paragraph enumerates pretty much all the uses I have for
    integer-type variables, with (like Mitch) a few apis that use (-1) as >>>>> an error signal that has to be handled with special code.


    You don't have loop counters, array indices, or integer arithmetic?

    Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine
    with unsigned i, arrays always use a zero base so in Rust the only array >>> index type is usize, i.e the largest supported unsigned type in the
    system, typically the same as u64.

    unsigned arithmetic is easier than signed integer arithmetic, including
    comparisons that would result in a negative value, you just have to make >>> the test before subtracting, instead of checking if the result was
    negative.

    I.e I cannot easily replicate a downward loop that exits when the
    counter become negative:

       for (int i = START; i >= 0; i-- ) {
         // Do something with data[i]
       }

    for (int i = START; i > -1; i-- ) {
          // Do something with data[i]
    }

    ;^)

    # define START 0x80000001


    No.

    The great thing about 32 bit integers is that your numbers are never
    anywhere close to being too big - or you /know/ you are dealing with
    very big numbers and you can take that into account such as by using
    64-bit integer types.

    A number that is the start or end of a normal count range is /never/ 0x80000001. Write code that is clear, simple and correct for what you
    are actually doing. And if you think such big numbers are realistic,
    write the same clear, simple and correct code with "int64_t" instead.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Sat Sep 7 07:26:51 2024
    Tim Rentsch <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:

    Thomas Koenig <[email protected]> schrieb:

    "Don't do this" or "don't do that" is not sufficient. Maybe you,
    together with like-minded people, could try formulating some rules
    as an extension to the C standard, and see where it gets you.
    Maybe you can get it published as an annex.

    Hm... putting some thought into it, it may be a good first step
    to define cases for which a a diagnostic is required; maybe
    "observable error" would be a reasonable term.

    So, put "dereferencing a NULL pointer shall be an observable
    error" would make sure that no null pointer checks are thrown
    away, and that this requires a run-time diagnostic.

    If that is the case, should dereferencing a member of a struct
    pointed to by a null pointer also be an observable error, and
    be required to be caught at run-time?

    Or is this completely the wrong track, and you would like to do
    something entirely different? Any annex to the C standard would
    still be constrained to the abstract machine (probably).

    The idea is not to make more of the language defined but to give
    less freedom to cases of undefined behavior.

    That sentece makes no sense to me.

    Behavior is defined by the standard, by the compiler documentation,
    by other standards (such as OpenMP) or it is undefined.

    "Giving less freedom" has no difference from defining.

    (It might make
    sense to define certain cases that are undefined behavior now but
    that is a separate discussion.) Let me take an example from
    another of your postings:

    int a;

    ...

    if (a > a + 1) {
    ...
    }


    Stipulating that 'a' has a well-defined int value, what behaviors
    are allowable here?

    If a < INT_MAX, the behavior is the same as replacing the if()
    test with 'if(0)'. If the compiler can accurately deduce that
    the condition 'a < INT_MAX' will hold in all cases then the if()
    can be optimized away accordingly.

    If a == INT_MAX, one possibility is that code is generated to
    evaluate the addition and the comparison, and the if-block is
    either evaluated or it isn't, depending on the outcome of the
    comparison. Important: the compiler is disallowed from drawing
    any inferences based on "knowing" the result of either the
    addition or the comparison; code must be generated under a "best
    efforts" umbrella, and whatever the code does dictates whether
    the if-block is evaluated or not, with the compiler being
    forbidden to draw any conclusions based on what the result will
    be.

    If a == INT_MAX, it also should be possible for the addition to
    abort the program. Here again the compiler is disallowed from
    drawing any inferences based on knowing this will happen. To
    make this work the rule allowing "UB to travel backwards in time"
    must be revoked; unless a compiler can accurately deduce that a
    given piece of code cannot transgress into UB then other code in
    the program must not be moved (either forwards or backwards) past
    the possibly-not-well-defined code segment.

    After thinking about this for a time, what you want looks a lot
    like volaitle.

    Is there any requirement that you can think of that would not
    be fullfilled with "volatile int a"?

    Is there anything with "volatile int a" that you do not want?

    If volatile is close to what you want, then this would be
    straightforward to incorporate into an existing compiler such as
    gcc, just add an option which declares every variable in the C
    front end volatile, weed out the resulting bugs (yes, that is a
    mixed metaphor) and be done.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Sat Sep 7 05:51:41 2024
    Thomas Koenig <[email protected]> writes:

    Scott Lurndal <[email protected]> schrieb:

    David Brown <[email protected]> writes:

    On 05/09/2024 11:12, Terje Mathisen wrote:

    That last paragraph enumerates pretty much all the uses I have for
    integer-type variables, with (like Mitch) a few apis that use (-1) as an >>>> error signal that has to be handled with special code.

    You don't have loop counters, array indices, or integer arithmetic?

    We do. There is no issue using unsigned loop counters,

    I find counting down from n to 0 using unsigned variables
    unintuitive. Or do you always count up and then calculate
    what you actually use? Induction variable optimization
    should take care of that, but it would be more complicated
    to use.

    In most cases of counting down the upper bound is one more
    than the value to be used, reflecting a half-open interval.
    These ranges are analogous to pointers traversing arrays
    downwards:

    int stuff[20];

    for( int *p = stuff+20; p > stuff; ){
    p--;
    .. do something with *p ..
    }

    For pointers it's important that the pointer not "fall off the
    bottom" of the array. That needn't apply to unsigned index
    variables, so the decrement can be absorbed into the test:

    int stuff[20];

    for( unsigned i = 20; i-- > 0; ){
    .. do something with stuff[i] ..
    }

    If you adopt patterns similar to this one I think you will
    get used to it quickly and it will start to seem quite
    natural. Counting down is the mirror image of counting
    up. When counting up we "point at" and increment after using.
    When counting down we "point after" and decrement before using.

    Using half-open intervals also comes up in binary search:

    int stuff[N];

    unsigned low = 0, limit = N;
    while( low+1 != limit ){
    unsigned m = low + (limit-low)/2;
    .. test stuff[m] and pick one of ..
    .. low = m .. (or)
    .. limit = m ..
    }
    .. stuff[low] has the answer, if there is one ..

    At each point in the search we are considering a half-open
    interval. That makes writing (or reading) invariants for
    the code very easy. When low+1 == limit then there is only
    one element to consider.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to [email protected] on Sat Sep 7 06:52:02 2024
    [email protected] (MitchAlsup1) writes:

    On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:

    Thomas Koenig <[email protected]> writes:

    Thomas Koenig <[email protected]> schrieb:

    "Don't do this" or "don't do that" is not sufficient. Maybe you,
    together with like-minded people, could try formulating some rules
    as an extension to the C standard, and see where it gets you.
    Maybe you can get it published as an annex.

    Hm... putting some thought into it, it may be a good first step
    to define cases for which a a diagnostic is required; maybe
    "observable error" would be a reasonable term.

    So, put "dereferencing a NULL pointer shall be an observable
    error" would make sure that no null pointer checks are thrown
    away, and that this requires a run-time diagnostic.

    If that is the case, should dereferencing a member of a struct
    pointed to by a null pointer also be an observable error, and
    be required to be caught at run-time?

    Or is this completely the wrong track, and you would like to do
    something entirely different? Any annex to the C standard would
    still be constrained to the abstract machine (probably).

    The idea is not to make more of the language defined but to give
    less freedom to cases of undefined behavior. (It might make
    sense to define certain cases that are undefined behavior now but
    that is a separate discussion.) Let me take an example from
    another of your postings:

    int a;

    ...

    if (a > a + 1) {
    ...
    }


    Stipulating that 'a' has a well-defined int value, what behaviors
    are allowable here?

    If a < INT_MAX, the behavior is the same as replacing the if()
    test with 'if(0)'. If the compiler can accurately deduce that
    the condition 'a < INT_MAX' will hold in all cases then the if()
    can be optimized away accordingly.

    If a == INT_MAX, one possibility is that code is generated to
    evaluate the addition and the comparison, and the if-block is
    either evaluated or it isn't, depending on the outcome of the
    comparison. Important: the compiler is disallowed from drawing
    any inferences based on "knowing" the result of either the
    addition or the comparison; code must be generated under a "best
    efforts" umbrella, and whatever the code does dictates whether
    the if-block is evaluated or not, with the compiler being
    forbidden to draw any conclusions based on what the result will
    be.

    If a == INT_MAX, it also should be possible for the addition to
    abort the program. Here again the compiler is disallowed from
    drawing any inferences based on knowing this will happen. To
    make this work the rule allowing "UB to travel backwards in time"
    must be revoked; unless a compiler can accurately deduce that a
    given piece of code cannot transgress into UB then other code in
    the program must not be moved (either forwards or backwards) past
    the possibly-not-well-defined code segment.

    It is also possible if a == INT_MAX that the exception will
    transfer control to a signal handler to do some SW orchestrated
    recovery.

    Philosophically this reaction doesn't fit with the others. Assuming
    for the sake of discussion that raising an implementation-defined
    signal is an important behavior to support, it should go into the
    C standard in a different way than making it part of the "limited
    undefined behavior" idea outlined above.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Brett on Sat Sep 7 06:38:31 2024
    Brett <[email protected]> writes:

    I tried using unsigned for a bunch of my data types that should
    never go negative, but every time I would have to compare them
    with an int somewhere and that would cause a compiler warning,
    because the goal was to also remove unsafe code.

    What sort of ints? How many of those were constants? In which
    cases were the int values negative, and which cases non-negative?
    More generally, what are the circumstances that prompted you to
    compare a can-never-be-negative value to a potentially-negative
    value? Are most of the comparisons relational, or are there
    lots of equality/inequality?

    There are easy ways to compare (without getting warnings) signed
    values and unsigned values, but how a particular case should be
    addressed depends on the details. Can you supply more information?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Tim Rentsch on Sat Sep 7 14:30:30 2024
    On Sat, 7 Sep 2024 13:52:02 +0000, Tim Rentsch wrote:

    [email protected] (MitchAlsup1) writes:

    On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:

    Thomas Koenig <[email protected]> writes:

    Thomas Koenig <[email protected]> schrieb:

    "Don't do this" or "don't do that" is not sufficient. Maybe you,
    together with like-minded people, could try formulating some rules
    as an extension to the C standard, and see where it gets you.
    Maybe you can get it published as an annex.

    Hm... putting some thought into it, it may be a good first step
    to define cases for which a a diagnostic is required; maybe
    "observable error" would be a reasonable term.

    So, put "dereferencing a NULL pointer shall be an observable
    error" would make sure that no null pointer checks are thrown
    away, and that this requires a run-time diagnostic.

    If that is the case, should dereferencing a member of a struct
    pointed to by a null pointer also be an observable error, and
    be required to be caught at run-time?

    Or is this completely the wrong track, and you would like to do
    something entirely different? Any annex to the C standard would
    still be constrained to the abstract machine (probably).

    The idea is not to make more of the language defined but to give
    less freedom to cases of undefined behavior. (It might make
    sense to define certain cases that are undefined behavior now but
    that is a separate discussion.) Let me take an example from
    another of your postings:

    int a;

    ...

    if (a > a + 1) {
    ...
    }


    Stipulating that 'a' has a well-defined int value, what behaviors
    are allowable here?

    If a < INT_MAX, the behavior is the same as replacing the if()
    test with 'if(0)'. If the compiler can accurately deduce that
    the condition 'a < INT_MAX' will hold in all cases then the if()
    can be optimized away accordingly.

    If a == INT_MAX, one possibility is that code is generated to
    evaluate the addition and the comparison, and the if-block is
    either evaluated or it isn't, depending on the outcome of the
    comparison. Important: the compiler is disallowed from drawing
    any inferences based on "knowing" the result of either the
    addition or the comparison; code must be generated under a "best
    efforts" umbrella, and whatever the code does dictates whether
    the if-block is evaluated or not, with the compiler being
    forbidden to draw any conclusions based on what the result will
    be.

    If a == INT_MAX, it also should be possible for the addition to
    abort the program. Here again the compiler is disallowed from
    drawing any inferences based on knowing this will happen. To
    make this work the rule allowing "UB to travel backwards in time"
    must be revoked; unless a compiler can accurately deduce that a
    given piece of code cannot transgress into UB then other code in
    the program must not be moved (either forwards or backwards) past
    the possibly-not-well-defined code segment.

    It is also possible if a == INT_MAX that the exception will
    transfer control to a signal handler to do some SW orchestrated
    recovery.

    Philosophically this reaction doesn't fit with the others. Assuming
    for the sake of discussion that raising an implementation-defined
    signal is an important behavior to support, it should go into the
    C standard in a different way than making it part of the "limited
    undefined behavior" idea outlined above.

    With it "being difficult" to determine when an integer overflow
    has occurred in may architectures, it is unlikely that integer
    overflow could ever be put into the C standard.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Bernd Linsel on Sat Sep 7 09:33:22 2024
    Bernd Linsel <[email protected]> writes:

    On 06.09.24 00:04, Tim Rentsch wrote:

    If START is signed (presumably of type int), so the loop might run
    zero times, but never more than INT_MAX times, then

    for( unsigned u = START < 0 ? 0 : START + 1u; u > 0 && u--; ){
    // Do something with data[i]
    }

    If START is unsigned, so in all cases the loop must run at
    least once, then

    unsigned u = START;
    do {
    // Do something with data[i]
    } while( u > 0 && u-- );

    (Yes I know the 'u > 0' expressions can be replaced by just 'u'.)

    The optimizer should be smart enough to realize that if 'u > 0'
    is true then the test 'u--' will also be true. The same should
    hold if 'u > 0' is replaced by just 'u'.

    (Disclaimer: code not compiled.)

    Both yield not very elegant code:

    https://godbolt.org/z/M4Y5PYP3v

    The problem being solved is not typical. In most cases
    downward-counting loops start at one past the end of the
    values, not at the last value. I didn't choose the problem.

    Any "inelegancy" might just as well as come from how the
    optimizer was written as from the code. Clearly optimizers
    do better on some patterns than others. (For that matter,
    the earlier code shown may have resulted in generated code
    that is just as unappealing.)

    The generated code being not very elegant doesn't necessarily
    imply poor performance.

    In almost all cases the performance implications don't matter.
    Premature optimization is the root of all evil. The first
    reaction should never be to look at what code is generated.

    The purpose of the example (besides fixing a bug in the original,
    which was removed) is, one, to illustrate an idea, and two, to
    show an alternate example pattern that may help in unrelated
    cases. It helps to be familiar with different approaches to
    common situations. For this particular problem, probably it's
    better to revise code outside the loop so the loop would be
    done differently. The point here is not this code specifically
    but a pattern and a principle that might be applicable in a
    range of coding circumstances.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to [email protected] on Sat Sep 7 16:59:39 2024
    MitchAlsup1 <[email protected]> wrote:
    On Sat, 7 Sep 2024 13:52:02 +0000, Tim Rentsch wrote:

    [email protected] (MitchAlsup1) writes:

    On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:

    Thomas Koenig <[email protected]> writes:

    Thomas Koenig <[email protected]> schrieb:

    "Don't do this" or "don't do that" is not sufficient. Maybe you,
    together with like-minded people, could try formulating some rules >>>>>> as an extension to the C standard, and see where it gets you.
    Maybe you can get it published as an annex.

    Hm... putting some thought into it, it may be a good first step
    to define cases for which a a diagnostic is required; maybe
    "observable error" would be a reasonable term.

    So, put "dereferencing a NULL pointer shall be an observable
    error" would make sure that no null pointer checks are thrown
    away, and that this requires a run-time diagnostic.

    If that is the case, should dereferencing a member of a struct
    pointed to by a null pointer also be an observable error, and
    be required to be caught at run-time?

    Or is this completely the wrong track, and you would like to do
    something entirely different? Any annex to the C standard would
    still be constrained to the abstract machine (probably).

    The idea is not to make more of the language defined but to give
    less freedom to cases of undefined behavior. (It might make
    sense to define certain cases that are undefined behavior now but
    that is a separate discussion.) Let me take an example from
    another of your postings:

    int a;

    ...

    if (a > a + 1) {
    ...
    }


    Stipulating that 'a' has a well-defined int value, what behaviors
    are allowable here?

    If a < INT_MAX, the behavior is the same as replacing the if()
    test with 'if(0)'. If the compiler can accurately deduce that
    the condition 'a < INT_MAX' will hold in all cases then the if()
    can be optimized away accordingly.

    If a == INT_MAX, one possibility is that code is generated to
    evaluate the addition and the comparison, and the if-block is
    either evaluated or it isn't, depending on the outcome of the
    comparison. Important: the compiler is disallowed from drawing
    any inferences based on "knowing" the result of either the
    addition or the comparison; code must be generated under a "best
    efforts" umbrella, and whatever the code does dictates whether
    the if-block is evaluated or not, with the compiler being
    forbidden to draw any conclusions based on what the result will
    be.

    If a == INT_MAX, it also should be possible for the addition to
    abort the program. Here again the compiler is disallowed from
    drawing any inferences based on knowing this will happen. To
    make this work the rule allowing "UB to travel backwards in time"
    must be revoked; unless a compiler can accurately deduce that a
    given piece of code cannot transgress into UB then other code in
    the program must not be moved (either forwards or backwards) past
    the possibly-not-well-defined code segment.

    It is also possible if a == INT_MAX that the exception will
    transfer control to a signal handler to do some SW orchestrated
    recovery.

    Philosophically this reaction doesn't fit with the others. Assuming
    for the sake of discussion that raising an implementation-defined
    signal is an important behavior to support, it should go into the
    C standard in a different way than making it part of the "limited
    undefined behavior" idea outlined above.

    With it "being difficult" to determine when an integer overflow
    has occurred in may architectures, it is unlikely that integer
    overflow could ever be put into the C standard.


    Swift traps on all overflows:

    https://docs.swift.org/swift-book/documentation/the-swift-programming-language/advancedoperators/#

    Such branches are predicted perfectly so they only cost some code density.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sat Sep 7 22:17:36 2024
    On Fri, 06 Sep 2024 13:26:42 GMT
    [email protected] (Anton Ertl) wrote:


    ARM A64:
    mov x2, #0x8000000000000000
    cmp x1, x2
    b.le 20 <bar2+0x10>

    I am hardly an expert in aarch64 code generatiion, but IMHO gcc is
    missing the shortest code:
    eor x1, x1, #0x8000000000000000
    b.eq 20 <bar2+0x10>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Sat Sep 7 16:45:45 2024
    [email protected] (Anton Ertl) writes:

    Stefan Monnier <[email protected]> writes:

    Specifications are an agreement between the supplier and the client. The >>
    The problem here is that the C standard, seen as a contract, is unfair
    to the programmer, because it's so excruciatingly hard to write code
    that is guaranteed to be free from UB.

    For programs there is no conformance level "free from UB" in the C
    standard.

    The C standard doesn't define any conformance "levels": it defines
    the term "strictly conforming program", for its own convenience in
    defining the language; it also defines the term "conforming
    program", for no apparent purpose at all. In both cases however
    what is given are simply definitions; there is no reason an
    interested party couldn't give a definition of some other term, for
    the purpose of identifying a class of C programs that have some
    particular property -- such as being free from undefined behavior --
    where membership in the class is completely determined by statements
    in the C standard, being used as a reference document.

    There are two conformance levels for programs:

    1) A strictly conforming program shall use only those features of the
    language and library specified in this International Standard.
    This excludes all programs that terminate, including the "Hello,
    World" program. [...]

    I don't know why you say this. Which aspects of the definition for
    "strictly conforming program" do you think are violated by a typical
    'Hello, World' program? I'm confident the people who wrote the C
    standard would say such a program is strictly conforming.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Tim Rentsch on Sun Sep 8 00:12:40 2024
    On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:

    [email protected] (Anton Ertl) writes:

    Stefan Monnier <[email protected]> writes:

    Specifications are an agreement between the supplier and the client.
    The

    The problem here is that the C standard, seen as a contract, is unfair
    to the programmer, because it's so excruciatingly hard to write code
    that is guaranteed to be free from UB.

    For programs there is no conformance level "free from UB" in the C
    standard.

    The C standard doesn't define any conformance "levels": it defines
    the term "strictly conforming program", for its own convenience in
    defining the language; it also defines the term "conforming
    program", for no apparent purpose at all. In both cases however
    what is given are simply definitions; there is no reason an
    interested party couldn't give a definition of some other term, for
    the purpose of identifying a class of C programs that have some
    particular property -- such as being free from undefined behavior --
    where membership in the class is completely determined by statements
    in the C standard, being used as a reference document.

    There are two conformance levels for programs:

    1) A strictly conforming program shall use only those features of the
    language and library specified in this International Standard.
    This excludes all programs that terminate, including the "Hello,
    World" program. [...]

    I don't know why you say this. Which aspects of the definition for
    "strictly conforming program" do you think are violated by a typical
    'Hello, World' program? I'm confident the people who wrote the C
    standard would say such a program is strictly conforming.

    The standard "Hello World !" program does not return a value to
    <effectively> crt0.

    Secondarily while one is supposed to return 0 for success and
    something else for failure, there is no standard C defined way
    that this is related back to the invoker of the program.

    Another issue is that main() may not have the 3 defined arguments
    and the containing environment is not supposed to complain when
    argc, arv, and envp are unused or even unnamed as arguments.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Sun Sep 8 00:17:25 2024
    On Sat, 7 Sep 2024 7:15:11 +0000, David Brown wrote:

    On 07/09/2024 01:10, MitchAlsup1 wrote:
    On Fri, 6 Sep 2024 22:41:12 +0000, Chris M. Thomasson wrote:

    On 9/5/2024 10:04 AM, Terje Mathisen wrote:
    David Brown wrote:
    On 05/09/2024 11:12, Terje Mathisen wrote:
    David Brown wrote:
    Unsigned types are ideal for "raw" memory access or external data, >>>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>>> on signed types is usually wrong, IMHO), as building blocks in
    extended arithmetic types, for the few occasions when you want two's >>>>>>> complement wrapping, and for the even fewer occasions when you
    actually need that last bit of range.

    That last paragraph enumerates pretty much all the uses I have for >>>>>> integer-type variables, with (like Mitch) a few apis that use (-1) as >>>>>> an error signal that has to be handled with special code.


    You don't have loop counters, array indices, or integer arithmetic?

    Loop counters of the for (i= 0; i < LIMIT; i++) type are of course fine >>>> with unsigned i, arrays always use a zero base so in Rust the only array >>>> index type is usize, i.e the largest supported unsigned type in the
    system, typically the same as u64.

    unsigned arithmetic is easier than signed integer arithmetic, including >>>> comparisons that would result in a negative value, you just have to make >>>> the test before subtracting, instead of checking if the result was
    negative.

    I.e I cannot easily replicate a downward loop that exits when the
    counter become negative:

       for (int i = START; i >= 0; i-- ) {
         // Do something with data[i]
       }

    for (int i = START; i > -1; i-- ) {
          // Do something with data[i]
    }

    ;^)

    # define START 0x80000001


    No.

    The great thing about 32 bit integers is that your numbers are never
    anywhere close to being too big - or you /know/ you are dealing with
    very big numbers and you can take that into account such as by using
    64-bit integer types.

    A number that is the start or end of a normal count range is /never/ 0x80000001. Write code that is clear, simple and correct for what you
    are actually doing. And if you think such big numbers are realistic,
    write the same clear, simple and correct code with "int64_t" instead.

    static uint64_t array[1024*1024*512+1]
    static int SIZE = sizeof(array)/sizeof(uint65_t);

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sun Sep 8 00:23:38 2024
    And just for fun::

    On Fri, 6 Sep 2024 13:26:42 +0000, Anton Ertl wrote:
    Here we have the three variants:

    #include <limits.h>

    extern long foo1(long);
    extern long foo2(long);

    long bar(long a, long b)
    {
    long c;
    if (__builtin_sub_overflow(b,1,&c))
    return foo1(a);
    else
    return foo2(a);
    }

    long bar2(long a, long b)
    {
    if (b < b-1)
    return foo1(a);
    else
    return foo2(a);
    }

    long bar3(long a, long b)
    {
    if (b == LONG_MIN)
    return foo1(a);
    else
    return foo2(a);
    }

    My 66000:
    add r3,R1,#-1 add r3,r1,#-1 bepm r1,.L4
    bge R3,.L4 bge r3,.L4
    8-bytes 8-bytes 4-bytes

    I have a direct test for POSMAX in ISA that does not use a constant.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Sat Sep 7 18:46:20 2024
    Thomas Koenig <[email protected]> writes:

    Tim Rentsch <[email protected]> schrieb:

    Thomas Koenig <[email protected]> writes:

    Thomas Koenig <[email protected]> schrieb:

    "Don't do this" or "don't do that" is not sufficient. Maybe you,
    together with like-minded people, could try formulating some rules
    as an extension to the C standard, and see where it gets you.
    Maybe you can get it published as an annex.

    Hm... putting some thought into it, it may be a good first step
    to define cases for which a a diagnostic is required; maybe
    "observable error" would be a reasonable term.

    So, put "dereferencing a NULL pointer shall be an observable
    error" would make sure that no null pointer checks are thrown
    away, and that this requires a run-time diagnostic.

    If that is the case, should dereferencing a member of a struct
    pointed to by a null pointer also be an observable error, and
    be required to be caught at run-time?

    Or is this completely the wrong track, and you would like to do
    something entirely different? Any annex to the C standard would
    still be constrained to the abstract machine (probably).

    The idea is not to make more of the language defined but to give
    less freedom to cases of undefined behavior.

    That sentece makes no sense to me.

    Behavior is defined by the standard, by the compiler documentation,
    by other standards (such as OpenMP) or it is undefined.

    "Giving less freedom" has no difference from defining.

    I use the term "undefined behavior" in the same sense that the C
    standard does. For example, if a particular C implementation
    supports the POSIX extensions to printf(), including documenting
    them, using those extensions still falls under the heading of
    undefined behavior, support and documentation not withstanding.

    The idea is to define a new classification, perhaps "limited
    undefined behavior", that gives more freedom than "unspecified
    behavior" but not nearly as much as "undefined behavior" does
    now.

    (It might make
    sense to define certain cases that are undefined behavior now but
    that is a separate discussion.) Let me take an example from
    another of your postings:

    int a;

    ...

    if (a > a + 1) {
    ...
    }


    Stipulating that 'a' has a well-defined int value, what behaviors
    are allowable here?

    If a < INT_MAX, the behavior is the same as replacing the if()
    test with 'if(0)'. If the compiler can accurately deduce that
    the condition 'a < INT_MAX' will hold in all cases then the if()
    can be optimized away accordingly.

    If a == INT_MAX, one possibility is that code is generated to
    evaluate the addition and the comparison, and the if-block is
    either evaluated or it isn't, depending on the outcome of the
    comparison. Important: the compiler is disallowed from drawing
    any inferences based on "knowing" the result of either the
    addition or the comparison; code must be generated under a "best
    efforts" umbrella, and whatever the code does dictates whether
    the if-block is evaluated or not, with the compiler being
    forbidden to draw any conclusions based on what the result will
    be.

    If a == INT_MAX, it also should be possible for the addition to
    abort the program. Here again the compiler is disallowed from
    drawing any inferences based on knowing this will happen. To
    make this work the rule allowing "UB to travel backwards in time"
    must be revoked; unless a compiler can accurately deduce that a
    given piece of code cannot transgress into UB then other code in
    the program must not be moved (either forwards or backwards) past
    the possibly-not-well-defined code segment.

    After thinking about this for a time, what you want looks a lot
    like volaitle.

    That's a good insight. Certainly there are aspects of what I
    have proposed that are similar to how volatile works.

    Is there any requirement that you can think of that would not
    be fullfilled with "volatile int a"?

    Is there anything with "volatile int a" that you do not want?

    Something being volatile has consequences only in reference to
    objects, and only when a memory access (either read or write) is
    requested. There are no such things as volatile values. What
    we're looking for here is constraints on operations, not on
    memory accesses. In a sense one might say what we want is
    "volatile operators": similar in concept to how volatile works,
    but in a different area of language semantics.

    Also there are aspects of 'volatile' is defined now that are too
    lax for what I think "volatile operators" need to do. However
    that is a fine point, I mention it only for completeness.

    If volatile is close to what you want, then this would be
    straightforward to incorporate into an existing compiler such as
    gcc, just add an option which declares every variable in the C
    front end volatile, weed out the resulting bugs (yes, that is a
    mixed metaphor) and be done.

    Like I said, it isn't the variables, it's the operators. Maybe
    though you have a good idea there, looking at how volatile is
    handled in gcc or clang might give some useful ideas about how to
    implement volatile operators.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to [email protected] on Sat Sep 7 19:47:38 2024
    [email protected] (MitchAlsup1) writes:

    On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:

    [email protected] (Anton Ertl) writes:

    Stefan Monnier <[email protected]> writes:

    Specifications are an agreement between the supplier and the client. >>>>> The

    The problem here is that the C standard, seen as a contract, is unfair >>>> to the programmer, because it's so excruciatingly hard to write code
    that is guaranteed to be free from UB.

    For programs there is no conformance level "free from UB" in the C
    standard.

    The C standard doesn't define any conformance "levels": it defines
    the term "strictly conforming program", for its own convenience in
    defining the language; it also defines the term "conforming
    program", for no apparent purpose at all. In both cases however
    what is given are simply definitions; there is no reason an
    interested party couldn't give a definition of some other term, for
    the purpose of identifying a class of C programs that have some
    particular property -- such as being free from undefined behavior --
    where membership in the class is completely determined by statements
    in the C standard, being used as a reference document.

    There are two conformance levels for programs:

    1) A strictly conforming program shall use only those features of the
    language and library specified in this International Standard.
    This excludes all programs that terminate, including the "Hello,
    World" program. [...]

    I don't know why you say this. Which aspects of the definition for
    "strictly conforming program" do you think are violated by a typical
    'Hello, World' program? I'm confident the people who wrote the C
    standard would say such a program is strictly conforming.

    The standard "Hello World !" program does not return a value to
    <effectively> crt0.

    That has no effect on whether the program is strictly conforming.

    Secondarily while one is supposed to return 0 for success and
    something else for failure, there is no standard C defined way
    that this is related back to the invoker of the program.

    That has no effect on whether the program is strictly conforming.

    Another issue is that main() may not have the 3 defined arguments
    and the containing environment is not supposed to complain when
    argc, arv, and envp are unused or even unnamed as arguments.

    The usual "Hello, World" program defines main() either with no
    arguments

    int
    main(){
    ...
    }

    or with two arguments

    int
    main( int argc, char *argv[] ){
    ...
    }

    and in both cases main() has defined behavior and does not
    violate the strictures of strictly conforming programs.

    If the surrounding OS or whatever cannot support these, that
    doesn't change whether the program is strictly conforming. The
    condition of being strictly conforming is a predicate on
    programs, not on implementations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to [email protected] on Sat Sep 7 19:32:43 2024
    [email protected] (MitchAlsup1) writes:

    On Sat, 7 Sep 2024 13:52:02 +0000, Tim Rentsch wrote:

    [email protected] (MitchAlsup1) writes:

    On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:
    [...]
    The idea is not to make more of the language defined but to give
    less freedom to cases of undefined behavior. (It might make
    sense to define certain cases that are undefined behavior now but
    that is a separate discussion.) Let me take an example from
    another of your postings:

    int a;

    ...

    if (a > a + 1) {
    ...
    }


    Stipulating that 'a' has a well-defined int value, what behaviors
    are allowable here? [...] If a == INT_MAX, it also should be
    possible for the addition to abort the program. [...]

    It is also possible if a == INT_MAX that the exception will
    transfer control to a signal handler to do some SW orchestrated
    recovery.

    Philosophically this reaction doesn't fit with the others. Assuming
    for the sake of discussion that raising an implementation-defined
    signal is an important behavior to support, it should go into the
    C standard in a different way than making it part of the "limited
    undefined behavior" idea outlined above.

    With it "being difficult" to determine when an integer overflow
    has occurred in may architectures, it is unlikely that integer
    overflow could ever be put into the C standard.

    It could easily be added to the C standard just by making the
    signal-raise option be conditional: give each implementation
    the choice of either (a) stipulating that overflow causes an implementation-defined signal to be raised, or (b) letting the
    operation be limited undefined behavior. Limited undefined
    behavior can be provided simply by naively compiling the code
    in question, so that can be accommodated regardless of how
    unsophisticated the processor is.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Sat Sep 7 21:17:02 2024
    [email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Sun Sep 8 08:25:10 2024
    On 08/09/2024 02:17, MitchAlsup1 wrote:
    On Sat, 7 Sep 2024 7:15:11 +0000, David Brown wrote:

    On 07/09/2024 01:10, MitchAlsup1 wrote:
    On Fri, 6 Sep 2024 22:41:12 +0000, Chris M. Thomasson wrote:

    On 9/5/2024 10:04 AM, Terje Mathisen wrote:
    David Brown wrote:
    On 05/09/2024 11:12, Terje Mathisen wrote:
    David Brown wrote:
    Unsigned types are ideal for "raw" memory access or external data, >>>>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>>>> on signed types is usually wrong, IMHO), as building blocks in >>>>>>>> extended arithmetic types, for the few occasions when you want >>>>>>>> two's
    complement wrapping, and for the even fewer occasions when you >>>>>>>> actually need that last bit of range.

    That last paragraph enumerates pretty much all the uses I have for >>>>>>> integer-type variables, with (like Mitch) a few apis that use
    (-1) as
    an error signal that has to be handled with special code.


    You don't have loop counters, array indices, or integer arithmetic? >>>>>
    Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
    fine
    with unsigned i, arrays always use a zero base so in Rust the only
    array
    index type is usize, i.e the largest supported unsigned type in the
    system, typically the same as u64.

    unsigned arithmetic is easier than signed integer arithmetic,
    including
    comparisons that would result in a negative value, you just have to
    make
    the test before subtracting, instead of checking if the result was
    negative.

    I.e I cannot easily replicate a downward loop that exits when the
    counter become negative:

       for (int i = START; i >= 0; i-- ) {
         // Do something with data[i]
       }

    for (int i = START; i > -1; i-- ) {
          // Do something with data[i]
    }

    ;^)

    # define START 0x80000001


    No.

    The great thing about 32 bit integers is that your numbers are never
    anywhere close to being too big - or you /know/ you are dealing with
    very big numbers and you can take that into account such as by using
    64-bit integer types.

    A number that is the start or end of a normal count range is /never/
    0x80000001.  Write code that is clear, simple and correct for what you
    are actually doing.  And if you think such big numbers are realistic,
    write the same clear, simple and correct code with "int64_t" instead.

    static uint64_t array[1024*1024*512+1]
    static int      SIZE = sizeof(array)/sizeof(uint65_t);

    Surely you mean :

    static const size_t array_size = sizeof(array) / sizeof(uint64_t);

    ?

    Look, if you want to write such strange code, I certainly can't stop
    you. But I can tell you that /I/ think it's very poor style, and that
    /I/ would reject it in a code review.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Tim Rentsch on Sun Sep 8 09:20:53 2024
    On 08/09/2024 04:32, Tim Rentsch wrote:
    [email protected] (MitchAlsup1) writes:

    On Sat, 7 Sep 2024 13:52:02 +0000, Tim Rentsch wrote:

    [email protected] (MitchAlsup1) writes:

    On Fri, 6 Sep 2024 13:37:13 +0000, Tim Rentsch wrote:
    [...]
    The idea is not to make more of the language defined but to give
    less freedom to cases of undefined behavior. (It might make
    sense to define certain cases that are undefined behavior now but
    that is a separate discussion.) Let me take an example from
    another of your postings:

    int a;

    ...

    if (a > a + 1) {
    ...
    }


    Stipulating that 'a' has a well-defined int value, what behaviors
    are allowable here? [...] If a == INT_MAX, it also should be
    possible for the addition to abort the program. [...]

    It is also possible if a == INT_MAX that the exception will
    transfer control to a signal handler to do some SW orchestrated
    recovery.

    Philosophically this reaction doesn't fit with the others. Assuming
    for the sake of discussion that raising an implementation-defined
    signal is an important behavior to support, it should go into the
    C standard in a different way than making it part of the "limited
    undefined behavior" idea outlined above.

    With it "being difficult" to determine when an integer overflow
    has occurred in may architectures, it is unlikely that integer
    overflow could ever be put into the C standard.


    The ckd_add, ckd_sub and ckd_mul functions from C23 make it easy to
    check for integer overflow in C23. And of course C has had guaranteed
    64-bit support since C99 - it's very rare to overflow these.

    It could easily be added to the C standard just by making the
    signal-raise option be conditional: give each implementation
    the choice of either (a) stipulating that overflow causes an implementation-defined signal to be raised, or (b) letting the
    operation be limited undefined behavior. Limited undefined
    behavior can be provided simply by naively compiling the code
    in question, so that can be accommodated regardless of how
    unsophisticated the processor is.

    The C standard doesn't have anything where implementations have an
    option between a particular behaviour or undefined behaviour - because
    that would simply be the same as undefined behaviour. It sometimes has footnotes with suggestions of possible results, and it could add such a footnote for signed integer arithmetic overflow treatment. But it would
    not have any greater blessing from the standard than wrapping,
    saturating, or assuming it is impossible.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Sun Sep 8 08:26:25 2024
    Tim Rentsch <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:

    After thinking about this for a time, what you want looks a lot
    like volaitle.

    That's a good insight. Certainly there are aspects of what I
    have proposed that are similar to how volatile works.

    The way I understand you is the following: You want the
    compiler to be forbidden to remove codepaths on the assumption
    that undefined behavior cannot happen, and you want a
    "best effort" in that case, which includes throwing an error
    or just ignoring everything and proceeding.

    The observable behavior includes (n2596)

    "Volatile accesses to objects are evaluated strictly according to
    the rules of the abstract machine."

    So, assuming that variables are objects (if there's a definition
    of an object in n2596, I missed it) the compiler cannot remove
    accessing a in

    volatile int a;

    if (a > a + 1)

    so it cannot remove any code path leading to the if statement, which
    is what you want. An interesting point is what "volatile access"
    actually means, especially for automatic variables; it seems that
    all compilers treat this as a memory access (which makes limited
    sense in my opinion - is there an explanation for this?)


    Is there any requirement that you can think of that would not
    be fullfilled with "volatile int a"?

    Is there anything with "volatile int a" that you do not want?

    Something being volatile has consequences only in reference to
    objects, and only when a memory access (either read or write) is
    requested. There are no such things as volatile values. What
    we're looking for here is constraints on operations, not on
    memory accesses. In a sense one might say what we want is
    "volatile operators": similar in concept to how volatile works,
    but in a different area of language semantics.

    Hmm.. OK. The nice thing about SSA is that it transforms
    complicated expressions like "a + b + c" into

    tmp1 = a + b
    tmp2 = tmp1 + c

    so it would be possible to write a pass which would declare those
    variables as volatile that you want (not needed for unsigned, for
    example).

    Alternatively, you could write a pass which translates

    int a, b;

    tmp1 = a + b;

    into

    tmp1 = (int) ((unsigned) a + (unsigned) b)

    or just use -frwapv in the first place.

    So, SSA offers you the possibility of working on operators, like
    you want to.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Tim Rentsch on Sun Sep 8 15:12:03 2024
    Tim Rentsch <[email protected]> writes: >[email protected] (Anton Ertl) writes:

    Stefan Monnier <[email protected]> writes:

    Specifications are an agreement between the supplier and the client. The >>>
    The problem here is that the C standard, seen as a contract, is unfair
    to the programmer, because it's so excruciatingly hard to write code
    that is guaranteed to be free from UB.

    For programs there is no conformance level "free from UB" in the C
    standard.

    The C standard doesn't define any conformance "levels": it defines
    the term "strictly conforming program", for its own convenience in
    defining the language; it also defines the term "conforming
    program", for no apparent purpose at all.

    It defines both terms in the section on "Conformance", so I take it
    that both are there for defining the conformance of programs; you may
    not consider them to be levels, but given that all "strictly
    conforming programs" are also "conforming programs", it has the
    feeling of conformance levels to me.

    In both cases however
    what is given are simply definitions; there is no reason an
    interested party couldn't give a definition of some other term, for
    the purpose of identifying a class of C programs that have some
    particular property -- such as being free from undefined behavior --
    where membership in the class is completely determined by statements
    in the C standard, being used as a reference document.

    Sure, but the C standard does not give such a definition, so the
    "interested party" would cherry-pick from the C standard.

    There are two conformance levels for programs:

    1) A strictly conforming program shall use only those features of the
    language and library specified in this International Standard.
    This excludes all programs that terminate, including the "Hello,
    World" program. [...]

    I don't know why you say this. Which aspects of the definition for
    "strictly conforming program" do you think are violated by a typical
    'Hello, World' program?

    A typical "Hello, World" program terminates, and as mentioned, no
    terminating program can be strictly conforming, because it exercises
    at least implementation-defined behaviour (e.g., look at section
    7.22.4.4 of C11).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Tim Rentsch on Sun Sep 8 15:36:39 2024
    Tim Rentsch <[email protected]> writes: >[email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether there is
    an overlap of the memory areas. But then I remembered that you cannot
    write such a check in standard C without (in the general case)
    exercising undefined behaviour; and then the compiler could eliminate
    the check or do something else that's unexpected. Do you have such a
    check in mind that does not exercise undefined behaviour in the
    general case?

    2) Even if there is such a check, you have to be aware that there is a potential problem with memcpy(). In that case the way to go is to
    just use memmove(). But that does not help you with the next "clever"
    idea that some compiler or library maintainer has.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Sun Sep 8 15:32:02 2024
    [email protected] (MitchAlsup1) writes:
    And just for fun::

    On Fri, 6 Sep 2024 13:26:42 +0000, Anton Ertl wrote:
    Here we have the three variants:

    #include <limits.h>

    extern long foo1(long);
    extern long foo2(long);

    long bar(long a, long b)
    {
    long c;
    if (__builtin_sub_overflow(b,1,&c))
    return foo1(a);
    else
    return foo2(a);
    }

    long bar2(long a, long b)
    {
    if (b < b-1)
    return foo1(a);
    else
    return foo2(a);
    }

    long bar3(long a, long b)
    {
    if (b == LONG_MIN)
    return foo1(a);
    else
    return foo2(a);
    }

    My 66000:
    add r3,R1,#-1 add r3,r1,#-1 bepm r1,.L4
    bge R3,.L4 bge r3,.L4
    8-bytes 8-bytes 4-bytes

    I have a direct test for POSMAX in ISA that does not use a constant.

    How does bge work in the first and second column? My impression was
    that you are using an 88k-style flags-in-GPR architecture.

    Concerning the last column, the gcc developer who added the
    transformation of bar2() into bar3() apparently had My66000 in mind.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Niklas Holsti on Sun Sep 8 09:19:13 2024
    Niklas Holsti <[email protected]d> writes:

    On 2024-09-03 11:10, David Brown wrote:

    [snip]

    (There are a few situations where UB in C could be diagnosed at
    compile-time, which are probably historical decisions to avoid
    imposing too much work on early compilers. Where possible, UB that
    can be caught at compile time, could usefully be turned into
    constrain violations that must be diagnosed.)

    A thoughtless, knee-jerk reaction, ending in a wrongheaded
    conclusion.

    The problem, as you of course know, is that the "can" in "can be
    caught at compile time" depends on the amount and kind of analysis
    that is done at compile time -- some cases of UB "can" be caught at
    compile time but only by advanced and costly analysis. If the language standard requires that such things /must/ be detected by the compiler,
    it can place quite a burden on the developers of conforming compilers.

    That is one problem.

    As I understand it, current C compilers detect UB mostly as a side
    effect of the analyses they do for code optimization purposes, which
    vary widely between compilers, and so the UB-detections also vary.

    There are different kinds of undefined behavior; some are easy
    to detect, others require more extensive analysis. In the second
    category the analysis usually is approximate rather than exact;
    false positive cases need to be weighed against false negative
    cases, looking for the right balance, and very often it happens
    that neither of those is zero. Obviously any requirement that a
    mandatory diagnostic be issued should have no false positives,
    which often means doing a different analysis. More work.

    Another problem is that just the act of specifying the condition under
    which a diagnostic is required means a lot of work and a non-trivial
    amount of additional text needed in the C standard. If someone is
    interested to investigate this a good place to start is the Java
    standard, where there are specific rules for deciding if variables are
    all initialized before any use. Alternatively look in the C standard
    at the formal definition of 'restrict'. Besides being hard to write,
    both of these are quite difficult to read and understand. Even more
    of those? No thanks.

    Let me add, it is not always a good idea to require a diagnostic in
    cases even when it is 100% certain that there is undefined behavior. Unfortunately it seems there are a fair number of people who don't
    get this.

    This issue (compile-time detection) has now and then been discussed in
    the Ada standards group. Given the currently low market penetration of
    Ada, the group has been reluctant to require too much of the
    compilers, and so the more advanced UB-detecting tools are
    stand-alone, such as the SPARK tools.

    I'm all in favor of static analysis. And I don't mind if compilers do
    it (selectively), instead of or in addition to stand-alone tools. But
    there is a huge chasm between saying compilers /can/ do it and saying
    compilers /must/ do it. Crossing that chasm is a bridge too far.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sun Sep 8 17:45:03 2024
    On Sun, 8 Sep 2024 15:32:02 +0000, Anton Ertl wrote:

    [email protected] (MitchAlsup1) writes:
    And just for fun::

    On Fri, 6 Sep 2024 13:26:42 +0000, Anton Ertl wrote:
    Here we have the three variants:

    #include <limits.h>

    extern long foo1(long);
    extern long foo2(long);

    long bar(long a, long b)
    {
    long c;
    if (__builtin_sub_overflow(b,1,&c))
    return foo1(a);
    else
    return foo2(a);
    }

    long bar2(long a, long b)
    {
    if (b < b-1)
    return foo1(a);
    else
    return foo2(a);
    }

    long bar3(long a, long b)
    {
    if (b == LONG_MIN)
    return foo1(a);
    else
    return foo2(a);
    }

    My 66000:
    add r3,R1,#-1 add r3,r1,#-1 bepm r1,.L4
    bge R3,.L4 bge r3,.L4
    8-bytes 8-bytes 4-bytes

    I have a direct test for POSMAX in ISA that does not use a constant.

    How does bge work in the first and second column? My impression was
    that you are using an 88k-style flags-in-GPR architecture.

    I just copied the RISC-V code

    Concerning the last column, the gcc developer who added the
    transformation of bar2() into bar3() apparently had My66000 in mind.

    My branch on comparison to zero (BC) instruction has 32 variants
    with only ~20 being normal uses. This gave room for signed and
    unsigned int-MAX and int-MIN.

    BTW I had the comparisons to int-MAX/MIN in since about 2016.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Sun Sep 8 18:32:10 2024
    On Sun, 8 Sep 2024 6:25:10 +0000, David Brown wrote:

    On 08/09/2024 02:17, MitchAlsup1 wrote:
    On Sat, 7 Sep 2024 7:15:11 +0000, David Brown wrote:

    static uint64_t array[1024*1024*512+1]
    static int      SIZE = sizeof(array)/sizeof(uint65_t);

    Surely you mean :

    static const size_t array_size = sizeof(array) / sizeof(uint64_t);


    I wanted SIZE to have the same type as i.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Tim Rentsch on Sun Sep 8 18:34:54 2024
    On Sun, 8 Sep 2024 2:47:38 +0000, Tim Rentsch wrote:

    [email protected] (MitchAlsup1) writes:

    On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:

    Another issue is that main() may not have the 3 defined arguments
    and the containing environment is not supposed to complain when
    argc, arv, and envp are unused or even unnamed as arguments.

    The usual "Hello, World" program defines main() either with no
    arguments

    int
    main(){
    ...
    }

    or with two arguments

    int
    main( int argc, char *argv[] ){
    ...
    }

    and in both cases main() has defined behavior and does not
    violate the strictures of strictly conforming programs.

    The Linux environment (crt0) calls main with 3 arguments.

    Are you arguing that a program can be strictly conforming and
    not be type safe at its call/return interfaces ??

    If the surrounding OS or whatever cannot support these, that
    doesn't change whether the program is strictly conforming. The
    condition of being strictly conforming is a predicate on
    programs, not on implementations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Sun Sep 8 11:18:46 2024
    Thomas Koenig <[email protected]> writes:

    Tim Rentsch <[email protected]> schrieb:

    Thomas Koenig <[email protected]> writes:

    After thinking about this for a time, what you want looks a lot
    like volaitle.

    That's a good insight. Certainly there are aspects of what I
    have proposed that are similar to how volatile works.

    The way I understand you is the following: You want the
    compiler to be forbidden to remove codepaths on the assumption
    that undefined behavior cannot happen, and you want a
    "best effort" in that case, which includes throwing an error
    or just ignoring everything and proceeding.

    The key point is not about removing (or forcing) code paths, but
    about what inferences may be drawn. Consider this example:

    int a = .. something ..;
    if( a > a+1 ){ .. stuff not involving a .. }
    if( a != INT_MAX ){ ... }

    Relying on the premise that "undefined behavior doesn't happen",
    a compiler might discard the dependent block of the first if().
    But the compiler might also always execute the dependent block
    of the second if(), because if a == INT_MAX then the first test
    would have been undefined behavior, which violates our premise.
    It is just as wrong to skip the test in the second if() as it
    is to remove the controlled block in the first if().

    Consider a related example:

    int a = .. something ..;
    if( a < a+1 ){ .. stuff not involving a .. }
    if( a != INT_MAX ){ .. other stuff not involving a .. }

    Again operating under the premise that the program has no
    undefined behavior, both controlled blocks can be executed
    unconditionally, because the assumption of there being no
    undefined behavior leads to a bad inference for the second
    if() test. Notice by the way that the same bad inference
    can be drawn if the order of the if() statements is reversed,
    because of the rule that undefined behavior "can travel
    backwards in time".

    The observable behavior includes (n2596)

    "Volatile accesses to objects are evaluated strictly according to
    the rules of the abstract machine."

    So, assuming that variables are objects (if there's a definition
    of an object in n2596, I missed it) the compiler cannot remove
    accessing a in

    volatile int a;

    if (a > a + 1)

    so it cannot remove any code path leading to the if statement, which
    is what you want. An interesting point is what "volatile access"
    actually means, especially for automatic variables; it seems that
    all compilers treat this as a memory access (which makes limited
    sense in my opinion - is there an explanation for this?)

    The original motivation for volatile is to ensure an actual memory
    access occurs, in cases where what is happening is outside what
    the C implementation know about. Examples are reading or writing
    by another process (perhaps not written in C) or a memory-mapped
    I/O port. It may be unlikely that a function-local variable would
    fall into such a category, but volatile is there in case someone
    thinks it does.

    Is there any requirement that you can think of that would not
    be fullfilled with "volatile int a"?

    Is there anything with "volatile int a" that you do not want?

    Something being volatile has consequences only in reference to
    objects, and only when a memory access (either read or write) is
    requested. There are no such things as volatile values. What
    we're looking for here is constraints on operations, not on
    memory accesses. In a sense one might say what we want is
    "volatile operators": similar in concept to how volatile works,
    but in a different area of language semantics.

    Hmm.. OK. The nice thing about SSA is that it transforms
    complicated expressions like "a + b + c" into

    tmp1 = a + b
    tmp2 = tmp1 + c

    so it would be possible to write a pass which would declare those
    variables as volatile that you want (not needed for unsigned, for
    example).

    Alternatively, you could write a pass which translates

    int a, b;

    tmp1 = a + b;

    into

    tmp1 = (int) ((unsigned) a + (unsigned) b)

    or just use -frwapv in the first place.

    So, SSA offers you the possibility of working on operators, like
    you want to.

    We're talking about different things. What you are talking about is
    (perhaps only partially) an implementation strategy. What I am
    talking about is how to define the abstract semantics. Exactly what
    the rules are has to come first; after the rules are known then we
    can think about how they might be implemented.

    In terms of defining the abstract semantics, volatile doesn't do the
    job. There are several reasons for this, but the most important is
    that undefined behavior takes precedence over volatile. If we have
    a program

    volatile int *p;
    ...
    *p = 0;
    ... much further down ...
    if( 1/0 ) ...

    the assignment to *p doesn't have to have happened, regardless of
    the volatile status of *p. There needs to be a meaning defined
    for some more constrained form of undefined behavior, which I have
    called "limited undefined behavior" in other postings, and a change
    to the semantics of some constructs from "undefined behavior" to
    "limited undefined behavior" (or some other suitable term), to get
    the results desired.

    I hope you can see what I'm trying to get at here. I admit that my descriptions are more abstruse than I would like. It's not an easy
    area to talk about.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to [email protected] on Sun Sep 8 17:52:33 2024
    [email protected] (MitchAlsup1) writes:

    On Sun, 8 Sep 2024 2:47:38 +0000, Tim Rentsch wrote:

    [email protected] (MitchAlsup1) writes:

    On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:

    Another issue is that main() may not have the 3 defined arguments
    and the containing environment is not supposed to complain when
    argc, arv, and envp are unused or even unnamed as arguments.

    The usual "Hello, World" program defines main() either with no
    arguments

    int
    main(){
    ...
    }

    or with two arguments

    int
    main( int argc, char *argv[] ){
    ...
    }

    and in both cases main() has defined behavior and does not
    violate the strictures of strictly conforming programs.

    The Linux environment (crt0) calls main with 3 arguments.

    Are you arguing that a program can be strictly conforming and
    not be type safe at its call/return interfaces ??

    Note by the way that the C standard doesn't make any guarantees
    about how a strictly conforming program will run under any given implementation. All the standard does say is that a conforming
    implementation shall accept any strictly conforming program (with
    slightly different rules for conforming hosted implementations as
    compared to conforming freestanding implementations).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to [email protected] on Sun Sep 8 17:31:19 2024
    [email protected] (MitchAlsup1) writes:

    On Sun, 8 Sep 2024 2:47:38 +0000, Tim Rentsch wrote:

    [email protected] (MitchAlsup1) writes:

    On Sat, 7 Sep 2024 23:45:45 +0000, Tim Rentsch wrote:

    Another issue is that main() may not have the 3 defined arguments
    and the containing environment is not supposed to complain when
    argc, arv, and envp are unused or even unnamed as arguments.

    The usual "Hello, World" program defines main() either with no
    arguments

    int
    main(){
    ...
    }

    or with two arguments

    int
    main( int argc, char *argv[] ){
    ...
    }

    and in both cases main() has defined behavior and does not
    violate the strictures of strictly conforming programs.

    The Linux environment (crt0) calls main with 3 arguments.

    The C standard allows defining main() either with no parameters,
    with two parameters (of types int and char **), or "in some other implementation-defined manner". (Note: this rule applies only
    to hosted implementations; freestanding implementations have a
    different rule. Compilers on Linux are hosted implementations.)

    On Ubuntu Linux, both gcc and clang accept (under -pedantic with
    either -std=c99 or -std=c11) this input

    #include <stdio.h>

    int
    main(){
    printf( "Hello, world\n" );
    return 0;
    }

    and this input

    #include <stdio.h>

    int
    main( int argc, char *argv[] ){
    printf( "Hello, world\n" );
    return 0;
    }

    and this input

    #include <stdio.h>

    int
    main( int argc, char *argv[], char *envp[] ){
    printf( "Hello, world\n" );
    return 0;
    }

    without giving any diagnostics. The executable produced in each
    case runs fine. In fact using -S to look at generated code, all
    three compile to the same code (different generated code under
    gcc compared to clang, but the same code for all versions under
    each compiler).

    As a sanity check, I tried this input

    #include <stdio.h>

    int
    main( int argc, char *argv[], double *envp[] ){
    printf( "Hello, world\n" );
    return 0;
    }

    which from gcc gives a warning diagnostic, and from clang gives
    an error diagnostic. The generated code under gcc is the same as
    that produced by gcc for the other inputs, and the produced
    executable runs and does the same thing as the other versions (as
    one would expect, since the generated code is the same).


    Are you arguing that a program can be strictly conforming and
    not be type safe at its call/return interfaces ??

    Both of the first two versions (with a no-parameters main() and
    with a two-parameter main()) satisfy all the criteria of strictly
    conforming programs.

    The third version (with a third parameter of type char **) does
    not satisfy the definition of strictly conforming programs,
    because it uses a feature not specified as part of the language
    or library -- namely, the implementation-defined form of main().

    The C standard requires every implementation to accept all
    strictly conforming programs (or the implementation is not
    conforming if it chooses not to accept a SC program for any
    reason). We don't expect all C compilers to accept a main()
    defined with three parameters, which is consistent with the
    rule that they are required to accept all strictly conforming
    programs.

    Does this explanation help clear things up? Or is there still
    some aspect I haven't explained adequately?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to [email protected] on Sun Sep 8 19:20:06 2024
    [email protected] (MitchAlsup1) writes:

    On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

    On 04/09/2024 18:07, Tim Rentsch wrote:

    Terje Mathisen <[email protected]> writes:

    Michael S wrote:

    On Tue, 3 Sep 2024 17:41:40 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:

    3 years ago Terje Mathisen wrote that many years ago he read
    that behaviour of memcpy() with overlappped src/dst was defined. >>>>>>> https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>>> Mitch Alsup answered "That was true in 1983". So, two people of >>>>>>> different age living in different parts of the world are telling >>>>>>> the same story. May be, there exist old popular book that said
    that it was defined?

    It probably wasn't written in the official C standard, which I
    couldn't have afforded to buy/read, but in a compiler runtime
    doc?

    Specifying that it would always copy from beginning to end of
    the source buffer, in increasing address order meant that it
    was guaranteed safe when used to compact buffers.

    What is "compact buffers" ?

    Assume a buffer consisting of records of some type, some of
    them marked as deleted. Iterating over them while removing
    the gaps means that you are always copying to a destination
    lower in memory, right?

    If all the records are in one large array, there is a simple
    test to see if memcpy() must work or whether some alternative
    should be used instead.

    Such tests are usually built into implementations of memmove(),
    which will chose to run forwards or backwards as needed. So you
    might as well just call memmove() any time you are not sure
    memcpy() is safe and appropriate.

    The ever-shallow David Brown first misses the point, then makes a
    slightly incorrect statement, and finally makes a recommendation
    that surely is familiar to every reader in the newsgroup.

    Memmove() is always appropriate unless you are doing something
    nefarious.

    So:
    # define memcpy memomve

    Incidentally, if one wants to do this, it's advisable to write

    #undef memcpy

    before the #define of memcpy.

    and move forward with life--for the 2 extra cycles memmove costs
    it saves everyone long term grief.

    When you need the nefarious activities of memcpy write it as a
    for loop by yourself and comment the nafariousness of the use.

    The point of my comment is that there is extra information
    available in the scenario described, and it might be useful to
    take advantage of that information not to make a low-level change
    (eg, substitute memmove() for memcpy()) but to switch to a
    different higher level strategy, such as using a semi-space
    compactor (or other possibilities).

    Simply replacing memcpy() by memmove() of course will always
    work, but there might be negative consequences beyond a cost
    of 2 extra cycles -- for example, if a negative stride is
    better performing than a positive stride, but the nature
    of the compaction forces memmove() to always take the slower
    choice.

    It's always useful to have more options to choose from when there
    is more information, even if ultimately what path is chosen
    is the zero-information path.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Sun Sep 8 22:44:54 2024
    [email protected] (Anton Ertl) writes:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether there
    is an overlap of the memory areas. But then I remembered that you
    cannot write such a check in standard C without (in the general
    case) exercising undefined behaviour;

    Yes, I can.

    and then the compiler could eliminate the check or do something
    else that's unexpected. Do you have such a check in mind that
    does not exercise undefined behaviour in the general case?

    Sure. I wouldn't have made my earlier statement otherwise.

    2) Even if there is such a check, you have to be aware that there
    is a potential problem with memcpy(). In that case the way to go
    is to just use memmove().

    The point of my previous comment was only to address the question
    of whether any existing memcpy() calls are problematic. If all
    of the checks return "no overlap" then memcpy() is not the problem.

    That said, using memmove() in place of memcpy() is one way to get
    around problems with undesired behavior from memcpy(), but depending
    on circumstances there may be other ways that are better.

    But that does not help you with the next "clever" idea that some
    compiler or library maintainer has.

    I have the impression that this is an editorial comment having
    nothing to do with memcpy() or memmove(). If that impression
    is wrong then I'm at a loss to understand what you are talking
    about, and would you please elaborate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Mon Sep 9 05:55:08 2024
    [email protected] (MitchAlsup1) writes:
    On Sun, 8 Sep 2024 15:32:02 +0000, Anton Ertl wrote:

    [email protected] (MitchAlsup1) writes:
    And just for fun::

    On Fri, 6 Sep 2024 13:26:42 +0000, Anton Ertl wrote:
    Here we have the three variants:

    #include <limits.h>

    extern long foo1(long);
    extern long foo2(long);

    long bar(long a, long b)
    {
    long c;
    if (__builtin_sub_overflow(b,1,&c))
    return foo1(a);
    else
    return foo2(a);
    }

    long bar2(long a, long b)
    {
    if (b < b-1)
    return foo1(a);
    else
    return foo2(a);
    }

    long bar3(long a, long b)
    {
    if (b == LONG_MIN)
    return foo1(a);
    else
    return foo2(a);
    }

    My 66000:
    add r3,R1,#-1 add r3,r1,#-1 bepm r1,.L4
    bge R3,.L4 bge r3,.L4
    8-bytes 8-bytes 4-bytes

    I have a direct test for POSMAX in ISA that does not use a constant.

    How does bge work in the first and second column? My impression was
    that you are using an 88k-style flags-in-GPR architecture.

    I just copied the RISC-V code

    The RISC-V bge has two operands (plus the branch target), the bge in
    your code has only one operand. Here's the RISC-V code:

    RV64GC:
    addi a5,a1,-1 addi a5,a1,-1 li a5,-1
    bge a1,a5,10 <.L4> bge a1,a5,28 <.L6> slli a5,a5,0x3f
    bne a1,a5,40 <.L8>

    Concerning the last column, the gcc developer who added the
    transformation of bar2() into bar3() apparently had My66000 in mind.
    ...
    BTW I had the comparisons to int-MAX/MIN in since about 2016.

    The transformation was added to gcc after gcc-10 was released in 2020,
    so my tongue-in-cheek theory is not falsified by the timing of events.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Mon Sep 9 08:56:45 2024
    David Brown wrote:
    On 05/09/2024 19:04, Terje Mathisen wrote:
    David Brown wrote:
    On 05/09/2024 11:12, Terje Mathisen wrote:
    David Brown wrote:
    Unsigned types are ideal for "raw" memory access or external data,
    for anything involving bit manipulation (use of &, |, ^, << and >>
    on signed types is usually wrong, IMHO), as building blocks in
    extended arithmetic types, for the few occasions when you want
    two's complement wrapping, and for the even fewer occasions when
    you actually need that last bit of range.

    That last paragraph enumerates pretty much all the uses I have for
    integer-type variables, with (like Mitch) a few apis that use (-1)
    as an error signal that has to be handled with special code.


    You don't have loop counters, array indices, or integer arithmetic?

    Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
    fine with unsigned i, arrays always use a zero base so in Rust the
    only array index type is usize, i.e the largest supported unsigned
    type in the system, typically the same as u64.

    Loop counters can usually be signed or unsigned, and it usually makes no difference.  Array indices are also usually much the same signed or unsigned, and it can feel more natural to use size_t here (an unsigned type).  It can make a difference to efficiency, however.  On x86-64,
    this code is 3 instructions with T as "unsigned long int" or "long int",
    4 with "int", and 5 with "unsigned int".

    int foo(int * p, T x) {
        int a = p[x++];
        int b = p[x++];
        return a + b;
    }

    ;; assume *p in rdi, x in rsi

    mov rax,[rdi+rsi]
    add rax,[rdi+rsi+8]
    ret



    Anyway, I count loop counters and array indices as "use of integer-type variables", whether you prefer signed or unsigned.


    OK



    unsigned arithmetic is easier than signed integer arithmetic,
    including comparisons that would result in a negative value, you just
    have to make the test before subtracting, instead of checking if the
    result was negative.

    I can't follow that at all.  Unsigned and signed arithmetic and
    comparisons both work simply and as you'd expect.  /Mixing/ signed and unsigned types can get things wrong.

    Oh yeah!



    I.e I cannot easily replicate a downward loop that exits when the
    counter become negative:

     Â  for (int i = START; i >= 0; i-- ) {
     Â Â Â  // Do something with data[i]
     Â  }

    One of my alternatives are

     Â  unsigned u = start; // Cannot be less than zero
     Â  if (u) {
     Â Â Â  u++;
     Â Â Â  do {
     Â Â Â Â Â  u--;
     Â Â Â Â Â  data[u]...
     Â Â Â  while (u);
     Â  }

    This typically results in effectively the same asm code as the signed
    version, except for a bottom JGE (Jump (signed) Greater or Equal
    instead of JA (Jump Above or Equal, but my version is far more verbose.


    A more important thing is that the first version, with signed i, is
    /vastly/ simpler and clearer in the source code.

    Alternatively, if you don't need all N bits of the unsigned type, then
    you can subtract and check if the top bit is set in the result:

     Â  for (unsigned u = start; (u & TOPBIT) == 0; u--)

    Terje


    Or you could just write sane code that matches what you want to say.

    :-)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Bernd Linsel on Mon Sep 9 09:09:13 2024
    Bernd Linsel wrote:
    On 05.09.24 19:04, Terje Mathisen wrote:
    One of my alternatives are

     Â  unsigned u = start; // Cannot be less than zero
     Â  if (u) {
     Â Â Â  u++;
     Â Â Â  do {
     Â Â Â Â Â  u--;
     Â Â Â Â Â  data[u]...
     Â Â Â  while (u);
     Â  }

    This typically results in effectively the same asm code as the signed
    version, except for a bottom JGE (Jump (signed) Greater or Equal
    instead of JA (Jump Above or Equal, but my version is far more verbose.

    Alternatively, if you don't need all N bits of the unsigned type, then
    you can subtract and check if the top bit is set in the result:

     Â  for (unsigned u = start; (u & TOPBIT) == 0; u--)

    Terje


    What about:

    for (unsigned u = start; u != ~0u; --u)

    I like that one!
       ...

    or even

    for (unsigned u = start; (int)u >= 0; --u)

    That is the one that I've actually been using, i.e. casting to the corresponding signed type.
       ...

    ?

    I've compared all variants for x86_64 with -O3 -fexpensive-optimizations
    on godbolt.org:
    - 32 bit version: https://godbolt.org/z/TMhhx3nch
    - 64 bit version: https://godbolt.org/z/8oxzTf5Gf

    Thanks!

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Tim Rentsch on Mon Sep 9 07:40:38 2024
    Tim Rentsch <[email protected]> writes: >[email protected] (Anton Ertl) writes:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether there
    is an overlap of the memory areas. But then I remembered that you
    cannot write such a check in standard C without (in the general
    case) exercising undefined behaviour;

    Yes, I can.

    and then the compiler could eliminate the check or do something
    else that's unexpected. Do you have such a check in mind that
    does not exercise undefined behaviour in the general case?

    Sure. I wouldn't have made my earlier statement otherwise.

    You also stated "I'm confident the people who wrote the C standard
    would say such a program is strictly conforming." about a program with implementation-defined behaviour, so I lack confidence in your claim.

    2) Even if there is such a check, you have to be aware that there
    is a potential problem with memcpy(). In that case the way to go
    is to just use memmove().

    The point of my previous comment was only to address the question
    of whether any existing memcpy() calls are problematic. If all
    of the checks return "no overlap" then memcpy() is not the problem.

    At least for the test runs.

    But that does not help you with the next "clever" idea that some
    compiler or library maintainer has.

    I have the impression that this is an editorial comment having
    nothing to do with memcpy() or memmove(). If that impression
    is wrong then I'm at a loss to understand what you are talking
    about, and would you please elaborate.

    There are at least 200 undefined behaviours in the C standard, and
    according to some people, C programmers should avoid all of them. So
    the possible breakage of memcpy() is just one of many problems that
    the programmers should be aware of and that they should test for.

    Just because we discussed memcpy() as one of the problems with this
    approach does not mean that having a way to deal with memcpy() solves
    the larger problem.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Tim Rentsch on Mon Sep 9 07:07:25 2024
    Tim Rentsch <[email protected]> writes:
    [email protected] (MitchAlsup1) writes:
    So:
    # define memcpy memomve

    Incidentally, if one wants to do this, it's advisable to write

    #undef memcpy

    before the #define of memcpy.

    and move forward with life--for the 2 extra cycles memmove costs
    it saves everyone long term grief.

    Is it two extra cycles? Here are some data points from <[email protected]>:

    Haswell (Core i7-4790K), glibc 2.19
    1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
    14 14 15 15 17 30 48 85 150 281 570 1370 memmove
    15 16 13 16 19 32 48 86 161 327 631 1420 memcpy

    Skylake (Core i5-6600K), glibc 2.19
    1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
    14 14 14 14 15 27 43 77 147 305 573 1417 memmove
    13 14 10 12 14 27 46 85 165 313 607 1350 memcpy

    Zen (Ryzen 5 1600X), glibc 2.24
    1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
    16 16 16 17 32 43 66 107 177 328 601 1225 memmove
    13 13 14 13 38 49 73 116 188 336 610 1233 memcpy

    I don't see a consistent speedup of memcpy over memmove here.

    However, when one uses memcpy(&var,ptr,8) or the like to perform an
    unaligned access, gcc transforms this into a load (or store) without
    the redefinition of memcpy, but into much slower code with the
    redefinition (i.e., when using memmove instead of memcpy).

    Simply replacing memcpy() by memmove() of course will always
    work, but there might be negative consequences beyond a cost
    of 2 extra cycles -- for example, if a negative stride is
    better performing than a positive stride, but the nature
    of the compaction forces memmove() to always take the slower
    choice.

    If the two memory blocks don't overlap, memmove() can use the fastest
    stride. If the two memory blocks overlap, memcpy() as implemented in
    glibc is a bad idea.

    The way to go for memmove() is:

    On hardware where positive stride is faster:

    if (((uintptr)(dest-src)) >= len)
    return memcpy_posstride(dest,src,len)
    else
    return memcpy_negstride(dest,src,len)

    On hardware where the negative stride is faster:

    if (((uintptr)(src-dest)) >= len)
    return memcpy_negstride(dest,src,len)
    else
    return memcpy_posstride(dest,src,len)

    And I expect that my test is undefined behaviour, but most people
    except the UB advocates should understand what I mean.

    The benefit of this comparison over just comparing the addresses is
    that the branch will have a much lower miss rate.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Mon Sep 9 10:20:00 2024
    Tim Rentsch wrote:
    [email protected] (MitchAlsup1) writes:

    On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

    On 04/09/2024 18:07, Tim Rentsch wrote:

    Terje Mathisen <[email protected]> writes:

    Michael S wrote:

    On Tue, 3 Sep 2024 17:41:40 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:

    3 years ago Terje Mathisen wrote that many years ago he read
    that behaviour of memcpy() with overlappped src/dst was defined. >>>>>>>> https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>>>> Mitch Alsup answered "That was true in 1983". So, two people of >>>>>>>> different age living in different parts of the world are telling >>>>>>>> the same story. May be, there exist old popular book that said >>>>>>>> that it was defined?

    It probably wasn't written in the official C standard, which I
    couldn't have afforded to buy/read, but in a compiler runtime
    doc?

    Specifying that it would always copy from beginning to end of
    the source buffer, in increasing address order meant that it
    was guaranteed safe when used to compact buffers.

    What is "compact buffers" ?

    Assume a buffer consisting of records of some type, some of
    them marked as deleted. Iterating over them while removing
    the gaps means that you are always copying to a destination
    lower in memory, right?

    If all the records are in one large array, there is a simple
    test to see if memcpy() must work or whether some alternative
    should be used instead.

    Such tests are usually built into implementations of memmove(),
    which will chose to run forwards or backwards as needed. So you
    might as well just call memmove() any time you are not sure
    memcpy() is safe and appropriate.

    The ever-shallow David Brown first misses the point, then makes a
    slightly incorrect statement, and finally makes a recommendation
    that surely is familiar to every reader in the newsgroup.

    Memmove() is always appropriate unless you are doing something
    nefarious.

    So:
    # define memcpy memomve

    Incidentally, if one wants to do this, it's advisable to write

    #undef memcpy

    before the #define of memcpy.

    What really worries me is that I've been told (and shown in godbolt)
    that memcpy() can be magic, i.e the ocmpiler is allowed to make it NOP
    when I use it to move data between an integer and float variable:

    float invsqrt(float x)
    {
    ...
    int32_t ix = *(int32_t *) &x;

    is deprecated, instead do something like this:

    int32_t ix;
    memcpy(&ix, &x, sizeof(ix));

    and the compiler will see that x and ix can share the same register.

    I don't suppose memmove() can be dependent upon to do the same?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Mon Sep 9 09:06:43 2024
    Terje Mathisen <[email protected]> writes:
    float invsqrt(float x)
    [...]
    int32_t ix;
    memcpy(&ix, &x, sizeof(ix));

    and the compiler will see that x and ix can share the same register.

    I don't suppose memmove() can be dependent upon to do the same?

    There is nothing that prevents the compiler from doing it, or forcing
    the compiler to to it with memcpy(). So a compiler could call the
    function memcpy() for the code above, and optimize it as you prefer
    with memmove(). What actual compilers do is something you can try
    out. My experience is that memcpy() is given more love by compiler
    maintainers than memmove(). It's as if, despite all the rethoric that
    C programmers should "sanitize" programs to get rid of undefined
    behaviours in our programs, they actually prefer that we use functions
    with less defined behaviour like memcpy() instead of functions with
    more defined behaviour like memmove().

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to All on Mon Sep 9 12:26:57 2024
    On Mon, 09 Sep 2024 07:07:25 GMT
    [email protected] (Anton Ertl) wrote:

    Does hardware on which negative stride is faster really exists?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Mon Sep 9 12:22:19 2024
    On Mon, 9 Sep 2024 10:20:00 +0200
    Terje Mathisen <[email protected]> wrote:

    Tim Rentsch wrote:
    [email protected] (MitchAlsup1) writes:

    On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

    On 04/09/2024 18:07, Tim Rentsch wrote:

    Terje Mathisen <[email protected]> writes:

    Michael S wrote:

    On Tue, 3 Sep 2024 17:41:40 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:

    3 years ago Terje Mathisen wrote that many years ago he read >>>>>>>> that behaviour of memcpy() with overlappped src/dst was
    defined.
    https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>>>> Mitch Alsup answered "That was true in 1983". So, two
    people of different age living in different parts of the
    world are telling the same story. May be, there exist old
    popular book that said that it was defined?

    It probably wasn't written in the official C standard, which I >>>>>>> couldn't have afforded to buy/read, but in a compiler runtime
    doc?

    Specifying that it would always copy from beginning to end of
    the source buffer, in increasing address order meant that it
    was guaranteed safe when used to compact buffers.

    What is "compact buffers" ?

    Assume a buffer consisting of records of some type, some of
    them marked as deleted. Iterating over them while removing
    the gaps means that you are always copying to a destination
    lower in memory, right?

    If all the records are in one large array, there is a simple
    test to see if memcpy() must work or whether some alternative
    should be used instead.

    Such tests are usually built into implementations of memmove(),
    which will chose to run forwards or backwards as needed. So you
    might as well just call memmove() any time you are not sure
    memcpy() is safe and appropriate.

    The ever-shallow David Brown first misses the point, then makes a
    slightly incorrect statement, and finally makes a recommendation
    that surely is familiar to every reader in the newsgroup.

    Memmove() is always appropriate unless you are doing something
    nefarious.

    So:
    # define memcpy memomve

    Incidentally, if one wants to do this, it's advisable to write

    #undef memcpy

    before the #define of memcpy.

    What really worries me is that I've been told (and shown in godbolt)
    that memcpy() can be magic, i.e the ocmpiler is allowed to make it
    NOP when I use it to move data between an integer and float variable:

    float invsqrt(float x)
    {
    ...
    int32_t ix = *(int32_t *) &x;

    is deprecated, instead do something like this:

    int32_t ix;
    memcpy(&ix, &x, sizeof(ix));

    and the compiler will see that x and ix can share the same register.

    I don't suppose memmove() can be dependent upon to do the same?

    Terje


    In simple situations like shown above, memmove is as dependable as
    memcpy.

    I don't know if it is always true in more complex cases, where absence
    of aliasing is less obvious to compiler. However, I'd expect that as
    long as a copied item fits in register, the magic will work equally
    with both memcpy and memmove.

    It depends on compiler, too.
    MSVC from VS2019 produces the same code for both variants d_to_u below.
    But MSVC from VS2017 does not.

    #include <stdint.h>
    #include <string.h>

    void d_to_u_cpy(uint64_t* u, const double* d) {
    memcpy(u, d, sizeof(*u));
    }

    #define memcpy memmove

    void d_to_u_move(uint64_t* u, const double* d) {
    memcpy(u, d, sizeof(*u));
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Mon Sep 9 10:30:34 2024
    Michael S <[email protected]> writes:
    On Mon, 9 Sep 2024 10:20:00 +0200
    Terje Mathisen <[email protected]> wrote:
    float invsqrt(float x)
    {
    ...
    int32_t ix = *(int32_t *) &x;
    [...]
    int32_t ix;
    memcpy(&ix, &x, sizeof(ix));
    ...
    I don't know if it is always true in more complex cases, where absence
    of aliasing is less obvious to compiler.

    Something like

    memmove(*p, *q, 8)

    can be translated to something like

    0: 48 8b 06 mov (%rsi),%rax
    3: 48 89 07 mov %rax,(%rdi)

    without any aliasing worries, and indeed, gcc-9, gcc-10, and gcc-12,
    does that.

    However, I'd expect that as
    long as a copied item fits in register, the magic will work equally
    with both memcpy and memmove.

    One would hope so, but here's what happens with gcc-12:

    #include <string.h>

    void foo1(char *p, char* q)
    {
    memcpy(p,q,32);
    }

    void foo2(char *p, char* q)
    {
    memmove(p,q,32);
    }

    gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o:

    0000000000000000 <foo1>:
    0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
    4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
    8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
    d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
    12: c3 ret
    13: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
    1a: 00 00 00 00
    1e: 66 90 xchg %ax,%ax

    0000000000000020 <foo2>:
    20: ba 20 00 00 00 mov $0x20,%edx
    25: e9 00 00 00 00 jmp 2a <foo2+0xa>

    The jmp in line 25 is probably a tail-call to memmove().

    My guess is that xmm registers and unrolling are used here rather than
    ymm registers because waking up the second 128 bits takes time. But
    even with that, the code uses two different registers, and if
    scheduled differently, could be used for implementing foo2():

    0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
    8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
    4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
    d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
    12: c3 ret

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Mon Sep 9 13:03:19 2024
    On 09/09/2024 08:56, Terje Mathisen wrote:
    David Brown wrote:
    On 05/09/2024 19:04, Terje Mathisen wrote:
    David Brown wrote:
    On 05/09/2024 11:12, Terje Mathisen wrote:
    David Brown wrote:
    Unsigned types are ideal for "raw" memory access or external data, >>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>> on signed types is usually wrong, IMHO), as building blocks in
    extended arithmetic types, for the few occasions when you want
    two's complement wrapping, and for the even fewer occasions when
    you actually need that last bit of range.

    That last paragraph enumerates pretty much all the uses I have for
    integer-type variables, with (like Mitch) a few apis that use (-1)
    as an error signal that has to be handled with special code.


    You don't have loop counters, array indices, or integer arithmetic?

    Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
    fine with unsigned i, arrays always use a zero base so in Rust the
    only array index type is usize, i.e the largest supported unsigned
    type in the system, typically the same as u64.

    Loop counters can usually be signed or unsigned, and it usually makes
    no difference.  Array indices are also usually much the same signed or
    unsigned, and it can feel more natural to use size_t here (an unsigned
    type).  It can make a difference to efficiency, however.  On x86-64,
    this code is 3 instructions with T as "unsigned long int" or "long
    int", 4 with "int", and 5 with "unsigned int".

    int foo(int * p, T x) {
         int a = p[x++];
         int b = p[x++];
         return a + b;
    }

    ;;  assume *p in rdi, x in rsi

      mov rax,[rdi+rsi]
      add rax,[rdi+rsi+8]
      ret

    Yes - that's three instructions for 64-bit type T. (To be clear, I had
    counted the "ret" here.)

    With 32-bit int for T, you need a "movsx rsi, esi" first to sign-extend
    the 32-bit int parameter "x" to 64 bits. (That could be different for different ABI's.) With 32-bit unsigned int for T you need an additional instruction to make sure the result of the first "x++" is wrapped as
    32-bit unsigned.


    Or you could just write sane code that matches what you want to say.

    :-)


    Of course the fine line between "smart code" and "smart-arse code" is
    somewhat subjective!

    It also varies over time, and depends on the needs of the code.
    Sometimes it makes sense to prioritise efficiency over readability - but
    that is rare, and has been getting steadily rarer over the decades as processors have been getting faster (disproportionally so for
    inefficient code) and compilers have been getting better.

    Often you get the most efficient results by writing code clearly and
    simply so that the compiler can understand it better and good object
    code. This is particularly true if you want the same source to be used
    on different targets or different variants of a target - few people can
    track the instruction scheduling and timings on multiple processors
    better than a good compiler. (And the few people who /can/ do that
    spend their time chatting in comp.arch instead of writing code...) When
    you do hand-made micro-optimisations, these can work against the
    compiler and give poorer results overall. This is especially the case
    when code is moved around with inlining, constant propagation,
    unrolling, link-time optimisation, etc.

    Long ago, it was a different matter - then compilers needed more help to
    get good results. And compilers are far from perfect - there are still
    times when "smart" code or assembly-like C is needed (such as when
    taking advantage of some vector and SIMD facilities).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Mon Sep 9 13:19:49 2024
    On 08/09/2024 20:32, MitchAlsup1 wrote:
    On Sun, 8 Sep 2024 6:25:10 +0000, David Brown wrote:

    On 08/09/2024 02:17, MitchAlsup1 wrote:
    On Sat, 7 Sep 2024 7:15:11 +0000, David Brown wrote:

    static uint64_t array[1024*1024*512+1]
    static int      SIZE = sizeof(array)/sizeof(uint65_t);

    Surely you mean :

    static const size_t array_size = sizeof(array) / sizeof(uint64_t);


    I wanted SIZE to have the same type as i.

    Okay, I suppose - though I would rather have it being an appropriate
    type and, if necessary, change the type of "i". But I still don't get
    your point - what has this "SIZE" of 0x20000001 got to do with a "START"
    that you want to equal 0x80000001 ? Were you just trying to show that
    it is possible to make the number 0x80000001 in code, and got the
    numbers wrong? If you know that you might have numbers exceeding 32-bit ranges, then you need to use a 64-bit type as the index variable - and
    it can still happily be signed rather than writing more complicated code
    just to force it into an obsessive rule about using unsigned types.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Mon Sep 9 04:32:17 2024
    [email protected] (Anton Ertl) writes:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    [...]

    1) A strictly conforming program shall use only those features
    of the language and library specified in this International
    Standard. This excludes all programs that terminate,
    including the "Hello, World" program. [...]

    I don't know why you say this. Which aspects of the definition
    for "strictly conforming program" do you think are violated by a
    typical 'Hello, World' program?

    A typical "Hello, World" program terminates, and as mentioned,
    no terminating program can be strictly conforming, because it
    exercises at least implementation-defined behaviour (e.g., look
    at section 7.22.4.4 of C11).

    I'm familiar with the exit() function and how the C standard
    defines it. You should re-read the definition of strictly
    conforming program, which says in part

    It shall not produce output dependent on any unspecified,
    undefined, or implementation-defined behavior

    It is not any use of implementation-defined behavior that is off
    limits, only those uses that produce output dependent on such
    behavior. The return status of a program is not an output.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Sep 9 14:58:54 2024
    On Mon, 09 Sep 2024 10:30:34 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Mon, 9 Sep 2024 10:20:00 +0200
    Terje Mathisen <[email protected]> wrote:
    float invsqrt(float x)
    {
    ...
    int32_t ix = *(int32_t *) &x;
    [...]
    int32_t ix;
    memcpy(&ix, &x, sizeof(ix));
    ...
    I don't know if it is always true in more complex cases, where
    absence of aliasing is less obvious to compiler.

    Something like

    memmove(*p, *q, 8)

    can be translated to something like

    0: 48 8b 06 mov (%rsi),%rax
    3: 48 89 07 mov %rax,(%rdi)

    without any aliasing worries, and indeed, gcc-9, gcc-10, and gcc-12,
    does that.

    However, I'd expect that as
    long as a copied item fits in register, the magic will work equally
    with both memcpy and memmove.

    One would hope so, but here's what happens with gcc-12:

    #include <string.h>

    void foo1(char *p, char* q)
    {
    memcpy(p,q,32);
    }

    void foo2(char *p, char* q)
    {
    memmove(p,q,32);
    }

    gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o:

    0000000000000000 <foo1>:
    0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
    4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
    8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
    d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
    12: c3 ret
    13: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
    1a: 00 00 00 00
    1e: 66 90 xchg %ax,%ax

    0000000000000020 <foo2>:
    20: ba 20 00 00 00 mov $0x20,%edx
    25: e9 00 00 00 00 jmp 2a <foo2+0xa>

    The jmp in line 25 is probably a tail-call to memmove().

    My guess is that xmm registers and unrolling are used here rather than
    ymm registers because waking up the second 128 bits takes time. But
    even with that, the code uses two different registers, and if
    scheduled differently, could be used for implementing foo2():

    0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
    8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
    4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
    d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
    12: c3 ret

    - anton

    Try -march instead of -mavx2. E.g. -march=haswell
    Sometimes gcc is beyond logic.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Mon Sep 9 11:11:04 2024
    Michael S <[email protected]> writes:
    On Mon, 09 Sep 2024 07:07:25 GMT
    [email protected] (Anton Ertl) wrote:

    Does hardware on which negative stride is faster really exists?

    At least that was claimed as the rationale for implementing a memcpy
    with negative stride in glibc in 2010. Of course, we have every
    reason to be skeptical, given that bullshit about undisclosed
    performance advantages of their misdeeds is common in those circles.

    And when somebody made the mistake of actually being a bit more
    concrete with their claims, and I actually checked it <http://www.complang.tuwien.ac.at/anton/autovectors/>, it turned out
    that the claimed-better version had essentially the same performance
    as the more benign version.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Mon Sep 9 13:39:40 2024
    On 09/09/2024 11:22, Michael S wrote:
    On Mon, 9 Sep 2024 10:20:00 +0200
    Terje Mathisen <[email protected]> wrote:

    Tim Rentsch wrote:
    [email protected] (MitchAlsup1) writes:

    On Wed, 4 Sep 2024 17:53:13 +0000, David Brown wrote:

    On 04/09/2024 18:07, Tim Rentsch wrote:

    Terje Mathisen <[email protected]> writes:

    Michael S wrote:

    On Tue, 3 Sep 2024 17:41:40 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:

    3 years ago Terje Mathisen wrote that many years ago he read >>>>>>>>>> that behaviour of memcpy() with overlappped src/dst was
    defined.
    https://groups.google.com/g/comp.arch/c/rSk8c7Urd_Y/m/ZWEG5V1KAQAJ >>>>>>>>>> Mitch Alsup answered "That was true in 1983". So, two
    people of different age living in different parts of the
    world are telling the same story. May be, there exist old >>>>>>>>>> popular book that said that it was defined?

    It probably wasn't written in the official C standard, which I >>>>>>>>> couldn't have afforded to buy/read, but in a compiler runtime >>>>>>>>> doc?

    Specifying that it would always copy from beginning to end of >>>>>>>>> the source buffer, in increasing address order meant that it >>>>>>>>> was guaranteed safe when used to compact buffers.

    What is "compact buffers" ?

    Assume a buffer consisting of records of some type, some of
    them marked as deleted. Iterating over them while removing
    the gaps means that you are always copying to a destination
    lower in memory, right?

    If all the records are in one large array, there is a simple
    test to see if memcpy() must work or whether some alternative
    should be used instead.

    Such tests are usually built into implementations of memmove(),
    which will chose to run forwards or backwards as needed. So you
    might as well just call memmove() any time you are not sure
    memcpy() is safe and appropriate.

    The ever-shallow David Brown first misses the point, then makes a
    slightly incorrect statement, and finally makes a recommendation
    that surely is familiar to every reader in the newsgroup.

    Memmove() is always appropriate unless you are doing something
    nefarious.

    So:
    # define memcpy memomve

    Incidentally, if one wants to do this, it's advisable to write

    #undef memcpy

    before the #define of memcpy.

    What really worries me is that I've been told (and shown in godbolt)
    that memcpy() can be magic, i.e the ocmpiler is allowed to make it
    NOP when I use it to move data between an integer and float variable:

    float invsqrt(float x)
    {
    ...
    int32_t ix = *(int32_t *) &x;

    is deprecated, instead do something like this:

    int32_t ix;
    memcpy(&ix, &x, sizeof(ix));

    and the compiler will see that x and ix can share the same register.

    I don't suppose memmove() can be dependent upon to do the same?

    Terje


    In simple situations like shown above, memmove is as dependable as
    memcpy.

    I don't know if it is always true in more complex cases, where absence
    of aliasing is less obvious to compiler. However, I'd expect that as
    long as a copied item fits in register, the magic will work equally
    with both memcpy and memmove.


    That's my experience too, but as you say, it is compiler (and flag)
    dependent.

    In most such cases, there's no overlap so memcpy() is the common choice.
    (Even if the same register is used as a result of optimisation,
    logically the variables are independent.)

    You could, I suppose, be trying to use memcpy() or memmove() on members
    of a union in C++ (where type-punning using unions is UB, unlike in C).
    Then you would have to use memmove() to be correct. (gcc can warn about aliases and overlaps for the "restrict" parameters of memcpy() in simple cases.)

    It depends on compiler, too.
    MSVC from VS2019 produces the same code for both variants d_to_u below.
    But MSVC from VS2017 does not.

    #include <stdint.h>
    #include <string.h>

    void d_to_u_cpy(uint64_t* u, const double* d) {
    memcpy(u, d, sizeof(*u));
    }

    #define memcpy memmove

    void d_to_u_move(uint64_t* u, const double* d) {
    memcpy(u, d, sizeof(*u));
    }











    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Mon Sep 9 12:28:13 2024
    Michael S <[email protected]> writes:
    On Mon, 09 Sep 2024 10:30:34 GMT
    [email protected] (Anton Ertl) wrote:
    One would hope so, but here's what happens with gcc-12:

    #include <string.h>

    void foo1(char *p, char* q)
    {
    memcpy(p,q,32);
    }

    void foo2(char *p, char* q)
    {
    memmove(p,q,32);
    }

    gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o:

    0000000000000000 <foo1>:
    0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
    4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
    8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
    d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
    12: c3 ret
    13: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
    1a: 00 00 00 00
    1e: 66 90 xchg %ax,%ax

    0000000000000020 <foo2>:
    20: ba 20 00 00 00 mov $0x20,%edx
    25: e9 00 00 00 00 jmp 2a <foo2+0xa>

    The jmp in line 25 is probably a tail-call to memmove().

    My guess is that xmm registers and unrolling are used here rather than
    ymm registers because waking up the second 128 bits takes time. But
    even with that, the code uses two different registers, and if
    scheduled differently, could be used for implementing foo2():

    0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
    8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
    4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
    d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
    12: c3 ret

    - anton

    Try -march instead of -mavx2. E.g. -march=haswell
    Sometimes gcc is beyond logic.

    For gcc -O3 -march=haswell I got the same result (with gcc-12). I
    also tried -march=x86-64-v3 with the same result.

    But gcc -O3 -march=x86-64-v4 produced:

    0000000000000000 <foo1>:
    0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
    4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
    8: c5 f8 77 vzeroupper
    b: c3 ret
    c: 0f 1f 40 00 nopl 0x0(%rax)

    0000000000000010 <foo2>:
    10: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
    14: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
    18: c5 f8 77 vzeroupper
    1b: c3 ret

    And when changing the length to 64:

    0000000000000000 <foo1>:
    0: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0
    6: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi)
    c: c5 f8 77 vzeroupper
    f: c3 ret

    0000000000000010 <foo2>:
    10: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0
    16: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi)
    1c: c5 f8 77 vzeroupper
    1f: c3 ret

    But when changing the length to 63:

    0000000000000000 <foo1>:
    0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
    4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
    8: c5 fe 6f 4e 1f vmovdqu 0x1f(%rsi),%ymm1
    d: c5 fe 7f 4f 1f vmovdqu %ymm1,0x1f(%rdi)
    12: c5 f8 77 vzeroupper
    15: c3 ret
    16: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
    1d: 00 00 00

    0000000000000020 <foo2>:
    20: ba 3f 00 00 00 mov $0x3f,%edx
    25: e9 00 00 00 00 jmp 2a <foo2+0xa>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Sep 9 16:08:47 2024
    On Mon, 09 Sep 2024 12:28:13 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Mon, 09 Sep 2024 10:30:34 GMT
    [email protected] (Anton Ertl) wrote:
    One would hope so, but here's what happens with gcc-12:

    #include <string.h>

    void foo1(char *p, char* q)
    {
    memcpy(p,q,32);
    }

    void foo2(char *p, char* q)
    {
    memmove(p,q,32);
    }

    gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o:

    0000000000000000 <foo1>:
    0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
    4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
    8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
    d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
    12: c3 ret
    13: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
    1a: 00 00 00 00
    1e: 66 90 xchg %ax,%ax

    0000000000000020 <foo2>:
    20: ba 20 00 00 00 mov $0x20,%edx
    25: e9 00 00 00 00 jmp 2a <foo2+0xa>

    The jmp in line 25 is probably a tail-call to memmove().

    My guess is that xmm registers and unrolling are used here rather
    than ymm registers because waking up the second 128 bits takes
    time. But even with that, the code uses two different registers,
    and if scheduled differently, could be used for implementing
    foo2():

    0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
    8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
    4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
    d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
    12: c3 ret

    - anton

    Try -march instead of -mavx2. E.g. -march=haswell
    Sometimes gcc is beyond logic.

    For gcc -O3 -march=haswell I got the same result (with gcc-12). I
    also tried -march=x86-64-v3 with the same result.

    But gcc -O3 -march=x86-64-v4 produced:


    My gcc was 14.1 and -O2. It produced same code as yours below (forcase
    of 32) with -march=haswell

    0000000000000000 <foo1>:
    0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
    4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
    8: c5 f8 77 vzeroupper
    b: c3 ret
    c: 0f 1f 40 00 nopl 0x0(%rax)

    0000000000000010 <foo2>:
    10: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
    14: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
    18: c5 f8 77 vzeroupper
    1b: c3 ret

    And when changing the length to 64:

    0000000000000000 <foo1>:
    0: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0
    6: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi)
    c: c5 f8 77 vzeroupper
    f: c3 ret

    0000000000000010 <foo2>:
    10: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0
    16: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi)
    1c: c5 f8 77 vzeroupper
    1f: c3 ret


    And here I got different code for -march=tigerlake and
    -march=znver4 despite both having approximately the same ISA.
    It seems, for Toger Lake gcc is over-concerned about impact of
    unaligned 64-bit accesses.

    But when changing the length to 63:

    0000000000000000 <foo1>:
    0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
    4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
    8: c5 fe 6f 4e 1f vmovdqu 0x1f(%rsi),%ymm1
    d: c5 fe 7f 4f 1f vmovdqu %ymm1,0x1f(%rdi)
    12: c5 f8 77 vzeroupper
    15: c3 ret
    16: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
    1d: 00 00 00

    0000000000000020 <foo2>:
    20: ba 3f 00 00 00 mov $0x3f,%edx
    25: e9 00 00 00 00 jmp 2a <foo2+0xa>

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Mon Sep 9 15:21:27 2024
    On Mon, 9 Sep 2024 08:56:45 +0200
    Terje Mathisen <[email protected]> wrote:

    David Brown wrote:
    On 05/09/2024 19:04, Terje Mathisen wrote:
    David Brown wrote:
    On 05/09/2024 11:12, Terje Mathisen wrote:
    David Brown wrote:
    Unsigned types are ideal for "raw" memory access or external
    data, for anything involving bit manipulation (use of &, |, ^,
    << and >> on signed types is usually wrong, IMHO), as building
    blocks in extended arithmetic types, for the few occasions when
    you want two's complement wrapping, and for the even fewer
    occasions when you actually need that last bit of range.

    That last paragraph enumerates pretty much all the uses I have
    for integer-type variables, with (like Mitch) a few apis that
    use (-1) as an error signal that has to be handled with special
    code.

    You don't have loop counters, array indices, or integer
    arithmetic?

    Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
    fine with unsigned i, arrays always use a zero base so in Rust the
    only array index type is usize, i.e the largest supported unsigned
    type in the system, typically the same as u64.

    Loop counters can usually be signed or unsigned, and it usually
    makes no difference.  Array indices are also usually much the same
    signed or unsigned, and it can feel more natural to use size_t here
    (an unsigned type).  It can make a difference to efficiency,
    however.  On x86-64, this code is 3 instructions with T as
    "unsigned long int" or "long int", 4 with "int", and 5 with
    "unsigned int".

    int foo(int * p, T x) {
        int a = p[x++];
        int b = p[x++];
        return a + b;
    }

    ;; assume *p in rdi, x in rsi

    mov rax,[rdi+rsi]
    add rax,[rdi+rsi+8]
    ret


    more like
    mov rax,[rdi+rsi*4]
    add rax,[rdi+rsi*4+8]
    ret

    But that's not the point (==trap).
    The point (==trap), I'd guess, is that for T=uint32_t code generator
    has to account for possibility of x==2**32-1.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Mon Sep 9 06:21:12 2024
    Terje Mathisen <[email protected]> writes:

    Bernd Linsel wrote:

    On 05.09.24 19:04, Terje Mathisen wrote:

    One of my alternatives are

    unsigned u = start; // Cannot be less than zero
    if (u) {
    u++;
    do {
    u--;
    data[u]...
    while (u);
    }

    This typically results in effectively the same asm code as the
    signed version, except for a bottom JGE (Jump (signed) Greater or
    Equal instead of JA (Jump Above or Equal, but my version is far
    more verbose.

    Alternatively, if you don't need all N bits of the unsigned type,
    then you can subtract and check if the top bit is set in the
    result:

    for (unsigned u = start; (u & TOPBIT) == 0; u--)

    What about:

    for (unsigned u = start; u != ~0u; --u)

    I like that one!

    ...

    or even

    for (unsigned u = start; (int)u >= 0; --u)

    That is the one that I've actually been using, i.e. casting to the corresponding signed type.

    I don't like either of these because they need a redundant
    specification of the index variable's type (and similarly the
    definition of TOPBIT depends on knowing that type). Needing to
    redundantly know the type is dangerous because the two type
    specifications might get out of sync. Instead, either

    for (unsigned u = start; u != -1; --u)

    or

    for (unsigned u = start; u+1 != 0; --u)

    avoids the danger of having types be out of sync (and also can be
    used with signed types, not that I would advocate doing that).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Mon Sep 9 06:24:35 2024
    [email protected] (Anton Ertl) writes:

    Tim Rentsch <[email protected]> writes:

    [email protected] (MitchAlsup1) writes:

    So:
    # define memcpy memomve

    Incidentally, if one wants to do this, it's advisable to write

    #undef memcpy

    before the #define of memcpy.

    and move forward with life--for the 2 extra cycles memmove costs
    it saves everyone long term grief.

    Is it two extra cycles? Here are some data points from <[email protected]>:

    Haswell (Core i7-4790K), glibc 2.19
    1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
    14 14 15 15 17 30 48 85 150 281 570 1370 memmove
    15 16 13 16 19 32 48 86 161 327 631 1420 memcpy

    Skylake (Core i5-6600K), glibc 2.19
    1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
    14 14 14 14 15 27 43 77 147 305 573 1417 memmove
    13 14 10 12 14 27 46 85 165 313 607 1350 memcpy

    Zen (Ryzen 5 1600X), glibc 2.24
    1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
    16 16 16 17 32 43 66 107 177 328 601 1225 memmove
    13 13 14 13 38 49 73 116 188 336 610 1233 memcpy

    I don't see a consistent speedup of memcpy over memmove here.

    However, when one uses memcpy(&var,ptr,8) or the like to perform an
    unaligned access, gcc transforms this into a load (or store) without
    the redefinition of memcpy, but into much slower code with the
    redefinition (i.e., when using memmove instead of memcpy).

    Simply replacing memcpy() by memmove() of course will always
    work, but there might be negative consequences beyond a cost
    of 2 extra cycles -- for example, if a negative stride is
    better performing than a positive stride, but the nature
    of the compaction forces memmove() to always take the slower
    choice.

    If the two memory blocks don't overlap, memmove() can use the
    fastest stride.

    It /could/ use the fastest stride. Whether it /does/ use the
    fastest stride is a different question (and one that may have
    different answers on different platforms).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Sep 9 16:30:50 2024
    On Mon, 09 Sep 2024 12:28:13 GMT
    [email protected] (Anton Ertl) wrote:


    But when changing the length to 63:

    0000000000000000 <foo1>:
    0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
    4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
    8: c5 fe 6f 4e 1f vmovdqu 0x1f(%rsi),%ymm1
    d: c5 fe 7f 4f 1f vmovdqu %ymm1,0x1f(%rdi)
    12: c5 f8 77 vzeroupper
    15: c3 ret
    16: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
    1d: 00 00 00

    0000000000000020 <foo2>:
    20: ba 3f 00 00 00 mov $0x3f,%edx
    25: e9 00 00 00 00 jmp 2a <foo2+0xa>

    - anton

    An interesting question is which code I want in this case.
    In absence of -march options and with -O1|2|3 I want something like
    that:

    foo2:
    movups (%rsi), %xmm0
    movups 16(%rsi), %xmm1
    movups 32(%rsi), %xmm2
    movups 47(%rsi), %xmm3
    movups %xmm0, (%rsi)
    movups %xmm1, 16(%rsi)
    movups %xmm2, 32(%rsi)
    movups %xmm3, 47(%rsi)
    ret

    Without deep thinking I don't see why I would want anything
    different for foo1().

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Mon Sep 9 06:41:15 2024
    Terje Mathisen <[email protected]> writes:

    Tim Rentsch wrote:

    [email protected] (MitchAlsup1) writes:

    [...]

    Memmove() is always appropriate unless you are doing something
    nefarious.

    So:
    # define memcpy memomve

    Incidentally, if one wants to do this, it's advisable to write

    #undef memcpy

    before the #define of memcpy.

    What really worries me is that I've been told (and shown in
    godbolt) that memcpy() can be magic, i.e the ocmpiler is allowed
    to make it NOP when I use it to move data between an integer and
    float variable:

    float invsqrt(float x)
    {
    ...
    int32_t ix = *(int32_t *) &x;

    is deprecated, instead do something like this:

    int32_t ix;
    memcpy(&ix, &x, sizeof(ix));

    and the compiler will see that x and ix can share the same
    register.

    I don't suppose memmove() can be dependent upon to do the same?

    In such cases I almost always use unions rather than memcpy()
    or memmove():

    float
    invsqrt(float x){
    int32_t ix = (union {float f; int32_t i32;}){ x } .i32;
    // ...
    }

    No need for addresses, aliasing concerns, or any stdlib.h
    functions. And typically the unioning/deunioning produces
    no generated code.

    Of course it helps to have an appropriate union type predefined;
    here I wrote it inline to make the example self-contained.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Mon Sep 9 08:31:13 2024
    [email protected] (Anton Ertl) writes:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether there
    is an overlap of the memory areas. But then I remembered that you
    cannot write such a check in standard C without (in the general
    case) exercising undefined behaviour;

    Yes, I can.

    and then the compiler could eliminate the check or do something
    else that's unexpected. Do you have such a check in mind that
    does not exercise undefined behaviour in the general case?

    Sure. I wouldn't have made my earlier statement otherwise.

    You also stated "I'm confident the people who wrote the C standard
    would say such a program is strictly conforming." about a program with implementation-defined behaviour, so I lack confidence in your claim.

    Oh? Do you have some reason to think your sense of the beliefs and
    attitudes of people on the ISO C committee is better than mine?

    2) Even if there is such a check, you have to be aware that there
    is a potential problem with memcpy(). In that case the way to go
    is to just use memmove().

    The point of my previous comment was only to address the question
    of whether any existing memcpy() calls are problematic. If all
    of the checks return "no overlap" then memcpy() is not the problem.

    At least for the test runs.

    Yes, the notion is to test exactly the runs that customers say
    are giving problems, if necessary by having customers run a
    version with the overlapping checks put in.

    But that does not help you with the next "clever" idea that some
    compiler or library maintainer has.

    I have the impression that this is an editorial comment having
    nothing to do with memcpy() or memmove(). If that impression
    is wrong then I'm at a loss to understand what you are talking
    about, and would you please elaborate.

    There are at least 200 undefined behaviours in the C standard, and
    according to some people, C programmers should avoid all of them. So
    the possible breakage of memcpy() is just one of many problems that
    the programmers should be aware of and that they should test for.

    Just because we discussed memcpy() as one of the problems with this
    approach does not mean that having a way to deal with memcpy() solves
    the larger problem.

    So you're saying my impression that your comment didn't really
    have anything to do with memcpy() or memmove() is right?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Mon Sep 9 15:02:51 2024
    Michael S <[email protected]> writes:
    On Mon, 09 Sep 2024 12:28:13 GMT
    [email protected] (Anton Ertl) wrote:


    But when changing the length to 63:
    ...
    An interesting question is which code I want in this case.
    In absence of -march options and with -O1|2|3 I want something like
    that:

    foo2:
    movups (%rsi), %xmm0
    movups 16(%rsi), %xmm1
    movups 32(%rsi), %xmm2
    movups 47(%rsi), %xmm3
    movups %xmm0, (%rsi)
    movups %xmm1, 16(%rsi)
    movups %xmm2, 32(%rsi)
    movups %xmm3, 47(%rsi)
    ret

    Yes.

    Without deep thinking I don't see why I would want anything
    different for foo1().

    I don't think that deep thinking helps here. One could try to measure microbenchmarks, but do they actually represent application use?

    Given that the code is inlined, you can reduce register pressure (and
    potential spilling and refilling cost) with:

    foo1:
    movups (%rsi), %xmm0
    movups %xmm0, (%rsi)
    movups 16(%rsi), %xmm0
    movups %xmm0, 16(%rsi)
    movups 32(%rsi), %xmm0
    movups %xmm0, 32(%rsi)
    movups 47(%rsi), %xmm0
    movups %xmm0, 47(%rsi)

    Interestingly, gcc uses this kind of scheduling, but different
    register names, squandering that advantage of its scheduling. But I
    did not test that in a situation where register pressure plays a role.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to David Brown on Mon Sep 9 19:25:51 2024
    David Brown <[email protected]> wrote:
    On 09/09/2024 08:56, Terje Mathisen wrote:
    David Brown wrote:
    On 05/09/2024 19:04, Terje Mathisen wrote:
    David Brown wrote:
    On 05/09/2024 11:12, Terje Mathisen wrote:
    David Brown wrote:
    Unsigned types are ideal for "raw" memory access or external data, >>>>>>> for anything involving bit manipulation (use of &, |, ^, << and >> >>>>>>> on signed types is usually wrong, IMHO), as building blocks in
    extended arithmetic types, for the few occasions when you want
    two's complement wrapping, and for the even fewer occasions when >>>>>>> you actually need that last bit of range.

    That last paragraph enumerates pretty much all the uses I have for >>>>>> integer-type variables, with (like Mitch) a few apis that use (-1) >>>>>> as an error signal that has to be handled with special code.


    You don't have loop counters, array indices, or integer arithmetic?

    Loop counters of the for (i= 0; i < LIMIT; i++) type are of course
    fine with unsigned i, arrays always use a zero base so in Rust the
    only array index type is usize, i.e the largest supported unsigned
    type in the system, typically the same as u64.

    Loop counters can usually be signed or unsigned, and it usually makes
    no difference.  Array indices are also usually much the same signed or
    unsigned, and it can feel more natural to use size_t here (an unsigned
    type).  It can make a difference to efficiency, however.  On x86-64,
    this code is 3 instructions with T as "unsigned long int" or "long
    int", 4 with "int", and 5 with "unsigned int".

    int foo(int * p, T x) {
         int a = p[x++];
         int b = p[x++];
         return a + b;
    }

    ;;  assume *p in rdi, x in rsi

      mov rax,[rdi+rsi]
      add rax,[rdi+rsi+8]
      ret

    Yes - that's three instructions for 64-bit type T. (To be clear, I had counted the "ret" here.)

    With 32-bit int for T, you need a "movsx rsi, esi" first to sign-extend
    the 32-bit int parameter "x" to 64 bits. (That could be different for different ABI's.) With 32-bit unsigned int for T you need an additional instruction to make sure the result of the first "x++" is wrapped as
    32-bit unsigned.


    Or you could just write sane code that matches what you want to say.

    :-)


    Of course the fine line between "smart code" and "smart-arse code" is somewhat subjective!

    It also varies over time, and depends on the needs of the code.
    Sometimes it makes sense to prioritise efficiency over readability - but
    that is rare, and has been getting steadily rarer over the decades as processors have been getting faster (disproportionally so for
    inefficient code) and compilers have been getting better.

    Often you get the most efficient results by writing code clearly and
    simply so that the compiler can understand it better and good object
    code. This is particularly true if you want the same source to be used
    on different targets or different variants of a target - few people can
    track the instruction scheduling and timings on multiple processors
    better than a good compiler. (And the few people who /can/ do that
    spend their time chatting in comp.arch instead of writing code...) When
    you do hand-made micro-optimisations, these can work against the
    compiler and give poorer results overall.

    I know of no example where hand optimized code does worse on a newer CPU.
    A newer CPU with bigger OoOe will effectively unroll your code and schedule
    it even better.

    It’s older lesser CPU’s where your hand optimized code might fail hard, and I know of few examples of that. None actually.

    This is especially the case
    when code is moved around with inlining, constant propagation,
    unrolling, link-time optimisation, etc.

    Long ago, it was a different matter - then compilers needed more help to
    get good results. And compilers are far from perfect - there are still
    times when "smart" code or assembly-like C is needed (such as when
    taking advantage of some vector and SIMD facilities).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Sep 9 15:55:32 2024
    So it's all up to the programmer, who often doesn't know either.
    Other than using CompCert, I don't know of any reliable way for
    a programmer to make sure his C code does not suffer from UB.
    There is no full-proof or complete method for C. There are other language for which formal methods can come closer to proving the correctness of the code, but for most practical cases this is infeasible.

    I'm not talking about proving that your code is correct. I'm talking
    about making sure that your code can do only those things that you
    wrote, as opposed to the situation with UB which includes all behaviors including those not written in your code.

    Any strongly typed language (Javascript, Python, Java, Haskell, ...)
    gives you such a guarantee with absolutely no effort required on the
    part of the programmer.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Mon Sep 9 20:52:29 2024
    On Mon, 9 Sep 2024 9:26:57 +0000, Michael S wrote:

    On Mon, 09 Sep 2024 07:07:25 GMT
    [email protected] (Anton Ertl) wrote:

    Does hardware on which negative stride is faster really exists?

    When the negative stride can be compared to zero, yes. else no.
    But the performance gain is often zero and sometimes negative.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to Anton Ertl on Mon Sep 9 23:27:24 2024
    On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
    (Anton Ertl) wrote:

    Tim Rentsch <[email protected]> writes: >>[email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether there is
    an overlap of the memory areas. But then I remembered that you cannot
    write such a check in standard C without (in the general case)
    exercising undefined behaviour; and then the compiler could eliminate
    the check or do something else that's unexpected. Do you have such a
    check in mind that does not exercise undefined behaviour in the
    general case?

    The result of comparing pointers to two elements of the same array is
    defined. Cast to (char*), both src and dst can be considered to point
    to elements of the [address space sized] char array at address zero.

    Adding size_t to a pointer yields another pointer of the same type.


    All of gcc, clang and MSVC seem happy with this.


    2) Even if there is such a check, you have to be aware that there is a >potential problem with memcpy(). In that case the way to go is to
    just use memmove(). But that does not help you with the next "clever"
    idea that some compiler or library maintainer has.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Tue Sep 10 05:22:32 2024
    Michael S <[email protected]> schrieb:
    On Mon, 09 Sep 2024 07:07:25 GMT
    [email protected] (Anton Ertl) wrote:

    Does hardware on which negative stride is faster really exists?

    Depends on what the alterntive is.

    For a Fortran assignment

    a(n1:n2) = a(n3:n4)

    the semantics of the language demand that the RHS is evaluated
    completely before the assignment. In the case of the wrong
    kind of overlap, a negative stride can be used instead of
    using an array temporary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Tue Sep 10 10:35:31 2024
    On Mon, 9 Sep 2024 20:52:29 +0000
    [email protected] (MitchAlsup1) wrote:

    On Mon, 9 Sep 2024 9:26:57 +0000, Michael S wrote:

    On Mon, 09 Sep 2024 07:07:25 GMT
    [email protected] (Anton Ertl) wrote:

    Does hardware on which negative stride is faster really exists?

    When the negative stride can be compared to zero, yes. else no.
    But the performance gain is often zero and sometimes negative.

    Direction of the count is not related to the sign of pointer's
    stride.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Tue Sep 10 10:36:55 2024
    On Tue, 10 Sep 2024 05:22:32 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Mon, 09 Sep 2024 07:07:25 GMT
    [email protected] (Anton Ertl) wrote:

    Does hardware on which negative stride is faster really exists?

    Depends on what the alterntive is.

    For a Fortran assignment

    a(n1:n2) = a(n3:n4)

    the semantics of the language demand that the RHS is evaluated
    completely before the assignment. In the case of the wrong
    kind of overlap, a negative stride can be used instead of
    using an array temporary.

    That sounds like memmove. The context of discussion was memcpy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to George Neuner on Tue Sep 10 11:21:01 2024
    On Mon, 09 Sep 2024 23:27:24 -0400
    George Neuner <[email protected]> wrote:

    On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
    (Anton Ertl) wrote:

    Tim Rentsch <[email protected]> writes: >>[email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether there is
    an overlap of the memory areas. But then I remembered that you
    cannot write such a check in standard C without (in the general case) >exercising undefined behaviour; and then the compiler could eliminate
    the check or do something else that's unexpected. Do you have such a
    check in mind that does not exercise undefined behaviour in the
    general case?

    The result of comparing pointers to two elements of the same array is defined. Cast to (char*), both src and dst can be considered to point
    to elements of the [address space sized] char array at address zero.


    According to my understanding, your 'can be considered' part is not
    codified in the C Standard.

    Adding size_t to a pointer yields another pointer of the same type.


    All of gcc, clang and MSVC seem happy with this.


    It works. But is it guaranteed to work in the future by some sort of
    document? I am pretty sure that no such guarantee exists in gcc and
    MSVC docs. I did not look in clang docs. Trying to find anythings in
    LLVM/clang docs makes me sad.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to George Neuner on Tue Sep 10 08:19:32 2024
    George Neuner <[email protected]> writes:
    On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
    (Anton Ertl) wrote:
    1) At first I thought that yes, one could just check whether there is
    an overlap of the memory areas. But then I remembered that you cannot >>write such a check in standard C without (in the general case)
    exercising undefined behaviour; and then the compiler could eliminate
    the check or do something else that's unexpected. Do you have such a
    check in mind that does not exercise undefined behaviour in the
    general case?

    The result of comparing pointers to two elements of the same array is >defined. Cast to (char*), both src and dst can be considered to point
    to elements of the [address space sized] char array at address zero.

    Yes, that would be reasonable. Unfortunately, "optimizations" that
    assume that undefined behaviour does not happen are not justified by
    assigning reasonable meaning to language constructs, but by giving
    only the little meaning to language constructs that the standard
    requires, and in case of unequality comparisons between pointers to
    different objects, the C standard does not define a meaning for that.

    All of gcc, clang and MSVC seem happy with this.

    But the next version of gcc or clang might see such a check and decide
    to bite you.

    One can cast the pointers into an uintptr_t, and try to do the check
    there. AFAIK the result would be implementation-defined, but on an architecture with a flat address space it's unlikely that they will
    find a way to compile the code in a different way than the programmer
    intended without making "relevant" programs slower.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Michael S on Tue Sep 10 12:50:20 2024
    On Mon, 9 Sep 2024 15:21:27 +0300
    Michael S <[email protected]> wrote:

    On Mon, 9 Sep 2024 08:56:45 +0200
    Terje Mathisen <[email protected]> wrote:

    David Brown wrote:
    On 05/09/2024 19:04, Terje Mathisen wrote:
    David Brown wrote:
    On 05/09/2024 11:12, Terje Mathisen wrote:
    David Brown wrote:
    Unsigned types are ideal for "raw" memory access or external
    data, for anything involving bit manipulation (use of &, |, ^,
    << and >> on signed types is usually wrong, IMHO), as building
    blocks in extended arithmetic types, for the few occasions
    when you want two's complement wrapping, and for the even
    fewer occasions when you actually need that last bit of
    range.

    That last paragraph enumerates pretty much all the uses I have
    for integer-type variables, with (like Mitch) a few apis that
    use (-1) as an error signal that has to be handled with special
    code.

    You don't have loop counters, array indices, or integer
    arithmetic?

    Loop counters of the for (i= 0; i < LIMIT; i++) type are of
    course fine with unsigned i, arrays always use a zero base so in
    Rust the only array index type is usize, i.e the largest
    supported unsigned type in the system, typically the same as
    u64.

    Loop counters can usually be signed or unsigned, and it usually
    makes no difference.  Array indices are also usually much the same signed or unsigned, and it can feel more natural to use size_t
    here (an unsigned type).  It can make a difference to efficiency, however.  On x86-64, this code is 3 instructions with T as
    "unsigned long int" or "long int", 4 with "int", and 5 with
    "unsigned int".

    int foo(int * p, T x) {
        int a = p[x++];
        int b = p[x++];
        return a + b;
    }

    ;; assume *p in rdi, x in rsi

    mov rax,[rdi+rsi]
    add rax,[rdi+rsi+8]
    ret


    more like
    mov rax,[rdi+rsi*4]
    add rax,[rdi+rsi*4+8]
    ret


    Should be:
    mov eax,[rdi+rsi*4]
    add eax,[rdi+rsi*4+4]
    ret
    :(


    But that's not the point (==trap).
    The point (==trap), I'd guess, is that for T=uint32_t code generator
    has to account for possibility of x==2**32-1.





    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Sep 10 11:49:03 2024
    On Sun, 08 Sep 2024 15:36:39 GMT
    [email protected] (Anton Ertl) wrote:

    Tim Rentsch <[email protected]> writes: >[email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether there is
    an overlap of the memory areas. But then I remembered that you cannot
    write such a check in standard C without (in the general case)
    exercising undefined behaviour; and then the compiler could eliminate
    the check or do something else that's unexpected. Do you have such a
    check in mind that does not exercise undefined behaviour in the
    general case?


    The check that reliably catches all overlaps seems easy.
    E.g. (src <= dst) == (src+len > dst)

    In theory, on unusual hardware platform it can give false positives.
    May be, for task in hand that's o.k.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Tue Sep 10 08:33:45 2024
    Thomas Koenig <[email protected]> writes:
    Michael S <[email protected]> schrieb:
    [on memcpy() where glibc used negative stride on some hardware and
    existing binaries no longer worked as intended]
    Does hardware on which negative stride is faster really exists?

    Depends on what the alterntive is.

    For a Fortran assignment

    a(n1:n2) = a(n3:n4)

    the semantics of the language demand that the RHS is evaluated
    completely before the assignment. In the case of the wrong
    kind of overlap, a negative stride can be used instead of
    using an array temporary.

    Which is a completely different situation from the one that was
    assumed by Ulrich Drepper: that there is no overlap between the source
    and the target of memcpy(), and if there is, the programmer "should
    never have been allowed to touch a keyboard" (i.e., the user of the programmer's program deserves the breakage). So Ulrich Drepper
    considered himself free to use an arbitrary stride, with no language
    semantics limiting him. And he claimed that for some hardware,
    negative stride is faster.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Brett on Tue Sep 10 12:47:37 2024
    On 09/09/2024 21:25, Brett wrote:
    David Brown <[email protected]> wrote:

    Of course the fine line between "smart code" and "smart-arse code" is
    somewhat subjective!

    It also varies over time, and depends on the needs of the code.
    Sometimes it makes sense to prioritise efficiency over readability - but
    that is rare, and has been getting steadily rarer over the decades as
    processors have been getting faster (disproportionally so for
    inefficient code) and compilers have been getting better.

    Often you get the most efficient results by writing code clearly and
    simply so that the compiler can understand it better and good object
    code. This is particularly true if you want the same source to be used
    on different targets or different variants of a target - few people can
    track the instruction scheduling and timings on multiple processors
    better than a good compiler. (And the few people who /can/ do that
    spend their time chatting in comp.arch instead of writing code...) When
    you do hand-made micro-optimisations, these can work against the
    compiler and give poorer results overall.

    I know of no example where hand optimized code does worse on a newer CPU.
    A newer CPU with bigger OoOe will effectively unroll your code and schedule it even better.

    I would agree with you there. For the same object code, newer CPUs
    (with the same ISA) are typically faster for a variety of reasons.
    There may be the odd regression, but it is hard to market a newer CPU if
    it is slower than the older ones!

    However, my point was that "hand-optimised" source code can lead to
    poorer results on newer /compilers/ compared to simpler source code. If
    you've googled for "bit twiddling hacks" for cool tricks, or written
    something like "(x << 4) + (x << 2) + x" instead of "x * 21", then the
    results will be slower with a modern compiler and modern cpu, even
    though the "hand-optimised" version might have been faster two decades
    ago. You can expect the modern tool to convert the multiplication into
    shifts and adds if that is more efficient on the target, or a
    multiplication if that is best on the target. But you can't expect the compiler to turn the shifts and adds into a multiplication. (Sometimes
    it can, but you can't expect it to.)


    It’s older lesser CPU’s where your hand optimized code might fail hard, and
    I know of few examples of that. None actually.

    This is especially the case
    when code is moved around with inlining, constant propagation,
    unrolling, link-time optimisation, etc.

    Long ago, it was a different matter - then compilers needed more help to
    get good results. And compilers are far from perfect - there are still
    times when "smart" code or assembly-like C is needed (such as when
    taking advantage of some vector and SIMD facilities).


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Tue Sep 10 07:45:07 2024
    Michael S <[email protected]> writes:

    On Mon, 09 Sep 2024 23:27:24 -0400
    George Neuner <[email protected]> wrote:

    On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
    (Anton Ertl) wrote:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether there is
    an overlap of the memory areas. But then I remembered that you
    cannot write such a check in standard C without (in the general case)
    exercising undefined behaviour; and then the compiler could eliminate
    the check or do something else that's unexpected. Do you have such a
    check in mind that does not exercise undefined behaviour in the
    general case?

    The result of comparing pointers to two elements of the same array is
    defined. Cast to (char*), both src and dst can be considered to point
    to elements of the [address space sized] char array at address zero.

    According to my understanding, your 'can be considered' part is not
    codified in the C Standard.

    Right.

    Adding size_t to a pointer yields another pointer of the same type.


    All of gcc, clang and MSVC seem happy with this.

    It works. But is it guaranteed to work in the future by some sort of document? I am pretty sure that no such guarantee exists in gcc and
    MSVC docs. I did not look in clang docs. Trying to find anythings in LLVM/clang docs makes me sad.

    What is being sought is something that works on any implementation
    allowed by the C standard, including those that exist only in
    someone's imagination.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Brett on Tue Sep 10 11:30:28 2024
    Brett wrote:
    David Brown <[email protected]> wrote:
    Often you get the most efficient results by writing code clearly and
    simply so that the compiler can understand it better and good object
    code. This is particularly true if you want the same source to be used
    on different targets or different variants of a target - few people can
    track the instruction scheduling and timings on multiple processors
    better than a good compiler. (And the few people who /can/ do that
    spend their time chatting in comp.arch instead of writing code...) When
    you do hand-made micro-optimisations, these can work against the
    compiler and give poorer results overall.

    I know of no example where hand optimized code does worse on a newer CPU.
    A newer CPU with bigger OoOe will effectively unroll your code and schedule it even better.

    Not true:

    My favorite benchmark program for 20+ years was Word Count, I
    re-optimized that for every new x86 generation, and on the Pentium I got
    it to run at 1.5 clock cycles per character (40 MB/s on a 60 MHz Pentium).

    When the PentiumPro came out, it did a 10-20 cycle stall for every pair
    of characters, so about an order of magnitude slower in cycle count.
    (But only about 3X clock time due to being 200 instead of 60 MHz.)


    It’s older lesser CPU’s where your hand optimized code might fail hard, and
    I know of few examples of that. None actually.

    This is especially the case
    when code is moved around with inlining, constant propagation,
    unrolling, link-time optimisation, etc.

    Long ago, it was a different matter - then compilers needed more help to
    get good results. And compilers are far from perfect - there are still
    times when "smart" code or assembly-like C is needed (such as when
    taking advantage of some vector and SIMD facilities).

    Right.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Tue Sep 10 07:37:59 2024
    Michael S <[email protected]> writes:

    On Sun, 08 Sep 2024 15:36:39 GMT
    [email protected] (Anton Ertl) wrote:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether there is
    an overlap of the memory areas. But then I remembered that you cannot
    write such a check in standard C without (in the general case)
    exercising undefined behaviour; and then the compiler could eliminate
    the check or do something else that's unexpected. Do you have such a
    check in mind that does not exercise undefined behaviour in the
    general case?

    The check that reliably catches all overlaps seems easy.
    E.g. (src <= dst) == (src+len > dst)

    In theory, on unusual hardware platform it can give false positives.
    May be, for task in hand that's o.k.

    The challenge is to find portable C that doesn't enter the arena
    of undefined behavior (and also detects exactly those cases where
    overlap occurs), and that is quite a stringent criterion.

    The comparison shown works if src and dst both point to elements
    of the same array. But if they don't, comparing pointers to see
    if one is less than another (or any of <, <=, >, >=) is undefined
    behavior. At the bit level it wouldn't surprise me to learn that
    the test shown always returns accurate information. However the
    C standard doesn't promise that a bit-level comparison will be
    done, and implementations are allowed to do anything at all for
    this test in cases where src and dst point to (somewhere within)
    different top-level objects. What the hardware does doesn't
    matter - what needs to be satisfied are the rules of the C
    standard, and they are less forgiving.

    I should add that I appreciate your proposed solution; it's
    better than what I think I would have come up with under a
    similar set of assumptions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Tue Sep 10 15:30:26 2024
    Michael S <[email protected]> writes:
    On Sun, 08 Sep 2024 15:36:39 GMT
    [email protected] (Anton Ertl) wrote:

    Tim Rentsch <[email protected]> writes:
    [email protected] (Anton Ertl) writes:
    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether there is
    an overlap of the memory areas. But then I remembered that you cannot
    write such a check in standard C without (in the general case)
    exercising undefined behaviour; and then the compiler could eliminate
    the check or do something else that's unexpected. Do you have such a
    check in mind that does not exercise undefined behaviour in the
    general case?


    The check that reliably catches all overlaps seems easy.
    E.g. (src <= dst) == (src+len > dst)

    In theory, on unusual hardware platform it can give false positives.

    That is probably the original motivation for that lack of definition
    (e.g., compare only the offset on large-model 8086).

    However, if the compiler ATUBDNH, that assumption can lead to the
    "knowledge" that src and dest point into the same object, and that may
    produce unintended results beyond false positives on some hardware
    platforms.

    I have not heard about a C compiler that has this misfeature, but I
    would not be surprised if it shows up at some point (hopefully with
    some flag to define the ordering of pointers to different objects).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Tue Sep 10 11:33:05 2024
    On Tue, 10 Sep 2024 11:21:01 +0300, Michael S
    <[email protected]> wrote:

    On Mon, 09 Sep 2024 23:27:24 -0400
    George Neuner <[email protected]> wrote:

    On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
    (Anton Ertl) wrote:

    Tim Rentsch <[email protected]> writes:
    [email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether there is
    an overlap of the memory areas. But then I remembered that you
    cannot write such a check in standard C without (in the general case)
    exercising undefined behaviour; and then the compiler could eliminate
    the check or do something else that's unexpected. Do you have such a
    check in mind that does not exercise undefined behaviour in the
    general case?

    The result of comparing pointers to two elements of the same array is
    defined. Cast to (char*), both src and dst can be considered to point
    to elements of the [address space sized] char array at address zero.


    According to my understanding, your 'can be considered' part is not
    codified in the C Standard.

    Adding size_t to a pointer yields another pointer of the same type.


    All of gcc, clang and MSVC seem happy with this.


    It works. But is it guaranteed to work in the future by some sort of >document? I am pretty sure that no such guarantee exists in gcc and
    MSVC docs. I did not look in clang docs. Trying to find anythings in >LLVM/clang docs makes me sad.

    I know that it has worked as expected with every version of gcc and
    Microsoft I've used since 1988. [clang I don't use, but I tried it on godbolt.org with the most recent version]

    Will it continue to work ... who knows?


    I definitely am NOT an expert on the C standard, but thinking about
    it, it occurred to me that if an array is explicitly defined that
    *might* cover all memory (or at least all heap), then the compiler
    would have to honor any apparent pointers into it.

    E.g., char (*all_memory)[] = 0;

    None of the compilers at godbolt seem to need this to compare
    arbitrary addresses as char*, but all accept it.

    Obviously speculation, but it's the best I have.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Tue Sep 10 17:16:07 2024
    David Brown <[email protected]> writes:
    However, my point was that "hand-optimised" source code can lead to
    poorer results on newer /compilers/ compared to simpler source code. If >you've googled for "bit twiddling hacks" for cool tricks, or written >something like "(x << 4) + (x << 2) + x" instead of "x * 21", then the >results will be slower with a modern compiler and modern cpu, even
    though the "hand-optimised" version might have been faster two decades
    ago. You can expect the modern tool to convert the multiplication into >shifts and adds if that is more efficient on the target, or a
    multiplication if that is best on the target. But you can't expect the >compiler to turn the shifts and adds into a multiplication.

    Why not? Let's see:

    [b3:~/tmp:109062] gcc -Os -c xxx-mul.c && objdump -d xxx-mul.o

    xxx-mul.o: file format elf64-x86-64


    Disassembly of section .text:

    0000000000000000 <foo>:
    0: 48 6b c7 15 imul $0x15,%rdi,%rax
    4: c3 ret
    [b3:~/tmp:109063] gcc -O3 -c xxx-mul.c && objdump -d xxx-mul.o

    xxx-mul.o: file format elf64-x86-64


    Disassembly of section .text:

    0000000000000000 <foo>:
    0: 48 8d 04 bf lea (%rdi,%rdi,4),%rax
    4: 48 8d 04 87 lea (%rdi,%rax,4),%rax
    8: c3 ret

    So gcc-12 obviously understands that your "hand-optimized" version is equivalent to the multiplication, and with -O3 then decides that the
    leas are faster.

    (Sometimes it can, but you can't expect it to.)

    That also works the other way.

    But it becomes really annoying when I intend it not to perform a transformation, and it performs the transformation, like when writing
    "-(x>0)" and the compiler turns that into a conditional branch. These
    days gcc does not do that, but I have just seen another twist:

    long bar(long x)
    {
    return -(x>0);
    }

    gcc-12 -O3 turns this into:

    10: 31 c0 xor %eax,%eax
    12: 48 85 ff test %rdi,%rdi
    15: 0f 9f c0 setg %al
    18: f7 d8 neg %eax
    1a: 48 98 cltq
    1c: c3 ret

    So apparently sign-extension optimization is apparently still lacking.
    Clang-14 handles this fine:

    10: 31 c0 xor %eax,%eax
    12: 48 85 ff test %rdi,%rdi
    15: 0f 9f c0 setg %al
    18: 48 f7 d8 neg %rax
    1b: c3 ret

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Terje Mathisen on Tue Sep 10 18:03:01 2024
    Terje Mathisen <[email protected]> wrote:
    Brett wrote:
    David Brown <[email protected]> wrote:
    Often you get the most efficient results by writing code clearly and
    simply so that the compiler can understand it better and good object
    code. This is particularly true if you want the same source to be used
    on different targets or different variants of a target - few people can
    track the instruction scheduling and timings on multiple processors
    better than a good compiler. (And the few people who /can/ do that
    spend their time chatting in comp.arch instead of writing code...) When >>> you do hand-made micro-optimisations, these can work against the
    compiler and give poorer results overall.

    I know of no example where hand optimized code does worse on a newer CPU.
    A newer CPU with bigger OoOe will effectively unroll your code and schedule >> it even better.

    Not true:

    My favorite benchmark program for 20+ years was Word Count, I
    re-optimized that for every new x86 generation, and on the Pentium I got
    it to run at 1.5 clock cycles per character (40 MB/s on a 60 MHz Pentium).

    When the PentiumPro came out, it did a 10-20 cycle stall for every pair
    of characters, so about an order of magnitude slower in cycle count.
    (But only about 3X clock time due to being 200 instead of 60 MHz.)

    But how big a slowdown did the unoptimized code get?

    Are you describing a glass jaw handling unpredictable branches on a CPU
    with a much longer pipeline?

    A shorter pipeline with better worst case handling is going to do better,
    even if older. Intel was going for high clock benchmark speed, not
    performance.

    It’s older lesser CPU’s where your hand optimized code might fail hard, and
    I know of few examples of that. None actually.

    This is especially the case
    when code is moved around with inlining, constant propagation,
    unrolling, link-time optimisation, etc.

    Long ago, it was a different matter - then compilers needed more help to >>> get good results. And compilers are far from perfect - there are still
    times when "smart" code or assembly-like C is needed (such as when
    taking advantage of some vector and SIMD facilities).

    Right.

    Terje


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Sep 10 18:36:38 2024
    On Tue, 10 Sep 2024 7:35:31 +0000, Michael S wrote:

    On Mon, 9 Sep 2024 20:52:29 +0000
    [email protected] (MitchAlsup1) wrote:

    On Mon, 9 Sep 2024 9:26:57 +0000, Michael S wrote:

    On Mon, 09 Sep 2024 07:07:25 GMT
    [email protected] (Anton Ertl) wrote:

    Does hardware on which negative stride is faster really exists?

    When the negative stride can be compared to zero, yes. else no.
    But the performance gain is often zero and sometimes negative.

    Direction of the count is not related to the sign of pointer's
    stride.

    For the record; I was responding to an array index stride not a
    pointer stride.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Tue Sep 10 20:45:53 2024
    On 10/09/2024 19:16, Anton Ertl wrote:
    David Brown <[email protected]> writes:
    However, my point was that "hand-optimised" source code can lead to
    poorer results on newer /compilers/ compared to simpler source code. If
    you've googled for "bit twiddling hacks" for cool tricks, or written
    something like "(x << 4) + (x << 2) + x" instead of "x * 21", then the
    results will be slower with a modern compiler and modern cpu, even
    though the "hand-optimised" version might have been faster two decades
    ago. You can expect the modern tool to convert the multiplication into
    shifts and adds if that is more efficient on the target, or a
    multiplication if that is best on the target. But you can't expect the
    compiler to turn the shifts and adds into a multiplication.

    Why not? Let's see:

    [b3:~/tmp:109062] gcc -Os -c xxx-mul.c && objdump -d xxx-mul.o

    xxx-mul.o: file format elf64-x86-64


    Disassembly of section .text:

    0000000000000000 <foo>:
    0: 48 6b c7 15 imul $0x15,%rdi,%rax
    4: c3 ret
    [b3:~/tmp:109063] gcc -O3 -c xxx-mul.c && objdump -d xxx-mul.o

    xxx-mul.o: file format elf64-x86-64


    Disassembly of section .text:

    0000000000000000 <foo>:
    0: 48 8d 04 bf lea (%rdi,%rdi,4),%rax
    4: 48 8d 04 87 lea (%rdi,%rax,4),%rax
    8: c3 ret

    So gcc-12 obviously understands that your "hand-optimized" version is equivalent to the multiplication, and with -O3 then decides that the
    leas are faster.

    (Sometimes it can, but you can't expect it to.)

    Again - sometimes a compiler will recognise a particular hand-optimised pattern, turn it back to something logically simpler, then optimise from
    there. But you cannot /expect/ that. On the whole, compilers are more
    likely to recognise clear and simple patterns than complex ones,
    especially using bit manipulation in odd ways.

    There will always be exceptions, this is just a general rule.

    And a related general rule is that /humans/ are much better at
    understanding clear code written in a logical way, than something weird
    and hand-optimised.


    That also works the other way.

    But it becomes really annoying when I intend it not to perform a transformation, and it performs the transformation, like when writing "-(x>0)" and the compiler turns that into a conditional branch. These
    days gcc does not do that, but I have just seen another twist:

    long bar(long x)
    {
    return -(x>0);
    }

    gcc-12 -O3 turns this into:

    10: 31 c0 xor %eax,%eax
    12: 48 85 ff test %rdi,%rdi
    15: 0f 9f c0 setg %al
    18: f7 d8 neg %eax
    1a: 48 98 cltq
    1c: c3 ret

    So apparently sign-extension optimization is apparently still lacking. Clang-14 handles this fine:

    10: 31 c0 xor %eax,%eax
    12: 48 85 ff test %rdi,%rdi
    15: 0f 9f c0 setg %al
    18: 48 f7 d8 neg %rax
    1b: c3 ret


    One day, perhaps, compilers will be perfect. But not yet :-(

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Tue Sep 10 22:27:02 2024
    On Tue, 10 Sep 2024 07:37:59 -0700
    Tim Rentsch <[email protected]> wrote:


    I should add that I appreciate your proposed solution; it's
    better than what I think I would have come up with under a
    similar set of assumptions.

    Unfortunately, my solution is wrong and mistake is not even subtle.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Sep 10 15:09:41 2024
    Again - sometimes a compiler will recognise a particular hand-optimised pattern, turn it back to something logically simpler, then optimise from there. But you cannot /expect/ that.

    You might even consider those as performance bugs, since the
    hand-optimized code is sometimes chosen specifically to try and impose
    a particular kind of code. Compiler's "optimizations" are usually just heuristics so compilers are often better off not being "too clever" so
    as to allow manual-optimization to override the heuristics: if
    programmers want to use the heuristics, they should write
    simple&clear code.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Brett on Tue Sep 10 22:34:11 2024
    On Tue, 10 Sep 2024 18:03:01 -0000 (UTC)
    Brett <[email protected]> wrote:

    Terje Mathisen <[email protected]> wrote:
    Brett wrote:
    David Brown <[email protected]> wrote:
    Often you get the most efficient results by writing code clearly
    and simply so that the compiler can understand it better and good
    object code. This is particularly true if you want the same
    source to be used on different targets or different variants of a
    target - few people can track the instruction scheduling and
    timings on multiple processors better than a good compiler. (And
    the few people who /can/ do that spend their time chatting in
    comp.arch instead of writing code...) When you do hand-made
    micro-optimisations, these can work against the compiler and give
    poorer results overall.

    I know of no example where hand optimized code does worse on a
    newer CPU. A newer CPU with bigger OoOe will effectively unroll
    your code and schedule it even better.

    Not true:

    My favorite benchmark program for 20+ years was Word Count, I
    re-optimized that for every new x86 generation, and on the Pentium
    I got it to run at 1.5 clock cycles per character (40 MB/s on a 60
    MHz Pentium).

    When the PentiumPro came out, it did a 10-20 cycle stall for every
    pair of characters, so about an order of magnitude slower in cycle
    count. (But only about 3X clock time due to being 200 instead of 60
    MHz.)

    But how big a slowdown did the unoptimized code get?

    Are you describing a glass jaw handling unpredictable branches on a
    CPU with a much longer pipeline?

    No, the glass jaw of PPro described by Terje is known as partial
    register stall.


    A shorter pipeline with better worst case handling is going to do
    better, even if older. Intel was going for high clock benchmark
    speed, not performance.


    Typically, PPro was much faster than Pentium clock-for-clock,
    especially so when running 32-bit software.
    But it had few weak points.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Tue Sep 10 22:35:16 2024
    On Tue, 10 Sep 2024 18:36:38 +0000
    [email protected] (MitchAlsup1) wrote:

    On Tue, 10 Sep 2024 7:35:31 +0000, Michael S wrote:

    On Mon, 9 Sep 2024 20:52:29 +0000
    [email protected] (MitchAlsup1) wrote:

    On Mon, 9 Sep 2024 9:26:57 +0000, Michael S wrote:

    On Mon, 09 Sep 2024 07:07:25 GMT
    [email protected] (Anton Ertl) wrote:

    Does hardware on which negative stride is faster really exists?

    When the negative stride can be compared to zero, yes. else no.
    But the performance gain is often zero and sometimes negative.

    Direction of the count is not related to the sign of pointer's
    stride.

    For the record; I was responding to an array index stride not a
    pointer stride.

    Same thing

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Josh Vanderhoof@21:1/5 to Anton Ertl on Tue Sep 10 16:44:21 2024
    [email protected] (Anton Ertl) writes:

    George Neuner <[email protected]> writes:
    On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
    (Anton Ertl) wrote:
    1) At first I thought that yes, one could just check whether there is
    an overlap of the memory areas. But then I remembered that you cannot >>>write such a check in standard C without (in the general case)
    exercising undefined behaviour; and then the compiler could eliminate
    the check or do something else that's unexpected. Do you have such a >>>check in mind that does not exercise undefined behaviour in the
    general case?

    The result of comparing pointers to two elements of the same array is >>defined. Cast to (char*), both src and dst can be considered to point
    to elements of the [address space sized] char array at address zero.

    Yes, that would be reasonable. Unfortunately, "optimizations" that
    assume that undefined behaviour does not happen are not justified by assigning reasonable meaning to language constructs, but by giving
    only the little meaning to language constructs that the standard
    requires, and in case of unequality comparisons between pointers to
    different objects, the C standard does not define a meaning for that.

    All of gcc, clang and MSVC seem happy with this.

    But the next version of gcc or clang might see such a check and decide
    to bite you.

    One can cast the pointers into an uintptr_t, and try to do the check
    there. AFAIK the result would be implementation-defined, but on an architecture with a flat address space it's unlikely that they will
    find a way to compile the code in a different way than the programmer intended without making "relevant" programs slower.

    It is legal to test for equality between pointers to different objects
    so you could test for overlap by testing against every element in the
    array. It seems like it should be possible for the compiler to figure
    out what's happening and optimize those tests away, but unfortunately
    no compiler I tested did it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Michael S on Wed Sep 11 02:24:13 2024
    On Tue, 10 Sep 2024 22:27:02 +0300
    Michael S <[email protected]> wrote:

    On Tue, 10 Sep 2024 07:37:59 -0700
    Tim Rentsch <[email protected]> wrote:


    I should add that I appreciate your proposed solution; it's
    better than what I think I would have come up with under a
    similar set of assumptions.

    Unfortunately, my solution is wrong and mistake is not even subtle.



    This one appears to work: (src < dst+len) == (dst < src+len)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Michael S on Wed Sep 11 05:47:59 2024
    Michael S <[email protected]> wrote:
    On Tue, 10 Sep 2024 18:03:01 -0000 (UTC)
    Brett <[email protected]> wrote:

    Terje Mathisen <[email protected]> wrote:
    Brett wrote:
    David Brown <[email protected]> wrote:
    Often you get the most efficient results by writing code clearly
    and simply so that the compiler can understand it better and good
    object code. This is particularly true if you want the same
    source to be used on different targets or different variants of a
    target - few people can track the instruction scheduling and
    timings on multiple processors better than a good compiler. (And
    the few people who /can/ do that spend their time chatting in
    comp.arch instead of writing code...) When you do hand-made
    micro-optimisations, these can work against the compiler and give
    poorer results overall.

    I know of no example where hand optimized code does worse on a
    newer CPU. A newer CPU with bigger OoOe will effectively unroll
    your code and schedule it even better.

    Not true:

    My favorite benchmark program for 20+ years was Word Count, I
    re-optimized that for every new x86 generation, and on the Pentium
    I got it to run at 1.5 clock cycles per character (40 MB/s on a 60
    MHz Pentium).

    When the PentiumPro came out, it did a 10-20 cycle stall for every
    pair of characters, so about an order of magnitude slower in cycle
    count. (But only about 3X clock time due to being 200 instead of 60
    MHz.)

    But how big a slowdown did the unoptimized code get?

    Are you describing a glass jaw handling unpredictable branches on a
    CPU with a much longer pipeline?

    No, the glass jaw of PPro described by Terje is known as partial
    register stall.

    That is an exception that proves the rule. ;)

    A shorter pipeline with better worst case handling is going to do
    better, even if older. Intel was going for high clock benchmark
    speed, not performance.


    Typically, PPro was much faster than Pentium clock-for-clock,
    especially so when running 32-bit software.
    But it had few weak points.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Wed Sep 11 13:07:33 2024
    Tim Rentsch wrote:
    Michael S <[email protected]> writes:

    On Sun, 08 Sep 2024 15:36:39 GMT
    [email protected] (Anton Ertl) wrote:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether there is
    an overlap of the memory areas. But then I remembered that you cannot
    write such a check in standard C without (in the general case)
    exercising undefined behaviour; and then the compiler could eliminate
    the check or do something else that's unexpected. Do you have such a
    check in mind that does not exercise undefined behaviour in the
    general case?

    The check that reliably catches all overlaps seems easy.
    E.g. (src <= dst) == (src+len > dst)

    Does that work for dst < src? What if dst+len < src?

    I.e. no overlap?

    The first test will be false while the second test will always be true
    when src >= dst, so I think it will have false positives?

    What about:

    max(src,dst) < (min(src,dst)+len)

    If you have a min/max circuit, i.e a two-element sorter, then it could
    be quite efficient, otherwise run the min first, then the max and the
    add during the second cycle, before the less than test in the third cycle.


    In theory, on unusual hardware platform it can give false positives.
    May be, for task in hand that's o.k.

    The challenge is to find portable C that doesn't enter the arena
    of undefined behavior (and also detects exactly those cases where
    overlap occurs), and that is quite a stringent criterion.

    The comparison shown works if src and dst both point to elements
    of the same array. But if they don't, comparing pointers to see
    if one is less than another (or any of <, <=, >, >=) is undefined
    behavior. At the bit level it wouldn't surprise me to learn that
    the test shown always returns accurate information. However the
    C standard doesn't promise that a bit-level comparison will be
    done, and implementations are allowed to do anything at all for
    this test in cases where src and dst point to (somewhere within)
    different top-level objects. What the hardware does doesn't
    matter - what needs to be satisfied are the rules of the C
    standard, and they are less forgiving.

    I should add that I appreciate your proposed solution; it's
    better than what I think I would have come up with under a
    similar set of assumptions.


    I do believe though that in reality it could be faster to use the
    branchy version, and let the branch predictors do their job instead of
    having to wait to evaluate all three terms:

    bool is_overlap(char *src, char *dst, size_t len)
    {
    if (src < dst) {
    return (src+len > dst);
    }
    return (dst+len > src);
    }

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Brett on Wed Sep 11 13:31:50 2024
    Brett wrote:
    Terje Mathisen <[email protected]> wrote:
    Brett wrote:
    David Brown <[email protected]> wrote:
    Often you get the most efficient results by writing code clearly and
    simply so that the compiler can understand it better and good object
    code. This is particularly true if you want the same source to be used >>>> on different targets or different variants of a target - few people can >>>> track the instruction scheduling and timings on multiple processors
    better than a good compiler. (And the few people who /can/ do that
    spend their time chatting in comp.arch instead of writing code...) When >>>> you do hand-made micro-optimisations, these can work against the
    compiler and give poorer results overall.

    I know of no example where hand optimized code does worse on a newer CPU. >>> A newer CPU with bigger OoOe will effectively unroll your code and schedule >>> it even better.

    Not true:

    My favorite benchmark program for 20+ years was Word Count, I
    re-optimized that for every new x86 generation, and on the Pentium I got
    it to run at 1.5 clock cycles per character (40 MB/s on a 60 MHz Pentium). >>
    When the PentiumPro came out, it did a 10-20 cycle stall for every pair
    of characters, so about an order of magnitude slower in cycle count.
    (But only about 3X clock time due to being 200 instead of 60 MHz.)

    But how big a slowdown did the unoptimized code get?

    The gcc-optimized unix wc was probably still a slower than my glass
    jaw-hitting asm code: The issue was partial register stalls, where I had
    been using the relatively tricky concept of interleaving updates to the
    BL and BH halfs of BX, then using BX to index into a table of combined
    word and line increments:

    add dx,ax
    mov ax,incr_table[bx]
    mov bl,extra_segment[di]
    mov di,[si+offset]

    followed by

    add dx,ax
    mov ax,incr_table[bx+16] ;; Transposed table interleaved at +16
    mov bh,extra_segment[di]
    mov di,[si+offset+2]

    All of the above unrolled 64 times so that the code would load & count
    256 characters with zero branches.



    Are you describing a glass jaw handling unpredictable branches on a CPU
    with a much longer pipeline?

    PRS stalls was the single largest glass jaw on the PentiumPro, but it
    was very rare in compiled code.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Wed Sep 11 14:51:16 2024
    On Wed, 11 Sep 2024 13:07:33 +0200
    Terje Mathisen <[email protected]> wrote:

    Tim Rentsch wrote:
    Michael S <[email protected]> writes:

    On Sun, 08 Sep 2024 15:36:39 GMT
    [email protected] (Anton Ertl) wrote:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,


    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether
    there is an overlap of the memory areas. But then I remembered
    that you cannot write such a check in standard C without (in the
    general case) exercising undefined behaviour; and then the
    compiler could eliminate the check or do something else that's
    unexpected. Do you have such a check in mind that does not
    exercise undefined behaviour in the general case?

    The check that reliably catches all overlaps seems easy.
    E.g. (src <= dst) == (src+len > dst)

    Does that work for dst < src? What if dst+len < src?


    No, it doesn't. See the followup post.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Josh Vanderhoof on Wed Sep 11 10:38:24 2024
    Josh Vanderhoof <[email protected]> writes:
    [email protected] (Anton Ertl) writes:

    George Neuner <[email protected]> writes:
    On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
    (Anton Ertl) wrote:
    1) At first I thought that yes, one could just check whether there is >>>>an overlap of the memory areas. But then I remembered that you cannot >>>>write such a check in standard C without (in the general case) >>>>exercising undefined behaviour; and then the compiler could eliminate >>>>the check or do something else that's unexpected. Do you have such a >>>>check in mind that does not exercise undefined behaviour in the
    general case?
    ...
    It is legal to test for equality between pointers to different objects
    so you could test for overlap by testing against every element in the
    array. It seems like it should be possible for the compiler to figure
    out what's happening and optimize those tests away, but unfortunately
    no compiler I tested did it.

    That would be an interesting result of the ATUBDNH lunacy: programmers
    would see themselves forced to write workarounds such as the one you
    suggest (with terrible performance when not optimized), and then C
    compiler maintainers would see themselves forced to optimize this kind
    of code. The end result would be that both parties have to put in
    more effort to eventually get the same result as if ordered comparison
    between different objects had been defined from the start.

    For now, the ATUBDNH advocates tell programmers that they have to work
    around the lack of definition, but there is usually no optimization
    for that.

    One case where things work somewhat along the lines you suggest is
    unaligned accesses. Traditionally, if knowing that the hardware
    supports unaligned accesses, for a 16-bit load one would write:

    int16_t foo1(int16_t *p)
    {
    return *p;
    }

    If one does not know that the hardware supports unaligned accesses,
    the traditional way to perform such an access (little-endian) is
    something like:

    int16_t foo2(int16_t *p)
    {
    unsignedchar *q = p;
    return (int16_t)(q[0] + (q[1]>>8));
    }

    Now, several years ago, somebody told me that the proper way is as
    follows:

    int16_t foo3(int16_t *p)
    {
    int16_t v;
    memcpy(&v,p,2);
    return v;
    }

    That way looked horribly inefficient to me, with v having to reside in
    memory instead of in a register and then the expensive function call,
    and all the decisions that memcpy() has to take depending on the
    length argument. But gcc optimizes this idiom into an unaligned load
    rather than taking all the steps that I expected (however, I have seen
    cases where the code produced on hardware that supports unaligned
    accesses is worse than that for foo1()). Of course, if you also want
    to support less sophisticated compilers, this idiom may be really slow
    on those, although not quite as expensive as your containment check.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Wed Sep 11 15:35:00 2024
    On Wed, 11 Sep 2024 13:07:33 +0200
    Terje Mathisen <[email protected]> wrote:


    I do believe though that in reality it could be faster to use the
    branchy version, and let the branch predictors do their job instead
    of having to wait to evaluate all three terms:

    bool is_overlap(char *src, char *dst, size_t len)
    {
    if (src < dst) {
    return (src+len > dst);
    }
    return (dst+len > src);
    }

    Terje


    I think that under assumptions that overlaps are very rare and that we
    have wide OoO CPU, one-branch solution would be faster than multiple
    branches.
    Assuming Windows x64 coding conventions (dst==RCX, src==RDX, len=R8)
    and using algorithm that I posted at night:

    lea rax, [rcx,r8] ; rax = dst+len
    lea r9, [rdx,r8] ; r9 = src+len
    cmp rdx, rax
    setb al ; al = src < dst+len
    cmp rcx, r9
    setb r9b ; r9b = dst < src+len
    cmp al, r9b
    je handle_overlap
    ; there is no overlap

    The important observation here is that for as long as branch predictor correctly predicted that the branch is not taken all previous
    calculation are not on the critical latency path. So, the fact that
    there are 7 instructions before branch that have latency of ~4 clocks
    does not matter.

    On the other hand, in your branchy variant the second branch is easy
    to predict, but the first branch if (src < dst) not necessarily easy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Wed Sep 11 08:08:30 2024
    Michael S <[email protected]> writes:

    On Tue, 10 Sep 2024 07:37:59 -0700
    Tim Rentsch <[email protected]> wrote:

    I should add that I appreciate your proposed solution; it's
    better than what I think I would have come up with under a
    similar set of assumptions.

    Unfortunately, my solution is wrong and mistake is not even subtle.

    Oh that's okay, I appreciate it nonetheless. :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Tim Rentsch on Wed Sep 11 08:07:34 2024
    Tim Rentsch <[email protected]> writes:

    Michael S <[email protected]> writes:

    On Sun, 08 Sep 2024 15:36:39 GMT
    [email protected] (Anton Ertl) wrote:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will work
    on platforms one doesn't have, but there is a relatively simple
    and portable way to tell if some memcpy() call crosses over into
    the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether there is
    an overlap of the memory areas. But then I remembered that you cannot
    write such a check in standard C without (in the general case)
    exercising undefined behaviour; and then the compiler could eliminate
    the check or do something else that's unexpected. Do you have such a
    check in mind that does not exercise undefined behaviour in the
    general case?

    The check that reliably catches all overlaps seems easy.
    E.g. (src <= dst) == (src+len > dst)

    In theory, on unusual hardware platform it can give false positives.
    May be, for task in hand that's o.k.

    The challenge is to find portable C that doesn't enter the arena
    of undefined behavior (and also detects exactly those cases where
    overlap occurs), and that is quite a stringent criterion.

    The comparison shown works if src and dst both point to elements
    of the same array. [...]

    Sorry, that statement isn't right. I accepted the stated test as
    being accurate without checking it myself. My bad. :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Wed Sep 11 09:08:12 2024
    Michael S <[email protected]> writes:

    On Tue, 10 Sep 2024 22:27:02 +0300
    Michael S <[email protected]> wrote:

    On Tue, 10 Sep 2024 07:37:59 -0700
    Tim Rentsch <[email protected]> wrote:

    I should add that I appreciate your proposed solution; it's
    better than what I think I would have come up with under a
    similar set of assumptions.

    Unfortunately, my solution is wrong and mistake is not even subtle.

    This one appears to work: (src < dst+len) == (dst < src+len)

    Yes. This time I checked it. :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to All on Wed Sep 11 09:29:04 2024
    Josh Vanderhoof <[email protected]> writes:

    [how to write a portable, UB-free check if mempcy() intervals overlap]

    It is legal to test for equality between pointers to different objects

    Right. This observation is the key insight.

    so you could test for overlap by testing against every element in the
    array.

    For a complete test, compare the address of every element in both
    arrays. For example:

    #include <stddef.h>

    _Bool
    memcpy_intervals_overlap( void *const vd, void *const vs, size_t n ){
    char *d = vd, *s = vs;
    size_t k = 0;

    while( k < n && d != vs && s != vd ) k++, d++, s++;

    return k < n;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Wed Sep 11 19:52:21 2024
    On Wed, 11 Sep 2024 09:29:04 -0700
    Tim Rentsch <[email protected]> wrote:

    Josh Vanderhoof <[email protected]> writes:

    [how to write a portable, UB-free check if mempcy() intervals overlap]

    It is legal to test for equality between pointers to different
    objects

    Right. This observation is the key insight.


    Real mode x86 C compilers operating in Large and Compact Models that
    were popular on IBM-compatible PCs 30-40 years ago could have more than
    one representation for the pointer to the same memory location. If my
    memory serves me, the rules of pointers comparison for equality were
    the same as rules of comparison for <>. In both cases for reliable
    result pointers had to be explicitly normalized (i.e. converted from
    'far' to 'huge' or something like that).

    It was long time ago and even back then I didn't use Large model very
    often, so it's possible that I misremember. But if I remember
    correctly, does it mean that those C compilers now would be considered non-compliant?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Wed Sep 11 17:34:38 2024
    Michael S <[email protected]> writes:

    On Wed, 11 Sep 2024 09:29:04 -0700
    Tim Rentsch <[email protected]> wrote:

    Josh Vanderhoof <[email protected]> writes:

    [how to write a portable, UB-free check if mempcy() intervals overlap]

    It is legal to test for equality between pointers to different
    objects

    Right. This observation is the key insight.

    Real mode x86 C compilers operating in Large and Compact Models that
    were popular on IBM-compatible PCs 30-40 years ago could have more than
    one representation for the pointer to the same memory location. If my
    memory serves me, the rules of pointers comparison for equality were
    the same as rules of comparison for <>. In both cases for reliable
    result pointers had to be explicitly normalized (i.e. converted from
    'far' to 'huge' or something like that).

    It was long time ago and even back then I didn't use Large model very
    often, so it's possible that I misremember. But if I remember
    correctly, does it mean that those C compilers now would be considered non-compliant?

    The C standard was first ratified (by ANSI) in 1989. The rules
    for pointer comparison were clarified in the C99 standard, but it
    has always been true that pointers to the same object have to
    compare equal.

    C environments that have things like 'far' or 'huge' pointers,
    etc, are not standard C but must have extensions so that they can
    deal with the different kinds of pointers. Depending on how the
    non-standard kinds of pointer worked, the implementation might or
    might not be conforming. Most likely though it's a moot point
    because once a program starts using an extension all the rules
    can change, and the C standard allows that. It's only programs
    that look like really standard C that have to do what the C
    standard says (for the implementation to be conforming); any
    code that declares a 'far' pointer or 'huge' pointer certainly
    isn't standard C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to BGB on Thu Sep 12 03:12:11 2024
    BGB <[email protected]> writes:

    [...]

    Would be nice, say, if there were semi-standard compiler macros for
    various things:
    Endianess (macros exist, typically compiler specific);
    And, apparently GCC and Clang can't agree on which strategy to use.
    Whether or not the target/compiler allows misaligned memory access;
    If set, one may use misaligned access.
    Whether or not memory uses a single address space;
    If set, all pointer comparisons are allowed.

    [elaborations on the above]

    I suppose it's natural for hardware-type folks to want features
    like this to be part of standard C. In a sense what is being
    asked is to make C a high-level assembly language. But that's
    not what C is. Nor should it be.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to George Neuner on Thu Sep 12 04:04:06 2024
    George Neuner <[email protected]> writes:

    On Tue, 10 Sep 2024 11:21:01 +0300, Michael S
    <[email protected]> wrote:

    On Mon, 09 Sep 2024 23:27:24 -0400
    George Neuner <[email protected]> wrote:

    On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
    (Anton Ertl) wrote:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will
    work on platforms one doesn't have, but there is a relatively
    simple and portable way to tell if some memcpy() call crosses
    over into the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether
    there is an overlap of the memory areas. But then I remembered
    that you cannot write such a check in standard C without (in the
    general case) exercising undefined behaviour; and then the
    compiler could eliminate the check or do something else that's
    unexpected. Do you have such a check in mind that does not
    exercise undefined behaviour in the general case?

    The result of comparing pointers to two elements of the same array
    is defined. Cast to (char*), both src and dst can be considered
    to point to elements of the [address space sized] char array at
    address zero.

    According to my understanding, your 'can be considered' part is not
    codified in the C Standard.

    Adding size_t to a pointer yields another pointer of the same
    type.

    In terms of types, that is right, but the addition works only if
    the pointer points into an array large enough to include the
    result of the addition (the result is also allowed to be just one
    past the end of the array).

    All of gcc, clang and MSVC seem happy with this.

    It works. But is it guaranteed to work in the future by some sort
    of document? I am pretty sure that no such guarantee exists in gcc
    and MSVC docs. I did not look in clang docs. Trying to find
    anythings in LLVM/clang docs makes me sad.

    I know that it has worked as expected with every version of gcc
    and Microsoft I've used since 1988. [clang I don't use, but I
    tried it on godbolt.org with the most recent version]

    Will it continue to work ... who knows?


    I definitely am NOT an expert on the C standard, but thinking
    about it, it occurred to me that if an array is explicitly defined
    that *might* cover all memory (or at least all heap), then the
    compiler would have to honor any apparent pointers into it.

    E.g., char (*all_memory)[] = 0;

    This declaration introduces a pointer, not an array. Similarly
    the declaration

    char (*great_white_array)[ 999999999999999999 ] = 0;

    does not introduce an array but just a pointer (and initializes
    the pointer to be a null pointer). There is no humongous array.

    None of the compilers at godbolt seem to need this to compare
    arbitrary addresses as char*, but all accept it.

    The given declaration of 'all_memory' is strictly conforming.
    It must be accepted by any conforming C implementation (which
    all of gcc, clang, and MSVC purport to be, IIUC).

    Obviously speculation, but it's the best I have.

    It's important to realize that there are two distinct questions.
    One, does the code work (in a given implementation)? Two, does
    the code satisfy the rules given in the C standard?

    Unfortunately having an answer to the first question does not by
    itself give enough information to answer the second question.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Thu Sep 12 14:10:33 2024
    On Wed, 11 Sep 2024 17:34:38 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Wed, 11 Sep 2024 09:29:04 -0700
    Tim Rentsch <[email protected]> wrote:

    Josh Vanderhoof <[email protected]> writes:

    [how to write a portable, UB-free check if mempcy() intervals
    overlap]
    It is legal to test for equality between pointers to different
    objects

    Right. This observation is the key insight.

    Real mode x86 C compilers operating in Large and Compact Models that
    were popular on IBM-compatible PCs 30-40 years ago could have more
    than one representation for the pointer to the same memory
    location. If my memory serves me, the rules of pointers comparison
    for equality were the same as rules of comparison for <>. In both
    cases for reliable result pointers had to be explicitly normalized
    (i.e. converted from 'far' to 'huge' or something like that).

    It was long time ago and even back then I didn't use Large model
    very often, so it's possible that I misremember. But if I remember correctly, does it mean that those C compilers now would be
    considered non-compliant?

    The C standard was first ratified (by ANSI) in 1989. The rules
    for pointer comparison were clarified in the C99 standard, but it
    has always been true that pointers to the same object have to
    compare equal.

    C environments that have things like 'far' or 'huge' pointers,
    etc, are not standard C but must have extensions so that they can
    deal with the different kinds of pointers. Depending on how the
    non-standard kinds of pointer worked, the implementation might or
    might not be conforming. Most likely though it's a moot point
    because once a program starts using an extension all the rules
    can change, and the C standard allows that. It's only programs
    that look like really standard C that have to do what the C
    standard says (for the implementation to be conforming); any
    code that declares a 'far' pointer or 'huge' pointer certainly
    isn't standard C.

    In Compact and Large models data pointers are 'far' by default. So,
    the source doesn't have to use non-standard declarations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Thu Sep 12 14:29:48 2024
    On Thu, 12 Sep 2024 03:12:11 -0700
    Tim Rentsch <[email protected]> wrote:

    BGB <[email protected]> writes:

    [...]

    Would be nice, say, if there were semi-standard compiler macros for
    various things:
    Endianess (macros exist, typically compiler specific);
    And, apparently GCC and Clang can't agree on which strategy to
    use. Whether or not the target/compiler allows misaligned memory
    access; If set, one may use misaligned access.
    Whether or not memory uses a single address space;
    If set, all pointer comparisons are allowed.

    [elaborations on the above]

    I suppose it's natural for hardware-type folks to want features
    like this to be part of standard C. In a sense what is being
    asked is to make C a high-level assembly language. But that's
    not what C is. Nor should it be.

    Why not?
    I don't see practical need for all those UBs apart from buffer
    overflow. More so, I don't see the need for UB in certain limited
    classes of buffer overflows.

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Thu Sep 12 06:06:46 2024
    [email protected] (Anton Ertl) writes:

    [considering which way to copy with memmove()]

    If the two memory blocks don't overlap, memmove() can use the
    fastest stride. [...]

    The way to go for memmove() is:

    On hardware where positive stride is faster:

    if (((uintptr)(dest-src)) >= len)
    return memcpy_posstride(dest,src,len)
    else
    return memcpy_negstride(dest,src,len)

    On hardware where the negative stride is faster:

    if (((uintptr)(src-dest)) >= len)
    return memcpy_negstride(dest,src,len)
    else
    return memcpy_posstride(dest,src,len)

    And I expect that my test is undefined behaviour, but most people
    except the UB advocates should understand what I mean.

    Code inside the implementation is allowed to exploit internal
    knowledge.

    The benefit of this comparison over just comparing the addresses
    is that the branch will have a much lower miss rate.

    It's a clever idea. It suffers from a few shortcomings.

    First, the type name is uintptr_t. Also, uintptr_t might not
    exist.

    Second, uintptr_t might be small, leading to incorrect behavior
    in some cases. Better to use a large unsigned type that is
    known to exist, either unsigned long long or uintmax_t.

    Third, pointer subtraction is not guaranteed to work for large
    differences because ptrdiff_t might not be big enough. This is
    just a technicality because presumably the implementation would
    know how big ptrdiff_t is and wouldn't use this approach if it
    were too small. That said, it's something to keep in mind if the
    code is meant to be used on other systems.

    Last but not least, having two different code blocks for the
    different preferences is clunky. The two blocks can be
    combined by fusing the two test expressions into a single
    expression, as for example

    #ifndef PREFER_UPWARDS
    #define PREFER_UPWARDS 1
    #endif/*PREFER_UPWARDS*/

    extern void* ascending_copy( void*, const void*, size_t );
    extern void* descending_copy( void*, const void*, size_t );

    void *
    good_memmove( void *vd, const void *vs, size_t n ){
    const char *d = vd;
    const char *s = vs;
    _Bool upwards = PREFER_UPWARDS ? d-s +0ull >= n : s-d +0ull < n;

    return
    upwards
    ? ascending_copy( vd, vs, n )
    : descending_copy( vd, vs, n );
    }

    Using the preprocessor symbol PREFER_UPWARDS to select between
    the two preferences (ascending or descending) allows the choice
    to made by a -D compiler option, and we can expect the compiler
    to optimize away the part of the test that is never used.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Thu Sep 12 06:17:24 2024
    [email protected] (Anton Ertl) writes:

    Josh Vanderhoof <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    George Neuner <[email protected]> writes:

    On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
    (Anton Ertl) wrote:

    1) At first I thought that yes, one could just check whether
    there is an overlap of the memory areas. But then I remembered
    that you cannot write such a check in standard C without (in the
    general case) exercising undefined behaviour; and then the
    compiler could eliminate the check or do something else that's
    unexpected. Do you have such a check in mind that does not
    exercise undefined behaviour in the general case?

    ...

    It is legal to test for equality between pointers to different
    objects so you could test for overlap by testing against every
    element in the array. It seems like it should be possible for the
    compiler to figure out what's happening and optimize those tests
    away, but unfortunately no compiler I tested did it.

    That would be an interesting result of the ATUBDNH lunacy:
    programmers would see themselves forced to write workarounds such
    as the one you suggest (with terrible performance when not
    optimized), and then C compiler maintainers would see themselves
    forced to optimize this kind of code. The end result would be
    that both parties have to put in more effort to eventually get the
    same result as if ordered comparison between different objects had
    been defined from the start.

    For now, the ATUBDNH advocates tell programmers that they have to
    work around the lack of definition, but there is usually no
    optimization for that.

    This reaction doesn't fit the case here. The C standard already
    provides a way to do what is needed, namely memmove(). The code
    being discussed in this thread is relevant only because someone
    (may have) wrongly used memcpy() rather than memmove(). As has
    been pointed out, all of the worries around this problem can be
    avoided by simply using memmove() rather then memcpy().

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to BGB on Thu Sep 12 16:18:45 2024
    On 11/09/2024 20:51, BGB wrote:
    On 9/11/2024 5:38 AM, Anton Ertl wrote:
    Josh Vanderhoof <[email protected]> writes:
    [email protected] (Anton Ertl) writes:

    George Neuner <[email protected]> writes:
    On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
    (Anton Ertl) wrote:
    1) At first I thought that yes, one could just check whether there is >>>>>> an overlap of the memory areas.  But then I remembered that you
    cannot
    write such a check in standard C without (in the general case)
    exercising undefined behaviour; and then the compiler could eliminate >>>>>> the check or do something else that's unexpected.  Do you have such a >>>>>> check in mind that does not exercise undefined behaviour in the
    general case?
    ...
    It is legal to test for equality between pointers to different objects
    so you could test for overlap by testing against every element in the
    array.  It seems like it should be possible for the compiler to figure
    out what's happening and optimize those tests away, but unfortunately
    no compiler I tested did it.

    That would be an interesting result of the ATUBDNH lunacy: programmers
    would see themselves forced to write workarounds such as the one you
    suggest (with terrible performance when not optimized), and then C
    compiler maintainers would see themselves forced to optimize this kind
    of code.  The end result would be that both parties have to put in
    more effort to eventually get the same result as if ordered comparison
    between different objects had been defined from the start.

    For now, the ATUBDNH advocates tell programmers that they have to work
    around the lack of definition, but there is usually no optimization
    for that.

    One case where things work somewhat along the lines you suggest is
    unaligned accesses.  Traditionally, if knowing that the hardware
    supports unaligned accesses, for a 16-bit load one would write:

    int16_t foo1(int16_t *p)
    {
       return *p;
    }

    If one does not know that the hardware supports unaligned accesses,
    the traditional way to perform such an access (little-endian) is
    something like:

    int16_t foo2(int16_t *p)
    {
       unsignedchar *q = p;
       return (int16_t)(q[0] + (q[1]>>8));
    }

    Correcting the typos (in case anyone wants to copy-and-paste to
    godbolt.org for testing):


    int16_t foo2(int16_t *p)
    {
    unsigned char *q = (unsigned char *) p;
    return (int16_t)(q[0] + (q[1] << 8));
    }


    Now, several years ago, somebody told me that the proper way is as
    follows:

    int16_t foo3(int16_t *p)
    {
        int16_t v;
        memcpy(&v,p,2);
        return v;
    }

    That way looked horribly inefficient to me, with v having to reside in
    memory instead of in a register and then the expensive function call,
    and all the decisions that memcpy() has to take depending on the
    length argument.  But gcc optimizes this idiom into an unaligned load
    rather than taking all the steps that I expected (however, I have seen
    cases where the code produced on hardware that supports unaligned
    accesses is worse than that for foo1()).  Of course, if you also want
    to support less sophisticated compilers, this idiom may be really slow
    on those, although not quite as expensive as your containment check.



    It is a unfortunate truth that code that is correct can be inefficient
    on some compilers, while code that is efficient on those compilers is
    not correct (according to the C standards) and can fail on other
    compilers. I may be a "ATUBDNH advocate", but I can certainly
    acknowledge that much. The C standard is concerned with the behaviour
    of the code, not its efficiency, and it has always been a fact of life
    for C programmers that different compilers give better or worse results
    for different ways of writing source code. Not all code can be written portably /and/ efficiently, without at least some conditional compilation.

    foo1() is defined behaviour if and only if the pointer is correctly
    aligned. For a stand-alone function,

    foo2() above is perfectly correct C and has fully defined behaviour
    (with the obvious assumptions that CHARBIT is 8 and that int16_t
    exists), but only gives the correct results for little-endian systems.

    foo3() is correct regardless of the endianness (with the same
    assumptions about the targets), but efficiency can vary.

    Testing these on godbolt.org with gcc and MSVC shows these both optimise
    the memcpy() into a single 16-bit load. MSVC does not recognize the
    pattern in foo2() and generates poor code for it (it even uses an "imul" instruction!).


    Another alternative is:

    int16_t foo1v(int16_t *p)
    {
    volatile int16_t * q = p;
    return *q;
    }

    The C standard does not say exactly what this will do, but you can
    expect the compiler to generate the load, even if it knows "p" is
    misaligned, and even if it knows the target does not support misaligned accesses. Of course, this has implications for optimisations as the
    compiler can't re-order such loads.


    Would be nice, say, if there were semi-standard compiler macros for
    various things:

    Ask, and you shall receive! (Well, sometimes you might receive.)

      Endianess (macros exist, typically compiler specific);
        And, apparently GCC and Clang can't agree on which strategy to use.

    #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
    ...
    #elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
    ...
    #else
    ...
    #endif

    Works in gcc, clang and MSVC.


    And C23 has the <stdbit.h> header with many convenient little "bit and
    byte" utilities, including endian detection:

    #include <stdbit.h>
    #if __STDC_ENDIAN_NATIVE__ == __STDC_ENDIAN_LITTLE__
    ...
    #elif __STDC_ENDIAN_NATIVE__ == __STDC_ENDIAN_BIG__
    ...
    #else
    ...
    #endif


      Whether or not the target/compiler allows misaligned memory access;
        If set, one may use misaligned access.

    Why would you need that? Any decent compiler will know what is allowed
    for the target (perhaps partly on the basis of compiler flags), and will generate the best allowed code for accesses like foo3() above.

      Whether or not memory uses a single address space;
        If set, all pointer comparisons are allowed.

    Pointer comparisons are always allowed for equality tests if they are
    pointers to objects of compatible types. (Function pointers cannot be
    compared at all.)

    For other relational tests, the pointers must point to sub-objects of
    the same aggregate object. (That means they can't be null pointers,
    misaligned pointers, invalid pointers or pointers going nowhere.) This
    is independent of how the address space(s) are organised on the target
    machine.

    What you /can/ do, on pretty much any implementation with a single
    linear address space, is convert pointers to uintptr_t and then compare
    them. There may be some targets for which there is no uintptr_t, or
    where the mapping from pointer to integer does not match with the
    address, but that would be very unusual.

    I can't think when you would need to do such comparisons, however, other
    than to implement memmove - and library functions can use any kind of implementation-specific feature they like.



    Clang:
      __LITTLE_ENDIAN__, __BIG_ENDIAN__
      One or the other is defined based on endian.
    GCC:
      __BYTE_ORDER__ which may equal one of:
        __ORDER_LITTLE_ENDIAN__
        __ORDER_BIG_ENDIAN__
        __ORDER_PDP_ENDIAN__
    MSVC:
      REG_DWORD is one of:
        REG_DWORD_LITTLE_ENDIAN
        REG_DWORD_BIG_ENDIAN

    GCC:
      __SIZEOF_type__  //gives sizeof various types


    See above.


    Possible:
      __MINALIGN_type__  //minimum allowed alignment for type

    _Alignof(type) has been around since C11.


    Maybe also alias pointer control:
      __POINTER_ALIAS__
        __POINTER_ALIAS_CONSERVATIVE__
        __POINTER_ALIAS_STRICT__

    Where, pointer alias can be declared, and:
      If conservative, then conservative semantics are being used.
        Pointers may be freely cast without concern for pointer aliasing.
        Compiler will assume that "non restrict" pointer stores may alias.
      If strict, the compiler is using TBAA semantics.
        Compiler may assume that aliasing is based on pointer types.


    Faffing around with pointer types - breaking the "effective type" rules
    - has been a bad idea and risky behaviour since C was standardised. You
    never need to do it. (I accept, however, that on some weaker or older compilers "doing the right thing" can be noticeably less efficient than
    writing bad code.) Just get a half-decent compiler and use memcpy().
    For any situation where you might think casting pointer types would be a
    good idea, your sizes are small and known at compile time, so they are
    easy for the compiler to optimise.

    If you /must/ do such casts, or you are dealing with questionable
    quality code that uses them, at least add this to your code:

    #ifdef __GNUC__
    #pragma GCC optimize("-fno-strict-aliasing")
    #endif

    It won't make the code correct if you are using a compiler other than
    gcc or clang, but it's a help.

    And as a general rule, if you feel you really want to break the rules of
    C and still get something useful out at the end, use "volatile" liberally.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Thu Sep 12 07:33:13 2024
    Terje Mathisen <[email protected]> writes:

    [how to detect interval overlap]

    What about:

    max(src,dst) < (min(src,dst)+len)

    If you have a min/max circuit, i.e a two-element sorter, then it
    could be quite efficient, otherwise run the min first, then the
    max and the add during the second cycle, before the less than test
    in the third cycle.

    [...]

    I do believe though that in reality it could be faster to use the
    branchy version, and let the branch predictors do their job
    instead of having to wait to evaluate all three terms:

    bool is_overlap(char *src, char *dst, size_t len)
    {
    if (src < dst) {
    return (src+len > dst);
    }
    return (dst+len > src);
    }

    Note that there are two distinct problems that are relevant to
    the discussion: is there any overlap, and is there overlap of
    the wrong kind. The question of Is there any overlap can be
    done with a simple comparison if there is a non-branching abs()
    function available (assuming a flat linear address space):

    if( abs( source - destination ) < n ) ...

    The question of Is there overlap of wrong kind, which is like
    what memmove would want to ask, can be done with a single
    comparison if the bad direction is known in advanced, and fixed.
    An example is given by Anton, and a revision of that in my
    recent response to his posting.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Thu Sep 12 16:34:31 2024
    On 12/09/2024 13:29, Michael S wrote:
    On Thu, 12 Sep 2024 03:12:11 -0700
    Tim Rentsch <[email protected]> wrote:

    BGB <[email protected]> writes:

    [...]

    Would be nice, say, if there were semi-standard compiler macros for
    various things:
    Endianess (macros exist, typically compiler specific);
    And, apparently GCC and Clang can't agree on which strategy to
    use. Whether or not the target/compiler allows misaligned memory
    access; If set, one may use misaligned access.
    Whether or not memory uses a single address space;
    If set, all pointer comparisons are allowed.

    [elaborations on the above]

    I suppose it's natural for hardware-type folks to want features
    like this to be part of standard C. In a sense what is being
    asked is to make C a high-level assembly language. But that's
    not what C is. Nor should it be.


    I fully agree that C is not, and should not be seen as, a "high-level
    assembly language". But it is a language that is very useful to
    "hardware-type folks", and there are a few things that could make it
    easier to write more portable code if they were standardised. As it is,
    we just have to accept that some things are not portable.

    Why not?
    I don't see practical need for all those UBs apart from buffer
    overflow. More so, I don't see the need for UB in certain limited
    classes of buffer overflows.

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.


    And how should that be defined? And what is its "practical" definition?
    My preference would be a hard compile-time error, but specifying that
    in the standards would force compilers to do more analysis and checking
    than the standards can reasonably enforce.

    clang can warn on this - I am disappointed to see that gcc does not.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Tim Rentsch on Thu Sep 12 14:20:42 2024
    Tim Rentsch <[email protected]> writes: >[email protected] (Anton Ertl) writes:

    [considering which way to copy with memmove()]

    If the two memory blocks don't overlap, memmove() can use the
    fastest stride. [...]

    The way to go for memmove() is:

    On hardware where positive stride is faster:

    if (((uintptr)(dest-src)) >= len)
    return memcpy_posstride(dest,src,len)
    else
    return memcpy_negstride(dest,src,len)

    On hardware where the negative stride is faster:

    if (((uintptr)(src-dest)) >= len)
    return memcpy_negstride(dest,src,len)
    else
    return memcpy_posstride(dest,src,len)

    And I expect that my test is undefined behaviour, but most people
    except the UB advocates should understand what I mean.
    ...
    Last but not least, having two different code blocks for the
    different preferences is clunky. The two blocks can be
    combined by fusing the two test expressions into a single
    expression, as for example

    #ifndef PREFER_UPWARDS
    #define PREFER_UPWARDS 1
    #endif/*PREFER_UPWARDS*/

    extern void* ascending_copy( void*, const void*, size_t );
    extern void* descending_copy( void*, const void*, size_t );

    void *
    good_memmove( void *vd, const void *vs, size_t n ){
    const char *d = vd;
    const char *s = vs;
    _Bool upwards = PREFER_UPWARDS ? d-s +0ull >= n : s-d +0ull < n;

    return
    upwards
    ? ascending_copy( vd, vs, n )
    : descending_copy( vd, vs, n );
    }

    Using the preprocessor symbol PREFER_UPWARDS to select between
    the two preferences (ascending or descending) allows the choice
    to made by a -D compiler option, and we can expect the compiler
    to optimize away the part of the test that is never used.

    That's clever, but for usage in glibc or the like the clunky version
    is the preferred one: memmove() is usually called through the dynamic
    linking mechanism, and which implementation is actually called is
    selected based on the hardware that it runs on (what does it do when
    the program is linked statically?). There seem to be quite a few
    memmove() (and __memmove_chk()) implementations in glibc-2.36 on
    AMD64:

    __memmove_chk
    __memmove_sse2_unaligned_erms
    __memmove_chk
    __memmove_chk_erms
    __memmove_chk_evex_unaligned
    __memmove_chk_avx_unaligned
    __memmove_chk_ssse3
    __memmove_chk_sse2_unaligned
    __memmove_erms
    __memmove_avx512_unaligned
    __memmove_evex_unaligned
    __memmove_evex_unaligned_erms
    __memmove_avx_unaligned
    __memmove_avx_unaligned_erms
    __memmove_avx_unaligned_rtm
    __memmove_ssse3
    __memmove_sse2_unaligned
    __memmove_chk_sse2_unaligned_erms
    __memmove_chk_avx512_no_vzeroupper
    __memmove_chk_avx512_unaligned
    __memmove_chk_avx512_unaligned_erms
    __memmove_chk_evex_unaligned_erms
    __memmove_chk_avx_unaligned_erms
    __memmove_chk_avx_unaligned_rtm
    __memmove_chk_avx_unaligned_erms_rtm
    __memmove_avx512_no_vzeroupper
    __memmove_avx512_unaligned_erms
    __memmove_avx_unaligned_erms_rtm

    From what I read, __memmove_chk() (which has an additional destlen
    parameter) is apparently not intended to be called explicitly from the
    source code, so I guess that some compilers generate calls to it.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Thu Sep 12 08:03:52 2024
    [email protected] (Anton Ertl) writes:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    [considering which way to copy with memmove()]

    If the two memory blocks don't overlap, memmove() can use the
    fastest stride. [...]

    The way to go for memmove() is:

    On hardware where positive stride is faster:

    if (((uintptr)(dest-src)) >= len)
    return memcpy_posstride(dest,src,len)
    else
    return memcpy_negstride(dest,src,len)

    On hardware where the negative stride is faster:

    if (((uintptr)(src-dest)) >= len)
    return memcpy_negstride(dest,src,len)
    else
    return memcpy_posstride(dest,src,len)

    And I expect that my test is undefined behaviour, but most people
    except the UB advocates should understand what I mean.

    ...

    Last but not least, having two different code blocks for the
    different preferences is clunky. The two blocks can be
    combined by fusing the two test expressions into a single
    expression, as for example

    #ifndef PREFER_UPWARDS
    #define PREFER_UPWARDS 1
    #endif/*PREFER_UPWARDS*/

    extern void* ascending_copy( void*, const void*, size_t );
    extern void* descending_copy( void*, const void*, size_t );

    void *
    good_memmove( void *vd, const void *vs, size_t n ){
    const char *d = vd;
    const char *s = vs;
    _Bool upwards = PREFER_UPWARDS ? d-s +0ull >= n : s-d +0ull < n; >>
    return
    upwards
    ? ascending_copy( vd, vs, n )
    : descending_copy( vd, vs, n );
    }

    Using the preprocessor symbol PREFER_UPWARDS to select between
    the two preferences (ascending or descending) allows the choice
    to made by a -D compiler option, and we can expect the compiler
    to optimize away the part of the test that is never used.

    That's clever, but for usage in glibc or the like the clunky version
    is the preferred one: [elaboration]

    That's irrelevant to the point I was making. People working
    inside an implementation can take advantage of knowledge unknown
    to people working at the source code level. My comment was only
    about what is visible at the source code level, not about the
    unknown hidden workings of some particular implementation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Thu Sep 12 18:43:07 2024
    On Thu, 12 Sep 2024 08:15:29 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Wed, 11 Sep 2024 17:34:38 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:


    Real mode x86 C compilers operating in Large and Compact Models
    that were popular on IBM-compatible PCs 30-40 years ago could
    have more than one representation for the pointer to the same
    memory location. If my memory serves me, the rules of pointers
    comparison for equality were the same as rules of comparison for
    <>. In both cases for reliable result pointers had to be
    explicitly normalized (i.e. converted from 'far' to 'huge' or
    something like that).

    It was long time ago and even back then I didn't use Large model
    very often, so it's possible that I misremember. But if I
    remember correctly, does it mean that those C compilers now would
    be considered non-compliant?

    The C standard was first ratified (by ANSI) in 1989. The rules
    for pointer comparison were clarified in the C99 standard, but it
    has always been true that pointers to the same object have to
    compare equal.

    C environments that have things like 'far' or 'huge' pointers,
    etc, are not standard C but must have extensions so that they can
    deal with the different kinds of pointers. Depending on how the
    non-standard kinds of pointer worked, the implementation might or
    might not be conforming. Most likely though it's a moot point
    because once a program starts using an extension all the rules
    can change, and the C standard allows that. It's only programs
    that look like really standard C that have to do what the C
    standard says (for the implementation to be conforming); any
    code that declares a 'far' pointer or 'huge' pointer certainly
    isn't standard C.

    In Compact and Large models data pointers are 'far' by default. So,
    the source doesn't have to use non-standard declarations.

    In that case, if the defaulted 'far' pointers don't follow the
    rules given in the C standard for regular pointers, then the
    implementation is not conforming. Extensions are allowed only if
    they don't change the behavior of any strictly conforming
    program. If undecorated pointer declarations don't observe this
    condition then it's not a valid extension, which in turn causes
    the implementation to be non-conforming.


    Thinking about it, there likely were no way to create aliases via using
    only standard language constructs. That's assuming that any use of
    preserved values of pointers to de-allocated heap storage, including
    use for comparison, is non-standard.
    So it probably was conforming implementation at the end.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Thu Sep 12 15:53:33 2024
    Tim Rentsch <[email protected]> schrieb:

    Code inside the implementation is allowed to exploit internal
    knowledge.

    Which is a cause of envy for people who don't...

    glibc can compare pointers all it wants if it knows that the
    pointer model is flat.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Thu Sep 12 08:15:29 2024
    Michael S <[email protected]> writes:

    On Wed, 11 Sep 2024 17:34:38 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Wed, 11 Sep 2024 09:29:04 -0700
    Tim Rentsch <[email protected]> wrote:

    Josh Vanderhoof <[email protected]> writes:

    [how to write a portable, UB-free check if mempcy() intervals
    overlap]

    It is legal to test for equality between pointers to different
    objects

    Right. This observation is the key insight.

    Real mode x86 C compilers operating in Large and Compact Models that
    were popular on IBM-compatible PCs 30-40 years ago could have more
    than one representation for the pointer to the same memory
    location. If my memory serves me, the rules of pointers comparison
    for equality were the same as rules of comparison for <>. In both
    cases for reliable result pointers had to be explicitly normalized
    (i.e. converted from 'far' to 'huge' or something like that).

    It was long time ago and even back then I didn't use Large model
    very often, so it's possible that I misremember. But if I remember
    correctly, does it mean that those C compilers now would be
    considered non-compliant?

    The C standard was first ratified (by ANSI) in 1989. The rules
    for pointer comparison were clarified in the C99 standard, but it
    has always been true that pointers to the same object have to
    compare equal.

    C environments that have things like 'far' or 'huge' pointers,
    etc, are not standard C but must have extensions so that they can
    deal with the different kinds of pointers. Depending on how the
    non-standard kinds of pointer worked, the implementation might or
    might not be conforming. Most likely though it's a moot point
    because once a program starts using an extension all the rules
    can change, and the C standard allows that. It's only programs
    that look like really standard C that have to do what the C
    standard says (for the implementation to be conforming); any
    code that declares a 'far' pointer or 'huge' pointer certainly
    isn't standard C.

    In Compact and Large models data pointers are 'far' by default. So,
    the source doesn't have to use non-standard declarations.

    In that case, if the defaulted 'far' pointers don't follow the
    rules given in the C standard for regular pointers, then the
    implementation is not conforming. Extensions are allowed only if
    they don't change the behavior of any strictly conforming
    program. If undecorated pointer declarations don't observe this
    condition then it's not a valid extension, which in turn causes
    the implementation to be non-conforming.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Thu Sep 12 17:57:52 2024
    On Thu, 12 Sep 2024 14:20:42 +0000, Anton Ertl wrote:

    That's clever, but for usage in glibc or the like the clunky version
    is the preferred one: memmove() is usually called through the dynamic
    linking mechanism, and which implementation is actually called is
    selected based on the hardware that it runs on (what does it do when
    the program is linked statically?). There seem to be quite a few
    memmove() (and __memmove_chk()) implementations in glibc-2.36 on
    AMD64:

    __memmove_chk
    __memmove_sse2_unaligned_erms
    __memmove_chk
    __memmove_chk_erms
    __memmove_chk_evex_unaligned
    __memmove_chk_avx_unaligned
    __memmove_chk_ssse3
    __memmove_chk_sse2_unaligned
    __memmove_erms
    __memmove_avx512_unaligned
    __memmove_evex_unaligned
    __memmove_evex_unaligned_erms
    __memmove_avx_unaligned
    __memmove_avx_unaligned_erms
    __memmove_avx_unaligned_rtm
    __memmove_ssse3
    __memmove_sse2_unaligned
    __memmove_chk_sse2_unaligned_erms
    __memmove_chk_avx512_no_vzeroupper
    __memmove_chk_avx512_unaligned
    __memmove_chk_avx512_unaligned_erms
    __memmove_chk_evex_unaligned_erms
    __memmove_chk_avx_unaligned_erms
    __memmove_chk_avx_unaligned_rtm
    __memmove_chk_avx_unaligned_erms_rtm
    __memmove_avx512_no_vzeroupper
    __memmove_avx512_unaligned_erms
    __memmove_avx_unaligned_erms_rtm

    All of these compile to the MM instruction in My 66000,
    including the memcpy() variants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Thu Sep 12 18:11:39 2024
    On Thu, 12 Sep 2024 17:57:52 +0000, MitchAlsup1 wrote:

    On Thu, 12 Sep 2024 14:20:42 +0000, Anton Ertl wrote:

    That's clever, but for usage in glibc or the like the clunky version
    is the preferred one: memmove() is usually called through the dynamic
    linking mechanism, and which implementation is actually called is
    selected based on the hardware that it runs on (what does it do when
    the program is linked statically?). There seem to be quite a few
    memmove() (and __memmove_chk()) implementations in glibc-2.36 on
    AMD64:

    __memmove_chk
    __memmove_sse2_unaligned_erms
    __memmove_chk
    __memmove_chk_erms
    __memmove_chk_evex_unaligned
    __memmove_chk_avx_unaligned
    __memmove_chk_ssse3
    __memmove_chk_sse2_unaligned
    __memmove_erms
    __memmove_avx512_unaligned
    __memmove_evex_unaligned
    __memmove_evex_unaligned_erms
    __memmove_avx_unaligned
    __memmove_avx_unaligned_erms
    __memmove_avx_unaligned_rtm
    __memmove_ssse3
    __memmove_sse2_unaligned
    __memmove_chk_sse2_unaligned_erms
    __memmove_chk_avx512_no_vzeroupper
    __memmove_chk_avx512_unaligned
    __memmove_chk_avx512_unaligned_erms
    __memmove_chk_evex_unaligned_erms
    __memmove_chk_avx_unaligned_erms
    __memmove_chk_avx_unaligned_rtm
    __memmove_chk_avx_unaligned_erms_rtm
    __memmove_avx512_no_vzeroupper
    __memmove_avx512_unaligned_erms
    __memmove_avx_unaligned_erms_rtm

    All of these compile to the MM instruction in My 66000,
    including the memcpy() variants.

    But the list above is a symptom of not providing the right abstract
    to memmove() in ISA to begin with.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Thu Sep 12 23:10:16 2024
    On Tue, 3 Sep 2024 17:46:38 +0200
    Terje Mathisen <[email protected]> wrote:


    Q&D programming is still far faster for me in C, but using Rust I
    don't have to worry about how well the compiler will be able to
    optimize my code, it is pretty much always close to speed of light
    since the entire aliasing issue goes away.


    I am trying to compare speed of few compiled languages in one benchmark
    that I find interesting.
    In order to make comparison I have to port a test bench first, because
    while most of this languages are able, with various level of
    difficulties, to call C routines, none of them can be called from 'C',
    at least at my level of knowledge.

    Porting test bench from C to Go was quite easy, the only part that I
    didn't grasp immediately was related to time measurements.

    Today I started Rust port and it is VERY much harder. After several
    hours of reading of various tutorials, examples and Stack Overflow
    articles I still don't know how to write
    switch (argv[1][0]) {
    case 't':
    case 'T':
    x = 42;
    break;
    }

    At this rate, I am not sure that my motivation will last long enough to
    finish the porting.

    Rust also gets rid of the horrible external library/configure/cmake
    mess that kept me from successfully compiling the reference LAStools
    lidar code for nearly 10 years.

    Using the Rust port I just tell cargo to add it to my project and
    that's it.

    Terje


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Thu Sep 12 20:58:18 2024
    On Thu, 12 Sep 2024 20:10:16 +0000, Michael S wrote:

    On Tue, 3 Sep 2024 17:46:38 +0200
    Terje Mathisen <[email protected]> wrote:


    Q&D programming is still far faster for me in C, but using Rust I
    don't have to worry about how well the compiler will be able to
    optimize my code, it is pretty much always close to speed of light
    since the entire aliasing issue goes away.


    I am trying to compare speed of few compiled languages in one benchmark
    that I find interesting.
    In order to make comparison I have to port a test bench first, because
    while most of this languages are able, with various level of
    difficulties, to call C routines, none of them can be called from 'C',
    at least at my level of knowledge.

    FORTRAN 77 passes arguments indirectly so the subroutine can write to
    the location storing the argument--giving it IN-OUT capabilities.
    I never found this indirect creating a bother when calling FORTRAN
    from C.

    Since C only has IN style arguments (in ADA parlance)::
    ADA OUT and INOUT arguments require the compiler knowing about the
    OUT nature of the argument, so, upon return, it can place the OUT
    argument variables back where they belong.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Sep 12 21:52:32 2024
    On Thu, 12 Sep 2024 21:14:18 +0000, BGB wrote:


    This is because in some cases, the performance overhead of copying the
    last (sz&31) bytes is significant, say:
    rsz=cte-ct;
    if(rsz)
    {
    if(rsz&16)
    {
    v0=((u64 *)cs)[0]; v1=((u64 *)cs)[1];
    ((u64 *)ct)[0]=v0; ((u64 *)ct)[1]=v1;
    cs+=16; ct+=16;
    }
    if(rsz&8)
    {
    v0=((u64 *)cs)[0];
    ((u64 *)ct)[0]=v0;
    cs+=8; ct+=8;
    }
    if(rsz&4)
    {
    v0=((u32 *)cs)[0];
    ((u32 *)ct)[0]=v0;
    cs+=4; ct+=4;
    }
    if(rsz&2)
    {
    v0=((u16 *)cs)[0];
    ((u16 *)ct)[0]=v0;
    cs+=2; ct+=2;
    }
    if(rsz&1)
    {
    v0=((byte *)cs)[0];
    ((byte *)ct)[0]=v0;
    cs++; ct++;
    }
    }

    For small copies with awkward sizes, this tailing part can cost more
    than the whole rest of the copy.

    A fine rendition of why this should be in HW as an instruction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Thu Sep 12 18:33:18 2024
    Michael S <[email protected]> writes:

    On Tue, 3 Sep 2024 17:46:38 +0200
    Terje Mathisen <[email protected]> wrote:

    Q&D programming is still far faster for me in C, but using Rust I
    don't have to worry about how well the compiler will be able to
    optimize my code, it is pretty much always close to speed of light
    since the entire aliasing issue goes away.

    I am trying to compare speed of few compiled languages in one benchmark
    that I find interesting.
    In order to make comparison I have to port a test bench first, because
    while most of this languages are able, with various level of
    difficulties, to call C routines, none of them can be called from 'C',
    at least at my level of knowledge.

    Porting test bench from C to Go was quite easy, the only part that I
    didn't grasp immediately was related to time measurements.

    Today I started Rust port and it is VERY much harder. After several
    hours of reading of various tutorials, examples and Stack Overflow
    articles I still don't know how to write
    switch (argv[1][0]) {
    case 't':
    case 'T':
    x = 42;
    break;
    }

    At this rate, I am not sure that my motivation will last long enough to finish the porting.

    Disclaimer: I have very little experience with Rust. The
    example shown below looks like Rust but may very well have
    syntax errors (or worse).

    match argv[1][0] {
    't' | 'T' => { x = 42; }
    _ => { }
    }

    The _ pattern matches anything that hasn't been matched (and
    may be necessary, I'm not sure about that).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Fri Sep 13 05:40:08 2024
    Michael S <[email protected]> schrieb:

    In order to make comparison I have to port a test bench first, because
    while most of this languages are able, with various level of
    difficulties, to call C routines, none of them can be called from 'C',
    at least at my level of knowledge.

    If you declare a Fortran procedure BIND(C), you can call it from C.
    gfortran will give you the C prototype with -fc-prototypes.

    Or, if you don't declare it BIND(C) and it uses old-style code,
    you can use -fc-prototypes-external.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Fri Sep 13 11:52:35 2024
    On Fri, 13 Sep 2024 05:40:08 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:

    In order to make comparison I have to port a test bench first,
    because while most of this languages are able, with various level of difficulties, to call C routines, none of them can be called from
    'C', at least at my level of knowledge.

    If you declare a Fortran procedure BIND(C), you can call it from C.
    gfortran will give you the C prototype with -fc-prototypes.

    Or, if you don't declare it BIND(C) and it uses old-style code,
    you can use -fc-prototypes-external.

    Thank you, but Fortran was not in the list of the languages that I
    wanted to test.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Fri Sep 13 12:04:17 2024
    On Thu, 12 Sep 2024 18:33:18 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Tue, 3 Sep 2024 17:46:38 +0200
    Terje Mathisen <[email protected]> wrote:

    Q&D programming is still far faster for me in C, but using Rust I
    don't have to worry about how well the compiler will be able to
    optimize my code, it is pretty much always close to speed of light
    since the entire aliasing issue goes away.

    I am trying to compare speed of few compiled languages in one
    benchmark that I find interesting.
    In order to make comparison I have to port a test bench first,
    because while most of this languages are able, with various level of difficulties, to call C routines, none of them can be called from
    'C', at least at my level of knowledge.

    Porting test bench from C to Go was quite easy, the only part that I
    didn't grasp immediately was related to time measurements.

    Today I started Rust port and it is VERY much harder. After several
    hours of reading of various tutorials, examples and Stack Overflow
    articles I still don't know how to write
    switch (argv[1][0]) {
    case 't':
    case 'T':
    x = 42;
    break;
    }

    At this rate, I am not sure that my motivation will last long
    enough to finish the porting.

    Disclaimer: I have very little experience with Rust. The
    example shown below looks like Rust but may very well have
    syntax errors (or worse).

    match argv[1][0] {
    't' | 'T' => { x = 42; }
    _ => { }
    }

    The _ pattern matches anything that hasn't been matched (and
    may be necessary, I'm not sure about that).

    My hardle is relatedd to [0] part rather than to switch/case part.
    Accessing nth character of String (or of str? Or &str ? I am still
    trying to figure out the difference.) is not as simple as in C or Go.
    One person on Stack Overflow said that he was able to figure it out
    after he learned the difference between std::string and
    std::string_view in C++. May be, I should follow the same process. But
    I don't want to. I don't plan to become an expert Rust programmer,
    but rather want to do a simple benchmark.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Fri Sep 13 12:05:10 2024
    Michael S wrote:
    On Thu, 12 Sep 2024 18:33:18 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Tue, 3 Sep 2024 17:46:38 +0200
    Terje Mathisen <[email protected]> wrote:

    Q&D programming is still far faster for me in C, but using Rust I
    don't have to worry about how well the compiler will be able to
    optimize my code, it is pretty much always close to speed of light
    since the entire aliasing issue goes away.

    I am trying to compare speed of few compiled languages in one
    benchmark that I find interesting.
    In order to make comparison I have to port a test bench first,
    because while most of this languages are able, with various level of
    difficulties, to call C routines, none of them can be called from
    'C', at least at my level of knowledge.

    Porting test bench from C to Go was quite easy, the only part that I
    didn't grasp immediately was related to time measurements.

    Today I started Rust port and it is VERY much harder. After several
    hours of reading of various tutorials, examples and Stack Overflow
    articles I still don't know how to write
    switch (argv[1][0]) {
    case 't':
    case 'T':
    x = 42;
    break;
    }

    At this rate, I am not sure that my motivation will last long
    enough to finish the porting.

    Disclaimer: I have very little experience with Rust. The
    example shown below looks like Rust but may very well have
    syntax errors (or worse).

    match argv[1][0] {
    't' | 'T' => { x = 42; }
    _ => { }
    }

    The _ pattern matches anything that hasn't been matched (and
    may be necessary, I'm not sure about that).

    My hardle is relatedd to [0] part rather than to switch/case part.
    Accessing nth character of String (or of str? Or &str ? I am still
    trying to figure out the difference.) is not as simple as in C or Go.
    One person on Stack Overflow said that he was able to figure it out
    after he learned the difference between std::string and
    std::string_view in C++. May be, I should follow the same process. But
    I don't want to. I don't plan to become an expert Rust programmer,
    but rather want to do a simple benchmark.

    Rust strings _always_ use utf8! If you use the .as_bytes() casting then
    you can in fact address the underlying u8 bytes, and since you will be
    working with 7-bit ascii only, that will not make any difference to you.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Fri Sep 13 04:12:21 2024
    Michael S <[email protected]> writes:

    On Thu, 12 Sep 2024 03:12:11 -0700
    Tim Rentsch <[email protected]> wrote:

    BGB <[email protected]> writes:

    [...]

    Would be nice, say, if there were semi-standard compiler macros for
    various things:
    Endianess (macros exist, typically compiler specific);
    And, apparently GCC and Clang can't agree on which strategy to
    use. Whether or not the target/compiler allows misaligned memory
    access; If set, one may use misaligned access.
    Whether or not memory uses a single address space;
    If set, all pointer comparisons are allowed.

    [elaborations on the above]

    I suppose it's natural for hardware-type folks to want features
    like this to be part of standard C. In a sense what is being
    asked is to make C a high-level assembly language. But that's
    not what C is. Nor should it be.

    Why not?

    Because it's not needed, and would make things worse rather
    than better. The result would be a bigger language but not
    a better language.

    I don't see practical need for all those UBs apart from buffer
    overflow. More so, I don't see the need for UB in certain
    limited classes of buffer overflows.

    Eliminating undefined behavior is not what's being asked for.
    These two questions are not the same.

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Fri Sep 13 14:29:04 2024
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Thu, 12 Sep 2024 03:12:11 -0700
    Tim Rentsch <[email protected]> wrote:

    BGB <[email protected]> writes:

    [...]

    Would be nice, say, if there were semi-standard compiler macros
    for various things:
    Endianess (macros exist, typically compiler specific);
    And, apparently GCC and Clang can't agree on which strategy to
    use. Whether or not the target/compiler allows misaligned memory
    access; If set, one may use misaligned access.
    Whether or not memory uses a single address space;
    If set, all pointer comparisons are allowed.

    [elaborations on the above]

    I suppose it's natural for hardware-type folks to want features
    like this to be part of standard C. In a sense what is being
    asked is to make C a high-level assembly language. But that's
    not what C is. Nor should it be.

    Why not?

    Because it's not needed, and would make things worse rather
    than better. The result would be a bigger language but not
    a better language.

    I don't see practical need for all those UBs apart from buffer
    overflow. More so, I don't see the need for UB in certain
    limited classes of buffer overflows.

    Eliminating undefined behavior is not what's being asked for.
    These two questions are not the same.

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Fri Sep 13 14:44:11 2024
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Thu, 12 Sep 2024 03:12:11 -0700
    Tim Rentsch <[email protected]> wrote:

    BGB <[email protected]> writes:

    [...]

    Would be nice, say, if there were semi-standard compiler macros
    for various things:
    Endianess (macros exist, typically compiler specific);
    And, apparently GCC and Clang can't agree on which strategy to
    use. Whether or not the target/compiler allows misaligned memory
    access; If set, one may use misaligned access.
    Whether or not memory uses a single address space;
    If set, all pointer comparisons are allowed.

    [elaborations on the above]

    I suppose it's natural for hardware-type folks to want features
    like this to be part of standard C. In a sense what is being
    asked is to make C a high-level assembly language. But that's
    not what C is. Nor should it be.

    Why not?

    Because it's not needed, and would make things worse rather
    than better. The result would be a bigger language but not
    a better language.


    I beg to differ.
    Yes, the standard would be bigger. And yes, few unimportant benchmarks
    would run a little slower. But a job of compiler writers would be
    simpler and less exciting (good thing!). The most importantly,
    programming in resulting language would feel more predictable.

    I don't see practical need for all those UBs apart from buffer
    overflow. More so, I don't see the need for UB in certain
    limited classes of buffer overflows.

    Eliminating undefined behavior is not what's being asked for.
    These two questions are not the same.

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would be
    bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to BGB on Fri Sep 13 17:30:35 2024
    On 12/09/2024 23:14, BGB wrote:
    On 9/12/2024 9:18 AM, David Brown wrote:
    On 11/09/2024 20:51, BGB wrote:
    On 9/11/2024 5:38 AM, Anton Ertl wrote:
    Josh Vanderhoof <[email protected]> writes:
    [email protected] (Anton Ertl) writes:


    <snip lots>

    Would be nice, say, if there were semi-standard compiler macros for
    various things:

    Ask, and you shall receive!  (Well, sometimes you might receive.)

       Endianess (macros exist, typically compiler specific);
         And, apparently GCC and Clang can't agree on which strategy to use.

    #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
    ...
    #elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
    ...
    #else
    ...
    #endif

    Works in gcc, clang and MSVC.


    Technically now also in BGBCC, since I have just recently added it.

    Good idea.



    And C23 has the <stdbit.h> header with many convenient little "bit and
    byte" utilities, including endian detection:

    #include <stdbit.h>
    #if __STDC_ENDIAN_NATIVE__ == __STDC_ENDIAN_LITTLE__
    ...
    #elif __STDC_ENDIAN_NATIVE__ == __STDC_ENDIAN_BIG__
    ...
    #else
    ...
    #endif


    This is good at least.

    Though, generally takes a few years before new features become usable.
    Like, it is only in recent years that it has become "safe" to use most
    parts of C99.


    Most of the commonly used parts of C99 have been "safe" to use for 20
    years. There were a few bits that MSVC did not implement until
    relatively recently, but I think even have caught up now.

    There are only two serious, general purpose C compilers in mainstream
    use - gcc and clang, and both support almost all of C23 now. But it
    will take a while for the more niche tools, such as some embedded
    compilers, to catch up.

    <stdbit.h> is, however, in the standard library rather than the
    compiler, and they can be a bit slow to catch up.


       Whether or not the target/compiler allows misaligned memory access; >>>      If set, one may use misaligned access.

    Why would you need that?  Any decent compiler will know what is
    allowed for the target (perhaps partly on the basis of compiler
    flags), and will generate the best allowed code for accesses like
    foo3() above.


    Imagine you have compilers that are smart enough to turn "memcpy()" into
    a load and store, but not smart enough to optimize away the memory
    accesses, or fully optimize away the wrapper functions...


    Why would I do that? If I want to have efficient object code, I use a
    good compiler. Under what realistic circumstances would you need to
    have highly efficient results but be unable to use a good optimising
    compiler? Compilers have been inlining code for 30 years at least
    (that's when I first saw it) - this is not something new and rare.

    So, for best results, the best case option is to use a pointer cast and dereference.

    For some cases, one may also need to know whether or not they can access
    the pointers in a misaligned way (and whether doing so would be better
    or worse than something like "memcpy()").


    Again, I cannot see a /real/ situation where that would be relevant.


       Whether or not memory uses a single address space;
         If set, all pointer comparisons are allowed.

    Pointer comparisons are always allowed for equality tests if they are
    pointers to objects of compatible types.  (Function pointers cannot be
    compared at all.)

    For other relational tests, the pointers must point to sub-objects of
    the same aggregate object.  (That means they can't be null pointers,
    misaligned pointers, invalid pointers or pointers going nowhere.)
    This is independent of how the address space(s) are organised on the
    target machine.

    What you /can/ do, on pretty much any implementation with a single
    linear address space, is convert pointers to uintptr_t and then
    compare them.  There may be some targets for which there is no
    uintptr_t, or where the mapping from pointer to integer does not match
    with the address, but that would be very unusual.

    I can't think when you would need to do such comparisons, however,
    other than to implement memmove - and library functions can use any
    kind of implementation-specific feature they like.


    Yeah.

    My "_memlzcpy()" functions do a lot of relative comparisons (more than
    needed for memmove):
      dst<=src: memmove
      (dst-src)>=sz: memcpy
      (dst-src)>=32: can copy with 32B blocks
      (dst-src)>=16: can copy with 16B blocks
      (dst-src)>= 8: can copy with 8B blocks
      1/2/4: Generate a full-block fill pattern
      3/5/6/7: partial fill pattern (16B block with irregular step)


    If this is something for your library for your compiler, then of course
    you are free to do anything you want here - standard library code does
    not need to be portable, but is free to use any kind of compiler "magic"
    it likes. (For example, gcc has lots of builtins and extensions that
    are not targeted at normal code, but are targeted specifically at
    library writers.)

    There is a difference here between "_memlzcpy()" and "_memlzcpyf()" in
    that:
      the former will always copy an exact number of bytes;
      the latter may write 16-32 bytes over the limit.

    It may do /what/ ? That is a scary function!


    Possible:
       __MINALIGN_type__  //minimum allowed alignment for type

    _Alignof(type) has been around since C11.


    _Alignof tells the native alignment, not the minimum.

    It is the same thing.


    Where, _Alignof(int32_t) will give 4, but __MINALIGN_INT32__ would give
    1 if the target supports misaligned pointers.


    The alignment of types in C is given by _Alignof. Hardware may support unaligned accesses - C does not. (By that, I mean that unaligned
    accesses are UB.)



    Maybe also alias pointer control:
       __POINTER_ALIAS__
         __POINTER_ALIAS_CONSERVATIVE__
         __POINTER_ALIAS_STRICT__

    Where, pointer alias can be declared, and:
       If conservative, then conservative semantics are being used.
         Pointers may be freely cast without concern for pointer aliasing. >>>      Compiler will assume that "non restrict" pointer stores may alias. >>>    If strict, the compiler is using TBAA semantics.
         Compiler may assume that aliasing is based on pointer types.


    Faffing around with pointer types - breaking the "effective type"
    rules - has been a bad idea and risky behaviour since C was
    standardised.  You never need to do it.  (I accept, however, that on
    some weaker or older compilers "doing the right thing" can be
    noticeably less efficient than writing bad code.)  Just get a
    half-decent compiler and use memcpy(). For any situation where you
    might think casting pointer types would be a good idea, your sizes are
    small and known at compile time, so they are easy for the compiler to
    optimise.


    It depends.

    In some things, like my ELF and PE/COFF program loaders, the code can
    get particularly nasty in these areas...

    It may look simpler in the code to do this kind of thing, but it is not /necessary/ and it is not safe unless you are writing non-portable code
    and are sure it will only be used on a compiler that supports it. Thus
    the Linux kernel requires "-fno-strict-aliasing", because some of the
    Linux kernel authors write crap C code. (Or, to be a bit fairer, some
    of the code in the Linux kernel is very old and comes from a time when
    writing things correctly while generating efficient results would need
    more effort.)


    And as a general rule, if you feel you really want to break the rules
    of C and still get something useful out at the end, use "volatile"
    liberally.


    I have used "volatile" here to good effect.




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Fri Sep 13 15:55:39 2024
    David Brown <[email protected]> schrieb:

    Most of the commonly used parts of C99 have been "safe" to use for 20
    years. There were a few bits that MSVC did not implement until
    relatively recently, but I think even have caught up now.

    What about VLAs?

    There are only two serious, general purpose C compilers in mainstream
    use - gcc and clang, and both support almost all of C23 now. But it
    will take a while for the more niche tools, such as some embedded
    compilers, to catch up.

    It is almost impossible to gather statistics on compiler use,
    especially with free compilers, but what about MSVC and icc?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Fri Sep 13 13:11:37 2024
    On Thu, 12 Sep 2024 04:04:06 -0700, Tim Rentsch
    <[email protected]> wrote:

    George Neuner <[email protected]> writes:

    On Tue, 10 Sep 2024 11:21:01 +0300, Michael S
    <[email protected]> wrote:

    On Mon, 09 Sep 2024 23:27:24 -0400
    George Neuner <[email protected]> wrote:

    On Sun, 08 Sep 2024 15:36:39 GMT, [email protected]
    (Anton Ertl) wrote:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Anton Ertl) writes:

    There was still no easy way to determine whether your software
    that calls memcpy() actually works as expected on all hardware,

    There may not be a way to tell if memcpy()-calling code will
    work on platforms one doesn't have, but there is a relatively
    simple and portable way to tell if some memcpy() call crosses
    over into the realm of undefined behavior.

    1) At first I thought that yes, one could just check whether
    there is an overlap of the memory areas. But then I remembered
    that you cannot write such a check in standard C without (in the
    general case) exercising undefined behaviour; and then the
    compiler could eliminate the check or do something else that's
    unexpected. Do you have such a check in mind that does not
    exercise undefined behaviour in the general case?

    The result of comparing pointers to two elements of the same array
    is defined. Cast to (char*), both src and dst can be considered
    to point to elements of the [address space sized] char array at
    address zero.

    According to my understanding, your 'can be considered' part is not
    codified in the C Standard.

    Adding size_t to a pointer yields another pointer of the same
    type.

    In terms of types, that is right, but the addition works only if
    the pointer points into an array large enough to include the
    result of the addition (the result is also allowed to be just one
    past the end of the array).

    All of gcc, clang and MSVC seem happy with this.

    It works. But is it guaranteed to work in the future by some sort
    of document? I am pretty sure that no such guarantee exists in gcc
    and MSVC docs. I did not look in clang docs. Trying to find
    anythings in LLVM/clang docs makes me sad.

    I know that it has worked as expected with every version of gcc
    and Microsoft I've used since 1988. [clang I don't use, but I
    tried it on godbolt.org with the most recent version]

    Will it continue to work ... who knows?


    I definitely am NOT an expert on the C standard, but thinking
    about it, it occurred to me that if an array is explicitly defined
    that *might* cover all memory (or at least all heap), then the
    compiler would have to honor any apparent pointers into it.

    E.g., char (*all_memory)[] = 0;

    This declaration introduces a pointer, not an array. Similarly
    the declaration

    char (*great_white_array)[ 999999999999999999 ] = 0;

    does not introduce an array but just a pointer (and initializes
    the pointer to be a null pointer). There is no humongous array.


    Of course there is no actual array ... the point was to (try to)
    define *something* such that the compiler would think there was an
    array and consider any char* as possibly pointing to an element of
    that array.
    [And yes! it might end up pessimizing character manipulating code.]

    The C standard guarantees that pointers to 2 elements of the same
    array are comparable, and current (and past) compilers do allow
    comparing arbitrary pointers when cast to char* without needing an
    actual char array that covers the addresses.

    But a guarantee wrt the standard requires the compiler to at least
    *think* there is such an array. The question is how to do that.


    None of the compilers at godbolt seem to need this to compare
    arbitrary addresses as char*, but all accept it.

    The given declaration of 'all_memory' is strictly conforming.
    It must be accepted by any conforming C implementation (which
    all of gcc, clang, and MSVC purport to be, IIUC).

    Obviously speculation, but it's the best I have.

    It's important to realize that there are two distinct questions.
    One, does the code work (in a given implementation)? Two, does
    the code satisfy the rules given in the C standard?

    Unfortunately having an answer to the first question does not by
    itself give enough information to answer the second question.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Fri Sep 13 10:42:22 2024
    Michael S <[email protected]> writes:

    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Thu, 12 Sep 2024 03:12:11 -0700
    Tim Rentsch <[email protected]> wrote:

    BGB <[email protected]> writes:

    [...]

    Would be nice, say, if there were semi-standard compiler macros
    for various things:
    Endianess (macros exist, typically compiler specific);
    And, apparently GCC and Clang can't agree on which strategy to
    use. Whether or not the target/compiler allows misaligned memory
    access; If set, one may use misaligned access.
    Whether or not memory uses a single address space;
    If set, all pointer comparisons are allowed.

    [elaborations on the above]

    I suppose it's natural for hardware-type folks to want features
    like this to be part of standard C. In a sense what is being
    asked is to make C a high-level assembly language. But that's
    not what C is. Nor should it be.

    Why not?

    Because it's not needed, and would make things worse rather
    than better. The result would be a bigger language but not
    a better language.

    I beg to differ.
    Yes, the standard would be bigger. And yes, few unimportant benchmarks
    would run a little slower. But a job of compiler writers would be
    simpler and less exciting (good thing!). The most importantly,
    programming in resulting language would feel more predictable.

    I don't see practical need for all those UBs apart from buffer
    overflow. More so, I don't see the need for UB in certain
    limited classes of buffer overflows.

    Eliminating undefined behavior is not what's being asked for.
    These two questions are not the same.

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.

    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would be
    bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.

    I think the consequences of changes like the ones you suggest
    would be much larger than you think they would be. The result
    would change C into a completely different language.

    Also I think the percentage of code where such considerations are
    relevant is extremely small, significantly less than a thousandth
    of a percent. That's a mighty small tail wagging a mighty large
    dog.

    I'm not trying to convince anyone; just stating a personal view.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to George Neuner on Fri Sep 13 11:09:01 2024
    George Neuner <[email protected]> writes:

    On Thu, 12 Sep 2024 04:04:06 -0700, Tim Rentsch
    <[email protected]> wrote:

    George Neuner <[email protected]> writes:

    [...]

    I definitely am NOT an expert on the C standard, but thinking
    about it, it occurred to me that if an array is explicitly defined
    that *might* cover all memory (or at least all heap), then the
    compiler would have to honor any apparent pointers into it.

    E.g., char (*all_memory)[] = 0;

    This declaration introduces a pointer, not an array. Similarly
    the declaration

    char (*great_white_array)[ 999999999999999999 ] = 0;

    does not introduce an array but just a pointer (and initializes
    the pointer to be a null pointer). There is no humongous array.

    Of course there is no actual array ... the point was to (try to)
    define *something* such that the compiler would think there was an
    array and consider any char* as possibly pointing to an element of
    that array.
    [And yes! it might end up pessimizing character manipulating code.]

    The C standard guarantees that pointers to 2 elements of the same
    array are comparable, and current (and past) compilers do allow
    comparing arbitrary pointers when cast to char* without needing an
    actual char array that covers the addresses.

    But a guarantee wrt the standard requires the compiler to at least
    *think* there is such an array. The question is how to do that.

    What the compiler thinks is irrelevant. It's only what the C
    standard thinks that matters.

    If someone fools the C compiler today they might (or might not)
    get what they want or expect. But fooling the compiler is a
    risky strategy, and it's almost never needed; people try to
    trick compilers a lot more often than circumstances actually
    warrant, and that is even not counting non-guarantees about
    future behavior.

    Incidentally, I am in the middle of porting some code from one
    platform to another. Code works fine on the original platform,
    millions of tests are picture perfect. No undefined behavior in
    sight. The new platform is a nightmare, thanks to a certain
    well-known company headquartered in the state of Washington. Any
    concerns about what happens with undefined behavior are so far
    down the list we'd need a telescope to see them. Given this
    recent experience, it's hard for me to get too worked up about
    defining these over-the-edge cases.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to David Brown on Fri Sep 13 13:09:00 2024
    On 9/3/2024 4:14 PM, David Brown wrote:
    On 03/09/2024 18:54, Stephen Fuld wrote:
    On 9/2/2024 11:23 PM, David Brown wrote:
    On 02/09/2024 18:46, Stephen Fuld wrote:
    On 9/2/2024 1:23 AM, Terje Mathisen wrote:

    Anyway, that is all mostly moot since I'm using Rust for this kind
    of programming now. :-)

    Can you talk about the advantages and disadvantages of Rust versus C?


    And also for Rust versus C++ ?

    I asked about C versus Rust as Terje explicitly mentioned those two
    languages, but you make a good point in general.


    I want to know about both :-)

    In my field, small-systems embedded development, C has been dominant for
    a long time, but C++ use is increasing.  Most of my new stuff in recent times has been C++.  There are some in the field who are trying out
    Rust, so I need to look into it myself - either because it is a better
    choice than C++, or because customers might want it.



    My impression - based on hearsay for Rust as I have no experience -
    is that the key point of Rust is memory "safety".  I use scare-quotes
    here, since it is simply about correct use of dynamic memory and
    buffers.

    I agree that memory safety is the key point, although I gather that it
    has other features that many programmers like.


    Sure.  There are certainly plenty of things that I think are a better
    idea in a modern programming language and that make it a good step up compared to C.  My key interest is in comparison to C++ - it is a step
    up in some ways, a step down in others, and a step sideways in many features.  But is it overall up or down, for /my/ uses?

    Examples of things that I think are good in Rust are making variables immutable by default and pattern matching.  Steps down include lack of function overloading

    Rust's generic functions are not sufficient?



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Fri Sep 13 21:39:39 2024
    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would be
    bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.

    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety. Good idea, that.
    That created *really* interesting bugs, and Real Programmers (TM)
    have to have something that pays their salaries, right?

    SCNR

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Sep 13 23:16:19 2024
    On Fri, 13 Sep 2024 21:39:39 +0000, Thomas Koenig wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would be
    bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.

    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety.

    FORTAN allowed::
    subroutine1:
    COMMON /ALPHA/i,j,k,l,m,n
    subroutine2:
    COMMON /ALPHA/x.y.z
    expecting {i,j} which are INT*4 to overlap with x Read*8 ;...
    {Completely neglecting the BE/LE problems,...}

    Good idea, that.
    That created *really* interesting bugs, and Real Programmers (TM)
    have to have something that pays their salaries, right?

    SCNR

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to [email protected] on Sat Sep 14 07:25:00 2024
    MitchAlsup1 <[email protected]> schrieb:
    On Fri, 13 Sep 2024 21:39:39 +0000, Thomas Koenig wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would be
    bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.

    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety.

    FORTAN allowed::
    subroutine1:
    COMMON /ALPHA/i,j,k,l,m,n
    subroutine2:
    COMMON /ALPHA/x.y.z
    expecting {i,j} which are INT*4 to overlap with x Read*8 ;...
    {Completely neglecting the BE/LE problems,...}

    Not only that, also different FP formats...

    The only thing that was guaranteed is the storage unit. An INTEGER
    and a REAL occupies one storage unit, a DOUBLE PRECISION occoupies
    two. Through EQUIVALENCE or through different COMMON blocks in
    different procedures, an INTEGER and a REAL can occupy the same
    storage location. And if a value was assigned to a variable of
    one time (the entity became defined, in standardese) the variable
    with the same storage location becomes undefined (at least as far
    back as Fortran 77, I didn't check earlier).

    This was very widely ignored, people used COMMON and EQUIVALENCE
    for type punning all the time.

    There also was the issue of alignment; by playing tricks with
    EQUIVALENCE, you could put a double precision variable on an
    unaligned memory location. With the advent of the RISC CPUs which
    didn't support this, this became the most-ignored provision in the
    standard (but with a flag to restorte standard-conforming behavior).

    Hmm... what were the alignment restrictions on double precision
    on the /360?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to BGB on Sat Sep 14 08:24:29 2024
    BGB <[email protected]> schrieb:
    On 9/13/2024 10:55 AM, Thomas Koenig wrote:
    David Brown <[email protected]> schrieb:

    Most of the commonly used parts of C99 have been "safe" to use for 20
    years. There were a few bits that MSVC did not implement until
    relatively recently, but I think even have caught up now.

    What about VLAs?


    IIRC, VLAs and _Complex and similar still don't work in MSVC.
    Most of the rest does now at least.

    It's only been 25 years. You have to give Microsoft a bit of
    time to catch up. I'm sure they will get there by 2099.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to Scott Lurndal on Sat Sep 14 13:08:05 2024
    In article <UxpCO.174965$[email protected]>,
    Scott Lurndal <[email protected]> wrote:
    Bernd Linsel <[email protected]> writes:
    On 05.09.24 19:04, Terje Mathisen wrote:
    One of my alternatives are

    unsigned u = start; // Cannot be less than zero
    if (u) {
    u++;
    do {
    u--;
    data[u]...
    while (u);
    }

    This typically results in effectively the same asm code as the signed
    version, except for a bottom JGE (Jump (signed) Greater or Equal instead >>> of JA (Jump Above or Equal, but my version is far more verbose.

    Alternatively, if you don't need all N bits of the unsigned type, then
    you can subtract and check if the top bit is set in the result:

    for (unsigned u = start; (u & TOPBIT) == 0; u--)

    Terje


    What about:

    for (unsigned u = start; u != ~0u; --u)

    This is the form we use most when we need
    to work in reverse.

    ...

    or even

    for (unsigned u = start; (int)u >= 0; --u)
    ...

    ?

    I've compared all variants for x86_64 with -O3 -fexpensive-optimizations
    on godbolt.org:
    - 32 bit version: https://godbolt.org/z/TMhhx3nch
    - 64 bit version: https://godbolt.org/z/8oxzTf5Gf


    No significant differences in code generation for unsigned vs. signed.

    This discussion wandered into many subthreads, but I only want to make
    one post and chose here.

    When you write code working on signed numbers and do something like:

    (a < 0) || (a >= max)

    Then the compiler realizes if you treat 'a' as unsigned, this is just:

    (unsigned)a >= max

    since any negative number, treated as unsigned, will be larger than the
    largest positive signed number. So, to do loops which count down and
    have any stride using an unsigned loop count:

    for(u = start; u <= start; u -= step)

    With the usual caveats (start must be a valid signed number, and step
    cannot be so large that start + step crosses the signed boundary).

    But: unsigned numbers in C have some dangers, which no one here has
    mentioned. Some code presented comes CLOSE to being wrong, but gets
    lucky. With "int" being 32-bits, C promotion rules around unsigned
    ints, signed ints, and unsigned 64-bit can create trouble.

    uint64_t dval; uint32_t uval; int a;

    val32 = 1 dval = 1; a = 1;
    dval = val32 - 2 + dval;

    C will do (val32 - 2) first, with is (1U - 2) which is 0xffff_ffff, and
    then add dval, and the result is 0x1_0000_0000.

    Signed numbers don't have this risk, so if you're doing known small loops,
    you can just use ints. If you're doing possibly large loops, just use
    int64_t.

    Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
    C/C++ is a mistake. It should always have been ILP64, and this nonsense
    would go away. Any new architecture should make C ILP64 (looking at you RISC-V, missing yet another opportunity to not make the same mistakes as everyone else).

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Kent Dickey on Sat Sep 14 13:26:52 2024
    [email protected] (Kent Dickey) writes:
    Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
    C/C++ is a mistake. It should always have been ILP64, and this nonsense >would go away. Any new architecture should make C ILP64 (looking at you >RISC-V, missing yet another opportunity to not make the same mistakes as >everyone else).

    We now have had more than 30 years of catering for this mistake by
    everyone involved. Given their goals, I think that RISC-V made the
    right choice for int in their ABI, even if it was the original choice
    by the MIPS and Alpha people that they follow, like everyone else, was
    wrong.

    That being said, one option would be to introduce another ABI and API
    with 64-bit int (and maybe 32-bit long short int), and programmers
    could choose whether to program for the ILP API, or the int=int32_t
    API. Would the ILP API/ABI fare better then x32? I doubt it, even
    though I would support it. This ship probably has sailed.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Sat Sep 14 21:59:22 2024
    On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation.
    And in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would be bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.

    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety. Good idea, that.
    That created *really* interesting bugs, and Real Programmers (TM)
    have to have something that pays their salaries, right?

    SCNR

    What I wrote is how all production C compilers work today. So it
    will add no new bugs. What I propose is to formally codify 50 y.o.
    existing practice.
    And no, it's both much easier to follow than old FORTRAN common blocks
    and has wider scope (applies to all storage classes, rather than just
    to global).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Sat Sep 14 19:02:43 2024
    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation.
    And in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would be
    bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.

    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety. Good idea, that.
    That created *really* interesting bugs, and Real Programmers (TM)
    have to have something that pays their salaries, right?

    SCNR

    What I wrote is how all production C compilers work today. So it
    will add no new bugs. What I propose is to formally codify 50 y.o.
    existing practice.

    So was Fortran's misuse of COMMON Blocks.

    And no, it's both much easier to follow than old FORTRAN common blocks

    You want to allow array bounds violations and type punning rolled into
    one?

    I beg to differ that this is in any way easier, or better.

    and has wider scope (applies to all storage classes, rather than just
    to global).

    You're correct, the potential for mischief is far greater.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Kent Dickey on Sat Sep 14 19:00:35 2024
    Kent Dickey <[email protected]> schrieb:

    When you write code working on signed numbers and do something like:

    (a < 0) || (a >= max)

    Then the compiler realizes if you treat 'a' as unsigned, this is just:

    (unsigned)a >= max

    For which definition of a and max exactly?

    It coertainly does not do so for

    _Bool foo(int a, int max)
    {
    return (a < 0) || (a >= max);
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to [email protected] on Sat Sep 14 19:26:13 2024
    MitchAlsup1 <[email protected]> schrieb:

    In many cases int is slower now than long -- which violates the notion
    of int from K&R days.

    That's a designers's choice, I think. It is possible to add 32-bit instructions which should be as fast (or possibly faster) than
    64-bit instructions, as AMD64 and ARM have shown.

    And having a smaller memory footprint is also beneficial, especially
    for caches.

    (Plus, there are FORTRAN's storage association rules, but these should
    be less used by now. But for a 64-bit integer, they pretty much would
    require a 64-bit REAL and a 128-bit DOUBLE PRECISION).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sat Sep 14 19:11:30 2024
    On Sat, 14 Sep 2024 13:26:52 +0000, Anton Ertl wrote:

    [email protected] (Kent Dickey) writes:
    Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
    C/C++ is a mistake. It should always have been ILP64, and this nonsense >>would go away. Any new architecture should make C ILP64 (looking at you >>RISC-V, missing yet another opportunity to not make the same mistakes as >>everyone else).

    We now have had more than 30 years of catering for this mistake by
    everyone involved. Given their goals, I think that RISC-V made the
    right choice for int in their ABI, even if it was the original choice
    by the MIPS and Alpha people that they follow, like everyone else, was
    wrong.

    Until the advent of int32_t the only way to get a known 32-bit container
    was int. But I agree with the notion that ILP64 should be universal now,
    and if you want/need something smaller, use some other type indicator
    than int.

    In many cases int is slower now than long -- which violates the notion
    of int from K&R days.

    That being said, one option would be to introduce another ABI and API
    with 64-bit int (and maybe 32-bit long short int), and programmers
    could choose whether to program for the ILP API, or the int=int32_t
    API. Would the ILP API/ABI fare better then x32? I doubt it, even
    though I would support it. This ship probably has sailed.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to [email protected] on Sat Sep 14 19:57:04 2024
    In article <vc4mgj$1khmk$[email protected]>,
    Thomas Koenig <[email protected]> wrote:
    Kent Dickey <[email protected]> schrieb:

    When you write code working on signed numbers and do something like:

    (a < 0) || (a >= max)

    Then the compiler realizes if you treat 'a' as unsigned, this is just:

    (unsigned)a >= max

    For which definition of a and max exactly?

    It coertainly does not do so for

    _Bool foo(int a, int max)
    {
    return (a < 0) || (a >= max);
    }

    Sorry, I should have made it clear for max >= 0 (but not necessarily an unsigned variable), and for my code, a constant, which is how the
    compiler knows it's positive . I have this in my code all the time to
    validate function inputs--a negative number is bad, and a number beyond
    a certain reasonable value is bad. And I let the compiler optimize the
    check to (unsigned)a >= (unsigned)max.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Sat Sep 14 20:14:23 2024
    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation.
    And in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would be
    bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.

    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety. Good idea, that.
    That created *really* interesting bugs, and Real Programmers (TM)
    have to have something that pays their salaries, right?

    SCNR

    What I wrote is how all production C compilers work today. So it
    will add no new bugs.

    Maybe I should be a little bit more precise in why I think this
    is an extemely bad idea.

    struct {
    char x[8]
    int y;
    } bar;

    Assume

    bar.y = 1234;
    bar.x[i] = 42; // The compiler does not know i
    // Do something with bar.y

    The compiler should then treat the access to bar.x[i] as if bar.y
    was clobbered by the assignment statement, and reload bar.y if
    it was kept in a register? That is the semantics you propose.

    So, either bar.y is treated as if it was volatile, or hard-to-detect
    bugs would appear because, with optimization, the assignment would
    sometimes change the value of bar.y and sometimes not.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Linsel@21:1/5 to Kent Dickey on Sat Sep 14 22:18:12 2024
    On 14.09.24 21:57, Kent Dickey wrote:
    In article <vc4mgj$1khmk$[email protected]>,
    Thomas Koenig <[email protected]> wrote:
    Kent Dickey <[email protected]> schrieb:

    When you write code working on signed numbers and do something like:

    (a < 0) || (a >= max)

    Then the compiler realizes if you treat 'a' as unsigned, this is just:

    (unsigned)a >= max

    For which definition of a and max exactly?

    It coertainly does not do so for

    _Bool foo(int a, int max)
    {
    return (a < 0) || (a >= max);
    }

    Sorry, I should have made it clear for max >= 0 (but not necessarily an unsigned variable), and for my code, a constant, which is how the
    compiler knows it's positive . I have this in my code all the time to validate function inputs--a negative number is bad, and a number beyond
    a certain reasonable value is bad. And I let the compiler optimize the
    check to (unsigned)a >= (unsigned)max.

    Kent

    And that's the information the compiler was missing to optimize foo() in
    the same way:

    _Bool foo1(int a, int max)
    {
    if (__builtin_expect(max < 0, 0)) __builtin_unreachable();

    return a < 0 || a >= max;
    }


    _Bool foo2(int a, int max)
    {
    return (unsigned)a >= (unsigned)max;
    }

    compiles to:

    foo1:
    cmp edi, esi
    setnb al
    ret
    foo2:
    cmp edi, esi
    setnb al
    ret

    (x64-64-gcc 14.2 -Wall -Wextra -Wpedantic -O3 -fexpensive-optimizations)

    --
    Bernd Linsel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Sat Sep 14 23:53:40 2024
    On Sat, 14 Sep 2024 08:24:29 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    BGB <[email protected]> schrieb:
    On 9/13/2024 10:55 AM, Thomas Koenig wrote:
    David Brown <[email protected]> schrieb:

    Most of the commonly used parts of C99 have been "safe" to use
    for 20 years. There were a few bits that MSVC did not implement
    until relatively recently, but I think even have caught up now.

    What about VLAs?


    IIRC, VLAs and _Complex and similar still don't work in MSVC.
    Most of the rest does now at least.

    It's only been 25 years. You have to give Microsoft a bit of
    time to catch up. I'm sure they will get there by 2099.

    Microsoft does not see ISO C as their primary language.
    They are willing to do an easy stuff, but seem very reluctant to
    implement anything that is principally incompatible with C++.
    Both VLA and _Complex fall under the later category.
    Both were optional in C11/17.
    However in C23, while VLA are still optional, variably-modified types
    that are also principally incompatible with C++, became mandatory.
    I wonder what Microsoft would do about it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Sun Sep 15 00:11:53 2024
    On Thu, 12 Sep 2024 16:34:31 +0200
    David Brown <[email protected]> wrote:

    On 12/09/2024 13:29, Michael S wrote:
    On Thu, 12 Sep 2024 03:12:11 -0700
    Tim Rentsch <[email protected]> wrote:

    BGB <[email protected]> writes:

    [...]

    Would be nice, say, if there were semi-standard compiler macros
    for various things:
    Endianess (macros exist, typically compiler specific);
    And, apparently GCC and Clang can't agree on which strategy
    to use. Whether or not the target/compiler allows misaligned
    memory access; If set, one may use misaligned access.
    Whether or not memory uses a single address space;
    If set, all pointer comparisons are allowed.

    [elaborations on the above]

    I suppose it's natural for hardware-type folks to want features
    like this to be part of standard C. In a sense what is being
    asked is to make C a high-level assembly language. But that's
    not what C is. Nor should it be.


    I fully agree that C is not, and should not be seen as, a "high-level assembly language". But it is a language that is very useful to "hardware-type folks", and there are a few things that could make it
    easier to write more portable code if they were standardised. As it
    is, we just have to accept that some things are not portable.

    Why not?
    I don't see practical need for all those UBs apart from buffer
    overflow. More so, I don't see the need for UB in certain limited
    classes of buffer overflows.

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.


    And how should that be defined?


    bar.x[8] = 42 should be defined to be the same as
    char tmp = 42
    memcpy(&bar.y, &tmp, sizeof(tmp));

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Sun Sep 15 00:19:39 2024
    On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation.
    And in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would
    be bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.


    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety. Good idea, that.
    That created *really* interesting bugs, and Real Programmers (TM)
    have to have something that pays their salaries, right?

    SCNR

    What I wrote is how all production C compilers work today. So it
    will add no new bugs.

    Maybe I should be a little bit more precise in why I think this
    is an extemely bad idea.

    struct {
    char x[8]
    int y;
    } bar;

    Assume

    bar.y = 1234;
    bar.x[i] = 42; // The compiler does not know i
    // Do something with bar.y

    The compiler should then treat the access to bar.x[i] as if bar.y
    was clobbered by the assignment statement, and reload bar.y if
    it was kept in a register? That is the semantics you propose.


    Yes, exactly.

    So, either bar.y is treated as if it was volatile, or hard-to-detect
    bugs would appear because, with optimization, the assignment would
    sometimes change the value of bar.y and sometimes not.

    No, semantics is that compiler has to reload bar.y if it keeps it in
    register. Optimizer that does anything else is buggy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Sat Sep 14 19:38:36 2024
    Thomas Koenig <[email protected]> writes:

    BGB <[email protected]> schrieb:

    On 9/13/2024 10:55 AM, Thomas Koenig wrote:

    David Brown <[email protected]> schrieb:

    Most of the commonly used parts of C99 have been "safe" to use for 20
    years. There were a few bits that MSVC did not implement until
    relatively recently, but I think even have caught up now.

    What about VLAs?

    IIRC, VLAs and _Complex and similar still don't work in MSVC.
    Most of the rest does now at least.

    It's only been 25 years. You have to give Microsoft a bit of
    time to catch up. I'm sure they will get there by 2099.

    Microsoft is never going to catch up because they don't want to
    catch up. The choice to offer a sub-standard C compiler is the
    result of a business decision, not a technical decision; they
    want to steer people away from open environments and towards
    their proprietary environments. The world would be a better
    place if Microsoft had been broken up in the judgment of the
    anti-trust action 20+ years ago. And they certainly deserved
    it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to BGB on Sat Sep 14 20:07:03 2024
    BGB <[email protected]> writes:

    On 9/12/2024 5:12 AM, Tim Rentsch wrote:

    BGB <[email protected]> writes:

    [...]

    Would be nice, say, if there were semi-standard compiler macros for
    various things:
    Endianess (macros exist, typically compiler specific);
    And, apparently GCC and Clang can't agree on which strategy to use. >>> Whether or not the target/compiler allows misaligned memory access;
    If set, one may use misaligned access.
    Whether or not memory uses a single address space;
    If set, all pointer comparisons are allowed.

    [elaborations on the above]

    I suppose it's natural for hardware-type folks to want features
    like this to be part of standard C. In a sense what is being
    asked is to make C a high-level assembly language. But that's
    not what C is. Nor should it be.

    There are a few ways things can go:
    Define rules, have one of N permutations for how those rules can go;
    How it often worked in practice.
    Throw up hands and say it is unknowable.
    What a lot of "portability" people assert.
    Do whatever gives the fastest results in standardized benchmarks.
    What many compiler maintainers want.

    These options come from the perspective of someone writing a
    compiler. That is very different from the perspective of someone
    writing a language definition. More than 60 years ago we learned
    the lesson that we shouldn't let machine architectures be defined
    just by what the hardware does. The same lesson applies to
    defining a programming language just by what compilers do, or
    even just what compilers can do.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Sun Sep 15 08:05:47 2024
    Michael S <[email protected]> schrieb:
    On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation.
    And in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would
    be bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.


    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety. Good idea, that.
    That created *really* interesting bugs, and Real Programmers (TM)
    have to have something that pays their salaries, right?

    SCNR

    What I wrote is how all production C compilers work today. So it
    will add no new bugs.

    Maybe I should be a little bit more precise in why I think this
    is an extemely bad idea.

    struct {
    char x[8]
    int y;
    } bar;

    Assume

    bar.y = 1234;
    bar.x[i] = 42; // The compiler does not know i
    // Do something with bar.y

    The compiler should then treat the access to bar.x[i] as if bar.y
    was clobbered by the assignment statement, and reload bar.y if
    it was kept in a register? That is the semantics you propose.


    Yes, exactly.

    So, volatile for all structs, plus prescribed behavior on
    array overruns.

    At the risk of repeating myself: This is an extremely bad idea.

    I rest my case.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Sun Sep 15 12:50:06 2024
    On Sun, 15 Sep 2024 08:05:47 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by
    implementation. And in practice it is. Just not in
    theory.

    Do you mean union rather than struct? And do you mean
    bar.x[7] rather than bar.x[8]? Surely no one would expect
    that storing into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think
    should be defined by the C standard but is not? And the
    same question for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior
    would be bar.y==42 on LE machines and bar.y==42*2**24 on BE
    machines. As it actually happens in reality with all
    production compilers.

    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety. Good idea, that.
    That created *really* interesting bugs, and Real Programmers
    (TM) have to have something that pays their salaries, right?

    SCNR

    What I wrote is how all production C compilers work today. So it
    will add no new bugs.

    Maybe I should be a little bit more precise in why I think this
    is an extemely bad idea.

    struct {
    char x[8]
    int y;
    } bar;

    Assume

    bar.y = 1234;
    bar.x[i] = 42; // The compiler does not know i
    // Do something with bar.y

    The compiler should then treat the access to bar.x[i] as if bar.y
    was clobbered by the assignment statement, and reload bar.y if
    it was kept in a register? That is the semantics you propose.


    Yes, exactly.

    So, volatile for all structs,

    No.
    Access to field of struct's should be ordered only relatively to
    accesses to other fields *of the same instance* of the struct. And,
    of course, usual 'as if' applies, so optimizing compiler can figure out
    that bar.x[7] and bar.y do not overlap and thus generate code knowing
    that write to one does not clobber the other.
    That's pretty far from semantics of volatile.

    plus prescribed behavior on array overruns.

    Only withing bound of struct. bar.x[12] remains UB


    At the risk of repeating myself: This is an extremely bad idea.

    I rest my case.

    You seem to think that C should be as optimizable and as full of UBs as Fortran. Many compiler authors agree with you.
    I have different idea. IMHO, your party exploits the letter of C
    standard in violation to its spirit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Waldek Hebisch on Sun Sep 15 12:30:22 2024
    Waldek Hebisch <[email protected]> schrieb:

    [...]

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.

    That has two drawbacks: minor one that you need to know that
    there are no padding between 'x' and 'y'.

    Similar to Fortran's problems with unaligned variables in COMMON
    blocks.

    Major drawback
    is that it would forbid bounds checking for array accesses.
    In code like above it is easy to spot out of bound access at
    compile time.

    And it happens:

    $ cat x.c

    struct {
    char x[8];
    int y;
    } bar;

    void foo()
    {
    bar.y = 0;
    bar.x[8] = 42;
    }
    $ gcc -O2 -c x.c
    x.c: In function 'foo':
    x.c:10:12: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=]
    10 | bar.x[8] = 42;
    | ~~~~~~~~~^~~~
    x.c:3:9: note: at offset 8 into destination object 'x' of size 8
    3 | char x[8];
    | ^

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Sun Sep 15 12:38:32 2024
    On 2024-09-15, Michael S <[email protected]> wrote:

    You seem to think that C should be as optimizable and as full of UBs as Fortran.

    The only place where "undefined behavior" is mentioned in the Fortran
    standards is with reference to C.

    Many compiler authors agree with you.
    I have different idea.

    You don't appear to believe in specifications.

    iIMHO, your party exploits the letter of C
    standard in violation to its spirit.

    If you meet the spirit of the C standard, say hello to him for me.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Waldek Hebisch on Sun Sep 15 15:40:38 2024
    On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
    Waldek Hebisch <[email protected]> wrote:

    Michael S <[email protected]> wrote:
    On Thu, 12 Sep 2024 16:34:31 +0200
    David Brown <[email protected]> wrote:

    On 12/09/2024 13:29, Michael S wrote:
    On Thu, 12 Sep 2024 03:12:11 -0700
    Tim Rentsch <[email protected]> wrote:

    BGB <[email protected]> writes:

    I fully agree that C is not, and should not be seen as, a
    "high-level assembly language". But it is a language that is very
    useful to "hardware-type folks", and there are a few things that
    could make it easier to write more portable code if they were
    standardised. As it is, we just have to accept that some things
    are not portable.
    Why not?
    I don't see practical need for all those UBs apart from buffer
    overflow. More so, I don't see the need for UB in certain limited
    classes of buffer overflows.

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation.
    And in practice it is. Just not in theory.


    And how should that be defined?


    bar.x[8] = 42 should be defined to be the same as
    char tmp = 42
    memcpy(&bar.y, &tmp, sizeof(tmp));

    That has two drawbacks: minor one that you need to know that
    there are no padding between 'x' and 'y'.

    Padding is another thing that should be Implementation Defined.
    I.e. compiler should provide complete documentation of its padding
    algorithms.
    In addition, some padding-related things can be defined by Standard
    itself. Not in this particular case, but, for example, it could be
    defined that when field of one integer type is immediately followed by
    another field of integer type with the same or narrower width then
    there should be no padding in-between.

    Major drawback
    is that it would forbid bounds checking for array accesses.
    In code like above it is easy to spot out of bound access at
    compile time. Even with variable index compiler knows size
    of 'x' so can insert bounds checking code (and AFAIK if you
    insist leading compilers will do this).

    More generally, assuming cooperating compiler modern C has enough
    features to eliminate out of bounds array indexing.

    In general, only by means of fat pointers.
    Fat pointers break existing ABIs.
    Also if fat pointers is what I want then I already have them in few
    mainstream languages where they are integrated much better than they
    will ever be in "checked C".

    More precisely,
    I mean compiler which inserts bounds check where they are needed
    and warns or rejects constructs that can not be checked. I claim
    that it is possible to write nontrivial programs in "checked C".
    With change as above very important language construct would be
    uncheckable.

    BTW: If you need such behaviour you can get what you want by
    using unions, so there is no need to break language for folks
    that do not need this.


    Such behavior is sometimes handy, but I can easily live without it.
    Its potential usefulness is not my motivation. My motivation is
    eliminating as many UBs as is practically possible.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Sun Sep 15 15:46:03 2024
    On Sun, 15 Sep 2024 12:38:32 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    On 2024-09-15, Michael S <[email protected]> wrote:

    You seem to think that C should be as optimizable and as full of
    UBs as Fortran.

    The only place where "undefined behavior" is mentioned in the Fortran standards is with reference to C.


    The rest of the time they write "program shouldn't" or "when xyz
    the program is ill-formed" or something like that. But the meaning is
    exactly the same as UB in C.

    Many compiler authors agree with you.
    I have different idea.

    You don't appear to believe in specifications.

    iIMHO, your party exploits the letter of C
    standard in violation to its spirit.

    If you meet the spirit of the C standard, say hello to him for me.

    If I meet him, I'd try drink him.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Michael S on Sun Sep 15 12:19:02 2024
    Michael S <[email protected]> wrote:
    On Thu, 12 Sep 2024 16:34:31 +0200
    David Brown <[email protected]> wrote:

    On 12/09/2024 13:29, Michael S wrote:
    On Thu, 12 Sep 2024 03:12:11 -0700
    Tim Rentsch <[email protected]> wrote:

    BGB <[email protected]> writes:

    I fully agree that C is not, and should not be seen as, a "high-level
    assembly language". But it is a language that is very useful to
    "hardware-type folks", and there are a few things that could make it
    easier to write more portable code if they were standardised. As it
    is, we just have to accept that some things are not portable.

    Why not?
    I don't see practical need for all those UBs apart from buffer
    overflow. More so, I don't see the need for UB in certain limited
    classes of buffer overflows.

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.


    And how should that be defined?


    bar.x[8] = 42 should be defined to be the same as
    char tmp = 42
    memcpy(&bar.y, &tmp, sizeof(tmp));

    That has two drawbacks: minor one that you need to know that
    there are no padding between 'x' and 'y'. Major drawback
    is that it would forbid bounds checking for array accesses.
    In code like above it is easy to spot out of bound access at
    compile time. Even with variable index compiler knows size
    of 'x' so can insert bounds checking code (and AFAIK if you
    insist leading compilers will do this).

    More generally, assuming cooperating compiler modern C has enough
    features to eliminate out of bounds array indexing. More precisely,
    I mean compiler which inserts bounds check where they are needed
    and warns or rejects constructs that can not be checked. I claim
    that it is possible to write nontrivial programs in "checked C".
    With change as above very important language construct would be
    uncheckable.

    BTW: If you need such behaviour you can get what you want by
    using unions, so there is no need to break language for folks
    that do not need this.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Michael S on Sun Sep 15 15:41:00 2024
    In article <[email protected]>, [email protected] (Michael S) wrote:

    Padding is another thing that should be Implementation Defined.
    I.e. compiler should provide complete documentation of its padding algorithms.

    It is, and they do. I've used a lot of different compilers over the last
    29 years, needing to know about padding for a DIY varargs, and I've never
    had problems with finding out what the padding was.

    It can usually be described quite briefly, by saying that all data types
    are naturally aligned. The only variant of that I've encountered is on
    32-bit x86 Linux and 32-bit POWER AIX where in both cases 8-byte doubles
    were 4-byte aligned.

    The C standard specifies that struct members shall be stored in memory in
    the same order as they appear in the declaration. It does not specify
    padding because the standard committee feel they need to allow C to work
    on machines that are not byte-addressed or are otherwise weird.

    In addition, some padding-related things can be defined by Standard
    itself. Not in this particular case, but, for example, it could be
    defined that when field of one integer type is immediately followed
    by another field of integer type with the same or narrower width then
    there should be no padding in-between.

    That would be fine if you were willing to confine yourself to
    byte-addressed machines.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Sun Sep 15 17:50:15 2024
    On 14/09/2024 21:26, Thomas Koenig wrote:
    MitchAlsup1 <[email protected]> schrieb:

    In many cases int is slower now than long -- which violates the notion
    of int from K&R days.

    That's a designers's choice, I think. It is possible to add 32-bit instructions which should be as fast (or possibly faster) than
    64-bit instructions, as AMD64 and ARM have shown.


    For some kinds of instructions, that's true - for others, it's not so
    easy without either making rather complicated instructions or having
    assembly instructions with undefined behaviour (imagine the terror that
    would bring to some people!).

    A classic example would be for "y = p[x++];" in a loop. For a 64-bit
    type x, you would set up one register once with "p + x", and then have a
    load with post-increment instruction in the loop. You can also do that
    with x as a 32-bit int, unless you are of the opinion that enough apples
    added to a pile should give a negative number of apples. But with a
    wrapping type for x - such as unsigned int in C or modulo types in Ada,
    you have little choice but to hold "p" and "x" separately in registers,
    add them for every load, and do the increment and modulo operation. I
    really can't see this all being handled by a single instruction.

    Of course you could add a 32-bit zero extend or sign extend to many
    32-bit ALU instructions and save some instructions - many architectures
    already support that kind of thing.


    And having a smaller memory footprint is also beneficial, especially
    for caches.

    (Plus, there are FORTRAN's storage association rules, but these should
    be less used by now. But for a 64-bit integer, they pretty much would require a 64-bit REAL and a 128-bit DOUBLE PRECISION).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Sun Sep 15 15:45:39 2024
    Michael S <[email protected]> writes:
    On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
    Waldek Hebisch <[email protected]> wrote:

    That has two drawbacks: minor one that you need to know that
    there are no padding between 'x' and 'y'.

    Padding is another thing that should be Implementation Defined.
    I.e. compiler should provide complete documentation of its padding >algorithms.

    This is definitely in the realm of the processor ABI, not
    the compiler. And, most processor ABIs do document the
    padding requirements (which generally reflect optimal hardware
    access rules).

    Most C and C++ compilers provide support for "packed" structures
    when the programmer wishes explicit control over structure
    member layout.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Stephen Fuld on Sun Sep 15 17:53:32 2024
    On 13/09/2024 22:09, Stephen Fuld wrote:
    On 9/3/2024 4:14 PM, David Brown wrote:
    On 03/09/2024 18:54, Stephen Fuld wrote:
    On 9/2/2024 11:23 PM, David Brown wrote:
    On 02/09/2024 18:46, Stephen Fuld wrote:
    On 9/2/2024 1:23 AM, Terje Mathisen wrote:

    Anyway, that is all mostly moot since I'm using Rust for this kind >>>>>> of programming now. :-)

    Can you talk about the advantages and disadvantages of Rust versus C? >>>>>

    And also for Rust versus C++ ?

    I asked about C versus Rust as Terje explicitly mentioned those two
    languages, but you make a good point in general.


    I want to know about both :-)

    In my field, small-systems embedded development, C has been dominant
    for a long time, but C++ use is increasing.  Most of my new stuff in
    recent times has been C++.  There are some in the field who are trying
    out Rust, so I need to look into it myself - either because it is a
    better choice than C++, or because customers might want it.



    My impression - based on hearsay for Rust as I have no experience -
    is that the key point of Rust is memory "safety".  I use
    scare-quotes here, since it is simply about correct use of dynamic
    memory and buffers.

    I agree that memory safety is the key point, although I gather that
    it has other features that many programmers like.


    Sure.  There are certainly plenty of things that I think are a better
    idea in a modern programming language and that make it a good step up
    compared to C.  My key interest is in comparison to C++ - it is a step
    up in some ways, a step down in others, and a step sideways in many
    features.  But is it overall up or down, for /my/ uses?

    Examples of things that I think are good in Rust are making variables
    immutable by default and pattern matching.  Steps down include lack of
    function overloading

    Rust's generic functions are not sufficient?


    I don't know Rust well enough to say for sure, but certainly in C++ a
    generic function (a template function) and an overloaded function are completely different things.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Sun Sep 15 18:02:35 2024
    On 14/09/2024 23:11, Michael S wrote:
    On Thu, 12 Sep 2024 16:34:31 +0200
    David Brown <[email protected]> wrote:

    On 12/09/2024 13:29, Michael S wrote:
    On Thu, 12 Sep 2024 03:12:11 -0700
    Tim Rentsch <[email protected]> wrote:

    BGB <[email protected]> writes:

    [...]

    Would be nice, say, if there were semi-standard compiler macros
    for various things:
    Endianess (macros exist, typically compiler specific);
    And, apparently GCC and Clang can't agree on which strategy
    to use. Whether or not the target/compiler allows misaligned
    memory access; If set, one may use misaligned access.
    Whether or not memory uses a single address space;
    If set, all pointer comparisons are allowed.

    [elaborations on the above]

    I suppose it's natural for hardware-type folks to want features
    like this to be part of standard C. In a sense what is being
    asked is to make C a high-level assembly language. But that's
    not what C is. Nor should it be.


    I fully agree that C is not, and should not be seen as, a "high-level
    assembly language". But it is a language that is very useful to
    "hardware-type folks", and there are a few things that could make it
    easier to write more portable code if they were standardised. As it
    is, we just have to accept that some things are not portable.

    Why not?
    I don't see practical need for all those UBs apart from buffer
    overflow. More so, I don't see the need for UB in certain limited
    classes of buffer overflows.

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.


    And how should that be defined?


    bar.x[8] = 42 should be defined to be the same as
    char tmp = 42
    memcpy(&bar.y, &tmp, sizeof(tmp));


    No, it should not.

    It should be "defined" like any other buffer overflow - if there is some
    kind of checking mechanism possible and enabled, at compile time or
    run-time, then that should trigger and tell you you've got a bug in your
    code. If not - well, that's the way programming works. You are
    responsible for writing correct code.

    If you want the behaviour you describe here, then you might like to try:

    union {
    char x[9];
    struct {
    char padding[8];
    int y;
    }
    } bar;


    I can understand people wanting C to behave in a different way from the
    way it is defined. I can understand people wanting to write code that
    seems simple, clear and efficient, even though the C rules say it is
    wrong. I can understand people wanting to continue using code
    constructs that they know are wrong, because they used to get away with
    it. I can understand people wanting some kind of limits to how bad
    things can go for undefined behaviour (I think this comes from some
    fundamental misunderstandings about how programming works, but I can
    understand people wanting it).

    But I really cannot get my head around the idea that someone would want
    to be able to write code that is /clearly/ wrong, totally unnecessary,
    and /clearly/ against the rules of the language, but somehow want the
    compiler to give specific behaviour to that mistake.

    It's like saying you want "1 / 0" to be defined as 6, because 6 is your favourite number.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Sun Sep 15 18:09:56 2024
    On 15/09/2024 14:40, Michael S wrote:
    On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
    Waldek Hebisch <[email protected]> wrote:

    Michael S <[email protected]> wrote:
    On Thu, 12 Sep 2024 16:34:31 +0200
    David Brown <[email protected]> wrote:

    On 12/09/2024 13:29, Michael S wrote:
    On Thu, 12 Sep 2024 03:12:11 -0700
    Tim Rentsch <[email protected]> wrote:

    BGB <[email protected]> writes:

    I fully agree that C is not, and should not be seen as, a
    "high-level assembly language". But it is a language that is very
    useful to "hardware-type folks", and there are a few things that
    could make it easier to write more portable code if they were
    standardised. As it is, we just have to accept that some things
    are not portable.
    Why not?
    I don't see practical need for all those UBs apart from buffer
    overflow. More so, I don't see the need for UB in certain limited
    classes of buffer overflows.

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation.
    And in practice it is. Just not in theory.


    And how should that be defined?


    bar.x[8] = 42 should be defined to be the same as
    char tmp = 42
    memcpy(&bar.y, &tmp, sizeof(tmp));

    That has two drawbacks: minor one that you need to know that
    there are no padding between 'x' and 'y'.

    Padding is another thing that should be Implementation Defined.

    It is.

    I.e. compiler should provide complete documentation of its padding algorithms.

    They do. Or, they should. Often they are lazy and say "defined by the platform ABI". Really, it is only the alignments that are needed.

    C defines the minimum padding between members in a struct - you get the
    padding needed to ensure that members are correctly aligned. I don't
    think the C standards disallow additional padding, but it would be an extraordinarily strange implementation if there were anything more than
    this minimum padding.

    But I certainly wouldn't mind if the standards dictated this minimum
    padding, and then there would be nothing left to the implementation
    other than alignments.

    In addition, some padding-related things can be defined by Standard
    itself. Not in this particular case, but, for example, it could be
    defined that when field of one integer type is immediately followed by another field of integer type with the same or narrower width then
    there should be no padding in-between.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to John Dallman on Sun Sep 15 18:14:40 2024
    On 15/09/2024 16:41, John Dallman wrote:
    In article <[email protected]>, [email protected] (Michael S) wrote:

    Padding is another thing that should be Implementation Defined.
    I.e. compiler should provide complete documentation of its padding
    algorithms.

    It is, and they do. I've used a lot of different compilers over the last
    29 years, needing to know about padding for a DIY varargs, and I've never
    had problems with finding out what the padding was.

    It can usually be described quite briefly, by saying that all data types
    are naturally aligned. The only variant of that I've encountered is on
    32-bit x86 Linux and 32-bit POWER AIX where in both cases 8-byte doubles
    were 4-byte aligned.


    It is better to say types are naturally aligned up to a maximum
    appropriate for the architecture (usually the width of general-purpose registers and/or pointers). Then there are far fewer exceptions.

    (So on 8-bit devices you usually see single byte alignment even for
    64-bit types.)

    The C standard specifies that struct members shall be stored in memory in
    the same order as they appear in the declaration. It does not specify
    padding because the standard committee feel they need to allow C to work
    on machines that are not byte-addressed or are otherwise weird.


    It specifies that there can be padding between members, and members need
    to be aligned, so it gives the minimum padding (though the alignment requirements are implementation-defined). But it gives no maximum
    padding, AFAIK.

    In addition, some padding-related things can be defined by Standard
    itself. Not in this particular case, but, for example, it could be
    defined that when field of one integer type is immediately followed
    by another field of integer type with the same or narrower width then
    there should be no padding in-between.

    That would be fine if you were willing to confine yourself to
    byte-addressed machines.


    There would not be padding between one integer type and another member
    of the same or smaller integer type, unless you have a very odd
    architecture or niche features (like, say, an int24_t with 1-byte
    alignment followed by an int16_t with 2-byte alignment).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Michael S on Sun Sep 15 16:43:45 2024
    Michael S <[email protected]> wrote:
    On Sun, 15 Sep 2024 08:05:47 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by
    implementation. And in practice it is. Just not in
    theory.

    Do you mean union rather than struct? And do you mean
    bar.x[7] rather than bar.x[8]? Surely no one would expect
    that storing into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think
    should be defined by the C standard but is not? And the
    same question for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior
    would be bar.y==42 on LE machines and bar.y==42*2**24 on BE
    machines. As it actually happens in reality with all
    production compilers.

    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety. Good idea, that.
    That created *really* interesting bugs, and Real Programmers
    (TM) have to have something that pays their salaries, right?

    SCNR

    What I wrote is how all production C compilers work today. So it
    will add no new bugs.

    Maybe I should be a little bit more precise in why I think this
    is an extemely bad idea.

    struct {
    char x[8]
    int y;
    } bar;

    Assume

    bar.y = 1234;
    bar.x[i] = 42; // The compiler does not know i
    // Do something with bar.y

    The compiler should then treat the access to bar.x[i] as if bar.y
    was clobbered by the assignment statement, and reload bar.y if
    it was kept in a register? That is the semantics you propose.


    Yes, exactly.

    So, volatile for all structs,

    No.
    Access to field of struct's should be ordered only relatively to
    accesses to other fields *of the same instance* of the struct. And,
    of course, usual 'as if' applies, so optimizing compiler can figure out
    that bar.x[7] and bar.y do not overlap and thus generate code knowing
    that write to one does not clobber the other.
    That's pretty far from semantics of volatile.

    plus prescribed behavior on array overruns.

    Only withing bound of struct. bar.x[12] remains UB


    At the risk of repeating myself: This is an extremely bad idea.

    I rest my case.

    You seem to think that C should be as optimizable and as full of UBs as Fortran. Many compiler authors agree with you.
    I have different idea. IMHO, your party exploits the letter of C
    standard in violation to its spirit.

    In may copy (translation of) of K&R there is a passage which
    says that C tries to define useful things, but unlike PL/I does
    not define things to make them defined. And PL/I experience
    was that many defined behaviours were bugs, but due to language
    definiton compiler silenty accepred them and generated code.
    The trouble was that program was doing different thing that
    programmer intended. Anyway, the passge in K&R that I mention
    and advice given in other places (like "implementation may do
    different things, program should not depend on any particular
    behaviour") for me means that UB was part of _original_ C spirit.
    Later came folks from "do not break my code" camp, and they do
    have _some_ points. But they do not represent point of view
    of creators of the language.

    BTW: Wording in Pascal standard is quite different, in particular
    Pascal uses term "error" and "behaviour not defined by the standard".
    But spirit is the same as C: break the rules and your program
    may do whatever it wishes. Main difference is that C adopted
    "trust the programmer" philosophy and offers several unsafe
    constructs not present in Pascal. And with unsafe constructs
    came associated UB.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Michael S on Sun Sep 15 16:22:13 2024
    Michael S <[email protected]> wrote:
    On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
    Waldek Hebisch <[email protected]> wrote:

    Michael S <[email protected]> wrote:
    On Thu, 12 Sep 2024 16:34:31 +0200
    David Brown <[email protected]> wrote:

    On 12/09/2024 13:29, Michael S wrote:
    On Thu, 12 Sep 2024 03:12:11 -0700
    Tim Rentsch <[email protected]> wrote:

    BGB <[email protected]> writes:

    I fully agree that C is not, and should not be seen as, a
    "high-level assembly language". But it is a language that is very
    useful to "hardware-type folks", and there are a few things that
    could make it easier to write more portable code if they were
    standardised. As it is, we just have to accept that some things
    are not portable.
    Why not?
    I don't see practical need for all those UBs apart from buffer
    overflow. More so, I don't see the need for UB in certain limited
    classes of buffer overflows.

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation.
    And in practice it is. Just not in theory.


    And how should that be defined?


    bar.x[8] = 42 should be defined to be the same as
    char tmp = 42
    memcpy(&bar.y, &tmp, sizeof(tmp));

    That has two drawbacks: minor one that you need to know that
    there are no padding between 'x' and 'y'.

    Padding is another thing that should be Implementation Defined.
    I.e. compiler should provide complete documentation of its padding algorithms.
    In addition, some padding-related things can be defined by Standard
    itself. Not in this particular case, but, for example, it could be
    defined that when field of one integer type is immediately followed by another field of integer type with the same or narrower width then
    there should be no padding in-between.

    Major drawback
    is that it would forbid bounds checking for array accesses.
    In code like above it is easy to spot out of bound access at
    compile time. Even with variable index compiler knows size
    of 'x' so can insert bounds checking code (and AFAIK if you
    insist leading compilers will do this).

    More generally, assuming cooperating compiler modern C has enough
    features to eliminate out of bounds array indexing.

    In general, only by means of fat pointers.
    Fat pointers break existing ABIs.
    Also if fat pointers is what I want then I already have them in few mainstream languages where they are integrated much better than they
    will ever be in "checked C".

    No. When array declaration (or allocation) is visible adding checks
    is trivial, so the problem is passing size information to functions.
    As long as arrays have fixed sizes one can declare size of function
    argument using qualifier "static", like in

    void foo(int a[static 20]);

    For arrays of variable size there are variably modified types.
    Standard botched this, essentially saying that size info in
    the prototype should be ignored, but in non-conforming mode
    compiler may require size info and check it for correctness.

    The point is that "checked program" can be compiled by standard
    C complier. And as long as all accesses are in bound "checked
    code" is ABI compatible with unchecked one. Of course,
    if you take random C program, then with probablity close to 1
    it will be rejected by checking compiler. But if you pass
    variable sized arrays the called routine needs _some_ way to
    find out how big the array is. And using vmt-s is a reasonable
    way to pass size info. To make it more useful vmt-s should be
    beefed up, in particular cover pointer inside structures.
    But even as it is now one can write useful checkable programs
    in C.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Sun Sep 15 17:07:58 2024
    Robert Finch <[email protected]> writes:
    On 2024-09-15 12:09 p.m., David Brown wrote:

    In addition, some padding-related things can be defined by Standard
    itself. Not in this particular case, but, for example, it could be
    defined that when field of one integer type is immediately followed by
    another field of integer type with the same or narrower width then
    there should be no padding in-between.


    What about bit-fields in a struct? I believe they are usually packed. In
    case its for something like an I/O device.

    That's a bit more complicated as it depends on the target byte-order.

    e.g.

    struct GIC_ECC_INT_STATUSR_s {
    #if __BYTE_ORDER == __BIG_ENDIAN
    uint64_t reserved_41_63 : 23;
    uint64_t dbe : 9; /**< R/W1C/H - RAM ECC DBE detected. */
    uint64_t reserved_9_31 : 23;
    uint64_t sbe : 9; /**< R/W1C/H - RAM ECC SBE detected. */
    #else
    uint64_t sbe : 9;
    uint64_t reserved_9_31 : 23;
    uint64_t dbe : 9;
    uint64_t reserved_41_63 : 23;
    #endif
    } s;

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sun Sep 15 17:46:12 2024
    Michael S <[email protected]> writes:
    Padding is another thing that should be Implementation Defined.

    It is. It's defined in the ABI, so when the compiler documents to
    follow some ABI, you automatically get that ABI's structure layout.
    And if a compiler does not follow an ABI, it is practically useless.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sun Sep 15 17:21:42 2024
    On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

    Robert Finch <[email protected]> writes:
    On 2024-09-15 12:09 p.m., David Brown wrote:

    In addition, some padding-related things can be defined by Standard
    itself. Not in this particular case, but, for example, it could be
    defined that when field of one integer type is immediately followed by >>>> another field of integer type with the same or narrower width then
    there should be no padding in-between.


    What about bit-fields in a struct? I believe they are usually packed. In >>case its for something like an I/O device.

    That's a bit more complicated as it depends on the target byte-order.

    e.g.

    struct GIC_ECC_INT_STATUSR_s {
    #if __BYTE_ORDER == __BIG_ENDIAN
    uint64_t reserved_41_63 : 23;
    uint64_t dbe : 9; /**< R/W1C/H - RAM
    ECC DBE detected. */
    uint64_t reserved_9_31 : 23;
    uint64_t sbe : 9; /**< R/W1C/H - RAM
    ECC SBE detected. */
    #else
    uint64_t sbe : 9;
    uint64_t reserved_9_31 : 23;
    uint64_t dbe : 9;
    uint64_t reserved_41_63 : 23;
    #endif
    } s;

    Which brings to mind a slight different but related bit-field issue.

    If one has an architecture that allows a bit-field to span a register
    sized container, how does one specify that bit-field in C ??

    So, assume a register contains 64-bits and we have a 17-bit field
    starting at bit 53 and continuing to bit 69 of a 128-bit struct.
    How would one "properly" specify this in C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Sun Sep 15 20:13:44 2024
    On 14/09/2024 23:19, Michael S wrote:
    On Sat, 14 Sep 2024 20:14:23 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 21:39:39 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation.
    And in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would
    be bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.


    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety. Good idea, that.
    That created *really* interesting bugs, and Real Programmers (TM)
    have to have something that pays their salaries, right?

    SCNR

    What I wrote is how all production C compilers work today. So it
    will add no new bugs.

    Maybe I should be a little bit more precise in why I think this
    is an extemely bad idea.

    struct {
    char x[8]
    int y;
    } bar;

    Assume

    bar.y = 1234;
    bar.x[i] = 42; // The compiler does not know i
    // Do something with bar.y

    The compiler should then treat the access to bar.x[i] as if bar.y
    was clobbered by the assignment statement, and reload bar.y if
    it was kept in a register? That is the semantics you propose.


    Yes, exactly.


    Contrary to your imagination - compilers have /never/ followed your
    proposed semantics. The oldest gcc version I found on godbolt.org is
    3.4.6 from 2006, and given:

    struct Bar {
    char x[8];
    int y;
    } bar;


    int foo(int i) {
    bar.y = 1234;
    bar.x[i] = 42;
    return bar.y;
    }

    It generates:

    foo:
    movslq %edi,%rdi
    movl $1234, %eax
    movl $1234, bar+8(%rip)
    movb $42, bar(%rdi)
    ret

    That is, y is /not/ reloaded after bar.x[i] is set.

    Your proposed semantics are extremely unexpected for most C developers,
    would involve pretty much a complete re-write of the C model if they
    were to be applied consistently to other aspects of C, would have a
    significant impact on code efficiency, and they are not something anyone
    has used or relied on before.


    So, either bar.y is treated as if it was volatile, or hard-to-detect
    bugs would appear because, with optimization, the assignment would
    sometimes change the value of bar.y and sometimes not.

    No, semantics is that compiler has to reload bar.y if it keeps it in register. Optimizer that does anything else is buggy.


    Well, buggy according to your hypothetical semantics. Not buggy
    according to the way C has always worked, and the way C compilers
    generate code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Robert Finch on Sun Sep 15 20:47:11 2024
    On 15/09/2024 18:52, Robert Finch wrote:
    On 2024-09-15 12:09 p.m., David Brown wrote:
    On 15/09/2024 14:40, Michael S wrote:
    On Sun, 15 Sep 2024 12:19:02 -0000 (UTC)
    Waldek Hebisch <[email protected]> wrote:

    Michael S <[email protected]> wrote:
    On Thu, 12 Sep 2024 16:34:31 +0200
    David Brown <[email protected]> wrote:
    On 12/09/2024 13:29, Michael S wrote:
    On Thu, 12 Sep 2024 03:12:11 -0700
    Tim Rentsch <[email protected]> wrote:
    BGB <[email protected]> writes:

    I fully agree that C is not, and should not be seen as, a
    "high-level assembly language".  But it is a language that is very >>>>>> useful to "hardware-type folks", and there are a few things that
    could make it easier to write more portable code if they were
    standardised.  As it is, we just have to accept that some things
    are not portable.
    Why not?
    I don't see practical need for all those UBs apart from buffer
    overflow. More so, I don't see the need for UB in certain limited >>>>>>> classes of buffer overflows.

    struct {
       char x[8]
       int  y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation.
    And in practice it is. Just not in theory.

    And how should that be defined?


    bar.x[8] = 42 should be defined to be the same as
       char tmp = 42
       memcpy(&bar.y, &tmp, sizeof(tmp));

    That has two drawbacks: minor one that you need to know that
    there are no padding between 'x' and 'y'.

    Padding is another thing that should be Implementation Defined.

    It is.

    I.e. compiler should provide complete documentation of its padding
    algorithms.

    They do.  Or, they should.  Often they are lazy and say "defined by
    the platform ABI".  Really, it is only the alignments that are needed.

    C defines the minimum padding between members in a struct - you get
    the padding needed to ensure that members are correctly aligned.  I
    don't think the C standards disallow additional padding, but it would
    be an extraordinarily strange implementation if there were anything
    more than this minimum padding.

    But I certainly wouldn't mind if the standards dictated this minimum
    padding, and then there would be nothing left to the implementation
    other than alignments.

    In addition, some padding-related things can be defined by Standard
    itself. Not in this particular case, but, for example, it could be
    defined that when field of one integer type is immediately followed by
    another field of integer type with the same or narrower width then
    there should be no padding in-between.


    What about bit-fields in a struct? I believe they are usually packed. In
    case its for something like an I/O device.


    Generally, they are packed if you make the fields of the same type, but
    if you change the type you get a new block that is aligned appropriately
    for the type you gave. It is certainly the case that bit-field struct
    layout is complicated, not well-specified in the C standards, and often
    not as well documented as it could be by compilers.

    When I use bit-field layouts and the layout matters (such as for an I/O
    device, rather than just to collect lots of small bits of data in less
    memory), I like to give any padding explicitly. And I put a
    static_assert on the size of the struct, to be sure I haven't got it
    wrong. Such code is, naturally, never intended to be very portable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Sun Sep 15 20:48:48 2024
    On 15/09/2024 19:21, MitchAlsup1 wrote:
    On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

    Robert Finch <[email protected]> writes:
    On 2024-09-15 12:09 p.m., David Brown wrote:

    In addition, some padding-related things can be defined by Standard
    itself. Not in this particular case, but, for example, it could be
    defined that when field of one integer type is immediately followed by >>>>> another field of integer type with the same or narrower width then
    there should be no padding in-between.


    What about bit-fields in a struct? I believe they are usually packed. In >>> case its for something like an I/O device.

    That's a bit more complicated as it depends on the target byte-order.

    e.g.

        struct GIC_ECC_INT_STATUSR_s {
    #if __BYTE_ORDER == __BIG_ENDIAN
            uint64_t reserved_41_63              : 23;
            uint64_t dbe                         :  9; /**< R/W1C/H - RAM
    ECC DBE detected. */
            uint64_t reserved_9_31               : 23;
            uint64_t sbe                         :  9; /**< R/W1C/H - RAM
    ECC SBE detected. */
    #else
            uint64_t sbe                         :  9;
            uint64_t reserved_9_31               : 23;
            uint64_t dbe                         :  9;
            uint64_t reserved_41_63              : 23;
    #endif
        } s;

    Which brings to mind a slight different but related bit-field issue.

    If one has an architecture that allows a bit-field to span a register
    sized container, how does one specify that bit-field in C ??

    So, assume a register contains 64-bits and we have a 17-bit field
    starting at bit 53 and continuing to bit 69 of a 128-bit struct.
    How would one "properly" specify this in C.

    You do so inconveniently, perhaps with access inline functions rather
    than a bit-field struct.

    Fortunately, not many hardware designers are that sadistic. (Or perhaps
    they /are/ that sadistic, but lack the imagination for that particular
    trick.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Sun Sep 15 21:03:00 2024
    On 13/09/2024 17:55, Thomas Koenig wrote:
    David Brown <[email protected]> schrieb:

    Most of the commonly used parts of C99 have been "safe" to use for 20
    years. There were a few bits that MSVC did not implement until
    relatively recently, but I think even have caught up now.

    What about VLAs?

    I don't know if MSVC has VLAs - it's not a tool I ever use, so I don't
    have the details in my head.

    But perhaps VLAs don't count as "commonly used parts of C99". I have
    only occasionally had use for real VLAs in my own programming (more
    often I have local arrays whose size is a const known at compile time,
    but not syntactically a constant expression - then you have something
    that is technically a VLA but which the compiler can handle just like a
    normal fixed size array). A lot of people seem to get in a fluster when
    you talk about VLAs, and think their inclusion in the C standards was
    inspired by the demons trying escape people's noses.

    There are a few more obscure parts of C99 that are often poorly
    implemented, such as some of the floating point details, and many
    embedded compilers omit much of the wide character stuff.

    I suppose you could argue that my claim is tautological - parts of C99
    that are not implemented in the mainstream C compilers will of course
    not be commonly used!


    There are only two serious, general purpose C compilers in mainstream
    use - gcc and clang, and both support almost all of C23 now. But it
    will take a while for the more niche tools, such as some embedded
    compilers, to catch up.

    It is almost impossible to gather statistics on compiler use,
    especially with free compilers, but what about MSVC and icc?

    MSVC is rarely used for C - it is primarily a C++ tool. Traditionally,
    you have had closer to modern C support using MSVC in C++ mode than in C
    mode.

    As for icc, I don't think it is nearly as popular as it used to be, but
    I have no statistics to back that up. However, I believe it has kept up
    with the standards (as well as compatibility with many of gcc and
    clang's extensions). I don't know about C23 support.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to BGB on Sun Sep 15 21:09:47 2024
    On 14/09/2024 04:39, BGB wrote:
    On 9/13/2024 10:55 AM, Thomas Koenig wrote:
    David Brown <[email protected]> schrieb:

    Most of the commonly used parts of C99 have been "safe" to use for 20
    years.  There were a few bits that MSVC did not implement until
    relatively recently, but I think even have caught up now.

    What about VLAs?


    IIRC, VLAs and _Complex and similar still don't work in MSVC.
    Most of the rest does now at least.


    Thanks - you know it far better than I do.

    There are only two serious, general purpose C compilers in mainstream
    use - gcc and clang, and both support almost all of C23 now.  But it
    will take a while for the more niche tools, such as some embedded
    compilers, to catch up.

    It is almost impossible to gather statistics on compiler use,
    especially with free compilers, but what about MSVC and icc?

    From what I gather:
      GCC and Clang are popular for most mainline targets;
        GCC is the dominant C compiler on Linux.

    It is also far and away the dominant compiler for embedded systems -
    both embedded Linux and small embedded systems.

      MSVC is popular on Windows
        Has been essentially freeware/fremium for over a decade;
        Visual Studio has a fairly good debugger;
        Targets limited to things you can run Windows on (x86, X64, ARM)

    MSVC is mainly used for C++ - or for a C-like subset of C++.

    .
      TinyCC, popular for niche use, but limited range of targets;
        x86, ARM, experimental RISC-V.
      SDCC, popular for 8/16 bit targets;

    SDCC has never been very popular. For the targets SDCC support, Keil
    (8051) and IAR (many small CISC targets) are far more common. But for
    these kinds of devices, you are never working in anything close to
    standard C anyway.

      CC65, popular for 6502 and 65C816;

    That's getting /really/ obscure now. There are thousands of C compilers
    that are used, or have been used, for various microcontrollers. But if
    you sum all their uses over the last decade, it will not be close to 1%
    of the total use of C compilers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Scott Lurndal on Sun Sep 15 12:37:28 2024
    [email protected] (Scott Lurndal) writes:

    Robert Finch <[email protected]> writes:

    On 2024-09-15 12:09 p.m., David Brown wrote:

    In addition, some padding-related things can be defined by Standard
    itself. Not in this particular case, but, for example, it could be
    defined that when field of one integer type is immediately followed by >>>> another field of integer type with the same or narrower width then
    there should be no padding in-between.


    What about bit-fields in a struct? I believe they are usually packed. In >> case its for something like an I/O device.

    That's a bit more complicated as it depends on the target byte-order.

    e.g.

    struct GIC_ECC_INT_STATUSR_s {
    #if __BYTE_ORDER == __BIG_ENDIAN
    uint64_t reserved_41_63 : 23;
    uint64_t dbe : 9; /**< R/W1C/H - RAM ECC DBE detected. */
    uint64_t reserved_9_31 : 23;
    uint64_t sbe : 9; /**< R/W1C/H - RAM ECC SBE detected. */
    #else
    uint64_t sbe : 9;
    uint64_t reserved_9_31 : 23;
    uint64_t dbe : 9;
    uint64_t reserved_41_63 : 23;
    #endif
    } s;

    Probably many people know that this code depends on an
    implementation-defined extension (allowing uint64_t as
    the type of a bitfield) and is not guaranteed to be
    portable. Using 'unsigned' instead would be portable
    (assuming typical 32-bit ints, etc).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Sun Sep 15 19:13:31 2024
    On Sun, 15 Sep 2024 18:48:48 +0000, David Brown wrote:

    On 15/09/2024 19:21, MitchAlsup1 wrote:
    On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

    Robert Finch <[email protected]> writes:
    On 2024-09-15 12:09 p.m., David Brown wrote:

    In addition, some padding-related things can be defined by Standard >>>>>> itself. Not in this particular case, but, for example, it could be >>>>>> defined that when field of one integer type is immediately followed by >>>>>> another field of integer type with the same or narrower width then >>>>>> there should be no padding in-between.


    What about bit-fields in a struct? I believe they are usually packed. In >>>> case its for something like an I/O device.

    That's a bit more complicated as it depends on the target byte-order.

    e.g.

        struct GIC_ECC_INT_STATUSR_s {
    #if __BYTE_ORDER == __BIG_ENDIAN
            uint64_t reserved_41_63              : 23;
            uint64_t dbe                         :  9; /**< R/W1C/H - RAM
    ECC DBE detected. */
            uint64_t reserved_9_31               : 23;
            uint64_t sbe                         :  9; /**< R/W1C/H - RAM
    ECC SBE detected. */
    #else
            uint64_t sbe                         :  9;
            uint64_t reserved_9_31               : 23;
            uint64_t dbe                         :  9;
            uint64_t reserved_41_63              : 23;
    #endif
        } s;

    Which brings to mind a slight different but related bit-field issue.

    If one has an architecture that allows a bit-field to span a register
    sized container, how does one specify that bit-field in C ??

    So, assume a register contains 64-bits and we have a 17-bit field
    starting at bit 53 and continuing to bit 69 of a 128-bit struct.
    How would one "properly" specify this in C.

    You do so inconveniently, perhaps with access inline functions rather
    than a bit-field struct.

    Fortunately, not many hardware designers are that sadistic. (Or perhaps
    they /are/ that sadistic, but lack the imagination for that particular trick.)

    In My 66000 ISA it is both efficient and straightforward::

    i = struct.field;
    ..
    struct.field = j;

    CARRY Rsf1,{I}
    SRA Ri,Rsf0,<17,53>
    and
    CARRY Rsf1,{O}
    INS Rsf0,Rj,<52,17>

    Note: Rsf1 and Rsf0 combined are the 128 bits container, but there is no
    need for these registers to be sequential.

    As to HW sadism:: this not not <realistically> any harder than mis-
    aligned DW accesses from the cache. Many ISA from the rather distant
    past could do these rather efficiently {360 SRDL,...}

    If the ISA has any realistically efficient grasp on multi-precision
    integer operations, these fall out almost for free.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to BGB on Sun Sep 15 21:40:59 2024
    On 14/09/2024 08:34, BGB wrote:
    On 9/13/2024 10:30 AM, David Brown wrote:
    On 12/09/2024 23:14, BGB wrote:
    On 9/12/2024 9:18 AM, David Brown wrote:
    On 11/09/2024 20:51, BGB wrote:
    On 9/11/2024 5:38 AM, Anton Ertl wrote:
    Josh Vanderhoof <[email protected]> writes:
    [email protected] (Anton Ertl) writes:


    <snip lots>


    Though, generally takes a few years before new features become usable.
    Like, it is only in recent years that it has become "safe" to use
    most parts of C99.


    Most of the commonly used parts of C99 have been "safe" to use for 20
    years.  There were a few bits that MSVC did not implement until
    relatively recently, but I think even have caught up now.


    Until VS2013, the most one could really use was:
      // comments
      long long
    Otherwise, it was basically C90.
      'stdint.h'? Nope.
      Ability to declare variables wherever? Nope.
      ...

    Nonsense.

    MS basically gave up on C and concentrated on C++ (then later C# and
    other languages). Their C compiler gained the parts of C99 that were in
    common with C++ - and anyway, most people (that I have heard of) using
    MSVC for C programming actually use the C++ compiler but stick
    approximately to a C subset. And this has been the case for a /long/
    time - long before 2013.


    After this, it was piecewise.
      Though, IIRC, still no VLAs or similar.


    That I believe.


    There are only two serious, general purpose C compilers in mainstream
    use - gcc and clang, and both support almost all of C23 now.  But it
    will take a while for the more niche tools, such as some embedded
    compilers, to catch up.

    <stdbit.h> is, however, in the standard library rather than the
    compiler, and they can be a bit slow to catch up.


    FWIW:
    I had been adding parts of newer standards in my case, but it is more hit/miss (more adding parts as they seem relevant).


    Clearly your own compiler will only support the bits of C that you
    implement. But I am not sure that it counts as a "serious, general
    purpose C compiler in mainstream use" - no offence implied!



       Whether or not the target/compiler allows misaligned memory access; >>>>>      If set, one may use misaligned access.

    Why would you need that?  Any decent compiler will know what is
    allowed for the target (perhaps partly on the basis of compiler
    flags), and will generate the best allowed code for accesses like
    foo3() above.


    Imagine you have compilers that are smart enough to turn "memcpy()"
    into a load and store, but not smart enough to optimize away the
    memory accesses, or fully optimize away the wrapper functions...


    Why would I do that?  If I want to have efficient object code, I use a
    good compiler.  Under what realistic circumstances would you need to
    have highly efficient results but be unable to use a good optimising
    compiler?  Compilers have been inlining code for 30 years at least
    (that's when I first saw it) - this is not something new and rare.


    Say, you are using a target where you can't use GCC or similar.

    Which target would that be? Excluding personal projects, some very
    niche devices, and long-outdated small CISC chips, there really aren't
    many devices that don't have a GCC and clang port. Of course there
    /are/ processors that gcc does not support, but almost nobody writes
    code that has to be portable to such devices.

    And as for optimising compilers, I used at least two different
    optimising compilers in the mid nineties that inlined code
    automatically, before using gcc. (I can't remember if they inlined
    memcpy - it was a long time ago!). Optimising compilers are not a new
    concept, and are not limited to gcc and clang.


    Say:
    BJX2, haven't ported GCC as it looks like a pain;
      Also GCC is big and slow to recompile.

    6502 and 65C816, because these are old and probably not worth the effort
    from GCC's POV.

    Various other obscure/niche targets.


    Say, SH-5, which never saw a production run (it was a 64-bit successor
    to SH-4), but seemingly around the time Hitachi spun-out Renesas, the
    SH-5 essentially got canned. And, it apparently wasn't worth it for GCC
    to maintain a target for which there were no actual chips (comparably
    the SH-2 and SH-4 lived on a lot longer due to having niche uses).


    It would be quite ridiculous to limit the way you write code because of possible limitations for non-existent compilers for target devices that
    have never been made.


    So, for best results, the best case option is to use a pointer cast
    and dereference.

    For some cases, one may also need to know whether or not they can
    access the pointers in a misaligned way (and whether doing so would
    be better or worse than something like "memcpy()").


    Again, I cannot see a /real/ situation where that would be relevant.


    I can think of a few.

    Most often though it is in things like data compression/decompression
    code, where there is often a lot of priority on "gotta go fast".


    I still cannot see any situation where it would be relevant. If I need
    to read 4 bytes of memory from an address, and don't know if the address
    is uint32_t aligned or not, I would use memcpy(). The compiler would
    know if unaligned 32-bit reads are supported or not for the target, or
    if it is faster to use them or use byte reads. That's the compiler's
    job - I'm the programmer, not the micro-manager.

    And if I know that for a particular target there are particular
    instructions that could be more efficient but are unknown to the
    compiler (perhaps there are odd SIMD instructions), and it is worth the
    effort to use them, then I would be writing that code for the specific
    target. That's target-specific conditional compilation, and I still
    have no need to know if the target can access misaligned data.


    There is a difference here between "_memlzcpy()" and "_memlzcpyf()"
    in that:
       the former will always copy an exact number of bytes;
       the latter may write 16-32 bytes over the limit.

    It may do /what/ ?  That is a scary function!


    This is why the latter have an 'f' extension (for "fast").


    I can accept that there are cases (such as you describe below) where
    this might be useful, but I would not be identifying it just with an "f".

    There are cases where it may be desirable to have the function write
    past the end in the name of speed, and others where this would not be acceptable.

    Hence why there are 2 functions.


    The main intended use-case for _memlzcpyf() being use for match-copying
    in something like my LZ4 decoder, where one may pad the decode buffer by
    an extra 32 byes.

    Also my RP2 decoder works in a similar way.



    Possible:
       __MINALIGN_type__  //minimum allowed alignment for type

    _Alignof(type) has been around since C11.


    _Alignof tells the native alignment, not the minimum.

    It is the same thing.


    Not necessarily, it wouldn't make sense for _Alignof to return 1 for all
    the basic integer types.

    Of course it makes sense to do that, on targets where an alignment of 1
    is safe and efficient.

    But, for" minimum alignment" it may make sense
    to return 1 for anything that can be accessed unaligned.


    Again, I see no use for this.


    Where, _Alignof(int32_t) will give 4, but __MINALIGN_INT32__ would
    give 1 if the target supports misaligned pointers.


    The alignment of types in C is given by _Alignof.  Hardware may
    support unaligned accesses - C does not.  (By that, I mean that
    unaligned accesses are UB.)


    The point of __MINALIGN_type__ would be:
    If the compiler defines it, and it is defined as 1, then this allows the compiler to be able to tell the program that it is safe to use this type
    in an unaligned way.


    For what purpose?

    This also applies to targets where some types are unaligned but others
    are not:
    Say, if all integer types 64 bits or less are unaligned, but 128-bit
    types are not.


    For what purpose? And why do you want to worry about totally
    hypothetical systems?


    Most of this is being compiled by BGBCC for a 50 MHz cPU.

    So, the CPU is slow and the compiler doesn't generate particularly
    efficient code unless one writes it in a way it can use effectively.

    Which often means trying to write C like it was assembler and manually organizing statements to try to minimize value dependencies (often
    caching any values in variables, and using lots of variables).


    In this case, the equivalent of "-fwrapv -fno-strict-aliasing" is the
    default semantics.

    Generally, MSVC also responds well to a similar coding style as used for BGBCC (or, as it more happened, the coding styles that gave good results
    in MSVC also tended to work well in BGBCC).


    Note that MSVC most certainly does /not/ work like "gcc -fwrapv" -
    signed integer overflow is UB in MSVC, and it generates code that
    assumes it never happens. There is an obscure officially undocumented
    (or documented unofficially, if you prefer) flag to turn off such optimisations.

    Last I read about it, they had no plans to do any type-based alias
    analysis, but nor did they rule out the possibility in the future.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Sun Sep 15 22:42:35 2024
    On Sun, 15 Sep 2024 20:13:44 +0200
    David Brown <[email protected]> wrote:

    On 14/09/2024 23:19, Michael S wrote:

    Yes, exactly.


    Contrary to your imagination - compilers have /never/ followed your
    proposed semantics. The oldest gcc version I found on godbolt.org is
    3.4.6 from 2006, and given:

    struct Bar {
    char x[8];
    int y;
    } bar;


    int foo(int i) {
    bar.y = 1234;
    bar.x[i] = 42;
    return bar.y;
    }

    It generates:

    foo:
    movslq %edi,%rdi
    movl $1234, %eax
    movl $1234, bar+8(%rip)
    movb $42, bar(%rdi)
    ret

    That is, y is /not/ reloaded after bar.x[i] is set.


    No other compiler on godbolt is doing it, except possibly gcc clones.
    Not even clang, who's former leader wrote "Nasal Manifest".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to [email protected] on Sun Sep 15 12:51:04 2024
    [email protected] (MitchAlsup1) writes:

    On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

    Robert Finch <[email protected]> writes:

    On 2024-09-15 12:09 p.m., David Brown wrote:

    In addition, some padding-related things can be defined by Standard
    itself. Not in this particular case, but, for example, it could be
    defined that when field of one integer type is immediately followed by >>>>> another field of integer type with the same or narrower width then
    there should be no padding in-between.


    What about bit-fields in a struct? I believe they are usually packed. In >>> case its for something like an I/O device.

    That's a bit more complicated as it depends on the target byte-order.

    e.g.

    struct GIC_ECC_INT_STATUSR_s {
    #if __BYTE_ORDER == __BIG_ENDIAN
    uint64_t reserved_41_63 : 23;
    uint64_t dbe : 9;
    uint64_t reserved_9_31 : 23;
    uint64_t sbe : 9;
    #else
    uint64_t sbe : 9;
    uint64_t reserved_9_31 : 23;
    uint64_t dbe : 9;
    uint64_t reserved_41_63 : 23;
    #endif
    } s;

    Which brings to mind a slight different but related bit-field issue.

    If one has an architecture that allows a bit-field to span a register
    sized container, how does one specify that bit-field in C ??

    So, assume a register contains 64-bits and we have a 17-bit field
    starting at bit 53 and continuing to bit 69 of a 128-bit struct.
    How would one "properly" specify this in C.

    The 17-bit bitfied can be specified in the usual way. Example:

    struct bitfield_example {
    unsigned one : 32;
    unsigned two : 20;
    unsigned hmm : 17;
    };

    An implementation is allowed to use up the last 12 bits of the
    first 64-bit unit and the first 5 bits of the next 64-bit unit.
    But, whether that happens or not is up to the implementation.
    The bitfield for member 'hmm' could instead be put entirely in
    the second 64-bit unit, with the last 12 bits of the first 64-bit
    unit simply left as padding. There is no standard way to force
    it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Sun Sep 15 12:54:04 2024
    Michael S <[email protected]> writes:

    My motivation is eliminating as many UBs as is practically
    possible.

    I think I understand what it is you want. What sort of case can
    you make that other people should want it, or that I should want
    it? So far I'm a very long way from being convinced.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Tim Rentsch on Sun Sep 15 21:05:05 2024
    On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:

    [email protected] (MitchAlsup1) writes:

    On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

    Robert Finch <[email protected]> writes:

    On 2024-09-15 12:09 p.m., David Brown wrote:

    In addition, some padding-related things can be defined by Standard >>>>>> itself. Not in this particular case, but, for example, it could be >>>>>> defined that when field of one integer type is immediately followed by >>>>>> another field of integer type with the same or narrower width then >>>>>> there should be no padding in-between.


    What about bit-fields in a struct? I believe they are usually packed. >>>> In
    case its for something like an I/O device.

    That's a bit more complicated as it depends on the target byte-order.

    e.g.

    struct GIC_ECC_INT_STATUSR_s {
    #if __BYTE_ORDER == __BIG_ENDIAN
    uint64_t reserved_41_63 : 23;
    uint64_t dbe : 9;
    uint64_t reserved_9_31 : 23;
    uint64_t sbe : 9;
    #else
    uint64_t sbe : 9;
    uint64_t reserved_9_31 : 23;
    uint64_t dbe : 9;
    uint64_t reserved_41_63 : 23;
    #endif
    } s;

    Which brings to mind a slight different but related bit-field issue.

    If one has an architecture that allows a bit-field to span a register
    sized container, how does one specify that bit-field in C ??

    So, assume a register contains 64-bits and we have a 17-bit field
    starting at bit 53 and continuing to bit 69 of a 128-bit struct.
    How would one "properly" specify this in C.

    The 17-bit bitfied can be specified in the usual way. Example:

    struct bitfield_example {
    unsigned one : 32;
    unsigned two : 20;
    unsigned hmm : 17;
    };

    An implementation is allowed to use up the last 12 bits of the
    first 64-bit unit and the first 5 bits of the next 64-bit unit.
    But, whether that happens or not is up to the implementation.
    The bitfield for member 'hmm' could instead be put entirely in
    the second 64-bit unit, with the last 12 bits of the first 64-bit
    unit simply left as padding. There is no standard way to force
    it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Tim Rentsch on Sun Sep 15 23:43:08 2024
    Tim Rentsch <[email protected]> writes:
    [email protected] (Scott Lurndal) writes:

    Robert Finch <[email protected]> writes:

    On 2024-09-15 12:09 p.m., David Brown wrote:

    In addition, some padding-related things can be defined by Standard
    itself. Not in this particular case, but, for example, it could be
    defined that when field of one integer type is immediately followed by >>>>> another field of integer type with the same or narrower width then
    there should be no padding in-between.


    What about bit-fields in a struct? I believe they are usually packed. In >>> case its for something like an I/O device.

    That's a bit more complicated as it depends on the target byte-order.

    e.g.

    struct GIC_ECC_INT_STATUSR_s {
    #if __BYTE_ORDER == __BIG_ENDIAN
    uint64_t reserved_41_63 : 23;
    uint64_t dbe : 9; /**< R/W1C/H - RAM ECC DBE detected. */
    uint64_t reserved_9_31 : 23;
    uint64_t sbe : 9; /**< R/W1C/H - RAM ECC SBE detected. */
    #else
    uint64_t sbe : 9;
    uint64_t reserved_9_31 : 23;
    uint64_t dbe : 9;
    uint64_t reserved_41_63 : 23;
    #endif
    } s;

    Probably many people know that this code depends on an
    implementation-defined extension (allowing uint64_t as
    the type of a bitfield) and is not guaranteed to be
    portable. Using 'unsigned' instead would be portable
    (assuming typical 32-bit ints, etc).


    Portability in this case was not necessary. In any case,
    it's portable to clang and gcc, which is good enough.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Kent Dickey on Sun Sep 15 18:32:51 2024
    [email protected] (Kent Dickey) writes:

    [examples of descending loops with unsigned loop variables]

    This discussion wandered into many subthreads, but I only want to make
    one post and chose here.

    When you write code working on signed numbers and do something like:

    (a < 0) || (a >= max)

    Then the compiler realizes if you treat 'a' as unsigned, this is just:

    (unsigned)a >= max

    since any negative number, treated as unsigned, will be larger than the largest positive signed number. So, to do loops which count down and
    have any stride using an unsigned loop count:

    for(u = start; u <= start; u -= step)

    With the usual caveats (start must be a valid signed number, and step
    cannot be so large that start + step crosses the signed boundary).

    Clever, although maybe too tricky. Better if start and step are
    also unsigned, in which case a safe test is easily seen to be
    start + step > start.

    But: unsigned numbers in C have some dangers, which no one here has mentioned. Some code presented comes CLOSE to being wrong, but gets
    lucky. With "int" being 32-bits, C promotion rules around unsigned
    ints, signed ints, and unsigned 64-bit can create trouble.

    uint64_t dval; uint32_t uval; int a;

    val32 = 1 dval = 1; a = 1;
    dval = val32 - 2 + dval;

    C will do (val32 - 2) first, with is (1U - 2) which is 0xffff_ffff, and
    then add dval, and the result is 0x1_0000_0000.

    Not really interesting. It's usually a mistake to mix different
    types, whether or not the types have different signedness. Arithmetic
    is one problem but assignment is another. Using the same type
    throughout avoids surprises like this one.

    Signed numbers don't have this risk, so if you're doing known small loops, you can just use ints. If you're doing possibly large loops, just use int64_t.

    I consider this bad advice. Loops are doing something with the loop
    variable, and its type should be chosen according to how it is used.
    If the loop variable represents an index, or a length, or count, it
    should be unsigned (or unsigned long, etc). If the loop variable
    represents degrees C or F, or some other naturally signed measure it
    should be signed (or maybe floating point). What kind of loop it
    is, whether ascending or descending, or what the increment is, etc,
    is secondary; a more important factor is what sort of value is
    being represented, and in almost all cases that is what should
    determine the type used.

    Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
    C/C++ is a mistake. It should always have been ILP64, and this nonsense would go away. Any new architecture should make C ILP64 (looking at you RISC-V, missing yet another opportunity to not make the same mistakes as everyone else).

    I believe this view is shortsighted. The big mistake is developers
    hardcoding types everywhere - especially int, but also long, and
    their unsigned variants. It's almost never a good idea to hardcode
    a specific width (eg, uint32_t) in a type name used for parameters
    or local variables, but that is by far a very common practice.
    Names of types should reflect how the variable is meant to be used,
    not the specifics of what sort of register it goes into. The more
    firmly we cement our programs to specific hardware choices, the
    greater the pain when those choices need to change, either due to
    time or moving to a different platform. The key is to keep things
    light and flexible, not encrusted onto fixed hardware choices like
    barnacles on the hull of a ship.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Sun Sep 15 18:47:06 2024
    Michael S <[email protected]> writes:

    On Sun, 15 Sep 2024 20:13:44 +0200
    David Brown <[email protected]> wrote:

    struct Bar {
    char x[8];
    int y;
    } bar;


    int foo(int i) {
    bar.y = 1234;
    bar.x[i] = 42;
    return bar.y;
    }

    It generates:

    foo:
    movslq %edi,%rdi
    movl $1234, %eax
    movl $1234, bar+8(%rip)
    movb $42, bar(%rdi)
    ret

    That is, y is /not/ reloaded after bar.x[i] is set.

    No other compiler on godbolt is doing it, except possibly gcc clones.
    Not even clang, who's former leader wrote "Nasal Manifest".

    Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
    both show bar.y not being overwritten (optimization levels -01 or -O2)
    when foo() is called.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Scott Lurndal on Sun Sep 15 18:51:28 2024
    [email protected] (Scott Lurndal) writes:

    Tim Rentsch <[email protected]> writes:

    [email protected] (Scott Lurndal) writes:

    Robert Finch <[email protected]> writes:

    On 2024-09-15 12:09 p.m., David Brown wrote:

    In addition, some padding-related things can be defined by
    Standard itself. Not in this particular case, but, for
    example, it could be defined that when field of one integer
    type is immediately followed by another field of integer type
    with the same or narrower width then there should be no padding
    in-between.


    What about bit-fields in a struct? I believe they are usually
    packed. In case its for something like an I/O device.

    That's a bit more complicated as it depends on the target byte-order.

    e.g.

    struct GIC_ECC_INT_STATUSR_s {
    #if __BYTE_ORDER == __BIG_ENDIAN
    uint64_t reserved_41_63 : 23;
    uint64_t dbe : 9;
    uint64_t reserved_9_31 : 23;
    uint64_t sbe : 9;
    #else
    uint64_t sbe : 9;
    uint64_t reserved_9_31 : 23;
    uint64_t dbe : 9;
    uint64_t reserved_41_63 : 23;
    #endif
    } s;

    Probably many people know that this code depends on an
    implementation-defined extension (allowing uint64_t as
    the type of a bitfield) and is not guaranteed to be
    portable. Using 'unsigned' instead would be portable
    (assuming typical 32-bit ints, etc).

    Portability in this case was not necessary. In any case,
    it's portable to clang and gcc, which is good enough.

    I'm not criticizing the code; just pointing out an aspect
    in case some people weren't aware of it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to BGB on Mon Sep 16 09:01:08 2024
    On 15/09/2024 06:42, BGB wrote:
    On 9/14/2024 8:26 AM, Anton Ertl wrote:
    [email protected] (Kent Dickey) writes:
    Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
    C/C++ is a mistake.  It should always have been ILP64, and this nonsense >>> would go away.  Any new architecture should make C ILP64 (looking at you >>> RISC-V, missing yet another opportunity to not make the same mistakes as >>> everyone else).

    We now have had more than 30 years of catering for this mistake by
    everyone involved.  Given their goals, I think that RISC-V made the
    right choice for int in their ABI, even if it was the original choice
    by the MIPS and Alpha people that they follow, like everyone else, was
    wrong.

    That being said, one option would be to introduce another ABI and API
    with 64-bit int (and maybe 32-bit long short int), and programmers
    could choose whether to program for the ILP API, or the int=int32_t
    API.  Would the ILP API/ABI fare better then x32?  I doubt it, even
    though I would support it.  This ship probably has sailed.


    Changing the size of 'int' would likely be a massive pain from a
    software compatibility POV (possibly effecting things much more than
    changing the size of pointers, or the size of 'long'; which was a major source of pain during the 32 to 64 bit migration).


    When my project got started, I was originally going with 32-bit 'long',
    like MSVC, but then switched over to keeping 'long' matched with the
    pointer size, as code that assumed sizeof(long)==sizeof(void *) was more common than code that assumed sizeof(long)==4 (it was more common for
    code to use 'int' as the de-facto 32-bit type), as well as this being a
    more useful assumption (though this assumption breaks with 128 bit
    pointers).

    Changing sizeof(int) to be anything other than 4 is likely to break significant amounts of code, and pretty much anything that reads/writes structs to files or similar for data storage.

    But, yes, this is even with the whole thing that on a 64-bit machine,
    32-bit integers are typically handled in a way where they are sign or
    zero extended to 64 bits.


    Granted, a better alternative might be to rework code to generally use
    the "stdint.h" types, and to use "intptr_t" for integer types matched to
    the size of a pointer, ...


    uintptr_t is usually a more natural choice - on almost all systems, it
    is representing an address, and those are unsigned.

    The other biggest hinder (apart from breaking unwarranted assumptions
    about sizes in existing code) to 64-bit int is the number of fundamental integer types in C. You have char, short, int, long and long long. So
    if int is 64-bit, there are not sufficient standard types to have 8-bit,
    16-bit and 32-bit types as well. But at the other end you have int,
    long and long long that are all 64-bit (perhaps one of them might be
    128-bit). The integer type system in C was made at a time when 16-bit
    systems were common and 32-bit would be more than enough for anyone, and
    before the world settled on 8-bit bytes and powers of two for integer sizes.

    I think using the <stdint.h> types for anything size-specific makes a
    lot of sense. For a lot of things, exact sizes don't matter, 32-bit int
    (but often not unsigned int) is as efficient as anything else, and
    assuming at least 32 bits is not a hinder to portability. But I would
    be reluctant to use "short", "long" or "long long" in any code -
    <stdint.h> types do a much better job.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Mon Sep 16 06:50:45 2024
    Tim Rentsch <[email protected]> schrieb:
    Michael S <[email protected]> writes:

    On Sun, 15 Sep 2024 20:13:44 +0200
    David Brown <[email protected]> wrote:

    struct Bar {
    char x[8];
    int y;
    } bar;


    int foo(int i) {
    bar.y = 1234;
    bar.x[i] = 42;
    return bar.y;
    }

    It generates:

    foo:
    movslq %edi,%rdi
    movl $1234, %eax
    movl $1234, bar+8(%rip)
    movb $42, bar(%rdi)
    ret

    That is, y is /not/ reloaded after bar.x[i] is set.

    No other compiler on godbolt is doing it, except possibly gcc clones.
    Not even clang, who's former leader wrote "Nasal Manifest".

    Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
    both show bar.y not being overwritten (optimization levels -01 or -O2)
    when foo() is called.

    Same for current gcc trunk (bleeding edge development version).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Mon Sep 16 07:17:44 2024
    David Brown <[email protected]> schrieb:
    On 14/09/2024 21:26, Thomas Koenig wrote:
    MitchAlsup1 <[email protected]> schrieb:

    In many cases int is slower now than long -- which violates the notion
    of int from K&R days.

    That's a designers's choice, I think. It is possible to add 32-bit
    instructions which should be as fast (or possibly faster) than
    64-bit instructions, as AMD64 and ARM have shown.


    For some kinds of instructions, that's true - for others, it's not so
    easy without either making rather complicated instructions or having
    assembly instructions with undefined behaviour (imagine the terror that
    would bring to some people!).

    It has happened, see the illegal (but sometimes useful)
    6502 instructions, or the recent RISC-V implementation snafu
    (GhostWrite).

    A classic example would be for "y = p[x++];" in a loop. For a 64-bit
    type x, you would set up one register once with "p + x", and then have a
    load with post-increment instruction in the loop. You can also do that
    with x as a 32-bit int, unless you are of the opinion that enough apples added to a pile should give a negative number of apples.

    But of course it should!

    But wait, no, the number of apples should become zero if you add
    enough of them.

    But wait... maybe if the pile becomes too large, then the apples
    will no longer be individual apples, but will be crushed under
    their weight, a bit like https://what-if.xkcd.com/4/ .

    But with a
    wrapping type for x - such as unsigned int in C or modulo types in Ada,
    you have little choice but to hold "p" and "x" separately in registers,
    add them for every load, and do the increment and modulo operation. I
    really can't see this all being handled by a single instruction.

    One reason not to use such a wrapping type.

    Although, if you have (R1+R2) addressing and a 32-bit addition, this
    could actually work, but not with a post-increment instruction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Mon Sep 16 07:25:38 2024
    Tim Rentsch <[email protected]> schrieb:
    If the loop variable
    represents degrees C or F, or some other naturally signed measure it
    should be signed (or maybe floating point).

    The first one is a bad idea because temperature is a continuous
    physical quantity.

    The second has bad implications for constructs like

    DO R = 0.0, 1.0, 0.1

    where it will depend on details floating point arithmetic if the
    number of loop trips is 10 or 11.

    You can argue that people can write

    DO R=0.0, 1.05, 0.1

    but this construct was error-prone enough that it was deleted
    from the Fortran standards.

    What kind of loop it
    is, whether ascending or descending, or what the increment is, etc,
    is secondary; a more important factor is what sort of value is
    being represented, and in almost all cases that is what should
    determine the type used.

    Not for floating point numbers. For that, you should simply do

    DO I=0,10
    R = I * 0.1

    or

    R = 0.0
    DO I=0,10
    ...
    R = R + 0.1
    END DO

    whichever rounding error you prefer.

    Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
    C/C++ is a mistake. It should always have been ILP64, and this nonsense
    would go away. Any new architecture should make C ILP64 (looking at you
    RISC-V, missing yet another opportunity to not make the same mistakes as
    everyone else).

    I believe this view is shortsighted. The big mistake is developers hardcoding types everywhere - especially int, but also long, and
    their unsigned variants. It's almost never a good idea to hardcode
    a specific width (eg, uint32_t) in a type name used for parameters
    or local variables, but that is by far a very common practice.

    Hence Fortran's SELECTED_REAL_KIND and SELECTED_INT_KIND...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Mon Sep 16 11:34:56 2024
    On Sun, 15 Sep 2024 18:47:06 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Sun, 15 Sep 2024 20:13:44 +0200
    David Brown <[email protected]> wrote:

    struct Bar {
    char x[8];
    int y;
    } bar;


    int foo(int i) {
    bar.y = 1234;
    bar.x[i] = 42;
    return bar.y;
    }

    It generates:

    foo:
    movslq %edi,%rdi
    movl $1234, %eax
    movl $1234, bar+8(%rip)
    movb $42, bar(%rdi)
    ret

    That is, y is /not/ reloaded after bar.x[i] is set.

    No other compiler on godbolt is doing it, except possibly gcc
    clones. Not even clang, who's former leader wrote "Nasal Manifest".


    Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
    both show bar.y not being overwritten (optimization levels -01 or -O2)
    when foo() is called.

    I didn't mean to say that gcc3 is the only gcc version that returns non-overwritten value.
    I meant to say that all gcc versions are in one camp and the rest of
    compilers represented on Goldbolt is in the other camp.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Mon Sep 16 10:37:47 2024
    David Brown wrote:
    On 14/09/2024 21:26, Thomas Koenig wrote:
    MitchAlsup1 <[email protected]> schrieb:

    In many cases int is slower now than long -- which violates the notion
    of int from K&R days.

    That's a designers's choice, I think.  It is possible to add 32-bit
    instructions which should be as fast (or possibly faster) than
    64-bit instructions, as AMD64 and ARM have shown.


    For some kinds of instructions, that's true - for others, it's not so
    easy without either making rather complicated instructions or having assembly instructions with undefined behaviour (imagine the terror that would bring to some people!).

    A classic example would be for "y = p[x++];" in a loop.  For a 64-bit
    type x, you would set up one register once with "p + x", and then have a load with post-increment instruction in the loop.  You can also do that with x as a 32-bit int, unless you are of the opinion that enough apples added to a pile should give a negative number of apples.  But with a wrapping type for x - such as unsigned int in C or modulo types in Ada,
    you have little choice but to hold "p" and "x" separately in registers,
    add them for every load, and do the increment and modulo operation.  I really can't see this all being handled by a single instruction.

    This becomes much simpler in Rust where usize is the only legal index type:

    Yeah, you have to actually write it as

    y = p[x];
    x += 1;

    instead of a single line, but this makes zero difference to the
    compiler, right?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Mon Sep 16 10:34:19 2024
    On 15/09/2024 21:13, MitchAlsup1 wrote:
    On Sun, 15 Sep 2024 18:48:48 +0000, David Brown wrote:

    On 15/09/2024 19:21, MitchAlsup1 wrote:
    On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

    Robert Finch <[email protected]> writes:
    On 2024-09-15 12:09 p.m., David Brown wrote:

    In addition, some padding-related things can be defined by Standard >>>>>>> itself. Not in this particular case, but, for example, it could be >>>>>>> defined that when field of one integer type is immediately
    followed by
    another field of integer type with the same or narrower width then >>>>>>> there should be no padding in-between.


    What about bit-fields in a struct? I believe they are usually
    packed. In
    case its for something like an I/O device.

    That's a bit more complicated as it depends on the target byte-order.

    e.g.

        struct GIC_ECC_INT_STATUSR_s {
    #if __BYTE_ORDER == __BIG_ENDIAN
            uint64_t reserved_41_63              : 23; >>>>         uint64_t dbe                         :  9; /**< R/W1C/H - RAM
    ECC DBE detected. */
            uint64_t reserved_9_31               : 23; >>>>         uint64_t sbe                         :  9; /**< R/W1C/H - RAM
    ECC SBE detected. */
    #else
            uint64_t sbe                         :  9;
            uint64_t reserved_9_31               : 23; >>>>         uint64_t dbe                         :  9;
            uint64_t reserved_41_63              : 23; >>>> #endif
        } s;

    Which brings to mind a slight different but related bit-field issue.

    If one has an architecture that allows a bit-field to span a register
    sized container, how does one specify that bit-field in C ??

    So, assume a register contains 64-bits and we have a 17-bit field
    starting at bit 53 and continuing to bit 69 of a 128-bit struct.
    How would one "properly" specify this in C.

    You do so inconveniently, perhaps with access inline functions rather
    than a bit-field struct.

    Fortunately, not many hardware designers are that sadistic.  (Or perhaps
    they /are/ that sadistic, but lack the imagination for that particular
    trick.)

    In My 66000 ISA it is both efficient and straightforward::


    That does not change that it is inconvenient in C, which is what you
    asked about. For any ISA, there will always be things that can easily
    written in C that are awkward in assembly, and vice versa.



        i = struct.field;
    ..
        struct.field = j;

        CARRY    Rsf1,{I}
        SRA      Ri,Rsf0,<17,53>
    and
        CARRY    Rsf1,{O}
        INS      Rsf0,Rj,<52,17>

    Note: Rsf1 and Rsf0 combined are the 128 bits container, but there is no
    need for these registers to be sequential.

    As to HW sadism:: this not not <realistically> any harder than mis-
    aligned DW accesses from the cache. Many ISA from the rather distant
    past could do these rather efficiently {360 SRDL,...}


    Anyone who designs a data structure with a bit-field that spans two
    64-bit parts of a struct is probably ignorant of C bit-fields and
    software in general. It is highly unlikely to be necessary or even
    beneficial from the hardware viewpoint, but really inconvenient on the
    software side (whether you use bit-fields or not).

    Some hardware designers seem to have no understanding of or
    consideration for the software folks that will use their designs. "HW
    Sadism" is no doubt too strong a term - ignorance and a lack of
    consideration is more realistic.

    If the ISA has any realistically efficient grasp on multi-precision
    integer operations, these fall out almost for free.

    I can't see that. I am not saying you are wrong, but I don't see the connection.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Tim Rentsch on Mon Sep 16 11:14:52 2024
    On 15/09/2024 21:51, Tim Rentsch wrote:
    [email protected] (MitchAlsup1) writes:

    On Sun, 15 Sep 2024 17:07:58 +0000, Scott Lurndal wrote:

    Robert Finch <[email protected]> writes:

    On 2024-09-15 12:09 p.m., David Brown wrote:

    In addition, some padding-related things can be defined by Standard >>>>>> itself. Not in this particular case, but, for example, it could be >>>>>> defined that when field of one integer type is immediately followed by >>>>>> another field of integer type with the same or narrower width then >>>>>> there should be no padding in-between.


    What about bit-fields in a struct? I believe they are usually packed. In >>>> case its for something like an I/O device.

    That's a bit more complicated as it depends on the target byte-order.

    e.g.

    struct GIC_ECC_INT_STATUSR_s {
    #if __BYTE_ORDER == __BIG_ENDIAN
    uint64_t reserved_41_63 : 23;
    uint64_t dbe : 9;
    uint64_t reserved_9_31 : 23;
    uint64_t sbe : 9;
    #else
    uint64_t sbe : 9;
    uint64_t reserved_9_31 : 23;
    uint64_t dbe : 9;
    uint64_t reserved_41_63 : 23;
    #endif
    } s;

    Which brings to mind a slight different but related bit-field issue.

    If one has an architecture that allows a bit-field to span a register
    sized container, how does one specify that bit-field in C ??

    So, assume a register contains 64-bits and we have a 17-bit field
    starting at bit 53 and continuing to bit 69 of a 128-bit struct.
    How would one "properly" specify this in C.

    The 17-bit bitfied can be specified in the usual way. Example:

    struct bitfield_example {
    unsigned one : 32;
    unsigned two : 20;
    unsigned hmm : 17;
    };

    An implementation is allowed to use up the last 12 bits of the
    first 64-bit unit and the first 5 bits of the next 64-bit unit.
    But, whether that happens or not is up to the implementation.
    The bitfield for member 'hmm' could instead be put entirely in
    the second 64-bit unit, with the last 12 bits of the first 64-bit
    unit simply left as padding. There is no standard way to force
    it.


    Yes, implementations get to choose this, with most implementations
    following the specifications from the ABI for the target.

    Many implementations have a way to specify tighter packing, but
    naturally this is not standardised. But it can give a picture of the differences in code generation between the two options, which makes it
    easy to see why most compilers do not split bit-fields across two
    storage units.

    (There is a standard way to specify that "hmm" above is /not/ packed
    across two units - adding a field "unsigned : 0;" between "two" and
    "hmm" forces this.)

    <https://godbolt.org/z/sYxWjM766>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to BGB on Mon Sep 16 11:27:15 2024
    On 16/09/2024 09:18, BGB wrote:
    On 9/15/2024 12:46 PM, Anton Ertl wrote:
    Michael S <[email protected]> writes:
    Padding is another thing that should be Implementation Defined.

    It is.  It's defined in the ABI, so when the compiler documents to
    follow some ABI, you automatically get that ABI's structure layout.
    And if a compiler does not follow an ABI, it is practically useless.


    Though, there also isn't a whole lot of freedom of choice here regarding layout.

    If member ordering or padding differs from typical expectations, then
    any code which serializes structures to files is liable to break, and
    this practice isn't particularly uncommon.


    Your expectations here should match up with the ABI - otherwise things
    are going to go wrong pretty quickly. But I think most ABIs will have
    fairly sensible choices for padding and alignments.

    Say, typical pattern:
    Members are organized in the same order they appear in the source code;

    That is required by the C standards. (A compiler can re-arrange the
    order if that does not affect any observable behaviour. gcc used to
    have an optimisation option that allowed it to re-arrange struct
    ordering when it was safe to do so, but it was removed as it was rarely
    used and a serious PITA to support with LTO.)

    If the current position is not a multiple of the member's alignment, it
    is padded to an offset that is a multiple of the member's alignment;

    That is a requirement in the C standards.

    The only implementation-defined option is whether or not there is
    /extra/ padding - and I have never seen that in practice. (And there
    are more implementation-defined options for bit-fields.)

    For primitive types, the alignment is equal to the size, which is also a power of 2;

    That is the norm, up to the maximum appropriate alignment for the
    architecture. A 16-bit cpu has nothing to gain by making 32-bit types
    32-bit aligned.

    If needed, the total size of the struct is padded to a multiple of the largest alignment of the struct members.

    That is required by the C standards.




    For C++ classes, it is more chaotic (and more compiler dependent), but:

    Not really, no. Apart from a few hidden bits such as pointers to handle virtual methods and virtual inheritance, the data fields are ordered,
    padded and aligned just like in C structs. And these hidden pointers
    follow the same rules as any other pointer.

    The only other special bit is empty base class optimisation, and that's
    pretty simple too.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Thomas Koenig on Mon Sep 16 12:26:20 2024
    On 2024-09-16 10:25, Thomas Koenig wrote:
    Tim Rentsch <[email protected]> schrieb:
    If the loop variable
    represents degrees C or F, or some other naturally signed measure it
    should be signed (or maybe floating point).

    The first one is a bad idea because temperature is a continuous
    physical quantity.

    The second has bad implications for constructs like

    DO R = 0.0, 1.0, 0.1

    where it will depend on details floating point arithmetic if the
    number of loop trips is 10 or 11.

    You can argue that people can write

    DO R=0.0, 1.05, 0.1

    but this construct was error-prone enough that it was deleted
    from the Fortran standards.

    What kind of loop it
    is, whether ascending or descending, or what the increment is, etc,
    is secondary; a more important factor is what sort of value is
    being represented, and in almost all cases that is what should
    determine the type used.

    Not for floating point numbers. For that, you should simply do

    DO I=0,10
    R = I * 0.1

    or

    R = 0.0
    DO I=0,10
    ...
    R = R + 0.1
    END DO

    whichever rounding error you prefer.

    Bringing it back to "architecture" Like Anton Ertl has said, LP64 for
    C/C++ is a mistake. It should always have been ILP64, and this nonsense >>> would go away. Any new architecture should make C ILP64 (looking at you >>> RISC-V, missing yet another opportunity to not make the same mistakes as >>> everyone else).

    I believe this view is shortsighted. The big mistake is developers
    hardcoding types everywhere - especially int, but also long, and
    their unsigned variants. It's almost never a good idea to hardcode
    a specific width (eg, uint32_t) in a type name used for parameters
    or local variables, but that is by far a very common practice.


    I agree. This issue guided the design of the scalar type system in Ada.

    C programmers can use typedef to get part way there, but not all the way because typedefs are still weakly typed.


    Hence Fortran's SELECTED_REAL_KIND and SELECTED_INT_KIND...


    And the way Ada programmers can define application-specific types with
    the ranges and precisions the application needs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to BGB on Mon Sep 16 13:12:15 2024
    On 16/09/2024 02:00, BGB wrote:
    On 9/15/2024 2:09 PM, David Brown wrote:
    On 14/09/2024 04:39, BGB wrote:
    On 9/13/2024 10:55 AM, Thomas Koenig wrote:
    David Brown <[email protected]> schrieb:

    Most of the commonly used parts of C99 have been "safe" to use for 20 >>>>> years.  There were a few bits that MSVC did not implement until
    relatively recently, but I think even have caught up now.

    What about VLAs?


    IIRC, VLAs and _Complex and similar still don't work in MSVC.
    Most of the rest does now at least.


    Thanks - you know it far better than I do.


    I use it fairly often.
    Mostly VS2022 at present.

    There are only two serious, general purpose C compilers in mainstream >>>>> use - gcc and clang, and both support almost all of C23 now.  But it >>>>> will take a while for the more niche tools, such as some embedded
    compilers, to catch up.

    It is almost impossible to gather statistics on compiler use,
    especially with free compilers, but what about MSVC and icc?

     From what I gather:
       GCC and Clang are popular for most mainline targets;
         GCC is the dominant C compiler on Linux.

    It is also far and away the dominant compiler for embedded systems -
    both embedded Linux and small embedded systems.


    Albeit, ones with semi-popular CPU architectures.

    What do you mean by that? ARM currently has perhaps 90% of the market
    for small embedded systems, and gcc is used for development on perhaps
    85% of those systems. Major non-ARM microcontroller cores include AVR,
    RISC-V, ESP-32 and the undying PIC16x and 8051 cores. Only the last two
    there do not have gcc ports, but those devices have almost died out of
    the market for new designs. clang is still the "new kid on the block"
    for small-systems embedded development, and has not yet made a big
    impact. And of course there are a range of high-price commercial
    toolchains that are very popular in some areas, but not a big fraction
    of users overall.

    Though, GCC and Linux kinda go together here.

    Small embedded systems don't run Linux. And the people developing for
    them usually do so on Windows, not Linux. So in the embedded
    development world, gcc dominates, Linux does not. (But for embedded
    Linux systems, gcc dominates.)

    Say, one isn't going to find Linux ported to targets outside the scope
    of GCC,

    True.

    and GCC isn't too interested outside the scope of targets that
    could potentially run Linux and see at least semi-widespread use.

    False.

    Much of the development work in gcc is done based on which company pays
    for the work, and many of the biggest commercial backers have an
    interest in Linux (Intel, AMD, ARM, IBM, Google, Facebook, etc. - even Microsoft). But the gcc ports for smaller microcontrollers also have
    their commercial backers, and they only need to concentrate on the
    backend - they get most of the benefits (new language support, most optimisations, static error checking, etc.) for free.

    The huge majority of current embedded systems use ARM Cortex-M cores.
    The huge majority of these run software developed with gcc. None of
    them run Linux.




    .
       TinyCC, popular for niche use, but limited range of targets;
         x86, ARM, experimental RISC-V.
       SDCC, popular for 8/16 bit targets;

    SDCC has never been very popular.  For the targets SDCC support, Keil
    (8051) and IAR (many small CISC targets) are far more common.  But for
    these kinds of devices, you are never working in anything close to
    standard C anyway.


    OK.

    I had mostly heard of people using SDCC here.

    With all respect to the regulars here, most people in technical Usenet
    groups are either old, unusually nerdy, or both. They are not
    representative of developers. And while I think SDCC is a very
    impressive project and it would be my own first choice if I were working
    with brain-dead 8-bitters, its popularity is close to negligible. And
    that is in a market for 8-bit cores that is rapidly disappearing.



       CC65, popular for 6502 and 65C816;

    That's getting /really/ obscure now.  There are thousands of C
    compilers that are used, or have been used, for various
    microcontrollers.  But if you sum all their uses over the last decade,
    it will not be close to 1% of the total use of C compilers.


    This is mostly for the crowd still messing around with a few older systems:
      Commodore 64/128
      Apple II / II/C / II/E
      Apple IIGS
      NES and SNES
      ...

    It is not a "crowd" - it's a small group of oddballs and enthusiasts. I
    fully support them, and playing with these things is a great hobby. I
    would maybe be doing that too, if I had twice as many hours in the week.
    But talking about "popular compilers like gcc and CC65" is like
    talking about "popular sports like football and Inuit ear pulling contests".


    Also, some newer projects, like the "Commander X16" are also using CC65
    (it was based around a 65C816 being used in a 6502 compatibility mode).


    Where, AFAIK, GCC proper has little interest in these targets.


    The GCC community would be quite happy to support such targets, but
    someone would need to make the port. And the architecture of the gcc
    compiler suite is best suited to processors with reasonably regular and orthogonal ISAs with plenty of registers and at least 16-bit width -
    getting good results for a cpu like the 6502 from gcc would be an
    extraordinary level of effort. It makes a lot more sense to look at
    tools like SDCC with an architecture that fits better.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Mon Sep 16 12:39:54 2024
    On 16/09/2024 10:34, Michael S wrote:
    On Sun, 15 Sep 2024 18:47:06 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Sun, 15 Sep 2024 20:13:44 +0200
    David Brown <[email protected]> wrote:

    struct Bar {
    char x[8];
    int y;
    } bar;


    int foo(int i) {
    bar.y = 1234;
    bar.x[i] = 42;
    return bar.y;
    }

    It generates:

    foo:
    movslq %edi,%rdi
    movl $1234, %eax
    movl $1234, bar+8(%rip)
    movb $42, bar(%rdi)
    ret

    That is, y is /not/ reloaded after bar.x[i] is set.

    No other compiler on godbolt is doing it, except possibly gcc
    clones. Not even clang, who's former leader wrote "Nasal Manifest".


    Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
    both show bar.y not being overwritten (optimization levels -01 or -O2)
    when foo() is called.

    I didn't mean to say that gcc3 is the only gcc version that returns non-overwritten value.

    I also did not mean to imply that - I meant merely to show that gcc has generated code this way since at least that version.

    I meant to say that all gcc versions are in one camp and the rest of compilers represented on Goldbolt is in the other camp.


    Yes, but you were wrong about that. And even if you were right, it
    would still be irrelevant - your argument that "what I wrote is how all production C compilers work today" has been shattered. The most-used C compiler does not work as you thought, and has not done so for at least
    20 years. Indeed, for some targets (such as 32-bit ARM that I tested)
    it does the write to bar.x[i] first, then the write to bar.y, because
    that makes more sense from an instruction scheduling viewpoint.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Mon Sep 16 12:32:39 2024
    On 15/09/2024 21:42, Michael S wrote:
    On Sun, 15 Sep 2024 20:13:44 +0200
    David Brown <[email protected]> wrote:

    On 14/09/2024 23:19, Michael S wrote:

    Yes, exactly.


    Contrary to your imagination - compilers have /never/ followed your
    proposed semantics. The oldest gcc version I found on godbolt.org is
    3.4.6 from 2006, and given:

    struct Bar {
    char x[8];
    int y;
    } bar;


    int foo(int i) {
    bar.y = 1234;
    bar.x[i] = 42;
    return bar.y;
    }

    It generates:

    foo:
    movslq %edi,%rdi
    movl $1234, %eax
    movl $1234, bar+8(%rip)
    movb $42, bar(%rdi)
    ret

    That is, y is /not/ reloaded after bar.x[i] is set.


    No other compiler on godbolt is doing it, except possibly gcc clones.
    Not even clang, who's former leader wrote "Nasal Manifest".


    Is this going to be a "No true Scotsman" argument? Or did you forget to
    enable optimisations when testing /all/ the compilers on godbolt?

    I tested a couple more.

    With gcc for 32-bit ARM, the code re-arranges the stores - bar.x[i] gets
    the value of 42 before the store to bar.y is done, and bar.y is not
    reloaded. This is perfectly valid code generation.

    icc generates the same code as gcc for x86-64, other than the order of
    the first two instructions.


    Compilers are, of course, free to re-read bar.y. But they are not
    obliged to. And a good enough optimising compiler will not re-read
    bar.y because it is a waste of instruction cycles. Most of the C
    compilers on godbolt do not optimise as well as gcc does, though some
    (like clang and icc) will do better in a minority of cases. I know of a
    number of other heavily optimising compilers that are not on godbolt
    because they have high costs and licenses that forbid that kind of use.


    However, what we have from godbolt is a clear pattern - there is
    absolutely no basis for suggesting that accessing bar.x[] beyond the
    defined limit of the array is defined in any way, either within the C standards, practical real-world compilers, or documented extensions in compilers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to BGB on Mon Sep 16 14:30:29 2024
    On 16/09/2024 01:54, BGB wrote:
    On 9/15/2024 2:40 PM, David Brown wrote:
    On 14/09/2024 08:34, BGB wrote:
    On 9/13/2024 10:30 AM, David Brown wrote:
    On 12/09/2024 23:14, BGB wrote:
    On 9/12/2024 9:18 AM, David Brown wrote:
    On 11/09/2024 20:51, BGB wrote:
    On 9/11/2024 5:38 AM, Anton Ertl wrote:
    Josh Vanderhoof <[email protected]> writes:
    [email protected] (Anton Ertl) writes:


    <snip lots>


    Though, generally takes a few years before new features become usable. >>>>> Like, it is only in recent years that it has become "safe" to use
    most parts of C99.


    Most of the commonly used parts of C99 have been "safe" to use for
    20 years.  There were a few bits that MSVC did not implement until
    relatively recently, but I think even have caught up now.


    Until VS2013, the most one could really use was:
       // comments
       long long
    Otherwise, it was basically C90.
       'stdint.h'? Nope.
       Ability to declare variables wherever? Nope.
       ...

    Nonsense.

    MS basically gave up on C and concentrated on C++ (then later C# and
    other languages).  Their C compiler gained the parts of C99 that were
    in common with C++ - and anyway, most people (that I have heard of)
    using MSVC for C programming actually use the C++ compiler but stick
    approximately to a C subset.  And this has been the case for a /long/
    time - long before 2013.


    Go and try to write C with variables not declared at the start of a
    block in VS2008 or similar and see how far you get...


    While, it may work in C++ mode, it did not work in C mode.

    You have tried it, I have not, so I will take your word for it. Perhaps
    those I heard of using it were, as you say, compiling in C++ mode - my understanding is that is very common with MSVC.


    IIRC, the ability to declare variables wherever got added in VS2013.
    Looks like 'stdint.h' got added in VS2010.


    <stdint.h> became part of C++ in C++11, but most C and C++ compilers
    have had it since shortly after C99 came out, even if they did not
    support much more of C99.

    I can sort of understand MS being lazy about supporting new C standards
    and features that required effort - after all, very few people use MSVC
    in C mode. But a <stdint.h> header only takes a dozen lines on a fixed platform and is directly useful in C++ as well as C. I suppose at that
    time MS was still desperately trying to fight against anything that was
    open and not tying people into their systems, so they'd rather people
    used DWORD and the like than uint32_t.




       Whether or not the target/compiler allows misaligned memory >>>>>>> access;
         If set, one may use misaligned access.

    Why would you need that?  Any decent compiler will know what is
    allowed for the target (perhaps partly on the basis of compiler
    flags), and will generate the best allowed code for accesses like
    foo3() above.


    Imagine you have compilers that are smart enough to turn "memcpy()"
    into a load and store, but not smart enough to optimize away the
    memory accesses, or fully optimize away the wrapper functions...


    Why would I do that?  If I want to have efficient object code, I use
    a good compiler.  Under what realistic circumstances would you need
    to have highly efficient results but be unable to use a good
    optimising compiler?  Compilers have been inlining code for 30 years
    at least (that's when I first saw it) - this is not something new
    and rare.


    Say, you are using a target where you can't use GCC or similar.

    Which target would that be?  Excluding personal projects, some very
    niche devices, and long-outdated small CISC chips, there really aren't
    many devices that don't have a GCC and clang port.  Of course there /
    are/ processors that gcc does not support, but almost nobody writes
    code that has to be portable to such devices.

    And as for optimising compilers, I used at least two different
    optimising compilers in the mid nineties that inlined code
    automatically, before using gcc.  (I can't remember if they inlined
    memcpy - it was a long time ago!).  Optimising compilers are not a new
    concept, and are not limited to gcc and clang.



    It also depends on what one considers optimizing.

    Yes, that's a fair point. As far as the C language is concerned,
    there's no such thing - any generated code that gives the same (or
    equally valid) observable behaviour is simply an alternative output for
    the compiler. But it generally means that the compiler makes more than
    a minimal effort to generate more efficient results.


    But, like:
      Allocates variables into registers;
      Evaluates expressions involving constants;
      Turns "memcpy()" into inlined loads/stores in some cases;
        Essentially treating it like a builtin function.
      ...


    Well, at least BGBCC does this much.


    Very good.


    Things it doesn't do though:
      Loop unrolling;

    Loop unrolling can be difficult in a compiler - it's also not always a
    good thing in the end (cache arrangements can sometimes mean a real loop
    is faster than an unrolled loop).

      Inline functions;
      ...


    Inlining small functions is a /very/ useful optimisation, IMHO,
    especially when it happens before other optimisations like constant propagation.


    There is a partial feature to cache member loads and array loads within
    a basic-block, but will flush any such cached values whenever a memory
    store happens.

    Say:
      i=foo->bar->x + foo->bar->y;
    Will cache and reuse the first foo->bar.
    But, if you do:
      *ptr=0;
    Or:
      foo->z=3;
    It will flush any memory of the cached values (unless the pointers are 'restrict').

    There is an option to disable this caching though (at which point it
    will always do each member load). But, unlike TBAA, this optimization is
    less prone to break stuff.


    Slow and always correct is better than fast and sometimes wrong!

    It also has a special feature than small leaf functions which can fit entirely in scratch registers may skip creation of a stack frame.


    But, I can note that even with these limitations, BGBCC+BJX2 still seems
    to be beating RV64G + "GCC -O3" in terms of performance in my tests
    (well, mostly because clever compiler can't beat ISA limitations).



    Say:
    BJX2, haven't ported GCC as it looks like a pain;
       Also GCC is big and slow to recompile.

    6502 and 65C816, because these are old and probably not worth the
    effort from GCC's POV.

    Various other obscure/niche targets.


    Say, SH-5, which never saw a production run (it was a 64-bit
    successor to SH-4), but seemingly around the time Hitachi spun-out
    Renesas, the SH-5 essentially got canned. And, it apparently wasn't
    worth it for GCC to maintain a target for which there were no actual
    chips (comparably the SH-2 and SH-4 lived on a lot longer due to
    having niche uses).


    It would be quite ridiculous to limit the way you write code because
    of possible limitations for non-existent compilers for target devices
    that have never been made.


    Hitachi did release an ISA spec for SH-5 at least (and it might have
    worked OK, if Renesas had pushed "upwards" rather than focusing almost exclusively on the small embedded / microcontroller space).

    Pushing upwards would have been a waste of money.



    But, at present, people trying to worry about portability to things with non-power-of-2 integers, non-8-bit bytes, non-twos-complement
    arithmetic, etc, has a similar level of validity (or non-validity) to
    writing code for ISA's which never saw a release in "actual silicon".


    Agreed. There /are/ cores that have such features, like DSPs and very specialised cores, but the code you use on them is equally specialised.
    You don't need to port back and forth between such cores and "normal"
    targets.


    If the compiler is naive (wrt inline memcpy):
      memcpy(&v, cs, 8);
      rl=(v>>4)&15;
    Needs 5 instructions, but:
      v=*(uint64_t *)cs;
      rl=(v>>4)&15;
    Uses 3 instructions.

    Having the compiler turn the former into the latter is possible, but
    would require more complex pattern matching, and would likely need to be handled in the frontend (rather than in the function-call operation) in
    the backend.


    Can I recommend you try to implement gcc's __builtin_constant_p()
    function that determines if the result of an expression is known at
    compile time? (It's fine to have false negatives for complicated
    cases.) But it needs to be evaluated at compile time and used for
    dead-code elimination, otherwise there's little point.

    Then your standard library implementation of memcpy (assuming unaligned accesses are allowed) can be something approximately like :

    #define memcpy(s1, s2, n) \
    if (__builtin_constant_p(n)) { \
    if (n == 1) { \
    uint8_t * p = (uint8_t *) s1; \
    const uint8_t * q = (const uint8_t *) s2; \
    *p = *q; \
    } else if (n == 2) { \
    uint16_t * p = (uint16_t *) s1; \
    const uint16_t * q = (const uint16_t *) s2; \
    *p = *q; \
    } else if (n == 4) { \
    uint32_t * p = (uint32_t *) s1; \
    const uint32_t * q = (const uint32_t *) s2; \
    *p = *q; \
    } else {
    __real_memcpy(s1, s2, n); \
    } \
    } else { \
    __real_memcpy(s1, s2, n); \
    }


    This is missing several details to make it safe and to match the
    standard library specifications, but I believe it should be possible to
    do something along those lines. (Implementing gcc's statement
    expressions would help too.)


    Not necessarily, it wouldn't make sense for _Alignof to return 1 for
    all the basic integer types.

    Of course it makes sense to do that, on targets where an alignment of
    1 is safe and efficient.


    Tradition dictates that struct members are pad-aligned aligned to their native alignment (usually equal to the size of the base type), unless
    the struct is 'packed'.

    No, tradition dictates that there is a maximum to the alignment,
    matching the size of the architecture. 16-bit implementations rarely
    have any type alignment greater than 16-bit, 32-bit implementations
    rarely have any alignment greater than 32-bit, and 8-bit implementations
    rarely have any alignment greater than 8-bit.


    An implementation where all structs are packed by default could have unforeseen consequences...

    Yes - such as poor performance. And of course some programmers make unwarranted assumptions about alignments and paddings.


    Presumably, _Alignof would give the same alignment as would appear in
    structs or similar.


    Yes. C requires that.


    But, for" minimum alignment" it may make sense to return 1 for
    anything that can be accessed unaligned.


    Again, I see no use for this.


    The main alternatives:
    Detect target architecture and "know" whether the architecture is unaligned-safe (ye olde mess of ifdef's);
    Have a global PP define that applies to all types, but this doesn't
    allow for cases where some types are unaligned safe but others are not.

    One possibility could be __minalign__(type), but (unlike doing it with preprocessor defines), one could not likely use it in preprocessor expressions.

    #if __MINALIGN_LONG__==1
    ...
    #else
    ...
    #endif

    Works, but:
    #if _Alignof(long)==1
    ...

    Poses problems, as generally the preprocessor is not able to evaluate
    things like this.


    Scrap all that and have functions to read or write from a given address
    with specified sizes, using whatever method the compiler sees as most
    efficient and supported by the target. Or implement mempcy()
    optimisations for small known sizes, and use that.



    Where, _Alignof(int32_t) will give 4, but __MINALIGN_INT32__ would
    give 1 if the target supports misaligned pointers.


    The alignment of types in C is given by _Alignof.  Hardware may
    support unaligned accesses - C does not.  (By that, I mean that
    unaligned accesses are UB.)


    The point of __MINALIGN_type__ would be:
    If the compiler defines it, and it is defined as 1, then this allows
    the compiler to be able to tell the program that it is safe to use
    this type in an unaligned way.


    For what purpose?


    Probably for unaligned deref's on targets where "memcpy()" is a less desirable option (say, if it takes several additional CPU instructions).


    Make a better memcpy() implementation instead.


    This also applies to targets where some types are unaligned but
    others are not:
    Say, if all integer types 64 bits or less are unaligned, but 128-bit
    types are not.


    For what purpose?  And why do you want to worry about totally
    hypothetical systems?


    Note that a lot of what I am describing here is true of BJX2.


    Are you saying that you have no alignment restrictions for types up to
    64-bits (that is, they are placed at any address), but /do/ have
    alignment restrictions for 128-bit types? That would be so strange that
    I suspect I am misunderstanding you.

    Perhaps you are saying that unaligned accesses are allowed for types up
    to 64-bit even though the types are normally aligned for efficiency, but unaligned accesses are not allowed for 128-bit types? That is a lot
    more plausible, especially if there is a special implementation for
    128-bit accesses. (On x86-64 there are some SIMD vector instructions
    that do not support unaligned accesses.)


    It is also true of __m128 and similar in MSVC.
      __m128 v;
      v=*(__m128 *)someptr;
    May explode if someptr is not 16-byte aligned, as it may emit a "MOVDQA"
    or similar (rather than MOVDQU).

    But, in both cases, if "int *" or "long *" is misaligned, both are fine
    with it.


    This is all quite simple to handle - don't faff around converting
    pointer types unless you know exactly what you are doing, and you know
    it is safe to do and your alignments are correct according to the ABI requirements. A decent C compiler is not going to give you incorrect alignments unless you go out of your way to create them via explicit
    code (i.e., using casts).

    The only time you get problems is if your compiler makes certain
    assumptions (such as gcc x86-64 assuming 16-byte stack pointer
    alignment) and you have an OS that does something stupid (like Windows
    not necessarily aligning the stack pointer properly before calling
    callbacks). For that, you want compiler help.



    There may be other compilers in a similar camp.

    But, then again, it is kinda hypothetical in the sense to claim that one can't cast and deref a pointer, since on most existing targets, it works without issue (except that on GCC one may also need to use 'volatile').


    I've worked with targets where unaligned access does not work - or where
    it is immensely slow. This is something that the compiler should get
    right, and the user should rely on the compiler.




    Most of this is being compiled by BGBCC for a 50 MHz cPU.

    So, the CPU is slow and the compiler doesn't generate particularly
    efficient code unless one writes it in a way it can use effectively.

    Which often means trying to write C like it was assembler and
    manually organizing statements to try to minimize value dependencies
    (often caching any values in variables, and using lots of variables).


    In this case, the equivalent of "-fwrapv -fno-strict-aliasing" is the
    default semantics.

    Generally, MSVC also responds well to a similar coding style as used
    for BGBCC (or, as it more happened, the coding styles that gave good
    results in MSVC also tended to work well in BGBCC).


    Note that MSVC most certainly does /not/ work like "gcc -fwrapv" -
    signed integer overflow is UB in MSVC, and it generates code that
    assumes it never happens.  There is an obscure officially undocumented
    (or documented unofficially, if you prefer) flag to turn off such
    optimisations.

    Last I read about it, they had no plans to do any type-based alias
    analysis, but nor did they rule out the possibility in the future.


    I haven't seen any issues with MSVC and this sort of code usually works
    as expected...

    But, a lot of times, one has to supply these options to GCC otherwise
    the code will break. So, it almost makes sense to assume these semantics
    as a default.

    What you mean is that for some bad code, you have to supply these flags
    or you face "garbage in, garbage out". The code was already broken if
    these flags are needed for it to behave as the programmer intended.


    In the case of BGBCC, I decided to make these semantics the default as a matter of a policy decision.

    And IMHO that's a /really/ bad idea. Instead of telling users "we know
    you write shit code - so I'll assume your source code might be shit,
    even if the results are worse when you write good code", why not
    encourage people to write code correctly by giving them the best results
    for correct code? And if possible, give them tools - static and
    run-time - to help spot their mistakes, rather than blessing those
    mistakes as a new norm.

    There is some talk about pointer provenance semantics for C (apparently
    semi controversial), but admittedly thus far I don't fully understand
    the idea.

    It is complicated, but has big potential for improving static analysis, run-time checkers, and code optimisations. One thing you can be sure is
    that encouraging people to break the current C rules is only going to
    make it more likely that they will have trouble in the future.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Mon Sep 16 14:45:44 2024
    On 16/09/2024 09:17, Thomas Koenig wrote:
    David Brown <[email protected]> schrieb:
    On 14/09/2024 21:26, Thomas Koenig wrote:
    MitchAlsup1 <[email protected]> schrieb:

    In many cases int is slower now than long -- which violates the notion >>>> of int from K&R days.

    That's a designers's choice, I think. It is possible to add 32-bit
    instructions which should be as fast (or possibly faster) than
    64-bit instructions, as AMD64 and ARM have shown.


    For some kinds of instructions, that's true - for others, it's not so
    easy without either making rather complicated instructions or having
    assembly instructions with undefined behaviour (imagine the terror that
    would bring to some people!).

    It has happened, see the illegal (but sometimes useful)
    6502 instructions, or the recent RISC-V implementation snafu
    (GhostWrite).

    I have seen plenty of undefined behaviour in ISA's over the years. (A
    very common case is that instruction encodings that are not specified
    are left as UB so that later extensions to the ISA can use them.) I was
    just thinking of the reactions you'd get if you made an ISA where
    attempting to overflow signed integer arithmetic was UB at the hardware
    level, so that you could get faster and simpler instructions.


    A classic example would be for "y = p[x++];" in a loop. For a 64-bit
    type x, you would set up one register once with "p + x", and then have a
    load with post-increment instruction in the loop. You can also do that
    with x as a 32-bit int, unless you are of the opinion that enough apples
    added to a pile should give a negative number of apples.

    But of course it should!

    But wait, no, the number of apples should become zero if you add
    enough of them.

    But wait... maybe if the pile becomes too large, then the apples
    will no longer be individual apples, but will be crushed under
    their weight, a bit like https://what-if.xkcd.com/4/ .


    :-)


    But with a
    wrapping type for x - such as unsigned int in C or modulo types in Ada,
    you have little choice but to hold "p" and "x" separately in registers,
    add them for every load, and do the increment and modulo operation. I
    really can't see this all being handled by a single instruction.

    One reason not to use such a wrapping type.

    Agreed.


    Although, if you have (R1+R2) addressing and a 32-bit addition, this
    could actually work, but not with a post-increment instruction.

    Yes, but assuming you have 64-bit pointers you'd need a 64-bit + 32-bit addition. That could work, but I think you'd end up making your ISA a
    fair bit more complicated for little gain (compared to just using UB
    overflow int types and not going overboard in the software).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Mon Sep 16 14:48:50 2024
    On 16/09/2024 10:37, Terje Mathisen wrote:
    David Brown wrote:
    On 14/09/2024 21:26, Thomas Koenig wrote:
    MitchAlsup1 <[email protected]> schrieb:

    In many cases int is slower now than long -- which violates the notion >>>> of int from K&R days.

    That's a designers's choice, I think.  It is possible to add 32-bit
    instructions which should be as fast (or possibly faster) than
    64-bit instructions, as AMD64 and ARM have shown.


    For some kinds of instructions, that's true - for others, it's not so
    easy without either making rather complicated instructions or having
    assembly instructions with undefined behaviour (imagine the terror
    that would bring to some people!).

    A classic example would be for "y = p[x++];" in a loop.  For a 64-bit
    type x, you would set up one register once with "p + x", and then have
    a load with post-increment instruction in the loop.  You can also do
    that with x as a 32-bit int, unless you are of the opinion that enough
    apples added to a pile should give a negative number of apples.  But
    with a wrapping type for x - such as unsigned int in C or modulo types
    in Ada, you have little choice but to hold "p" and "x" separately in
    registers, add them for every load, and do the increment and modulo
    operation.  I really can't see this all being handled by a single
    instruction.

    This becomes much simpler in Rust where usize is the only legal index type:

    Yeah, you have to actually write it as

      y = p[x];
      x += 1;

    instead of a single line, but this makes zero difference to the
    compiler, right?


    I don't care much about the compiler - but I don't think this is an
    improvement for the programmer. (In general, I dislike trying to do too
    much in a single expression or statement, but some C constructs are
    common enough that I am happy with them. It would be hard to formulate concrete rules here.)

    And the resulting object code is less efficient than you get with signed
    int and "y = p[x++];" (or "y = p[x]; x++;") in C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Mon Sep 16 16:04:02 2024
    On Mon, 16 Sep 2024 14:48:50 +0200
    David Brown <[email protected]> wrote:

    On 16/09/2024 10:37, Terje Mathisen wrote:
    David Brown wrote:
    On 14/09/2024 21:26, Thomas Koenig wrote:
    MitchAlsup1 <[email protected]> schrieb:

    In many cases int is slower now than long -- which violates the
    notion of int from K&R days.

    That's a designers's choice, I think.  It is possible to add
    32-bit instructions which should be as fast (or possibly faster)
    than 64-bit instructions, as AMD64 and ARM have shown.


    For some kinds of instructions, that's true - for others, it's not
    so easy without either making rather complicated instructions or
    having assembly instructions with undefined behaviour (imagine the
    terror that would bring to some people!).

    A classic example would be for "y = p[x++];" in a loop.  For a
    64-bit type x, you would set up one register once with "p + x",
    and then have a load with post-increment instruction in the loop.
    You can also do that with x as a 32-bit int, unless you are of the
    opinion that enough apples added to a pile should give a negative
    number of apples.  But with a wrapping type for x - such as
    unsigned int in C or modulo types in Ada, you have little choice
    but to hold "p" and "x" separately in registers, add them for
    every load, and do the increment and modulo operation.  I really
    can't see this all being handled by a single instruction.

    This becomes much simpler in Rust where usize is the only legal
    index type:

    Yeah, you have to actually write it as

      y = p[x];
      x += 1;

    instead of a single line, but this makes zero difference to the
    compiler, right?


    I don't care much about the compiler - but I don't think this is an improvement for the programmer. (In general, I dislike trying to do
    too much in a single expression or statement, but some C constructs
    are common enough that I am happy with them. It would be hard to
    formulate concrete rules here.)

    And the resulting object code is less efficient than you get with
    signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.




    It's not less efficient. usize in Rust is approximately the same as
    size_t in C. With one exception that usize overflow panics under debug
    build.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Mon Sep 16 16:09:38 2024
    On 16/09/2024 15:04, Michael S wrote:
    On Mon, 16 Sep 2024 14:48:50 +0200
    David Brown <[email protected]> wrote:

    On 16/09/2024 10:37, Terje Mathisen wrote:
    David Brown wrote:
    On 14/09/2024 21:26, Thomas Koenig wrote:
    MitchAlsup1 <[email protected]> schrieb:

    In many cases int is slower now than long -- which violates the
    notion of int from K&R days.

    That's a designers's choice, I think.  It is possible to add
    32-bit instructions which should be as fast (or possibly faster)
    than 64-bit instructions, as AMD64 and ARM have shown.


    For some kinds of instructions, that's true - for others, it's not
    so easy without either making rather complicated instructions or
    having assembly instructions with undefined behaviour (imagine the
    terror that would bring to some people!).

    A classic example would be for "y = p[x++];" in a loop.  For a
    64-bit type x, you would set up one register once with "p + x",
    and then have a load with post-increment instruction in the loop.
    You can also do that with x as a 32-bit int, unless you are of the
    opinion that enough apples added to a pile should give a negative
    number of apples.  But with a wrapping type for x - such as
    unsigned int in C or modulo types in Ada, you have little choice
    but to hold "p" and "x" separately in registers, add them for
    every load, and do the increment and modulo operation.  I really
    can't see this all being handled by a single instruction.

    This becomes much simpler in Rust where usize is the only legal
    index type:

    Yeah, you have to actually write it as

      y = p[x];
      x += 1;

    instead of a single line, but this makes zero difference to the
    compiler, right?


    I don't care much about the compiler - but I don't think this is an
    improvement for the programmer. (In general, I dislike trying to do
    too much in a single expression or statement, but some C constructs
    are common enough that I am happy with them. It would be hard to
    formulate concrete rules here.)

    And the resulting object code is less efficient than you get with
    signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.




    It's not less efficient. usize in Rust is approximately the same as
    size_t in C.

    Ah, okay - I was thinking of it as a C unsigned int.

    With one exception that usize overflow panics under debug
    build.


    I'm quite happy with unsigned types that are not allowed to overflow, as
    long as there is some other way to get efficient wrapping on the rare
    occasions when you need it.

    But I am completely against the idea that you have different defined
    semantics for different builds. Run-time errors in a debug/test build
    and undefined behaviour in release mode is fine - defining the behaviour
    of overflow in release mode (other than possibly to the same run-time
    checking) is wrong.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to David Brown on Mon Sep 16 10:51:21 2024
    David Brown wrote:
    On 15/09/2024 21:13, MitchAlsup1 wrote:
    On Sun, 15 Sep 2024 18:48:48 +0000, David Brown wrote:

    On 15/09/2024 19:21, MitchAlsup1 wrote:

    Which brings to mind a slight different but related bit-field issue.

    If one has an architecture that allows a bit-field to span a register
    sized container, how does one specify that bit-field in C ??

    So, assume a register contains 64-bits and we have a 17-bit field
    starting at bit 53 and continuing to bit 69 of a 128-bit struct.
    How would one "properly" specify this in C.

    You do so inconveniently, perhaps with access inline functions rather
    than a bit-field struct.

    Fortunately, not many hardware designers are that sadistic. (Or perhaps >>> they /are/ that sadistic, but lack the imagination for that particular
    trick.)

    In My 66000 ISA it is both efficient and straightforward::


    That does not change that it is inconvenient in C, which is what you
    asked about. For any ISA, there will always be things that can easily written in C that are awkward in assembly, and vice versa.



    i = struct.field;
    ..
    struct.field = j;

    CARRY Rsf1,{I}
    SRA Ri,Rsf0,<17,53>
    and
    CARRY Rsf1,{O}
    INS Rsf0,Rj,<52,17>

    Note: Rsf1 and Rsf0 combined are the 128 bits container, but there is no
    need for these registers to be sequential.

    As to HW sadism:: this not not <realistically> any harder than mis-
    aligned DW accesses from the cache. Many ISA from the rather distant
    past could do these rather efficiently {360 SRDL,...}


    Anyone who designs a data structure with a bit-field that spans two
    64-bit parts of a struct is probably ignorant of C bit-fields and
    software in general. It is highly unlikely to be necessary or even beneficial from the hardware viewpoint, but really inconvenient on the software side (whether you use bit-fields or not).

    Some hardware designers seem to have no understanding of or
    consideration for the software folks that will use their designs. "HW Sadism" is no doubt too strong a term - ignorance and a lack of
    consideration is more realistic.

    If the ISA has any realistically efficient grasp on multi-precision
    integer operations, these fall out almost for free.

    I can't see that. I am not saying you are wrong, but I don't see the connection.

    These double-width bit-field straddle operations show up at 32-bits.
    Various FP64 formats (DEC's middle-endian FP being the worst example),
    Intel page table entries and segment/gate descriptors, come to mind.

    It's just going to take a while for double-width things to show up
    at the 64-bit level. But if FP128 becomes a reality...

    Codecs likely have to deal with double-width straddles a lot, whatever
    the register word size. So for them it likely happens at 64-bits already.

    I added a bunch of instructions for dealing with double-width operations.
    The main ISA design decision is whether to have register pair specifiers,
    R0, R2, R4,... or two separate {r_high,r_low} registers.
    In either case the main uArch issue is that now instructions have an extra source register and two dest registers, which has a number of consequences.
    But once you bite the bullet on that it simplifies a lot of things,
    like how to deal with carry or overflow without flags,
    full width multiplies, divide producing both quotient and remainder.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Mon Sep 16 17:33:37 2024
    On Mon, 16 Sep 2024 16:09:38 +0200
    David Brown <[email protected]> wrote:

    On 16/09/2024 15:04, Michael S wrote:
    On Mon, 16 Sep 2024 14:48:50 +0200
    David Brown <[email protected]> wrote:

    On 16/09/2024 10:37, Terje Mathisen wrote:
    David Brown wrote:
    On 14/09/2024 21:26, Thomas Koenig wrote:
    MitchAlsup1 <[email protected]> schrieb:

    In many cases int is slower now than long -- which violates the
    notion of int from K&R days.

    That's a designers's choice, I think.  It is possible to add
    32-bit instructions which should be as fast (or possibly faster)
    than 64-bit instructions, as AMD64 and ARM have shown.


    For some kinds of instructions, that's true - for others, it's
    not so easy without either making rather complicated
    instructions or having assembly instructions with undefined
    behaviour (imagine the terror that would bring to some people!).

    A classic example would be for "y = p[x++];" in a loop.  For a
    64-bit type x, you would set up one register once with "p + x",
    and then have a load with post-increment instruction in the loop.
    You can also do that with x as a 32-bit int, unless you are of
    the opinion that enough apples added to a pile should give a
    negative number of apples.  But with a wrapping type for x -
    such as unsigned int in C or modulo types in Ada, you have
    little choice but to hold "p" and "x" separately in registers,
    add them for every load, and do the increment and modulo
    operation.  I really can't see this all being handled by a
    single instruction.

    This becomes much simpler in Rust where usize is the only legal
    index type:

    Yeah, you have to actually write it as

      y = p[x];
      x += 1;

    instead of a single line, but this makes zero difference to the
    compiler, right?


    I don't care much about the compiler - but I don't think this is an
    improvement for the programmer. (In general, I dislike trying to
    do too much in a single expression or statement, but some C
    constructs are common enough that I am happy with them. It would
    be hard to formulate concrete rules here.)

    And the resulting object code is less efficient than you get with
    signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.




    It's not less efficient. usize in Rust is approximately the same as
    size_t in C.

    Ah, okay - I was thinking of it as a C unsigned int.

    With one exception that usize overflow panics under debug
    build.


    I'm quite happy with unsigned types that are not allowed to overflow,
    as long as there is some other way to get efficient wrapping on the
    rare occasions when you need it.


    Rust has it in form of builtin functions wrapping_*()

    But I am completely against the idea that you have different defined semantics for different builds. Run-time errors in a debug/test
    build and undefined behaviour in release mode is fine - defining the behaviour of overflow in release mode (other than possibly to the
    same run-time checking) is wrong.



    On the one hand, Rust manual says that integer overflow in release mode
    wraps. On the other hand, it says that "Relying on integer overflow’s wrapping behavior is considered an error."
    It does not sound particularly consistent and rather close to worst of
    both worlds.

    However on more important issue of out-of-bound array access Rust is consistent,

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Mon Sep 16 15:34:25 2024
    David Brown <[email protected]> schrieb:

    The GCC community would be quite happy to support such targets, but
    someone would need to make the port. And the architecture of the gcc compiler suite is best suited to processors with reasonably regular and orthogonal ISAs with plenty of registers and at least 16-bit width -
    getting good results for a cpu like the 6502 from gcc would be an extraordinary level of effort. It makes a lot more sense to look at
    tools like SDCC with an architecture that fits better.

    Native compilation of gcc on a 6502 would be... interesting.

    But I think an adaption of gcc to a 6502 could actually work if
    the zero page was treated as 128 16-bit registers. Not going
    there, though :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to David Brown on Mon Sep 16 11:39:55 2024
    David Brown wrote:
    On 16/09/2024 15:04, Michael S wrote:

    With one exception that usize overflow panics under debug
    build.


    I'm quite happy with unsigned types that are not allowed to overflow, as
    long as there is some other way to get efficient wrapping on the rare occasions when you need it.

    But I am completely against the idea that you have different defined semantics for different builds. Run-time errors in a debug/test build
    and undefined behaviour in release mode is fine - defining the behaviour
    of overflow in release mode (other than possibly to the same run-time checking) is wrong.

    In the compilers that do checking which I have worked with
    there was always a distinction between checked builds and debug builds.
    In my C code I have Assert() and AssertDbg(). Assert stay in the
    production code, AssertDbg are only in the debug builds.

    Debug builds disable optimizations and spill all variable updates
    to memory to make life easier for the debugger.
    One usually compiles debug builds with no-optimize and all checks enabled.

    But debug, optimize, and checking are separate controls.

    In the compilers for checking languages I've worked with,
    checking and optimization are compatible.
    For example, if the compiler uses an AddFaultOverflow x = x + 1 instruction
    to increment 'x' then it knows no overflow is possible and then
    can make all the other optimizations that C assumes are true.

    And on those compilers checks can be controlled with quite fine resolution. Checks can be enabled/disabled based on kind of check,
    eg scalar overflow, array bounds,
    for a compilation unit, a routine, a section of code,
    a particular data type, a particular object.

    This was all standard on DEC Ada85 so if Rust compilers do not
    do this now they may in the near future.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to EricP on Mon Sep 16 18:58:57 2024
    On Mon, 16 Sep 2024 11:39:55 -0400
    EricP <[email protected]> wrote:

    David Brown wrote:
    On 16/09/2024 15:04, Michael S wrote:

    With one exception that usize overflow panics under debug
    build.


    I'm quite happy with unsigned types that are not allowed to
    overflow, as long as there is some other way to get efficient
    wrapping on the rare occasions when you need it.

    But I am completely against the idea that you have different
    defined semantics for different builds. Run-time errors in a
    debug/test build and undefined behaviour in release mode is fine -
    defining the behaviour of overflow in release mode (other than
    possibly to the same run-time checking) is wrong.

    In the compilers that do checking which I have worked with
    there was always a distinction between checked builds and debug
    builds. In my C code I have Assert() and AssertDbg(). Assert stay in
    the production code, AssertDbg are only in the debug builds.

    Debug builds disable optimizations and spill all variable updates
    to memory to make life easier for the debugger.
    One usually compiles debug builds with no-optimize and all checks
    enabled.

    But debug, optimize, and checking are separate controls.

    In the compilers for checking languages I've worked with,
    checking and optimization are compatible.
    For example, if the compiler uses an AddFaultOverflow x = x + 1
    instruction to increment 'x' then it knows no overflow is possible
    and then can make all the other optimizations that C assumes are true.

    And on those compilers checks can be controlled with quite fine
    resolution. Checks can be enabled/disabled based on kind of check,
    eg scalar overflow, array bounds,
    for a compilation unit, a routine, a section of code,
    a particular data type, a particular object.

    This was all standard on DEC Ada85 so if Rust compilers do not
    do this now they may in the near future.


    If ability to control compilers checks was standard on DEC Ada then it
    made DEC Ada none-standard.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to David Brown on Mon Sep 16 09:02:38 2024
    On 9/16/2024 4:12 AM, David Brown wrote:

    snip

    With all respect to the regulars here, most people in technical Usenet
    groups are either old, unusually nerdy, or both.

    I resemble that remark! :-)

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Mon Sep 16 18:59:21 2024
    On 16/09/2024 17:34, Thomas Koenig wrote:
    David Brown <[email protected]> schrieb:

    The GCC community would be quite happy to support such targets, but
    someone would need to make the port. And the architecture of the gcc
    compiler suite is best suited to processors with reasonably regular and
    orthogonal ISAs with plenty of registers and at least 16-bit width -
    getting good results for a cpu like the 6502 from gcc would be an
    extraordinary level of effort. It makes a lot more sense to look at
    tools like SDCC with an architecture that fits better.

    Native compilation of gcc on a 6502 would be... interesting.


    The 6502 would be a target, rather than a host!

    Of course there were C compilers, and many other languages, running on
    the 6502 BBC Micro and BBC Master computers. But those tools were a bit
    more compact that gcc :-)

    But I think an adaption of gcc to a 6502 could actually work if
    the zero page was treated as 128 16-bit registers. Not going
    there, though :-)

    That would be a starting point, yes. But I would not use the whole zero
    page there - perhaps just the first 32 bytes (and therefore 16 16-bit registers). Having a huge register bank would make function calls tough
    when you have to stack all the callee-saved registers in your one-page
    stack!

    With 16 register pairs, you would get you close to how the AVR is
    treated in gcc - it has 32 8-bit registers which are, for many purposes, handled in pairs by the compiler. (Lowering ALU operations on 16-bit
    register pairs to 8-bit operations on single registers is done mostly as peephole optimisations at the backend.)


    Someone did manage to get Linux running on an 8-bit AVR (by having the
    AVR run an ARM emulator, and using ARM Linux). I'm sure the same
    technique could be used to host Linux on a 6502 and run gcc on it,
    though you might not consider that "native". And "run" might be a bit
    of a misnomer - the AVR was a lot faster than a 6502, and it took 6
    hours to boot to login.

    <https://dmitry.gr/?r=05.Projects&proj=07.%20Linux%20on%208bit>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Mon Sep 16 13:02:46 2024
    Michael S wrote:
    On Mon, 16 Sep 2024 11:39:55 -0400
    EricP <[email protected]> wrote:

    David Brown wrote:
    On 16/09/2024 15:04, Michael S wrote:

    With one exception that usize overflow panics under debug
    build.

    I'm quite happy with unsigned types that are not allowed to
    overflow, as long as there is some other way to get efficient
    wrapping on the rare occasions when you need it.

    But I am completely against the idea that you have different
    defined semantics for different builds. Run-time errors in a
    debug/test build and undefined behaviour in release mode is fine -
    defining the behaviour of overflow in release mode (other than
    possibly to the same run-time checking) is wrong.
    In the compilers that do checking which I have worked with
    there was always a distinction between checked builds and debug
    builds. In my C code I have Assert() and AssertDbg(). Assert stay in
    the production code, AssertDbg are only in the debug builds.

    Debug builds disable optimizations and spill all variable updates
    to memory to make life easier for the debugger.
    One usually compiles debug builds with no-optimize and all checks
    enabled.

    But debug, optimize, and checking are separate controls.

    In the compilers for checking languages I've worked with,
    checking and optimization are compatible.
    For example, if the compiler uses an AddFaultOverflow x = x + 1
    instruction to increment 'x' then it knows no overflow is possible
    and then can make all the other optimizations that C assumes are true.

    And on those compilers checks can be controlled with quite fine
    resolution. Checks can be enabled/disabled based on kind of check,
    eg scalar overflow, array bounds,
    for a compilation unit, a routine, a section of code,
    a particular data type, a particular object.

    This was all standard on DEC Ada85 so if Rust compilers do not
    do this now they may in the near future.


    If ability to control compilers checks was standard on DEC Ada then it
    made DEC Ada none-standard.

    No, pragma SUPPRESS (check_identifier [, ON => name]);
    is defined by the Ada85 standard, with 9 different kinds of checks that
    can be suppressed on named type, object, routine, task, etc within a scope.
    But support for pragmas and what they do is defined as implementation dependent, and may also be extended.

    pragma suppress (INDEX_CHECK, myArray);

    if supported, eliminates bounds check on just that array.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Mon Sep 16 18:44:47 2024
    On 16/09/2024 16:33, Michael S wrote:
    On Mon, 16 Sep 2024 16:09:38 +0200
    David Brown <[email protected]> wrote:

    On 16/09/2024 15:04, Michael S wrote:
    On Mon, 16 Sep 2024 14:48:50 +0200

    With one exception that usize overflow panics under debug
    build.


    I'm quite happy with unsigned types that are not allowed to overflow,
    as long as there is some other way to get efficient wrapping on the
    rare occasions when you need it.


    Rust has it in form of builtin functions wrapping_*()


    Okay.

    But I am completely against the idea that you have different defined
    semantics for different builds. Run-time errors in a debug/test
    build and undefined behaviour in release mode is fine - defining the
    behaviour of overflow in release mode (other than possibly to the
    same run-time checking) is wrong.



    On the one hand, Rust manual says that integer overflow in release mode wraps. On the other hand, it says that "Relying on integer overflow’s wrapping behavior is considered an error."
    It does not sound particularly consistent and rather close to worst of
    both worlds.

    Yes. If it is not behaviour you can rely on (and if it is a run-time
    error in debug mode, you certainly can't rely on it!) then the compiler
    should be able to optimise ignoring the possibility of it happening,
    when checks are not enabled.


    However on more important issue of out-of-bound array access Rust is consistent,

    I think it is difficult to determine the relative importance of
    out-of-bounds array access and contradictory documentation about basic arithmetic!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Mon Sep 16 17:51:44 2024
    On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

    On 15/09/2024 21:13, MitchAlsup1 wrote:

    As to HW sadism:: this not not <realistically> any harder than mis-
    aligned DW accesses from the cache. Many ISA from the rather distant
    past could do these rather efficiently {360 SRDL,...}


    Anyone who designs a data structure with a bit-field that spans two
    64-bit parts of a struct is probably ignorant of C bit-fields and
    software in general. It is highly unlikely to be necessary or even beneficial from the hardware viewpoint, but really inconvenient on the software side (whether you use bit-fields or not).

    Sometimes you don't have a choice::
    x86-64 segment registers.
    PCIe MMI/O registers,
    ..

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Mon Sep 16 17:57:49 2024
    On Mon, 16 Sep 2024 13:04:02 +0000, Michael S wrote:

    On Mon, 16 Sep 2024 14:48:50 +0200
    David Brown <[email protected]> wrote:

    It's not less efficient. usize in Rust is approximately the same as
    size_t in C. With one exception that usize overflow panics under debug
    build.

    One can and should argue that::

    #p++;

    should panic if p++ crosses an address space boundary (user->OS, or OS->HyperVisor,...) as no array is allowed to cross such a boundary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Michael S on Mon Sep 16 21:06:48 2024
    On 2024-09-16 18:58, Michael S wrote:
    On Mon, 16 Sep 2024 11:39:55 -0400
    EricP <[email protected]> wrote:

    David Brown wrote:
    On 16/09/2024 15:04, Michael S wrote:

    With one exception that usize overflow panics under debug
    build.


    I'm quite happy with unsigned types that are not allowed to
    overflow, as long as there is some other way to get efficient
    wrapping on the rare occasions when you need it.

    But I am completely against the idea that you have different
    defined semantics for different builds. Run-time errors in a
    debug/test build and undefined behaviour in release mode is fine -
    defining the behaviour of overflow in release mode (other than
    possibly to the same run-time checking) is wrong.

    In the compilers that do checking which I have worked with
    there was always a distinction between checked builds and debug
    builds. In my C code I have Assert() and AssertDbg(). Assert stay in
    the production code, AssertDbg are only in the debug builds.

    Debug builds disable optimizations and spill all variable updates
    to memory to make life easier for the debugger.
    One usually compiles debug builds with no-optimize and all checks
    enabled.

    But debug, optimize, and checking are separate controls.

    In the compilers for checking languages I've worked with,
    checking and optimization are compatible.
    For example, if the compiler uses an AddFaultOverflow x = x + 1
    instruction to increment 'x' then it knows no overflow is possible
    and then can make all the other optimizations that C assumes are true.

    And on those compilers checks can be controlled with quite fine
    resolution. Checks can be enabled/disabled based on kind of check,
    eg scalar overflow, array bounds,
    for a compilation unit, a routine, a section of code,
    a particular data type, a particular object.

    This was all standard on DEC Ada85 so if Rust compilers do not
    do this now they may in the near future.


    If ability to control compilers checks was standard on DEC Ada then it
    made DEC Ada none-standard.


    No, it means that DEC Ada could be used as a standard-conforming Ada
    compiler or as a non-conforming compiler, to a user-chosen extent.

    The recommended approach today (for applications where it matters) is to
    use static analysis of the Ada code (e.g. SPARK or other tools) to prove
    that run-time errors cannot happen, which then makes it possible to omit
    the corresponding run-time checks while staying compliant.

    I don't know if Rust code can be analysed as easily and completely as
    Ada code can. But Ada compilers usually allow fine-grained control over
    which checks are applied where, not just a single choice between "debug"
    and "production" builds.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Mon Sep 16 20:08:42 2024
    On 16/09/2024 19:57, MitchAlsup1 wrote:
    On Mon, 16 Sep 2024 13:04:02 +0000, Michael S wrote:

    On Mon, 16 Sep 2024 14:48:50 +0200
    David Brown <[email protected]> wrote:

    It's not less efficient. usize in Rust is approximately the same as
    size_t in C. With one exception that usize overflow panics under debug
    build.

    One can and should argue that::

        #p++;

    should panic if p++ crosses an address space boundary (user->OS, or OS->HyperVisor,...) as no array is allowed to cross such a boundary.

    That is outside the scope of C, which has no concept of address space boundaries, or even an OS (other than as something that makes the
    standard library functions work).

    Of course it is perfectly fine if, on any given implementation, trying
    to access through an invalid pointer (including beyond the end of an
    array) results in some kind of panic, crash, OS exception, or other
    error. Those are all valid for UB. But it is not possible or practical
    to specify or require such action from a language. At best, a language
    could say that some kind of run-time error handling must be supported
    and that it is triggered by certain kinds of out of bounds accesses
    (defined by the language, not by address space boundaries). Even then,
    you are not going to be able to detect all invalid pointer uses while maintaining low-level and efficient direct pointer usage.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Mon Sep 16 20:11:20 2024
    On 16/09/2024 19:51, MitchAlsup1 wrote:
    On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

    On 15/09/2024 21:13, MitchAlsup1 wrote:

    As to HW sadism:: this not not <realistically> any harder than mis-
    aligned DW accesses from the cache. Many ISA from the rather distant
    past could do these rather efficiently {360 SRDL,...}


    Anyone who designs a data structure with a bit-field that spans two
    64-bit parts of a struct is probably ignorant of C bit-fields and
    software in general.  It is highly unlikely to be necessary or even
    beneficial from the hardware viewpoint, but really inconvenient on the
    software side (whether you use bit-fields or not).

    Sometimes you don't have a choice::
    x86-64 segment registers.
    PCIe MMI/O registers,
    ..

    The folks designing those register setups had a choice, and made a bad
    choice from the viewpoint of software (whether it be C, assembly, or any
    other language).

    It's conceivable that it was the right choice on balance, considering
    many factors. And it's certainly more believable that it was an
    appropriate choice when sizes were smaller. It is less believable that
    there is an overwhelming need to cross a 64-bit boundary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Mon Sep 16 18:22:40 2024
    [email protected] (MitchAlsup1) writes:
    On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

    On 15/09/2024 21:13, MitchAlsup1 wrote:

    As to HW sadism:: this not not <realistically> any harder than mis-
    aligned DW accesses from the cache. Many ISA from the rather distant
    past could do these rather efficiently {360 SRDL,...}


    Anyone who designs a data structure with a bit-field that spans two
    64-bit parts of a struct is probably ignorant of C bit-fields and
    software in general. It is highly unlikely to be necessary or even
    beneficial from the hardware viewpoint, but really inconvenient on the
    software side (whether you use bit-fields or not).

    Sometimes you don't have a choice::
    x86-64 segment registers.
    PCIe MMI/O registers,

    I'm not aware of any PCIe device where a field straddles the
    boundary between two 64-bit registers. There are many devices
    that split a 64-bit address across two 32-bit registers; including
    the BAR registers in the configuration space, DMA addresses, etc.

    Our CSR tool explicitly forbids field definitions that cross 64-bit
    boundaries. If necessary, the logic designer will instead define
    two smaller fields that software is required to combined explicitly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dombo@21:1/5 to David Brown on Mon Sep 16 22:15:35 2024
    On 16-09-2024 13:12, David Brown wrote:
    On 16/09/2024 02:00, BGB wrote:
    On 9/15/2024 2:09 PM, David Brown wrote:
    This is mostly for the crowd still messing around with a few older
    systems:
       Commodore 64/128
       Apple II / II/C / II/E
       Apple IIGS
       NES and SNES
       ...

    It is not a "crowd" - it's a small group of oddballs and enthusiasts.  I fully support them, and playing with these things is a great hobby.  I
    would maybe be doing that too, if I had twice as many hours in the week.
     But talking about "popular compilers like gcc and CC65" is like
    talking about "popular sports like football and Inuit ear pulling
    contests".


    Also, some newer projects, like the "Commander X16" are also using
    CC65 (it was based around a 65C816 being used in a 6502 compatibility
    mode).


    Where, AFAIK, GCC proper has little interest in these targets.


    The GCC community would be quite happy to support such targets, but
    someone would need to make the port.  And the architecture of the gcc compiler suite is best suited to processors with reasonably regular and orthogonal ISAs with plenty of registers and at least 16-bit width -
    getting good results for a cpu like the 6502 from gcc would be an extraordinary level of effort.

    I wouldn't be surprised if someone had a go at creating a 6502 back-end
    for GCC. There is a 6502 back-end for LLVM (https://llvm-mos.org), it
    appears that someone put a serious amount of effort into this. I've
    played a little with it (using Compiler Explorer site -
    https://godbolt.org/). Considering the limitations of the 6502 it seemed
    to be able produce relatively decent code from C++ code when
    optimizations are enabled. Like AVR-GCC you do need to keep in mind that
    the target is an 8-bitter and its limitations to get somewhat reasonable
    code out of it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bill Findlay@21:1/5 to Niklas Holsti on Mon Sep 16 22:40:33 2024
    On 16 Sep 2024, Niklas Holsti wrote
    (in article <[email protected]>):
    ...
    The recommended approach today (for applications where it matters) is to
    use static analysis of the Ada code (e.g. SPARK or other tools) to prove
    that run-time errors cannot happen, which then makes it possible to omit
    the corresponding run-time checks while staying compliant.

    I don't know if Rust code can be analysed as easily and completely as
    Ada code can. But Ada compilers usually allow fine-grained control over
    which checks are applied where, not just a single choice between "debug"
    and "production" builds.

    I find, without using SPARK or any analysis (other than that done
    by the compiler) that going from all Ada language-defined checks
    ON to all OFF gains < 5% in speed.

    So all checks are left ON in "production" builds.

    --
    Bill Findlay

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Mon Sep 16 20:15:59 2024
    David Brown <[email protected]> schrieb:
    On 16/09/2024 09:17, Thomas Koenig wrote:
    David Brown <[email protected]> schrieb:
    On 14/09/2024 21:26, Thomas Koenig wrote:
    MitchAlsup1 <[email protected]> schrieb:

    In many cases int is slower now than long -- which violates the notion >>>>> of int from K&R days.

    That's a designers's choice, I think. It is possible to add 32-bit
    instructions which should be as fast (or possibly faster) than
    64-bit instructions, as AMD64 and ARM have shown.


    For some kinds of instructions, that's true - for others, it's not so
    easy without either making rather complicated instructions or having
    assembly instructions with undefined behaviour (imagine the terror that
    would bring to some people!).

    It has happened, see the illegal (but sometimes useful)
    6502 instructions, or the recent RISC-V implementation snafu
    (GhostWrite).

    I have seen plenty of undefined behaviour in ISA's over the years. (A
    very common case is that instruction encodings that are not specified
    are left as UB so that later extensions to the ISA can use them.)

    A much better idea is to raise an exception, that way you can
    be sure that nobody uses it for nefarious purposes.

    I was
    just thinking of the reactions you'd get if you made an ISA where
    attempting to overflow signed integer arithmetic was UB at the hardware level, so that you could get faster and simpler instructions.

    Hard to see how this would be possible... but I realize this
    is a hypothetical example.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Bill Findlay on Mon Sep 16 20:00:34 2024
    Bill Findlay wrote:
    On 16 Sep 2024, Niklas Holsti wrote
    (in article <[email protected]>):
    ....
    The recommended approach today (for applications where it matters) is to
    use static analysis of the Ada code (e.g. SPARK or other tools) to prove
    that run-time errors cannot happen, which then makes it possible to omit
    the corresponding run-time checks while staying compliant.

    I don't know if Rust code can be analysed as easily and completely as
    Ada code can. But Ada compilers usually allow fine-grained control over
    which checks are applied where, not just a single choice between "debug"
    and "production" builds.

    I find, without using SPARK or any analysis (other than that done
    by the compiler) that going from all Ada language-defined checks
    ON to all OFF gains < 5% in speed.

    So all checks are left ON in "production" builds.

    I found the same 5% performance cost in my tests with DEC Ada85.
    Most code was pretty optimal too.

    The one thing I found DEC's compiler made a complete pigs breakfast
    of the generated code was scanning a character string backwards:

    function TrimBlanks (str : in string) return Natural is
    n : natural;
    begin
    for n in reverse str'range loop
    if (str(n) /= ' ')
    return n;
    end if;
    end loop;
    return 0;
    end;

    Godbolt x86-64 gnat Ada 14.2 -O3 gives

    # Compilation provided by Compiler Explorer at https://godbolt.org/
    .LC0:
    .ascii "example.adb"
    .zero 1
    _ada_trimblanks:
    mov ecx, DWORD PTR [rsi] # _1, str$P_BOUNDS_12->LB0
    movsx rax, DWORD PTR [rsi+4] #, str$P_BOUNDS_12->UB0
    cmp ecx, eax # _1, _3
    jg .L5 #,
    movsx rdx, ecx # _2, _1
    add rax, 1 # I.0_8,
    sub rdi, rdx # _22, _2
    jmp .L4 #
    .L3:
    cmp rdx, rax # _2, I.0_8
    je .L5 #,
    .L4:
    sub rax, 1 # I.0_8,
    cmp BYTE PTR [rdi+rax], 32 # MEM <character>
    je .L3 #,
    mov edx, eax # <retval>, I.0_8
    test ecx, eax # _1, I.0_8
    jns .L1 #,
    push rax #
    mov esi, 6 #,
    mov edi, OFFSET FLAT:.LC0 #,
    call __gnat_rcheck_CE_Range_Check #
    .L5:
    xor edx, edx # <retval>
    .L1:
    mov eax, edx #, <retval>
    ret

    Gnat doesn't realize that the subtype of string's index is Positive,
    range 1..Integer'last and therefore inside the range of return
    subtype Natural, range 0..Integer'last, and therefore the "return n"
    cannot fail, and this code that tests n in that range and throws
    an exception is unnecessary:

    test ecx, eax # _1, I.0_8
    jns .L1 #,
    push rax #
    mov esi, 6 #,
    mov edi, OFFSET FLAT:.LC0 #,
    call __gnat_rcheck_CE_Range_Check #

    Also it didn't needs to use edx at all.

    The code it should have generated is

    _ada_trimblanks:
    movsx rcx, DWORD PTR [rsi] # _1, str$P_BOUNDS_12->LB0
    movsx rax, DWORD PTR [rsi+4] #, str$P_BOUNDS_12->UB0
    cmp rcx, rax # _1, _3
    jg .L5 # null range?
    .L4:
    cmp BYTE PTR [rdi+rax], 32 # MEM <character>
    je .L1
    dec rax
    cmp rcx, rax
    jle .L4 # if rax >= rcx loop
    .L5:
    xor eax, eax # <retval>
    .L1:
    ret

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Niklas Holsti on Mon Sep 16 20:04:15 2024
    Niklas Holsti wrote:
    On 2024-09-16 18:58, Michael S wrote:
    On Mon, 16 Sep 2024 11:39:55 -0400
    EricP <[email protected]> wrote:

    David Brown wrote:
    On 16/09/2024 15:04, Michael S wrote:

    With one exception that usize overflow panics under debug
    build.


    I'm quite happy with unsigned types that are not allowed to
    overflow, as long as there is some other way to get efficient
    wrapping on the rare occasions when you need it.

    But I am completely against the idea that you have different
    defined semantics for different builds. Run-time errors in a
    debug/test build and undefined behaviour in release mode is fine -
    defining the behaviour of overflow in release mode (other than
    possibly to the same run-time checking) is wrong.

    In the compilers that do checking which I have worked with
    there was always a distinction between checked builds and debug
    builds. In my C code I have Assert() and AssertDbg(). Assert stay in
    the production code, AssertDbg are only in the debug builds.

    Debug builds disable optimizations and spill all variable updates
    to memory to make life easier for the debugger.
    One usually compiles debug builds with no-optimize and all checks
    enabled.

    But debug, optimize, and checking are separate controls.

    In the compilers for checking languages I've worked with,
    checking and optimization are compatible.
    For example, if the compiler uses an AddFaultOverflow x = x + 1
    instruction to increment 'x' then it knows no overflow is possible
    and then can make all the other optimizations that C assumes are true.

    And on those compilers checks can be controlled with quite fine
    resolution. Checks can be enabled/disabled based on kind of check,
    eg scalar overflow, array bounds,
    for a compilation unit, a routine, a section of code,
    a particular data type, a particular object.

    This was all standard on DEC Ada85 so if Rust compilers do not
    do this now they may in the near future.


    If ability to control compilers checks was standard on DEC Ada then it
    made DEC Ada none-standard.


    No, it means that DEC Ada could be used as a standard-conforming Ada
    compiler or as a non-conforming compiler, to a user-chosen extent.

    The recommended approach today (for applications where it matters) is to
    use static analysis of the Ada code (e.g. SPARK or other tools) to prove
    that run-time errors cannot happen, which then makes it possible to omit
    the corresponding run-time checks while staying compliant.

    DEC Ada did that too. It seems to me this optimization to be a relatively straight forward "propagation of constants" type of problem.
    Most subtypes have a constant range

    subtype Sub1T is integer range 1..100;

    For ones with dynamic range

    subtype Sub2T is integer range x..y;

    then the worst case range can be inferred from the ranges attributes
    of subtypes of x and y

    Sub2T'first = min (x'first, y'first)
    Sub2T'last = max (x'last, y'last)

    And there is a check for a null range, where upper bound is less than
    lower bound.

    Then all these constant attribute values propagate onto the variables
    declared with that subtype.

    That should allow most checks to evaporate as compares of constants values.

    I don't know if Rust code can be analysed as easily and completely as
    Ada code can. But Ada compilers usually allow fine-grained control over
    which checks are applied where, not just a single choice between "debug"
    and "production" builds.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to David Brown on Tue Sep 17 01:36:46 2024
    David Brown <[email protected]> wrote:
    On 16/09/2024 19:51, MitchAlsup1 wrote:
    On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

    On 15/09/2024 21:13, MitchAlsup1 wrote:

    As to HW sadism:: this not not <realistically> any harder than mis-
    aligned DW accesses from the cache. Many ISA from the rather distant
    past could do these rather efficiently {360 SRDL,...}


    Anyone who designs a data structure with a bit-field that spans two
    64-bit parts of a struct is probably ignorant of C bit-fields and
    software in general.  It is highly unlikely to be necessary or even
    beneficial from the hardware viewpoint, but really inconvenient on the
    software side (whether you use bit-fields or not).

    Sometimes you don't have a choice::
    x86-64 segment registers.
    PCIe MMI/O registers,
    ..

    The folks designing those register setups had a choice, and made a bad
    choice from the viewpoint of software (whether it be C, assembly, or any other language).

    It's conceivable that it was the right choice on balance, considering
    many factors. And it's certainly more believable that it was an
    appropriate choice when sizes were smaller. It is less believable that
    there is an overwhelming need to cross a 64-bit boundary.

    Several pieces of software discoverd that "bad" smaller data
    structures lead to faster execution. Simply, smaller data structures
    lead to better utilization of caches and busses, and efect due to
    this was larger than cost of extra instructions. So need to cross
    64-bit boundary may be rare, but there will be cases when it is best
    choice.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Tue Sep 17 01:35:17 2024
    On Tue, 17 Sep 2024 0:00:34 +0000, EricP wrote:

    Bill Findlay wrote:
    I found the same 5% performance cost in my tests with DEC Ada85.
    Most code was pretty optimal too.

    The one thing I found DEC's compiler made a complete pigs breakfast
    of the generated code was scanning a character string backwards:

    Bacon, sausage, and ham.

    Sounds yummy. Code not so much.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Mon Sep 16 19:51:26 2024
    Thomas Koenig <[email protected]> writes:

    Tim Rentsch <[email protected]> schrieb:

    If the loop variable
    represents degrees C or F, or some other naturally signed measure it
    should be signed (or maybe floating point).

    The first one is a bad idea because temperature is a continuous
    physical quantity.

    That doesn't mean that a quantity representing degrees C or
    degrees F in a computer program always has to be a continuous
    measure. Sometimes a signed integer for degrees is what's
    needed. It depends on circumstances.

    The second has bad implications for constructs like

    DO R = 0.0, 1.0, 0.1

    where it will depend on details floating point arithmetic if the
    number of loop trips is 10 or 11.

    You can argue that people can write

    DO R=0.0, 1.05, 0.1

    but this construct was error-prone enough that it was deleted
    from the Fortran standards.

    What kind of loop it
    is, whether ascending or descending, or what the increment is, etc,
    is secondary; a more important factor is what sort of value is
    being represented, and in almost all cases that is what should
    determine the type used.

    Not for floating point numbers. For that, you should simply do

    DO I=0,10
    R = I * 0.1

    or

    R = 0.0
    DO I=0,10
    ...
    R = R + 0.1
    END DO

    whichever rounding error you prefer.

    In cases like these I mean R as the loop variable. The extra
    stuff is incidental scaffolding there only to make sure R
    takes on all the appropriate values.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to [email protected] on Mon Sep 16 19:34:44 2024
    [email protected] (MitchAlsup1) writes:

    On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:

    I didn't see any content from you in this last posting
    of yours.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Mon Sep 16 19:33:22 2024
    Michael S <[email protected]> writes:

    On Sun, 15 Sep 2024 18:47:06 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Sun, 15 Sep 2024 20:13:44 +0200
    David Brown <[email protected]> wrote:

    struct Bar {
    char x[8];
    int y;
    } bar;


    int foo(int i) {
    bar.y = 1234;
    bar.x[i] = 42;
    return bar.y;
    }

    It generates:

    foo:
    movslq %edi,%rdi
    movl $1234, %eax
    movl $1234, bar+8(%rip)
    movb $42, bar(%rdi)
    ret

    That is, y is /not/ reloaded after bar.x[i] is set.

    No other compiler on godbolt is doing it, except possibly gcc
    clones. Not even clang, who's former leader wrote "Nasal Manifest".

    Test runs on two different Ubuntu machines (gcc 7.4.0 and gcc 8.4.0)
    both show bar.y not being overwritten (optimization levels -01 or -O2)
    when foo() is called.

    I didn't mean to say that gcc3 is the only gcc version that returns non-overwritten value.
    I meant to say that all gcc versions are in one camp and the rest of compilers represented on Goldbolt is in the other camp.

    Okay.

    Please note that I didn't mean to dispute your statement,
    which is about compilers on godbolt. I meant only to give
    an isolated data point that might be related.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Tue Sep 17 08:07:44 2024
    David Brown wrote:
    On 16/09/2024 10:37, Terje Mathisen wrote:
    This becomes much simpler in Rust where usize is the only legal index
    type:

    Yeah, you have to actually write it as

     Â  y = p[x];
     Â  x += 1;

    instead of a single line, but this makes zero difference to the
    compiler, right?


    I don't care much about the compiler - but I don't think this is an improvement for the programmer.  (In general, I dislike trying to do too much in a single expression or statement, but some C constructs are
    common enough that I am happy with them.  It would be hard to formulate concrete rules here.)

    And the resulting object code is less efficient than you get with signed
    int and "y = p[x++];" (or "y = p[x]; x++;") in C.

    Is that true? I'll have to check godbolt myself if that is really the case!

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Tue Sep 17 08:43:22 2024
    Stephen Fuld wrote:
    On 9/16/2024 4:12 AM, David Brown wrote:

    snip

    With all respect to the regulars here, most people in technical Usenet
    groups are either old, unusually nerdy, or both.

    I resemble that remark!  :-)

    Ditto, probably...

    I'm 67 (but not yet retired), I taught myself the Trachtenberg
    algorithms for mental arithmetic when I was around 12 (was reminded of
    this last night when I watched Gifted on netflix), I mail ordered what
    was probably the first Rubik's cube to get to Norway. (And developed
    three different algorithms to solve it, but I only remember the last one
    now which I had optimized for simplicity, not speed.)

    Those, along with high school chess and orienteering mapping should
    count as nerdy pursuits, right?

    Winning the County Yo-Yo championship would be less so?

    Regards to all the regulars here, I do consider many of you friends that
    I just haven't met yet.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Tue Sep 17 08:46:07 2024
    MitchAlsup1 wrote:
    On Mon, 16 Sep 2024 13:04:02 +0000, Michael S wrote:

    On Mon, 16 Sep 2024 14:48:50 +0200
    David Brown <[email protected]> wrote:

    It's not less efficient. usize in Rust is approximately the same as
    size_t in C. With one exception that usize overflow panics under debug
    build.

    One can and should argue that::

        #p++;

    should panic if p++ crosses an address space boundary (user->OS, or OS->HyperVisor,...) as no array is allowed to cross such a boundary.

    I'm pretty sure you meant *p++; since the hash mark (#) is a comment
    separator in many languages. :-)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to EricP on Tue Sep 17 08:20:15 2024
    EricP wrote:
    These double-width bit-field straddle operations show up at 32-bits.
    Various FP64 formats (DEC's middle-endian FP being the worst example),
    Intel page table entries and segment/gate descriptors, come to mind.

    Lots of them in 32-bit code!

    It's just going to take a while for double-width things to show up
    at the 64-bit level. But if FP128 becomes a reality...

    If???

    Codecs likely have to deal with double-width straddles a lot, whatever
    the register word size. So for them it likely happens at 64-bits already.

    Nothing likely about it: LZ4 is pretty much the only compression algorithm/lossless codec that never straddles, all the rest tend to
    treat the source data as single bitstream of arbitrary length, except
    for some built-in chunking mechanism which simplifies faster scanning.

    The core of the algorithm always starts with knowing the endianness,
    then picking up 32 or 64-bit chunks of input data (byte-flipping if
    needed) and then extractin the next N bits either from the top of bottom
    of the buffer register.

    AlLmost by definition, this is not code that a compiler is setup to help
    you get correct.


    I added a bunch of instructions for dealing with double-width operations.
    The main ISA design decision is whether to have register pair specifiers,
    R0, R2, R4,... or two separate {r_high,r_low} registers.
    In either case the main uArch issue is that now instructions have an extra source register and two dest registers, which has a number of consequences. But once you bite the bullet on that it simplifies a lot of things,
    like how to deal with carry or overflow without flags,
    full width multiplies, divide producing both quotient and remainder.

    Very nice!

    This means that you can do integer IMAC(), right?

    (hi, lo) = imac(a, b, c); // == a*b+c

    The only thing even nicer from the perspective of writing arbitrary
    precision library code would be IMAA, i.e. a*b+c+d since that is the
    largest combination which is guaranteed to never overflow the double
    register target field.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Tue Sep 17 11:12:16 2024
    On Tue, 17 Sep 2024 01:35:17 +0000
    [email protected] (MitchAlsup1) wrote:

    On Tue, 17 Sep 2024 0:00:34 +0000, EricP wrote:

    Bill Findlay wrote:
    I found the same 5% performance cost in my tests with DEC Ada85.
    Most code was pretty optimal too.

    The one thing I found DEC's compiler made a complete pigs breakfast
    of the generated code was scanning a character string backwards:

    Bacon, sausage, and ham.

    Sounds yummy. Code not so much.

    It seems that you and EricP give different (not to say an opposite)
    meaning to the phrase "complete pigs breakfast".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue Sep 17 11:21:14 2024
    On Tue, 17 Sep 2024 08:20:15 +0200
    Terje Mathisen <[email protected]> wrote:

    EricP wrote:
    These double-width bit-field straddle operations show up at 32-bits. Various FP64 formats (DEC's middle-endian FP being the worst
    example), Intel page table entries and segment/gate descriptors,
    come to mind.

    Lots of them in 32-bit code!

    Lot's of what in 32-bit code?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Niklas Holsti on Tue Sep 17 01:38:03 2024
    Niklas Holsti <[email protected]d> writes:

    On 2024-09-16 10:25, Thomas Koenig wrote:

    Tim Rentsch <[email protected]> schrieb:

    [attribution lost]

    Bringing it back to "architecture" Like Anton Ertl has said, LP64
    for C/C++ is a mistake. It should always have been ILP64, and
    this nonsense would go away. Any new architecture should make C
    ILP64 (looking at you RISC-V, missing yet another opportunity to
    not make the same mistakes as everyone else).

    I believe this view is shortsighted. The big mistake is
    developers hardcoding types everywhere - especially int, but
    also long, and their unsigned variants. It's almost never a
    good idea to hardcode a specific width (eg, uint32_t) in a type
    name used for parameters or local variables, but that is by far
    a very common practice.

    I agree. This issue guided the design of the scalar type system
    in Ada.

    C programmers can use typedef to get part way there, but not all
    the way because typedefs are still weakly typed.

    I don't agree with this characterization. There are different kinds
    of concerns here, but they don't form a linear progression.
    Granted, C has a limited type system, but typedef is not part of
    the type system, and it's important not to confuse the two. My
    comment is only about what names of types are used, not about the
    nature of type systems. As it happens I don't think the Ada type
    system is where type systems should be heading, but that is a
    separate discussion from my earlier comment.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Tue Sep 17 11:15:36 2024
    On 16/09/2024 22:15, Thomas Koenig wrote:
    David Brown <[email protected]> schrieb:
    On 16/09/2024 09:17, Thomas Koenig wrote:
    David Brown <[email protected]> schrieb:
    On 14/09/2024 21:26, Thomas Koenig wrote:
    MitchAlsup1 <[email protected]> schrieb:

    In many cases int is slower now than long -- which violates the notion >>>>>> of int from K&R days.

    That's a designers's choice, I think. It is possible to add 32-bit
    instructions which should be as fast (or possibly faster) than
    64-bit instructions, as AMD64 and ARM have shown.


    For some kinds of instructions, that's true - for others, it's not so
    easy without either making rather complicated instructions or having
    assembly instructions with undefined behaviour (imagine the terror that >>>> would bring to some people!).

    It has happened, see the illegal (but sometimes useful)
    6502 instructions, or the recent RISC-V implementation snafu
    (GhostWrite).

    I have seen plenty of undefined behaviour in ISA's over the years. (A
    very common case is that instruction encodings that are not specified
    are left as UB so that later extensions to the ISA can use them.)

    A much better idea is to raise an exception, that way you can
    be sure that nobody uses it for nefarious purposes.

    Sure. But not all processors are big enough to support such exceptions
    - many of those I have used are really small. (An "unimplemented
    instruction" exception also lets you use it for non-nefarious purposes,
    such as supporting binary compatibility with other members of the
    processor family, or as convenient user extensions.)


    I was
    just thinking of the reactions you'd get if you made an ISA where
    attempting to overflow signed integer arithmetic was UB at the hardware
    level, so that you could get faster and simpler instructions.

    Hard to see how this would be possible... but I realize this
    is a hypothetical example.

    Yes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Tue Sep 17 11:21:38 2024
    On 17/09/2024 08:07, Terje Mathisen wrote:
    David Brown wrote:
    On 16/09/2024 10:37, Terje Mathisen wrote:
    This becomes much simpler in Rust where usize is the only legal index
    type:

    Yeah, you have to actually write it as

     Â  y = p[x];
     Â  x += 1;

    instead of a single line, but this makes zero difference to the
    compiler, right?


    I don't care much about the compiler - but I don't think this is an
    improvement for the programmer.  (In general, I dislike trying to do
    too much in a single expression or statement, but some C constructs
    are common enough that I am happy with them.  It would be hard to
    formulate concrete rules here.)

    And the resulting object code is less efficient than you get with
    signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.

    Is that true? I'll have to check godbolt myself if that is really the case!


    It is not true - or at least, it shouldn't be true. I had thought the
    Rust code was using the equivalent of a C "unsigned int" here, which
    would require extra code for wrapping semantics. But that was just my misunderstanding of Rust and its types - with a 64-bit unsigned type, it
    should give the same results as C. However, there's no harm in checking
    it and letting us know.

    (I've previously shown how "y = p[x++];" in C is less efficient on
    x86-64 if x is "unsigned int", compared to "int" or 64-bit types for x.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Waldek Hebisch on Tue Sep 17 11:29:15 2024
    On 17/09/2024 03:36, Waldek Hebisch wrote:
    David Brown <[email protected]> wrote:
    On 16/09/2024 19:51, MitchAlsup1 wrote:
    On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

    On 15/09/2024 21:13, MitchAlsup1 wrote:

    As to HW sadism:: this not not <realistically> any harder than mis-
    aligned DW accesses from the cache. Many ISA from the rather distant >>>>> past could do these rather efficiently {360 SRDL,...}


    Anyone who designs a data structure with a bit-field that spans two
    64-bit parts of a struct is probably ignorant of C bit-fields and
    software in general.  It is highly unlikely to be necessary or even
    beneficial from the hardware viewpoint, but really inconvenient on the >>>> software side (whether you use bit-fields or not).

    Sometimes you don't have a choice::
    x86-64 segment registers.
    PCIe MMI/O registers,
    ..

    The folks designing those register setups had a choice, and made a bad
    choice from the viewpoint of software (whether it be C, assembly, or any
    other language).

    It's conceivable that it was the right choice on balance, considering
    many factors. And it's certainly more believable that it was an
    appropriate choice when sizes were smaller. It is less believable that
    there is an overwhelming need to cross a 64-bit boundary.

    Several pieces of software discoverd that "bad" smaller data
    structures lead to faster execution. Simply, smaller data structures
    lead to better utilization of caches and busses, and efect due to
    this was larger than cost of extra instructions. So need to cross
    64-bit boundary may be rare, but there will be cases when it is best
    choice.


    It is possible, but I think it is rare.

    Perhaps my perception is biased from working with microcontrollers,
    where you often don't have caches and instruction speeds are not nearly
    as much faster than ram access speeds as you see in modern x86 systems.

    The other thing I don't like about split bit-fields is that there is
    typically no way to do atomic updates, which can mean you need extra
    care to keep things correct.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to BGB on Tue Sep 17 11:39:43 2024
    On 16/09/2024 21:46, BGB wrote:
    On 9/16/2024 4:27 AM, David Brown wrote:
    On 16/09/2024 09:18, BGB wrote:
    On 9/15/2024 12:46 PM, Anton Ertl wrote:
    Michael S <[email protected]> writes:
    Padding is another thing that should be Implementation Defined.

    It is.  It's defined in the ABI, so when the compiler documents to
    follow some ABI, you automatically get that ABI's structure layout.
    And if a compiler does not follow an ABI, it is practically useless.


    Though, there also isn't a whole lot of freedom of choice here
    regarding layout.

    If member ordering or padding differs from typical expectations, then
    any code which serializes structures to files is liable to break, and
    this practice isn't particularly uncommon.


    Your expectations here should match up with the ABI - otherwise things
    are going to go wrong pretty quickly.  But I think most ABIs will have
    fairly sensible choices for padding and alignments.


    Yeah. It is "almost fixed", as there are a lot of programs that are
    liable to break if these assumptions differ.


    Say, typical pattern:
    Members are organized in the same order they appear in the source code;

    That is required by the C standards.  (A compiler can re-arrange the
    order if that does not affect any observable behaviour.  gcc used to
    have an optimisation option that allowed it to re-arrange struct
    ordering when it was safe to do so, but it was removed as it was
    rarely used and a serious PITA to support with LTO.)


    OK.


    If the current position is not a multiple of the member's alignment,
    it is padded to an offset that is a multiple of the member's alignment;

    That is a requirement in the C standards.

    The only implementation-defined option is whether or not there is /
    extra/ padding - and I have never seen that in practice.  (And there
    are more implementation-defined options for bit-fields.)


    Extra padding seems like it wouldn't have much benefit.

    No, generally not - which is why it would be a really strange
    implementation if it had extra padding. It's possible that extra
    padding at the end of a struct could lead to more efficient array access
    by aligning to cache line sizes, but I think such things are better left
    to the programmer (possibly with the aid of compiler extensions) rather
    than attempting to specify them in the ABI.


    Albeit, types like _Bool in my implementation are padded to a full byte
    (it is treated as an "unsigned char" that is assumed to always hold
    either 0 or 1).

    That's the usual way to handle them.



    For primitive types, the alignment is equal to the size, which is
    also a power of 2;

    That is the norm, up to the maximum appropriate alignment for the
    architecture.  A 16-bit cpu has nothing to gain by making 32-bit types
    32-bit aligned.


    This comes up as an issue in some Windows file formats, where one can't
    just naively use a struct with 32-bit fields because some 32-bit members
    only have 16-bit alignment.

    Ah, the joys of using ancient formats with new systems!

    My comment above was in reference to data remaining on the system,
    rather than moving off-system.

    If I am making a format that is accessible externally - a file format, a network packet, etc., - I generally make sure all types are "naturally"
    aligned up to at least 8-byte types, even if the processor's maximum
    useful alignment is much smaller.


    If needed, the total size of the struct is padded to a multiple of
    the largest alignment of the struct members.

    That is required by the C standards.




    For C++ classes, it is more chaotic (and more compiler dependent), but:

    Not really, no.  Apart from a few hidden bits such as pointers to
    handle virtual methods and virtual inheritance, the data fields are
    ordered, padded and aligned just like in C structs.  And these hidden
    pointers follow the same rules as any other pointer.

    The only other special bit is empty base class optimisation, and
    that's pretty simple too.


    For simple cases, they may match up, like a POD class may look just like
    an equivalent struct, or single-inheritance classes with virtual methods
    like a struct with a vtable, etc... But in more complex cases there may
    be compiler differences (along with differences in things like name
    mangling, etc).

    I've never seen or header of a case where there there is anything
    unexpected here.

    Sure, different C++ implementations or ABIs might have different details
    around these hidden pointers and the way they organise their vtables.
    But they are still hidden /pointers/, and these are aligned and padded
    like any other pointer. Even if the hidden data contained a bunch of
    extra bits, flags, etc., to handle complicated inheritance setups, these
    would still be padded and aligned like any other structs with bits,
    flags, etc.


    Though, unlike with structs, programs seem less inclined to rely on the memory layout specifics of class instances.


    Of course they shouldn't be relying on such details!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Tue Sep 17 11:42:47 2024
    Michael S wrote:
    On Tue, 17 Sep 2024 08:20:15 +0200
    Terje Mathisen <[email protected]> wrote:

    EricP wrote:
    These double-width bit-field straddle operations show up at 32-bits.
    Various FP64 formats (DEC's middle-endian FP being the worst
    example), Intel page table entries and segment/gate descriptors,
    come to mind.

    Lots of them in 32-bit code!

    Lot's of what in 32-bit code?


    Pretty much any 64-bit container with non-regular contents, with the
    suggest double / fp64 as the classic example?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Tue Sep 17 11:48:12 2024
    David Brown wrote:
    On 17/09/2024 08:07, Terje Mathisen wrote:
    David Brown wrote:
    On 16/09/2024 10:37, Terje Mathisen wrote:
    This becomes much simpler in Rust where usize is the only legal
    index type:

    Yeah, you have to actually write it as

     Â  y = p[x];
     Â  x += 1;

    instead of a single line, but this makes zero difference to the
    compiler, right?


    I don't care much about the compiler - but I don't think this is an
    improvement for the programmer.  (In general, I dislike trying to do >>> too much in a single expression or statement, but some C constructs
    are common enough that I am happy with them.  It would be hard to
    formulate concrete rules here.)

    And the resulting object code is less efficient than you get with
    signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.

    Is that true? I'll have to check godbolt myself if that is really the
    case!


    It is not true - or at least, it shouldn't be true.  I had thought the
    Rust code was using the equivalent of a C "unsigned int" here, which
    would require extra code for wrapping semantics.  But that was just my misunderstanding of Rust and its types - with a 64-bit unsigned type, it should give the same results as C.  However, there's no harm in checking
    it and letting us know.

    No need to check this particular point, Rust's usize was obviously
    designed to be an unsigned type large enough to index into the entire addressable memory range, so on a 64-bit platform it has to be 64 bits.

    (I've previously shown how "y = p[x++];" in C is less efficient on
    x86-64 if x is "unsigned int", compared to "int" or 64-bit types for x.)

    That's actually surprising to me, I would have guessed any 32-bit index
    would be less efficient than a full-width type, but if the idionm is
    very, very common in C code, then it makes sense to make it fast.

    Doing so would typically require either sign- or zero-extending all
    32-bit variables when loaded into a 64-bit register, right?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue Sep 17 12:52:48 2024
    On Tue, 17 Sep 2024 11:42:47 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Tue, 17 Sep 2024 08:20:15 +0200
    Terje Mathisen <[email protected]> wrote:

    EricP wrote:
    These double-width bit-field straddle operations show up at
    32-bits. Various FP64 formats (DEC's middle-endian FP being the
    worst example), Intel page table entries and segment/gate
    descriptors, come to mind.

    Lots of them in 32-bit code!

    Lot's of what in 32-bit code?


    Pretty much any 64-bit container with non-regular contents, with the
    suggest double / fp64 as the classic example?

    Terje


    You mean
    struct { int a; double b; } where on 32-bit target we expect that b is
    not padded?
    And then mantissa of b crosses 64-bit boundary?
    But mantissa of b is not accessed as bit field in a typical program.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Tue Sep 17 13:15:36 2024
    On Tue, 17 Sep 2024 11:29:15 +0200
    David Brown <[email protected]> wrote:

    On 17/09/2024 03:36, Waldek Hebisch wrote:
    David Brown <[email protected]> wrote:
    On 16/09/2024 19:51, MitchAlsup1 wrote:
    On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

    On 15/09/2024 21:13, MitchAlsup1 wrote:

    As to HW sadism:: this not not <realistically> any harder than
    mis- aligned DW accesses from the cache. Many ISA from the
    rather distant past could do these rather efficiently {360
    SRDL,...}

    Anyone who designs a data structure with a bit-field that spans
    two 64-bit parts of a struct is probably ignorant of C
    bit-fields and software in general.  It is highly unlikely to be
    necessary or even beneficial from the hardware viewpoint, but
    really inconvenient on the software side (whether you use
    bit-fields or not).

    Sometimes you don't have a choice::
    x86-64 segment registers.
    PCIe MMI/O registers,
    ..

    The folks designing those register setups had a choice, and made a
    bad choice from the viewpoint of software (whether it be C,
    assembly, or any other language).

    It's conceivable that it was the right choice on balance,
    considering many factors. And it's certainly more believable that
    it was an appropriate choice when sizes were smaller. It is less
    believable that there is an overwhelming need to cross a 64-bit
    boundary.

    Several pieces of software discoverd that "bad" smaller data
    structures lead to faster execution. Simply, smaller data
    structures lead to better utilization of caches and busses, and
    efect due to this was larger than cost of extra instructions. So
    need to cross 64-bit boundary may be rare, but there will be cases
    when it is best choice.


    It is possible, but I think it is rare.

    Perhaps my perception is biased from working with microcontrollers,
    where you often don't have caches and instruction speeds are not
    nearly as much faster than ram access speeds as you see in modern x86 systems.

    On the other hand, with MCUs it's quite common to be limited by size of
    data storage (SRAM), while size of program storage (flash) is bigger
    than one will ever want. Plus, quite often, speed is of less concern.
    In such [common] situation densely packed [arrays of] structures could
    be desirable.


    The other thing I don't like about split bit-fields is that there is typically no way to do atomic updates, which can mean you need extra
    care to keep things correct.


    In the common case, on common ISAs atomic RMW update of bit field is
    impossible even when the field does not cross a word boundary.

    In case you mean write-only update (i.e. values of adjacent fields are
    known in advance and not expected to change), what you say can be
    correct or not, depending on availability of unaligned stores and on
    what exactly one consider 'atomic'.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue Sep 17 13:27:16 2024
    On Tue, 17 Sep 2024 11:48:12 +0200
    Terje Mathisen <[email protected]> wrote:

    David Brown wrote:
    On 17/09/2024 08:07, Terje Mathisen wrote:
    David Brown wrote:
    On 16/09/2024 10:37, Terje Mathisen wrote:
    This becomes much simpler in Rust where usize is the only legal
    index type:

    Yeah, you have to actually write it as

     Â  y = p[x];
     Â  x += 1;

    instead of a single line, but this makes zero difference to the
    compiler, right?


    I don't care much about the compiler - but I don't think this is
    an improvement for the programmer.  (In general, I dislike
    trying to do too much in a single expression or statement, but
    some C constructs are common enough that I am happy with them.Â
    It would be hard to formulate concrete rules here.)

    And the resulting object code is less efficient than you get with
    signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.

    Is that true? I'll have to check godbolt myself if that is really
    the case!


    It is not true - or at least, it shouldn't be true.  I had thought
    the Rust code was using the equivalent of a C "unsigned int" here,
    which would require extra code for wrapping semantics.  But that
    was just my misunderstanding of Rust and its types - with a 64-bit
    unsigned type, it should give the same results as C.  However,
    there's no harm in checking it and letting us know.

    No need to check this particular point, Rust's usize was obviously
    designed to be an unsigned type large enough to index into the entire addressable memory range, so on a 64-bit platform it has to be 64
    bits.

    (I've previously shown how "y = p[x++];" in C is less efficient on
    x86-64 if x is "unsigned int", compared to "int" or 64-bit types
    for x.)
    That's actually surprising to me, I would have guessed any 32-bit
    index would be less efficient than a full-width type, but if the
    idionm is very, very common in C code, then it makes sense to make it
    fast.

    Doing so would typically require either sign- or zero-extending all
    32-bit variables when loaded into a 64-bit register, right?

    Terje


    Taken in isolation, on something like x86=64 or aarch64, where result
    of 32-bit addition is by default zero-extended, there is no difference
    between 32-bit and 64-bit unsigned x.
    However when statement shown above is part of the sequence, even short
    one, 64-bit x allows compiler optimizations that are impossible with
    32-bit.
    E.g.
    y1 = p[x++]
    y2 = p[x++]

    On x86-64 with 64-bit x the second load can be implemented as
    mov dstreg, [rcx+rdx*4+4]
    On aarch64 with 64-bit x both loads can be folded into single 'load
    pair' instruction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bill Findlay@21:1/5 to EricP on Tue Sep 17 15:13:43 2024
    On 17 Sep 2024, EricP wrote
    (in article <4J3GO.141734$[email protected]>):

    Niklas Holsti wrote:
    On 2024-09-16 18:58, Michael S wrote:
    On Mon, 16 Sep 2024 11:39:55 -0400
    EricP <[email protected]> wrote:

    David Brown wrote:
    On 16/09/2024 15:04, Michael S wrote:

    With one exception that usize overflow panics under debug
    build.

    I'm quite happy with unsigned types that are not allowed to
    overflow, as long as there is some other way to get efficient wrapping on the rare occasions when you need it.

    But I am completely against the idea that you have different
    defined semantics for different builds. Run-time errors in a debug/test build and undefined behaviour in release mode is fine - defining the behaviour of overflow in release mode (other than possibly to the same run-time checking) is wrong.

    In the compilers that do checking which I have worked with
    there was always a distinction between checked builds and debug
    builds. In my C code I have Assert() and AssertDbg(). Assert stay in the production code, AssertDbg are only in the debug builds.

    Debug builds disable optimizations and spill all variable updates
    to memory to make life easier for the debugger.
    One usually compiles debug builds with no-optimize and all checks enabled.

    But debug, optimize, and checking are separate controls.

    In the compilers for checking languages I've worked with,
    checking and optimization are compatible.
    For example, if the compiler uses an AddFaultOverflow x = x + 1 instruction to increment 'x' then it knows no overflow is possible
    and then can make all the other optimizations that C assumes are true.

    And on those compilers checks can be controlled with quite fine resolution. Checks can be enabled/disabled based on kind of check,
    eg scalar overflow, array bounds,
    for a compilation unit, a routine, a section of code,
    a particular data type, a particular object.

    This was all standard on DEC Ada85 so if Rust compilers do not
    do this now they may in the near future.

    If ability to control compilers checks was standard on DEC Ada then it made DEC Ada none-standard.

    No, it means that DEC Ada could be used as a standard-conforming Ada compiler or as a non-conforming compiler, to a user-chosen extent.

    The recommended approach today (for applications where it matters) is to use static analysis of the Ada code (e.g. SPARK or other tools) to prove that run-time errors cannot happen, which then makes it possible to omit the corresponding run-time checks while staying compliant.

    DEC Ada did that too. It seems to me this optimization to be a relatively straight forward "propagation of constants" type of problem.
    Not just that, many language forms actually preclude the need for checks,
    e.g.:

    for i in this_array'Range loop
    ... this_array(i) ...
    end loop;

    cannot fail on access to this_array(i), and:

    this_array := that_array;

    cannot fail in any of the ways that are endlessly debated
    here in relation to *mem* C routines.

    --
    Bill Findlay

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Tue Sep 17 15:37:37 2024
    On 17/09/2024 08:43, Terje Mathisen wrote:
    Stephen Fuld wrote:
    On 9/16/2024 4:12 AM, David Brown wrote:

    snip

    With all respect to the regulars here, most people in technical
    Usenet groups are either old, unusually nerdy, or both.

    I resemble that remark!  :-)

    Ditto, probably...


    Of course my comment was not meant very seriously, though there is a lot
    of truth in it. Most regulars in technical Usenet groups have been in
    those groups for a long time - very few twenty year olds can hold a conversation about Fortran and S390 mainframes! And most of us are
    fairly nerdy - this stuff is not just a job, it's also an interest. But
    that does not mean any of us are /too/ old, or have only nerdy interests.

    I'm 67 (but not yet retired), I taught myself the Trachtenberg
    algorithms for mental arithmetic when I was around 12 (was reminded of
    this last night when I watched Gifted on netflix), I mail ordered what
    was probably the first Rubik's cube to get to Norway. (And developed
    three different algorithms to solve it, but I only remember the last one
    now which I had optimized for simplicity, not speed.)


    I would have been about 9 or 10 when I got my first Rubik's cube. A mathematician colleague of my father's and I put together a solution
    algorithm based on a few bits he had remembered from a lecture by David Singmaster. When the rest of the class played football at break, I
    stood in the goals practising the Rubiks's cube - I believe that counts
    as nerdy!

    Those, along with high school chess and orienteering mapping should
    count as nerdy pursuits, right?


    Orienteering is too physical to be nerdy, isn't it? I teach judo to
    kids - so none of us are perfect :-)


    Winning the County Yo-Yo championship would be less so?

    It is still a /bit/ nerdy...


    Regards to all the regulars here, I do consider many of you friends that
    I just haven't met yet.


    That is a fine attitude. I like to think that even with the people I
    regularly disagree with in technical groups, if we were to sit down with
    a coffee or a beer, rather than a screen and keyboard, we'd have a very pleasant evening.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Tue Sep 17 15:53:16 2024
    On 17/09/2024 12:27, Michael S wrote:
    On Tue, 17 Sep 2024 11:48:12 +0200
    Terje Mathisen <[email protected]> wrote:

    David Brown wrote:
    On 17/09/2024 08:07, Terje Mathisen wrote:
    David Brown wrote:
    On 16/09/2024 10:37, Terje Mathisen wrote:
    This becomes much simpler in Rust where usize is the only legal
    index type:

    Yeah, you have to actually write it as

     Â  y = p[x];
     Â  x += 1;

    instead of a single line, but this makes zero difference to the
    compiler, right?


    I don't care much about the compiler - but I don't think this is
    an improvement for the programmer.  (In general, I dislike
    trying to do too much in a single expression or statement, but
    some C constructs are common enough that I am happy with them.Â
    It would be hard to formulate concrete rules here.)

    And the resulting object code is less efficient than you get with
    signed int and "y = p[x++];" (or "y = p[x]; x++;") in C.

    Is that true? I'll have to check godbolt myself if that is really
    the case!


    It is not true - or at least, it shouldn't be true.  I had thought
    the Rust code was using the equivalent of a C "unsigned int" here,
    which would require extra code for wrapping semantics.  But that
    was just my misunderstanding of Rust and its types - with a 64-bit
    unsigned type, it should give the same results as C.  However,
    there's no harm in checking it and letting us know.

    No need to check this particular point, Rust's usize was obviously
    designed to be an unsigned type large enough to index into the entire
    addressable memory range, so on a 64-bit platform it has to be 64
    bits.

    (I've previously shown how "y = p[x++];" in C is less efficient on
    x86-64 if x is "unsigned int", compared to "int" or 64-bit types
    for x.)
    That's actually surprising to me, I would have guessed any 32-bit
    index would be less efficient than a full-width type, but if the
    idionm is very, very common in C code, then it makes sense to make it
    fast.

    Doing so would typically require either sign- or zero-extending all
    32-bit variables when loaded into a 64-bit register, right?

    Terje


    Taken in isolation, on something like x86=64 or aarch64, where result
    of 32-bit addition is by default zero-extended, there is no difference between 32-bit and 64-bit unsigned x.
    However when statement shown above is part of the sequence, even short
    one, 64-bit x allows compiler optimizations that are impossible with
    32-bit.
    E.g.
    y1 = p[x++]
    y2 = p[x++]

    On x86-64 with 64-bit x the second load can be implemented as
    mov dstreg, [rcx+rdx*4+4]
    On aarch64 with 64-bit x both loads can be folded into single 'load
    pair' instruction.


    That's it, yes. It's not the access that is slower for 32-bit x, it's
    using it later after the increment because the increment has to be wrapped.

    These things are always complicated by surrounding code, but consider
    Michael's example here (which is the same as I discussed in another
    post), assuming a 64-bit system with some common addressing modes :

    y1 = p[x++];
    y2 = p[x++];
    ...

    When x is a 64-bit type, this can be implemented (where "r?" are general-purpose 64-bit registers) as :

    r1 = p + x;
    y1 = *r1++;
    y2 = *r1++;
    ...

    For a 32-bit x with defined wrapping, it might be implemented as :


    r1 = x; // Zero or sign extend as appropriate
    y1 = *(p + r1);
    r1 += 1;
    r1 &= 0xffffffff;
    y2 = *(p + r1);
    r1 += 1;
    r1 &= 0xffffffff;
    ...

    There might be a single instruction for adding 1 with 32-bit wrapping,
    but it is still bigger.

    For 32-bit x with undefined overflow, it will be :

    r1 = x; // Zero or sign extend as appropriate
    r2 = p + x;
    y1 = *r2++;
    y2 = *r2++;
    ...

    So with a 32-bit index, you are probably going to have to have a sign or
    zero extension somewhere. But key to the efficiency of signed int
    compared to unsigned int is that the compiler can assume there is no
    overflow, and does not need to implement wrapping.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Tue Sep 17 16:00:48 2024
    On 17/09/2024 12:15, Michael S wrote:
    On Tue, 17 Sep 2024 11:29:15 +0200
    David Brown <[email protected]> wrote:

    On 17/09/2024 03:36, Waldek Hebisch wrote:
    David Brown <[email protected]> wrote:
    On 16/09/2024 19:51, MitchAlsup1 wrote:
    On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

    On 15/09/2024 21:13, MitchAlsup1 wrote:

    As to HW sadism:: this not not <realistically> any harder than
    mis- aligned DW accesses from the cache. Many ISA from the
    rather distant past could do these rather efficiently {360
    SRDL,...}

    Anyone who designs a data structure with a bit-field that spans
    two 64-bit parts of a struct is probably ignorant of C
    bit-fields and software in general.  It is highly unlikely to be
    necessary or even beneficial from the hardware viewpoint, but
    really inconvenient on the software side (whether you use
    bit-fields or not).

    Sometimes you don't have a choice::
    x86-64 segment registers.
    PCIe MMI/O registers,
    ..

    The folks designing those register setups had a choice, and made a
    bad choice from the viewpoint of software (whether it be C,
    assembly, or any other language).

    It's conceivable that it was the right choice on balance,
    considering many factors. And it's certainly more believable that
    it was an appropriate choice when sizes were smaller. It is less
    believable that there is an overwhelming need to cross a 64-bit
    boundary.

    Several pieces of software discoverd that "bad" smaller data
    structures lead to faster execution. Simply, smaller data
    structures lead to better utilization of caches and busses, and
    efect due to this was larger than cost of extra instructions. So
    need to cross 64-bit boundary may be rare, but there will be cases
    when it is best choice.


    It is possible, but I think it is rare.

    Perhaps my perception is biased from working with microcontrollers,
    where you often don't have caches and instruction speeds are not
    nearly as much faster than ram access speeds as you see in modern x86
    systems.

    On the other hand, with MCUs it's quite common to be limited by size of
    data storage (SRAM), while size of program storage (flash) is bigger
    than one will ever want. Plus, quite often, speed is of less concern.
    In such [common] situation densely packed [arrays of] structures could
    be desirable.

    That can also be true. (The smallest device I ever used had 1 KB of
    flash, and that was still plenty for the task I had!)

    But in many embedded systems, speed is of some concern at least - if you
    can do the task in fewer clock cycles, maybe you can use a slower device
    (which might be cheaper, or have easier EMC requirements), or you can
    spend more time in sleep modes for reduced average power. Run-time
    efficiency isn't always about shorter wall-clock times.

    The main thing about embedded development, however, is that the answer
    is always "it depends". There are few hard and fast rules!



    The other thing I don't like about split bit-fields is that there is
    typically no way to do atomic updates, which can mean you need extra
    care to keep things correct.


    In the common case, on common ISAs atomic RMW update of bit field is impossible even when the field does not cross a word boundary.

    In case you mean write-only update (i.e. values of adjacent fields are
    known in advance and not expected to change), what you say can be
    correct or not, depending on availability of unaligned stores and on
    what exactly one consider 'atomic'.


    Yes, these are all possibilities. But it is not uncommon that the key
    thing is to avoid partial updates where you have changed one half of a
    hardware register but not yet changed the other half.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Tue Sep 17 10:57:49 2024
    Michael S wrote:
    On Tue, 17 Sep 2024 08:20:15 +0200
    Terje Mathisen <[email protected]> wrote:

    EricP wrote:
    These double-width bit-field straddle operations show up at 32-bits.
    Various FP64 formats (DEC's middle-endian FP being the worst
    example), Intel page table entries and segment/gate descriptors,
    come to mind.
    Lots of them in 32-bit code!

    Lot's of what in 32-bit code?

    On 32-bit cpus, bit-fields that straddle 32-bit boundaries inside
    larger structures like a 64-bit FP or PTE.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Tim Rentsch on Tue Sep 17 16:23:40 2024
    On Tue, 17 Sep 2024 2:34:44 +0000, Tim Rentsch wrote:

    [email protected] (MitchAlsup1) writes:

    On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:

    I didn't see any content from you in this last posting
    of yours.

    I had started to make a comment after hitting quote, and
    while re-reading what you wrote I had nothing to add and
    nothing to modify or complain about. While thinking it all
    over I ended hitting the Post Article button without any
    text.

    There was no way to retrieve the post, so I let it lie.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bill Findlay@21:1/5 to Stefan Monnier on Tue Sep 17 18:32:35 2024
    On 17 Sep 2024, Stefan Monnier wrote
    (in article<[email protected]>):

    With all respect to the regulars here, most people in technical Usenet groups are either old, unusually nerdy, or both.

    I plead guilty to nerdy, but as for old, I'm still 27 (and that's been
    true for more than 20 years).

    Stefan

    Hi Stefan!
    At least equally nerdy, I should think, but 50 years older.
    (Older, not old!)

    --
    Bill Findlay

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Sep 17 12:27:52 2024
    With all respect to the regulars here, most people in technical Usenet
    groups are either old, unusually nerdy, or both.

    I plead guilty to nerdy, but as for old, I'm still 27 (and that's been
    true for more than 20 years).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Sep 17 16:34:10 2024
    On Tue, 17 Sep 2024 8:12:16 +0000, Michael S wrote:

    On Tue, 17 Sep 2024 01:35:17 +0000
    [email protected] (MitchAlsup1) wrote:

    On Tue, 17 Sep 2024 0:00:34 +0000, EricP wrote:

    Bill Findlay wrote:
    I found the same 5% performance cost in my tests with DEC Ada85.
    Most code was pretty optimal too.

    The one thing I found DEC's compiler made a complete pigs breakfast
    of the generated code was scanning a character string backwards:

    Bacon, sausage, and ham.

    Sounds yummy. Code not so much.

    It seems that you and EricP give different (not to say an opposite)
    meaning to the phrase "complete pigs breakfast".

    I had never heard or seen the phrase before. So I just made that up
    on the spot.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Sep 17 16:32:08 2024
    On Tue, 17 Sep 2024 6:20:15 +0000, Terje Mathisen wrote:

    EricP wrote:

    I added a bunch of instructions for dealing with double-width
    operations.
    The main ISA design decision is whether to have register pair
    specifiers,
    R0, R2, R4,... or two separate {r_high,r_low} registers.
    In either case the main uArch issue is that now instructions have an
    extra
    source register and two dest registers, which has a number of
    consequences.
    But once you bite the bullet on that it simplifies a lot of things,
    like how to deal with carry or overflow without flags,
    full width multiplies, divide producing both quotient and remainder.

    Very nice!

    This means that you can do integer IMAC(), right?

    (hi, lo) = imac(a, b, c); // == a*b+c

    CARRY Rc,{{OI}}
    MUL Rd,Ra,Rb
    gives
    {Rc,Rd} = product128(Ra,Rb)+Rc

    where all registers are 64-bits.

    The only thing even nicer from the perspective of writing arbitrary
    precision library code would be IMAA, i.e. a*b+c+d since that is the
    largest combination which is guaranteed to never overflow the double
    register target field.

    CARRY Rc,{{OI}{OI}}
    MUL Re,Ra,Rb
    ADD Re,Re,Rd
    gives
    {Rc,Re} = product128(Ra,Rb) + Rc + Rd

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Bill Findlay on Tue Sep 17 18:18:51 2024
    On Tue, 17 Sep 2024 16:32:35 +0000, Bill Findlay wrote:

    On 17 Sep 2024, Stefan Monnier wrote
    (in article<[email protected]>):

    With all respect to the regulars here, most people in technical Usenet
    groups are either old, unusually nerdy, or both.

    I plead guilty to nerdy, but as for old, I'm still 27 (and that's been
    true for more than 20 years).

    Stefan

    Hi Stefan!
    At least equally nerdy, I should think, but 50 years older.
    (Older, not old!)

    At 71 real years old I still operate as if I were <let's say> 21.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to [email protected] on Tue Sep 17 19:52:38 2024
    MitchAlsup1 <[email protected]> schrieb:
    On Tue, 17 Sep 2024 2:34:44 +0000, Tim Rentsch wrote:

    [email protected] (MitchAlsup1) writes:

    On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:

    I didn't see any content from you in this last posting
    of yours.

    I had started to make a comment after hitting quote, and
    while re-reading what you wrote I had nothing to add and
    nothing to modify or complain about. While thinking it all
    over I ended hitting the Post Article button without any
    text.

    There was no way to retrieve the post, so I let it lie.

    Same thing happens to me on occasion.

    With slrn, it is possible to cancel the post, and Eternal September
    will honor the cancel.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to [email protected] on Tue Sep 17 12:48:31 2024
    [email protected] (MitchAlsup1) writes:

    On Tue, 17 Sep 2024 2:34:44 +0000, Tim Rentsch wrote:

    [email protected] (MitchAlsup1) writes:

    On Sun, 15 Sep 2024 19:51:04 +0000, Tim Rentsch wrote:

    I didn't see any content from you in this last posting
    of yours.

    I had started to make a comment after hitting quote, and
    while re-reading what you wrote I had nothing to add and
    nothing to modify or complain about. While thinking it all
    over I ended hitting the Post Article button without any
    text.

    There was no way to retrieve the post, so I let it lie.

    Okay, thank you.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to BGB on Tue Sep 17 19:53:31 2024
    BGB <[email protected]> schrieb:

    Another option would be for adjacent _Bool values to merge similar to bitfields...

    How would you manage a pointer to a _Bool?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to David Brown on Tue Sep 17 20:00:13 2024
    David Brown <[email protected]> wrote:
    On 17/09/2024 03:36, Waldek Hebisch wrote:
    David Brown <[email protected]> wrote:
    On 16/09/2024 19:51, MitchAlsup1 wrote:
    On Mon, 16 Sep 2024 8:34:19 +0000, David Brown wrote:

    On 15/09/2024 21:13, MitchAlsup1 wrote:

    As to HW sadism:: this not not <realistically> any harder than mis- >>>>>> aligned DW accesses from the cache. Many ISA from the rather distant >>>>>> past could do these rather efficiently {360 SRDL,...}


    Anyone who designs a data structure with a bit-field that spans two
    64-bit parts of a struct is probably ignorant of C bit-fields and
    software in general.  It is highly unlikely to be necessary or even >>>>> beneficial from the hardware viewpoint, but really inconvenient on the >>>>> software side (whether you use bit-fields or not).

    Sometimes you don't have a choice::
    x86-64 segment registers.
    PCIe MMI/O registers,
    ..

    The folks designing those register setups had a choice, and made a bad
    choice from the viewpoint of software (whether it be C, assembly, or any >>> other language).

    It's conceivable that it was the right choice on balance, considering
    many factors. And it's certainly more believable that it was an
    appropriate choice when sizes were smaller. It is less believable that
    there is an overwhelming need to cross a 64-bit boundary.

    Several pieces of software discoverd that "bad" smaller data
    structures lead to faster execution. Simply, smaller data structures
    lead to better utilization of caches and busses, and efect due to
    this was larger than cost of extra instructions. So need to cross
    64-bit boundary may be rare, but there will be cases when it is best
    choice.


    It is possible, but I think it is rare.

    Perhaps my perception is biased from working with microcontrollers,
    where you often don't have caches and instruction speeds are not nearly
    as much faster than ram access speeds as you see in modern x86 systems.

    I personally got lots of 20% speedups by restructuring data on PlayStation
    2 code.

    The C rules for data structure layout is stupid, a programmer would add a
    int in front of a vector and fail to wonder why his structure grew by 16
    bytes. Never mind that he used that 4 byte int to hold a value that had a
    max of 15.

    Had to annotate the data structures with 16 byte comment boundaries to stop endless stupidity.

    The other thing I don't like about split bit-fields is that there is typically no way to do atomic updates, which can mean you need extra
    care to keep things correct.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Sep 17 20:11:55 2024
    On Tue, 17 Sep 2024 19:51:19 +0000, BGB wrote:

    On 9/17/2024 4:39 AM, David Brown wrote:
    On 16/09/2024 21:46, BGB wrote:
    On 9/16/2024 4:27 AM, David Brown wrote:

    Albeit, types like _Bool in my implementation are padded to a full
    byte (it is treated as an "unsigned char" that is assumed to always
    hold either 0 or 1).

    That's the usual way to handle them.

    Smallest C container is 1 byte
    __BOOL can use as small a container as C can address



    Another option would be for adjacent _Bool values to merge similar to bitfields...
    Though, seems that simply turning it into a byte is the typical option.

    One can do ATOMIC stuff on a __BOOL
    one cannot do ATOMIC stuff on struct { unsigned __bool: 1};



    This comes up as an issue in some Windows file formats, where one
    can't just naively use a struct with 32-bit fields because some 32-bit
    members only have 16-bit alignment.

    Ah, the joys of using ancient formats with new systems!


    I was around when this stuff was still newish.

    Some are essentially frozen in time with their misaligned members.

    In HW the packing and unpacking of multi-container single variables
    is easy--its just wires.

    Still better than:
    "Well, initial field wasn't big enough";
    "Repurpose those bytes from over there, and glue them on".

    Really NOT a problem in HW--understandably low efficiency in SW.


    There would need to be a mechanism in the ISA to select between these
    modes though (probably a "magic branch" scheme different from the one
    used for Inter-ISA branches).

    Modes make testing significantly harder. Each mode adds 1 to the
    exponent
    how many test cases it takes to adequately test a part.

    This would likely include an RV64 encoding for "Branch to/from CoEx",
    and an encoding within this ISA to jump between CoEx and "Native" mode.

    Magic branches make sense mostly as any such mode switch is going to
    require a pipeline flush.

    This is assuming an implementation that would want to be able to support
    both this ISA and also RV64GC.

    One possibility could be (in native RV notation):
    RV64 (Branches if supported, NOP if not):
    LBU X0, Xs, Disp12s //Dest=RV64GC
    LWU X0, Xs, Disp12s //Dest=CoEx
    LHU X0, Xs, Disp12s //Dest=Native
    New ISA:
    LBU X0, Xs, Disp10s //Dest=RV64GC
    LWU X0, Xs, Disp10s //Dest=CoEx
    LHU X0, Xs, Disp10s //Dest=Native

    This only gives 36-bits (top) or 30-bits (bottom) or range. What you are
    going to want is 64-bits of range -- especially when switching modes--
    you PROBABLY want to use an entirely different sub-tree of the
    translation
    table trees.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Sep 17 23:04:52 2024
    On Tue, 17 Sep 2024 22:15:12 +0000, BGB wrote:

    On 9/17/2024 3:11 PM, MitchAlsup1 wrote:

    Modes make testing significantly harder. Each mode adds 1 to the
    exponent
    how many test cases it takes to adequately test a part.

    Possibly.

    But, modes are kinda unavoidable here:
    CPU only runs RV64GC or similar:
    Doomed to relative slowness;
    CPU only does CoEx:
    Closes off the ability to run binaries that assume RV64GC.
    CPU only does new ISA:
    Well, then it can't run RISC-V code, making all this kinda moot.

    My 66000 does not have modes (at least yet) it even comes our of
    RESET with the MMUs turned on.
    -----------
    This is assuming an implementation that would want to be able to support >>> both this ISA and also RV64GC.

    One possibility could be (in native RV notation):
    RV64 (Branches if supported, NOP if not):
       LBU X0, Xs, Disp12s  //Dest=RV64GC
       LWU X0, Xs, Disp12s  //Dest=CoEx
       LHU X0, Xs, Disp12s  //Dest=Native
    New ISA:
       LBU X0, Xs, Disp10s  //Dest=RV64GC
       LWU X0, Xs, Disp10s  //Dest=CoEx
       LHU X0, Xs, Disp10s  //Dest=Native

    This only gives 36-bits (top) or 30-bits (bottom) or range. What you are
    going to want is 64-bits of range -- especially when switching modes--
    you PROBABLY want to use an entirely different sub-tree of the
    translation
    table trees.

    Idea here is that 'Xs' will give the base address for the target.

    On the RISC-V side, this would mean, say:
    AUIPC X7, disp
    LWU X0, X7, disp
    Similar to a normal JALR.

    Still limited to 32-bit displacement from IP.

    How would you perform the following call::
    current IP = 0x0000000000001234
    target IP = 0x7FFFFFFF00001234

    This is a single (2-word) instruction in my ISA, assuming GOT is
    32-bit displaceable and 64-bit entries.

    I could almost interpret X0 as PC, except that on a "standard" RISC-V
    CPU, the non-supported case would be, likely: "program crashes trying to access a NULL pointer", which is less useful.


    Branches in the new ISA would likely be encoded using jumbo prefixes.

    Well, partly because the new ISA lacks AUIPC, but the new ISA can encode
    it more directly as, essentially:
    LWU X0, PC, Disp33s

    AUPIC is (and remains) a crutch (like LUI from MIPS)
    a) it consumes an instruction (space and time)
    b) it consumes a register unnecessarily
    c) it consumes power that direct delivery of the constant would not

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Wed Sep 18 10:13:12 2024
    On 17/09/2024 20:18, MitchAlsup1 wrote:
    On Tue, 17 Sep 2024 16:32:35 +0000, Bill Findlay wrote:

    On 17 Sep 2024, Stefan Monnier wrote
    (in article<[email protected]>):

    With all respect to the regulars here, most people in technical Usenet >>>> groups are either old, unusually nerdy, or both.

    I plead guilty to nerdy, but as for old, I'm still 27 (and that's been
    true for more than 20 years).

    Stefan

    Hi Stefan!
    At least equally nerdy, I should think, but 50 years older.
    (Older, not old!)

    At 71 real years old I still operate as if I were <let's say> 21.

    You are not 71, you are merely 0x47 :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Wed Sep 18 13:37:14 2024
    On Wed, 18 Sep 2024 8:13:12 +0000, David Brown wrote:

    On 17/09/2024 20:18, MitchAlsup1 wrote:
    On Tue, 17 Sep 2024 16:32:35 +0000, Bill Findlay wrote:

    On 17 Sep 2024, Stefan Monnier wrote
    (in article<[email protected]>):

    With all respect to the regulars here, most people in technical Usenet >>>>> groups are either old, unusually nerdy, or both.

    I plead guilty to nerdy, but as for old, I'm still 27 (and that's been >>>> true for more than 20 years).

    Stefan

    Hi Stefan!
    At least equally nerdy, I should think, but 50 years older.
    (Older, not old!)

    At 71 real years old I still operate as if I were <let's say> 21.

    You are not 71, you are merely 0x47 :-)

    It is only 27 in base 32.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Terje Mathisen on Wed Sep 18 10:10:21 2024
    Terje Mathisen wrote:
    EricP wrote:

    Codecs likely have to deal with double-width straddles a lot, whatever
    the register word size. So for them it likely happens at 64-bits already.

    Nothing likely about it: LZ4 is pretty much the only compression algorithm/lossless codec that never straddles, all the rest tend to
    treat the source data as single bitstream of arbitrary length, except
    for some built-in chunking mechanism which simplifies faster scanning.

    The core of the algorithm always starts with knowing the endianness,
    then picking up 32 or 64-bit chunks of input data (byte-flipping if
    needed) and then extractin the next N bits either from the top of bottom
    of the buffer register.

    AlLmost by definition, this is not code that a compiler is setup to help
    you get correct.


    I added a bunch of instructions for dealing with double-width operations.
    The main ISA design decision is whether to have register pair specifiers,
    R0, R2, R4,... or two separate {r_high,r_low} registers.
    In either case the main uArch issue is that now instructions have an
    extra
    source register and two dest registers, which has a number of
    consequences.
    But once you bite the bullet on that it simplifies a lot of things,
    like how to deal with carry or overflow without flags,
    full width multiplies, divide producing both quotient and remainder.

    Very nice!

    This means that you can do integer IMAC(), right?

    (hi, lo) = imac(a, b, c); // == a*b+c

    The only thing even nicer from the perspective of writing arbitrary
    precision library code would be IMAA, i.e. a*b+c+d since that is the
    largest combination which is guaranteed to never overflow the double
    register target field.

    Terje


    I thought about IMAC but it was a bit too much.
    And unlike FMA there is no precision gain in IMAC, just convenience.
    IMAC requires 6 register specifiers, 2 dest and 4 source if you don't
    care about overflow/carry on the accumulate.
    2-wide = 2-wide + narrow * narrow
    It needs 7 registers, 3 dest and 4 source if you want overflow/carry
    on the accumulate.
    3-wide = 2-wide + narrow * narrow

    I wanted to support checked arithmetic which means full width multiplies.
    And I was always bothered by the risc approach of MULL (low part) and
    MULH (high part) where they do most of the multiply then toss half away
    just because they won't have 2 dest registers.

    So what else I can do with 2 dest registers? Wide add and sub.
    Various wide Add,Sub solves the missing carry/overflow flags problems.

    FMA already requires 3 source registers.
    Beside Add,Sub,Mul what else can one do with 3 source and 2 dest registers? Wide shifts and wide bit-field extract and insert.

    I went with two (r_hi,r_lo) register specifiers because it gave programmers more flexibility. I played a bit with even register pairs (R0, R2, R4...)
    and found one had to do extra MOVs just form a pair.
    (r_hi,r_lo) cost a longer instruction format but I have a variable length instruction so its mostly a wider fetch and decode pathways to handle
    the worst case instruction size.

    W = Wide = (hi,lo) register pair, N = Narrow = one register.

    Add forms:
    Add N = N + N // No carry out
    Add3 N = N + N + N // No carry out
    Addw2 W = N + N // Generate carry
    Addw3 W = N + N + N // Generate + propagate carry
    Addw1 W = W + N // Propagate carry

    Same for subtract wide.
    The three Add forms are chosen to make multi-precision integer
    multiply easier. See below.

    MUluw W = N * N
    Mulsw W = N * N

    Divuw (quo,rem) = N / N
    Divsw (quo,rem) = N / N

    Shllw W = W << size // Shift left logical
    Shlaw W = W << size // Shift left arithmetic, fault on signed overflow
    Shrlw W = W >> size // Shift right logical
    Shraw W = W >> size // Shift right arithmetic, sign extend
    Shrnw W = W >> size // Shift right numeric, round -1 to zero

    Bfextu N = extract (W, size, position) // Bit-field extract, zero extend Bfexts N = extract (W, size, position) // Bit-field extract, sign extend Bfins W = insert (W, N, size, position) // Bit-field insert

    =====================================
    Example unsigned 128 * 128 => 256 multiply:

    // Unsigned Multiply 128*128 => 256
    // (r3,r2)*(r1,r0) => (r3,r2,r1,r0)
    // Uses r4,r5,r6,r7,r8 as temp registers
    //
    muluw r5,r4 = r3*r0
    muluw r6,r0 = r2*r0
    muluw r8,r7 = r2*r1
    muluw r3,r2 = r3*r1
    addw3 r4,r1 = r4+r6+r7
    addw3 r5,r2 = r5+r8+r2
    addw2 r4,r2 = r2+r4
    add3 r3 = r3+r5+r4

    The reason I prefer the separate (r_hi,r_lo) pair specifiers rather
    than the even number register pairs R0,R2,R4... is because the above
    sequence would require extra moves for form the even numbered pairs.
    With separate pairs one can select registers so that everything lands
    in the right dest at the right time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Sep 18 14:27:28 2024
    On Wed, 18 Sep 2024 4:00:43 +0000, BGB wrote:

    On 9/17/2024 6:04 PM, MitchAlsup1 wrote:

    Still limited to 32-bit displacement from IP.

    How would you perform the following call::
    current IP = 0x0000000000001234
    target  IP = 0x7FFFFFFF00001234

    This is a single (2-word) instruction in my ISA, assuming GOT is
    32-bit displaceable and 64-bit entries.


    Granted, but in plain RISC-V, there is no real better option.

    If one wants to generate 64-bit displacement, and doesn't want to load a constant from memory:
    LUI X6, Disp20Hi //20 bits
    ADDI X6, X6, Disp12Hi //12 bits
    AUIPC X7, Disp20Lo
    ADD X7, Disp12Lo
    SLLI X6, X6, 32
    ADD X7, X7, X6

    How very much simpler is::

    MEM Rd,[IP,Ri<<s,DISP64]

    1 instruction, 3 words, 1 decode cycle, no forwarding, shorter latency.

    Which is sort of the whole reason I am considering hacking around it
    with an alternate encoding scheme.

    Just put in real constants.

    New encoding scheme can in theory do:
    LEA X7, PC, Disp64
    In a single 96-bit instruction.

    Where is the indexing register?

    ------------

    AUPIC is (and remains) a crutch (like LUI from MIPS)
    a) it consumes an instruction (space and time)
    b) it consumes a register unnecessarily
    c) it consumes power that direct delivery of the constant would not

    Yeah, pretty much.
    LUI + AUIPC + JAL, eat nearly 27 bits of encoding space.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Sep 18 18:42:19 2024
    On Wed, 18 Sep 2024 17:55:34 +0000, BGB wrote:

    On 9/18/2024 9:27 AM, MitchAlsup1 wrote:
    On Wed, 18 Sep 2024 4:00:43 +0000, BGB wrote:

    On 9/17/2024 6:04 PM, MitchAlsup1 wrote:

    Still limited to 32-bit displacement from IP.

    How would you perform the following call::
    current IP = 0x0000000000001234
    target  IP = 0x7FFFFFFF00001234

    This is a single (2-word) instruction in my ISA, assuming GOT is
    32-bit displaceable and 64-bit entries.


    Granted, but in plain RISC-V, there is no real better option.

    If one wants to generate 64-bit displacement, and doesn't want to load a >>> constant from memory:
       LUI X6, Disp20Hi       //20 bits
       ADDI X6, X6, Disp12Hi  //12 bits
       AUIPC X7, Disp20Lo
       ADD X7, Disp12Lo
       SLLI X6, X6, 32
       ADD X7, X7, X6

    How very much simpler is::

        MEM    Rd,[IP,Ri<<s,DISP64]

    1 instruction, 3 words, 1 decode cycle, no forwarding, shorter latency.


    It is simpler, but N/E in RV64G...

    This is the whole issue of the idea:
    Remain backwards compatible with RV64G / RV64GC (in a binary sense).

    So, you like sailing with an albatross tied around your neck:: Check.

    *and* try to allow extending it in a way such that performance can be
    less poor...

    I should remind you that if you eliminate the compressed parts of
    RISC-V you can fit the entire My 66000 ISA in the space remaining.
    All the constants, all transcendentals, all the far-control transfers,
    the efficient context switching, overhead free world switching,...
    ---------

    Which is sort of the whole reason I am considering hacking around it
    with an alternate encoding scheme.

    Just put in real constants.

    New encoding scheme can in theory do:
       LEA X7, PC, Disp64
    In a single 96-bit instruction.

    Where is the indexing register?

    Generally the use of a displacement and index register are mutually
    exclusive (and, cases that can make use of Disp AND Index are much less common than Disp OR Index).

    COMMON ?alpha/ a(100,100), b(300,300),

    ..

    x = a(i,j)*b(j,i);

    I see large displacements with indexing all the time from ASM out
    of Brian's compiler.

    I may still consider defining an encoding for this, but not yet. It is
    in a similar boat as auto-increment. Both add resource cost with
    relatively little benefit in terms of overall performance.
    Auto-increment because if one has superscalar, the increment can usually
    be co-executed. And, full [Rb+Ri*Sc+Disp], because it is just too
    infrequent to really justify the extra cost of a 3-way adder even if
    limited mostly to the low-order bits...

    Myopathy--look it up.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to EricP on Wed Sep 18 21:15:55 2024
    EricP <[email protected]> wrote:
    Terje Mathisen wrote:
    EricP wrote:

    Codecs likely have to deal with double-width straddles a lot, whatever
    the register word size. So for them it likely happens at 64-bits already. >>
    Nothing likely about it: LZ4 is pretty much the only compression
    algorithm/lossless codec that never straddles, all the rest tend to
    treat the source data as single bitstream of arbitrary length, except
    for some built-in chunking mechanism which simplifies faster scanning.

    The core of the algorithm always starts with knowing the endianness,
    then picking up 32 or 64-bit chunks of input data (byte-flipping if
    needed) and then extractin the next N bits either from the top of bottom
    of the buffer register.

    AlLmost by definition, this is not code that a compiler is setup to help
    you get correct.


    I added a bunch of instructions for dealing with double-width operations. >>> The main ISA design decision is whether to have register pair specifiers, >>> R0, R2, R4,... or two separate {r_high,r_low} registers.
    In either case the main uArch issue is that now instructions have an
    extra
    source register and two dest registers, which has a number of
    consequences.
    But once you bite the bullet on that it simplifies a lot of things,
    like how to deal with carry or overflow without flags,
    full width multiplies, divide producing both quotient and remainder.

    Very nice!

    This means that you can do integer IMAC(), right?

    (hi, lo) = imac(a, b, c); // == a*b+c

    The only thing even nicer from the perspective of writing arbitrary
    precision library code would be IMAA, i.e. a*b+c+d since that is the
    largest combination which is guaranteed to never overflow the double
    register target field.

    Terje


    I thought about IMAC but it was a bit too much.
    And unlike FMA there is no precision gain in IMAC, just convenience.
    IMAC requires 6 register specifiers, 2 dest and 4 source if you don't
    care about overflow/carry on the accumulate.
    2-wide = 2-wide + narrow * narrow
    It needs 7 registers, 3 dest and 4 source if you want overflow/carry
    on the accumulate.
    3-wide = 2-wide + narrow * narrow

    I wanted to support checked arithmetic which means full width multiplies.
    And I was always bothered by the risc approach of MULL (low part) and
    MULH (high part) where they do most of the multiply then toss half away
    just because they won't have 2 dest registers.

    I always assumed that MULH just grabbed the part that would have been
    thrown away. And that is how at least one RISC-V core does it:

    https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit

    They claim 5 cycles, should be six, five for the multiply and one more for
    the second result, unless the next instruction does not need a write port,
    and does not use the result. You can get a throughput of 5 cycles with
    smart coding, but that rarely happens without effort.

    So what else I can do with 2 dest registers? Wide add and sub.
    Various wide Add,Sub solves the missing carry/overflow flags problems.

    FMA already requires 3 source registers.
    Beside Add,Sub,Mul what else can one do with 3 source and 2 dest registers? Wide shifts and wide bit-field extract and insert.

    I went with two (r_hi,r_lo) register specifiers because it gave programmers more flexibility. I played a bit with even register pairs (R0, R2, R4...)
    and found one had to do extra MOVs just form a pair.
    (r_hi,r_lo) cost a longer instruction format but I have a variable length instruction so its mostly a wider fetch and decode pathways to handle
    the worst case instruction size.

    W = Wide = (hi,lo) register pair, N = Narrow = one register.

    Add forms:
    Add N = N + N // No carry out
    Add3 N = N + N + N // No carry out
    Addw2 W = N + N // Generate carry
    Addw3 W = N + N + N // Generate + propagate carry
    Addw1 W = W + N // Propagate carry

    Same for subtract wide.
    The three Add forms are chosen to make multi-precision integer
    multiply easier. See below.

    MUluw W = N * N
    Mulsw W = N * N

    Divuw (quo,rem) = N / N
    Divsw (quo,rem) = N / N

    Shllw W = W << size // Shift left logical
    Shlaw W = W << size // Shift left arithmetic, fault on signed overflow Shrlw W = W >> size // Shift right logical
    Shraw W = W >> size // Shift right arithmetic, sign extend
    Shrnw W = W >> size // Shift right numeric, round -1 to zero

    Bfextu N = extract (W, size, position) // Bit-field extract, zero extend Bfexts N = extract (W, size, position) // Bit-field extract, sign extend Bfins W = insert (W, N, size, position) // Bit-field insert

    =====================================
    Example unsigned 128 * 128 => 256 multiply:

    // Unsigned Multiply 128*128 => 256
    // (r3,r2)*(r1,r0) => (r3,r2,r1,r0)
    // Uses r4,r5,r6,r7,r8 as temp registers
    //
    muluw r5,r4 = r3*r0
    muluw r6,r0 = r2*r0
    muluw r8,r7 = r2*r1
    muluw r3,r2 = r3*r1
    addw3 r4,r1 = r4+r6+r7
    addw3 r5,r2 = r5+r8+r2
    addw2 r4,r2 = r2+r4
    add3 r3 = r3+r5+r4

    The reason I prefer the separate (r_hi,r_lo) pair specifiers rather
    than the even number register pairs R0,R2,R4... is because the above
    sequence would require extra moves for form the even numbered pairs.
    With separate pairs one can select registers so that everything lands
    in the right dest at the right time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Thu Sep 19 00:35:03 2024
    On Wed, 18 Sep 2024 21:15:55 +0000, Brett wrote:

    EricP <[email protected]> wrote:
    Terje Mathisen wrote:
    EricP wrote:

    I always assumed that MULH just grabbed the part that would have been
    thrown away. And that is how at least one RISC-V core does it:

    https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit

    They claim 5 cycles, should be six, five for the multiply and one more
    for the second result, unless the next instruction does not need a write port, and does not use the result. You can get a throughput of 5 cycles
    with
    smart coding, but that rarely happens without effort.

    It is easy enough in the decoder to recognize a MUL followed by MULH
    (and vice versa) as using the multiplier tree once and delivering 2
    results. So the first result is 6 cycles, the second result on the 6th
    cycle. {you ALMOST have to do this to avoid large wastes in power.}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to EricP on Thu Sep 19 08:34:19 2024
    EricP wrote:
    Terje Mathisen wrote:

    Very nice!

    This means that you can do integer IMAC(), right?

    (hi, lo) = imac(a, b, c); // == a*b+c

    The only thing even nicer from the perspective of writing arbitrary
    precision library code would be IMAA, i.e. a*b+c+d since that is the
    largest combination which is guaranteed to never overflow the double
    register target field.


    I thought about IMAC but it was a bit too much.
    And unlike FMA there is no precision gain in IMAC, just convenience.
    IMAC requires 6 register specifiers, 2 dest and 4 source if you don't
    care about overflow/carry on the accumulate.
      2-wide = 2-wide + narrow * narrow

    No, no! IMAC is three in, two out, so in your syntax:

    W = N*N+N

    or

    (rhi, rlo) = imac(r0,r1,r2)

    It needs 7 registers, 3 dest and 4 source if you want overflow/carry
    on the accumulate.
      3-wide = 2-wide + narrow * narrow

    Otoh, if you do have all the wide add forms you outlined below,
    including the "full adder" with three inputs and a wirde/pair output,
    then the carry propagations do become easier, and just doing

    (a,b) = muluw(e,f)
    (a,b) = addw1(a,b,g)

    would do the same as my suggested

    (a,b) = imac(a,f,g)

    Anyway, very nice!

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Brett on Thu Sep 19 11:07:11 2024
    Brett wrote:
    EricP <[email protected]> wrote:

    I wanted to support checked arithmetic which means full width multiplies.
    And I was always bothered by the risc approach of MULL (low part) and
    MULH (high part) where they do most of the multiply then toss half away
    just because they won't have 2 dest registers.

    I always assumed that MULH just grabbed the part that would have been
    thrown away. And that is how at least one RISC-V core does it:

    https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit

    They claim 5 cycles, should be six, five for the multiply and one more for the second result, unless the next instruction does not need a write port, and does not use the result. You can get a throughput of 5 cycles with
    smart coding, but that rarely happens without effort.

    That article is ignoring multiplier pipelining.
    If the multiplier is pipelined with a latency of 5 and throughput of 1,
    then MULL takes 5 cycles and MULL,MULH takes 6.

    But those two multiplies still are tossing away 50% of their work.
    And if it does fuse them then the internal uArch cost is the same as if
    you had designed it optimally from the start, except now you have
    to pay for a fuser.

    <sound of soap box being dragged out>
    This idea that macro-op fusion is some magic solution is bullshit.
    1) It's not free.
    2) It only works where Decode can see *all* the required lookahead
    instructions, which means you have to pay for an N-lane decoder
    but only get 1 lane.
    3) It's probabilistic as it depends on how the fetch buffers get loaded.
    Eg if the fetch buffer contains a valid instruction but does not have
    a next instruction, do you stall Decode to see if a fuser might arrive
    or dispatch it anyway.
    4) It gets exponentially expensive if you start doing multiple instruction
    lanes because decode has to deal with all the permutations of
    fusion possibilities.
    5) Any fused instructions leave (multiple) bubbles that should be
    compacted out or there wasn't much point to doing the fusion.

    In my opinion it is better to have an ISA that is optimal by design
    rather than being patched up by fusion later.

    Some of this inefficiency is caused by clinging to now 40 year old
    risc design *guidelines* (ie not even rules) that:
    - instructions have at most 1 dest and 2 source registers
    - register specifier fields are either source or dest, never both
    - instructions should take at most 1 clock (they never did)

    These self imposed design restrictions cause ISA designers to miss
    some possible more optimal solutions. The result is things like
    RISC-V's memory reference linkage structures taking 6 instructions
    to build a 64-bit PC-relative address. And I'm pretty sure we won't
    see any 6 instruction fusers for quite some time.

    <sound of soap box being dragged back to cupboard>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Sep 19 16:01:48 2024
    On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:

    Brett wrote:
    EricP <[email protected]> wrote:

    They claim 5 cycles, should be six, five for the multiply and one more
    for
    the second result, unless the next instruction does not need a write
    port,
    and does not use the result. You can get a throughput of 5 cycles with
    smart coding, but that rarely happens without effort.

    That article is ignoring multiplier pipelining.
    If the multiplier is pipelined with a latency of 5 and throughput of 1,
    then MULL takes 5 cycles and MULL,MULH takes 6.

    But those two multiplies still are tossing away 50% of their work.
    And if it does fuse them then the internal uArch cost is the same as if
    you had designed it optimally from the start, except now you have
    to pay for a fuser.

    You failed to recognize the critical part of my comment on this::

    When the IMUL function unit sees MULL and MULH back to back AND
    when both operands are the same for both instructions; it KNOWS
    that the second multiply has the same result as the first and
    thereby that the second multiply can be suppressed and the first
    multiply used twice. {{In pure CMOS, if you drop the same operands
    twice into the multiplier tree, the multiplier tree burns no power
    in any event, just the operand delivery power.}}

    You may call this fusion, but it is the very lowest level of it
    and was not called such when first used.

    <sound of soap box being dragged out>
    This idea that macro-op fusion is some magic solution is bullshit.
    Agreed
    1) It's not free.
    Far from it.
    2) It only works where Decode can see *all* the required lookahead
    instructions, which means you have to pay for an N-lane decoder
    but only get 1 lane.
    I think it is but a crutch for a misdesigned ISA
    3) It's probabilistic as it depends on how the fetch buffers get loaded.
    Eg if the fetch buffer contains a valid instruction but does not
    have
    a next instruction, do you stall Decode to see if a fuser might
    arrive
    or dispatch it anyway.
    It can be worse than that
    4) It gets exponentially expensive if you start doing multiple
    instruction
    lanes because decode has to deal with all the permutations of
    fusion possibilities.
    All the more reason to have a better ISA
    5) Any fused instructions leave (multiple) bubbles that should be
    compacted out or there wasn't much point to doing the fusion.

    One of the interesting things I have noticed with my ISA is that
    when one has a properly designed higher level ISA, one gets rid
    of so many of the "easy to schedule" instructions that one ends
    up with 30 FMAC instructions in a row, with no other instruction
    to occupy any of the other function units.

    In my opinion it is better to have an ISA that is optimal by design
    rather than being patched up by fusion later.

    Indeed.

    Some of this inefficiency is caused by clinging to now 40 year old
    risc design *guidelines* (ie not even rules) that:
    - instructions have at most 1 dest and 2 source registers
    Makes FMAC had
    - register specifier fields are either source or dest, never both
    I happen to be wishywashy on this
    - instructions should take at most 1 clock (they never did)
    This never worked for floating point anyway...and many consider
    branches and memory references as not fitting that tenet either.

    What is required is that each instruction can be decoded in a single
    cycle and delivered to whichever function unit in one cycle.

    These self imposed design restrictions cause ISA designers to miss
    some possible more optimal solutions. The result is things like
    RISC-V's memory reference linkage structures taking 6 instructions
    to build a 64-bit PC-relative address. And I'm pretty sure we won't
    see any 6 instruction fusers for quite some time.

    And it is just "so unnecessary".

    I suspect that RISC-V will end up choosing AUPIC-LD-JMP instead
    loosing the PIC nature of flow control.

    Doing it right the first time is so much easier for everyone now
    and down the line.

    <sound of soap box being dragged back to cupboard>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Thu Sep 19 11:29:08 2024
    MitchAlsup1 wrote:
    On Wed, 18 Sep 2024 21:15:55 +0000, Brett wrote:

    EricP <[email protected]> wrote:
    Terje Mathisen wrote:
    EricP wrote:

    I always assumed that MULH just grabbed the part that would have been
    thrown away. And that is how at least one RISC-V core does it:

    https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit


    They claim 5 cycles, should be six, five for the multiply and one more
    for the second result, unless the next instruction does not need a write
    port, and does not use the result. You can get a throughput of 5 cycles
    with
    smart coding, but that rarely happens without effort.

    It is easy enough in the decoder to recognize a MUL followed by MULH
    (and vice versa) as using the multiplier tree once and delivering 2
    results. So the first result is 6 cycles, the second result on the 6th
    cycle. {you ALMOST have to do this to avoid large wastes in power.}

    Yes, but then you *require* a macro-op fuser to function efficiently. Probably... assuming it works.

    OR one can give up the cherished 1-dest,2-source self imposed ISA design limitation and have a 32-bit instruction with four 5-bit registers,
    2 source, 2 dest, leaving 12 bits for opcode and function code
    that you know will calculate multiply once, and can write back
    the result in 1 clock if it has two write ports (which it needs
    anyway if it wants any hope of catching up after a stall bubble).

    Also in the case of Alpha they only had unsigned MUL,MULH and
    for signed multiply it had to use branchy code (pre-CMOV) to
    do the signed correction subtracts, so fusion would be too complex.
    That design decision is as baffling as HP-PA originally leaving
    a MUL instruction out entirely because "it violated the 1-clock per
    instruction design philosophy". (HP quickly fixed it, but still...)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Thu Sep 19 18:46:04 2024
    EricP <[email protected]> schrieb:
    And I'm pretty sure we won't
    see any 6 instruction fusers for quite some time.

    That would probably blow a fuse.

    SCNR,

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to [email protected] on Thu Sep 19 19:12:41 2024
    MitchAlsup1 <[email protected]> wrote:
    On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:

    Brett wrote:
    EricP <[email protected]> wrote:

    They claim 5 cycles, should be six, five for the multiply and one more
    for
    the second result, unless the next instruction does not need a write
    port,
    and does not use the result. You can get a throughput of 5 cycles with
    smart coding, but that rarely happens without effort.

    That article is ignoring multiplier pipelining.
    If the multiplier is pipelined with a latency of 5 and throughput of 1,
    then MULL takes 5 cycles and MULL,MULH takes 6.

    But those two multiplies still are tossing away 50% of their work.
    And if it does fuse them then the internal uArch cost is the same as if
    you had designed it optimally from the start, except now you have
    to pay for a fuser.

    You failed to recognize the critical part of my comment on this::

    When the IMUL function unit sees MULL and MULH back to back AND
    when both operands are the same for both instructions; it KNOWS
    that the second multiply has the same result as the first and
    thereby that the second multiply can be suppressed and the first
    multiply used twice. {{In pure CMOS, if you drop the same operands
    twice into the multiplier tree, the multiplier tree burns no power
    in any event, just the operand delivery power.}}

    You may call this fusion, but it is the very lowest level of it
    and was not called such when first used.

    <sound of soap box being dragged out>


    - register specifier fields are either source or dest, never both

    I happen to be wishywashy on this


    This is deeply interesting, can you expound on why it is fine a register
    field can be shared by loads and stores, and sometimes both like x86.

    Classic RISC says the loads are critical, but no one is one wide today, so stores matter for deconfliction…. And does stuff just fall out right to
    allow both?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to EricP on Thu Sep 19 19:12:42 2024
    EricP <[email protected]> wrote:
    MitchAlsup1 wrote:
    On Wed, 18 Sep 2024 21:15:55 +0000, Brett wrote:

    EricP <[email protected]> wrote:
    Terje Mathisen wrote:
    EricP wrote:

    I always assumed that MULH just grabbed the part that would have been
    thrown away. And that is how at least one RISC-V core does it:

    https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit



    They claim 5 cycles, should be six, five for the multiply and one more
    for the second result, unless the next instruction does not need a write >>> port, and does not use the result. You can get a throughput of 5 cycles
    with
    smart coding, but that rarely happens without effort.

    It is easy enough in the decoder to recognize a MUL followed by MULH
    (and vice versa) as using the multiplier tree once and delivering 2
    results. So the first result is 6 cycles, the second result on the 6th
    cycle. {you ALMOST have to do this to avoid large wastes in power.}

    Yes, but then you *require* a macro-op fuser to function efficiently. Probably... assuming it works.

    OR one can give up the cherished 1-dest,2-source self imposed ISA design limitation and have a 32-bit instruction with four 5-bit registers,
    2 source, 2 dest, leaving 12 bits for opcode and function code
    that you know will calculate multiply once, and can write back
    the result in 1 clock if it has two write ports (which it needs
    anyway if it wants any hope of catching up after a stall bubble).

    You already have 2 source, 2 dest if you have load with address update.
    A low end CPU is going to have a shared INT/FPU pipeline so you have the hardware to do three sources for MAC. You might as well do 3 source 2 dest
    on the int side as well. And ARM does Add with Shift which is 3 sources,
    though one is a constant if you want one cycle uncracked throughput in most designs.

    Also in the case of Alpha they only had unsigned MUL,MULH and
    for signed multiply it had to use branchy code (pre-CMOV) to
    do the signed correction subtracts, so fusion would be too complex.
    That design decision is as baffling as HP-PA originally leaving
    a MUL instruction out entirely because "it violated the 1-clock per instruction design philosophy". (HP quickly fixed it, but still...)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Thu Sep 19 20:21:20 2024
    On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:


    - register specifier fields are either source or dest, never both

    I happen to be wishywashy on this


    This is deeply interesting, can you expound on why it is fine a register field can be shared by loads and stores, and sometimes both like x86.

    My 66000 encodes store data register in the same field position as it
    encodes "what kind of branch" is being performed, and the same position
    as all calculation (and load) results.

    I started doing this in 1982 with Mc88100 ISA, and never found a problem
    with the encoding nor in the decoding nor with the pipelining of it.

    Let me be clear, I do not support necessarily damaging a source operand
    to fit in another destination as::

    ADD SP,SP,#0x40

    by specifying SP only once in the instruction.

    So,

    +------+-----+-----+----------------+
    | major| Rd | Rs1 | whatever |
    +------+-----+-----+----------------+
    | BC | cnd | Rs1 | label offset |
    +------+-----+-----+----------------+
    | LD | Rd | Rb | displacement |
    +------+-----+-----+----------------+
    | ST | Rs0 | Rb | displacement |
    +------+-----+-----+----------------+

    Is:
    a) no burden in encoding
    b) no burden in decoding
    c) no burden in pipelining
    d) no burden in stealing the Store data port late in the pipeline
    {in particular, this saves lots of flip-flops deferring store
    data until after cache hit, TLB hit, and data has arrived at
    cache.}

    I disagree with things like::

    +------+-----+-----+----------------+
    | big OpCode | Rds | whatever |
    +------+-----+-----+----------------+

    Where Rds means the specifier is used as both a source and destination.

    Notice in my encoding one can ALWAYS take the register specification
    fields and wire them directly into the RF/renamer decoder ports.
    You lose this property the other way around.

    Classic RISC says the loads are critical, but no one is one wide today,

    SiFive disagrees with you.

    so
    stores matter for deconfliction…. And does stuff just fall out right to allow both?

    Can you restate what you wanted to say using different words or perhaps
    give an example ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Thu Sep 19 20:30:38 2024
    On Thu, 19 Sep 2024 19:12:42 +0000, Brett wrote:

    EricP <[email protected]> wrote:
    MitchAlsup1 wrote:

    It is easy enough in the decoder to recognize a MUL followed by MULH
    (and vice versa) as using the multiplier tree once and delivering 2
    results. So the first result is 6 cycles, the second result on the 6th
    cycle. {you ALMOST have to do this to avoid large wastes in power.}

    Yes, but then you *require* a macro-op fuser to function efficiently.
    Probably... assuming it works.

    OR one can give up the cherished 1-dest,2-source self imposed ISA design
    limitation and have a 32-bit instruction with four 5-bit registers,
    2 source, 2 dest, leaving 12 bits for opcode and function code
    that you know will calculate multiply once, and can write back
    the result in 1 clock if it has two write ports (which it needs
    anyway if it wants any hope of catching up after a stall bubble).

    You already have 2 source, 2 dest if you have load with address update.
    A low end CPU is going to have a shared INT/FPU pipeline so you have the hardware to do three sources for MAC. You might as well do 3 source 2
    dest on the int side as well. And ARM does Add with Shift which is 3
    sources, though one is a constant if you want one cycle uncracked
    throughput in most designs.

    Once you bite off on a shared INT/FP multiplier, and that the FP
    multiplier has to do FMAC, you HAVE 3-operand busses leaving the
    decoder stage.

    Those 3 operand busses give you [Rbase,Rindex<<scale,#displacement]
    memory reference address mode. You can say you only use it 2%
    of the time, but every time you can't use it and need it; it costs
    1-2 additional instructions--multiplying the 2% into the 5% range
    making it worthwhile even it you only save ICache misses.

    So, if you do FMAC, you have the bussing to do efficient Mem Refs.

    In addition:: if you have a pipelined FMAC unit, why NOT use it
    for integer Multiplication ??

    Additionally:: if you have a high performance FDIV unit, you can
    borrow it for integer division at little costs--no matter if it
    is in the FAMC unit or if it is a separate unit from FMAC.

    Given the 3-operand busses:: one can have 128/64 in the divisor
    at virtually no cost of calculation.

    THEREFORE: once you have 3-operand busses to support FMAC you
    should get all the bang out of them that you paid for.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to [email protected] on Fri Sep 20 00:12:48 2024
    MitchAlsup1 <[email protected]> wrote:
    On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:


    - register specifier fields are either source or dest, never both

    I happen to be wishywashy on this


    This is deeply interesting, can you expound on why it is fine a register
    field can be shared by loads and stores, and sometimes both like x86.

    My 66000 encodes store data register in the same field position as it
    encodes "what kind of branch" is being performed, and the same position
    as all calculation (and load) results.

    I started doing this in 1982 with Mc88100 ISA, and never found a problem
    with the encoding nor in the decoding nor with the pipelining of it.

    Let me be clear, I do not support necessarily damaging a source operand
    to fit in another destination as::

    ADD SP,SP,#0x40

    by specifying SP only once in the instruction.

    So,

    +------+-----+-----+----------------+
    | major| Rd | Rs1 | whatever |
    +------+-----+-----+----------------+
    | BC | cnd | Rs1 | label offset |
    +------+-----+-----+----------------+
    | LD | Rd | Rb | displacement |
    +------+-----+-----+----------------+
    | ST | Rs0 | Rb | displacement |
    +------+-----+-----+----------------+

    Is:
    a) no burden in encoding
    b) no burden in decoding
    c) no burden in pipelining
    d) no burden in stealing the Store data port late in the pipeline
    {in particular, this saves lots of flip-flops deferring store
    data until after cache hit, TLB hit, and data has arrived at
    cache.}

    I disagree with things like::

    +------+-----+-----+----------------+
    | big OpCode | Rds | whatever |
    +------+-----+-----+----------------+

    Where Rds means the specifier is used as both a source and destination.

    Notice in my encoding one can ALWAYS take the register specification
    fields and wire them directly into the RF/renamer decoder ports.
    You lose this property the other way around.

    Classic RISC says the loads are critical, but no one is one wide today,

    SiFive disagrees with you.

    so
    stores matter for deconfliction…. And does stuff just fall out right to
    allow both?

    Can you restate what you wanted to say using different words or perhaps
    give an example ??


    A series of adds to the same register in a four wide design.

    A = A + 1
    A = A + B
    A = A + C
    A = A + D

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Fri Sep 20 01:05:34 2024
    On Fri, 20 Sep 2024 0:12:48 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:


    - register specifier fields are either source or dest, never both

    I happen to be wishywashy on this


    This is deeply interesting, can you expound on why it is fine a register >>> field can be shared by loads and stores, and sometimes both like x86.

    My 66000 encodes store data register in the same field position as it
    encodes "what kind of branch" is being performed, and the same position
    as all calculation (and load) results.

    I started doing this in 1982 with Mc88100 ISA, and never found a problem
    with the encoding nor in the decoding nor with the pipelining of it.

    Let me be clear, I do not support necessarily damaging a source operand
    to fit in another destination as::

    ADD SP,SP,#0x40

    by specifying SP only once in the instruction.

    So,

    +------+-----+-----+----------------+
    | major| Rd | Rs1 | whatever |
    +------+-----+-----+----------------+
    | BC | cnd | Rs1 | label offset |
    +------+-----+-----+----------------+
    | LD | Rd | Rb | displacement |
    +------+-----+-----+----------------+
    | ST | Rs0 | Rb | displacement |
    +------+-----+-----+----------------+

    Is:
    a) no burden in encoding
    b) no burden in decoding
    c) no burden in pipelining
    d) no burden in stealing the Store data port late in the pipeline
    {in particular, this saves lots of flip-flops deferring store
    data until after cache hit, TLB hit, and data has arrived at
    cache.}

    I disagree with things like::

    +------+-----+-----+----------------+
    | big OpCode | Rds | whatever |
    +------+-----+-----+----------------+

    Where Rds means the specifier is used as both a source and destination.

    Notice in my encoding one can ALWAYS take the register specification
    fields and wire them directly into the RF/renamer decoder ports.
    You lose this property the other way around.

    Classic RISC says the loads are critical, but no one is one wide today,

    SiFive disagrees with you.

    so
    stores matter for deconfliction…. And does stuff just fall out right to >>> allow both?

    Can you restate what you wanted to say using different words or perhaps
    give an example ??


    A series of adds to the same register in a four wide design.

    A = A + 1
    A = A + B
    A = A + C
    A = A + D

    Which any good compiler should emit as::

    T1 = A + B
    T2 = C + D
    A = LEA( T1, T2, #1 )

    With a 2 cycle latency instead of 4.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to [email protected] on Fri Sep 20 03:31:36 2024
    MitchAlsup1 <[email protected]> wrote:
    On Fri, 20 Sep 2024 0:12:48 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:


    - register specifier fields are either source or dest, never both

    I happen to be wishywashy on this


    This is deeply interesting, can you expound on why it is fine a register >>>> field can be shared by loads and stores, and sometimes both like x86.

    My 66000 encodes store data register in the same field position as it
    encodes "what kind of branch" is being performed, and the same position
    as all calculation (and load) results.

    I started doing this in 1982 with Mc88100 ISA, and never found a problem >>> with the encoding nor in the decoding nor with the pipelining of it.

    Let me be clear, I do not support necessarily damaging a source operand
    to fit in another destination as::

    ADD SP,SP,#0x40

    by specifying SP only once in the instruction.

    So,

    +------+-----+-----+----------------+
    | major| Rd | Rs1 | whatever |
    +------+-----+-----+----------------+
    | BC | cnd | Rs1 | label offset |
    +------+-----+-----+----------------+
    | LD | Rd | Rb | displacement |
    +------+-----+-----+----------------+
    | ST | Rs0 | Rb | displacement |
    +------+-----+-----+----------------+

    Is:
    a) no burden in encoding
    b) no burden in decoding
    c) no burden in pipelining
    d) no burden in stealing the Store data port late in the pipeline
    {in particular, this saves lots of flip-flops deferring store
    data until after cache hit, TLB hit, and data has arrived at
    cache.}

    I disagree with things like::

    +------+-----+-----+----------------+
    | big OpCode | Rds | whatever |
    +------+-----+-----+----------------+

    Where Rds means the specifier is used as both a source and destination.

    Notice in my encoding one can ALWAYS take the register specification
    fields and wire them directly into the RF/renamer decoder ports.
    You lose this property the other way around.

    Classic RISC says the loads are critical, but no one is one wide today, >>>
    SiFive disagrees with you.

    so
    stores matter for deconfliction…. And does stuff just fall out right to >>>> allow both?

    Can you restate what you wanted to say using different words or perhaps
    give an example ??


    A series of adds to the same register in a four wide design.

    A = A + 1
    A = A + B
    A = A + C
    A = A + D

    Which any good compiler should emit as::

    T1 = A + B
    T2 = C + D
    A = LEA( T1, T2, #1 )

    With a 2 cycle latency instead of 4.

    The point was that you have three renames of A, so you can’t just blindly load the first A for all instructions. This takes gate time to determine,
    you can’t ignore the store field until later.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Chris M. Thomasson on Fri Sep 20 09:40:58 2024
    Chris M. Thomasson wrote:
    On 9/19/2024 12:15 PM, BGB wrote:
    On 9/19/2024 2:04 AM, Robert Finch wrote:
    On 2024-09-18 10:30 p.m., BGB wrote:
    On 9/18/2024 2:29 PM, Chris M. Thomasson wrote:
    On 9/18/2024 1:13 AM, David Brown wrote:
    On 17/09/2024 20:18, MitchAlsup1 wrote:
    On Tue, 17 Sep 2024 16:32:35 +0000, Bill Findlay wrote:

    On 17 Sep 2024, Stefan Monnier wrote
    (in article<[email protected]>):

    With all respect to the regulars here, most people in
    technical Usenet
    groups are either old, unusually nerdy, or both.

    I plead guilty to nerdy, but as for old, I'm still 27 (and
    that's been
    true for more than 20 years).

    Stefan

    Hi Stefan!
    At least equally nerdy, I should think, but 50 years older.
    (Older, not old!)

    At 71 real years old I still operate as if I were <let's say> 21. >>>>>>
    You are not 71, you are merely 0x47 :-)


    LOL! :^)

    Not going to say my exact age, but if I wrote my age in hex I could
    almost try to pass myself off as an early Zoomer (rather than as a
    millennial...).

    ...

    I think I am early GenX. 59 and still learning loads of stuff.
    Old enough to remember tube TVs and radios. Transistorized pocket
    radio were a big thing.


    In my case, my childhood was mostly in the era of Win 3.x and Win 9x
    PCs, and early dial-up internet (unlike most Zoomers, I remember a
    time before YouTube).

    [...]

    I remember way back wrt compuserve. :^)


    BIX (Byte Information eXchange)!

    I believe my id/mail was terjem (@bix.com), but it could have been tma
    or terje.

    I had some wonderful discussions with Mike Abrash and other x86 asm
    programmers there.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Brett on Fri Sep 20 10:02:55 2024
    Brett wrote:
    MitchAlsup1 <[email protected]> wrote:
    On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:


    - register specifier fields are either source or dest, never both

    I happen to be wishywashy on this


    This is deeply interesting, can you expound on why it is fine a register >>> field can be shared by loads and stores, and sometimes both like x86.

    My 66000 encodes store data register in the same field position as it
    encodes "what kind of branch" is being performed, and the same position
    as all calculation (and load) results.

    I started doing this in 1982 with Mc88100 ISA, and never found a problem
    with the encoding nor in the decoding nor with the pipelining of it.

    Let me be clear, I do not support necessarily damaging a source operand
    to fit in another destination as::

    ADD SP,SP,#0x40

    by specifying SP only once in the instruction.

    So,

    +------+-----+-----+----------------+
    | major| Rd | Rs1 | whatever |
    +------+-----+-----+----------------+
    | BC | cnd | Rs1 | label offset |
    +------+-----+-----+----------------+
    | LD | Rd | Rb | displacement |
    +------+-----+-----+----------------+
    | ST | Rs0 | Rb | displacement |
    +------+-----+-----+----------------+

    Is:
    a) no burden in encoding
    b) no burden in decoding
    c) no burden in pipelining
    d) no burden in stealing the Store data port late in the pipeline
    {in particular, this saves lots of flip-flops deferring store
    data until after cache hit, TLB hit, and data has arrived at
    cache.}

    I disagree with things like::

    +------+-----+-----+----------------+
    | big OpCode | Rds | whatever |
    +------+-----+-----+----------------+

    Where Rds means the specifier is used as both a source and destination.

    Notice in my encoding one can ALWAYS take the register specification
    fields and wire them directly into the RF/renamer decoder ports.
    You lose this property the other way around.

    Classic RISC says the loads are critical, but no one is one wide today,

    SiFive disagrees with you.

    so
    stores matter for deconfliction…. And does stuff just fall out right to
    allow both?

    Can you restate what you wanted to say using different words or perhaps
    give an example ??


    A series of adds to the same register in a four wide design.

    A = A + 1
    A = A + B
    A = A + C
    A = A + D


    That's a compiler issue, not a HW architecture problem imho:

    lea rega,[rega+regb+1]
    lea temp,[regc,regd]

    add rega,temp

    is 2 cycles, using two ports for the first cycle.

    If you have an add3 opcode, then you can do it in a single lane.

    Please note that I'm assuming either -fwrapv wrapping signed or just
    regular unsigned adds.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Fri Sep 20 09:52:32 2024
    MitchAlsup1 wrote:
    On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:


    - register specifier fields are either source or dest, never both

    I happen to be wishywashy on this


    This is deeply interesting, can you expound on why it is fine a register
    field can be shared by loads and stores, and sometimes both like x86.

    My 66000 encodes store data register in the same field position as it
    encodes "what kind of branch" is being performed, and the same position
    as all calculation (and load) results.

    I started doing this in 1982 with Mc88100 ISA, and never found a problem
    with the encoding nor in the decoding nor with the pipelining of it.

    Let me be clear, I do not support necessarily damaging a source operand
    to fit in another destination as::

    ADD SP,SP,#0x40

    by specifying SP only once in the instruction.

    So,

    +------+-----+-----+----------------+
    | major| Rd | Rs1 | whatever |
    +------+-----+-----+----------------+
    | BC | cnd | Rs1 | label offset |
    +------+-----+-----+----------------+
    | LD | Rd | Rb | displacement |
    +------+-----+-----+----------------+
    | ST | Rs0 | Rb | displacement |
    +------+-----+-----+----------------+

    Is:
    a) no burden in encoding
    b) no burden in decoding
    c) no burden in pipelining
    d) no burden in stealing the Store data port late in the pipeline
    {in particular, this saves lots of flip-flops deferring store
    data until after cache hit, TLB hit, and data has arrived at
    cache.}

    I disagree with things like::

    +------+-----+-----+----------------+
    | big OpCode | Rds | whatever |
    +------+-----+-----+----------------+

    Where Rds means the specifier is used as both a source and destination.

    Notice in my encoding one can ALWAYS take the register specification
    fields and wire them directly into the RF/renamer decoder ports.
    You lose this property the other way around.

    I assume in your examples that you want to start your register file
    read access and or rename register lookup access in the decode stage,
    and not wait to start at the end of the decode stage.
    Effectively pipelining those accesses.
    That's fine.

    But that's my point - it doesn't make a difference because in both
    cases you can wire the reg fields to the reg file or rename directly
    and start the access ASAP.
    In both cases the enable signal determining what to do shows up
    later after decode has done its thing. And the critical path for
    that decode enable signal is the same both ways.

    And if you are not doing this early access start but the traditional
    of latch the decode output THEN start your RegRd or Rename access
    it makes no timing difference at all.

    By allowing the opcode-Rds style instructions to be *CONSIDERED*
    it opens an avenue to potential instructions that cost little or
    nothing extra in terms of logic or performance.

    And this is particularly useful with fixed width 32-bit instructions
    where one is try to pack as much function into a fixed size space as
    possible. Even more so with 16-bit compact instructions.

    For example, a 32-bit fixed format instruction with four 5-bit registers
    could do a full width integer multiply wide-accumulate

    IMAC (Rsd_hi,Rsd_lo) = (Rsd_hi,Rsd_lo) + Rs1 * Rs2

    with little more logic than the existing MULL,MULH approach.
    It still only needs 2 read ports because Rs1,Rs2 are read first to start
    the multiply, then (Rsd_hi,Rsd_lo) second as they aren't needed until
    late in the multiply-accumulate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Sep 20 17:39:34 2024
    On Fri, 20 Sep 2024 13:52:32 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:


    - register specifier fields are either source or dest, never both

    I happen to be wishywashy on this


    This is deeply interesting, can you expound on why it is fine a register >>> field can be shared by loads and stores, and sometimes both like x86.

    My 66000 encodes store data register in the same field position as it
    encodes "what kind of branch" is being performed, and the same position
    as all calculation (and load) results.

    I started doing this in 1982 with Mc88100 ISA, and never found a problem
    with the encoding nor in the decoding nor with the pipelining of it.

    Let me be clear, I do not support necessarily damaging a source operand
    to fit in another destination as::

    ADD SP,SP,#0x40

    by specifying SP only once in the instruction.

    So,

    +------+-----+-----+----------------+
    | major| Rd | Rs1 | whatever |
    +------+-----+-----+----------------+
    | BC | cnd | Rs1 | label offset |
    +------+-----+-----+----------------+
    | LD | Rd | Rb | displacement |
    +------+-----+-----+----------------+
    | ST | Rs0 | Rb | displacement |
    +------+-----+-----+----------------+

    Is:
    a) no burden in encoding
    b) no burden in decoding
    c) no burden in pipelining
    d) no burden in stealing the Store data port late in the pipeline
    {in particular, this saves lots of flip-flops deferring store
    data until after cache hit, TLB hit, and data has arrived at
    cache.}

    I disagree with things like::

    +------+-----+-----+----------------+
    | big OpCode | Rds | whatever |
    +------+-----+-----+----------------+

    Where Rds means the specifier is used as both a source and destination.

    Notice in my encoding one can ALWAYS take the register specification
    fields and wire them directly into the RF/renamer decoder ports.
    You lose this property the other way around.

    I assume in your examples that you want to start your register file
    read access and or rename register lookup access in the decode stage,
    and not wait to start at the end of the decode stage.
    Effectively pipelining those accesses.
    That's fine.

    But that's my point - it doesn't make a difference because in both
    cases you can wire the reg fields to the reg file or rename directly
    and start the access ASAP.

    Not when a source field and a destination field are the same
    field sometimes but not always. Your thought train adds a
    register specifier mux between the destination field and
    the overused source field in front of the destination
    rename port. It is not a BIG hinderance, but it is not
    insignificant is you are doing a "balls to the walls"
    design.

    In both cases the enable signal determining what to do shows up
    later after decode has done its thing. And the critical path for
    that decode enable signal is the same both ways.

    And if you are not doing this early access start but the traditional
    of latch the decode output THEN start your RegRd or Rename access
    it makes no timing difference at all.

    By allowing the opcode-Rds style instructions to be *CONSIDERED*
    it opens an avenue to potential instructions that cost little or
    nothing extra in terms of logic or performance.

    The actual calculations are easy, it is the routing of data
    to and from the calculation that is hard.

    And this is particularly useful with fixed width 32-bit instructions
    where one is try to pack as much function into a fixed size space as possible. Even more so with 16-bit compact instructions.

    RISC-V, because of where the various fields ARE, have a mux between
    every source field and every register port--simply because their
    positions move between non-compressed and compressed.

    I agree with the position that if the mux is already there
    that one should use it often and greatly.

    Where I disagree is that the mux HAS to be there.

    For example, a 32-bit fixed format instruction with four 5-bit registers could do a full width integer multiply wide-accumulate

    IMAC (Rsd_hi,Rsd_lo) = (Rsd_hi,Rsd_lo) + Rs1 * Rs2

    This violates the RISC tenet where each calculation instruction
    produces exactly 1 result. I get around this with the mechanical
    definition of the CARRY instruction. The MUL instruction produces
    its result, CARRY captures the other, and deposits it in RF when
    possible.

    with little more logic than the existing MULL,MULH approach.
    It still only needs 2 read ports because Rs1,Rs2 are read first to start
    the multiply, then (Rsd_hi,Rsd_lo) second as they aren't needed until
    late in the multiply-accumulate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Sep 20 20:34:01 2024
    On Fri, 20 Sep 2024 2:09:35 +0000, BGB wrote:

    On 9/18/2024 1:42 PM, MitchAlsup1 wrote:

    One simple option would be to assume an instruction looks like:
    [Prefix Bytes]
    [REX byte]
    OP_Byte | 0F+OP_Byte
    Mod/RM + SIB + ...

    No, the simple option is that an instruction looks like:

    +------+-----+-----+----------------+
    | major| Rd | Rs1 | imm16 |
    +------+-----+-----+----------------+
    | mem | Rd | Rb | disp16 |
    +------+-----+-----+----------------+
    | Bcnd | cnd | Rs1 | disp18 |
    +------+-----+-----+----------------+
    | 2OP | Rd | Rs1 |mods| 2op | Rs2 |
    +------+-----+-----+----------------+
    | 3OP | Rd | Rs1 | Rs3 | 3op| Rs2 |
    +------+-----+-----+----------------+

    And then use a heuristic to try to guess how to interpret the
    instruction stream based on "looks better" (more likely to be aligned
    with the instruction stream vs random unaligned garbage).

    Though, such a "looks good" heuristic could itself risk skewing the
    results.


    I may still consider defining an encoding for this, but not yet. It is
    in a similar boat as auto-increment. Both add resource cost with
    relatively little benefit in terms of overall performance.
    Auto-increment because if one has superscalar, the increment can usually >>> be co-executed. And, full [Rb+Ri*Sc+Disp], because it is just too
    infrequent to really justify the extra cost of a 3-way adder even if
    limited mostly to the low-order bits...

    Myopathy--look it up.


    OK.

    Not sure how that is related (a medical condition involving muscle defects...).

    Myopathy is NEAR SIGHTEDNESS.

    You are not looking far enough into the future to avoid problems in your
    ISA and architecture. {I did the same in my youth. almost everyone
    does.}


    Can also note that a worthwhile design goal is to not add significant
    cost over what would be needed for a plain RV64GC implementation, but,
    could define a [Rb+Ri*Sc+Disp] encoding or similar if it would likely be beneficial enough to justify its existence.

    486 showed that "[Rbase+Rindex<<scale+displacement]:segment" could all
    be performed in a single cycle at a frequency competitive with the RISC processors available at the time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to All on Sat Sep 21 10:45:47 2024
    On 2024-09-20 23:34, MitchAlsup1 wrote:

    Myopathy is NEAR SIGHTEDNESS.


    Perhaps you meant "myopia", https://en.wikipedia.org/wiki/Myopia.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Sun Sep 22 22:19:15 2024
    On Sun, 22 Sep 2024 20:43:38 +0000, Paul A. Clayton wrote:

    On 9/19/24 11:07 AM, EricP wrote:
    [snip]
    If the multiplier is pipelined with a latency of 5 and throughput
    of 1,
    then MULL takes 5 cycles and MULL,MULH takes 6.

    But those two multiplies still are tossing away 50% of their work.

    I do not remember how multipliers are actually implemented — and
    am not motivated to refresh my memory at the moment — but I
    thought a multiply low would not need to generate the upper bits,
    so I do not understand where your "50% of their work" is coming
    from.

    +-----------+ +------------+
    \ mplier / \ mcand / Big input mux
    +--------+ +--------+
    | |
    | +--------------+
    | / /
    | / /
    +-- / /
    / Tree /
    / /--+
    / / |
    / / |
    +---------------+-----------+
    hi low Products

    two n-bit operands are multiplied into a 2×n-bit result.
    {{All the rest is HOW not what}}

    The high result needs the low result carry-out but not the rest of
    the result. (An approximate multiply high for multiply by
    reciprocal might be useful, avoiding the low result work. There
    might also be ways that a multiplier could be configured to also
    provide bit mixing similar to middle result for generating a
    hash?)

    I seem to recall a PowerPC implementation did semi-pipelined 32-
    bit multiplication 16-bits at a time. This presumably saved area
    and power

    You save 1/2 of the tree area, but ultimately consume more power.

    while also facilitating early out for small
    multiplicands,

    Dadda showed that doubling the size of the tree only adds one
    4-2 compressor delay to the whole calculation.

    at the cost of some latency and substantial
    throughput compared to a fully pipelined multiplication.

    Throughput that the rest of the engine could not use.

    If I
    remember correctly, this produced a result for 16-bit by 32-bit multiplication, which is different from generating a low or high
    result.

    And if it does fuse them then the internal uArch cost is the same
    as if
    you had designed it optimally from the start, except now you have
    to pay for a fuser.

    <sound of soap box being dragged out>
    This idea that macro-op fusion is some magic solution is bullshit.

    The argument is, at best, of Academic Quality, made by a student
    at the time as a way to justify RISC-V not having certain easy
    for HW to perform calculations.

    1) It's not free.

    Neither is increasing the number of opcodes or providing extender
    prefixes. If one wants binary compatibility, non-fusing
    implementations would work.

    I did neither and avoided both.

    (I tend to favor providing a translation layer between software
    distribution format and instruction cache format, which reduces
    the binary compatibility constraint.)

    2) It only works where Decode can see *all* the required lookahead
       instructions, which means you have to pay for an N-lane decoder
       but only get 1 lane.

    Most fusion is for two adjacent instructions, which significantly
    limits the complexity.

    To quadratic {BigO( instruction-OpCode-bits ** 2)}

    The fusable patterns are also a subset of
    all pairs of two instructions, so complete two-way decoding may
    not be needed.

    There may also be optimization opportunities from looking ahead.
    Mitch Alsup proposed such for branch handling in a scalar
    implementation.

    I use this, to be clear, as a means to eliminate any need of the
    branch delay slot in smaller narrow machines.

    Apart from fusion, there might be advantages for
    avoiding bank conflicts in a banked register file. I.e., the cost
    of lookahead might be shared by multiple techniques/optimizations.

    I tend to agree that fusion tends to be a workaround for sub-
    optimal instruction encoding, but it seems that encoding involves
    a lot of tradeoffs.

    3) It's probabilistic as it depends on how the fetch buffers get
    loaded.
       Eg if the fetch buffer contains a valid instruction but does
    not have
       a next instruction, do you stall Decode to see if a fuser
    might arrive
       or dispatch it anyway.

    This is also somewhat true for variable length encodings that
    cross fetch boundaries.

    In My 1-wide machine, the only time this comes up is when a
    long instruction crosses into a new cache line (or page) and
    the cache (or TLB) takes a miss.

    In general a boundary-crossing instruction
    would probably stall even if such was not strictly necessary
    (e.g., if the missing information is opcode refinement — not
    related to instruction routing — or an immediate or even a
    register source identifier specifying a value that can have
    delayed use (e.g., value of a store, addend of a FMADD).

    In my case, immediate data for a ST is not needed until the ST
    has retired, so a) it is placed last, b) delay can be tolerated
    as long as the pipeline depth.

    This does seem a weakness, but fusion is not entirely negative
    factors.

    4) It gets exponentially expensive if you start doing multiple
    instruction
       lanes because decode has to deal with all the permutations of
       fusion possibilities.

    Fusion in an already variable length RISC ISA is already exponential.

    This is also a factor in mere superscalar decode/execute.
    Detecting that an instruction is dependent on another would
    normally stall the execution of that instruction.

    (I feel that encoding some of the dependency information could
    be useful to avoid some of this work. In theory, common
    dependency detection could also be more broadly useful; e.g.,
    operand availability detection and execution/operand routing.)

    So useful that it is encoded directly in My 66000 ISA.

    5) Any fused instructions leave (multiple) bubbles that should be
       compacted out or there wasn't much point to doing the fusion.

    Even with reduced operations per cycle, fusion could still provide
    a net energy benefit.

    Here I disagree:: but for a different reason::

    In order for RISC-V to use a 64-bit constant as an operand, it has
    to execute either:: AUPIC-LD to an area of memory containing the
    64-bit constant, or a 6-7 instruction stream to build the constant
    inline. While an ISA that directly supports 64-bit constants in ISA
    does not execute any of those.

    Thus, while it may save power seen at the "its my ISA" level it
    may save power, but when seem from the perspective of "it is
    directly supported in my ISA" it wastes power.

    There is NO less power expensive way to deliver a constant into
    execution as from the instruction stream directly to the function
    unit performing the calculation.

    In my opinion it is better to have an ISA that is optimal by design
    rather than being patched up by fusion later.

    Fusion is mostly presented for "patching up", but there are also considerations of diverse microarchitectures. With pre-fused
    instructions, an implementation might need to crack some of those instructions. Software optimized for such an implementation might
    also prefer more flexible compile-time scheduling of pre-cracked
    operations.

    Agreed:: there is a cost of implementing a means by which large
    constants can be used in the instruction. I argue that this is
    a) only apparent in the smallest implementations, b) is smaller
    than the cost in cycles and power that fusion requires.

    A load-op instruction is perhaps particularly difficult because
    one needs frequent stalls, a skewed (or second chance) pipeline to
    hide the load latency, out-of-order execution, or some other stall
    avoidance mechanism.

    There are also constraints in encoding granularity.

    Some of this inefficiency is caused by clinging to now 40 year old
    risc design *guidelines* (ie not even rules) that:
    - instructions have at most 1 dest and 2 source registers

    FMADD seems to have mostly killed the 2-source limit. AArch64's
    paired load removes the 2 destination limit. (Paired destinations
    were common for early double precision implementations.)

    FMAD also provides the operand bussing to support the::
    mem rd,[Rbase+Rindex<<scale+disp]
    addressing mode.

    But this was already possible since "disp" always comes from the
    instruction, and only goes to the AGEN unit.

    FMAD just got rid of all the other excuses not to do the right
    thing.

    - register specifier fields are either source or dest, never both

    This seems mostly a code density consideration. I think using a
    single name for both a source and a destination is not so
    horrible, but I am not a hardware guy.

    All we HW guys want is the where ever the field is specified,
    it is specified in exactly 1 field in the instruction. So, if
    field<a..b> is used to specify Rd in one instruction, there is
    no other field<!a..!b> specifies the Rd register. RISC-V blew
    this "requirement.

    - instructions should take at most 1 clock (they never did)

    That was clearly overconstraining.

    These self imposed design restrictions cause ISA designers to miss
    some possible more optimal solutions. The result is things like
    RISC-V's memory reference linkage structures taking 6 instructions
    to build a 64-bit PC-relative address. And I'm pretty sure we won't
    see any 6 instruction fusers for quite some time.

    I very much doubt a compiler would generate such outside of some
    real-time application where the time constancy might justify the
    code bloat.

    <sound of soap box being dragged back to cupboard>

    I do not mean my response to be heckling. Your points are very
    true. However, I think fusion is a technique — like cracking —
    that is a natural part of an architect's toolbox.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to David Brown on Mon Sep 23 11:45:16 2024
    On 9/16/2024 4:12 AM, David Brown wrote:

    big snip

    With all respect to the regulars here, most people in technical Usenet
    groups are either old, unusually nerdy, or both.

    Of course, that is true, but it raises some questions.

    Are there fewer younger people interested in computer architecture? I
    guess this is possible, since the number of new architectures seems to
    be declining, thus interest might be too.

    Are the younger people discussing computer architecture in the way we
    do, but are doing it in other places? If so, where? I know that web
    based forums are more "user friendly" than Usenet, but does that explain
    the difference? Do wherever they are going provide the same quality of discussion that comp.arch does?



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Tue Sep 24 00:12:43 2024
    On Mon, 23 Sep 2024 23:16:08 +0000, BGB-Alt wrote:

    On 9/22/2024 3:43 PM, Paul A. Clayton wrote:
    On 9/19/24 11:07 AM, EricP wrote:

    I tend to agree that fusion tends to be a workaround for sub-
    optimal instruction encoding, but it seems that encoding involves
    a lot of tradeoffs.


    Yeah...

    However, the cost of doing fusion is higher than having longer-form variable-length instructions via prefixes...

    If one wants a cheapish way to do prefixes on a 1-wide machine, they
    could transpose the instruction words during fetch, and then only need a single decoder.

    So:
    WordA
    PrefixA WordB
    PrefixA PrefixB WordC

    Is presented to the decoder as:
    WordA
    WordB PrefixA
    WordC PrefixB PrefixA

    So, the decoder doesn't move...

    Exactly my reasoning wrt constants

    INSTA
    INSTB DISP32 DISP64 SDATA32 SDATA64
    INSTC SDATA32

    Possibly, a similar trick could be used for 2-wide with limited variable-length, but would get more complicated.
    <snip>

    FMADD seems to have mostly killed the 2-source limit. AArch64's
    paired load removes the 2 destination limit. (Paired destinations
    were common for early double precision implementations.)


    IMHO:
    RISC-V not having register-index load/store, while having things like
    FMADD, is kinda stupid. Having advanced features while taking a big hit
    on the lack of cheap features is not ideal.


    I had recently been working on getting BGBCC to target RISC-V (generated
    code still not fully working, but the compiler is now able to do the
    compiler thing at least).

    However, with all of the limits that RISC-V imposes, BGBCC is currently generating output that is around 43% bigger in RISC-V mode than BJX2-XG2
    mode (or around 56% bigger than baseline mode).

    My 66000 tends to use only 72% of the instructions needed by RISC-V
    1/0.72 = 39% more instructions for RISC-V. Almost the same number.

    This is kinda terrible...

    "kinda" is unwarranted in that statement.
    <snip>
    So, say, 6 instructions for a 64-bit constant load, or around 4
    instructions to load/store a global variable (relative to GP), 4
    instructions whenever the 12-bit displacement fails, ...

    0 instructions in My 66000. Constants are simply operands fed from
    the instruction stream.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to BGB-Alt on Tue Sep 24 05:23:08 2024
    BGB-Alt <[email protected]> schrieb:
    On 9/22/2024 3:43 PM, Paul A. Clayton wrote:
    On 9/19/24 11:07 AM, EricP wrote:
    [snip]
    If the multiplier is pipelined with a latency of 5 and throughput of 1,
    then MULL takes 5 cycles and MULL,MULH takes 6.

    But those two multiplies still are tossing away 50% of their work.

    I do not remember how multipliers are actually implemented — and
    am not motivated to refresh my memory at the moment — but I
    thought a multiply low would not need to generate the upper bits,
    so I do not understand where your "50% of their work" is coming
    from.

    The high result needs the low result carry-out but not the rest of
    the result. (An approximate multiply high for multiply by
    reciprocal might be useful, avoiding the low result work. There
    might also be ways that a multiplier could be configured to also
    provide bit mixing similar to middle result for generating a
    hash?)


    I guess it might be interesting if one made a bigger multiplier out of
    4-bit multipliers, in a way similar to a 4-bit shift-add.

    If you look through the old TTL handbooks by TI, you will find how
    people did multipliers in the bit-slice age. They had 4 bit *
    4 bit->8 bit multipliers (74274) or Booth recoding with a 74261
    and then summed up the partial products using the 74275.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Paul A. Clayton on Fri Sep 27 13:31:13 2024
    Paul A. Clayton wrote:
    On 9/22/24 6:19 PM, MitchAlsup1 wrote:
    On 9/19/24 11:07 AM, EricP wrote:
    <sound of soap box being dragged out>
    This idea that macro-op fusion is some magic solution is bullshit.

    The argument is, at best, of Academic Quality, made by a student
    at the time as a way to justify RISC-V not having certain easy
    for HW to perform calculations.

    The RISC-V published argument for fusion is not great, but fusion
    (and cracking/fission) seem natural architectural mechanisms *if*
    one is stuck with binary compatibility.

    As far as I know there are only 3 published articles on RV fusion.

    The Renewed Case for the Reduced Instruction Set Computer
    Avoiding ISA Bloat with Macro-Op Fusion for RISC-V, 2016 http://people.eecs.berkeley.edu/~krste/papers/EECS-2016-130.pdf

    is an academic paper that proposes some fusion and compares compiler
    outputs but does not consider hardware cost.

    Exploring Instruction Fusion Opportunities in
    General Purpose Processors, 2022 https://webs.um.es/aros/papers/pdfs/ssingh-micro22.pdf

    looks at a much more difficult fusion:
    "In this paper, we propose and study techniques to increase the number of
    fused memory instructions, notably nonconsecutive and non-contiguous fusion. Non-ConSecutive Fusion (NCSF) is the operation of fusing two (or more) μ-ops that are not consecutive in the dynamic execution stream of the program. Non-ConTiguous Fusion (NCTF) is the operation of fusing two (or more)
    memory μ-ops that access non-contiguous memory bytes."

    There is a very recent paper that I have not read as it is paywalled.

    [paywalled]
    Evaluating and Enhancing Performance through Macro-Op Fusion Optimization
    with RISC-V, 2024
    https://dl.acm.org/doi/abs/10.1145/3677333.3678150

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Fri Sep 27 18:01:40 2024
    On Wed, 25 Sep 2024 2:49:07 +0000, Paul A. Clayton wrote:

    On 9/22/24 6:19 PM, MitchAlsup1 wrote:
    On Sun, 22 Sep 2024 20:43:38 +0000, Paul A. Clayton wrote:

    On 9/19/24 11:07 AM, EricP wrote:
    [snip]
    If the multiplier is pipelined with a latency of 5 and throughput
    of 1,
    then MULL takes 5 cycles and MULL,MULH takes 6.

    But those two multiplies still are tossing away 50% of their work.

    I do not remember how multipliers are actually implemented — and
    am not motivated to refresh my memory at the moment — but I
    thought a multiply low would not need to generate the upper bits,
    so I do not understand where your "50% of their work" is coming
    from.

        +-----------+   +------------+
        \  mplier  /     \   mcand  /        Big input mux >>      +--------+       +--------+
              |                |
              |      +--------------+
              |     /               /
              |    /               /
              +-- /               /
                 /     Tree      /
                /               /--+
               /               /   |
              /               /    |
             +---------------+-----------+
                   hi             low        Products

    two n-bit operands are multiplied into a 2×n-bit result.
    {{All the rest is HOW not what}}

    So are you saying the high bits come for free? This seems
    contrary to the conception of sums of partial products, where
    some of the partial products are only needed for the upper bits
    and so could (it seems to me) be uncalculated if one only wanted
    the lower bits.

    The high order bits are free WRT gates of delay, but consume as much
    area as the lower order bits. I was answering the question of
    "I do not remember how multipliers are actually implemented".

    The high result needs the low result carry-out but not the rest of
    the result. (An approximate multiply high for multiply by
    reciprocal might be useful, avoiding the low result work. There
    might also be ways that a multiplier could be configured to also
    provide bit mixing similar to middle result for generating a
    hash?)

    I seem to recall a PowerPC implementation did semi-pipelined 32-
    bit multiplication 16-bits at a time. This presumably saved area
    and power

    You save 1/2 of the tree area, but ultimately consume more power.

    The power consumption would seem to depend on how frequently both
    multiplier and multiplicand are larger than 16 bits. (However, I
    seem to recall that the mentioned implementation only checked one
    operand.) I suspect that for a lot of code, small values are
    common.

    It is 100% of the time in FP codes, and generally unknowable in
    integer codes.
    <snip>

    My 66000's CARRY and PRED are "extender prefixes", admittedly
    included in the original architecture so compensating for encoding constraints (e.g., not having 36-bit instruction parcels) rather
    than microarchitectural or architectural variation.

    Since they cast extra bits over a number of instructions, and
    while they precede the instructions they modify, they are not
    classical prefixes--so I use the term Instruction-modifier instead.

    [snip]>> (I feel that encoding some of the dependency information
    could
    be useful to avoid some of this work. In theory, common
    dependency detection could also be more broadly useful; e.g.,
    operand availability detection and execution/operand routing.)

    So useful that it is encoded directly in My 66000 ISA.

    How so? My 66000 does not provide any explicit declaration what
    operation will be using a result (or where an operand is being
    sourced from). Register names express the dependencies so the
    dataflow graph is implicit.

    I was talking about how operand routing is explicitly described
    in ISA--which is mainly about how constants override register
    file reads by the time operands get to the calculation unit.

    I was speculating that _knowing_ when an operand will be available
    and where a result should be sent (rather than broadcasting) could
    be useful information.

    It is easier to record which FU will deliver a result, the when
    part is simply a pipeline sequencer from the end of a FU to the
    entries in the reservation station.


    Even with reduced operations per cycle, fusion could still provide
    a net energy benefit.

    Here I disagree:: but for a different reason::

    In order for RISC-V to use a 64-bit constant as an operand, it has
    to execute either::  AUPIC-LD to an area of memory containing the
    64-bit constant, or a 6-7 instruction stream to build the constant
    inline. While an ISA that directly supports 64-bit constants in ISA
    does not execute any of those.

    Thus, while it may save power seen at the "its my ISA" level it
    may save power, but when seem from the perspective of "it is
    directly supported in my ISA" it wastes power.

    Yes, but "computing" large immediates is obviously less efficient
    (except for compression), the computation part is known to be
    unnecessary. Fusing a comparison and a branch may be a consequence
    of bad ISA design in not properly estimating how much work an
    instruction can do (and be encoded in available space) and there
    is excess decode overhead with separate instructions, but the
    individual operations seem to be doing actual work.

    I suspect there can be cases where different microarchitectures
    would benefit from different amounts of instruction/operation
    complexity such that cracking and/or fusion may be useful even in
    an optimally designed generic ISA.

    [snip]
    - register specifier fields are either source or dest, never both

    This seems mostly a code density consideration. I think using a
    single name for both a source and a destination is not so
    horrible, but I am not a hardware guy.

    All we HW guys want is the where ever the field is specified,
    it is specified in exactly 1 field in the instruction. So, if
    field<a..b> is used to specify Rd in one instruction, there is
    no other field<!a..!b> specifies the Rd register. RISC-V blew
    this "requirement.

    Only with the Compressed extension, I think. The Compressed
    extension was somewhat rushed and, in my opinion, philosophically
    flawed by being redundant (i.e., every C instruction can be
    expanded to a non-C instruction). Things like My 66000's ENTER
    provide code density benefits but are contrary to the simplicity
    emphasis. Perhaps a Rho (density) extension would have been
    better.☺ (The extension letter idea was interesting for an
    academic ISA but has been clearly shown to be seriously flawed.)

    The R in RISC-V does not represent REDUCED.

    16-bit instructions could have kept the same register field
    placements with masking/truncation for two-register-field
    instructions.

    The whole layout of the ISA is sloppy...

    Even a non-destructive form might be provided by
    different masking or bit inversion for the destination. However,
    providing three register fields seems to require significant
    irregularity in extracting register names. (Another technique
    would be using opcode bits for specifying part or all of a
    register name. Some special purpose registers or groups of
    registers may not be horrible for compiler register allocation,
    but such seems rather funky/clunky.)

    It is interesting that RISC-V chose to split the immediate field
    for store instructions so that source register names would be in
    the same place for all (non-C) instructions.

    Lipstick on a pig.

    Comparing an ISA design to RISC-V is not exactly the same as
    comparing to "best in class".

    I don't even know if My 66000 can or should be termed RISC since
    it is a bit closer to VAX but did not go so far as to allow all
    operands to be constants--just one; the memory unit has a sequencer
    to perform ENTER, EXIT, LDM, STM, MM, MS; the FPU has a sequencer
    to do FDIV, SQRT, Log-family, exp-family, sin-family, arc-family
    and pow, flow control unit has a sequencer to do PIC switch-case:
    all while allowing other FUs to process instructions while those
    sequencers run.

    I postulate that My 66000 ISA is RISC because it actually IS a
    Reduced instruction set computer--currently standing at 64
    instructions including SIMD and vectors.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to BGB on Sun Oct 13 11:30:52 2024
    On Thu, 5 Sep 2024 20:08:23 -0500
    BGB <[email protected]> wrote:

    On 9/3/2024 3:40 AM, Michael S wrote:
    On Tue, 3 Sep 2024 05:55:14 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Tim Rentsch <[email protected]> schrieb:

    My suggestion is not to implement a language extension, but to
    implement a compiler conforming to C as it is now,

    Sure, that was also what I was suggesting - define things that
    are currently undefined behavior.

    with
    additional guarantees for what happens in cases that are
    undefined behavior.

    Guarantees or specifications - no difference there.

    Moreover the additional guarantees are
    always in effect unless explicitly and specifically requested
    otherwise (most likely by means of a #pragma or _Pragma).
    Documentation needs to be written for the #pragmas, but no other
    documentation is required (it might be nice to describe the
    additional guarantees but that is not required by the C
    standard).

    It' the other way around - you need to describe first what the
    actual behavior in absence of any pragmas is, and this needs to be
    a firm specification, so the programmer doesn't need to read your
    mind (or the source code to the compiler) to find out what you
    meant. "But it is clear that..." would not be a specification;
    what is clear to you may absolutely not be clear to anybody else.

    This is also the only chance you'll have of getting this
    implemented in one of the current compilers (and let's face it, if
    you want high-quality code, you would need that; both LLVM and GCC
    have taken an enormous amount of effort up to now, and duplicating
    that is probably not going to happen).

    The point is to change the behavior of the compiler but
    still conform to the existing ISO C standard.

    I understood that - defining things that are currently undefined.
    But without a specification, that falls down.

    So, let's try something that causes some grief - what should
    be the default behavior (in the absence of pragmas) for integer
    overflow? More specifically, can the compiler set the condition
    to false in

    int a;

    ...

    if (a > a + 1) {
    }

    and how would you specify this in an unabigous manner?

    I'd start much earlier, by declaration of "Homogeneity and
    Exclusion". It would state that "more defined C" does not pretend
    to cover all targets covered by existing C language.
    Specifically, following target characteristics are required:
    - byte-addressable machine with 8-bit bytes
    - two-complement integer types
    - if float type is supported it has to be IEEE-754 binary32
    - if double type is supported it has to be IEEE-754 binary64
    - if long double type is supported it has to be IEEE-754 binary128
    - storage order for multibyte types should be either LE or BE,
    consistently for all built-in types
    - flat address space That part should be specified in more formal
    manner

    I might add a few things.

    ALU:
    If integer types overflow, they wrap, with any internal sign or zero extension consistent with the declared type;
    If a multiply overflows, the result will contain the low-order bits
    of the product, sign or zero extended according to the declared types;
    If a variable is shifted left, it will behave as-if it were sign or
    zero extended in a way consistent with the type;
    If a signed value is shifted right, its high order bits will remain consistent with the original sign bit.


    So, in the above example, one could see:
    if (a > a + 1) { }
    As a hypothetical:
    if (a > SignExtend32(a + 1)) { }
    Where SignExtent32 returns the input value sign-extended from 32 bits
    (a+1 always incrementing the value, but may conceptually either wrap
    or go outside the allowed range for 'int', with the sign extension
    always returning it to its canonical form, seen as twos complement).


    I will not define the behavior of shifts greater than or equal to the
    modulo of the integer size, or of negative shifts, as there isn't a consistent behavior here across targets.

    However, will note for shifting in a constant expression, it does
    seem to be the case, that the shift will behave as-if the width was unbounded, and negative shifts as a shift in the opposite direction,
    with the result then being sign or zero extended in accordance with
    the type.

    Say, for example, zigzag sign folding:
    int32_t i, j, k;
    i=somevalue;
    j=(i<<1)^(i>>31); //fold sign into LSB
    k=(j>>1)^((j<<31)>>31);
    assert(k==i);


    Memory:
    One may freely cast pointers to different types and dereference them, regardless of types or alignment of said pointers;
    Pointers will behave as-if the memory space were a linear array of
    bytes, with each value as one or more contiguous bytes in memory;
    Structs are normally packed with each member stored sequentially in
    memory, with each member padded to its natural alignment, and the
    overal struct, if needed, padded to a multiple of the largest member alignment; The natural alignment for primitive types is equal to the
    size of said primitive type;
    The address taken of any variable will have an in-memory layout
    consistent with the declared type;
    ...

    Implicitly:
    Any memory store may potentially alias with any other memory access,
    unless: One or both pointers has the restrict keyword;
    It can be reasonably proven that the pointed-to memory locations do
    not alias;
    A compiler may assume an access is aligned if it can be verified that
    no operation has caused the address to become misaligned (though, as
    a reservation, may assume that if a variable is declared restrict, it
    may also be assumed to be properly aligned for its type).


    Granted, there are targets where pointers are assumed aligned by
    default and declared unaligned, but there is no standard way in C to
    declare an unaligned pointer, and there is code that assumes the
    ability to freely de-reference pointers regardless of alignment.

    Though, a less conservative option would be to assume that any normal
    pointer variable is aligned by default, but may become unaligned if
    it accepts a value created by casting from a type of smaller
    alignment (or is assigned a value from a pointer holding such a
    value).

    char *cs;
    int *pi, *pj;
    ...
    pi=(int *)cs; //taints pi with unaligned status.
    ..
    pj=pi; //taints pj with unaligned status via pi

    This would still leave it as UB to pass or return a misaligned
    pointer across function boundaries (if the pointer is then
    de-referenced), or similar for putting them in struct members.

    May leave a partial exception for "void *", which may be cast to
    another type without causing the result to become unaligned.

    ...

    Misc:
    A missing return value is required to still return as normal;
    However, the nature and contents of the value returned will be
    undefined (it will be "probably random garbage").


    But, would make some reservations:
    The relative location and alignment of global variables remains
    undefined; The relative location and alignment of automatic variables
    remains undefined;
    The nature or the storage of any global or automatic variable whose
    address has not been taken, remains undefined;
    The nature or identity of any temporary variables created within an expression, remains undefined;
    Calling a function with a missing prototype will remain undefined,
    except if both the argument and return types are all primitive types,
    the argument types are an exact match and either pointer or integer
    types, and the return type is a small integer;
    ...


    Similar, one likely can't (yet) require that targets be little
    endian, but one can make a working assumption that the target is
    probably little endian.

    ...


    I agree with great majority of it.

    Rules for shifts could be formulated better. I think, they are
    formulated better in gcc manual, in section about implementation-defined behaviors.

    For functions without arguments, I'd prefer mandatory prototypes, even
    at cost of breakage of existing code.
    Also more draconian both about missing return type and about missing
    return statement in non-void function.

    About endiannes, I think that my definition in post above is most
    practical. I.e. BE allowed, but inconsistent byte orders are prohibited.
    Plus, of course, standardized name of preprocessor built-in for easy compile-time detection of endianness.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)