• is Vax adressing sane today

    From Brett@21:1/5 to All on Thu Sep 5 21:03:37 2024
    Is Vax addressing sane today?

    I am not talking indirect addressing, that is stupid.

    It has been determined from trusted sources that add from memory and add to memory as used in x86 are sane, and not much of a problem.

    But Vax allows all three arguments to be in memory with different pointers.

    Is this sane, just a natural progression if you allow memory operands?

    Packing three offsets in an instruction that can be decoded reasonably is a whole other problem…

    Heads and tails encoding could actually do this reasonably, and the code density would be actually be better than most competitors. Heads and tails
    is not that easy, but it’s not x86 difficult.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Brett on Thu Sep 5 21:15:20 2024
    Brett <[email protected]> writes:


    Is Vax addressing sane today?

    By what definition of sanity? The question doesn't make sense.


    I am not talking indirect addressing, that is stupid.

    That seems rather judgmental. Although you may wish to further
    define 'indirect addressing' to make clear exactly what you're
    referring to (PDP-8 style, with auto-increment? B3500 style
    where there is potentially infinite indirection? x86 where
    indirection through a register is common?)

    They may not make sense in the context of a modern RISC processor,
    but that doesn't make them "stupid".


    It has been determined from trusted sources that add from memory and add to >memory as used in x86 are sane, and not much of a problem.

    "Sanity" isn't an attribute associated with hardware architecture.


    But Vax allows all three arguments to be in memory with different pointers.

    That's not an addressing issue, it is simply a natural form of
    three-operand instruction when the processor supports memory to
    memory instructions.

    Now, you may reframe your question as to the desirability
    of memory-to-memory instructions vis a vis performance and or
    optimal code, and there you might find various opinions.

    I think the folks on this group spend an inordinate amount of
    time discussing minutia such has how many address/offset bits
    can be encoded in an instruction, or code density, which in the
    real world with modern high-performance processors aren't
    particularly significant or interesting to the typical
    programmer.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Brett on Thu Sep 5 22:55:00 2024
    In article <vbd6b9$g147$[email protected]>, [email protected] (Brett) wrote:

    It has been determined from trusted sources that add from memory
    and add to memory as used in x86 are sane, and not much of a problem.

    But Vax allows all three arguments to be in memory with different
    pointers.

    Is this sane, just a natural progression if you allow memory
    operands?

    Memory-to-memory instructions, in general, are hard to get to run fast
    with today's processors and memory, simply because memory access times
    are long enough for many register-to-register instructions to execute. A
    lot of that time can be hidden with good caches and prefetchers, but if
    your memory access patterns are complicated, those speedups can fail to
    work.

    One reason for memory-to-memory instructions was to avoid the need to
    dedicate registers to operands, but that's not much of a problem these
    days, since we have space in the CPU for lots of registers and rename
    systems for them.

    VAX was designed when heavy use of microcoding seemed like a good idea to
    make a CPU at an economical price, and memory wasn't much slower than registers. It was a backward-looking design in some ways, being a much
    better computer for the 1970s, rather than looking ahead to new concepts.
    VMS was the last large operating system written in assembly language (and Bliss, which is somewhat higher-level, bit not much).

    DEC spent a lot of time and money trying to keep VAX competitive and took
    too long to accept that was impractical. That was one of the seeds of
    their downfall.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Thu Sep 5 23:32:49 2024
    On Thu, 5 Sep 2024 21:03:37 +0000, Brett wrote:



    Is Vax addressing sane today?

    I am not talking indirect addressing, that is stupid.

    It has been determined from trusted sources that add from memory and add
    to memory as used in x86 are sane, and not much of a problem.

    But Vax allows all three arguments to be in memory with different
    pointers.

    With modern compiler technology 88% of instructions need only 1
    constant--thus VAX provides too many, along with providing address
    modes that require sequential decoding.

    Most ISAs do not provide "enough" constants, VAX provides too many.
    Where "enough" covers::

    SLA R9,#1,R17 // this is 1 instruction
    DIV R9,#24,R17 // ibid
    FDIV R8,#3.14159265358928,R17

    Is this sane, just a natural progression if you allow memory operands?

    Having watching this in real time:: in 1970 we needed more/better
    constants, then PDP-11 came around and we liked it, then at the end
    of the decade VAX cam along and we loved it, only later recognizing
    that it had fallen for the second system syndrome--becoming overly
    complicated without benefit--the address space was definitely needed
    the address modes no so much.

    Packing three offsets in an instruction that can be decoded reasonably
    is a whole other problem…

    Realistically, modern compilers have advanced to the point where
    anything more than 1` constant per instruction is overkill--
    harder to build and delivering no more performance.

    Heads and tails encoding could actually do this reasonably, and the code density would be actually be better than most competitors. Heads and
    tails is not that easy, but it’s not x86 difficult.

    Another encoding scheme is segmenting the OpCode into 2 components
    1) goes to the function unit to convey the kind of calculation
    to be performed,
    2) goes to the forwarding logic to convey how to route bits into
    calculation.

    Some might consider the concatenation of both to the be OpCode
    but that obscures what to do with when to do it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Dallman on Fri Sep 6 00:39:30 2024
    On Thu, 5 Sep 2024 22:55 +0100 (BST), John Dallman wrote:

    VMS was the last large operating system written in assembly language
    (and Bliss, which is somewhat higher-level, bit not much).

    Bliss could, I think, have been just as portable as C. But it mainly found favour inside DEC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Fri Sep 6 00:38:31 2024
    On Thu, 05 Sep 2024 21:15:20 GMT, Scott Lurndal wrote:

    "Sanity" isn't an attribute associated with hardware architecture.

    Says someone posting to a group full of hardware experts ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Brett on Fri Sep 6 05:38:01 2024
    Brett <[email protected]> writes:
    But Vax allows all three arguments to be in memory with different pointers.

    Is this sane, just a natural progression if you allow memory operands?

    In combination with supporting unaligned accesses (but excluding
    indirect addressing), it means that an instruction can access 6 pages,
    and so the TLB (and/or TLB loader) has to be designed to support that. Likewise, the OS has to be designed to load all 6 pages into physical
    RAM without evicting one of these pages again. So this kind of
    architecture increases the design complexity. And I don't see a
    benefit from this design.

    Heads and tails encoding could actually do this reasonably, and the code >density would be actually be better than most competitors.

    Would it? Please present empirical data. Certainly people claim that instruction sets with one-memory-address load-and-op and
    read-modify-write instructions have better code density, but when you
    look at the data, there are load-store instruction sets with better
    code density (and by quite a lot). From <[email protected]>:

    bash grep gzip
    595204 107636 46744 armhf 16 regs load/store 32-bit
    599832 101102 46898 riscv64 32 regs load/store 64-bit
    796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
    829776 134784 56868 arm64 32 regs load/store 64-bit
    853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
    891128 158544 68500 armel 16 regs load/store 32-bit
    892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
    1020720 170736 71088 mips64el 32 regs load/store 64-bit
    1168104 194900 83332 ppc64el 32 regs load/store 64-bit

    What is "heads and tails encoding"?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri Sep 6 07:08:51 2024
    On Fri, 06 Sep 2024 06:05:35 GMT, Anton Ertl wrote:

    ... they failed to stick to VAX for the few more years until
    they would have developed an OoO implementation, which would have
    leveled the playing field again (see Pentium Pro).

    It takes a whole lot of extra transistors (and consequent die area) to
    keep a CISC architecture comparable in performance to RISC. Back about
    when Intel finally caught up with PowerPC, I remember their chip packages
    were huge -- about the size of a VHS videocassette.

    Intel were probably spending 10× what Apple-IBM-Motorola were putting into each generation of chip development. But then, the x86 world had 10× the revenue coming in, so Intel could afford it. That’s how they regained the lead over RISC.

    Nowadays, I don’t think the revenue advantage is quite what it once was. That, and the even greater increases in chip complexity (and hence
    development cost), has tilted the playing field more in favour of RISC architectures, notably ARM and RISC-V.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Fri Sep 6 06:05:35 2024
    [email protected] (John Dallman) writes:
    Memory-to-memory instructions, in general, are hard to get to run fast
    with today's processors and memory, simply because memory access times
    are long enough for many register-to-register instructions to execute.

    Given modern OoO technology, even VAX can fly. It does not matter
    whether, say,

    *a++ = *b++ + *c++;

    is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
    RISC-V instructions, what goes on inside the OoO engine is pretty
    similar in all cases, and so is the performance.

    In recent years a number of implementations have 0-cycle store-to-load forwarding, so the misconception that a memory operand is as cheap as
    a register operand if only the instruction set has memory operands of
    operate instructions is a little bit closer to reality. It is still a misconception, because such an implementation can read and write
    several times as many registers per cycle as memory operands.

    A
    lot of that time can be hidden with good caches and prefetchers, but if
    your memory access patterns are complicated, those speedups can fail to
    work.

    Whether operate instructions in an instruction set have 0, 1, or 3
    memory operands makes little difference in that case.

    One reason for memory-to-memory instructions was to avoid the need to >dedicate registers to operands, but that's not much of a problem these
    days, since we have space in the CPU for lots of registers and rename
    systems for them.

    That may have been a consideration in the NOVA or the 6800, but in
    case of the VAX with its 16 registers, that corresponds to a load/store-architecture with 18 registers, so for the VAX this is just
    a minor issue.

    Some time ago I thought a bit about which kind of architecture to
    design with the transistor budget of the 6502, but with the RISC
    lessons under the belt. One problem with a big RISC-like register set
    is the instruction bandwidth. You really want to stick to 8-bit
    instructions if you only have an 8-bit data bus. With a register
    architecture that means 2-bits for register operands, and that means
    you would need a lot of loads and stores in a load/store architecture.
    So the narrow instruction word almost forces you to use implicit
    register operands or at small special-purpose register sets (e.g., 2 accumulators and 4 index registers, as in the 6809) rather than
    general-purpose registers.

    However, the VAX 11/780 does not have these restrictions. It has a
    wider memory bus and it has a cache.

    DEC spent a lot of time and money trying to keep VAX competitive and took
    too long to accept that was impractical. That was one of the seeds of
    their downfall.

    Either that, or they failed to stick to VAX for the few more years
    until they would have developed an OoO implementation, which would
    have leveled the playing field again (see Pentium Pro). The Alpha
    came out in 1992, the Pentium Pro in 1995, so if DEC has stuck to the
    VAX and managed a timely OoO implementation, they would have needed to
    survive just 3 years. And it seems that they lost a lot of customers
    in the transition from VAX to Alpha.

    Of course, the question is if the customers would have stayed with DEC
    if they had continued with the VAX. The vibe at the time was that
    CISCs are doomed. OTOH, Intel stuck with IA-32 and won with the P6,
    and IBM stuck with the S390. But VAX customers are not S390
    customers, and maybe they would have defected to Intel even if the VAX
    had been there.

    From what I read, the VAX 9000 was a big nail in the DEC coffin. In
    hindsight they should have canceled the project early, but that does
    not mean that they could not have continued with VAX (they could even
    have competed with the IBM mainframes, which took quite long to gain superscalar and OoO implementations).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Fri Sep 6 14:54:24 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Thu, 05 Sep 2024 21:15:20 GMT, Scott Lurndal wrote:

    "Sanity" isn't an attribute associated with hardware architecture.

    Says someone posting to a group full of hardware experts ...

    That someone has been doing architecture since 1983; from
    mainframes to high-end SoCs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Anton Ertl on Fri Sep 6 18:19:02 2024
    Anton Ertl <[email protected]> wrote:
    Brett <[email protected]> writes:
    But Vax allows all three arguments to be in memory with different pointers. >>
    Is this sane, just a natural progression if you allow memory operands?

    In combination with supporting unaligned accesses (but excluding
    indirect addressing), it means that an instruction can access 6 pages,
    and so the TLB (and/or TLB loader) has to be designed to support that. Likewise, the OS has to be designed to load all 6 pages into physical
    RAM without evicting one of these pages again. So this kind of
    architecture increases the design complexity. And I don't see a
    benefit from this design.

    The memory system is pipelined, once you load the first of the three
    values, you do not care if that cache line is evicted while you load the second.

    Caches are 16 way today, one does not worry about cache line evictions, it
    just works.

    Heads and tails encoding could actually do this reasonably, and the code
    density would be actually be better than most competitors.

    Would it? Please present empirical data. Certainly people claim that instruction sets with one-memory-address load-and-op and
    read-modify-write instructions have better code density, but when you
    look at the data, there are load-store instruction sets with better
    code density (and by quite a lot). From <[email protected]>:

    bash grep gzip
    595204 107636 46744 armhf 16 regs load/store 32-bit
    599832 101102 46898 riscv64 32 regs load/store 64-bit
    796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
    829776 134784 56868 arm64 32 regs load/store 64-bit
    853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
    891128 158544 68500 armel 16 regs load/store 32-bit
    892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
    1020720 170736 71088 mips64el 32 regs load/store 64-bit
    1168104 194900 83332 ppc64el 32 regs load/store 64-bit

    What is "heads and tails encoding"?

    128 bit or larger packets with the fixed size opcodes on the front, and the variable sized data and offsets packing in from the end. You get variable length instruction density with easier faster wide decoding. And also using memory operands give you another density bonus on top.

    The down side is that it makes your one and two wide implementations
    bigger.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to [email protected] on Fri Sep 6 18:21:17 2024
    MitchAlsup1 <[email protected]> wrote:
    On Thu, 5 Sep 2024 21:03:37 +0000, Brett wrote:



    Is Vax addressing sane today?

    I am not talking indirect addressing, that is stupid.

    It has been determined from trusted sources that add from memory and add
    to memory as used in x86 are sane, and not much of a problem.

    But Vax allows all three arguments to be in memory with different
    pointers.

    With modern compiler technology 88% of instructions need only 1 constant--thus VAX provides too many, along with providing address
    modes that require sequential decoding.

    Most ISAs do not provide "enough" constants, VAX provides too many.
    Where "enough" covers::

    SLA R9,#1,R17 // this is 1 instruction
    DIV R9,#24,R17 // ibid
    FDIV R8,#3.14159265358928,R17

    In C++ game code there are places where you are loading from two structures
    and storing into a third structure. Three offsets are needed and used.

    Most commonly you need two offsets as you are building a new structure from
    the old one. The example being building the polygon display lists
    structures from your source structures which contain X,Y,Z and R,G,B,A and weights, and other info.

    The benchmarks you are using are out of date, using arrays instead of structures. Arrays are rare in game code, it’s all structures.

    Not Fortran arrays, C++ structure spaghetti.

    Is this sane, just a natural progression if you allow memory operands?

    Having watching this in real time:: in 1970 we needed more/better
    constants, then PDP-11 came around and we liked it, then at the end
    of the decade VAX cam along and we loved it, only later recognizing
    that it had fallen for the second system syndrome--becoming overly complicated without benefit--the address space was definitely needed
    the address modes no so much.

    Packing three offsets in an instruction that can be decoded reasonably
    is a whole other problem…

    Realistically, modern compilers have advanced to the point where
    anything more than 1` constant per instruction is overkill--
    harder to build and delivering no more performance.

    Heads and tails encoding could actually do this reasonably, and the code
    density would be actually be better than most competitors. Heads and
    tails is not that easy, but it’s not x86 difficult.

    Another encoding scheme is segmenting the OpCode into 2 components
    1) goes to the function unit to convey the kind of calculation
    to be performed,
    2) goes to the forwarding logic to convey how to route bits into
    calculation.

    Some might consider the concatenation of both to the be OpCode
    but that obscures what to do with when to do it.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Fri Sep 6 18:31:05 2024
    On Fri, 6 Sep 2024 6:05:35 +0000, Anton Ertl wrote:

    [email protected] (John Dallman) writes:
    Memory-to-memory instructions, in general, are hard to get to run fast
    with today's processors and memory, simply because memory access times
    are long enough for many register-to-register instructions to execute.

    Given modern OoO technology, even VAX can fly. It does not matter
    whether, say,

    *a++ = *b++ + *c++;

    is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
    RISC-V instructions, what goes on inside the OoO engine is pretty
    similar in all cases, and so is the performance.

    In recent years a number of implementations have 0-cycle store-to-load forwarding, so the misconception that a memory operand is as cheap as
    a register operand if only the instruction set has memory operands of
    operate instructions is a little bit closer to reality. It is still a misconception, because such an implementation can read and write
    several times as many registers per cycle as memory operands.

    A
    lot of that time can be hidden with good caches and prefetchers, but if >>your memory access patterns are complicated, those speedups can fail to >>work.

    Whether operate instructions in an instruction set have 0, 1, or 3
    memory operands makes little difference in that case.

    One reason for memory-to-memory instructions was to avoid the need to >>dedicate registers to operands, but that's not much of a problem these >>days, since we have space in the CPU for lots of registers and rename >>systems for them.

    That may have been a consideration in the NOVA or the 6800, but in
    case of the VAX with its 16 registers, that corresponds to a load/store-architecture with 18 registers, so for the VAX this is just
    a minor issue.

    Some time ago I thought a bit about which kind of architecture to
    design with the transistor budget of the 6502, but with the RISC
    lessons under the belt. One problem with a big RISC-like register set
    is the instruction bandwidth. You really want to stick to 8-bit
    instructions if you only have an 8-bit data bus. With a register architecture that means 2-bits for register operands, and that means
    you would need a lot of loads and stores in a load/store architecture.
    So the narrow instruction word almost forces you to use implicit
    register operands or at small special-purpose register sets (e.g., 2 accumulators and 4 index registers, as in the 6809) rather than general-purpose registers.

    However, the VAX 11/780 does not have these restrictions. It has a
    wider memory bus and it has a cache.

    DEC spent a lot of time and money trying to keep VAX competitive and took >>too long to accept that was impractical. That was one of the seeds of
    their downfall.

    Either that, or they failed to stick to VAX for the few more years
    until they would have developed an OoO implementation, which would
    have leveled the playing field again (see Pentium Pro). The Alpha
    came out in 1992, the Pentium Pro in 1995, so if DEC has stuck to the
    VAX and managed a timely OoO implementation, they would have needed to survive just 3 years. And it seems that they lost a lot of customers
    in the transition from VAX to Alpha.

    Of course, the question is if the customers would have stayed with DEC
    if they had continued with the VAX. The vibe at the time was that
    CISCs are doomed. OTOH, Intel stuck with IA-32 and won with the P6,
    and IBM stuck with the S390. But VAX customers are not S390
    customers, and maybe they would have defected to Intel even if the VAX
    had been there.

    In my opinion, DEC was caught at an ugly time for them. They did not
    have the transistor budget for a GBOoO implementation at exactly the
    time they also needed a clean transition to 64-bits (even more trans-
    istors). DEC did have the transistors for a medium OoO implementation
    but unlikely the 64-bit transition.

    From what I read, the VAX 9000 was a big nail in the DEC coffin. In hindsight they should have canceled the project early, but that does
    not mean that they could not have continued with VAX (they could even
    have competed with the IBM mainframes, which took quite long to gain superscalar and OoO implementations).

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Fri Sep 6 18:33:12 2024
    On Fri, 6 Sep 2024 5:38:01 +0000, Anton Ertl wrote:

    From
    <[email protected]>:

    bash grep gzip
    595204 107636 46744 armhf 16 regs load/store 32-bit
    599832 101102 46898 riscv64 32 regs load/store 64-bit
    796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
    829776 134784 56868 arm64 32 regs load/store 64-bit
    853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
    891128 158544 68500 armel 16 regs load/store 32-bit
    892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
    1020720 170736 71088 mips64el 32 regs load/store 64-bit
    1168104 194900 83332 ppc64el 32 regs load/store 64-bit

    Is there source code freely available so these could be compiled
    in My 66000 ISA and placed in the list ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Brett on Fri Sep 6 23:49:27 2024
    Brett <[email protected]> wrote:
    Anton Ertl <[email protected]> wrote:

    Here is a PDF on heads and tails:

    http://scale.eecs.berkeley.edu/papers/hat-cases2001.pdf

    They went for maximum density, which is stupid. The timing critical part is
    the source registers, and in a wide implementation the dest registers are
    also critical. Opcodes and data/offsets only matter far later in the
    pipeline.

    I would do three registers and enough opcode bits to get an idea of opcode
    type and size. For one and two register instructions you pack in more
    opcode.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Sat Sep 7 05:04:34 2024
    [email protected] (MitchAlsup1) writes:
    On Fri, 6 Sep 2024 5:38:01 +0000, Anton Ertl wrote:

    From
    <[email protected]>:

    bash grep gzip
    595204 107636 46744 armhf 16 regs load/store 32-bit
    599832 101102 46898 riscv64 32 regs load/store 64-bit
    796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
    829776 134784 56868 arm64 32 regs load/store 64-bit
    853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
    891128 158544 68500 armel 16 regs load/store 32-bit
    892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
    1020720 170736 71088 mips64el 32 regs load/store 64-bit
    1168104 194900 83332 ppc64el 32 regs load/store 64-bit

    Is there source code freely available so these could be compiled
    in My 66000 ISA and placed in the list ??

    Yes. I measured the binaries of the Debian packages
    bash_5.2.21-2_$arch.deb, grep_3.11-4~exp1_$arch.deb, and
    gzip_1.12-1_$arch.deb (sometimes with an extra suffix: bash_5.2.21-2+b1_$arch.deb gzip_1.12-1+b2_$arch.deb). So look up
    these packages, and then get the corresponding source packages.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Sat Sep 7 05:37:57 2024
    [email protected] (MitchAlsup1) writes:
    In my opinion, DEC was caught at an ugly time for them. They did not
    have the transistor budget for a GBOoO implementation at exactly the
    time they also needed a clean transition to 64-bits (even more trans- >istors). DEC did have the transistors for a medium OoO implementation
    but unlikely the 64-bit transition.

    For the K8 the switch from 32-bit to 64-bit was reported to have cost
    5%. You were there. Are the reports wrong?

    The Pentium Pro has a die size of 306mm^2 in 0.5um and 196mm^2 in
    0.35um according to <https://de.wikipedia.org/wiki/Intel_Pentium_Pro>
    (it's interesting that Intel produced a 0.6um, 0.5um, and 0.35um
    version; apparently their lock into a specific process for a chip came
    only later).

    The 64-bit OoO R10000 has a die size of 298mm^2 in 0.35um (but it was
    a RISC).

    DEC could fabricate the 299mm^2 21164 in 0.5um, and then the 21164a
    with 209mm^2 in 0.35um (in 1996), and the 21264 with 314mm^2 in 0.35um
    (in 1998).

    An OoO VAX should be possible in 0.35um, maybe not 4-wide as the
    R10000 and the 21264, but supporting three simple or three uops from
    one complex instruction per cycle like the Pentium Pro.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Brett on Sat Sep 7 05:17:18 2024
    Brett <[email protected]> writes:
    Anton Ertl <[email protected]> wrote:
    Brett <[email protected]> writes:
    But Vax allows all three arguments to be in memory with different pointers. >>>
    Is this sane, just a natural progression if you allow memory operands?

    In combination with supporting unaligned accesses (but excluding
    indirect addressing), it means that an instruction can access 6 pages,
    and so the TLB (and/or TLB loader) has to be designed to support that.
    Likewise, the OS has to be designed to load all 6 pages into physical
    RAM without evicting one of these pages again. So this kind of
    architecture increases the design complexity. And I don't see a
    benefit from this design.

    The memory system is pipelined, once you load the first of the three
    values, you do not care if that cache line is evicted while you load the >second.

    Caches are 16 way today, one does not worry about cache line evictions, it >just works.

    I did not write about caches, but yes, for TLBs a (the?) solution is
    to have the ITLB to be at least 6-way.

    It's unclear how pipelining should help. The VAX 11/780 was not much
    pipelined and can also do the memory accesses one after the other;
    this did not protect it from the complexity coming from x memory
    accesses in a single instruction. E.g., all the pages accessed by an instruction have to be in physical memory, or maybe support
    interruptable instructions; in any case, there is complexity.

    Heads and tails encoding could actually do this reasonably, and the code >>> density would be actually be better than most competitors.
    ...
    What is "heads and tails encoding"?

    128 bit or larger packets with the fixed size opcodes on the front, and the >variable sized data and offsets packing in from the end. You get variable >length instruction density with easier faster wide decoding. And also using >memory operands give you another density bonus on top.

    The only reason for VAX-style instructions is if you want to implement
    the VAX instruction set, because you want to run software for the VAX
    (and that reason started to vanish three decades ago and is now almost
    gone). Also, decoding variable-length instructions is a solved
    problem: Intel's P-cores and AMD's Zen-Zen5 cores solve it with
    microcode caches, and Intel's recent E-cores (Tremont, Gracemont,
    Chrestmont, Skymont) solve it by having 2-3 3-wide decoders.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sat Sep 7 07:42:07 2024
    On Sat, 07 Sep 2024 05:04:34 GMT, Anton Ertl wrote:

    [email protected] (MitchAlsup1) writes:

    Is there source code freely available so these could be compiled in My >>66000 ISA and placed in the list ??

    So look up these packages, and then get the corresponding source
    packages.

    Debian provides the “apt-get source” command for this purpose.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Sat Sep 7 08:44:40 2024
    [email protected] (Anton Ertl) writes:
    [email protected] (MitchAlsup1) writes:
    In my opinion, DEC was caught at an ugly time for them. They did not
    have the transistor budget for a GBOoO implementation at exactly the
    time they also needed a clean transition to 64-bits (even more trans- >>istors). DEC did have the transistors for a medium OoO implementation
    but unlikely the 64-bit transition.

    For the K8 the switch from 32-bit to 64-bit was reported to have cost
    5%. You were there. Are the reports wrong?

    In addition, VAX is a 32-bit architecture. It was not necessary to
    extend it to 64 bits and do OoO at the same time. IBM stuck with its
    31-bit addresses in s390 until 2000 when it was extended to the 64-bit z/Architecture (and the first implementation, the z900 was scalar, not
    even in-order superscalar; they got superscalar in the z990 in 2003
    and apparently out-of-order with the z196 in 2011; but then, IBM's
    customers are probably less performance-sensitive than DEC's customers
    used to be). Intel only delivered Merced (IA-64) in 2002 and
    delivered Nocona (AMD64) in 2004.

    Sure, there was marketing pressure to deliver 64-bit architectures
    early, but I think that a competetive 32-bit OoO VAX in 1996 with an announcement of a future 64-bit extension would have been fine
    wrt. marketing. And a 0.25um 64-bit VAX in 1999 or 2000 (they shrank
    the 21264 to 0.25um in 1999) would have certainly made good on that
    promise.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Sat Sep 7 16:31:00 2024
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    Sure, there was marketing pressure to deliver 64-bit architectures
    early, but I think that a competetive 32-bit OoO VAX in 1996 with an announcement of a future 64-bit extension would have been fine
    wrt. marketing. And a 0.25um 64-bit VAX in 1999 or 2000 (they
    shrank the 21264 to 0.25um in 1999) would have certainly made good
    on that promise.

    VAX had initially been very successful for the late 1970s and early 1980s
    in technical computing, because it was performance-competitive and had a
    better operating system than any of the other superminis of the time.

    Then multiple RISCs with Unix came along, which were cheaper, had equal
    or better performance, and a satisfactory operating system. Those ate
    DEC's technical computing market quite fast, but its business IT market
    lasted longer.

    The technical computing market was /much/ more interested in 64-bit than
    the business IT market. When I got involved at a software supplier for technical computing in 1995, VAX was not performance-competitive and was
    on the way out, but Alpha was the fastest thing around until Pentium Pro, stayed competitive for a couple more years, and didn't die out completely
    until 2002 or so.

    DEC seem to have concluded in 1988 that they could not keep VAX
    performance competitive with the RISCs of the time at a competitive price. Also, 64-bit ASAP was necessary to retain their part of the technical
    computing market and try to win some of it back.

    Trying to hold on with VAX, in the hope technology would emerge that
    would make it practical, without a clear idea of when or what that would
    be is not something that shareholders will tolerate for very long.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Sat Sep 7 15:52:17 2024
    [email protected] (John Dallman) writes:
    In article <[email protected]>, >[email protected] (Anton Ertl) wrote:

    Sure, there was marketing pressure to deliver 64-bit architectures
    early, but I think that a competetive 32-bit OoO VAX in 1996 with an
    announcement of a future 64-bit extension would have been fine
    wrt. marketing. And a 0.25um 64-bit VAX in 1999 or 2000 (they
    shrank the 21264 to 0.25um in 1999) would have certainly made good
    on that promise.

    VAX had initially been very successful for the late 1970s and early 1980s
    in technical computing, because it was performance-competitive and had a >better operating system than any of the other superminis of the time.

    Then multiple RISCs with Unix came along, which were cheaper, had equal
    or better performance, and a satisfactory operating system. Those ate
    DEC's technical computing market quite fast, but its business IT market >lasted longer.

    The technical computing market was /much/ more interested in 64-bit than
    the business IT market. When I got involved at a software supplier for >technical computing in 1995, VAX was not performance-competitive and was
    on the way out, but Alpha was the fastest thing around until Pentium Pro, >stayed competitive for a couple more years, and didn't die out completely >until 2002 or so.

    DEC seem to have concluded in 1988 that they could not keep VAX
    performance competitive with the RISCs of the time at a competitive price.

    Not really. They were still burning lots of money on the VAX 9000, a
    dead end.

    They stopped doing new VAX designs after the NVAX/NVAX+ was introduced
    in 1991 (Alpha was introduced in 1992). There was the NVAX++ shrink
    that improved the clock rate.

    Also, 64-bit ASAP was necessary to retain their part of the technical >computing market and try to win some of it back.

    Trying to hold on with VAX, in the hope technology would emerge that
    would make it practical, without a clear idea of when or what that would
    be is not something that shareholders will tolerate for very long.

    They tolerated it for Intel and for IBM. Ok, IBM introduced Power for
    the technical market, maybe that would have been the way to go for
    DEC: continue with MIPS for the technical market, and continue with
    VAX for the business market.

    The Alpha with VMS and with translation to run VAX (and DecStation)
    software sounds plausible, but somehow it did not work, neither for
    keeping the technical nor the business customers, even though Alpha
    was very competetive until the late 1990s.

    Maybe the business customers would not have defected without the
    VAX->Alpha transition, or maybe they would still have defected (they
    were DEC customers instead of IBM customers for a reason).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Sat Sep 7 16:37:00 2024
    [email protected] (Anton Ertl) writes:
    Brett <[email protected]> writes:

    I did not write about caches, but yes, for TLBs a (the?) solution is
    to have the ITLB to be at least 6-way.

    It's unclear how pipelining should help. The VAX 11/780 was not much >pipelined and can also do the memory accesses one after the other;
    this did not protect it from the complexity coming from x memory
    accesses in a single instruction. E.g., all the pages accessed by an >instruction have to be in physical memory, or maybe support
    interruptable instructions; in any case, there is complexity.

    MOVC3/MOVC5 were interruptable; specifically to handle page faults
    (and to reduce interrupt latency).

    Given the arguments were in registers, interruptibility in that case
    was just "restart with current register values". Similer to x86
    REP string instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Sep 7 21:17:42 2024
    According to Anton Ertl <[email protected]>:
    Given modern OoO technology, even VAX can fly. It does not matter
    whether, say,

    *a++ = *b++ + *c++;

    is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
    RISC-V instructions, what goes on inside the OoO engine is pretty
    similar in all cases, and so is the performance.

    It is my impression that unwinding all the side effects if the reference to "c" causes a
    page fault was painful. Particularly keeping in mind that b and c could be the same
    register, and if the code were this:

    *a++ = *b++ - *b++

    the order of increments and fetches matters.

    If you split it into four ARM instructions, a fault just has to restart one of those
    instructions which will have at most one register to fix up.

    It is my impression that even when the Vax was designed, it was already becoming evident
    that the Vax's super dense super encoded instruction set was not going to be a long term
    winner. The IBM 801 project was well along in 1975 when they started designing the Vax.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Sat Sep 7 23:38:45 2024
    On Sat, 07 Sep 2024 16:37:00 GMT, Scott Lurndal wrote:

    MOVC3/MOVC5 were interruptable ...

    Given the arguments were in registers ...

    The operands were in the perfectly general descriptor format, same as with nearly every other instruction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sun Sep 8 00:52:25 2024
    On Fri, 6 Sep 2024 6:05:35 +0000, Anton Ertl wrote:

    [email protected] (John Dallman) writes:
    Memory-to-memory instructions, in general, are hard to get to run fast
    with today's processors and memory, simply because memory access times
    are long enough for many register-to-register instructions to execute.

    Given modern OoO technology, even VAX can fly. It does not matter
    whether, say,

    *a++ = *b++ + *c++;

    is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
    RISC-V instructions, what goes on inside the OoO engine is pretty
    similar in all cases, and so is the performance.

    When I faced a similar set of desires, I had my movememory MM
    instruction do::

    for( control_register = 0,
    control_register < Rs3,
    control+register+=size )
    rd[control_register] = rs1[control_register];

    where size is determined by alignment, memory type (from PTE).
    Thus, when a page fault or interrupt cuts the instruction in
    the middle I don't have to recover any of the registers.

    This also allows the instruction to be thrown over to the
    Memory Function Unit and be performed in parallel with other
    calculation instructions.

    Getting back to the originating:: It is faster these days to
    write::
    a[i] = b[i] + c[i];i++;
    than the pre/post increment/decrement style of PDP-11.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sun Sep 8 02:03:44 2024
    On Sun, 8 Sep 2024 00:52:25 +0000, MitchAlsup1 wrote:

    Getting back to the originating:: It is faster these days to write::
    a[i] = b[i] + c[i];i++;
    than the pre/post increment/decrement style of PDP-11.

    That’s how I normally write the code in a high-level language, and have
    done so since the 1980s.

    I figured any decent compiler would be able to attend to the details I couldn’t be bothered thinking about. ;)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Sun Sep 8 13:55:11 2024
    John Levine <[email protected]> writes:
    According to Anton Ertl <[email protected]>:
    Given modern OoO technology, even VAX can fly. It does not matter
    whether, say,

    *a++ = *b++ + *c++;

    is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7 >>RISC-V instructions, what goes on inside the OoO engine is pretty
    similar in all cases, and so is the performance.

    It is my impression that unwinding all the side effects if the
    reference to "c" causes a page fault was painful.

    Yes, that was certainly a problem when using the implementation
    techniques of the day. With an OoO implementation, if any of the
    operations of the instruction causes an exception, none of the
    results of any of the operations are commited. Problem solved.

    Or almost: I expect that it's more complex to implement a reorder
    buffer that deals with such monster instructions than one that deals
    just with RISC-V instructions.

    Particularly
    keeping in mind that b and c could be the same register, and if the
    code were this:

    *a++ = *b++ - *b++

    the order of increments and fetches matters.

    Yes, but the decoder produces operations as defined by the
    architecture. I don't know how VAX specifies the order, but a simple translation could be

    # at the start, b is in p1, and a is in p6
    p0 = *p1 #*b
    p2 = p1+4 #b++
    p3 = *p2 #*b
    p4 = p2+4 #b++
    p5 = p2-p4
    *p6= p5 #*a = ...
    p7 = p6+4 #a++
    #at the end, b is in p4 and a is in p7

    where p0..p7 are physical registers. If there is an exception in any
    of the operations, b stays in p1 and a stays in p6.

    It is my impression that even when the Vax was designed, it was
    already becoming evident that the Vax's super dense super encoded
    instruction set was not going to be a long term winner. The IBM 801
    project was well along in 1975 when they started designing the Vax.

    The question is how much was known about the IBM 801 at the time.
    According to <https://en.wikipedia.org/wiki/OpenVMS>, the VAX project
    started in April 1975. Data General's Fountainhead project (FHP)
    started in July 1975. Intel started the iAPX 432 in 1975 or 1976,
    Zilog started the Z8000 after recruiting Bernard Peuto in March 1976 <https://thechipletter.substack.com/p/captain-zilog-crushed-the-story-of>. Motorola started the 68000 project in late 1976, and National
    Semiconductor obviously knew about the VAX when they designed the
    32016 (they originally wanted to implement the VAX instruction set,
    but in the end did something incompatible for legal reasons). All
    these projects used CISCy designs rather than RISCy designs. FHP was
    a bit special in making the writable control store an architectural
    feature (so it did not have just one instruction set); the thinking
    behind it is the "closing the semantic gap" idea that gave us
    architectures like the VAX.

    The first commercial RISCs were delivered in 1986 (including from IBM
    itself). Apparently the industry took that long to absorb the ideas
    from the IBM 801 and turn them into a commercial product.

    It would be interesting to take a time machine to, say, 1976, to go to
    any of these companies and try to convince them to do a RISCy CPU.
    How hard would it be to convince them? Would technical arguments be sufficient, or would one have to wave with money (as a customer or
    investor)? And how would such a CPU do in the marketplace?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Sun Sep 8 17:56:55 2024
    The problem with VAX was NOT that one could not put a lot of
    work in a single instruction;

    no,

    The problem with VAX is that it made putting too much work
    in a single instruction easy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Anton Ertl on Sun Sep 8 19:15:25 2024
    Anton Ertl <[email protected]> wrote:
    John Levine <[email protected]> writes:
    According to Anton Ertl <[email protected]>:
    Given modern OoO technology, even VAX can fly. It does not matter
    whether, say,

    *a++ = *b++ + *c++;

    is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
    RISC-V instructions, what goes on inside the OoO engine is pretty
    similar in all cases, and so is the performance.

    It is my impression that unwinding all the side effects if the
    reference to "c" causes a page fault was painful.

    Yes, that was certainly a problem when using the implementation
    techniques of the day. With an OoO implementation, if any of the
    operations of the instruction causes an exception, none of the
    results of any of the operations are commited. Problem solved.

    Or almost: I expect that it's more complex to implement a reorder
    buffer that deals with such monster instructions than one that deals
    just with RISC-V instructions.

    Particularly
    keeping in mind that b and c could be the same register, and if the
    code were this:

    *a++ = *b++ - *b++

    the order of increments and fetches matters.

    Yes, but the decoder produces operations as defined by the
    architecture. I don't know how VAX specifies the order, but a simple translation could be

    # at the start, b is in p1, and a is in p6
    p0 = *p1 #*b
    p2 = p1+4 #b++
    p3 = *p2 #*b
    p4 = p2+4 #b++
    p5 = p2-p4
    *p6= p5 #*a = ...
    p7 = p6+4 #a++
    #at the end, b is in p4 and a is in p7

    where p0..p7 are physical registers. If there is an exception in any
    of the operations, b stays in p1 and a stays in p6.

    It is my impression that even when the Vax was designed, it was
    already becoming evident that the Vax's super dense super encoded
    instruction set was not going to be a long term winner. The IBM 801
    project was well along in 1975 when they started designing the Vax.

    The question is how much was known about the IBM 801 at the time.
    According to <https://en.wikipedia.org/wiki/OpenVMS>, the VAX project
    started in April 1975. Data General's Fountainhead project (FHP)
    started in July 1975. Intel started the iAPX 432 in 1975 or 1976,
    Zilog started the Z8000 after recruiting Bernard Peuto in March 1976 <https://thechipletter.substack.com/p/captain-zilog-crushed-the-story-of>. Motorola started the 68000 project in late 1976, and National
    Semiconductor obviously knew about the VAX when they designed the
    32016 (they originally wanted to implement the VAX instruction set,
    but in the end did something incompatible for legal reasons). All
    these projects used CISCy designs rather than RISCy designs. FHP was
    a bit special in making the writable control store an architectural
    feature (so it did not have just one instruction set); the thinking
    behind it is the "closing the semantic gap" idea that gave us
    architectures like the VAX.

    The first commercial RISCs were delivered in 1986 (including from IBM itself). Apparently the industry took that long to absorb the ideas
    from the IBM 801 and turn them into a commercial product.

    It would be interesting to take a time machine to, say, 1976, to go to
    any of these companies and try to convince them to do a RISCy CPU.
    How hard would it be to convince them? Would technical arguments be sufficient, or would one have to wave with money (as a customer or
    investor)? And how would such a CPU do in the marketplace?

    The IBM 801 was boring and did not have a patent moat protecting it.

    We have talent, we can build something more complex that keeps out
    competition that does not have our talent.

    They had no idea that complexity doubling every two years was going to
    crush all those complex ideas. Instead they thought more transistors would
    keep letting them add ever more complex features.

    They had no idea that they would crash into the clock speed wall, and if
    they did that argues for more complexity in the same clock time to get more done.

    They had no idea that they would be building eight wide designs.
    This is the critical idea that made RISC popular. They figured out that
    they had been too smart for their own good.

    We are post RISC now and adding complexity that gets more work done per operation, with less tracking. Three sources and two destinations will be
    the rule. Load with address update, add with shift, three way add with
    logical operations is next. The FPU already has MAC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Brett on Sun Sep 8 21:01:50 2024
    On Sun, 8 Sep 2024 19:15:25 -0000 (UTC), Brett wrote:

    We are post RISC now and adding complexity that gets more work done per operation, with less tracking. Three sources and two destinations will
    be the rule. Load with address update, add with shift, three way add
    with logical operations is next. The FPU already has MAC.

    Does sound rather like another variant on Ivan Sutherland’s Wheel of Reincarnation, doesn’t it?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Sun Sep 8 21:03:31 2024
    On Sun, 8 Sep 2024 20:20:26 +0000, Thomas Koenig wrote:

    Brett <[email protected]> schrieb:

    [VAX]

    They had no idea that they would be building eight wide designs.
    This is the critical idea that made RISC popular.

    Nope.

    The early RISC designs aimed for one instruction per cycle, achieved
    maybe 0.7.

    But they were competing against processors with 4+ clocks per
    instruction.

    S.E.L 32/87 had an IBM 360-like ISA and also achieved 0.7 I/C largely
    because it was NOT microcoded for 95% of the instructions executed,
    but well pipelined. When it encountered an instruction that required
    microcode to complete, the HW did the first cycle and then let micro-
    code take over, and was ready to switch back to HW-control without
    wasting a clock.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sun Sep 8 21:09:39 2024
    On Sun, 8 Sep 2024 17:56:55 +0000, MitchAlsup1 wrote:

    The problem with VAX was NOT that one could not put a lot of work in a
    single instruction;

    no,

    The problem with VAX is that it made putting too much work in a single instruction easy.

    Perhaps there is also the issue of the wildly-variable instruction length.
    A single VAX operand descriptor could be up to 6 bytes; I think the
    instruction with the most general-format operands could have 6 of them:
    so, plus opcode, such an instruction could be 37 bytes long.

    While the shortest instruction could be just 1 byte.

    Even those who are talking about “post-RISC” are, I think, still in favour of RISC-style fixed instruction lengths.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Brett on Sun Sep 8 20:20:26 2024
    Brett <[email protected]> schrieb:

    [VAX]

    They had no idea that they would be building eight wide designs.
    This is the critical idea that made RISC popular.

    Nope.

    The early RISC designs aimed for one instruction per cycle, achieved
    maybe 0.7.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon Sep 9 00:27:39 2024
    On Sun, 8 Sep 2024 21:09:39 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 8 Sep 2024 17:56:55 +0000, MitchAlsup1 wrote:

    The problem with VAX was NOT that one could not put a lot of work in a
    single instruction;

    no,

    The problem with VAX is that it made putting too much work in a single
    instruction easy.

    Perhaps there is also the issue of the wildly-variable instruction
    length.
    A single VAX operand descriptor could be up to 6 bytes; I think the instruction with the most general-format operands could have 6 of them:
    so, plus opcode, such an instruction could be 37 bytes long.

    I have not heard an argument that the complex things in VAX ISA are
    a) desirable
    b) performance helpful

    I (sort of) think VAX ISA as a grown up PDP-11, ignoring all the
    dastardly complicated instructions it inflicted upon itself. AND
    it did inflict those things upon itself.

    Restricting a new-VAX-like ISA to 1-2-3 Operand and 1-result with
    at most 1 exception would result in a MUCH cleaner and easier to
    build machine.

    While the shortest instruction could be just 1 byte.

    Even those who are talking about “post-RISC” are, I think, still in favour of RISC-style fixed instruction lengths.

    I, for the record, are in favor of fixed length instruction-specifier
    followed by constants the entirety is the instruction, while the
    former minimizes your ability of shooting yourself in the foot.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Thomas Koenig on Mon Sep 9 04:38:42 2024
    Thomas Koenig <[email protected]> wrote:
    Brett <[email protected]> schrieb:

    [VAX]

    They had no idea that they would be building eight wide designs.
    This is the critical idea that made RISC popular.

    Nope.

    The early RISC designs aimed for one instruction per cycle, achieved
    maybe 0.7.

    The next step up for a CPU has one ALU and one load/store unit, giving
    above one IPC. This is what one of the PlayStation CPU’s did.

    The yellow brick road to eight way was apparent with the first RISC architecture, even if not fully implemented due to time and die size.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Brett on Mon Sep 9 06:21:08 2024
    On Mon, 9 Sep 2024 04:38:42 -0000 (UTC), Brett wrote:

    Thomas Koenig <[email protected]> wrote:

    The early RISC designs aimed for one instruction per cycle, achieved
    maybe 0.7.

    The next step up for a CPU has one ALU and one load/store unit, giving
    above one IPC. This is what one of the PlayStation CPU’s did.

    Those were the ones using PowerPC chips in the 1990s, I think it was.
    IBM’s POWER claimed superscalar performance right from its launch in, what was it, 1989.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Mon Sep 9 06:50:17 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Mon, 9 Sep 2024 04:38:42 -0000 (UTC), Brett wrote:
    The next step up for a CPU has one ALU and one load/store unit, giving
    above one IPC. This is what one of the PlayStation CPU’s did.

    Those were the ones using PowerPC chips in the 1990s, I think it was.

    The first PlayStation used a 33MHz R3000 (single-issue).

    The PS2 released in 2000 used a 299MHz MIPS R5900-based core, two-way in-order superscalar.

    The PS3 released in 2006 used the PowerPC-based Cell broadband engine.

    IBM’s POWER claimed superscalar performance right from its launch in, what >was it, 1989.

    1990.

    It's interesting that it took so long to go to dual-issue with the
    same number of functional units. I guess that the early RISCs were bandwidth-limited, and only once the L1 cache(s) came on-chip, was
    there enough bandwidth to make superscalarity actually pay off.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Mon Sep 9 08:03:00 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    Perhaps there is also the issue of the wildly-variable instruction length.
    A single VAX operand descriptor could be up to 6 bytes; I think the >instruction with the most general-format operands could have 6 of them:
    so, plus opcode, such an instruction could be 37 bytes long.

    The regularity of the VAX operand formats may actually help build the
    decoder: Decode your byte stream as possible operands, and then let
    the instruction decoder pick the real operands from the potential
    operands.

    Even those who are talking about “post-RISC” are, I think, still in favour >of RISC-style fixed instruction lengths.

    Even among the RISCs, fixed instruction lengths are not universal: ARM
    T32 has two widths, as has RV64GC (and RISC-V has provisions for
    additional lengths, but AFAIK nobody uses them yet); there was also
    ROMP and MIPS16.

    Interestingly, despite their ample experience with T32, ARM went
    fixed-length with A64, but then the market for A64 is probably not as
    code-size sensitive as that for T32.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Sep 9 12:32:34 2024
    On Mon, 09 Sep 2024 08:03:00 GMT
    [email protected] (Anton Ertl) wrote:


    Interestingly, despite their ample experience with T32, ARM went
    fixed-length with A64, but then the market for A64 is probably not as code-size sensitive as that for T32.

    - anton

    ARMv9-M is still T32 which probably should tell us something.
    Or, may be, not.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Sep 9 09:52:53 2024
    On Mon, 09 Sep 2024 08:03:00 GMT, Anton Ertl wrote:

    ... ARM T32 has two widths, as has RV64GC (and RISC-V has provisions for additional lengths, but AFAIK nobody uses them yet); there was also ROMP
    and MIPS16.

    That’s all very well. But none of them go to the extremes that VAX did:
    37:1, remember.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Mon Sep 9 11:33:37 2024
    Michael S <[email protected]> writes:
    ARMv9-M is still T32 which probably should tell us something.

    It tells me that ARM sees a market (covered by the M profile) where
    4GB of address space is sufficient and where code size is relevant.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Sep 9 15:15:59 2024
    On Mon, 09 Sep 2024 11:33:37 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    ARMv9-M is still T32 which probably should tell us something.

    It tells me that ARM sees a market (covered by the M profile) where
    4GB of address space is sufficient and where code size is relevant.

    - anton

    I think that the sad (for Arm) truth is that ARMv8-M was unnecessary
    except for tiny niches and ARMv9-M even more so. ARMv7-M works fine for overwhelming majority of user.
    So even if they somehow invent fixed-width 32-bit architecture that
    matches T32 in code density and then implement it in core that matches Cortex-M4 in performance per clock, but occupies 10% smaller area
    clocks 10% faster on the same process, their major licensees (STMicro,
    TI, NXP) wouldn't be willing to pay 1 cent more for that core than what
    they are currently paying for M4.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to [email protected] on Mon Sep 9 19:38:52 2024
    MitchAlsup1 <[email protected]> wrote:
    On Sun, 8 Sep 2024 21:09:39 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 8 Sep 2024 17:56:55 +0000, MitchAlsup1 wrote:

    The problem with VAX was NOT that one could not put a lot of work in a
    single instruction;

    no,

    The problem with VAX is that it made putting too much work in a single
    instruction easy.

    Perhaps there is also the issue of the wildly-variable instruction
    length.
    A single VAX operand descriptor could be up to 6 bytes; I think the
    instruction with the most general-format operands could have 6 of them:
    so, plus opcode, such an instruction could be 37 bytes long.

    I have not heard an argument that the complex things in VAX ISA are
    a) desirable
    b) performance helpful

    Speaking of complex things, have you looked at Swift output, as it checks
    all operations for overflow?

    You could add an exception type for that, saving huge numbers of correctly predicted branch instructions.

    The future of programming languages is type safe with checks, you need to
    get on that bandwagon early.

    I (sort of) think VAX ISA as a grown up PDP-11, ignoring all the
    dastardly complicated instructions it inflicted upon itself. AND
    it did inflict those things upon itself.

    Restricting a new-VAX-like ISA to 1-2-3 Operand and 1-result with
    at most 1 exception would result in a MUCH cleaner and easier to
    build machine.

    While the shortest instruction could be just 1 byte.

    Even those who are talking about “post-RISC” are, I think, still in
    favour of RISC-style fixed instruction lengths.

    I, for the record, are in favor of fixed length instruction-specifier followed by constants the entirety is the instruction, while the
    former minimizes your ability of shooting yourself in the foot.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon Sep 9 20:34:39 2024
    According to Anton Ertl <[email protected]>:
    Lawrence D'Oliveiro <[email protected]d> writes:
    Perhaps there is also the issue of the wildly-variable instruction length. >>A single VAX operand descriptor could be up to 6 bytes; I think the >>instruction with the most general-format operands could have 6 of them:
    so, plus opcode, such an instruction could be 37 bytes long.

    The regularity of the VAX operand formats may actually help build the >decoder: Decode your byte stream as possible operands, and then let
    the instruction decoder pick the real operands from the potential
    operands.

    Urrgh. Some of those bogus operands are indirect indexed auto-increment, so you are going to be throwing away a whole lot of work.

    Compare that to zSeries, where even after 50 years of sticking new instructions into the holes in the S/360 instruction set, it can still tell the length of the
    instruction from the first two bits and the operands from the first byte.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Mon Sep 9 20:44:00 2024
    On Mon, 9 Sep 2024 19:38:52 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Sun, 8 Sep 2024 21:09:39 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 8 Sep 2024 17:56:55 +0000, MitchAlsup1 wrote:

    The problem with VAX was NOT that one could not put a lot of work in a >>>> single instruction;

    no,

    The problem with VAX is that it made putting too much work in a single >>>> instruction easy.

    Perhaps there is also the issue of the wildly-variable instruction
    length.
    A single VAX operand descriptor could be up to 6 bytes; I think the
    instruction with the most general-format operands could have 6 of them:
    so, plus opcode, such an instruction could be 37 bytes long.

    I have not heard an argument that the complex things in VAX ISA are
    a) desirable
    b) performance helpful

    Speaking of complex things, have you looked at Swift output, as it
    checks
    all operations for overflow?

    You could add an exception type for that, saving huge numbers of
    correctly predicted branch instructions.

    Unlike RISC-V and may others; My 66000 has maskable integer exceptions.
    An exception can be routed directly to a signal handler of the current application (without a trip through GuestOS). GuestOS just has to
    configure where exceptions are delivered.

    The future of programming languages is type safe with checks, you need
    to get on that bandwagon early.

    This would/will happen faster when type-safe with checks are well
    represented in benchmarks used to measure various architectural
    things, and the exceptions and checks are actually utilized showing
    performance degradation of lesser endowed architectures.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Brett on Tue Sep 10 07:43:53 2024
    Brett <[email protected]> writes:
    Speaking of complex things, have you looked at Swift output, as it checks
    all operations for overflow?

    You could add an exception type for that, saving huge numbers of correctly >predicted branch instructions.

    The future of programming languages is type safe with checks, you need to
    get on that bandwagon early.

    MIPS got on that bandwagon early. It has, e.g., add (which traps on
    signed overflow) in addition to addu (which performs modulo
    arithmetic). It has been abandoned and replaced by RISC-V several
    years ago.

    Alpha got on that bandwagon early. It's a descendent of MIPS, but it
    renamed add into addv, and addu into add. It has been canceled around
    the year 2000.

    RISC-V, another descendent of MIPS, has an add instruction that
    corresponds to MIPS' addu, and no instruction that corresponds to
    MIPS' add. They obviously don't think that there's a bandwagon. Note
    that RISC-V was designed after Swift was introduced.

    IA-32 got on that bandwagon early. It has a single-byte instruction
    trapv that traps if the overflow flag is set. The AMD64 instruction
    set is very similar to the IA-32 instruction set, but one of the few differences is that the trapv instruction was eliminated, and the
    encoding replaced with a REX prefix. The AMD64 architects obviously
    don't think that there is a bandwagon.

    Apple has been designing their own silicon for a while, and they have introduced Swift as their language in 2010. Yet they have not
    switched to an architecture like MIPS or Alpha, nor have they designed
    their own architecture or architecture extension that includes
    instructions like Alpha's addv or IA-32's trapv. Instead, they
    switched to ARM A64, which does not have such features, after
    introducing Swift in 2010. They obviously don't think that there is
    such a bandwagon, either.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to All on Tue Sep 10 11:13:11 2024
    On 2024-09-09 23:44, MitchAlsup1 wrote:
    On Mon, 9 Sep 2024 19:38:52 +0000, Brett wrote:

    [snip]

    The future of programming languages is type safe with checks, you need
    to get on that bandwagon early.

    This would/will happen faster when type-safe with checks are well
    represented in benchmarks used to measure various architectural
    things, and the exceptions and checks are actually utilized showing performance degradation of lesser endowed architectures.


    Not all the type-safe, checking languages are equal in that respect. In
    some languages, and I am thinking of Ada, the language design and the
    favored programming styles work to reduce the number of run-time checks required.

    In the Ada case, the ability to declare array types with
    programmer-chosen index types with bounded range, such as range-bounded integers or enumerations, means that the compiler can avoid indexing
    checks when the (sub)type of the index is known at compile time to fit
    within the index range of the array.

    It is also helpful that loop counters in Ada are also (sub)typed in the
    same way, which provides compile-time information on their range with
    respect to the index range of arrays accessed in the loop, even if the
    bounds of the range are not known at compile time.

    As a result of these language features, the matching programming styles,
    and the attention paid to these issues by the compilers, the number of
    run-time checks executed by an Ada program and their effect on execution
    time are often surprisingly small.

    So, to demonstrate the usefulness of HW support for checks, the
    benchmarks should use languages that require checks but do not have the features that let programmers reduce the number of checks by suitable programming styles. If Ada were used for the benchmarks, the programmers
    would have to use an abnormal, pessimal style that defeats the
    compiler's ability to elide checks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Tue Sep 10 08:05:07 2024
    John Levine <[email protected]> writes:
    According to Anton Ertl <[email protected]>:
    The regularity of the VAX operand formats may actually help build the >>decoder: Decode your byte stream as possible operands, and then let
    the instruction decoder pick the real operands from the potential
    operands.

    Urrgh. Some of those bogus operands are indirect indexed auto-increment, so you
    are going to be throwing away a whole lot of work.

    Yes, AFAIK that's how multi-instruction decoding for variable-width
    instruction sets works these days: Decode at every potential
    instruction start, then select those decoded instructions that are at
    actual instruction boundaries, and throw the others away.

    Compare that to zSeries, where even after 50 years of sticking new instructions
    into the holes in the S/360 instruction set, it can still tell the length of the
    instruction from the first two bits and the operands from the first byte.

    Good for sequential decoding, and maybe it makes parallel decoding
    cheaper (but OTOH, the first superscalar S/360 descendent came out in
    2000, 7 years after the superscalar Pentium, and the first OoO S/360
    descendent lagged the Pentium Pro by 14 years or so), but as the IIRC
    6-wide decoder of Alder Lake demonstrates, hardware designers are able
    to deal with instruction sets that do not have such nice properties:
    an AMD64 instruction can have a large number of prefixes, and I think
    that the encoding of indexed addressing is not announced in the first non-prefix instruction byte, either.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Sep 10 12:08:40 2024
    On Tue, 10 Sep 2024 07:43:53 GMT
    [email protected] (Anton Ertl) wrote:

    Brett <[email protected]> writes:
    Speaking of complex things, have you looked at Swift output, as it
    checks all operations for overflow?

    You could add an exception type for that, saving huge numbers of
    correctly predicted branch instructions.

    The future of programming languages is type safe with checks, you
    need to get on that bandwagon early.

    MIPS got on that bandwagon early. It has, e.g., add (which traps on
    signed overflow) in addition to addu (which performs modulo
    arithmetic).

    Trapping variants were deprecated in Release 6 of MIPS ISA.

    It has been abandoned and replaced by RISC-V several
    years ago.

    I don't think that "replaced by RISC-V" is a correct description of proceedings.


    Alpha got on that bandwagon early. It's a descendent of MIPS, but it
    renamed add into addv, and addu into add. It has been canceled around
    the year 2000.

    RISC-V, another descendent of MIPS, has an add instruction that
    corresponds to MIPS' addu, and no instruction that corresponds to
    MIPS' add. They obviously don't think that there's a bandwagon. Note
    that RISC-V was designed after Swift was introduced.

    IA-32 got on that bandwagon early. It has a single-byte instruction
    trapv that traps if the overflow flag is set. The AMD64 instruction
    set is very similar to the IA-32 instruction set, but one of the few differences is that the trapv instruction was eliminated, and the
    encoding replaced with a REX prefix. The AMD64 architects obviously
    don't think that there is a bandwagon.

    Apple has been designing their own silicon for a while, and they have introduced Swift as their language in 2010. Yet they have not
    switched to an architecture like MIPS or Alpha, nor have they designed
    their own architecture or architecture extension that includes
    instructions like Alpha's addv or IA-32's trapv. Instead, they
    switched to ARM A64, which does not have such features, after
    introducing Swift in 2010. They obviously don't think that there is
    such a bandwagon, either.

    - anton


    How does Intel MPX fit in your picture?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Sep 10 12:35:51 2024
    On Tue, 10 Sep 2024 08:05:07 GMT
    [email protected] (Anton Ertl) wrote:

    John Levine <[email protected]> writes:

    Compare that to zSeries, where even after 50 years of sticking new >instructions into the holes in the S/360 instruction set, it can
    still tell the length of the instruction from the first two bits and
    the operands from the first byte.

    Good for sequential decoding, and maybe it makes parallel decoding
    cheaper (but OTOH, the first superscalar S/360 descendent came out in
    2000, 7 years after the superscalar Pentium, and the first OoO S/360 descendent lagged the Pentium Pro by 14 years or so),

    Wikipedia says that ES/9000 Model 900 had superscalar OoO CPU in 1991.
    This line was abandoned in favor of simpler 'CMOS' line in mid 90s, but according to the same Wiki article, CMOS line didn't matched Model 900
    in performance until 9672-RY5 near the end of 1997.

    but as the IIRC
    6-wide decoder of Alder Lake demonstrates, hardware designers are able
    to deal with instruction sets that do not have such nice properties:
    an AMD64 instruction can have a large number of prefixes, and I think
    that the encoding of indexed addressing is not announced in the first non-prefix instruction byte, either.

    - anton

    1. The longest AMD64 instruction is much shorter than the longest VAX instruction
    2. On AMD64 instruction length information is continuous. Yes, there
    could be multiple prefixes and it makes things ugly, but I would think
    that in practice you very rarely need to look at more than 5 leading
    bytes in order to figure out the length of the tail. And in practice
    it's probably o.k. when instructions with more than 3 prefixes decoded
    slowly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Tue Sep 10 16:32:05 2024
    Michael S <[email protected]> writes:
    On Tue, 10 Sep 2024 08:05:07 GMT
    [email protected] (Anton Ertl) wrote:
    Good for sequential decoding, and maybe it makes parallel decoding
    cheaper (but OTOH, the first superscalar S/360 descendent came out in
    2000, 7 years after the superscalar Pentium

    Correction: The first superscalar CMOS S/360 descendent, the z990 came
    out in 2003, a decade after the Pentium, but see below about bipolar
    CPUs.

    and the first OoO S/360
    descendent lagged the Pentium Pro by 14 years or so),

    Wikipedia says that ES/9000 Model 900 had superscalar OoO CPU in 1991.

    Reading up on this, the article says even more:

    |models 900 and 820 had full out-of-order execution for both integer
    |and floating-point units, with precise exception handling, and a fully |superscalar pipeline.

    So these are probably the first proper OoO processors (while the
    360-91 was an interesting prototype, it was too limited to count as
    proper OoO CPU). The 900 ran at 111MHz, and the 1994-vintage 9X2 ran
    at 141MHz and was rated at 468MIPS (for 10 CPUs), i.e. each 141MHz CPU
    at 47MIPS. So that would be an IPC of 1/3, which is somewhat
    disappointing even for an early superscalar OoO machine. But then I
    don't know how IBM produces its MIPS ratings.

    This line was abandoned in favor of simpler 'CMOS' line in mid 90s, but >according to the same Wiki article, CMOS line didn't matched Model 900
    in performance until 9672-RY5 near the end of 1997.

    A single-issue in-order CPU running at 370MHz with comparable per-CPU performance (and also 1-10CPUs); apparently 49MIPS for one CPU and
    447MIPS for 10. Again, the IPC seems abysmal, but who knows how IBM
    measures MIPS. Still, I expect that a contemporaneous Pentium II
    outperforms this 9672 by a lot on, say, SPEC95, just because of the
    basic technology and clock rate.

    It seems that during the late 1990s, IBM was not particularly
    interested in mainframe per-CPU performance.

    1. The longest AMD64 instruction is much shorter than the longest VAX >instruction
    2. On AMD64 instruction length information is continuous. Yes, there
    could be multiple prefixes and it makes things ugly, but I would think
    that in practice you very rarely need to look at more than 5 leading
    bytes in order to figure out the length of the tail. And in practice
    it's probably o.k. when instructions with more than 3 prefixes decoded >slowly.

    Yes, you can always choose to take slow paths on rare cases, but you
    can also do that for a VAX decoder. I don't expect that the 37 bytes
    (or whatever it is) is the common case.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Tue Sep 10 15:42:25 2024
    Michael S <[email protected]> writes:
    On Tue, 10 Sep 2024 07:43:53 GMT
    [email protected] (Anton Ertl) wrote:

    Brett <[email protected]> writes:
    Speaking of complex things, have you looked at Swift output, as it
    checks all operations for overflow?

    You could add an exception type for that, saving huge numbers of
    correctly predicted branch instructions.

    The future of programming languages is type safe with checks, you
    need to get on that bandwagon early.

    MIPS got on that bandwagon early. It has, e.g., add (which traps on
    signed overflow) in addition to addu (which performs modulo
    arithmetic).

    Trapping variants were deprecated in Release 6 of MIPS ISA.

    Interesting. So they abandoned the supposed bandwagon in 2014, after
    Swift was introduced.

    What they did add in the same release are branch instructions that
    check whether the sum of two signed integers overflows. That's useful
    for languages with arbitrarily large integers (also knowns as Big
    Integers or Bignums), while the trapping adds are too cumbersome for
    that purpose.

    And it seems to me that Swift with its trapping arithmetic is a blast
    from the past (with Algol, Pascal etc. usually erroring out on
    overflow, and Ada raising an exception (with famously explosive
    consequences for the Ariane 5)), and that the trend in safe languages
    is to eliminate integer overflow by allowing arbitrarily large
    integers.

    It has been abandoned and replaced by RISC-V several
    years ago.

    I don't think that "replaced by RISC-V" is a correct description of >proceedings.

    I don't know anything about the proceedings, just what Wikipedia tells
    me:

    |In March 2021, MIPS announced that the development of the MIPS
    |architecture had ended as the company is making the transition to
    |RISC-V.

    Sounds like a replacement to me.

    How does Intel MPX fit in your picture?

    I don't know anything about MPX beyond what Wikipedia says, which
    includes:

    |In practice, there have been too many flaws discovered in the design
    |for it to be useful, and support has been deprecated or removed from
    |most compilers and operating systems.

    Maybe a less flawed concept would have been more successful, but
    apparently MPX has had no such successor.

    Overall, languages that perform bounds checking seem on the rise,
    unlike languages that trap on signed integer overflow, so the window
    of opportunity for architectural support gets bigger.

    However, the question is if there is architectural support that is significantly better than what can be done with the current
    architectural features. SPARC has architectural tagging support for
    LISP, yet a comp.arch poster who worked on a major LISP implementation
    (Franz LISP IIRC) reported that their LISP implementation does not use
    these instructions.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Anton Ertl on Tue Sep 10 20:55:37 2024
    On 2024-09-10 18:42, Anton Ertl wrote:

    And it seems to me that Swift with its trapping arithmetic is a
    blast from the past


    The dominance of C and its descendants has corrupted the world of
    programming on this point. :-(

    Fortunately, among up-and-coming new languages Rust is in the
    overflow-checking camp, at least in DEBUG-mode compilations.


    (with Algol, Pascal etc. usually erroring out on
    overflow, and Ada raising an exception (with famously explosive
    consequences for the Ariane 5)),


    A bit misleading, as so often when the Ariane 501 incident is brought up.

    The Ariane 501 failure was a HW trap on an instruction converting a floating-point value into a 16-bit integer, not an Ada exception.

    As I understand it, the analogous C code could have used the same
    instruction and failed in the same way (an example of Undefined Behavior)

    The original designers of that Ada SW had carefully analysed the
    possible ranges of the numbers and correctly concluded that an overflow
    could not happen if the HW operated correctly. Correctly, that is, for
    the Ariane 4, but not for the Ariane 5 where the SW was sloppily reused
    through multiple process skimps and failures.

    Several other similar conversions were protected with programmed range
    checks and suitable alternative code paths, but the analysis showed that
    this particular conversion did not need such checks for the Ariane 4.

    One of the process failures was that the SW was never tested with the
    Ariane 5 launch trajectory, which would have revealed the error.

    If the SW had really used Ada exceptions (difficult as the processor was
    quite maxed out) a reasonable SW designer would have added an exception
    handler and could have made this part of the SW fail gracefully. But
    the mission would probably not have been saved because the failure investigation found other potentially fatal flaws in the systems,
    pointing to more process failures.


    and that the trend in safe languages is to eliminate integer overflow
    by allowing arbitrarily large integers.


    That is not practical in a real-time, resource-limited context, at least
    not without a large over-provision of computing resources. And sending
    the resulting over-large integer to a HW register will still fail in
    some way if the value is too large for the HW to accept.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Sep 10 23:46:18 2024
    On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:

    It seems that during the late 1990s, IBM was not particularly interested
    in mainframe per-CPU performance.

    Mainframes were never about CPU performance. They were about high I/O throughput for efficient batch operations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Sep 10 23:51:20 2024
    On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:

    [MIPS] has been abandoned and replaced by RISC-V several years ago.

    I’m not so sure the MIPS architecture has been “abandoned”. Last I heard, it was still shipping hundreds of millions of chips per year. Also those Chinese supers run LoongArch, which is some sort of MIPS derivative.

    It is true that there is no more money to be made from licensing any “MIPS IP”, which is why Imagination Tech, the inheritors of whatever was left of MIPS the commercial operation, have switched to being a RISC-V-centric
    company now.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Niklas Holsti on Tue Sep 10 23:47:53 2024
    On Tue, 10 Sep 2024 11:13:11 +0300, Niklas Holsti wrote:

    Not all the type-safe, checking languages are equal in that respect. In
    some languages, and I am thinking of Ada, the language design and the
    favored programming styles work to reduce the number of run-time checks required.

    True, and this was also demonstrated with Pascal before Ada.

    It’s a point that those who are accustomed to program in C seem to find it difficult to appreciate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to It appears that Anton Ertl on Wed Sep 11 01:56:16 2024
    It appears that Anton Ertl <[email protected]> said:
    The future of programming languages is type safe with checks, you need to >>get on that bandwagon early.

    MIPS got on that bandwagon early. It has, e.g., add (which traps on
    signed overflow) in addition to addu (which performs modulo
    arithmetic). It has been abandoned and replaced by RISC-V several
    years ago.

    S/360 had signed and unsigned adds in the 1960s, with optional
    trapping for signed overflow. OS/360 let you catch the traps and z
    still does but it is not my impression that many programs did or do.


    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Levine on Wed Sep 11 05:36:46 2024
    John Levine <[email protected]> schrieb:
    It appears that Anton Ertl <[email protected]> said:
    The future of programming languages is type safe with checks, you need to >>>get on that bandwagon early.

    MIPS got on that bandwagon early. It has, e.g., add (which traps on
    signed overflow) in addition to addu (which performs modulo
    arithmetic). It has been abandoned and replaced by RISC-V several
    years ago.

    S/360 had signed and unsigned adds in the 1960s, with optional
    trapping for signed overflow. OS/360 let you catch the traps and z
    still does but it is not my impression that many programs did or do.

    With trapping, I understand. Without trapping - what is the
    difference on a two's complement machine?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Wed Sep 11 06:24:55 2024
    Thomas Koenig <[email protected]> writes:
    John Levine <[email protected]> schrieb:
    S/360 had signed and unsigned adds in the 1960s, with optional
    trapping for signed overflow. OS/360 let you catch the traps and z
    still does but it is not my impression that many programs did or do.

    With trapping, I understand. Without trapping - what is the
    difference on a two's complement machine?

    Possibly in the flags set. The S/360 has a pretty perverse flags
    architecture.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Wed Sep 11 11:07:53 2024
    On Tue, 10 Sep 2024 15:42:25 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:

    How does Intel MPX fit in your picture?

    I don't know anything about MPX beyond what Wikipedia says, which
    includes:

    |In practice, there have been too many flaws discovered in the design
    |for it to be useful, and support has been deprecated or removed from
    |most compilers and operating systems.

    Maybe a less flawed concept would have been more successful, but
    apparently MPX has had no such successor.

    Overall, languages that perform bounds checking seem on the rise,
    unlike languages that trap on signed integer overflow, so the window
    of opportunity for architectural support gets bigger.


    Yes, I posted my questions without sufficient thinking.
    Intel MPX is about array bound checks which is a separate issue from
    catching signed overflow.
    The only commonality is that in both cases there is a potential for
    significant saving in code size if checks are handled by exception
    instead of conditional branch.

    However, the question is if there is architectural support that is significantly better than what can be done with the current
    architectural features. SPARC has architectural tagging support for
    LISP, yet a comp.arch poster who worked on a major LISP implementation
    (Franz LISP IIRC) reported that their LISP implementation does not use
    these instructions.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Wed Sep 11 11:54:25 2024
    On Tue, 10 Sep 2024 23:51:20 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:

    [MIPS] has been abandoned and replaced by RISC-V several years ago.


    I’m not so sure the MIPS architecture has been “abandoned”. Last I heard, it was still shipping hundreds of millions of chips per year.

    Care to point to the source of this claim? Two main suppliers of MIPS
    silicon in this century are Microchip and Cavium (now owned by Marvell).

    According to my understanding Microchip's MIPS-based PIC32 line was
    never as popular as their other offerings.

    In case of Marvell, I no longer see MIPS-based Octeon III chips in the
    product section of their Web site. Which, I'd guess, means that in order
    to buy one you has to be an existing customer. Since the market that
    Octeon III was playing in, is rather dynamic, I don't expect that those existing customers buy very old chips in tens of millions. Likely not
    even in single-digit millions.

    Also those Chinese supers run LoongArch, which is some sort of MIPS derivative.


    Sort of.
    And majority of my FPGA designs run Nios2 soft cores that are also
    'sort of MIPS'. But they are *not* MIPS.

    It is true that there is no more money to be made from licensing any
    “MIPS IP”, which is why Imagination Tech, the inheritors of whatever
    was left of MIPS the commercial operation, have switched to being a RISC-V-centric company now.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Wed Sep 11 11:40:23 2024
    On 11/09/2024 10:54, Michael S wrote:
    On Tue, 10 Sep 2024 23:51:20 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:

    [MIPS] has been abandoned and replaced by RISC-V several years ago.


    I’m not so sure the MIPS architecture has been “abandoned”. Last I
    heard, it was still shipping hundreds of millions of chips per year.

    Care to point to the source of this claim? Two main suppliers of MIPS
    silicon in this century are Microchip and Cavium (now owned by Marvell).

    According to my understanding Microchip's MIPS-based PIC32 line was
    never as popular as their other offerings.

    IMHO a major reason for that is Microchip's insane licensing policy for
    their development tools - although their compilers are just a minor modification of standard gcc, you have to pay huge amounts if you want
    to use the full features of the compiler. (At least now you can enable
    /some/ optimisation without a paid license.) It is not even possible to
    see from the release notes or documentation what version of gcc is
    provided, though my guess is that it is pretty old (the documentation
    describes "-std" options up to C++14).

    The other reason, of course, was the name - "PIC" is associated with
    brain-dead microcontrollers with terrible C tools and which many people
    program in assembly. They are also renowned for being very solid,
    coming in relatively amateur-friendly packages, and for never going out
    of production.

    For some time now, Microchip's PIC32 line has all had ARM Cortex-M cores
    in new devices, based on SAM parts they got when they purchased Atmel.
    But they still sell the existing MIPS microcontrollers.


    Sort of.
    And majority of my FPGA designs run Nios2 soft cores that are also
    'sort of MIPS'. But they are *not* MIPS.

    I thought the NIOS 2 was more "MIPS inspired" than "sort of MIPS". (And
    the original NIOS was "SPARC inspired.) They have now jumped to NIOS V,
    which is RISC-V (actual RISC-V).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Sep 11 09:32:04 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:

    It seems that during the late 1990s, IBM was not particularly interested
    in mainframe per-CPU performance.

    Mainframes were never about CPU performance.

    The S/360 Model 91 and the Model 195 certainly were about the maximum
    CPU performance. And I doubt that IBM would have spent all the effort
    with ECL and a superscalar OoO implementation for some of the ES/9000
    machines if CPU performance was considered unimportant at the time.

    It's an interesting question why they did not follow up their
    superscalar OoO ECL implementations with a superscalar OoO CMOS
    implementation in addition to the scalar in-order 9672. Here are
    three speculations of what happened:

    1) They had such a project and it did not work out, and the "never
    about CPU performance" spin is a sour-grapes type rationalization of
    the result.

    2) They expected their mainframe market to be eaten up by the Unix
    and/or WNT markets, and did not want to invest a lot into the
    development of mainframe CPUs. Again, the "never about CPU
    performance" spin is a sour-grapes type rationalization of the result.

    3) They had decided that they had a captive market in the mainframes,
    with software that was written for lower-powered CPUs, that the rapid
    CMOS advances in the 1990s would give them enough of a performance
    push to satisfy the needs of this software, so no more sophisticated
    CPU designs that the 9672 was necessary (and the G5 and G6 of the 9672
    indeed gave them more CPU power than ever). The "never about CPU
    performance" reflected their position at the time and also served to
    placate anyone who pointed out that the per-CPU performance was
    inferior to that of other CPUs of the time, including IBM's own
    RS/6000 line.

    Eventually they seem to have decided that per-CPU performance is
    important after all, with the superscalar z990 in 2003 and the OoO
    z196 in 2010. But of course Dennart scaling was slowing down around
    2003, so they needed to increase IPC to increase per-CPU performance.
    And even if they don't need more per-CPU performance than other
    architectures, they apparently do need advances over earlier
    generations of their own machines and maybe to discourage competition
    from emulators or startups.

    They were about high I/O
    throughput for efficient batch operations.

    Batch operations? I wonder how much CPU time on mainframes in the
    1990s and today is spent on that compared to interactive applications
    such as online transaction processing.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Anton Ertl on Wed Sep 11 07:48:36 2024
    On 9/11/2024 2:24 AM, Anton Ertl wrote:
    Thomas Koenig <[email protected]> writes:
    John Levine <[email protected]> schrieb:
    S/360 had signed and unsigned adds in the 1960s, with optional
    trapping for signed overflow. OS/360 let you catch the traps and z
    still does but it is not my impression that many programs did or do.

    With trapping, I understand. Without trapping - what is the
    difference on a two's complement machine?

    Possibly in the flags set. The S/360 has a pretty perverse flags architecture.

    In the EC PSW from my yellow card
    bit 20 fixed-point overflow mask
    bit 21 decimal overflow mask
    bit 22 exponent overflow mask
    bit 23 significance mask

    BC mode was bits 36 thru 39.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed Sep 11 15:53:06 2024
    According to Thomas Koenig <[email protected]>:
    John Levine <[email protected]> schrieb:
    It appears that Anton Ertl <[email protected]> said:
    The future of programming languages is type safe with checks, you need to >>>>get on that bandwagon early.

    MIPS got on that bandwagon early. It has, e.g., add (which traps on >>>signed overflow) in addition to addu (which performs modulo
    arithmetic). It has been abandoned and replaced by RISC-V several
    years ago.

    S/360 had signed and unsigned adds in the 1960s, with optional
    trapping for signed overflow. OS/360 let you catch the traps and z
    still does but it is not my impression that many programs did or do.

    With trapping, I understand. Without trapping - what is the
    difference on a two's complement machine?

    Different condition codes. Signed add:

    0 Sum is zero
    1 Sum is less than zero
    2 Sum is greater than zero
    3 Overflow

    Unsigned add:

    0 Sum is zero (no carry)
    1 Sum is not zero (no carry)
    2 Sum is zero (carry)
    3 Sum is not zero (carry)

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed Sep 11 16:21:19 2024
    According to Anton Ertl <[email protected]>:
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:

    It seems that during the late 1990s, IBM was not particularly interested >>> in mainframe per-CPU performance.

    Mainframes were never about CPU performance.

    The S/360 Model 91 and the Model 195 certainly were about the maximum
    CPU performance. And I doubt that IBM would have spent all the effort
    with ECL and a superscalar OoO implementation for some of the ES/9000 >machines if CPU performance was considered unimportant at the time.

    It's an interesting question why they did not follow up their
    superscalar OoO ECL implementations with a superscalar OoO CMOS >implementation in addition to the scalar in-order 9672. ...

    IBM definitely cared about maximum performance in the 1950s and early 1960s.

    The goal of STRETCH was specifically to make the fastest possible computer. It sort of
    succeeded, late and over budget and not as fast as they hoped, but still the fastest
    computer in the world for a while. It was a success in that they reused a lot of the
    technology like the fast core memory in later computers.

    The 360/91 was also intended to be the fastest possible computer, which again it sort of
    was, late and over budget. One thing that STRETCH and the /91 shared was that they were
    extremely complicated. STRETCH had variable sized bytes and and addressing modes that I
    never entirely figured out. The /91 had an instruction queue with loop mode and out of
    order operations and register renaming and imprecise interrupts. When the CDC 6600 came
    out, a much simpler design from a tiny company that was nonetheless faster than the /91,
    they knew they had a problem. The /95 and /195 were minor upgrades of the /91 but that was
    the end of their supercomputer efforts.

    The point of a mainframe is balanced performance. The CPU of a 360/30 was extremely slow
    but it was fast enough to drive a disk or two and a printer and card read/punch and get a
    lot of useful work done. Mainframes have had channels since the 709 in the late 1950s so
    they have a lot of I/O capacity. Modern ones have terabytes of RAM and exabyte of disk.

    They also care deeply about reliability. Modern mainframes have multiple kinds of error
    checking and standby CPUs that can take over from a failed CPU, restart a failed
    instruction, and the program doesn't notice. I think you'll find a pattern since the
    CDC shock of making CPUs fast enough to keep the RAM and I/O devices busy while having
    the error checking and recovery features so the systems keep running for years at a time.



    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Anton Ertl on Wed Sep 11 16:39:23 2024
    Anton Ertl <[email protected]> wrote:
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:

    It seems that during the late 1990s, IBM was not particularly interested >>> in mainframe per-CPU performance.

    Mainframes were never about CPU performance.

    The S/360 Model 91 and the Model 195 certainly were about the maximum
    CPU performance. And I doubt that IBM would have spent all the effort
    with ECL and a superscalar OoO implementation for some of the ES/9000 machines if CPU performance was considered unimportant at the time.

    It's an interesting question why they did not follow up their
    superscalar OoO ECL implementations with a superscalar OoO CMOS implementation in addition to the scalar in-order 9672. Here are
    three speculations of what happened:

    1) They had such a project and it did not work out, and the "never
    about CPU performance" spin is a sour-grapes type rationalization of
    the result.

    2) They expected their mainframe market to be eaten up by the Unix
    and/or WNT markets, and did not want to invest a lot into the
    development of mainframe CPUs. Again, the "never about CPU
    performance" spin is a sour-grapes type rationalization of the result.

    3) They had decided that they had a captive market in the mainframes,
    with software that was written for lower-powered CPUs, that the rapid
    CMOS advances in the 1990s would give them enough of a performance
    push to satisfy the needs of this software, so no more sophisticated
    CPU designs that the 9672 was necessary (and the G5 and G6 of the 9672
    indeed gave them more CPU power than ever). The "never about CPU performance" reflected their position at the time and also served to
    placate anyone who pointed out that the per-CPU performance was
    inferior to that of other CPUs of the time, including IBM's own
    RS/6000 line.

    IBM had huge caches the PC’s could not match and smart IO processors to handle much of the load, that PC’s had to handle with the CPU because they were cheap.

    You could go into this as my knowledge is mostly SWAG based off marketing
    bull and what little I know.

    Then there is the issue of cheap PC’s that fail, and a mainframes have a higher level of redundancy and failover. Failed business transactions can
    cost millions, more than the machine is worth, so saving pennies on
    hardware is stupid.

    Eventually they seem to have decided that per-CPU performance is
    important after all, with the superscalar z990 in 2003 and the OoO
    z196 in 2010. But of course Dennart scaling was slowing down around
    2003, so they needed to increase IPC to increase per-CPU performance.
    And even if they don't need more per-CPU performance than other architectures, they apparently do need advances over earlier
    generations of their own machines and maybe to discourage competition
    from emulators or startups.

    They were about high I/O
    throughput for efficient batch operations.

    Batch operations? I wonder how much CPU time on mainframes in the
    1990s and today is spent on that compared to interactive applications
    such as online transaction processing.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Wed Sep 11 17:02:26 2024
    David Brown <[email protected]> schrieb:
    On 11/09/2024 10:54, Michael S wrote:
    On Tue, 10 Sep 2024 23:51:20 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:

    [MIPS] has been abandoned and replaced by RISC-V several years ago.


    I’m not so sure the MIPS architecture has been “abandoned”. Last I >>> heard, it was still shipping hundreds of millions of chips per year.

    Care to point to the source of this claim? Two main suppliers of MIPS
    silicon in this century are Microchip and Cavium (now owned by Marvell).

    According to my understanding Microchip's MIPS-based PIC32 line was
    never as popular as their other offerings.

    IMHO a major reason for that is Microchip's insane licensing policy for
    their development tools - although their compilers are just a minor modification of standard gcc, you have to pay huge amounts if you want
    to use the full features of the compiler. (At least now you can enable /some/ optimisation without a paid license.) It is not even possible to
    see from the release notes or documentation what version of gcc is
    provided, though my guess is that it is pretty old (the documentation describes "-std" options up to C++14).

    Sounds like a violation of the GPL. Do they provide the sources?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed Sep 11 17:50:43 2024
    According to Stephen Fuld <[email protected]d>:
    IBM definitely cared about maximum performance in the 1950s and early
    1960s.

    Yes. And remember, one of the goals of S/360 was to provide an
    architecture that could handle both scientific (i.e. compute bound) and >business (i.e. I/O bound) workloads.

    I don't think anyone would have forseen how quickly scientific computing
    moved to mini and micro computers with fast CPUs and weak peripherals.
    Perhaps once the RAM is big enough to hold all the data the I/O performance
    is not a big deal.

    they knew they had a problem. The /95 and /195 were minor upgrades of
    the /91 but that was the end of their supercomputer efforts.

    Mostly true, except for the 3090 vector facility.

    I suppose. A review from the USDOE said:

    The IBM 3090 with Vector Facility is an extremely interesting machine
    because it combines very good scaler performance with enhanced vector
    and multitasking performance. For many IBM installations with a large
    scientific workload, the 3090/vector/MTF combination may be an ideal
    means of increasing throughput at minimum cost. However, neither the
    vector nor multitasking capabilities are sufficiently developed to
    make the 3090 competitive with our current worker machines for our
    large-scale scientific codes.

    https://www.osti.gov/biblio/5039931

    instruction, and the program doesn't notice. I think you'll find a
    pattern since the
    CDC shock of making CPUs fast enough to keep the RAM and I/O devices
    busy while having
    the error checking and recovery features so the systems keep running
    for years at a time.

    Yes, but they also have to keep producing faster and faster CPUs so they
    can entice current customers to upgrade and thus meet their revenue goals.

    The memories and disks keep getting bigger so it's not totally silly to
    think that the CPUs need to get faster, too. They keep increasing the
    number of CPUs, with z16 topping out at 200, all sharing up to 40TB of RAM.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Levine on Wed Sep 11 10:18:44 2024
    On 9/11/2024 9:21 AM, John Levine wrote:
    According to Anton Ertl <[email protected]>:
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:

    It seems that during the late 1990s, IBM was not particularly
    interested
    in mainframe per-CPU performance.

    Mainframes were never about CPU performance.

    The S/360 Model 91 and the Model 195 certainly were about the maximum
    CPU performance. And I doubt that IBM would have spent all the effort
    with ECL and a superscalar OoO implementation for some of the ES/9000
    machines if CPU performance was considered unimportant at the time.

    It's an interesting question why they did not follow up their
    superscalar OoO ECL implementations with a superscalar OoO CMOS
    implementation in addition to the scalar in-order 9672. ...

    IBM definitely cared about maximum performance in the 1950s and early
    1960s.

    Yes. And remember, one of the goals of S/360 was to provide an
    architecture that could handle both scientific (i.e. compute bound) and business (i.e. I/O bound) workloads.


    The goal of STRETCH was specifically to make the fastest possible
    computer. It sort of
    succeeded, late and over budget and not as fast as they hoped, but
    still the fastest
    computer in the world for a while. It was a success in that they
    reused a lot of the
    technology like the fast core memory in later computers.

    The 360/91 was also intended to be the fastest possible computer,
    which again it sort of
    was, late and over budget. One thing that STRETCH and the /91 shared
    was that they were
    extremely complicated. STRETCH had variable sized bytes and and
    addressing modes that I
    never entirely figured out. The /91 had an instruction queue with
    loop mode and out of
    order operations and register renaming and imprecise interrupts. When
    the CDC 6600 came
    out, a much simpler design from a tiny company that was nonetheless
    faster than the /91,
    they knew they had a problem. The /95 and /195 were minor upgrades of
    the /91 but that was
    the end of their supercomputer efforts.


    Mostly true, except for the 3090 vector facility.


    The point of a mainframe is balanced performance. The CPU of a 360/30
    was extremely slow
    but it was fast enough to drive a disk or two and a printer and card
    read/punch and get a
    lot of useful work done. Mainframes have had channels since the 709
    in the late 1950s so
    they have a lot of I/O capacity. Modern ones have terabytes of RAM
    and exabyte of disk.

    Yes.


    They also care deeply about reliability. Modern mainframes have
    multiple kinds of error
    checking and standby CPUs that can take over from a failed CPU,
    restart a failed
    instruction, and the program doesn't notice. I think you'll find a
    pattern since the
    CDC shock of making CPUs fast enough to keep the RAM and I/O devices
    busy while having
    the error checking and recovery features so the systems keep running
    for years at a time.

    Yes, but they also have to keep producing faster and faster CPUs so they
    can entice current customers to upgrade and thus meet their revenue goals.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Wed Sep 11 19:57:30 2024
    On 11/09/2024 19:02, Thomas Koenig wrote:
    David Brown <[email protected]> schrieb:
    On 11/09/2024 10:54, Michael S wrote:
    On Tue, 10 Sep 2024 23:51:20 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:

    [MIPS] has been abandoned and replaced by RISC-V several years ago.


    I’m not so sure the MIPS architecture has been “abandoned”. Last I >>>> heard, it was still shipping hundreds of millions of chips per year.

    Care to point to the source of this claim? Two main suppliers of MIPS
    silicon in this century are Microchip and Cavium (now owned by Marvell). >>>
    According to my understanding Microchip's MIPS-based PIC32 line was
    never as popular as their other offerings.

    IMHO a major reason for that is Microchip's insane licensing policy for
    their development tools - although their compilers are just a minor
    modification of standard gcc, you have to pay huge amounts if you want
    to use the full features of the compiler. (At least now you can enable
    /some/ optimisation without a paid license.) It is not even possible to
    see from the release notes or documentation what version of gcc is
    provided, though my guess is that it is pretty old (the documentation
    describes "-std" options up to C++14).

    Sounds like a violation of the GPL. Do they provide the sources?

    Yes.

    It's perfectly fine to take the gcc sources, add in some code that
    checks for a paid license of some sort, and distribute that as a binary
    - as long as you also provide the source for it. So you /could/ take
    the source and compile it yourself (or just get the original gcc source,
    or another binary build of gcc MIPS).

    But the license for their header files, SDKs, libraries, IDE (which,
    IIRC, was basically NetBeans) and other tools says you can only use them
    with an unmodified binary that they provide. And writing your own
    header files for a big microcontroller is not a quick and easy job.

    I believe there was no legal violation of the GPL, but there was no
    doubt that it was trashing the spirit of it.

    And as far as I could see from a look at their website, they are still
    at the same game (though you can now enable /some/ optimisations), now
    with the ARM core PIC32 devices as well.

    Every other ARM or RISC V based microcontroller manufacturer I have seen provides free gcc and/or clang tools, along with a free IDE (Eclipse or
    MS Code). They will also provide support for paid tools like ARM's
    development tools, or IAR, or maybe Green Hills - these are expensive,
    but that's fair enough. What is not fair, even if it is legal, is
    taking something that they get for free and charging multiple
    kilodollars for people to use it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Levine on Wed Sep 11 11:25:48 2024
    On 9/11/2024 10:50 AM, John Levine wrote:
    According to Stephen Fuld <[email protected]d>:
    IBM definitely cared about maximum performance in the 1950s and early
    1960s.

    Yes. And remember, one of the goals of S/360 was to provide an
    architecture that could handle both scientific (i.e. compute bound) and
    business (i.e. I/O bound) workloads.

    I don't think anyone would have forseen how quickly scientific computing moved to mini and micro computers with fast CPUs and weak peripherals.

    Agreed, plus the development of CDC/Cray supercomputers that took the
    high end scientific market away from IBM.



    Perhaps once the RAM is big enough to hold all the data the I/O performance is not a big deal.

    they knew they had a problem. The /95 and /195 were minor upgrades of
    the /91 but that was the end of their supercomputer efforts.

    Mostly true, except for the 3090 vector facility.

    I suppose. A review from the USDOE said:

    The IBM 3090 with Vector Facility is an extremely interesting machine
    because it combines very good scaler performance with enhanced vector
    and multitasking performance. For many IBM installations with a large
    scientific workload, the 3090/vector/MTF combination may be an ideal
    means of increasing throughput at minimum cost. However, neither the
    vector nor multitasking capabilities are sufficiently developed to
    make the 3090 competitive with our current worker machines for our
    large-scale scientific codes.

    https://www.osti.gov/biblio/5039931

    I didn't claim that the 3090VF was successful, just that IBM was
    interested enough in the scientific market to spend money developing it
    after the 370/195.




    instruction, and the program doesn't notice. I think you'll find a
    pattern since the
    CDC shock of making CPUs fast enough to keep the RAM and I/O devices
    busy while having
    the error checking and recovery features so the systems keep running
    for years at a time.

    Yes, but they also have to keep producing faster and faster CPUs so they
    can entice current customers to upgrade and thus meet their revenue goals.

    The memories and disks keep getting bigger so it's not totally silly to
    think that the CPUs need to get faster, too.

    Agreed, of course.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Anton Ertl on Wed Sep 11 11:36:46 2024
    On 9/11/2024 2:32 AM, Anton Ertl wrote:
    Lawrence D'Oliveiro <[email protected]d> writes:

    snip

    They were about high I/O
    throughput for efficient batch operations.

    Batch operations? I wonder how much CPU time on mainframes in the
    1990s and today is spent on that compared to interactive applications
    such as online transaction processing.

    Perhaps it would have been better stated as being about balanced
    performance (CPU and I/O) for business applications, which at the time
    were primarily batch, but have migrated over time to transactions, but
    which still are more I/O bound than scientific workloads.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Levine on Wed Sep 11 19:36:59 2024
    John Levine <[email protected]> schrieb:
    According to Stephen Fuld <[email protected]d>:
    IBM definitely cared about maximum performance in the 1950s and early >>1960s.

    Yes. And remember, one of the goals of S/360 was to provide an >>architecture that could handle both scientific (i.e. compute bound) and >>business (i.e. I/O bound) workloads.

    I don't think anyone would have forseen how quickly scientific computing moved to mini and micro computers with fast CPUs and weak peripherals. Perhaps once the RAM is big enough to hold all the data the I/O performance is not a big deal.

    they knew they had a problem. The /95 and /195 were minor upgrades of >>the /91 but that was the end of their supercomputer efforts.

    Don't forget the ACS.

    Looking at (if that is to be believed) https://people.computing.clemson.edu/~mark/acs_performance.html
    this seems to have been quite an amazing machine for its time,
    with projected 160 MFlops and around five concurrent instructions
    using OoO.

    Had this been realized, it would havbe been as fast as the Cray-I.
    But it never reached the market, so...

    Mostly true, except for the 3090 vector facility.

    I suppose. A review from the USDOE said:

    We had that in our IBM 3090 at the computer center. Compared to the
    Fujitsu VP machine sitting next to it, it was not impressive at
    all (which can also be read in the report you youted).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Wed Sep 11 21:12:34 2024
    On Wed, 11 Sep 2024 9:32:04 +0000, Anton Ertl wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:
    On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:

    It seems that during the late 1990s, IBM was not particularly interested >>> in mainframe per-CPU performance.

    Mainframes were never about CPU performance.

    The S/360 Model 91 and the Model 195 certainly were about the maximum
    CPU performance. And I doubt that IBM would have spent all the effort
    with ECL and a superscalar OoO implementation for some of the ES/9000 machines if CPU performance was considered unimportant at the time.

    91 was Current-Mode-Logic CML. Don't know about 195.
    CML had all of the speed and all of the electrical and all of the heat
    problems ECK had.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Wed Sep 11 21:15:29 2024
    On Wed, 11 Sep 2024 16:21:19 +0000, John Levine wrote:

    According to Anton Ertl <[email protected]>:
    Lawrence D'Oliveiro <[email protected]d> writes:

    The 360/91 was also intended to be the fastest possible computer, which
    again it sort of
    was, late and over budget. One thing that STRETCH and the /91 shared was
    that they were
    extremely complicated. STRETCH had variable sized bytes and and
    addressing modes that I
    never entirely figured out. The /91 had an instruction queue with loop
    mode and out of
    order operations and register renaming and imprecise interrupts. When
    the CDC 6600 came
    out, a much simpler design from a tiny company that was nonetheless
    faster than the /91,
    they knew they had a problem. The /95 and /195 were minor upgrades of
    the /91 but that was
    the end of their supercomputer efforts.

    You forgot to mention /91 had a 60ns clock while 6600 had a 100 ns
    clock.
    Here was a case where parallelism beat out pipelining.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Wed Sep 11 21:51:54 2024
    Thomas Koenig <[email protected]> writes:
    John Levine <[email protected]> schrieb:
    According to Stephen Fuld <[email protected]d>:
    IBM definitely cared about maximum performance in the 1950s and early >>>1960s.

    Yes. And remember, one of the goals of S/360 was to provide an >>>architecture that could handle both scientific (i.e. compute bound) and >>>business (i.e. I/O bound) workloads.

    I don't think anyone would have forseen how quickly scientific computing
    moved to mini and micro computers with fast CPUs and weak peripherals.
    Perhaps once the RAM is big enough to hold all the data the I/O performance >> is not a big deal.

    they knew they had a problem. The /95 and /195 were minor upgrades of >>>the /91 but that was the end of their supercomputer efforts.

    Don't forget the ACS.

    Looking at (if that is to be believed) >https://people.computing.clemson.edu/~mark/acs_performance.html
    this seems to have been quite an amazing machine for its time,
    with projected 160 MFlops and around five concurrent instructions
    using OoO.

    The (cancelled) Burroughs Scientific Processor was as fast as
    a Cray-I.

    The Burroughs Scientific Processor (BSP), a high-performance
    computer system, performed the Department of Energy LLL loops
    at roughly the speed of the CRAY-1. The BSP combined parallelism
    and pipelining, performing memory-to-memory operations. Seventeen
    memory units and two crossbar switch data alignment networks
    provided conflict-free access to most indexed arrays. Fast linear
    recurrence algorithms provided good performance on constructs that
    some machines execute serially. A system manager computer ran the
    operating system and a vectorizing Fortran compiler. An MOS file
    memory system served as a high bandwidth secondary memory.

    https://ieeexplore.ieee.org/document/1676014 https://en.wikipedia.org/wiki/Parallel_Element_Processing_Ensemble

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lars Poulsen@21:1/5 to John Levine on Wed Sep 11 17:44:58 2024
    On 9/11/2024 9:21 AM, John Levine wrote:
    I think you'll find a pattern since the
    CDC shock of making CPUs fast enough to keep the RAM and I/O devices busy while having
    the error checking and recovery features so the systems keep running for years at a time.

    So do these systems not require security patches?
    Or do they apply PTFs to the running system? (reliably?)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Sep 12 02:05:14 2024
    According to Lars Poulsen <[email protected]>:
    On 9/11/2024 9:21 AM, John Levine wrote:
    I think you'll find a pattern since the
    CDC shock of making CPUs fast enough to keep the RAM and I/O devices busy while having
    the error checking and recovery features so the systems keep running for years at a time.

    So do these systems not require security patches?
    Or do they apply PTFs to the running system? (reliably?)

    They don't just update the software, they swap out entire hardware subsystems while the
    overall system keeps running.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Thu Sep 12 04:28:00 2024
    On Wed, 11 Sep 2024 11:36:46 -0700, Stephen Fuld wrote:

    Perhaps it would have been better stated as being about balanced
    performance (CPU and I/O) for business applications, which at the time
    were primarily batch, but have migrated over time to transactions, but
    which still are more I/O bound than scientific workloads.

    Depending on the kind of transactions: online interactive stuff requires
    low latencies as opposed to high throughput. Mainframes are optimized for
    high throughput.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Brett on Thu Sep 12 04:26:52 2024
    On Wed, 11 Sep 2024 16:39:23 -0000 (UTC), Brett wrote:

    Then there is the issue of cheap PC’s that fail, and a mainframes have a higher level of redundancy and failover. Failed business transactions
    can cost millions, more than the machine is worth, so saving pennies on hardware is stupid.

    You solve that by having multiple units of the cheap machines to achieve
    the same level of redundancy, or even more. That ends up being more cost- effective than the mainframe.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Thu Sep 12 04:30:05 2024
    On Wed, 11 Sep 2024 21:12:34 +0000, MitchAlsup1 wrote:

    91 was Current-Mode-Logic CML. ...
    CML had all of the speed and all of the electrical and all of the heat problems ECK had.

    IBM over-promising and under-delivering, again.

    The ’90, or was it the ’91, or the ‘92, was the machine IBM promised to deliver to those customers who were looking to buy a CDC machine. It
    remained vapourware for close to two years I think it was, and was underwhelming when it did finally appear. But it still managed to cost CDC
    a lot of sales in the meantime.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu Sep 12 04:31:49 2024
    On Wed, 11 Sep 2024 16:21:19 -0000 (UTC), John Levine wrote:

    They also care deeply about reliability. Modern mainframes have multiple kinds of error checking and standby CPUs that can take over from a
    failed CPU, restart a failed instruction, and the program doesn't
    notice.

    This “mainframe reliability” seems to be a persistent myth. There is a document at Bitsavers, dating from 1986, which says that, if you want to
    turn daylight saving on or off on an IBM mainframe, you really should
    reboot.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu Sep 12 04:33:09 2024
    On Thu, 12 Sep 2024 02:05:14 -0000 (UTC), John Levine wrote:

    They don't just update the software, they swap out entire hardware
    subsystems while the overall system keeps running.

    Xen Orchestra (open-source) can do that on commodity PC hardware.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to John Levine on Thu Sep 12 10:50:33 2024
    John Levine wrote:
    According to Stephen Fuld <[email protected]d>:
    IBM definitely cared about maximum performance in the 1950s and early
    1960s.

    Yes. And remember, one of the goals of S/360 was to provide an
    architecture that could handle both scientific (i.e. compute bound) and
    business (i.e. I/O bound) workloads.

    I don't think anyone would have forseen how quickly scientific computing moved to mini and micro computers with fast CPUs and weak peripherals. Perhaps once the RAM is big enough to hold all the data the I/O performance is not a big deal.

    Back around 1986 or so I stated that all programming tasks will migrate
    down to the lowest/cheapest architecture which is large enough to handle
    the task. This meant that I was sure both minis and mainframes would go
    away, so I was in fact only 99.9% correct. :-)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Thu Sep 12 14:30:27 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Thu, 12 Sep 2024 02:05:14 -0000 (UTC), John Levine wrote:

    They don't just update the software, they swap out entire hardware
    subsystems while the overall system keeps running.

    Xen Orchestra (open-source) can do that on commodity PC hardware.

    The 3leaf hypervisor supported hot-plug memory, hot-plug CPU
    hot-plug PCI 15 years ago with commodity linux guests.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Thu Sep 12 14:31:26 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Wed, 11 Sep 2024 16:21:19 -0000 (UTC), John Levine wrote:

    They also care deeply about reliability. Modern mainframes have multiple
    kinds of error checking and standby CPUs that can take over from a
    failed CPU, restart a failed instruction, and the program doesn't
    notice.

    This “mainframe reliability” seems to be a persistent myth.

    Where do you come up with this nonsense?

    Have you not heard of Tandem, or Stratus?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Schultz@21:1/5 to David Brown on Wed Sep 11 14:06:23 2024
    On 9/11/24 4:40 AM, David Brown wrote:
    The other reason, of course, was the name - "PIC" is associated with brain-dead microcontrollers with terrible C tools and which many people program in assembly.  They are also renowned for being very solid,
    coming in relatively amateur-friendly packages, and for never going out
    of production.

    After having written code for a PIC I agree with "brain-dead". The small
    sized memory pages were bad enough but the total lack of an add with
    carry instruction drove me mad.

    So I swore them off and the introduction of a MIPS based system did
    nothing to change that.

    --
    http://davesrocketworks.com
    David Schultz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Sep 12 20:10:43 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 11 Sep 2024 16:39:23 -0000 (UTC), Brett wrote:

    Then there is the issue of cheap PC’s that fail, and a mainframes have a >> higher level of redundancy and failover. Failed business transactions
    can cost millions, more than the machine is worth, so saving pennies on
    hardware is stupid.

    You solve that by having multiple units of the cheap machines to achieve
    the same level of redundancy, or even more. That ends up being more cost- >effective than the mainframe.

    That's fine for workloads that work that way.

    Airline reservation systems historically ran on mainframes because when they were invented
    that's all there was (original SABRE ran on two 7090s) and they are business critical so
    they need to be very reliable.

    About 30 years ago some guys at MIT realized that route and fare search, which are some of
    the most demanding things that CRS do, are easy to parallelize and don't have to be
    particularly reliable -- if your search system crashes and restarts and reruns the search
    and the result is a couple of seconds late, that's OK. So they started ITA software which
    used racks of PC servers running parallel applications written in Lisp (they were from
    MIT) and blew away the competition.

    However, that's just the search part. Actually booking the seats and selling tickets stays
    on a mainframe or an Oracle system because double booking or giving away free tickets would
    be really bad.

    There's also a rule of thumb about databases that says one system of performance 100 is
    much better than 100 systems of performance 1 because those 100 systems will spend all
    their time contending for database locks.




    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu Sep 12 22:20:23 2024
    On Thu, 12 Sep 2024 20:10:43 -0000 (UTC), John Levine wrote:

    Actually booking the seats and
    selling tickets stays on a mainframe or an Oracle system because double booking or giving away free tickets would be really bad.

    Fun fact: double-booking happens all the time.

    I don’t think either mainframes or Oracle are still dominant in this sort
    of business. After all, Paul Graham’s Orbitz was doing this sort of thing over 20 years ago ... in LISP.

    <https://paulgraham.com/carl.html>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to John Levine on Fri Sep 13 09:28:06 2024
    John Levine wrote:
    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 11 Sep 2024 16:39:23 -0000 (UTC), Brett wrote:

    Then there is the issue of cheap PC’s that fail, and a mainframes have a
    higher level of redundancy and failover. Failed business transactions
    can cost millions, more than the machine is worth, so saving pennies on
    hardware is stupid.

    You solve that by having multiple units of the cheap machines to achieve
    the same level of redundancy, or even more. That ends up being more cost-
    effective than the mainframe.

    That's fine for workloads that work that way.

    Airline reservation systems historically ran on mainframes because when they were invented
    that's all there was (original SABRE ran on two 7090s) and they are business critical so
    they need to be very reliable.

    About 30 years ago some guys at MIT realized that route and fare search, which are some of
    the most demanding things that CRS do, are easy to parallelize and don't have to be
    particularly reliable -- if your search system crashes and restarts and reruns the search
    and the result is a couple of seconds late, that's OK. So they started ITA software which
    used racks of PC servers running parallel applications written in Lisp (they were from
    MIT) and blew away the competition.

    However, that's just the search part. Actually booking the seats and selling tickets stays
    on a mainframe or an Oracle system because double booking or giving away free tickets would
    be really bad.

    You could replicate much of that part as well, for most of the time, by setting aside chunks of seats to parallel servers, so that they can
    book/sell within that chunk until they start to run out. This way the expensive system is mostly needed only when the front ends run out?


    There's also a rule of thumb about databases that says one system of performance 100 is
    much better than 100 systems of performance 1 because those 100 systems will spend all
    their time contending for database locks.

    10-15 years ago I talked to another speaker at a conference, he told me
    that he was working on high-end open source LDAP software using _very_
    large memory DBs: Their system allowed one US cell phone company to keep
    every SIM card (~100M) on a single system, while a similar-size
    competitor had been forced to fall back on 17-way sharding (presumably
    using a hash of the SIM id).

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Scott Lurndal on Fri Sep 13 09:15:52 2024
    Scott Lurndal wrote:
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Thu, 12 Sep 2024 02:05:14 -0000 (UTC), John Levine wrote:

    They don't just update the software, they swap out entire hardware
    subsystems while the overall system keeps running.

    Xen Orchestra (open-source) can do that on commodity PC hardware.

    The 3leaf hypervisor supported hot-plug memory, hot-plug CPU
    hot-plug PCI 15 years ago with commodity linux guests.

    Novell's System Fault Tolerant NetWare 386 (around 1990) supported two
    complete servers acting like one, so that any hardware component could
    fail and the system would keep running, with nothing noticed by the
    clients, even those that were in the middle of an update/write request.

    Worked with a private high-speed link between the two servers, so that
    all requests were mirrored from master to slave. This way the slave
    would do the requested operation in sync with the master, maintaining
    the exact same state so it was ready to take over at any point.

    BTW, since the pair naturally had separate network connections, they
    could also be connected to separate LAN segments, and this worked
    transparently because every server (single or SFT) maintained a LAN
    segment inside the server: This way the two server connections just
    looked like redundant routing paths.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Fri Sep 13 06:23:47 2024
    John Levine <[email protected]> writes:
    According to Lawrence D'Oliveiro <[email protected]d>:
    You solve that by having multiple units of the cheap machines to achieve >>the same level of redundancy, or even more. That ends up being more cost- >>effective than the mainframe.

    That's fine for workloads that work that way.

    Airline reservation systems historically ran on mainframes because
    when they were invented that's all there was (original SABRE ran on
    two 7090s) and they are business critical so they need to be very
    reliable.

    About 30 years ago some guys at MIT realized that route and fare
    search, which are some of the most demanding things that CRS do, are
    easy to parallelize and don't have to be particularly reliable -- if
    your search system crashes and restarts and reruns the search and the
    result is a couple of seconds late, that's OK. So they started ITA
    software which used racks of PC servers running parallel applications
    written in Lisp (they were from MIT) and blew away the competition.

    However, that's just the search part. Actually booking the seats and
    selling tickets stays on a mainframe or an Oracle system because
    double booking or giving away free tickets would be really bad.

    Booking flights or seats can easily be distributed: each flight is
    assigned to one computer. To avoid double booking or free tickets
    even in case of a computer crash, you use the usual transaction
    processing approach, and report completion of the booking only when
    the booking has reached persistent memory. For persistent memory you
    use SSDs with power-loss protection.

    These SSDs, ECC RAM, RAID-1, redundant power supplies and UPSs protect
    against most hardware failures, but availability is still a concern
    (e.g., motherboard or CPU failure; that normally does not affect data
    integrity if the other measures are taken, but it affects
    availability). To increase availability, you can use e.g., DRBD
    (distributed replicated block device) to get the data on multiple
    machines.

    Concerning "real bad": Airlines overbook their flights as a matter of
    policy to increase their revenue. If they had a booking system that double-booked, say, 1ppm of all bookings, they probably would not even
    notice, and would deal with it in the same way they deal with the
    cases when the overbooking actually results in too many passengers
    arriving for the flight. Likewise, free tickets are not an issue if
    they occur rarely enough. Do they want to spend a million on a
    mainframe to avoid a revenue loss of $100k? But in any case, that's
    not the problem with cheap hardware.

    The problems are: When the persistent storage fails, you lose all
    transactions since the latest backup. To avoid that, RAID-1 helps, or
    a redundant distributed storage like DRBD, or a redundant distributed transaction system. You may also want more availability than a single
    system with RAID-1 (with a spare system standing by) provides, then
    you have to go for one of the redundant distributed approaches.

    However, my impression from booking flights online is that reliability
    of the booking platform is not at all a concern for the airlines. And
    as a customer, I find little difference between the booking front-end
    erroring out or the transaction back-end being unavailable.

    There's also a rule of thumb about databases that says one system of >performance 100 is much better than 100 systems of performance 1
    because those 100 systems will spend all their time contending for
    database locks.

    If you handle each flight on one system, the contention for locks is
    only within that one system. And I expect that there is not that much contention. How many people book the same flight within the same
    millisecond (or however long the lock is held)?

    Concerning performance 100 vs. performance 1, about what systems are
    you thinking? z17 will have 32*8=256 cores (of unknown performance
    that is likely to be disappointing, or IBM would not disallow
    publishing benchmark results), compared to similar numbers of cores on
    servers with AMD or Intel CPUs, or 16-24 cores on systems based on
    desktop chips (with Intel you pay a heavy premium these days if you
    want ECC memory, however).

    Interestingly, with increasing number of cores per socket in recent
    years, the number of sockets is going down. E.g., the successor for
    the HPE Superdome Flex with up to 32 sockets (up to 32*28=896 cores)
    is the HPE Compute Scale-Up Server 3200 with up to 16 sockets
    (16*60=960 cores). Either there is little demand for single systems
    with more cores, or there are technical difficulties (probably both).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Fri Sep 13 08:24:19 2024
    On Fri, 13 Sep 2024 09:15:52 +0200, Terje Mathisen wrote:

    Novell's System Fault Tolerant NetWare 386 (around 1990) supported two complete servers acting like one, so that any hardware component could
    fail and the system would keep running, with nothing noticed by the
    clients, even those that were in the middle of an update/write request.

    Just so long as it wasn’t the network connection between them that failed.

    See also, “CAP Theorem”.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Levine on Fri Sep 13 12:22:17 2024
    On Thu, 12 Sep 2024 20:10:43 -0000 (UTC)
    John Levine <[email protected]> wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 11 Sep 2024 16:39:23 -0000 (UTC), Brett wrote:

    Then there is the issue of cheap PC’s that fail, and a mainframes
    have a higher level of redundancy and failover. Failed business
    transactions can cost millions, more than the machine is worth, so
    saving pennies on hardware is stupid.

    You solve that by having multiple units of the cheap machines to
    achieve the same level of redundancy, or even more. That ends up
    being more cost- effective than the mainframe.

    That's fine for workloads that work that way.

    Airline reservation systems historically ran on mainframes because
    when they were invented that's all there was (original SABRE ran on
    two 7090s) and they are business critical so they need to be very
    reliable.

    About 30 years ago some guys at MIT realized that route and fare
    search, which are some of the most demanding things that CRS do, are
    easy to parallelize and don't have to be particularly reliable -- if
    your search system crashes and restarts and reruns the search and the
    result is a couple of seconds late, that's OK. So they started ITA
    software which used racks of PC servers running parallel applications
    written in Lisp (they were from MIT) and blew away the competition.

    However, that's just the search part. Actually booking the seats and
    selling tickets stays on a mainframe or an Oracle system because
    double booking or giving away free tickets would be really bad.

    There's also a rule of thumb about databases that says one system of performance 100 is much better than 100 systems of performance 1
    because those 100 systems will spend all their time contending for
    database locks.





    How many transactions per minute does world's biggest company need at
    peak hours? Is not this number small relatively to capabilities of
    even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Fri Sep 13 12:02:27 2024
    Lawrence D'Oliveiro wrote:
    On Fri, 13 Sep 2024 09:15:52 +0200, Terje Mathisen wrote:

    Novell's System Fault Tolerant NetWare 386 (around 1990) supported two
    complete servers acting like one, so that any hardware component could
    fail and the system would keep running, with nothing noticed by the
    clients, even those that were in the middle of an update/write request.

    Just so long as it wasn’t the network connection between them that failed.

    See also, “CAP Theorem”.

    If that failed, normal network routing would apply and the master-slave traffic would go out and back in over the normal network cards, but of
    course giving slighlty reduced performance.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Fri Sep 13 11:20:06 2024
    Terje Mathisen <[email protected]> schrieb:

    10-15 years ago I talked to another speaker at a conference, he told me
    that he was working on high-end open source LDAP software using _very_
    large memory DBs: Their system allowed one US cell phone company to keep every SIM card (~100M) on a single system, while a similar-size
    competitor had been forced to fall back on 17-way sharding (presumably
    using a hash of the SIM id).

    Keeping databases in memory is definitely a thing now... see SAP HANA.

    Any architectural implications for this?

    Browsing through the SAP pages, it seems they used Intel's Optane
    persistent memory, but that is no longer manufactured (?). But
    having fast, persistent storage is definitely an advantage for
    databases.

    Large memory: Of course.

    On the ISA level... these databases run on x86, so that seems to
    be good enough.

    Anything else?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Fri Sep 13 14:55:39 2024
    On Fri, 13 Sep 2024 11:20:06 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Terje Mathisen <[email protected]> schrieb:

    10-15 years ago I talked to another speaker at a conference, he
    told me that he was working on high-end open source LDAP software
    using _very_ large memory DBs: Their system allowed one US cell
    phone company to keep every SIM card (~100M) on a single system,
    while a similar-size competitor had been forced to fall back on
    17-way sharding (presumably using a hash of the SIM id).

    Keeping databases in memory is definitely a thing now... see SAP HANA.

    Any architectural implications for this?

    Browsing through the SAP pages, it seems they used Intel's Optane
    persistent memory, but that is no longer manufactured (?). But
    having fast, persistent storage is definitely an advantage for
    databases.

    Large memory: Of course.

    On the ISA level... these databases run on x86, so that seems to
    be good enough.

    Anything else?


    Another thing that SAP HANA seems to use more intensely than anybody
    else is Intel TSX. TSX (at least RTM part, I am not sure about HLE
    part) still present in the latest Xeon generation, but is strongly de-emphasized.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Michael S on Fri Sep 13 13:35:00 2024
    In article <[email protected]>, [email protected] (Michael S) wrote:

    How many transactions per minute does world's biggest company need
    at peak hours?

    One very painful case is credit card spending in the run-up to major
    holidays, such as Christmas, where the credit card companies feel the
    need for central authorisation of all transactions to reduce fraud. Fraud
    is, naturally, at its peak at these times. The price of wrongly refused transactions is also high, because it means customers march out of shops, having wasted retail staff time.

    Is not this number small relatively to capabilities of even 15 y.o.
    dual-Xeon server with few dozens of spinning rust disks?

    This does not seem to be the case.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Dallman on Fri Sep 13 15:50:12 2024
    On Fri, 13 Sep 2024 13:35 +0100 (BST)
    [email protected] (John Dallman) wrote:

    In article <[email protected]>,
    [email protected] (Michael S) wrote:

    How many transactions per minute does world's biggest company need
    at peak hours?

    One very painful case is credit card spending in the run-up to major holidays, such as Christmas, where the credit card companies feel the
    need for central authorisation of all transactions to reduce fraud.
    Fraud is, naturally, at its peak at these times. The price of wrongly
    refused transactions is also high, because it means customers march
    out of shops, having wasted retail staff time.

    Is not this number small relatively to capabilities of even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?

    This does not seem to be the case.

    John

    My post was specifically about flight reservations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Michael S on Fri Sep 13 16:18:00 2024
    In article <[email protected]>, [email protected] (Michael S) wrote:

    My post was specifically about flight reservations.

    Ah, sorry. In that field, the airlines have found it best to collect into
    large groups with shared reservation systems. If a travel agent has to
    use a different system for each airline, then booking becomes very
    inefficient and capacity gets wasted.

    <https://en.wikipedia.org/wiki/Computer_reservation_system>

    So there's real demand for systems with huge capacity. Not very many of
    them, but they have large budgets.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to David Schultz on Fri Sep 13 17:36:10 2024
    On 11/09/2024 21:06, David Schultz wrote:
    On 9/11/24 4:40 AM, David Brown wrote:
    The other reason, of course, was the name - "PIC" is associated with
    brain-dead microcontrollers with terrible C tools and which many
    people program in assembly.  They are also renowned for being very
    solid, coming in relatively amateur-friendly packages, and for never
    going out of production.

    After having written code for a PIC I agree with "brain-dead". The small sized memory pages were bad enough but the total lack of an add with
    carry instruction drove me mad.


    If you pretend it is a sort of microcode rather than assembly, so that
    you need several PIC assembly instructions to do the work of a single
    assembly instruction on a "normal" 8-bit CISC microcontroller, it feels
    less bad. And with enough complicated macros, it is possible to keep
    paging a bit more under control and automated.

    It also helps to use macros to give instructions better names - such as
    "IfBit" and "IfNBit" rather than "btfsc" and "btfss". (The fact that I
    can remember these after 25 years or so is a sign of the amount of
    cognitive effort it took to work with these things!)

    So I swore them off and the introduction of a MIPS based system did
    nothing to change that.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Fri Sep 13 11:50:44 2024
    Anton Ertl wrote:
    John Levine <[email protected]> writes:
    According to Lawrence D'Oliveiro <[email protected]d>:
    You solve that by having multiple units of the cheap machines to achieve >>> the same level of redundancy, or even more. That ends up being more cost- >>> effective than the mainframe.
    That's fine for workloads that work that way.

    Airline reservation systems historically ran on mainframes because
    when they were invented that's all there was (original SABRE ran on
    two 7090s) and they are business critical so they need to be very
    reliable.

    About 30 years ago some guys at MIT realized that route and fare
    search, which are some of the most demanding things that CRS do, are
    easy to parallelize and don't have to be particularly reliable -- if
    your search system crashes and restarts and reruns the search and the
    result is a couple of seconds late, that's OK. So they started ITA
    software which used racks of PC servers running parallel applications
    written in Lisp (they were from MIT) and blew away the competition.

    However, that's just the search part. Actually booking the seats and
    selling tickets stays on a mainframe or an Oracle system because
    double booking or giving away free tickets would be really bad.

    Booking flights or seats can easily be distributed: each flight is
    assigned to one computer. To avoid double booking or free tickets
    even in case of a computer crash, you use the usual transaction
    processing approach, and report completion of the booking only when
    the booking has reached persistent memory. For persistent memory you
    use SSDs with power-loss protection.

    These SSDs, ECC RAM, RAID-1, redundant power supplies and UPSs protect against most hardware failures, but availability is still a concern
    (e.g., motherboard or CPU failure; that normally does not affect data integrity if the other measures are taken, but it affects
    availability). To increase availability, you can use e.g., DRBD
    (distributed replicated block device) to get the data on multiple
    machines.

    Concerning "real bad": Airlines overbook their flights as a matter of
    policy to increase their revenue. If they had a booking system that double-booked, say, 1ppm of all bookings, they probably would not even notice, and would deal with it in the same way they deal with the
    cases when the overbooking actually results in too many passengers
    arriving for the flight. Likewise, free tickets are not an issue if
    they occur rarely enough. Do they want to spend a million on a
    mainframe to avoid a revenue loss of $100k? But in any case, that's
    not the problem with cheap hardware.

    You would not want the computer or agents overbooking randomly.
    You would want this controlled by policy and done behind the agents
    and customers back (or they would just cancel).

    So overbooking is just another kind of normal transaction
    and not an accident of fate.

    The problems are: When the persistent storage fails, you lose all transactions since the latest backup. To avoid that, RAID-1 helps, or
    a redundant distributed storage like DRBD, or a redundant distributed transaction system. You may also want more availability than a single
    system with RAID-1 (with a spare system standing by) provides, then
    you have to go for one of the redundant distributed approaches.

    However, my impression from booking flights online is that reliability
    of the booking platform is not at all a concern for the airlines. And
    as a customer, I find little difference between the booking front-end erroring out or the transaction back-end being unavailable.

    There's also a rule of thumb about databases that says one system of
    performance 100 is much better than 100 systems of performance 1
    because those 100 systems will spend all their time contending for
    database locks.

    If you handle each flight on one system, the contention for locks is
    only within that one system. And I expect that there is not that much contention. How many people book the same flight within the same
    millisecond (or however long the lock is held)?

    Unlike debit/credit or stock trading transactions which are self contained,
    the problem with airline reservation style transactions is they are
    interactive in the middle and a traditional DB record/row locking
    mechanism is insufficient.

    In the interactive transaction case, to do this properly (I don't know
    if airline systems actually do this) one needs to apply a timed reserve
    lock to a 3-seat row to give the agent a chance to talk to the customer
    and find out if the seat is acceptable. This creates a context that must
    be maintained for a period of time.

    Also typical SQL DB's do not have a way to read a row with lock
    and fail if already locked. They stall the request, which is not
    what you want a long duration interactive transaction to ever do.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Fri Sep 13 17:05:35 2024
    On Fri, 13 Sep 2024 9:22:17 +0000, Michael S wrote:


    How many transactions per minute does world's biggest company need at
    peak hours? Is not this number small relatively to capabilities of
    even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?

    A SWAG::

    8B people in the world: 1/3rd sleeping, 1/3rd working, 1/3rd relaxing.

    So we have only 3B potential transactions, and a single person will
    not average more than 1 transaction every 15 minutes over an hour.

    So: 3B/15 = 200M T/m

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to John Levine on Fri Sep 13 08:45:46 2024
    John Levine <[email protected]> writes:
    They also care deeply about reliability. Modern mainframes have multiple kinds of error
    checking and standby CPUs that can take over from a failed CPU, restart a failed
    instruction, and the program doesn't notice. I think you'll find a pattern since the
    CDC shock of making CPUs fast enough to keep the RAM and I/O devices busy while having
    the error checking and recovery features so the systems keep running for years at a time.

    shortly after joining IBM, I got pulled into effort to multithread
    370/195 ... 195 didn't have branch prediction or speculative execution
    so conditional branches drained pipeline and most codes ran at half
    rated throughput. Two (simulated, "red/black") instruction streams
    running at half speed would achieve rated throughput.

    They also claimed that the main difference between 360/195 and 370/195
    was introduction of ("370") hardware retry (masking all sorts of
    transient hardware errors). Some vague recall mention that 360/195 mean
    time between some hardware check was three hrs (combination of number of circuits and how fast they were running).

    Then decision was made to add virtual memory to all 370s and it was
    decided that the difficulty in adding virtual memory to 370/195 wasn't justified ... and all new work on machine was dropped.

    Account of end of ACS/360 ... Amdahl had won the battle to make ACS, 360 compatible ... but folklore is then executives were afraid that it would advance state-of-the-art too fast and IBM would loose control of the
    market ... includes some references to multithreading
    patents&disclosures.
    https://people.computing.clemson.edu/~mark/acs_end.html

    also mentions some of ACS/360 features show up more than 20yrs later
    with ES/9000.

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to John Levine on Fri Sep 13 09:05:33 2024
    John Levine <[email protected]> writes:
    I suppose. A review from the USDOE said:

    The IBM 3090 with Vector Facility is an extremely interesting machine
    because it combines very good scaler performance with enhanced vector
    and multitasking performance. For many IBM installations with a large
    scientific workload, the 3090/vector/MTF combination may be an ideal
    means of increasing throughput at minimum cost. However, neither the
    vector nor multitasking capabilities are sufficiently developed to
    make the 3090 competitive with our current worker machines for our
    large-scale scientific codes.

    1st part of 70s, IBM had FS effort, was totally different than 370
    and was to completely replace 370 (internal politics was killing
    off 370 efforts).
    http://www.jfsowa.com/computer/memo125.htm

    when FS finally implodes, there is a mad rush to get stuff back into the
    370 product pipelines, including kicking off 3033&3081 efforts in
    parallel.

    I got sucked into work on 16-processor 370 SMP and we con the 3033
    processor engineers into working on it in their spare time (lot more interesting that remapping 168 logic for 20% faster chips). Everybody
    thot it was great until somebody tells the head of POK lab that it could
    be decades before the POK favorite son operating system (batch MVS) has (effective) 16-processor support (at the time MVS documentation claimed
    that its 2-processor throughput had 1.2-1.5 times the throughput of
    single process). The head of POK then invites some of us to never visit
    POK again and the 3033 processor engineers, heads down and no
    distractions (although I was invited to sneak back into POK to work with
    them). POK doesn't ship a 16-processor machine until after the turn of
    the century, more than two decades later.

    Once the 3033 was out the door, the processor engineers start on
    trout/3090. When vector was announced they complained about it being
    purely marketing stunt ... that they had so speeded up 3090 scalar that
    it ran at memory bus saturation (and vector would unlikely make
    throughput much better).

    I had also started pontificating the relative disk throughput had gotten
    an order of magnitude slower (disks got 3-5 times faster while systems
    got 40-50 times faster) since 360 announce. Disk division executive took exception and directed division performance group to refute the claims,
    after a couple weeks they came back and said I had slightly understated
    the problem. They respun the analysis on how to configure disks to
    improve system throughput for a user group presentation (16Aug1984,
    SHARE 63, B874).

    I was doing some work with disk engineers and that they had been
    directed to use a very slow processor for the 3880 disk controller
    follow-on to the 3830 ... while it handled 3mbyte/sec 3380 disks, it
    otherwise seriously drove up channel busy. 3090 originally assumed that
    3880 would be like previous 3830 but with 3mbyte/sec transfer ... when
    they found out how bad things actually was, they realized they would
    have to seriously increase the number of (3mbyte/sec) channels (to
    achieve target throughput). Marketing then respins the significant
    increase in channels as being wonderful I/O machine. Trivia: the
    increase in channels required an extra TCM and the 3090 group
    semi-facetiously claimed they would bill the 3880 group for increase in
    3090 manufacturing cost.

    I was also doing some work with Clementi https://en.wikipedia.org/wiki/Enrico_Clementi
    E&S lab in IBM Kingston ... had boatload of Floating Point Systems boxes https://en.wikipedia.org/wiki/Floating_Point_Systems
    that had 40mbyte/sec disk arrays for keeping the FPS boxes fed.

    In 1980, I had been con'ed into doing channel-extender implementation
    for IBM STL (since renamed SVL), they were moving 300 people and 3270
    terminal to offsite bldg with dataprocessing back to STL datacenter.
    They had tried "remote 3270" but found human factors unacceptable. Channel-extender allowed "channel-attached" 3270 controllers to be place
    at offsite bldg with no human factors difference between offsite and in
    STL. Side-effect was that it increased system throughput by 10-15%. They
    had previously spread 3270 controllers across all the same channels with
    disks, the channel-extender work significantly reduced 3270 terminal I/O channel busy, increasing disk I/O and system throughput (they were
    considering moving all 3270 controllers to channel-extender, even those physically inside STL. Then there was attempt to my support released to customers, but there was group in POK playing with some serial stuff
    that get it vetoed, they were afraid if it was in the market, it would
    make it harder to release their stuff.

    In 1988, the IBM branch office asks if I could help LLNL get some serial
    stuff they were playing with, standardized ... which quickly becomes fibre-channel standard ("FCS", initially 1gibt/sec, full-duplex,
    aggregate 200mbyte/sec). The POK serial stuff finally gets released in
    the 90s with ES/9000 as ESCON (when it is already obsolete,
    17mbytes/sec). Then some POK engineers become involved with FCS and
    define a heavy-weight protocol that significantly reduces throughput, eventually released as FICON. The latest, public benchmark I've found is
    z196 "Peak I/O", getting 2M IOPS using 104 FICON. About the same time, a
    FCS is announced for E5-2600 blade claiming over million IOPS (two such
    FCS has higher throughput than 104 FICON). Also IBM docs had SAPs
    (system assist processors that do actual I/O) kept to 70% cpu (more like
    1.5M IOPS), also no IBM CKD DASD have been made for decades all being
    simulated on industry standard fixed-block disks.

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to Terje Mathisen on Fri Sep 13 09:54:45 2024
    Terje Mathisen <[email protected]> writes:
    Novell's System Fault Tolerant NetWare 386 (around 1990) supported two complete servers acting like one, so that any hardware component could
    fail and the system would keep running, with nothing noticed by the
    clients, even those that were in the middle of an update/write
    request.

    late 80s, get HA/6000 project, originally for NYTimes to move their
    newspaper system (ATEX) off VAXCluster to RS/6000. I then rename it
    HA/CMP when I start doing technical/scientific scale-up with national
    labs and commercial scale-up with RDBMS vendors (Oracle, Sybase,
    Informix, Ingres) that had VAXCluster support in same source base with
    Unix (I do distributed lock manager that supported VAXCluster semantics
    to ease ports). https://en.wikipedia.org/wiki/IBM_High_Availability_Cluster_Multiprocessing

    IBM had been marketing S/88, rebranded fault tolerant. Then the S/88
    product administer starts taking us around to their customers. https://en.wikipedia.org/wiki/Stratus_Technologies
    Also has me write a section for the corporate continuous availability
    strategy document ... however, it gets pulled when both Rochester
    (AS/400, I-systems) and POK (mainframe) complain that they couldn't meet
    the requirements.

    Early Jan92 in meeting with Oracle CEO, AWD/Hester tells Ellison that we
    would have 16processor clusters by mid92 and 128processor clusters by
    ye92. Within a couple weeks (end jan92), cluster scale-up is transferred
    for announce as IBM Supercomputer (scientific/technical *ONLY*) and we
    are told we can't work on anything with more than four processors (we
    leave IBM a few months later).

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to John Levine on Fri Sep 13 10:38:40 2024
    John Levine <[email protected]> writes:
    That's fine for workloads that work that way.

    Airline reservation systems historically ran on mainframes because when they were invented
    that's all there was (original SABRE ran on two 7090s) and they are business critical so
    they need to be very reliable.

    About 30 years ago some guys at MIT realized that route and fare search, which are some of
    the most demanding things that CRS do, are easy to parallelize and don't have to be
    particularly reliable -- if your search system crashes and restarts and reruns the search
    and the result is a couple of seconds late, that's OK. So they started ITA software which
    used racks of PC servers running parallel applications written in Lisp (they were from
    MIT) and blew away the competition.

    However, that's just the search part. Actually booking the seats and selling tickets stays
    on a mainframe or an Oracle system because double booking or giving away free tickets would
    be really bad.

    There's also a rule of thumb about databases that says one system of performance 100 is
    much better than 100 systems of performance 1 because those 100 systems will spend all
    their time contending for database locks.

    after leaving IBM was brought into largest airline res system to look
    ten impossible things they can't do. Got started with "ROUTES" (about
    25% of the mainframe workload), they gave me a full softcopy of OAG (all scheduled commercial flt segments in the world) ... couple weeks later
    came back with ROUTES that implemented their impossible things.
    Mainframe had tech trade-offs from the 60s and started from scratch
    could make totally different tech trade-offs, initially ran 100 times
    faster, then implementing the impossible stuff and still ran ten times
    faster (than their mainframe systems). Showed that ten rs6000/990 could
    handle workload for every flt and every airline in the world.

    Part of the issue was that they extensively massaged the data on a
    mainframe MVS/IMS system and then in sunday night, rebuilt the mainframe
    "TPF" (limited datamanagement services) system from the MVS/IMS
    system. That was all eliminated.

    Fare search was harder because it started being "tuned" by some real
    time factors.

    Could move all to RS/6000 - HA/CMP. Then some very non-technical issues kicked-in (like large staff involved in the data massaging). trivia: I
    had done a bunch of slight of hand for HA/CMP RDBMS distributed lock
    manager scaleup for 128-processor clusters.


    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Fri Sep 13 21:43:06 2024
    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 11:20:06 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Terje Mathisen <[email protected]> schrieb:

    10-15 years ago I talked to another speaker at a conference, he
    told me that he was working on high-end open source LDAP software
    using _very_ large memory DBs: Their system allowed one US cell
    phone company to keep every SIM card (~100M) on a single system,
    while a similar-size competitor had been forced to fall back on
    17-way sharding (presumably using a hash of the SIM id).

    Keeping databases in memory is definitely a thing now... see SAP HANA.

    Any architectural implications for this?

    Browsing through the SAP pages, it seems they used Intel's Optane
    persistent memory, but that is no longer manufactured (?). But
    having fast, persistent storage is definitely an advantage for
    databases.

    Large memory: Of course.

    On the ISA level... these databases run on x86, so that seems to
    be good enough.

    Anything else?


    Another thing that SAP HANA seems to use more intensely than anybody
    else is Intel TSX. TSX (at least RTM part, I am not sure about HLE
    part) still present in the latest Xeon generation, but is strongly de-emphasized.

    Sounds like a market niche... Mitch, how good is your ESM for
    in-memory databases?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Lynn Wheeler on Fri Sep 13 22:14:11 2024
    On Fri, 13 Sep 2024 09:05:33 -1000, Lynn Wheeler wrote:

    I had also started pontificating the relative disk throughput had gotten
    an order of magnitude slower (disks got 3-5 times faster while systems
    got 40-50 times faster) since 360 announce.

    Out of curiosity, did you have figures on how closely the filesystem could
    get to using all of theoretical disk I/O bandwidth?

    I ask because, in the Unix world, this was pretty terrible until
    Berkeley’s FFS (“Fast File System”) came along.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to [email protected] on Fri Sep 13 21:50:15 2024
    It appears that Michael S <[email protected]> said:
    There's also a rule of thumb about databases that says one system of
    performance 100 is much better than 100 systems of performance 1
    because those 100 systems will spend all their time contending for
    database locks.

    How many transactions per minute does world's biggest company need at
    peak hours?

    Ten years ago Visa could process 56,000 messages/second. It must be a
    lot more now. I think a transaction is two or four messages depending
    on the transaction type.

    Is not this number small relatively to capabilities of
    even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?

    Uh, no, it is not.


    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Sep 13 23:12:24 2024
    On Fri, 13 Sep 2024 21:43:06 +0000, Thomas Koenig wrote:

    Michael S <[email protected]> schrieb:
    On Fri, 13 Sep 2024 11:20:06 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Anything else?


    Another thing that SAP HANA seems to use more intensely than anybody
    else is Intel TSX. TSX (at least RTM part, I am not sure about HLE
    part) still present in the latest Xeon generation, but is strongly
    de-emphasized.

    Sounds like a market niche... Mitch, how good is your ESM for
    in-memory databases?

    I do not think the in-memory part has anything to do with ESM
    ATOMIC behavior.

    I have no actual data, all I have is mental analyses.

    The real think about ESM is that it allows one to code in such a way
    as to need FEWER ATOMIC events--because each event can do more work;
    so, thereby one needs fewer events.

    1) You can acquire several cache lines and perform a single event
    that would take a more typical ISA multiple ATIMOIC instructions.
    This attacks the exponent of how rapidly things degrade under
    contention.

    2) secondly if a higher privilege thread contends with a lower thread
    the higher privileged thread wins.

    3) amongst equally privileged threads the one(s) that have made more
    forward progress succeed while those just getting started fail.

    4) There are ways for SW to get a count of the amount of interference
    and each thread choose more wisely such that contention is reduced
    on subsequent tries. There are some ATOMIC things for which this takes
    a BigO( n**3 ) and makes it BigO( 3 ) {yes constant time}. A more
    typical; use with new contenders coming and going randomly goes from
    BIgO( n**3 ) to between BigO( n*ln(ln(n)) ) and BigO( n*ln(n) ).

    HOWEVER:: if one uses ESM to simply implement locking behavior; only
    part 1) above applies. That is if one uses ESM to create you standard {test&set, test*test*set, LoadLocked-StoreCOnditional, CAS, DCAS,
    DCADS, TCADS,...} to get a performing kernel that depends on how
    the SW is written, not necessarily how HW performs ESM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Sat Sep 14 01:47:03 2024
    On Fri, 13 Sep 2024 12:22:17 +0300, Michael S wrote:

    How many transactions per minute does world's biggest company need at
    peak hours?

    A few years ago, I read an article about Facebook’s setup. At the time,
    they had about a billion users who were active at least once a month. So
    that would have been over 300 postings per second, sustained.

    They were using MySQL with memcached, and I think they already had HHVM
    (their custom PHP implementation) then as well.

    Mainframes? Never heard of them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Sat Sep 14 01:44:14 2024
    On Fri, 13 Sep 2024 11:20:06 -0000 (UTC), Thomas Koenig wrote:

    Keeping databases in memory is definitely a thing now... see SAP HANA.

    memcached might have been there before SAP.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Dallman on Sat Sep 14 01:48:26 2024
    On Fri, 13 Sep 2024 16:18 +0100 (BST), John Dallman wrote:

    So there's real demand for systems with huge capacity. Not very many of
    them, but they have large budgets.

    Did somebody say “cloud” ... ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Sat Sep 14 09:21:46 2024
    John Levine <[email protected]> writes:
    It appears that Michael S <[email protected]> said:
    There's also a rule of thumb about databases that says one system of
    performance 100 is much better than 100 systems of performance 1
    because those 100 systems will spend all their time contending for
    database locks.

    How many transactions per minute does world's biggest company need at
    peak hours?

    Ten years ago Visa could process 56,000 messages/second. It must be a
    lot more now. I think a transaction is two or four messages depending
    on the transaction type.

    Is not this number small relatively to capabilities of
    even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?

    Uh, no, it is not.

    The way I would design this for a machine with that little IOPS is as
    an in-memory database, with transactions written to a log on RAID-1
    (on two or three of the HDDs), and a snapshot of the in-memory
    database written to disk repeatedly, with copy-on-write to get a
    consistent snapshot. The 8 cores of a 2009-vintage the dual-Xeon
    machine should be easily capable of doing it, but the question is if
    the machine has enough RAM for the database. Our dual-Xeon system
    from IIRC 2007 has 24GB of RAM, not sure how big it could be
    configured; OTOH, we have a single-Xeon system from 2009 or so with
    32GB of RAM (and there were bigger Xeons in the market at the time).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sat Sep 14 09:59:58 2024
    Anton Ertl <[email protected]> schrieb:

    [in-memory database]

    but the question is if
    the machine has enough RAM for the database. Our dual-Xeon system
    from IIRC 2007 has 24GB of RAM, not sure how big it could be
    configured; OTOH, we have a single-Xeon system from 2009 or so with
    32GB of RAM (and there were bigger Xeons in the market at the time).

    The minimum requirement of SAP HANA is 64 GB of memory, but typical
    ranges are from 256GB to 1TB.

    Interestingly enough, it will run on selected systemw, which only
    have Intel processors, and little-endian POWER 8 to 10. No AMD,
    no ARM, no zSystem.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Sat Sep 14 09:42:00 2024
    On Fri, 13 Sep 2024 21:50:15 -0000 (UTC), John Levine wrote:

    Ten years ago Visa could process 56,000 messages/second.

    That maybe sounds better than it is. After all, most of those transactions would tend to be geographically localized.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Sat Sep 14 10:45:34 2024
    Thomas Koenig <[email protected]> writes:
    Anton Ertl <[email protected]> schrieb:

    [in-memory database]

    but the question is if
    the machine has enough RAM for the database. Our dual-Xeon system
    from IIRC 2007 has 24GB of RAM, not sure how big it could be
    configured; OTOH, we have a single-Xeon system from 2009 or so with
    32GB of RAM (and there were bigger Xeons in the market at the time).

    The minimum requirement of SAP HANA is 64 GB of memory, but typical
    ranges are from 256GB to 1TB.

    What is the relevance of SAP HANA for the topic at hand?

    The question was if the RAM can hold the data. For each account they
    would have to keep the current balance (64 bits should be enough for
    that), the account number (64 bits for the up to 19 digits of a Visa
    card) for verifying that we are at the correct entry in the hash table
    and probably some account status information (64 bits should be
    plenty?).

    There is also the sequence of transactions (a 64-bit transaction
    offset in the log per transaction should be enough for that). The
    sequence of transactions may be useful for fraud detection, but I
    don't know enough about that to know how to scale the system, so I'll
    just say that fraud detection is done by a bigger system before the
    transaction goes through to the transaction processing computer.

    The sequence of transactions is also needed for generating the reports
    and for dealing with customer complaints, but again, that's not
    processing the transactions themselves (and is basically read-only,
    except that the customer-complaint processing may result in additional transactions).

    So, with 24 bytes needed for each account on the
    transaction-processing server, 32GB with, say 8GB left for
    copy-on-write and other administrative purposes should be good for
    about 900M accounts at a hash table load factor of 84%. I guess that
    Visa has more accounts, so one would need a box with more RAM.

    A single core of the Xeon should easily be able to handle all the 56K transactions per second, both the logging and the update of the hash
    table, and in that case no locking is needed. But that first needs a
    sequence of transactions coming in.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sat Sep 14 11:46:24 2024
    Anton Ertl <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Anton Ertl <[email protected]> schrieb:

    [in-memory database]

    but the question is if
    the machine has enough RAM for the database. Our dual-Xeon system
    from IIRC 2007 has 24GB of RAM, not sure how big it could be
    configured; OTOH, we have a single-Xeon system from 2009 or so with
    32GB of RAM (and there were bigger Xeons in the market at the time).

    The minimum requirement of SAP HANA is 64 GB of memory, but typical
    ranges are from 256GB to 1TB.

    What is the relevance of SAP HANA for the topic at hand?

    It is something that is implemented, unlike what you were discussing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Sat Sep 14 12:48:33 2024
    Thomas Koenig <[email protected]> writes:
    Anton Ertl <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    The minimum requirement of SAP HANA is 64 GB of memory, but typical >>>ranges are from 256GB to 1TB.

    What is the relevance of SAP HANA for the topic at hand?

    It is something that is implemented, unlike what you were discussing.

    So what? Linux is also implemented, and it runs on a 32GB machine.

    Neither Linux nor SAP HANA satisfy even the most basic requirement
    that I outlined (keeping balances) without additional implementation
    work. And I doubt that if you give a 15 year old Dual-Xeon even with
    64GB of RAM and a bunch of HDDs to a typical SAP developer, he will
    implement a system that manages to keep the balance on 1.8M (double
    the number for double RAM capacity) credit cards at 56K transactions
    per second on that system. What I described is relatively
    straightforward to implement on top of Linux.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Sat Sep 14 13:41:53 2024
    [email protected] (Anton Ertl) writes:
    Thomas Koenig <[email protected]> writes:
    Anton Ertl <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    The minimum requirement of SAP HANA is 64 GB of memory, but typical >>>>ranges are from 256GB to 1TB.

    What is the relevance of SAP HANA for the topic at hand?

    It is something that is implemented, unlike what you were discussing.

    So what? Linux is also implemented, and it runs on a 32GB machine.

    Neither Linux nor SAP HANA satisfy even the most basic requirement
    that I outlined (keeping balances) without additional implementation
    work. And I doubt that if you give a 15 year old Dual-Xeon even with
    64GB of RAM and a bunch of HDDs to a typical SAP developer, he will
    implement a system that manages to keep the balance on 1.8M (double
    the number for double RAM capacity) credit cards at 56K transactions
    per second on that system. What I described is relatively
    straightforward to implement on top of Linux.

    That should be 1.8G credit cards. I guess that a typical SAP HANA
    client developer will be able to handle 1.8M credit cards in 64GB.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Sat Sep 14 22:17:51 2024
    On Fri, 13 Sep 2024 17:05:35 +0000
    [email protected] (MitchAlsup1) wrote:

    On Fri, 13 Sep 2024 9:22:17 +0000, Michael S wrote:


    How many transactions per minute does world's biggest company need
    at peak hours? Is not this number small relatively to capabilities
    of even 15 y.o. dual-Xeon server with few dozens of spinning rust
    disks?

    A SWAG::

    8B people in the world: 1/3rd sleeping, 1/3rd working, 1/3rd relaxing.

    So we have only 3B potential transactions, and a single person will
    not average more than 1 transaction every 15 minutes over an hour.

    So: 3B/15 = 200M T/m

    I don't know about you, but I personally don't book flights 8 hours per
    day. Even less so in the biggest company in the world, which, I
    suppose, does not account for more tha 5-7% of world's flights.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Levine on Sat Sep 14 23:32:36 2024
    On Fri, 13 Sep 2024 21:50:15 -0000 (UTC)
    John Levine <[email protected]> wrote:

    It appears that Michael S <[email protected]> said:
    There's also a rule of thumb about databases that says one system
    of performance 100 is much better than 100 systems of performance 1
    because those 100 systems will spend all their time contending for
    database locks.

    How many transactions per minute does world's biggest company need at
    peak hours?

    Ten years ago Visa could process 56,000 messages/second. It must be a
    lot more now. I think a transaction is two or four messages depending
    on the transaction type.

    Is not this number small relatively to capabilities of
    even 15 y.o. dual-Xeon server with few dozens of spinning rust
    disks?

    Uh, no, it is not.



    I probably was not clear enough. I have no doubts that there exist jobs
    for which the machine that I mentioned above is insufficient.
    I just don't believe that flight reservation is one of such jobs.

    BTW, tcp.org site is down, so I can not find OLTP bechmarks for the
    sort of machines that I mentioned in my previus post. Quite possibly
    they (scores) do not exsit, because in 2009 people already stopped
    benchmarking OLT with rotating media.
    For reference, the best TPC-C score for 2-way Xeon in 2010 is 803,068
    tpmC, but that score uses SSDs. IIRC, that's ~80% of the world's
    absolutely fastest non-clustered score of 2003.

    I am posting a link with a hope that tcp.org will be up tomorrow. http://www.tpc.org/results/individual_results/HP/hp_DL380_TPCC_051110_ES.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Sat Sep 14 20:42:31 2024
    On Sat, 14 Sep 2024 19:17:51 +0000, Michael S wrote:

    On Fri, 13 Sep 2024 17:05:35 +0000
    [email protected] (MitchAlsup1) wrote:

    On Fri, 13 Sep 2024 9:22:17 +0000, Michael S wrote:


    How many transactions per minute does world's biggest company need
    at peak hours? Is not this number small relatively to capabilities
    of even 15 y.o. dual-Xeon server with few dozens of spinning rust
    disks?

    A SWAG::

    8B people in the world: 1/3rd sleeping, 1/3rd working, 1/3rd relaxing.

    So we have only 3B potential transactions, and a single person will
    not average more than 1 transaction every 15 minutes over an hour.

    So: 3B/15 = 200M T/m

    I don't know about you, but I personally don't book flights 8 hours per
    day. Even less so in the biggest company in the world, which, I
    suppose, does not account for more tha 5-7% of world's flights.

    An number for which total number of transactions of all kinds world
    wide will not be exceeded by a total population of 8B people.
    {Not just airline, but every transaction over the whole world.}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to John Levine on Sat Sep 14 10:57:57 2024
    John Levine <[email protected]> writes:
    Ten years ago Visa could process 56,000 messages/second. It must be a
    lot more now. I think a transaction is two or four messages depending
    on the transaction type.

    card associations were originally to promote brand acceptance/uptake/advertising and network interconnecting the acquiring/merchant card transaction processors with the issuing/consumer
    card transaction processors (issuer processor doing the real-time authorization/"auth" transaction).

    late 90s, internet/micropayments was looking at card transaction
    processors being able to handle micropayments ... but required
    singnificantly higher transaction rate than card processors were capable
    off. They turn to cellphone operations that were using "in-memory" DBMS
    capable of ten times the transaction rate (that card processors were
    doing).

    Some of the cellphone companies were enticed to get into micropayments
    but got out after a few years, turns out they lacked the significant
    fraud handling capability (they were absorbing cellphone calling fraud
    because it was their own resources, but in case of micropayments fraud,
    it involved actually transferring real money to other entities).

    As an side, card association interconnect network was flavor of VAN
    (value added networks) that was prevalent at the time, but were in the
    process of of being obsoleted by the internet. As an side, at the turn
    of the century, 90% of all acquiring&issuing card transactions were
    being handled by six datacenters having their own private, dedicated, non-association interconnect ... big litigation between card
    associations and those processors (the card association network had been charging fee for each transaction that flowed through their network, and association still wanted that fee paid them whether or not the
    transaction actually flowed their network.

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sun Sep 15 00:57:12 2024
    On Sun, 15 Sep 2024 0:40:31 +0000, Lawrence D'Oliveiro wrote:

    On Sat, 14 Sep 2024 10:57:57 -1000, Lynn Wheeler wrote:

    Some of the cellphone companies were enticed to get into micropayments
    but got out after a few years, turns out they lacked the significant
    fraud handling capability ...

    Meanwhile, the Kenyans have figured out how to run a successful online micropayments system mediated via text messages (M-Pesa).

    Another reason not to have ANY apps on your cell phone.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Lynn Wheeler on Sun Sep 15 00:40:31 2024
    On Sat, 14 Sep 2024 10:57:57 -1000, Lynn Wheeler wrote:

    Some of the cellphone companies were enticed to get into micropayments
    but got out after a few years, turns out they lacked the significant
    fraud handling capability ...

    Meanwhile, the Kenyans have figured out how to run a successful online micropayments system mediated via text messages (M-Pesa).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Sep 15 02:02:07 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    Some of the cellphone companies were enticed to get into micropayments
    but got out after a few years, turns out they lacked the significant
    fraud handling capability ...

    Meanwhile, the Kenyans have figured out how to run a successful online >micropayments system mediated via text messages (M-Pesa).

    M-pesa isn't micropayments, typical transfers are on the order of
    a dollar. It was only possible because there was a dominant
    government owned mobile mobile carrier in Kenya and M-Pesa is
    basically sending around prepaid mobile phone credits.

    It is certainly a success, providing banking services to vast numbers
    of poor people who'd never be able to use a normal bank.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to All on Sat Sep 14 17:00:59 2024
    before and after turn of century we would periodically have threads on
    the "bank fraud blame game" (in financial industry mailing lists);
    interchange fees that financial institutions charge merchants is base
    plus fraud surcharge ... adjusted for the fraud rate for kind of
    transactions .... internet transactions can have highest surcharge (with
    many banks' profit from fraud surcharge reaching major percentage of
    their bottom line)

    right after turn of century, several "safe" transaction products were
    presented to major online merchants (representing 80% of total internet
    payment transactions) which saw high acceptance ... expecting that the
    fraud surcharge would be eliminated. Then the cognitive dissonance set
    in, they were told that instead of eliminating the fraud surchanged, a
    new large "safe" surcharge would be added on top of the existing fraud surchange ... and all the interest evaporated.

    I had co-authored financial industry transaction protocols as well as
    done "safe" transaction chip design (that was one of the "safe"
    products) ... was one of panel giving talk at standing room only large ballroom, semi-facetiously saying I was taking $500 milspec chip and aggresively cost reducing by more than two orders of magnitude while
    increasing its security: https://csrc.nist.gov/pubs/conference/1998/10/08/proceedings-of-the-21st-nissc-1998/final
    got prototype chips after turn of the century and gave talk in assurance
    panal in the trusted computing tract at 2001 IDF https://web.archive.org/web/20011109072807/http://www.intel94.com/idf/spr2001/sessiondescription.asp?id=stp%2bs13
    the guy running trusted-computing TPM chip was in front row and I chided
    him that it was nice to see his chip was starting to look more like
    mine; his response was that I didn't have a committee of 200 people
    helping me with design.

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to Anton Ertl on Fri Sep 20 18:35:26 2024
    In article <[email protected]>,
    Anton Ertl <[email protected]> wrote:
    Brett <[email protected]> writes:
    Speaking of complex things, have you looked at Swift output, as it checks >>all operations for overflow?

    You could add an exception type for that, saving huge numbers of correctly >>predicted branch instructions.

    The future of programming languages is type safe with checks, you need to >>get on that bandwagon early.

    MIPS got on that bandwagon early. It has, e.g., add (which traps on
    signed overflow) in addition to addu (which performs modulo
    arithmetic). It has been abandoned and replaced by RISC-V several
    years ago.

    Alpha got on that bandwagon early. It's a descendent of MIPS, but it
    renamed add into addv, and addu into add. It has been canceled around
    the year 2000.

    [ More details about architectures without trapping overflow instructions ]

    Trapping on overflow is basically useless other than as a debug aid,
    which clearly nobody values. If you take Rust's approach, and only
    detect overflow in debug builds, then you already don't care about
    performance.

    If you want to do almost anything at all other than core dump on
    overflow, you need to branch to recovery code. And although it's
    theoretically possible to recover from the trap, it's worse than any
    other approach. So it's added hardware that's HARDER for software to
    use. No surprise it's gone away.

    IA64 went down this road--trapping on speculation failures. It was a
    huge disaster--trying to recover through an exception handler mechanism
    is slow and painful, for the reasons I'll lay out for overflow
    exceptions.

    Let's look at how you might want to handle overflows when they happen:

    1) Your language supports seemlessly transitioning to BigInts on
    overflow. Then each operation that could overflow needs to call
    a special bit of code to change to BigInt and then continue the
    calculation. This code must exist, even if a trapping
    instruction doesn't need an explicit branch to it. Some
    mechanism is needed to call this code.

    2) You need to call an exception handler, and the routine with the overflow
    is ended. We need to know which exception handler to call.

    3) You want to clamp the value to a reasonable range and continue. The
    reasonable values need to be looked up somewhere.

    4) You just want to crash the program. If a debugger is attached, it can
    say where the overflow occurred.

    Trapping on overflow instructions really are only useful for #4. Let's
    look at how the other cases could be handled, with a) meaning using
    branches, and b) mean using a trapping instruction.

    1a) (BigInt): After doing an operation which could overflow, use a
    conditional branch to jump to code to convert to BigInt, which
    then jumps back. Overhead is basically the branch-on-overflow
    instruction.

    1b) (BigInt with traps). Hardware traps to the OS, which needs to prepare
    the required structures describing the exception (all regs and
    the address), and then call the signal handler. The signal
    handler needs to look up the address of the trap with a table
    describing what to do for this particular operation which
    overflowed. Each table entry needs to describe, in detail, what
    registers are involved (the sources and the dest), and where to
    return once the BigInt has been created. This requires massive
    changes to the compiler (and possibly linker) to prepare these
    tables. The compiler must guarantee that changing the dest
    register to a pointer to BigInt works properly (otherwise,
    special code needs to be emitted for each potentially trapping
    instruction to try to recover).

    2a) (Try/Catch): After doing an operation which could overflow, use a
    conditional branch to jump to the catch block.

    2b) (Try/Catch with traps). Repeat all the OS work and call the signal
    handler. Now, it just needs a table entry describing where to
    jump to to enter the catch block. Almost all the complexity of
    1b), but without needing the register details.

    3a) (Clamp): After doing an operation which could overflow, use a
    conditional branch to do the MIN/MAX operations to bring it back
    within range and then jump back.

    3b) (Clamp with trap): Basically the same as 1b), but there's an alternative
    if the clamps are global (MAX_INT/MIN_INT). The exception handler
    can read the instruction which trapped, figure out the source and
    dest registers, re-do the calculation, and clamp the destination
    to MIN or MAX, and return to just after the instruction which
    trapped.

    4a) (Crash): Every operation could overflow needs a conditional branch
    after it to branch to a crashing instruction (or a branch over
    an undefined instruction if there's no overflow).

    4b) (Crash with trap): Use operations which trap on overflow. This takes
    no new instructions and costs no performance.

    Basically, all a) cases are:

    op_with_might_overflow();
    if(overflow_happened) {
    handle the overflow
    }

    Trapping-on-overflow instructions are clearly useless for languages
    which care about overflow. To save one branch instruction, an entry is
    needed to describe how to handle the overflow, which is certainly larger
    than a branch instruction. And the code to "handle the overflow" is
    needed in any case. And this assume some sort of instant lookup--of the
    1000 overflow instructions, we need a hash table to look up the address,
    which is more overhead.

    Trapping on overflow instructions are useful as a debug aid for
    languages which don't care about overflow--but then you're optimizing
    something nearly useless. It also might be helpful if global clamping
    to MIN/MAX was useful (and I don't think it is).

    Instruction sets which make detecting overflow difficult (say, RISC-V),
    would do well to make branch-on-overflow efficient and easy. But adding trap-on-overflow instructions is a waste of effort.

    Note that using traps on data access violations which are "fixed" by
    signal handlers CAN work out. They are slow, but as long as the
    exception handler can fix the access violation and return right to the instruction which failed (without needing to know ANYTHING about that instruction in particular), this can work fine. But integer overflow
    doesn't work like that--it's generally not possible to figure out
    in the trap handler what to do without more information.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Kent Dickey on Fri Sep 20 22:00:28 2024
    On Fri, 20 Sep 2024 18:35:26 +0000, Kent Dickey wrote:

    In article <[email protected]>,

    Alpha got on that bandwagon early. It's a descendent of MIPS, but it >>renamed add into addv, and addu into add. It has been canceled around
    the year 2000.

    [ More details about architectures without trapping overflow
    instructions ]

    Trapping on overflow is basically useless other than as a debug aid,
    which clearly nobody values. If you take Rust's approach, and only
    detect overflow in debug builds, then you already don't care about performance.

    If you want to do almost anything at all other than core dump on
    overflow, you need to branch to recovery code. And although it's theoretically possible to recover from the trap, it's worse than any
    other approach. So it's added hardware that's HARDER for software to
    use. No surprise it's gone away.

    Note: Linux does not even have an "Integer Overflow" signal, while
    it does have a "FP exception" signal.

    But then IEEE 754 exception semantics make even less sense than
    Linux signals. ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sat Sep 21 01:09:43 2024
    On Fri, 20 Sep 2024 22:00:28 +0000, MitchAlsup1 wrote:

    But then IEEE 754 exception semantics make even less sense than Linux signals. ...

    Note that what IEEE 754 calls an “exception” is just a bunch of status
    bits reporting on the current state of the computation: there is no
    implication of some transfer of control elsewhere.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Kent Dickey on Sat Sep 21 01:12:11 2024
    On Fri, 20 Sep 2024 18:35:26 -0000 (UTC), Kent Dickey wrote:

    3) You want to clamp the value to a reasonable range and continue. The
    reasonable values need to be looked up somewhere.

    This won’t work. The values outside the range are by definition non- representable, so comparisons against them are useless.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Sep 21 01:52:32 2024
    On Sat, 21 Sep 2024 1:09:43 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 20 Sep 2024 22:00:28 +0000, MitchAlsup1 wrote:

    But then IEEE 754 exception semantics make even less sense than Linux
    signals. ...

    Note that what IEEE 754 calls an “exception” is just a bunch of status bits reporting on the current state of the computation: there is no implication of some transfer of control elsewhere.

    Then how do you implement the alternate exception model ???
    which IS part of 754-2008 and 754-2019

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Sep 21 01:51:21 2024
    On Sat, 21 Sep 2024 1:12:11 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 20 Sep 2024 18:35:26 -0000 (UTC), Kent Dickey wrote:

    3) You want to clamp the value to a reasonable range and continue. The
    reasonable values need to be looked up somewhere.

    This won’t work. The values outside the range are by definition non- representable, so comparisons against them are useless.

    When a range is 0..10 both -1 and 11 are representable in
    the arithmetic of ALL computers, just not in the language
    specifying the range.

    So you are talking a language issue not a computer arithmetic
    issue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to All on Sat Sep 21 10:56:24 2024
    On 2024-09-21 4:51, MitchAlsup1 wrote:
    On Sat, 21 Sep 2024 1:12:11 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 20 Sep 2024 18:35:26 -0000 (UTC), Kent Dickey wrote:

    3) You want to clamp the value to a reasonable range and continue.  The >>>     reasonable values need to be looked up somewhere.

    This won’t work. The values outside the range are by definition non-
    representable, so comparisons against them are useless.

    When a range is 0..10 both -1 and 11 are representable in
    the arithmetic of ALL computers, just not in the language
    specifying the range.


    For "11" I agree, for "-1" disagree.

    if the program was written (in whatever language) with the assumption
    that the data type in question is unsigned, then it cannot represent -1
    in the program's view of the bits. The bits that represent -1 in a
    signed two's complement view represent a large positive value in the
    unsigned view that the code uses.

    Now if the error condition that was trapped or detected was an attempt
    to produce a negative value like -1 for an unsigned data type, that
    error condition is of course representable separately; it does not have
    to be encoded by an out-of-range value in the data type itself.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sat Sep 21 08:17:17 2024
    On Sat, 21 Sep 2024 01:52:32 +0000, MitchAlsup1 wrote:

    On Sat, 21 Sep 2024 1:09:43 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 20 Sep 2024 22:00:28 +0000, MitchAlsup1 wrote:

    But then IEEE 754 exception semantics make even less sense than Linux
    signals. ...

    Note that what IEEE 754 calls an “exception” is just a bunch of status >> bits reporting on the current state of the computation: there is no
    implication of some transfer of control elsewhere.

    Then how do you implement the alternate exception model ??? which IS
    part of 754-2008 and 754-2019

    Section 8.3 of the 2008 spec says:

    NOTE 2 — Immediate alternate exception handling for an exception
    can be implemented by traps or, for exceptions listed in Clause 7
    other than underflow, by testing status flags after each operation
    or at the end of the associated block. Thus for exceptions listed
    in Clause 7 other than underflow, immediate exception handling can
    be implemented with the same mechanism as delayed exception
    handling, if no better implementation mechanism is available.

    So explicit testing of flag bits is permitted. Note that the special case
    for underflow mentioned is that the exception signalled is “inexact”, not “underflow”.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sat Sep 21 08:18:06 2024
    On Sat, 21 Sep 2024 01:51:21 +0000, MitchAlsup1 wrote:

    On Sat, 21 Sep 2024 1:12:11 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 20 Sep 2024 18:35:26 -0000 (UTC), Kent Dickey wrote:

    3) You want to clamp the value to a reasonable range and continue.
    The reasonable values need to be looked up somewhere.

    This won’t work. The values outside the range are by definition non-
    representable, so comparisons against them are useless.

    When a range is 0..10 both -1 and 11 are representable in the arithmetic
    of ALL computers, just not in the language specifying the range.

    That’s an ”out of subrange” error, not an “overflow” error.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Kent Dickey on Sat Sep 21 13:05:02 2024
    Kent Dickey wrote:
    In article <[email protected]>,
    Anton Ertl <[email protected]> wrote:
    Brett <[email protected]> writes:
    Speaking of complex things, have you looked at Swift output, as it checks >>> all operations for overflow?

    You could add an exception type for that, saving huge numbers of correctly >>> predicted branch instructions.

    The future of programming languages is type safe with checks, you need to >>> get on that bandwagon early.
    MIPS got on that bandwagon early. It has, e.g., add (which traps on
    signed overflow) in addition to addu (which performs modulo
    arithmetic). It has been abandoned and replaced by RISC-V several
    years ago.

    Alpha got on that bandwagon early. It's a descendent of MIPS, but it
    renamed add into addv, and addu into add. It has been canceled around
    the year 2000.

    [ More details about architectures without trapping overflow instructions ]

    Trapping on overflow is basically useless other than as a debug aid,
    which clearly nobody values. If you take Rust's approach, and only
    detect overflow in debug builds, then you already don't care about performance.

    Those automatic software correctness checks, of which signed integer
    overflow detection is one of many, went away because most code was
    being written in C/C++ and those two languages don't require them.

    That just makes it more expensive in code size and performance to effect
    such checks. This overhead leads some to conclude it justifies eliminating
    the error checks.

    Eliminating the error event detectors doesn't make errors go away,
    just your knowledge of them.

    I gather portions of 16-bit Windows 3.1 were written in Pascal.
    When Microsoft developed 32-bit WinNT, if instead of C it they had
    switched their official development language from Pascal to Modula-2
    which does require signed and unsigned, checked and modulo arithmetic,
    and array bounds checks, the world would have been a much safer place.

    But they didn't so it isn't.

    The x86 designers might then have had an incentive to make all the
    checks as efficient as possible, and rather than eliminate them,
    they might have enhanced and more tightly integrated them.

    If you want to do almost anything at all other than core dump on
    overflow, you need to branch to recovery code. And although it's theoretically possible to recover from the trap, it's worse than any
    other approach. So it's added hardware that's HARDER for software to
    use. No surprise it's gone away.

    The reason it traps is because YOU ASKED IT TO DETECT CERTAIN EVENTS!
    An exception is just a method to deliver notification of an event.
    What makes such event detections efficient, in code size and performance,
    is that they ARE automatic and in the background.

    What makes those events errors is that you DIDN'T handle them.
    If you did handle them, then they wouldn't be errors,
    just automatically detected events.

    The reason most code does not have exceptions handlers and most are fatal
    is because that code doesn't have a way to recover from fundamental
    programming errors that are never supposed to occur.

    Eliminating the error event detectors doesn't make errors go away,
    just your knowledge of them.

    The only exception handler I have is a last-chance handler at the top
    of the thread stack, which dumps a stack traceback to a log file,
    and attempts a clean shutdown.

    IA64 went down this road--trapping on speculation failures. It was a
    huge disaster--trying to recover through an exception handler mechanism
    is slow and painful, for the reasons I'll lay out for overflow
    exceptions.

    That has nothing to do with the occurrence of errors, software or hardware. That notification of such events was painful on IA64, such was its nature.

    Let's look at how you might want to handle overflows when they happen:

    1) Your language supports seemlessly transitioning to BigInts on
    overflow. Then each operation that could overflow needs to call
    a special bit of code to change to BigInt and then continue the
    calculation. This code must exist, even if a trapping
    instruction doesn't need an explicit branch to it. Some
    mechanism is needed to call this code.

    BigInt is a variable sized signed integer type that, by definition,
    do not overflow. BigInt code library will make its own decisions on
    how to efficiently implement that behavior.

    I do not want integers that I declared as fixed sized types to be
    changed to variable sized BigInts, thank you.

    2) You need to call an exception handler, and the routine with the overflow
    is ended. We need to know which exception handler to call.

    3) You want to clamp the value to a reasonable range and continue. The
    reasonable values need to be looked up somewhere.

    I do not want integers that I declared as (normal) linear types to be
    changed to saturating types, thank you.

    4) You just want to crash the program. If a debugger is attached, it can
    say where the overflow occurred.

    If I _asked_ for signed overflow event detection on some expressions
    and one occurs, and the detection and delivery mechanism uses exceptions,
    then yes that is what happens.

    And yes, I crash by virtue of not having an exception handler for such
    an event because it is never supposed to occur and I have no way to
    correct the situation. If I could correct it I would and it wouldn't
    be an error, just and event. But I didn't so it was.

    Trapping on overflow instructions really are only useful for #4. Let's
    look at how the other cases could be handled, with a) meaning using
    branches, and b) mean using a trapping instruction.

    Yep, pretty much. But that is why I, the programmer, chose fixed size
    signed integer using checked arithmetic at that point in my program.
    Because I want that behavior.

    1a) (BigInt): After doing an operation which could overflow, use a
    conditional branch to jump to code to convert to BigInt, which
    then jumps back. Overhead is basically the branch-on-overflow
    instruction.

    That looks like a good way to implement BigInt.

    1b) (BigInt with traps). Hardware traps to the OS, which needs to prepare
    the required structures describing the exception (all regs and
    the address), and then call the signal handler. The signal
    handler needs to look up the address of the trap with a table
    describing what to do for this particular operation which
    overflowed. Each table entry needs to describe, in detail, what
    registers are involved (the sources and the dest), and where to
    return once the BigInt has been created. This requires massive
    changes to the compiler (and possibly linker) to prepare these
    tables. The compiler must guarantee that changing the dest
    register to a pointer to BigInt works properly (otherwise,
    special code needs to be emitted for each potentially trapping
    instruction to try to recover).

    That looks like an expensive way to implement BigInt.

    2a) (Try/Catch): After doing an operation which could overflow, use a
    conditional branch to jump to the catch block.

    That looks like a more expensive way, in code size and performance,
    than automatic by hardware to detect an event that should never occur.

    However if this overflow event might regularly occur at that point
    in your code, and you do have a way of handling it, then yes by all
    means do it programmatically as that less expensive than a trip
    through the OS. But both detection methods get the job done.

    2b) (Try/Catch with traps). Repeat all the OS work and call the signal
    handler. Now, it just needs a table entry describing where to
    jump to to enter the catch block. Almost all the complexity of
    1b), but without needing the register details.

    That looks like a less expensive way, in code size and performance,
    than manual by software, to detect an event that should never occur.

    3a) (Clamp): After doing an operation which could overflow, use a
    conditional branch to do the MIN/MAX operations to bring it back
    within range and then jump back.

    I do not want integers that I declared as (normal) linear types to be
    changed to saturating types, thank you.

    3b) (Clamp with trap): Basically the same as 1b), but there's an alternative
    if the clamps are global (MAX_INT/MIN_INT). The exception handler
    can read the instruction which trapped, figure out the source and
    dest registers, re-do the calculation, and clamp the destination
    to MIN or MAX, and return to just after the instruction which
    trapped.

    I do not want integers that I declared as (normal) linear types to be
    changed to saturating types, thank you.

    If I declare a saturating type then overflow exceptions is an expensive
    way to implement them considering that the C code is:

    long LongSatAdd (long left, long right)
    {
    long result;
    result = left + right;
    if (((result ^ left) & (result ^ right)) < 0) // Signed overflow?
    result = (result < 0)? LONG_MAX : LONG_MIN; // Saturate high or low
    return result;
    }

    Of course, a hardware instruction that does exactly this is preferred.

    4a) (Crash): Every operation could overflow needs a conditional branch
    after it to branch to a crashing instruction (or a branch over
    an undefined instruction if there's no overflow).

    Good choice for uncorrectable errors, poor choice for handleable events.

    4b) (Crash with trap): Use operations which trap on overflow. This takes
    no new instructions and costs no performance.

    Good choice for uncorrectable errors, poor choice for handleable events.

    Basically, all a) cases are:

    op_with_might_overflow();
    if(overflow_happened) {
    handle the overflow
    }

    Trapping-on-overflow instructions are clearly useless for languages
    which care about overflow.

    This conclusion is completely wrong.
    Exceptions are an event detection and notification delivery mechanism.
    It is very efficient if those events are rarely or never supposed to occur.

    It may not be a good tool choice for something that happens frequently,
    as might happen when implementing your BigInt library,
    as it can have a large overhead. But as the programmer that is your responsibility to make and why you get the big bucks.

    To save one branch instruction, an entry is
    needed to describe how to handle the overflow, which is certainly larger
    than a branch instruction. And the code to "handle the overflow" is
    needed in any case. And this assume some sort of instant lookup--of the
    1000 overflow instructions, we need a hash table to look up the address, which is more overhead.

    Trapping on overflow instructions are useful as a debug aid for
    languages which don't care about overflow--but then you're optimizing something nearly useless. It also might be helpful if global clamping
    to MIN/MAX was useful (and I don't think it is).

    Instruction sets which make detecting overflow difficult (say, RISC-V),
    would do well to make branch-on-overflow efficient and easy. But adding trap-on-overflow instructions is a waste of effort.

    No they are a very useful tool for those who need such a tool
    because the manual alternative is significantly more expensive
    for both size and performance.

    "I have one example where overflow exceptions would be a poor implementation choice" does not imply "therefore no one should have them as an option".

    And remember, you ASKED to be informed of overflow when YOU selected
    a checked data type. Now if your language or compiler of choice doesn't
    allow you to choose checked vs unchecked then your gripe is with the
    language or compiler.

    It has nothing to do with whether an ISA should include trapping arithmetic
    as it is the most efficient way to deliver this functionality to
    THOSE WHO ASK FOR IT.

    Note that using traps on data access violations which are "fixed" by
    signal handlers CAN work out. They are slow, but as long as the
    exception handler can fix the access violation and return right to the instruction which failed (without needing to know ANYTHING about that instruction in particular), this can work fine. But integer overflow
    doesn't work like that--it's generally not possible to figure out
    in the trap handler what to do without more information.

    Kent

    No, actually signed overflow, like many other exceptions, works like
    that because, like access violations, or divide by zero, or array bounds violations, or illegal instructions, or invalid float operands,
    they are never supposed to occur. And if they could occur you should
    have checked for the potential error first.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sat Sep 21 20:39:38 2024
    On Sat, 21 Sep 2024 17:05:02 +0000, EricP wrote:

    Kent Dickey wrote:

    Basically, all a) cases are:

    op_with_might_overflow();
    if(overflow_happened) {
    handle the overflow
    }

    Trapping-on-overflow instructions are clearly useless for languages
    which care about overflow.

    This conclusion is completely wrong.

    In the days before <good> branch prediction having a conditional branch
    after each instruction that could have an execution problem was an
    extremely poor choice. Thus, exceptions were invented (circa 1958).

    Now with good branch prediction, having a branch after each instruction
    which could suffer an execution problem is simply a bad was to blow
    up the size of the executable.

    Exceptions are an event detection and notification delivery mechanism.

    Exceptions are a free (easy to predict) branch.

    It is very efficient if those events are rarely or never supposed to
    occur.

    Many (most, nearly all) processor architectures have notoriously
    bad exception delivery to a point of control that can deal with
    the problem at hand.

    But it is not necessary for that bad mechanism to be necessary !!

    Some of the things that minimize the "badness" of taking an exception::

    a) deliver control to user signal handler without taking an
    excursion through GuestOS. (think 10 cycles)
    b) when control arrives, receiving thread is already reentrant.
    c) when control arrives, the instruction (bits) and its operand
    values are delivered to the exception handler. So, the exception
    handler has what it needs to deal with the problem at hand.
    d) when control returns, the result (R0) is delivered back to the
    destination register.
    e) (b, c, d) are performed without handler needing to understand
    how. Handler is just a subroutine that receives arguments (c)
    fixes the problem, and returns a non-excepting value, or abort.
    f) return has a way to re-execute the instruction or to skip the
    instruction under control of handler without having access
    to excepting-IP and without knowing the length of the
    instruction.
    g) during (a..f) nobody ever has to disable interrupts or
    exceptions or re-enable them later. Priority and privilege
    are inherited automatically from excepting thread.


    I know of only 1 ISA with these properties....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Sep 21 22:14:12 2024
    According to MitchAlsup1 <[email protected]>:
    In the days before <good> branch prediction having a conditional branch
    after each instruction that could have an execution problem was an
    extremely poor choice. Thus, exceptions were invented (circa 1958).

    Oh, it was worse than that. There were instructions like "Divide or
    Halt" which stopped the computer with an error light on a zero divide.

    Many (most, nearly all) processor architectures have notoriously
    bad exception delivery to a point of control that can deal with
    the problem at hand.

    Some of us remember imprecise itnterrupts and the OS/360 S0C0
    completion code.

    But you are in general right, it makes more sense to keep the computer
    running in the normal case and provide slow ways to recover from
    failures and do something else.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Levine on Sun Sep 22 01:41:47 2024
    On Sat, 21 Sep 2024 22:14:12 -0000 (UTC)
    John Levine <[email protected]> wrote:

    According to MitchAlsup1 <[email protected]>:
    In the days before <good> branch prediction having a conditional
    branch after each instruction that could have an execution problem
    was an extremely poor choice. Thus, exceptions were invented (circa
    1958).

    Oh, it was worse than that. There were instructions like "Divide or
    Halt" which stopped the computer with an error light on a zero divide.

    Many (most, nearly all) processor architectures have notoriously
    bad exception delivery to a point of control that can deal with
    the problem at hand.

    Some of us remember imprecise itnterrupts and the OS/360 S0C0
    completion code.

    But you are in general right, it makes more sense to keep the computer running in the normal case and provide slow ways to recover from
    failures and do something else.


    Where is Nick to tell you that any attempt of recovery is a Bad Idea.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sat Sep 21 23:29:01 2024
    On Sat, 21 Sep 2024 20:39:38 +0000, MitchAlsup1 wrote:

    In the days before <good> branch prediction having a conditional branch
    after each instruction that could have an execution problem was an
    extremely poor choice. Thus, exceptions were invented (circa 1958).

    So all that does is push the conditional branch into the microcode. And
    make the instruction more complicated. Why should that be faster?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Sat Sep 21 23:29:40 2024
    On Sat, 21 Sep 2024 22:14:12 -0000 (UTC), John Levine wrote:

    But you are in general right, it makes more sense to keep the computer running in the normal case and provide slow ways to recover from
    failures and do something else.

    Aren’t branches that are not taken supposed to be fast?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Sun Sep 22 01:24:05 2024
    On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:

    On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:
    On Sat, 21 Sep 2024 22:14:12 -0000 (UTC), John Levine wrote:

    But you are in general right, it makes more sense to keep the computer
    running in the normal case and provide slow ways to recover from
    failures and do something else.

    Aren’t branches that are not taken supposed to be fast?

    Well, they are not taken, so they should be faster... ;^)

    It is NOT the speed, it is the code bloat.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sun Sep 22 01:23:35 2024
    On Sat, 21 Sep 2024 23:29:01 +0000, Lawrence D'Oliveiro wrote:

    On Sat, 21 Sep 2024 20:39:38 +0000, MitchAlsup1 wrote:

    In the days before <good> branch prediction having a conditional branch
    after each instruction that could have an execution problem was an
    extremely poor choice. Thus, exceptions were invented (circa 1958).

    So all that does is push the conditional branch into the microcode. And
    make the instruction more complicated. Why should that be faster?

    It pushes the branch into the mispredict-recovery path and does not
    occupy any code space.

    There is no microcode outside of Z-system these days.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sun Sep 22 02:09:29 2024
    On Sun, 22 Sep 2024 01:24:05 +0000, MitchAlsup1 wrote:

    On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:

    On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:

    On Sat, 21 Sep 2024 22:14:12 -0000 (UTC), John Levine wrote:

    But you are in general right, it makes more sense to keep the
    computer running in the normal case and provide slow ways to recover
    from failures and do something else.

    Aren’t branches that are not taken supposed to be fast?

    Well, they are not taken, so they should be faster... ;^)

    It is NOT the speed, it is the code bloat.

    That’s an argument against RISC though, isn’t it?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sun Sep 22 02:10:40 2024
    On Sun, 22 Sep 2024 01:23:35 +0000, MitchAlsup1 wrote:

    On Sat, 21 Sep 2024 23:29:01 +0000, Lawrence D'Oliveiro wrote:

    On Sat, 21 Sep 2024 20:39:38 +0000, MitchAlsup1 wrote:

    In the days before <good> branch prediction having a conditional
    branch after each instruction that could have an execution problem was
    an extremely poor choice. Thus, exceptions were invented (circa 1958).

    So all that does is push the conditional branch into the microcode. And
    make the instruction more complicated. Why should that be faster?

    It pushes the branch into the mispredict-recovery path and does not
    occupy any code space.

    There is no microcode outside of Z-system these days.

    It occupies some space, either microcode or circuit logic, or both.

    And why should that be faster?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sun Sep 22 02:26:33 2024
    On Sun, 22 Sep 2024 2:10:40 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 22 Sep 2024 01:23:35 +0000, MitchAlsup1 wrote:

    On Sat, 21 Sep 2024 23:29:01 +0000, Lawrence D'Oliveiro wrote:

    On Sat, 21 Sep 2024 20:39:38 +0000, MitchAlsup1 wrote:

    In the days before <good> branch prediction having a conditional
    branch after each instruction that could have an execution problem was >>>> an extremely poor choice. Thus, exceptions were invented (circa 1958).

    So all that does is push the conditional branch into the microcode. And
    make the instruction more complicated. Why should that be faster?

    It pushes the branch into the mispredict-recovery path and does not
    occupy any code space.

    There is no microcode outside of Z-system these days.

    It occupies some space, either microcode or circuit logic, or both.

    It has sequencers, but none of them are in ROM or PLA form.

    And why should that be faster?

    It is faster if for no other reason that it did not fetch the branch
    that is always predicted non-taken. ICache and Fetch argument. It is
    of lower power because it did not fetch, decode, or execute the branch.

    If every calculation instruction had to be followed by a conditional
    branch, then the code would be 150% its original size (or worse).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sun Sep 22 02:23:12 2024
    On Sun, 22 Sep 2024 2:09:29 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 22 Sep 2024 01:24:05 +0000, MitchAlsup1 wrote:

    On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:

    On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:

    On Sat, 21 Sep 2024 22:14:12 -0000 (UTC), John Levine wrote:

    But you are in general right, it makes more sense to keep the
    computer running in the normal case and provide slow ways to recover >>>>> from failures and do something else.

    Aren’t branches that are not taken supposed to be fast?

    Well, they are not taken, so they should be faster... ;^)

    It is NOT the speed, it is the code bloat.

    That’s an argument against RISC though, isn’t it?

    Yes, and that is why my RISC ISA has VAX instruction count
    while having the pipelineability of MIPS.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sun Sep 22 07:11:53 2024
    On Sun, 22 Sep 2024 02:26:33 +0000, MitchAlsup1 wrote:

    It is faster if for no other reason that it did not fetch the branch
    that is always predicted non-taken.

    But architectures like POWER were able to do that sort of thing in zero effective cycles, decades ago.

    If every calculation instruction had to be followed by a conditional
    branch, then the code would be 150% its original size (or worse).

    Not every one. There are ways to do the checks only at crucial points,
    after sequences of instructions. This is how IEEE754 “exceptions” work,
    for example.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Sun Sep 22 09:25:30 2024
    [email protected] (MitchAlsup1) writes:
    There is no microcode outside of Z-system these days.

    Every AMD64 processor has microcode. E.g., on an Alder Lake system
    "perf list" lists the following events that have "microcode" in their description:

    machine_clears.fp_assist
    [Counts the number of floating point operations retired that required
    microcode assist. Unit: cpu_atom]
    assists.fp
    [Counts all microcode FP assists. Unit: cpu_core]
    machine_clears.slow
    [Counts the number of machine clears that flush the pipeline and
    restart the machine with the use of microcode due to SMC,
    MEMORY_ORDERING, FP_ASSISTS, PAGE_FAULT, DISAMBIGUATION, and
    FPC_VIRTUAL_TRAP. Unit: cpu_atom]
    topdown_be_bound.serialization
    [Counts the number of issue slots every cycle that were not consumed by
    the backend due to scoreboards from the instruction queue (IQ), jump
    execution unit (JEU), or microcode sequencer (MS). Unit: cpu_atom]
    topdown_fe_bound.cisc
    [Counts the number of issue slots every cycle that were not delivered
    by the frontend due to the microcode sequencer (MS). Unit: cpu_atom]
    assists.any
    [Number of occurrences where a microcode assist is invoked by hardware.
    Unit: cpu_core]
    tma_microcode_sequencer
    [This metric represents fraction of slots the CPU was retiring uops
    fetched by the Microcode Sequencer (MS) unit. Unit: cpu_core]
    IpAssist
    [Instructions per a microcode Assist invocation. See Assists tree node
    for details (lower number means higher occurrence rate). Unit:
    cpu_core]
    tma_heavy_operations
    [This metric represents fraction of slots where the CPU was retiring
    heavy-weight operations -- instructions that require two or more uops
    or microcoded sequences. Unit: cpu_core]
    tma_cisc
    [Counts the number of issue slots that were not delivered by the
    frontend due to the microcode sequencer (MS). Unit: cpu_atom]
    tma_serialization
    [Counts the number of issue slots that were not consumed by the backend
    due to scoreboards from the instruction queue (IQ), jump execution
    unit (JEU), or microcode sequencer (MS). Unit: cpu_atom] tma_microcode_sequencer_group:
    tma_assists
    [This metric estimates fraction of slots the CPU retired uops delivered
    by the Microcode_Sequencer as a result of Assists. Unit: cpu_core]

    tma_microcode_sequencer, IpAssist, tma_heavy_operations, tma_cisc
    occurs several times in different sections.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to [email protected] on Sun Sep 22 12:30:00 2024
    In article <[email protected]>, [email protected] (MitchAlsup1) wrote:
    On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:
    On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:
    Aren't branches that are not taken supposed to be fast?
    Well, they are not taken, so they should be faster... ;^)
    It is NOT the speed, it is the code bloat.

    Yup. Bigger code is always a potential problem, not so much because it
    takes up RAM nowadays, but because it takes up memory bandwidth and cache space. Using up cache space is always bad, because bigger caches are
    slower, and instructions seem naturally smaller than cache blocks.

    Wanting smaller code isn't an argument against RISC, but an argument
    against poorly optimised ISA design. Variable-length CISC makes it easier
    to get smaller average instruction sizes but has other drawbacks.

    For the stuff I work, on ARM64 code is consistently smaller than x86-64, although the factor varies by platform.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to John Levine on Sun Sep 22 13:45:28 2024
    John Levine wrote:
    According to MitchAlsup1 <[email protected]>:
    In the days before <good> branch prediction having a conditional branch
    after each instruction that could have an execution problem was an
    extremely poor choice. Thus, exceptions were invented (circa 1958).

    Before (good) branch predictors and exceptions, we did have the ability
    to fall through any forward branches, and assume backwards branches were
    taken, right?

    With that approach, compilers were free to place all recovery code after
    the function itself, in which case something like

    add,ax,bx
    jo overflow_detected
    ;; ax is now good

    would work quite well, i.e. costing "only" 8 cycles for the load of the
    two JO instruction bytes.


    Oh, it was worse than that. There were instructions like "Divide or
    Halt" which stopped the computer with an error light on a zero divide.

    Sort of like "Halt_And_Catch_Fire"?

    Many (most, nearly all) processor architectures have notoriously
    bad exception delivery to a point of control that can deal with
    the problem at hand.

    Some of us remember imprecise itnterrupts and the OS/360 S0C0
    completion code.

    But you are in general right, it makes more sense to keep the computer running in the normal case and provide slow ways to recover from
    failures and do something else.

    This was idea behind the single-byte INTO (Interrupt on Overflow) opcode:

    add ax,bx
    into

    would cost just 4 clock cycles vs the 8 needed for the forward exception handler.

    OTOH, you did need to either reload the INTO vector (same location as
    INT 4) every time you needed a possibly new/different handler.

    I never saw any compiler using it and I never used ti in my own asm code.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lars Poulsen@21:1/5 to John Levine on Sun Sep 22 09:14:04 2024
    On 9/21/2024 3:14 PM, John Levine wrote:
    According to MitchAlsup1 <[email protected]>:
    In the days before <good> branch prediction having a conditional branch
    after each instruction that could have an execution problem was an
    extremely poor choice. Thus, exceptions were invented (circa 1958).

    Oh, it was worse than that. There were instructions like "Divide or
    Halt" which stopped the computer with an error light on a zero divide.

    Many (most, nearly all) processor architectures have notoriously
    bad exception delivery to a point of control that can deal with
    the problem at hand.

    Some of us remember imprecise itnterrupts and the OS/360 S0C0
    completion code.

    But you are in general right, it makes more sense to keep the computer running in the normal case and provide slow ways to recover from
    failures and do something else.

    From a programmer's perspective, VAX exception handling was very nice.
    It may have been high overhead, though.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Kent Dickey on Sun Sep 22 16:59:24 2024
    Kent Dickey <[email protected]> schrieb:

    Trapping on overflow is basically useless other than as a debug aid,
    which clearly nobody values.

    I disagree.

    Look at the sanitizer libraries, which insert runtime checks for
    integer overflow - having less overhead for these would definitely
    be a plus.

    See https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html
    or https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-flags .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Dallman on Sun Sep 22 18:40:31 2024
    On Sun, 22 Sep 2024 11:30:00 +0000, John Dallman wrote:

    In article <[email protected]>, [email protected] (MitchAlsup1) wrote:
    On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:
    On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:
    Aren't branches that are not taken supposed to be fast?
    Well, they are not taken, so they should be faster... ;^)
    It is NOT the speed, it is the code bloat.

    Yup. Bigger code is always a potential problem, not so much because it
    takes up RAM nowadays, but because it takes up memory bandwidth and
    cache
    space. Using up cache space is always bad, because bigger caches are
    slower, and instructions seem naturally smaller than cache blocks.

    Wanting smaller code isn't an argument against RISC, but an argument
    against poorly optimised ISA design. Variable-length CISC makes it
    easier to get smaller average instruction sizes but has other drawbacks.

    Variable length RISC makes it easier, too.

    For the stuff I work, on ARM64 code is consistently smaller than x86-64, although the factor varies by platform.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Sep 23 17:51:26 2024
    Let's look at how you might want to handle overflows when they happen:

    1) Your language supports seemlessly transitioning to BigInts on
    overflow. Then each operation that could overflow needs to call
    a special bit of code to change to BigInt and then continue the
    calculation. This code must exist, even if a trapping
    instruction doesn't need an explicit branch to it. Some
    mechanism is needed to call this code.

    IOW, this is the case where the program thinks it's manipulating some "mathematical number ∈ Z" or something like it.

    2) You need to call an exception handler, and the routine with the overflow
    is ended. We need to know which exception handler to call.

    Not sure when that would be useful, other than for low-level coding.

    3) You want to clamp the value to a reasonable range and continue. The
    reasonable values need to be looked up somewhere.

    Here overflow detection can be useful only to the extent that the
    wrap-around may hide the range-error (or confuse its nature between over<->under).

    I'm not familiar with code needing this, but I heard such needs are
    common in some fields.

    4) You just want to crash the program. If a debugger is attached, it can
    say where the overflow occurred.

    Beside debugging, this corresponds to the case where the code thinks
    it's manipulating some "mathematical number ∈ Z" or something like it
    yet we have a good reason to think that this number will "never"
    overflow, so we never want to switch to bigint because an overflow means
    our "good reason" was crap and we we're in trouble anyway.

    IME, this shows up typically for integers related to the size of some data-structure, where the limited address space "guarantees" that we'll
    never go beyond some limit. They're *very* common (think array indices
    and things like that) and using overflow-trapping operations for them
    would make sense. At the same time, they *should* never overflow, and
    if some error somewhere means one of them can overflow (e.g. mistakenly
    using (u)int32 for array indices in a 64bit system), there's a good
    chance that the error will cause other trouble which may prevent us from
    even reaching the overflowing operation.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to [email protected] on Mon Sep 23 21:57:08 2024
    In article <O2DHO.184073$[email protected]>,
    EricP <[email protected]> wrote:
    Kent Dickey wrote:
    In article <[email protected]>,
    Anton Ertl <[email protected]> wrote:
    Brett <[email protected]> writes:
    Speaking of complex things, have you looked at Swift output, as it checks >>>> all operations for overflow?

    You could add an exception type for that, saving huge numbers of correctly >>>> predicted branch instructions.

    The future of programming languages is type safe with checks, you need to >>>> get on that bandwagon early.
    MIPS got on that bandwagon early. It has, e.g., add (which traps on
    signed overflow) in addition to addu (which performs modulo
    arithmetic). It has been abandoned and replaced by RISC-V several
    years ago.

    Alpha got on that bandwagon early. It's a descendent of MIPS, but it
    renamed add into addv, and addu into add. It has been canceled around
    the year 2000.

    [ More details about architectures without trapping overflow instructions ] >>
    Trapping on overflow is basically useless other than as a debug aid,
    which clearly nobody values. If you take Rust's approach, and only
    detect overflow in debug builds, then you already don't care about
    performance.

    Those automatic software correctness checks, of which signed integer
    overflow detection is one of many, went away because most code was
    being written in C/C++ and those two languages don't require them.

    That just makes it more expensive in code size and performance to effect
    such checks. This overhead leads some to conclude it justifies eliminating >the error checks.

    Eliminating the error event detectors doesn't make errors go away,
    just your knowledge of them.

    I gather portions of 16-bit Windows 3.1 were written in Pascal.
    When Microsoft developed 32-bit WinNT, if instead of C it they had
    switched their official development language from Pascal to Modula-2
    which does require signed and unsigned, checked and modulo arithmetic,
    and array bounds checks, the world would have been a much safer place.

    But they didn't so it isn't.

    The x86 designers might then have had an incentive to make all the
    checks as efficient as possible, and rather than eliminate them,
    they might have enhanced and more tightly integrated them.

    OK, my post was about how having a hardware trap-on-overflow instruction
    (or a mode for existing ALU instructions) is useless for anything OTHER
    than as a debug aid where you crash the problem on overflow (you can
    have a general exception handler to shut down gracefully, but "patching things up and continuing" doesn't work). I gave details of reasons folks might
    want to try to use trap-on-overflow instructions, and show how the
    other cases don't make sense.

    In no way was I ever arguing that checking for overflow was a bad idea,
    or a language issue, or anything else. Just that CPUs should not bother
    having trap-on-overflow instructions.

    You then went on to discuss how you want trap-on-overflow instructions
    for stuff like C code, so you can detect code bugs and shut down gracefully.

    And my response to that is we already know compilers don't use it. x86
    has INTO, which is "trap if the overflow bit is set". So "ADD r8,r9; INTO" would trap if the add overflowed.

    Look at:
    https://godbolt.org/z/oMhW55YsK

    Which is this code:

    int add2(int num, int other) {
    return num + other;
    }

    Compiled with these options: -O2 -ftrapv
    (-ftrapv is the GCC argument for detect signed overflows and crash).

    For x86-64 clang 19.1.0:

    add2:
    add edi, esi
    jo .LBB0_1
    mov eax, edi
    ret
    .LBB0_1:
    ud1 eax, dword ptr [eax]

    This looks OK: it does a normal add, then branches-on-overflow to
    an undefined instruction.

    But x86 has an instruction to trap on overflow directly: INTO. It's one byte. And it doesn't use it.

    GCC x86-64 14.2 is even worse:

    add2:
    sub rsp, 8
    call __addvsi3
    add rsp, 8
    ret

    It calls a routine to do all additions which might overflow, and that
    routine calls assert() if an overflow occurs.

    The CPU has a trap-on-overflow instruction exactly for this case (to crash
    on detecting an overflow), and compilers don't even use it.

    So even on architectures which have a trap-on-overflow instruction,
    compilers don't use it.

    So why should any hardware include an instruction to trap-on-overflow?

    Trap-on-overflow instruction have a hardware cost, of varying severity.
    If the ISA isn't already trapping on ALU instructions (such as
    divide-by-0), it adds a new class of operations which can take
    exceptions. An ALU functional unit that cannot take exceptions doesn't
    have to save "unwinding" info (at minimum, info to recover the PC, and
    possibly rollback state), and not needing this can be a nice
    simplification. Branches and LD/ST always needs this info, but not
    needing it on ALU ops can be a nice simplification of logic, and makes it easier to have multiple ALU functional units. Note that x86 INTO can
    be treated as a branch, so it doesn't have the cost of an instruction
    like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
    traps if it overflows. ADDTO is particularly what I am arguing against--
    it is just a bad idea for the ISA to have ALU instructions take exceptions.

    Instruction sets which make detecting overflow difficult (say, RISC-V),
    would do well to make branch-on-overflow efficient and easy. But adding
    trap-on-overflow instructions is a waste of effort.

    No they are a very useful tool for those who need such a tool
    because the manual alternative is significantly more expensive
    for both size and performance.

    "I have one example where overflow exceptions would be a poor implementation >choice" does not imply "therefore no one should have them as an option".

    Can you share what language, compiler, and hardware you are using which implements overflow checks using a trap-on-overflow instruction?

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Sep 23 18:00:36 2024
    Some of the things that minimize the "badness" of taking an exception::

    a) deliver control to user signal handler without taking an
    excursion through GuestOS. (think 10 cycles)
    b) when control arrives, receiving thread is already reentrant.
    c) when control arrives, the instruction (bits) and its operand
    values are delivered to the exception handler. So, the exception
    handler has what it needs to deal with the problem at hand.
    d) when control returns, the result (R0) is delivered back to the
    destination register.
    e) (b, c, d) are performed without handler needing to understand
    how. Handler is just a subroutine that receives arguments (c)
    fixes the problem, and returns a non-excepting value, or abort.
    f) return has a way to re-execute the instruction or to skip the
    instruction under control of handler without having access
    to excepting-IP and without knowing the length of the
    instruction.
    g) during (a..f) nobody ever has to disable interrupts or
    exceptions or re-enable them later. Priority and privilege
    are inherited automatically from excepting thread.

    Note that in the case where you want the overflow exception to jump to
    some alternate code path (a language-level exception handler, or a code
    path that continues with a bigint instead of a register-sized integer),
    (d) is useless because you don't want to return to the overflowing
    instruction (nor to the immediately following instruction). Instead you usually want to lookup a side table indexed with the address of the
    overflowing instruction to find the "exception handler" to "return" to.

    (a) (b) and (c) are still very welcome, of course.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Sep 23 22:23:51 2024
    On Sun, 22 Sep 2024 9:25:30 +0000, Anton Ertl wrote:

    [email protected] (MitchAlsup1) writes:
    There is no microcode outside of Z-system these days.

    Every AMD64 processor has microcode.

    Yes, yes, my brain was not working...........sigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Kent Dickey on Mon Sep 23 22:17:02 2024
    On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:

    In article <O2DHO.184073$[email protected]>,
    EricP <[email protected]> wrote:

    The x86 designers might then have had an incentive to make all the
    checks as efficient as possible, and rather than eliminate them,
    they might have enhanced and more tightly integrated them.

    OK, my post was about how having a hardware trap-on-overflow instruction
    (or a mode for existing ALU instructions) is useless for anything OTHER
    than as a debug aid where you crash the problem on overflow (you can
    have a general exception handler to shut down gracefully, but "patching things
    up and continuing" doesn't work). I gave details of reasons folks might
    want to try to use trap-on-overflow instructions, and show how the
    other cases don't make sense.

    In no way was I ever arguing that checking for overflow was a bad idea,
    or a language issue, or anything else. Just that CPUs should not bother having trap-on-overflow instructions.
    <snip>
    So why should any hardware include an instruction to trap-on-overflow?

    Trap-on-overflow instruction have a hardware cost, of varying severity.
    If the ISA isn't already trapping on ALU instructions (such as
    divide-by-0), it adds a new class of operations which can take
    exceptions. An ALU functional unit that cannot take exceptions doesn't
    have to save "unwinding" info (at minimum, info to recover the PC, and possibly rollback state), and not needing this can be a nice
    simplification. Branches and LD/ST always needs this info, but not
    needing it on ALU ops can be a nice simplification of logic, and makes
    it
    easier to have multiple ALU functional units. Note that x86 INTO can
    be treated as a branch, so it doesn't have the cost of an instruction
    like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
    traps if it overflows. ADDTO is particularly what I am arguing
    against--
    it is just a bad idea for the ISA to have ALU instructions take
    exceptions.

    You argue that trap-on-overflow as an instruction is unnecessary
    AND
    You argue that overflow detection is worthwhile
    AND
    You argue that ALU should not raise overflow exceptions

    I am at a loss for how to take all 3 arguments together at the
    same time !?! Can you explain ??

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Mon Sep 23 22:22:17 2024
    On Mon, 23 Sep 2024 22:00:36 +0000, Stefan Monnier wrote:

    Some of the things that minimize the "badness" of taking an exception::

    a) deliver control to user signal handler without taking an
    excursion through GuestOS. (think 10 cycles)
    b) when control arrives, receiving thread is already reentrant.
    c) when control arrives, the instruction (bits) and its operand
    values are delivered to the exception handler. So, the exception
    handler has what it needs to deal with the problem at hand.
    d) when control returns, the result (R0) is delivered back to the
    destination register.
    e) (b, c, d) are performed without handler needing to understand
    how. Handler is just a subroutine that receives arguments (c)
    fixes the problem, and returns a non-excepting value, or abort.
    f) return has a way to re-execute the instruction or to skip the
    instruction under control of handler without having access
    to excepting-IP and without knowing the length of the
    instruction.
    g) during (a..f) nobody ever has to disable interrupts or
    exceptions or re-enable them later. Priority and privilege
    are inherited automatically from excepting thread.

    Note that in the case where you want the overflow exception to jump to
    some alternate code path (a language-level exception handler, or a code
    path that continues with a bigint instead of a register-sized integer),
    (d) is useless because you don't want to return to the overflowing instruction (nor to the immediately following instruction). Instead you usually want to lookup a side table indexed with the address of the overflowing instruction to find the "exception handler" to "return" to.

    longjump() returns in such a way that integer ADD code path is never
    executed again.

    (a) (b) and (c) are still very welcome, of course.

    (d) is for the case where the exception handler fixes the problem
    and calculates the desired result and skips the instruction on
    return (completion) whereas you typical page fault is retried
    upon return.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Lars Poulsen on Tue Sep 24 00:41:39 2024
    On Sun, 22 Sep 2024 09:14:04 -0700, Lars Poulsen wrote:

    From a programmer's perspective, VAX exception handling was very nice.
    It may have been high overhead, though.

    Very high overhead. But it was also language-independent, and integrated
    into the procedure-calling convention, which also managed to be language- independent.

    There is an internal memo on Bitsavers somewhere, critiquing a proposal to adopt the MIPS architecture (which DEC did, for just one machine, the DECstation 3000 if I recall rightly), and one of the points against MIPS
    was that it didn’t have language-independent exception handling. But then
    no other architecture, before the VAX or since, has been able to do that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Sep 24 00:37:25 2024
    On Mon, 23 Sep 2024 22:17:02 +0000, MitchAlsup1 wrote:

    You argue that trap-on-overflow as an instruction is unnecessary
    AND
    You argue that overflow detection is worthwhile
    AND
    You argue that ALU should not raise overflow exceptions

    I am at a loss for how to take all 3 arguments together at the same time
    !?! Can you explain ??

    The answer is pretty obvious: explicit instruction to branch on overflow detection.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Tue Sep 24 00:43:40 2024
    On Sun, 22 Sep 2024 13:45:28 +0200, Terje Mathisen wrote:

    There were instructions like "Divide or
    Halt" which stopped the computer with an error light on a zero divide.

    Sort of like "Halt_And_Catch_Fire"?

    Imagine if it was a design feature that, if the error light came on too
    much, it would overheat and set the machine on fire? ;)

    Would that encourage programmers to have fewer bugs in their
    programs ... ?

    “The explosions will continue until morale improves.”

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Tue Sep 24 03:02:20 2024
    On Tue, 24 Sep 2024 2:52:09 +0000, Chris M. Thomasson wrote:

    On 9/23/2024 5:43 PM, Lawrence D'Oliveiro wrote:

    Imagine if it was a design feature that, if the error light came on too
    much, it would overheat and set the machine on fire? ;)

    Would that encourage programmers to have fewer bugs in their
    programs ... ?

    In the first computer controlled aircraft, the programmers were
    told that 10% of them would be on the first flight.

    Engine start, takeoff, and flying went perfect, than as the plane
    neared the ground it rolled 90º just 100 feet from the ground. The
    pilot flicked off the computer, saved the plane, and landed.

    Later investigation showed that the ailerons had their positions
    initialized while the pilot was doing his flight control maneuvers.

    many of the programmers came out of the plane sickened to their
    stomachs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Kent Dickey on Tue Sep 24 07:58:24 2024
    Kent Dickey wrote:
    In article <O2DHO.184073$[email protected]>,
    EricP <[email protected]> wrote:
    Kent Dickey wrote:
    In article <[email protected]>,
    Anton Ertl <[email protected]> wrote:
    Brett <[email protected]> writes:
    Speaking of complex things, have you looked at Swift output, as it checks >>>>> all operations for overflow?

    You could add an exception type for that, saving huge numbers of correctly
    predicted branch instructions.

    The future of programming languages is type safe with checks, you need to >>>>> get on that bandwagon early.
    MIPS got on that bandwagon early. It has, e.g., add (which traps on
    signed overflow) in addition to addu (which performs modulo
    arithmetic). It has been abandoned and replaced by RISC-V several
    years ago.

    Alpha got on that bandwagon early. It's a descendent of MIPS, but it
    renamed add into addv, and addu into add. It has been canceled around >>>> the year 2000.

    [ More details about architectures without trapping overflow instructions ] >>>
    Trapping on overflow is basically useless other than as a debug aid,
    which clearly nobody values. If you take Rust's approach, and only
    detect overflow in debug builds, then you already don't care about
    performance.

    Those automatic software correctness checks, of which signed integer
    overflow detection is one of many, went away because most code was
    being written in C/C++ and those two languages don't require them.

    That just makes it more expensive in code size and performance to effect
    such checks. This overhead leads some to conclude it justifies eliminating >> the error checks.

    Eliminating the error event detectors doesn't make errors go away,
    just your knowledge of them.

    I gather portions of 16-bit Windows 3.1 were written in Pascal.
    When Microsoft developed 32-bit WinNT, if instead of C it they had
    switched their official development language from Pascal to Modula-2
    which does require signed and unsigned, checked and modulo arithmetic,
    and array bounds checks, the world would have been a much safer place.

    But they didn't so it isn't.

    The x86 designers might then have had an incentive to make all the
    checks as efficient as possible, and rather than eliminate them,
    they might have enhanced and more tightly integrated them.

    OK, my post was about how having a hardware trap-on-overflow instruction
    (or a mode for existing ALU instructions) is useless for anything OTHER
    than as a debug aid where you crash the problem on overflow (you can
    have a general exception handler to shut down gracefully, but "patching things
    up and continuing" doesn't work). I gave details of reasons folks might
    want to try to use trap-on-overflow instructions, and show how the
    other cases don't make sense.

    In no way was I ever arguing that checking for overflow was a bad idea,
    or a language issue, or anything else. Just that CPUs should not bother having trap-on-overflow instructions.

    You then went on to discuss how you want trap-on-overflow instructions
    for stuff like C code, so you can detect code bugs and shut down gracefully.

    And my response to that is we already know compilers don't use it. x86
    has INTO, which is "trap if the overflow bit is set". So "ADD r8,r9; INTO" would trap if the add overflowed.

    Look at:
    https://godbolt.org/z/oMhW55YsK

    Which is this code:

    int add2(int num, int other) {
    return num + other;
    }

    Compiled with these options: -O2 -ftrapv
    (-ftrapv is the GCC argument for detect signed overflows and crash).

    For x86-64 clang 19.1.0:

    add2:
    add edi, esi
    jo .LBB0_1
    mov eax, edi
    ret
    .LBB0_1:
    ud1 eax, dword ptr [eax]

    This looks OK: it does a normal add, then branches-on-overflow to
    an undefined instruction.

    But x86 has an instruction to trap on overflow directly: INTO. It's one byte.
    And it doesn't use it.

    GCC x86-64 14.2 is even worse:

    add2:
    sub rsp, 8
    call __addvsi3
    add rsp, 8
    ret

    It calls a routine to do all additions which might overflow, and that
    routine calls assert() if an overflow occurs.

    The CPU has a trap-on-overflow instruction exactly for this case (to crash
    on detecting an overflow), and compilers don't even use it.

    So even on architectures which have a trap-on-overflow instruction,
    compilers don't use it.

    You can only compile in INTO opcodes if you can guarantee that the INT 4
    (INTO) trap vector will always be set to a proper handler, and since
    that isn't part of the ABI, compilers can't depend on it?

    I do agree that it would be nice if it did work, barring that clang is
    doing the best possible alternative, at close to zero cost except for
    the useless branch predictor table entry wastage.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Tue Sep 24 08:02:23 2024
    MitchAlsup1 wrote:
    On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:

    In article <O2DHO.184073$[email protected]>,
    EricP  <[email protected]> wrote:

    The x86 designers might then have had an incentive to make all the
    checks as efficient as possible, and rather than eliminate them,
    they might have enhanced and more tightly integrated them.

    OK, my post was about how having a hardware trap-on-overflow instruction
    (or a mode for existing ALU instructions) is useless for anything OTHER
    than as a debug aid where you crash the problem on overflow (you can
    have a general exception handler to shut down gracefully, but "patching
    things
    up and continuing" doesn't work).  I gave details of reasons folks might
    want to try to use trap-on-overflow instructions, and show how the
    other cases don't make sense.

    In no way was I ever arguing that checking for overflow was a bad idea,
    or a language issue, or anything else.  Just that CPUs should not bother
    having trap-on-overflow instructions.
    <snip>
    So why should any hardware include an instruction to trap-on-overflow?

    Trap-on-overflow instruction have a hardware cost, of varying severity.
    If the ISA isn't already trapping on ALU instructions (such as
    divide-by-0), it adds a new class of operations which can take
    exceptions.  An ALU functional unit that cannot take exceptions doesn't
    have to save "unwinding" info (at minimum, info to recover the PC, and
    possibly rollback state), and not needing this can be a nice
    simplification.  Branches and LD/ST always needs this info, but not
    needing it on ALU ops can be a nice simplification of logic, and makes
    it
    easier to have multiple ALU functional units.  Note that x86 INTO can
    be treated as a branch, so it doesn't have the cost of an instruction
    like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
    traps if it overflows.  ADDTO is particularly what I am arguing
    against--
    it is just a bad idea for the ISA to have ALU instructions take
    exceptions.

    You argue that trap-on-overflow as an instruction is unnecessary
    AND
    You argue that overflow detection is worthwhile
    AND
    You argue that ALU should not raise overflow exceptions

    I am at a loss for how to take all 3 arguments together at the
    same time !?! Can you explain ??

    Maybe all add/sub/etc opcodes that are immediately followed by an INTO
    could be fused into a single ADDO/SUBO/etc version that takes zero extra
    cycles as long as the trap part isn't hit?

    Personally I'm happy with the clang approach.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Tue Sep 24 11:00:20 2024
    On Tue, 24 Sep 2024 00:41:39 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sun, 22 Sep 2024 09:14:04 -0700, Lars Poulsen wrote:

    From a programmer's perspective, VAX exception handling was very
    nice. It may have been high overhead, though.

    Very high overhead. But it was also language-independent, and
    integrated into the procedure-calling convention, which also managed
    to be language- independent.

    There is an internal memo on Bitsavers somewhere, critiquing a
    proposal to adopt the MIPS architecture (which DEC did, for just one
    machine, the DECstation 3000 if I recall rightly),

    Much more than one machine.
    https://en.wikipedia.org/wiki/DECstation
    4 ranges, 13 models.

    DEC was quite successful both with MIPS and with x86.
    I'd guess, their CPU designers didn't like it.

    and one of the
    points against MIPS was that it didn’t have language-independent
    exception handling. But then no other architecture, before the VAX or
    since, has been able to do that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue Sep 24 11:37:03 2024
    On Tue, 24 Sep 2024 08:02:23 +0200
    Terje Mathisen <[email protected]> wrote:

    MitchAlsup1 wrote:
    On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:

    In article <O2DHO.184073$[email protected]>,
    EricP  <[email protected]> wrote:

    The x86 designers might then have had an incentive to make all the
    checks as efficient as possible, and rather than eliminate them,
    they might have enhanced and more tightly integrated them.

    OK, my post was about how having a hardware trap-on-overflow
    instruction (or a mode for existing ALU instructions) is useless
    for anything OTHER than as a debug aid where you crash the problem
    on overflow (you can have a general exception handler to shut down
    gracefully, but "patching things
    up and continuing" doesn't work).  I gave details of reasons folks
    might want to try to use trap-on-overflow instructions, and show
    how the other cases don't make sense.

    In no way was I ever arguing that checking for overflow was a bad
    idea, or a language issue, or anything else.  Just that CPUs
    should not bother having trap-on-overflow instructions.
    <snip>
    So why should any hardware include an instruction to
    trap-on-overflow?

    Trap-on-overflow instruction have a hardware cost, of varying
    severity. If the ISA isn't already trapping on ALU instructions
    (such as divide-by-0), it adds a new class of operations which can
    take exceptions.  An ALU functional unit that cannot take
    exceptions doesn't have to save "unwinding" info (at minimum, info
    to recover the PC, and possibly rollback state), and not needing
    this can be a nice simplification.  Branches and LD/ST always
    needs this info, but not needing it on ALU ops can be a nice
    simplification of logic, and makes it
    easier to have multiple ALU functional units.  Note that x86 INTO
    can be treated as a branch, so it doesn't have the cost of an
    instruction like "ADDTO r1,r2,r3" which is a normal ADD but where
    the ADD itself traps if it overflows.  ADDTO is particularly what
    I am arguing against--
    it is just a bad idea for the ISA to have ALU instructions take
    exceptions.

    You argue that trap-on-overflow as an instruction is unnecessary
    AND
    You argue that overflow detection is worthwhile
    AND
    You argue that ALU should not raise overflow exceptions

    I am at a loss for how to take all 3 arguments together at the
    same time !?! Can you explain ??

    Maybe all add/sub/etc opcodes that are immediately followed by an
    INTO could be fused into a single ADDO/SUBO/etc version that takes
    zero extra cycles as long as the trap part isn't hit?

    Personally I'm happy with the clang approach.


    Couple of questions:
    1. Which code would you put at destination of jo branch?
    2. In your code generator would every jo in the code (or in the module,
    or in the function) jump to the same destination or each will have
    destination of its own.

    It would be interesting if you answer before looking at what clang does,
    then take a look and comment again.

    Terje


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Tue Sep 24 10:44:22 2024
    Michael S wrote:
    On Tue, 24 Sep 2024 08:02:23 +0200
    Terje Mathisen <[email protected]> wrote:

    MitchAlsup1 wrote:
    On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:

    In article <O2DHO.184073$[email protected]>,
    EricP  <[email protected]> wrote:

    The x86 designers might then have had an incentive to make all the
    checks as efficient as possible, and rather than eliminate them,
    they might have enhanced and more tightly integrated them.

    OK, my post was about how having a hardware trap-on-overflow
    instruction (or a mode for existing ALU instructions) is useless
    for anything OTHER than as a debug aid where you crash the problem
    on overflow (you can have a general exception handler to shut down
    gracefully, but "patching things
    up and continuing" doesn't work).  I gave details of reasons folks
    might want to try to use trap-on-overflow instructions, and show
    how the other cases don't make sense.

    In no way was I ever arguing that checking for overflow was a bad
    idea, or a language issue, or anything else.  Just that CPUs
    should not bother having trap-on-overflow instructions.
    <snip>
    So why should any hardware include an instruction to
    trap-on-overflow?

    Trap-on-overflow instruction have a hardware cost, of varying
    severity. If the ISA isn't already trapping on ALU instructions
    (such as divide-by-0), it adds a new class of operations which can
    take exceptions.  An ALU functional unit that cannot take
    exceptions doesn't have to save "unwinding" info (at minimum, info
    to recover the PC, and possibly rollback state), and not needing
    this can be a nice simplification.  Branches and LD/ST always
    needs this info, but not needing it on ALU ops can be a nice
    simplification of logic, and makes it
    easier to have multiple ALU functional units.  Note that x86 INTO
    can be treated as a branch, so it doesn't have the cost of an
    instruction like "ADDTO r1,r2,r3" which is a normal ADD but where
    the ADD itself traps if it overflows.  ADDTO is particularly what
    I am arguing against--
    it is just a bad idea for the ISA to have ALU instructions take
    exceptions.

    You argue that trap-on-overflow as an instruction is unnecessary
    AND
    You argue that overflow detection is worthwhile
    AND
    You argue that ALU should not raise overflow exceptions

    I am at a loss for how to take all 3 arguments together at the
    same time !?! Can you explain ??

    Maybe all add/sub/etc opcodes that are immediately followed by an
    INTO could be fused into a single ADDO/SUBO/etc version that takes
    zero extra cycles as long as the trap part isn't hit?

    Personally I'm happy with the clang approach.


    Couple of questions:
    1. Which code would you put at destination of jo branch?
    2. In your code generator would every jo in the code (or in the module,
    or in the function) jump to the same destination or each will have destination of its own.

    It would be interesting if you answer before looking at what clang does,
    then take a look and comment again.

    If the handler consists of terminating the program, then every function,
    or small group of functions depending upon total code size, can have a
    common target, just so that all the JO opcodes can use the short-form
    two-byte encoding og a forward branch. I.e. leaving just 127 bytes
    available for mainline code.

    If you want separate handling for each overflow, i.e. switch to bigint
    and resume, then you do need one target per JO, in order to pick up the originating instruction address (and place it on the stack for a
    subsequent RET?) before jumping to a common handler.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue Sep 24 13:18:55 2024
    On Tue, 24 Sep 2024 10:44:22 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Tue, 24 Sep 2024 08:02:23 +0200
    Terje Mathisen <[email protected]> wrote:

    MitchAlsup1 wrote:
    On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:

    In article <O2DHO.184073$[email protected]>,
    EricP  <[email protected]> wrote:

    The x86 designers might then have had an incentive to make all
    the checks as efficient as possible, and rather than eliminate
    them, they might have enhanced and more tightly integrated
    them.

    OK, my post was about how having a hardware trap-on-overflow
    instruction (or a mode for existing ALU instructions) is useless
    for anything OTHER than as a debug aid where you crash the
    problem on overflow (you can have a general exception handler to
    shut down gracefully, but "patching things
    up and continuing" doesn't work).  I gave details of reasons
    folks might want to try to use trap-on-overflow instructions,
    and show how the other cases don't make sense.

    In no way was I ever arguing that checking for overflow was a bad
    idea, or a language issue, or anything else.  Just that CPUs
    should not bother having trap-on-overflow instructions.
    <snip>
    So why should any hardware include an instruction to
    trap-on-overflow?

    Trap-on-overflow instruction have a hardware cost, of varying
    severity. If the ISA isn't already trapping on ALU instructions
    (such as divide-by-0), it adds a new class of operations which
    can take exceptions.  An ALU functional unit that cannot take
    exceptions doesn't have to save "unwinding" info (at minimum,
    info to recover the PC, and possibly rollback state), and not
    needing this can be a nice simplification.  Branches and LD/ST
    always needs this info, but not needing it on ALU ops can be a
    nice simplification of logic, and makes it
    easier to have multiple ALU functional units.  Note that x86
    INTO can be treated as a branch, so it doesn't have the cost of
    an instruction like "ADDTO r1,r2,r3" which is a normal ADD but
    where the ADD itself traps if it overflows.  ADDTO is
    particularly what I am arguing against--
    it is just a bad idea for the ISA to have ALU instructions take
    exceptions.

    You argue that trap-on-overflow as an instruction is unnecessary
    AND
    You argue that overflow detection is worthwhile
    AND
    You argue that ALU should not raise overflow exceptions

    I am at a loss for how to take all 3 arguments together at the
    same time !?! Can you explain ??

    Maybe all add/sub/etc opcodes that are immediately followed by an
    INTO could be fused into a single ADDO/SUBO/etc version that takes
    zero extra cycles as long as the trap part isn't hit?

    Personally I'm happy with the clang approach.


    Couple of questions:
    1. Which code would you put at destination of jo branch?
    2. In your code generator would every jo in the code (or in the
    module, or in the function) jump to the same destination or each
    will have destination of its own.

    It would be interesting if you answer before looking at what clang
    does, then take a look and comment again.

    If the handler consists of terminating the program, then every
    function, or small group of functions depending upon total code size,
    can have a common target, just so that all the JO opcodes can use the short-form two-byte encoding og a forward branch. I.e. leaving just
    127 bytes available for mainline code.


    In case of clang (on Win7+msys2, I didn't test on other targets) the
    handler terminates the program with no useful info printed.
    Still, clang has different target for each jo. At target location it
    places invalid instruction.
    So, it lays infrastructure for better handler, pays the price in terms
    of code size, but does not take an advantage.

    If you want separate handling for each overflow, i.e. switch to
    bigint and resume, then you do need one target per JO, in order to
    pick up the originating instruction address (and place it on the
    stack for a subsequent RET?) before jumping to a common handler.

    Terje


    I want address of originating instruction in the handler.
    I want it not for switch to bigint that would not be in spirit of
    non-dynamic compiled languages, but in order to get useful termination printout.
    With JO in order to get what I want I'd have to pay by significant
    increase in code size.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to [email protected] on Tue Sep 24 15:47:21 2024
    In article <vcpidc$29e51$[email protected]>,
    Thomas Koenig <[email protected]> wrote:
    Kent Dickey <[email protected]> schrieb:

    Trapping on overflow is basically useless other than as a debug aid,
    which clearly nobody values.

    I disagree.

    Look at the sanitizer libraries, which insert runtime checks for
    integer overflow - having less overhead for these would definitely
    be a plus.

    See https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html
    or >https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-flags >.

    Not valuing something just means no one is spending a lot of time/effort
    on it. Decimal math is not valued--but you can still do it, it just
    has no special instructions on most architectures to make it fast/easy.
    And as I've pointed out, trapping on integer overflow is clearly not
    valued--on x86, where INTO exists, GCC and Clang do not use it.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Terje Mathisen on Tue Sep 24 10:06:27 2024
    On 9/23/2024 11:02 PM, Terje Mathisen wrote:

    snip

    Maybe all add/sub/etc opcodes that are immediately followed by an INTO
    could be fused into a single ADDO/SUBO/etc version that takes zero extra cycles as long as the trap part isn't hit?

    If you are going to do that, why not make it an optional prefix byte?
    That way, no fusion needed, no extra cycles, yet the same amount of code
    space.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Sep 24 17:05:28 2024
    On Tue, 24 Sep 2024 10:18:55 +0000, Michael S wrote:

    I want address of originating instruction in the handler.
    I want it not for switch to bigint that would not be in spirit of
    non-dynamic compiled languages, but in order to get useful termination printout.
    With JO in order to get what I want I'd have to pay by significant
    increase in code size.

    0XADDRESS OPCODE Rd,RS1,Rs2 // Had OVERFLOW using Rs1=0x12345678
    and Rs2=0xFEDCBA09

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bill Findlay@21:1/5 to Kent Dickey on Tue Sep 24 18:38:49 2024
    On 24 Sep 2024, Kent Dickey wrote
    (in article <vcumu9$38iv2$[email protected]>):

    In article<vcpidc$29e51$[email protected]>,
    Thomas Koenig <[email protected]> wrote:
    Kent Dickey <[email protected]> schrieb:

    Trapping on overflow is basically useless other than as a debug aid, which clearly nobody values.

    I disagree.

    Look at the sanitizer libraries, which insert runtime checks for
    integer overflow - having less overhead for these would definitely
    be a plus.

    See https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html
    or https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-fla
    gs
    .

    Not valuing something just means no one is spending a lot of time/effort
    on it. Decimal math is not valued--but you can still do it, it just
    has no special instructions on most architectures to make it fast/easy.
    And as I've pointed out, trapping on integer overflow is clearly not valued--on x86, where INTO exists, GCC and Clang do not use it.

    To quote Nick: sigh.
    --
    Bill Findlay

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Kent Dickey on Tue Sep 24 17:40:48 2024
    Kent Dickey <[email protected]> schrieb:
    In article <vcpidc$29e51$[email protected]>,
    Thomas Koenig <[email protected]> wrote:
    Kent Dickey <[email protected]> schrieb:

    Trapping on overflow is basically useless other than as a debug aid,
    which clearly nobody values.

    I disagree.

    Look at the sanitizer libraries, which insert runtime checks for
    integer overflow - having less overhead for these would definitely
    be a plus.

    See https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html
    or >>https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-flags
    .

    Not valuing something just means no one is spending a lot of time/effort
    on it.

    So writing and maintaining these libraries is not a lot of effort?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Sep 24 17:57:51 2024
    On Tue, 24 Sep 2024 17:06:27 +0000, Stephen Fuld wrote:

    On 9/23/2024 11:02 PM, Terje Mathisen wrote:

    snip

    Maybe all add/sub/etc opcodes that are immediately followed by an INTO
    could be fused into a single ADDO/SUBO/etc version that takes zero extra
    cycles as long as the trap part isn't hit?

    If you are going to do that, why not make it an optional prefix byte?
    That way, no fusion needed, no extra cycles, yet the same amount of code space.

    Realistically, what is the difference if INTO is a prefix
    byte or a postfix byte ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Niklas Holsti on Wed Sep 25 09:54:17 2024
    On 9/10/2024 1:13 AM, Niklas Holsti wrote:

    In the Ada case, the ability to declare array types with programmer-
    chosen index types with bounded range, such as range-bounded integers or enumerations, means that the compiler can avoid indexing checks when the (sub)type of the index is known at compile time to fit within the index
    range of the array.

    I have always liked the idea of variable ranges able to be specified in
    the language. Besides the advantages you mentioned, it provides more
    human "comprehensibility" (if the ranges are reasonably named) i.e.
    better internal documentation, and it makes responding to specification
    changes required later in the program life cycle easier and less error
    prone, i.e. if the range has to change, you change it in one place and
    don't risk missing making the change in some obscure part of the program
    you forgot about.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Stephen Fuld on Wed Sep 25 20:07:45 2024
    On Wed, 25 Sep 2024 09:54:17 -0700
    Stephen Fuld <[email protected]d> wrote:

    On 9/10/2024 1:13 AM, Niklas Holsti wrote:

    In the Ada case, the ability to declare array types with
    programmer- chosen index types with bounded range, such as
    range-bounded integers or enumerations, means that the compiler can
    avoid indexing checks when the (sub)type of the index is known at
    compile time to fit within the index range of the array.

    I have always liked the idea of variable ranges able to be specified
    in the language. Besides the advantages you mentioned, it provides
    more human "comprehensibility" (if the ranges are reasonably named)
    i.e. better internal documentation, and it makes responding to
    specification changes required later in the program life cycle easier
    and less error prone, i.e. if the range has to change, you change it
    in one place and don't risk missing making the change in some obscure
    part of the program you forgot about.



    The problem here is that arrays with fixed bounds were common when
    Ada was conceived back in the mid 1970s. On general-purpose (as opposed
    to embedded) computers they were already much rarer when Ada was shipped
    in 1983. By late 1990s arrays with fixed bounds were rare exception
    rather than rule.
    Except, of course, for many types of embedded computers. But even that
    is gradually changing. Very gradually.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed Sep 25 12:16:59 2024
    MitchAlsup1 wrote:
    On Sat, 21 Sep 2024 17:05:02 +0000, EricP wrote:

    It is very efficient if those events are rarely or never supposed to
    occur.

    Many (most, nearly all) processor architectures have notoriously
    bad exception delivery to a point of control that can deal with
    the problem at hand.

    But it is not necessary for that bad mechanism to be necessary !!

    Some of the things that minimize the "badness" of taking an exception::

    a) deliver control to user signal handler without taking an
    excursion through GuestOS. (think 10 cycles)
    b) when control arrives, receiving thread is already reentrant.
    c) when control arrives, the instruction (bits) and its operand
    values are delivered to the exception handler. So, the exception
    handler has what it needs to deal with the problem at hand.
    d) when control returns, the result (R0) is delivered back to the
    destination register.
    e) (b, c, d) are performed without handler needing to understand
    how. Handler is just a subroutine that receives arguments (c)
    fixes the problem, and returns a non-excepting value, or abort.
    f) return has a way to re-execute the instruction or to skip the
    instruction under control of handler without having access
    to excepting-IP and without knowing the length of the
    instruction.
    g) during (a..f) nobody ever has to disable interrupts or
    exceptions or re-enable them later. Priority and privilege
    are inherited automatically from excepting thread.


    I know of only 1 ISA with these properties....

    It all depends on the frequency that exceptions occur.
    It used to be that Page Fault was the only one that occurred with any frequency, and the code path for the page fault handler was long enough
    that any HW overhead was lost in the noise. In all other cases they
    indicated a fatal error so the HW cost was the least of your problems.

    But then, risc processors mostly, started using exceptions for housekeeping
    - SPARC for register window sliding, Alpha for byte, word and misaligned
    memory access, MIPS and Alpha for software TLB-miss handling.
    And suddenly the exceptional becomes the normal.

    The solution for Alpha was to add back the byte and word instructions,
    and add misaligned access support to all memory ops.
    Sparc stuck with traps for register windows.
    No one else used software managed TLB's.

    Then virtual machines come along using exceptions to trigger
    trap-and-emulate code, and now the normal becomes frequent.
    Not 1 or 10 exceptions per second, but 100,000 or 200,000.

    The solution for VM's is to add the ISA features necessary so that
    most exceptions are rare, and when they do happen they are cheap.
    Worst case it should cost the same as a branch mispredict pipeline drain.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Kent Dickey on Wed Sep 25 12:54:18 2024
    Kent Dickey wrote:
    In article <O2DHO.184073$[email protected]>,
    EricP <[email protected]> wrote:
    Kent Dickey wrote:
    In article <[email protected]>,
    Anton Ertl <[email protected]> wrote:
    Brett <[email protected]> writes:
    Speaking of complex things, have you looked at Swift output, as it checks >>>>> all operations for overflow?

    You could add an exception type for that, saving huge numbers of correctly
    predicted branch instructions.

    The future of programming languages is type safe with checks, you need to >>>>> get on that bandwagon early.
    MIPS got on that bandwagon early. It has, e.g., add (which traps on
    signed overflow) in addition to addu (which performs modulo
    arithmetic). It has been abandoned and replaced by RISC-V several
    years ago.

    Alpha got on that bandwagon early. It's a descendent of MIPS, but it
    renamed add into addv, and addu into add. It has been canceled around >>>> the year 2000.
    [ More details about architectures without trapping overflow instructions ] >>>
    Trapping on overflow is basically useless other than as a debug aid,
    which clearly nobody values. If you take Rust's approach, and only
    detect overflow in debug builds, then you already don't care about
    performance.
    Those automatic software correctness checks, of which signed integer
    overflow detection is one of many, went away because most code was
    being written in C/C++ and those two languages don't require them.

    That just makes it more expensive in code size and performance to effect
    such checks. This overhead leads some to conclude it justifies eliminating >> the error checks.

    Eliminating the error event detectors doesn't make errors go away,
    just your knowledge of them.

    I gather portions of 16-bit Windows 3.1 were written in Pascal.
    When Microsoft developed 32-bit WinNT, if instead of C it they had
    switched their official development language from Pascal to Modula-2
    which does require signed and unsigned, checked and modulo arithmetic,
    and array bounds checks, the world would have been a much safer place.

    But they didn't so it isn't.

    The x86 designers might then have had an incentive to make all the
    checks as efficient as possible, and rather than eliminate them,
    they might have enhanced and more tightly integrated them.

    OK, my post was about how having a hardware trap-on-overflow instruction
    (or a mode for existing ALU instructions) is useless for anything OTHER
    than as a debug aid where you crash the problem on overflow (you can
    have a general exception handler to shut down gracefully, but "patching things
    up and continuing" doesn't work). I gave details of reasons folks might
    want to try to use trap-on-overflow instructions, and show how the
    other cases don't make sense.

    For me error detection of all kinds is useful. It just happens
    to not be conveniently supported in C so no one tries it in C.

    GCC's -trapv option is not useful for a variety of reasons.
    1) its slow, about 50% performance hit
    2) its always on for a compilation unit which is not what programmers need
    as it triggers for many false positives so people turn it off.

    In no way was I ever arguing that checking for overflow was a bad idea,
    or a language issue, or anything else. Just that CPUs should not bother having trap-on-overflow instructions.

    I understand, and I disagree with this conclusion.
    I think all forms of software error detection are useful and
    HW should make them simple and eliminate cost when possible.

    You then went on to discuss how you want trap-on-overflow instructions
    for stuff like C code, so you can detect code bugs and shut down gracefully.

    And my response to that is we already know compilers don't use it. x86
    has INTO, which is "trap if the overflow bit is set". So "ADD r8,r9; INTO" would trap if the add overflowed.

    Well, there is a bunch of things to unpack here.

    First, INTO is a 32-bit x86 instruction. On 64-bit x64 AMD reassigned
    that opcode to be for other instructions. On x64 the JO (jump overflow) instruction does overflow detection.

    The reason AMD could reassign INTO was because it wasn't being used by C/C++. But this is a side effect of C's widespread use, not the cause.
    Programmers write in C because it is widely used and supported,
    and as a consequence of that choice they get unchecked arithmetic.
    But they are not choosing C to get unchecked arithmetic.
    Had this same usage tests been done on other languages the results
    would likely be quite different.

    Second, on x86 the INTO and on x64 a JO offset32 take up 1 and 5 bytes respectively. In JO case it has to branch to a ThrowOverflow () call
    so thats 5 more bytes per ADD or SUB if you want error traceability.
    With overflow trapping instructions there is NO runtime or code size cost.

    Third, on many risc ISA like RISC-V there are no flags so no JO instruction even possible. Either they must use the branchless overflow idiom or the branching version, adding more to the cost of error detection.

    *OR*its has an Add Fault Overflow instruction which has NO RUNTIME COST
    ADDFO rd = rs1 + rs2

    Fourth, it sounds like what you want is a risc (no flags) ADD instruction
    that returns both a result and an overflow flag so you can do the
    equivalent of the x64 JO branch test.
    ADDO (ro,rd) = rs1 + rs2
    where rd is dest and ro is a register to receive a 0/1 overflow flag.
    Once one allows multiple dest registers ADDO is trivial to support.

    But that does not invalidate the usefulness of ADDFO.
    I would also have ADDFC Add Fault Carry for unsigned overflow,
    plus other instructions for checking signed overflow and unsigned carry.

    Look at:
    https://godbolt.org/z/oMhW55YsK

    Which is this code:

    int add2(int num, int other) {
    return num + other;
    }

    Compiled with these options: -O2 -ftrapv
    (-ftrapv is the GCC argument for detect signed overflows and crash).

    For x86-64 clang 19.1.0:

    add2:
    add edi, esi
    jo .LBB0_1
    mov eax, edi
    ret
    ..LBB0_1:
    ud1 eax, dword ptr [eax]

    This looks OK: it does a normal add, then branches-on-overflow to
    an undefined instruction.

    Yes, this is all for x64 which has no INTO instruction.
    GCC's -trapv redirects signed arithmetic to the overflow trapping library
    which used to (a) have a reputation for bugs and (b) cause a 50% slow down.
    I believe they fixed the bugs eventually but the performance hit remains.

    But x86 has an instruction to trap on overflow directly: INTO. It's one byte.
    And it doesn't use it.

    GCC x86-64 14.2 is even worse:

    add2:
    sub rsp, 8
    call __addvsi3
    add rsp, 8
    ret

    It calls a routine to do all additions which might overflow, and that
    routine calls assert() if an overflow occurs.

    The CPU has a trap-on-overflow instruction exactly for this case (to crash
    on detecting an overflow), and compilers don't even use it.

    So even on architectures which have a trap-on-overflow instruction,
    compilers don't use it.

    This is for x64 which has no INTO instruction.
    And __addvsi3 it didn't use JO either. It uses the branching test,
    not the branchless idiom.

    https://blog.weghos.com/llvm/llvm/compiler-rt/lib/builtins/addvsi3.c.html

    So why should any hardware include an instruction to trap-on-overflow?

    Because ALL the negative speed and code size consequences do not occur.

    Trap-on-overflow instruction have a hardware cost, of varying severity.
    If the ISA isn't already trapping on ALU instructions (such as
    divide-by-0), it adds a new class of operations which can take
    exceptions. An ALU functional unit that cannot take exceptions doesn't
    have to save "unwinding" info (at minimum, info to recover the PC, and possibly rollback state), and not needing this can be a nice
    simplification. Branches and LD/ST always needs this info, but not
    needing it on ALU ops can be a nice simplification of logic, and makes it easier to have multiple ALU functional units. Note that x86 INTO can
    be treated as a branch, so it doesn't have the cost of an instruction
    like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
    traps if it overflows. ADDTO is particularly what I am arguing against--
    it is just a bad idea for the ISA to have ALU instructions take exceptions.

    Not really. Its a flag in the uOp indicating HasException and a union
    of fields to hold exception status and RIP, all of which needs to be
    there for other instructions like load/store.

    Instruction sets which make detecting overflow difficult (say, RISC-V),
    would do well to make branch-on-overflow efficient and easy. But adding >>> trap-on-overflow instructions is a waste of effort.
    No they are a very useful tool for those who need such a tool
    because the manual alternative is significantly more expensive
    for both size and performance.

    "I have one example where overflow exceptions would be a poor implementation >> choice" does not imply "therefore no one should have them as an option".

    Can you share what language, compiler, and hardware you are using which implements overflow checks using a trap-on-overflow instruction?

    Kent

    On DEC VAX the Overflow Enable flag was in the Program Status Word.
    IIRC it was enabled by default in all DEC languages, Fortran77, Pascal,
    Ada, Cobol, and disabled by default for C. But it could be toggled with
    a runtime library call.

    For a variety of reasons having Overflow Enable in the status register is
    A Bad Idea. On Alpha it was a compile switch which selects different instructions ADD vs ADDV, and also controlled by pragmas.
    If you wanted to manually test for overflow then you used
    one of the idioms, whatever language you worked in.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Wed Sep 25 17:30:46 2024
    On Wed, 25 Sep 2024 17:07:45 +0000, Michael S wrote:

    On Wed, 25 Sep 2024 09:54:17 -0700
    Stephen Fuld <[email protected]d> wrote:

    On 9/10/2024 1:13 AM, Niklas Holsti wrote:

    In the Ada case, the ability to declare array types with
    programmer- chosen index types with bounded range, such as
    range-bounded integers or enumerations, means that the compiler can
    avoid indexing checks when the (sub)type of the index is known at
    compile time to fit within the index range of the array.

    I have always liked the idea of variable ranges able to be specified
    in the language. Besides the advantages you mentioned, it provides
    more human "comprehensibility" (if the ranges are reasonably named)
    i.e. better internal documentation, and it makes responding to
    specification changes required later in the program life cycle easier
    and less error prone, i.e. if the range has to change, you change it
    in one place and don't risk missing making the change in some obscure
    part of the program you forgot about.



    The problem here is that arrays with fixed bounds were common when
    Ada was conceived back in the mid 1970s. On general-purpose (as opposed
    to embedded) computers they were already much rarer when Ada was shipped
    in 1983. By late 1990s arrays with fixed bounds were rare exception
    rather than rule.

    It sounds like variable ranges (array indexes) would be becoming more
    common, also.

    Where "variable range" is a variable that is defined to have a
    specified range, but from run to run the upper and lower bounds
    can be modified without re-compilation.

    Except, of course, for many types of embedded computers. But even that
    is gradually changing. Very gradually.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Wed Sep 25 17:46:46 2024
    EricP <[email protected]> writes:
    MitchAlsup1 wrote:

    But then, risc processors mostly, started using exceptions for housekeeping
    - SPARC for register window sliding, Alpha for byte, word and misaligned >memory access, MIPS and Alpha for software TLB-miss handling.
    And suddenly the exceptional becomes the normal.

    Yet all four of those have been relegated to the mists of history,
    for good reason.


    Then virtual machines come along using exceptions to trigger
    trap-and-emulate code, and now the normal becomes frequent.

    And then SVM (AMD) and VT-X (Intel) subsequently added
    hardware support that significantly reduced the need for
    trap-and-emulate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Terje Mathisen on Wed Sep 25 13:56:40 2024
    Terje Mathisen wrote:
    Kent Dickey wrote:

    Look at:
    https://godbolt.org/z/oMhW55YsK

    Which is this code:

    int add2(int num, int other) {
    return num + other;
    }

    Compiled with these options: -O2 -ftrapv
    (-ftrapv is the GCC argument for detect signed overflows and crash).

    For x86-64 clang 19.1.0:

    add2:
    add edi, esi
    jo .LBB0_1
    mov eax, edi
    ret
    .LBB0_1:
    ud1 eax, dword ptr [eax]

    This looks OK: it does a normal add, then branches-on-overflow to
    an undefined instruction.

    But x86 has an instruction to trap on overflow directly: INTO. It's
    one byte.
    And it doesn't use it.

    GCC x86-64 14.2 is even worse:

    add2:
    sub rsp, 8
    call __addvsi3
    add rsp, 8
    ret

    It calls a routine to do all additions which might overflow, and that
    routine calls assert() if an overflow occurs.

    The CPU has a trap-on-overflow instruction exactly for this case (to
    crash
    on detecting an overflow), and compilers don't even use it.

    So even on architectures which have a trap-on-overflow instruction,
    compilers don't use it.

    You can only compile in INTO opcodes if you can guarantee that the INT 4 (INTO) trap vector will always be set to a proper handler, and since
    that isn't part of the ABI, compilers can't depend on it?

    I do agree that it would be nice if it did work, barring that clang is
    doing the best possible alternative, at close to zero cost except for
    the useless branch predictor table entry wastage.

    Terje

    On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
    One must use JO to detect signed overflow.
    Others were repurposed, 1-byte INC and DEC 40..4F became the REX prefix.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Sep 25 17:49:13 2024
    On Wed, 25 Sep 2024 16:16:59 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Sat, 21 Sep 2024 17:05:02 +0000, EricP wrote:

    It is very efficient if those events are rarely or never supposed to
    occur.

    Many (most, nearly all) processor architectures have notoriously
    bad exception delivery to a point of control that can deal with
    the problem at hand.

    But it is not necessary for that bad mechanism to be necessary !!

    Some of the things that minimize the "badness" of taking an exception::

    a) deliver control to user signal handler without taking an
    excursion through GuestOS. (think 10 cycles)
    b) when control arrives, receiving thread is already reentrant.
    c) when control arrives, the instruction (bits) and its operand
    values are delivered to the exception handler. So, the exception
    handler has what it needs to deal with the problem at hand.
    d) when control returns, the result (R0) is delivered back to the
    destination register.
    e) (b, c, d) are performed without handler needing to understand
    how. Handler is just a subroutine that receives arguments (c)
    fixes the problem, and returns a non-excepting value, or abort.
    f) return has a way to re-execute the instruction or to skip the
    instruction under control of handler without having access
    to excepting-IP and without knowing the length of the
    instruction.
    g) during (a..f) nobody ever has to disable interrupts or
    exceptions or re-enable them later. Priority and privilege
    are inherited automatically from excepting thread.


    I know of only 1 ISA with these properties....

    It all depends on the frequency that exceptions occur.
    It used to be that Page Fault was the only one that occurred with any frequency, and the code path for the page fault handler was long enough
    that any HW overhead was lost in the noise. In all other cases they
    indicated a fatal error so the HW cost was the least of your problems.

    But then, risc processors mostly, started using exceptions for
    housekeeping
    - SPARC for register window sliding, Alpha for byte, word and misaligned memory access, MIPS and Alpha for software TLB-miss handling.
    And suddenly the exceptional becomes the normal.

    The solution for Alpha was to add back the byte and word instructions,
    and add misaligned access support to all memory ops.
    Sparc stuck with traps for register windows.
    No one else used software managed TLB's.

    Then virtual machines come along using exceptions to trigger
    trap-and-emulate code, and now the normal becomes frequent.
    Not 1 or 10 exceptions per second, but 100,000 or 200,000.

    High amounts of trap-and-emulate code originate from a "I am
    a CPU and I decide everything" mindset of ISAs developed
    before VMs came around. Too many control registers being
    touched too often, and don't get me started on interrupt
    "routing". What a VM wants is "I am a virtual CPU and I
    make choices for virtual me; I do not own interrupts or
    devices, or control the system--nor am I controlled by
    a sea of control registers".

    My 66000 architecture has only 1 instruction with any notion
    of privilege and when you touch a control register with it
    you end up changing a cache line size of control register
    state--dropping the amount of touches nearly an order of
    magnitude:: 100,000/sec -> 10,000/second.

    Privilege has been specified in such a way that a p-thread
    thread can context switch to another p-thread thread within
    a single application with a single non-privilege-violating
    instruction. New stack, new translation tables, save/restore
    the register files, change the ASID, change which exceptions
    are recognized/ignored, ...

    In addition, there is no SW overhead wrt interrupt routing
    when performing a "world switch" making a world switch
    have the same 10-20 cycle overhead as a context switch.

    The solution for VM's is to add the ISA features necessary so that
    most exceptions are rare, and when they do happen they are cheap.

    Exceptions should be rare AND cheap.

    But I argue against adding instructions to mask the deficiencies
    of the ISAs that got it wrong oh so long ago. But no-one will
    listen. I argue to fixe the problem at its source:: the notion
    of how the machine is controlled, and interrupted; rather than
    adding zillions of helper instruction to mask the real problem.

    Worst case it should cost the same as a branch mispredict pipeline
    drain.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Wed Sep 25 18:00:28 2024
    EricP <[email protected]> writes:
    Terje Mathisen wrote:


    You can only compile in INTO opcodes if you can guarantee that the INT 4
    (INTO) trap vector will always be set to a proper handler, and since
    that isn't part of the ABI, compilers can't depend on it?

    I do agree that it would be nice if it did work, barring that clang is
    doing the best possible alternative, at close to zero cost except for
    the useless branch predictor table entry wastage.

    Terje

    On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
    One must use JO to detect signed overflow.
    Others were repurposed, 1-byte INC and DEC 40..4F became the REX prefix.

    It seems more flexible to have language facilities to handle overflow directly rather than using trap & emulate to fake it.

    COMPUTE A = B + C ON OVERFLOW DO SOMETHING.

    A sticky processor state "overflow" bit is sufficient to support
    such languages.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to All on Wed Sep 25 21:12:50 2024
    On 2024-09-25 20:30, MitchAlsup1 wrote:
    On Wed, 25 Sep 2024 17:07:45 +0000, Michael S wrote:

    On Wed, 25 Sep 2024 09:54:17 -0700
    Stephen Fuld <[email protected]d> wrote:

    On 9/10/2024 1:13 AM, Niklas Holsti wrote:

    In the Ada case, the ability to declare array types with
    programmer- chosen index types with bounded range, such as
    range-bounded integers or enumerations, means that the compiler can
    avoid indexing checks when the (sub)type of the index is known at
    compile time to fit within the index range of the array.

    I have always liked the idea of variable ranges able to be specified
    in the language.  Besides the advantages you mentioned, it provides
    more human "comprehensibility" (if the ranges are reasonably named)
    i.e. better internal documentation, and it makes responding to
    specification changes required later in the program life cycle easier
    and less error prone, i.e. if the range has to change, you change it
    in one place and don't risk missing making the change in some obscure
    part of the program you forgot about.



    The problem here is that arrays with fixed bounds were common when
    Ada was conceived back in the mid 1970s. On general-purpose (as opposed
    to embedded) computers they were already much rarer when Ada was shipped
    in 1983. By late 1990s arrays with fixed bounds were rare exception
    rather than rule.

    It sounds like variable ranges (array indexes) would be becoming more
    common, also.

    Where "variable range" is a variable that is defined to have a
    specified range, but from run to run the upper and lower bounds
    can be modified without re-compilation.


    Ada subtypes can do that, but the underlying type, set at compile time,
    for example the standard Integer type, will put an upper bound on the
    range of the subtype. The number of bits in the numbers cannot change
    without recompilation.

    Michael S says that arrays with variable bounds are becoming more
    common. I assume he means indexable containers, often called vectors.
    Ada has several such containers in the standard library, all defined as
    generic in the index subtype, which means that the compiler can check
    that the type of a vector index is correct.

    If a vector is allocated (sized) with a certain (dynamically defined)
    length, instead of growing element by element, and if that length
    matches the (dynamically defined) range of the index subtype, the same
    static checking methods/proofs can be applied as for traditional arrays
    of fixed size. I think that the current tools don't do that for vector containers by default, but I beliveve they can be persuaded to do it by
    writing the corresponding preconditions for the indexing operations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Sep 25 18:11:32 2024
    On Wed, 25 Sep 2024 16:54:18 +0000, EricP wrote:

    Kent Dickey wrote:
    In article <O2DHO.184073$[email protected]>,
    EricP <[email protected]> wrote:

    Well, there is a bunch of things to unpack here.

    First, INTO is a 32-bit x86 instruction. On 64-bit x64 AMD reassigned
    that opcode to be for other instructions. On x64 the JO (jump overflow) instruction does overflow detection.

    The reason AMD could reassign INTO was because it wasn't being used by
    C/C++.
    But this is a side effect of C's widespread use, not the cause.
    Programmers write in C because it is widely used and supported,

    Free compilers

    and as a consequence of that choice they get unchecked arithmetic.
    But they are not choosing C to get unchecked arithmetic.
    Had this same usage tests been done on other languages the results
    would likely be quite different.

    Second, on x86 the INTO and on x64 a JO offset32 take up 1 and 5 bytes respectively. In JO case it has to branch to a ThrowOverflow () call
    so thats 5 more bytes per ADD or SUB if you want error traceability.
    With overflow trapping instructions there is NO runtime or code size
    cost.

    Third, on many risc ISA like RISC-V there are no flags so no JO
    instruction
    even possible. Either they must use the branchless overflow idiom or the branching version, adding more to the cost of error detection.

    *OR*its has an Add Fault Overflow instruction which has NO RUNTIME COST
    ADDFO rd = rs1 + rs2

    *OR"its ??? can you translate than into comp.arch language.

    Fourth, it sounds like what you want is a risc (no flags) ADD
    instruction
    that returns both a result and an overflow flag so you can do the
    equivalent of the x64 JO branch test.
    ADDO (ro,rd) = rs1 + rs2
    where rd is dest and ro is a register to receive a 0/1 overflow flag.
    Once one allows multiple dest registers ADDO is trivial to support.

    But that does not invalidate the usefulness of ADDFO.
    I would also have ADDFC Add Fault Carry for unsigned overflow,

    Which will be used two orders of magnitude less than ADDFO. First
    because unsigned is used less often than signed, secondly much/most
    unsigned arithmetic is specified to wrap rather than check.

    plus other instructions for checking signed overflow and unsigned carry.

    Look at:
    https://godbolt.org/z/oMhW55YsK

    Which is this code:

    int add2(int num, int other) {
    return num + other;
    }

    Compiled with these options: -O2 -ftrapv
    (-ftrapv is the GCC argument for detect signed overflows and crash).

    For x86-64 clang 19.1.0:

    add2:
    add edi, esi
    jo .LBB0_1
    mov eax, edi
    ret
    ..LBB0_1:
    ud1 eax, dword ptr [eax]

    This looks OK: it does a normal add, then branches-on-overflow to
    an undefined instruction.

    Yes, this is all for x64 which has no INTO instruction.

    s/x64/x86-64/g

    It is still an x86 with all the benefits and detriments.
    <snip>

    So why should any hardware include an instruction to trap-on-overflow?

    Because ALL the negative speed and code size consequences do not occur.

    No because an EFFICIENT trap-on-overflow has no performance consequences
    when no overflow is created. Efficient means 10-20 cycles to arrive at exception handler--already in a reentrant state with exceptions and
    interrupts still enabled. Just because x86 is so horrible in this regard
    does not mean every architecture has to be at least that bad.

    Trap-on-overflow instruction have a hardware cost, of varying severity.
    If the ISA isn't already trapping on ALU instructions (such as
    divide-by-0), it adds a new class of operations which can take
    exceptions. An ALU functional unit that cannot take exceptions doesn't
    have to save "unwinding" info (at minimum, info to recover the PC, and
    possibly rollback state), and not needing this can be a nice
    simplification. Branches and LD/ST always needs this info, but not
    needing it on ALU ops can be a nice simplification of logic, and makes
    it
    easier to have multiple ALU functional units. Note that x86 INTO can
    be treated as a branch, so it doesn't have the cost of an instruction
    like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
    traps if it overflows. ADDTO is particularly what I am arguing
    against--
    it is just a bad idea for the ISA to have ALU instructions take
    exceptions.

    Not really. Its a flag in the uOp indicating HasException and a union
    of fields to hold exception status and RIP, all of which needs to be
    there for other instructions like load/store.

    Agreed, the overhead of recording "Overflow" and whether to do something
    about it is so small that other considerations sway the argumetns.

    Instruction sets which make detecting overflow difficult (say, RISC-V), >>>> would do well to make branch-on-overflow efficient and easy. But adding >>>> trap-on-overflow instructions is a waste of effort.
    No they are a very useful tool for those who need such a tool
    because the manual alternative is significantly more expensive
    for both size and performance.

    "I have one example where overflow exceptions would be a poor
    implementation
    choice" does not imply "therefore no one should have them as an option".

    Can you share what language, compiler, and hardware you are using which
    implements overflow checks using a trap-on-overflow instruction?

    Kent

    On DEC VAX the Overflow Enable flag was in the Program Status Word.

    On My 66000 Overflow enable bit is part of the thread-status-line.

    IIRC it was enabled by default in all DEC languages, Fortran77, Pascal,
    Ada, Cobol, and disabled by default for C. But it could be toggled with
    a runtime library call.

    Similar--but library call does not have to "gain privilege" to flip the
    bit's state.

    For a variety of reasons having Overflow Enable in the status register
    is A Bad Idea.

    Can you expand. It seems to me if the unprivileged application using
    the instructions at hand (Header Register instruction) can access
    and write those exception control bits without needing privilege--
    that most of the "A Bad Idea™" disappear. At the same time there
    are significant amounts of state that do require privilege to
    access in thread-status-line, and HR obeys such a distinction.

    On Alpha it was a compile switch which selects different instructions ADD vs ADDV, and also controlled by pragmas.
    If you wanted to manually test for overflow then you used
    one of the idioms, whatever language you worked in.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Wed Sep 25 12:30:02 2024
    On 9/24/2024 10:57 AM, MitchAlsup1 wrote:
    On Tue, 24 Sep 2024 17:06:27 +0000, Stephen Fuld wrote:

    On 9/23/2024 11:02 PM, Terje Mathisen wrote:

    snip

    Maybe all add/sub/etc opcodes that are immediately followed by an INTO
    could be fused into a single ADDO/SUBO/etc version that takes zero extra >>> cycles as long as the trap part isn't hit?

    If you are going to do that, why not make it an optional prefix byte?
    That way, no fusion needed, no extra cycles, yet the same amount of code
    space.

    Realistically, what is the difference if INTO is a prefix
    byte or a postfix byte ?

    Of course, IANAHG, but my guess was that not having to do instruction
    fusion was worth something, and my suggestion has zero cost. If this is
    wrong, then there is no difference.




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Thu Sep 26 13:13:02 2024
    MitchAlsup1 wrote:
    On Wed, 25 Sep 2024 16:54:18 +0000, EricP wrote:

    Kent Dickey wrote:
    In article <O2DHO.184073$[email protected]>,
    EricP <[email protected]> wrote:

    Well, there is a bunch of things to unpack here.

    First, INTO is a 32-bit x86 instruction. On 64-bit x64 AMD reassigned
    that opcode to be for other instructions. On x64 the JO (jump overflow)
    instruction does overflow detection.

    The reason AMD could reassign INTO was because it wasn't being used by
    C/C++.
    But this is a side effect of C's widespread use, not the cause.
    Programmers write in C because it is widely used and supported,

    Free compilers

    I've always paid for mine. My first C compiler came with the
    WinNT 3.5 beta in 1992 for $99 and came with the development kit,
    editor, source code debugger, tools, documentation.
    A few hundred bucks is not going to hurt my business.

    and as a consequence of that choice they get unchecked arithmetic.
    But they are not choosing C to get unchecked arithmetic.
    Had this same usage tests been done on other languages the results
    would likely be quite different.

    Second, on x86 the INTO and on x64 a JO offset32 take up 1 and 5 bytes
    respectively. In JO case it has to branch to a ThrowOverflow () call
    so thats 5 more bytes per ADD or SUB if you want error traceability.
    With overflow trapping instructions there is NO runtime or code size
    cost.

    Third, on many risc ISA like RISC-V there are no flags so no JO
    instruction
    even possible. Either they must use the branchless overflow idiom or the
    branching version, adding more to the cost of error detection.

    *OR*its has an Add Fault Overflow instruction which has NO RUNTIME COST
    ADDFO rd = rs1 + rs2

    *OR"its ??? can you translate than into comp.arch language.

    *OR* an ISA has an Add Fault Overflow instruction which has NO RUNTIME COST
    ADDFO rd = rs1 + rs2

    Fourth, it sounds like what you want is a risc (no flags) ADD
    instruction
    that returns both a result and an overflow flag so you can do the
    equivalent of the x64 JO branch test.
    ADDO (ro,rd) = rs1 + rs2
    where rd is dest and ro is a register to receive a 0/1 overflow flag.
    Once one allows multiple dest registers ADDO is trivial to support.

    But that does not invalidate the usefulness of ADDFO.
    I would also have ADDFC Add Fault Carry for unsigned overflow,

    Which will be used two orders of magnitude less than ADDFO. First
    because unsigned is used less often than signed, secondly much/most
    unsigned arithmetic is specified to wrap rather than check.

    Unsigned checked arithmetic is just another data type but with the
    unsigned range of 0..2^n-1. A programmer should select it for situtations
    where unsigned values are never supposed to wrap at zero.

    Note that "checked arithmetic" is more than just checks for overflow on arithmetic ops. It also checks on assignments with down casts that no
    overflow occurs, that you are not trying to put 10 pounds in a 5 pound bag,
    and conversions between signed and unsigned checked types.
    It can also include range checks on subtypes.

    plus other instructions for checking signed overflow and unsigned carry.

    Look at:
    https://godbolt.org/z/oMhW55YsK

    Which is this code:

    int add2(int num, int other) {
    return num + other;
    }

    Compiled with these options: -O2 -ftrapv
    (-ftrapv is the GCC argument for detect signed overflows and crash).

    For x86-64 clang 19.1.0:

    add2:
    add edi, esi
    jo .LBB0_1
    mov eax, edi
    ret
    ..LBB0_1:
    ud1 eax, dword ptr [eax]

    This looks OK: it does a normal add, then branches-on-overflow to
    an undefined instruction.

    Yes, this is all for x64 which has no INTO instruction.

    s/x64/x86-64/g

    It is still an x86 with all the benefits and detriments.
    <snip>

    So why should any hardware include an instruction to trap-on-overflow?

    Because ALL the negative speed and code size consequences do not occur.

    No because an EFFICIENT trap-on-overflow has no performance consequences
    when no overflow is created. Efficient means 10-20 cycles to arrive at exception handler--already in a reentrant state with exceptions and interrupts still enabled. Just because x86 is so horrible in this regard
    does not mean every architecture has to be at least that bad.

    It has zero cost when no overflow occurs.
    And if one does occur it leaves the RIP pointing at the problem instruction.

    Trap-on-overflow instruction have a hardware cost, of varying severity.
    If the ISA isn't already trapping on ALU instructions (such as
    divide-by-0), it adds a new class of operations which can take
    exceptions. An ALU functional unit that cannot take exceptions doesn't
    have to save "unwinding" info (at minimum, info to recover the PC, and
    possibly rollback state), and not needing this can be a nice
    simplification. Branches and LD/ST always needs this info, but not
    needing it on ALU ops can be a nice simplification of logic, and makes
    it
    easier to have multiple ALU functional units. Note that x86 INTO can
    be treated as a branch, so it doesn't have the cost of an instruction
    like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
    traps if it overflows. ADDTO is particularly what I am arguing
    against--
    it is just a bad idea for the ISA to have ALU instructions take
    exceptions.

    Not really. Its a flag in the uOp indicating HasException and a union
    of fields to hold exception status and RIP, all of which needs to be
    there for other instructions like load/store.

    Agreed, the overhead of recording "Overflow" and whether to do something about it is so small that other considerations sway the argumetns.

    Instruction sets which make detecting overflow difficult (say,
    RISC-V),
    would do well to make branch-on-overflow efficient and easy. But
    adding
    trap-on-overflow instructions is a waste of effort.
    No they are a very useful tool for those who need such a tool
    because the manual alternative is significantly more expensive
    for both size and performance.

    "I have one example where overflow exceptions would be a poor
    implementation
    choice" does not imply "therefore no one should have them as an
    option".

    Can you share what language, compiler, and hardware you are using which
    implements overflow checks using a trap-on-overflow instruction?

    Kent

    On DEC VAX the Overflow Enable flag was in the Program Status Word.

    On My 66000 Overflow enable bit is part of the thread-status-line.

    IIRC it was enabled by default in all DEC languages, Fortran77, Pascal,
    Ada, Cobol, and disabled by default for C. But it could be toggled with
    a runtime library call.

    Similar--but library call does not have to "gain privilege" to flip the
    bit's state.

    The library routine didn't need a privilege change.
    This was just to give high level languages access to the PSW.
    Same as floating point unit control routines.

    For a variety of reasons having Overflow Enable in the status register
    is A Bad Idea.

    Can you expand. It seems to me if the unprivileged application using
    the instructions at hand (Header Register instruction) can access
    and write those exception control bits without needing privilege--
    that most of the "A Bad Idea™" disappear. At the same time there
    are significant amounts of state that do require privilege to
    access in thread-status-line, and HR obeys such a distinction.

    On Alpha it was a compile switch which selects different
    instructions ADD vs ADDV, and also controlled by pragmas.
    If you wanted to manually test for overflow then you used
    one of the idioms, whatever language you worked in.

    The problem is mostly due to the fact that expressions are
    *mixtures of signed and unsigned, checked and modulo arithmetic*.
    If overflow checks are enabled by status flag then the program has to
    keep switching between modes for individual arithmetic operations.
    This leads to a slew of enable and disable instructions which could
    be serializing.

    This is because array index value expressions are calculated using signed, checked arithmetic, then the result is range checked against the bounds. However addresses are calculated using modulo arithmetic.
    Since most OS define address 0 to be the start of user space,
    and locate the OS at FF...FF, addresses are unsigned modulo numbers.

    If checks are not disabled for the address calculation
    and the address not calculated using modulo arithmetic,
    it is easy to trigger false overflow exceptions with arrays
    that do not have base-0 or base-1 array bounds as many compilers
    use bias-base array buffer pointers.

    This is why one wants separate instructions for ADD and ADDV - there is
    no overhead to switching between modulo and checked linear arithmetic.

    Second, if there is a control register, it becomes part of the ABI.
    It can either be
    - undefined on calls, in which case each routine must save the current
    flags state on *each* entry and set a value, and restore the original
    state on return,
    - or defined to have a particular enable/disable value on call and
    callee's are required toggle it if needed but restore it to default
    for all calls and returns.

    Third, there is no reason to have overflow as a dynamic enable/disable
    because the kind of arithmetic, modulo or linear, is fixed by what the programmer writes and does not change dynamically.

    Dynamic overflow enable results in a continuous overhead managing it
    which does not occur with explicit fault testing instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Sep 26 18:11:56 2024
    On Thu, 26 Sep 2024 17:13:02 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Wed, 25 Sep 2024 16:54:18 +0000, EricP wrote:

    IIRC it was enabled by default in all DEC languages, Fortran77, Pascal,
    Ada, Cobol, and disabled by default for C. But it could be toggled with
    a runtime library call.

    Similar--but library call does not have to "gain privilege" to flip the
    bit's state.

    The library routine didn't need a privilege change.

    Most architectures do not allow the unpriviledged to access or modify
    PSW (except the IP via branch instructions and CC via arithmetic).

    My 66000 ISA allows unprivileged to access and modify PSW state
    that only effects how the application acts. So, unprivileged can
    modify:
    a) IEEE flags
    b) exception enablement
    c) rounding mode
    but not
    d) Translation Tables
    e) ASID
    f) Dispatcher
    g) call stack pointer
    ..
    Even though they are stored in the same cache line.
    <snip>

    The problem is mostly due to the fact that expressions are
    *mixtures of signed and unsigned, checked and modulo arithmetic*.
    If overflow checks are enabled by status flag then the program has to
    keep switching between modes for individual arithmetic operations.
    This leads to a slew of enable and disable instructions which could
    be serializing.

    And often are.

    This is because array index value expressions are calculated using
    signed,
    checked arithmetic, then the result is range checked against the bounds. However addresses are calculated using modulo arithmetic.
    Since most OS define address 0 to be the start of user space,
    and locate the OS at FF...FF, addresses are unsigned modulo numbers.

    If checks are not disabled for the address calculation
    and the address not calculated using modulo arithmetic,
    it is easy to trigger false overflow exceptions with arrays
    that do not have base-0 or base-1 array bounds as many compilers
    use bias-base array buffer pointers.

    This is why one wants separate instructions for ADD and ADDV - there is
    no overhead to switching between modulo and checked linear arithmetic.

    Second, if there is a control register, it becomes part of the ABI.
    It can either be
    - undefined on calls, in which case each routine must save the current
    flags state on *each* entry and set a value, and restore the original
    state on return,
    - or defined to have a particular enable/disable value on call and
    callee's are required toggle it if needed but restore it to default
    for all calls and returns.

    Third, there is no reason to have overflow as a dynamic enable/disable because the kind of arithmetic, modulo or linear, is fixed by what the programmer writes and does not change dynamically.

    Dynamic overflow enable results in a continuous overhead managing it
    which does not occur with explicit fault testing instructions.

    Where it consumes valuable OpCode space which is sometimes not available {{3-operand 1-result instructions are notoriously "tight" in encoding:
    ± on 2 operands, int/float/double, FMAC<->INSert, attached
    constant,...}}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Lawrence D'Oliveiro on Sat Sep 28 11:04:34 2024
    Lawrence D'Oliveiro wrote:
    On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

    I've always paid for mine. My first C compiler came with the WinNT 3.5
    beta in 1992 for $99 and came with the development kit,
    editor, source code debugger, tools, documentation.
    A few hundred bucks is not going to hurt my business.

    Given that GCC offers more features and generates better code than MSVC,
    the money may not matter to your business, but the quality of the product will.

    GCC is a compiler collection not a integrated development kit for Windows.
    I have no knowledge of what state GCC was in in 1992 but it likely
    did not support the MS enhancements for Win32 programming:
    structured exception handling, various ABI's, inline assembler,
    defined behavior for some of C's undefined behavior,
    later first-class-type support for 64-bit signed and unsigned integers,
    and most important: integration with the GUI source code debugger.

    Plus come with necessary API headers, various link libraries and DLL's, supporting applications, documentation.
    You know... what a product looks like.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to George Neuner on Sat Sep 28 17:20:56 2024
    On 28/09/2024 01:52, George Neuner wrote:
    On Wed, 25 Sep 2024 12:54:18 -0400, EricP
    <[email protected]> wrote:

    For me error detection of all kinds is useful. It just happens
    to not be conveniently supported in C so no one tries it in C.

    GCC's -trapv option is not useful for a variety of reasons.
    1) its slow, about 50% performance hit

    Trapping or other overflow detection makes it extremely difficult to
    optimise arithmetic. All kinds of re-arrangements, strength reduction
    and constant propagation become problematic if you need to flag
    overflow. That comes in addition to any actual overflow detection.

    I've just tried a quick test on godbolt - when you use -ftrapv, it seems
    that in at least some cases, the trapping arithmetic is done use a
    library function (like "__mulvsi3") rather than direct code. This will
    have significant performance implications.

    2) its always on for a compilation unit which is not what programmers need >> as it triggers for many false positives so people turn it off.

    GCC lets you turn "-fwrapv" on and off with :

    #pragma GCC optimize("-ftrapv")

    and

    #pragma GCC optimize("-fno-trapv")

    It also lets you specify it for a single function at a time with

    __attribute__((optimize("-ftrapv")))


    Changing these options does have some limitations, such as disabling
    inlining into functions with different options. But you can happily
    apply it to only some functions in a translation unit.


    Things like that are why some companies have a code policy that allows
    just one function per file.


    I have never heard of such a policy, and I think it would be an
    extremely silly one - code would be completely unmanageable, and the
    results would be significantly poorer when using modern compilers (i.e., anything this century).

    Still a problem if you need <whatever the relevant flag does> only in
    one or a few places.

    There are gcc flags that are only controllable for compiler invocations,
    rather than with pragmas or attributes, and of course not every compiler
    has the flexibility of gcc or clang. But this is not nearly the level
    you seem to think it is.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Sat Sep 28 13:20:57 2024
    On Sat, 28 Sep 2024 02:25:21 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

    I've always paid for mine. My first C compiler came with the WinNT 3.5
    beta in 1992 for $99 and came with the development kit,
    editor, source code debugger, tools, documentation.
    A few hundred bucks is not going to hurt my business.

    Given that GCC offers more features and generates better code than MSVC,
    the money may not matter to your business, but the quality of the product >will.

    The main reason to use MSVC is Windows GUI programming - and the code
    quality is fine for most applications.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Sat Sep 28 13:58:18 2024
    On Sat, 28 Sep 2024 17:20:56 +0200, David Brown
    <[email protected]> wrote:

    On 28/09/2024 01:52, George Neuner wrote:
    On Wed, 25 Sep 2024 12:54:18 -0400, EricP
    <[email protected]> wrote:

    For me error detection of all kinds is useful. It just happens
    to not be conveniently supported in C so no one tries it in C.

    GCC's -trapv option is not useful for a variety of reasons.
    1) its slow, about 50% performance hit

    :

    2) its always on for a compilation unit which is not what programmers need >>> as it triggers for many false positives so people turn it off.

    :

    Changing these options does have some limitations, such as disabling
    inlining into functions with different options. But you can happily
    apply it to only some functions in a translation unit.


    Things like that are why some companies have a code policy that allows
    just one function per file.


    I have never heard of such a policy, and I think it would be an
    extremely silly one - code would be completely unmanageable, and the
    results would be significantly poorer when using modern compilers (i.e., >anything this century).

    It is often the case when software modeling tools are in use because
    the tools tend to produce one source file per model 'object' or
    relation. And it DOES tend to result in poor(er) code.

    Most modeling tools do have the option to place code in specified
    files - so generally it is possible to have better control over the
    compilation and linking of the executables ... but typically it isn't
    done: often because the company has a policy to not interfere with the
    tool.

    Some people - managers mostly - feel that if it runs, and the code
    quality is 'acceptible' (for some definition), then "better is the
    enemy of 'good enough'".


    I have seen it firsthand.

    But for the record: I refuse to use modeling tools for code or for
    project management. However, I do a fair amount of DBMS work these
    days, and I have found some DBMS modeling tools to be useful for
    creating schema /documentation/.


    Still a problem if you need <whatever the relevant flag does> only in
    one or a few places.

    There are gcc flags that are only controllable for compiler invocations, >rather than with pragmas or attributes, and of course not every compiler
    has the flexibility of gcc or clang. But this is not nearly the level
    you seem to think it is.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to D'Oliveiro on Sat Sep 28 19:37:00 2024
    In article <vd7peh$12kpl$[email protected]>, [email protected]d (Lawrence
    D'Oliveiro) wrote:

    Given that GCC offers more features and generates better code than
    MSVC, the money may not matter to your business, but the quality of
    the product will.

    Using GCC (or Clang) for Windows programming is OK, and can be great,
    provided other organisations don't need to program against APIs you
    provide. If they do, the amount of FUD generated and explaining required overwhelms the gains.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to EricP on Sat Sep 28 22:02:31 2024
    On Sat, 28 Sep 2024 11:04:34 -0400, EricP wrote:

    GCC is a compiler collection not a integrated development kit for
    Windows.

    GCC is cross-platform and IDE-independent. If you can’t use it with your
    IDE, the limitation is with your IDE, not with GCC.

    I have no knowledge of what state GCC was in in 1992 but it likely did
    not support the MS enhancements for Win32 programming:

    That’s a long time to go without checking on what’s been happening in the computing scene lately. Are you still doing your programming to 32-bit
    APIs? Isn’t there a “Win64” yet?

    Also, Microsoft’s development tools still assume a Windows-centric world,
    and are not suited for cross-development. If you haven’t noticed, a lot of corporate work is deployed in the cloud now, which is Linux-based. And
    then there is embedded work, which is also heavily Linux-based.

    Plus come with necessary API headers, various link libraries and DLL's, supporting applications, documentation.
    You know... what a product looks like.

    Those are all parts of the relevant SDKs, separate from GCC. E.g.
    something as basic as

    ldo@theon:~> dpkg-query -S /usr/include/stdlib.h
    libc6-dev:amd64: /usr/include/stdlib.h

    is part of the POSIX/C runtime library SDK, not part of GCC itself.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to EricP on Sat Sep 28 23:59:23 2024
    EricP <[email protected]> writes:

    Lawrence D'Oliveiro wrote:

    On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

    I've always paid for mine. My first C compiler came with the WinNT 3.5
    beta in 1992 for $99 and came with the development kit,
    editor, source code debugger, tools, documentation.
    A few hundred bucks is not going to hurt my business.

    Given that GCC offers more features and generates better code than
    MSVC, the money may not matter to your business, but the quality of
    the product will.

    GCC is a compiler collection not a integrated development kit for Windows.
    I have no knowledge of what state GCC was in in 1992 but it likely
    did not support the MS enhancements for Win32 programming:
    structured exception handling, various ABI's, inline assembler,
    defined behavior for some of C's undefined behavior,
    later first-class-type support for 64-bit signed and unsigned integers,
    and most important: integration with the GUI source code debugger.

    Plus come with necessary API headers, various link libraries and DLL's, supporting applications, documentation.
    You know... what a product looks like.

    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Are there any MSVC folks here who can help with these problems?
    I am not an MSVC expert by any means and easily could have missed
    something.

    I should mention that the code is written in C, not C++, and that
    is not something I am at liberty to change.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Sun Sep 29 07:15:51 2024
    Tim Rentsch <[email protected]> schrieb:
    EricP <[email protected]> writes:

    Lawrence D'Oliveiro wrote:

    On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

    I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>> beta in 1992 for $99 and came with the development kit,
    editor, source code debugger, tools, documentation.
    A few hundred bucks is not going to hurt my business.

    Given that GCC offers more features and generates better code than
    MSVC, the money may not matter to your business, but the quality of
    the product will.

    GCC is a compiler collection not a integrated development kit for Windows. >> I have no knowledge of what state GCC was in in 1992 but it likely
    did not support the MS enhancements for Win32 programming:
    structured exception handling, various ABI's, inline assembler,
    defined behavior for some of C's undefined behavior,
    later first-class-type support for 64-bit signed and unsigned integers,
    and most important: integration with the GUI source code debugger.

    Plus come with necessary API headers, various link libraries and DLL's,
    supporting applications, documentation.
    You know... what a product looks like.

    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Depending on what you need to to, you can give MinGW-w64 a try.
    It works either as a cross-compiler from Linux or on Windows using
    msys2 or Cygwin.

    Personally, I like Cygwin best because it gives you access to the
    usual UNIX tools like make or emacs, and you can immediately run
    the executable. I just add -static-libgfortran for Fortran code
    to avoid the hassle of distributing a DLL with it.

    Even gdb works.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to EricP on Sun Sep 29 13:13:26 2024
    On Wed, 25 Sep 2024 13:56:40 -0400
    EricP <[email protected]> wrote:

    Terje Mathisen wrote:
    Kent Dickey wrote:

    Look at:
    https://godbolt.org/z/oMhW55YsK

    Which is this code:

    int add2(int num, int other) {
    return num + other;
    }

    Compiled with these options: -O2 -ftrapv
    (-ftrapv is the GCC argument for detect signed overflows and
    crash).

    For x86-64 clang 19.1.0:

    add2:
    add edi, esi
    jo .LBB0_1
    mov eax, edi
    ret
    .LBB0_1:
    ud1 eax, dword ptr [eax]

    This looks OK: it does a normal add, then branches-on-overflow to
    an undefined instruction.

    But x86 has an instruction to trap on overflow directly: INTO.
    It's one byte.
    And it doesn't use it.

    GCC x86-64 14.2 is even worse:

    add2:
    sub rsp, 8
    call __addvsi3
    add rsp, 8
    ret

    It calls a routine to do all additions which might overflow, and
    that routine calls assert() if an overflow occurs.

    The CPU has a trap-on-overflow instruction exactly for this case
    (to crash
    on detecting an overflow), and compilers don't even use it.

    So even on architectures which have a trap-on-overflow instruction,
    compilers don't use it.

    You can only compile in INTO opcodes if you can guarantee that the
    INT 4 (INTO) trap vector will always be set to a proper handler,
    and since that isn't part of the ABI, compilers can't depend on it?

    I do agree that it would be nice if it did work, barring that clang
    is doing the best possible alternative, at close to zero cost
    except for the useless branch predictor table entry wastage.

    Terje

    On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
    One must use JO to detect signed overflow.
    Others were repurposed, 1-byte INC and DEC 40..4F became the REX
    prefix.



    Single-byte form of INTO reassigned. Dual-byte form (CD 04) is here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Sun Sep 29 13:55:11 2024
    On Sat, 28 Sep 2024 23:59:23 -0700
    Tim Rentsch <[email protected]> wrote:

    EricP <[email protected]> writes:

    Lawrence D'Oliveiro wrote:

    On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

    I've always paid for mine. My first C compiler came with the
    WinNT 3.5 beta in 1992 for $99 and came with the development kit,
    editor, source code debugger, tools, documentation.
    A few hundred bucks is not going to hurt my business.

    Given that GCC offers more features and generates better code than
    MSVC, the money may not matter to your business, but the quality of
    the product will.

    GCC is a compiler collection not a integrated development kit for
    Windows. I have no knowledge of what state GCC was in in 1992 but
    it likely did not support the MS enhancements for Win32 programming: structured exception handling, various ABI's, inline assembler,
    defined behavior for some of C's undefined behavior,
    later first-class-type support for 64-bit signed and unsigned
    integers, and most important: integration with the GUI source code debugger.

    Plus come with necessary API headers, various link libraries and
    DLL's, supporting applications, documentation.
    You know... what a product looks like.

    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Are there any MSVC folks here who can help with these problems?
    I am not an MSVC expert by any means and easily could have missed
    something.

    I should mention that the code is written in C, not C++, and that
    is not something I am at liberty to change.

    Both of your problems have no [MSVC] solution right now.

    In case of 128-bit integer, there is a chance that MSVC will support it
    in the future.

    In case of 80-bit long double, there is no chance. If MSVC ever
    supports binary floating point wider than 64-bit on x86-64 platform
    then it would be IEEE binary128 implemented in software. But even then
    they would not use name 'long double' for a new type, because it would
    break existing programs.


    But if all you want is the program running on Windows, then the
    solution is easy - use different compiler.
    MSYS2 is just couple of clicks (and ~0.8 GB :( ) away.
    After you have msys2 installed do
    pacman -Sy
    pacman mingw-w64-ucrt-x86_64-gcc

    Several hundreds of MB more and you have gcc14
    Possible, I'd have to install make separately, i.e.
    pacman make.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Sun Sep 29 14:18:54 2024
    On 29/09/2024 09:15, Thomas Koenig wrote:
    Tim Rentsch <[email protected]> schrieb:
    EricP <[email protected]> writes:

    Lawrence D'Oliveiro wrote:

    On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

    I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>>> beta in 1992 for $99 and came with the development kit,
    editor, source code debugger, tools, documentation.
    A few hundred bucks is not going to hurt my business.

    Given that GCC offers more features and generates better code than
    MSVC, the money may not matter to your business, but the quality of
    the product will.

    GCC is a compiler collection not a integrated development kit for Windows. >>> I have no knowledge of what state GCC was in in 1992 but it likely
    did not support the MS enhancements for Win32 programming:
    structured exception handling, various ABI's, inline assembler,
    defined behavior for some of C's undefined behavior,
    later first-class-type support for 64-bit signed and unsigned integers,
    and most important: integration with the GUI source code debugger.

    Plus come with necessary API headers, various link libraries and DLL's,
    supporting applications, documentation.
    You know... what a product looks like.

    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Depending on what you need to to, you can give MinGW-w64 a try.
    It works either as a cross-compiler from Linux or on Windows using
    msys2 or Cygwin.

    Personally, I like Cygwin best because it gives you access to the
    usual UNIX tools like make or emacs, and you can immediately run
    the executable. I just add -static-libgfortran for Fortran code
    to avoid the hassle of distributing a DLL with it.


    Personally, I prefer msys2 because it gives you access to the usual *nix
    tools like make - and does so far better than Cygwin. (Here "better"
    means more native-like file access, and more efficient usage.) And you
    don't get the DLL hell of Cygwin.

    I think Cygwin is useful if you need more advanced or accurate POSIX
    semantics - such as "fork" calls. But for most uses, msys2 is much
    simpler to work with. Msys2 also has a more friendly license for many
    people.

    However, I haven't had to do much compilation of any kind targetting
    Windows.


    Even gdb works.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Sun Sep 29 12:34:53 2024
    David Brown <[email protected]> schrieb:
    On 29/09/2024 09:15, Thomas Koenig wrote:
    Tim Rentsch <[email protected]> schrieb:
    EricP <[email protected]> writes:

    Lawrence D'Oliveiro wrote:

    On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

    I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>>>> beta in 1992 for $99 and came with the development kit,
    editor, source code debugger, tools, documentation.
    A few hundred bucks is not going to hurt my business.

    Given that GCC offers more features and generates better code than
    MSVC, the money may not matter to your business, but the quality of
    the product will.

    GCC is a compiler collection not a integrated development kit for Windows. >>>> I have no knowledge of what state GCC was in in 1992 but it likely
    did not support the MS enhancements for Win32 programming:
    structured exception handling, various ABI's, inline assembler,
    defined behavior for some of C's undefined behavior,
    later first-class-type support for 64-bit signed and unsigned integers, >>>> and most important: integration with the GUI source code debugger.

    Plus come with necessary API headers, various link libraries and DLL's, >>>> supporting applications, documentation.
    You know... what a product looks like.

    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Depending on what you need to to, you can give MinGW-w64 a try.
    It works either as a cross-compiler from Linux or on Windows using
    msys2 or Cygwin.

    Personally, I like Cygwin best because it gives you access to the
    usual UNIX tools like make or emacs, and you can immediately run
    the executable. I just add -static-libgfortran for Fortran code
    to avoid the hassle of distributing a DLL with it.


    Personally, I prefer msys2 because it gives you access to the usual *nix tools like make - and does so far better than Cygwin. (Here "better"
    means more native-like file access, and more efficient usage.) And you
    don't get the DLL hell of Cygwin.

    Just one remark - I was referring to running the mingw compiler
    under Cygwin, for which you don't get the DLL issues.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Sun Sep 29 15:49:32 2024
    On 29/09/2024 14:34, Thomas Koenig wrote:
    David Brown <[email protected]> schrieb:
    On 29/09/2024 09:15, Thomas Koenig wrote:
    Tim Rentsch <[email protected]> schrieb:
    EricP <[email protected]> writes:

    Lawrence D'Oliveiro wrote:

    On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

    I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>>>>> beta in 1992 for $99 and came with the development kit,
    editor, source code debugger, tools, documentation.
    A few hundred bucks is not going to hurt my business.

    Given that GCC offers more features and generates better code than >>>>>> MSVC, the money may not matter to your business, but the quality of >>>>>> the product will.

    GCC is a compiler collection not a integrated development kit for Windows.
    I have no knowledge of what state GCC was in in 1992 but it likely
    did not support the MS enhancements for Win32 programming:
    structured exception handling, various ABI's, inline assembler,
    defined behavior for some of C's undefined behavior,
    later first-class-type support for 64-bit signed and unsigned integers, >>>>> and most important: integration with the GUI source code debugger.

    Plus come with necessary API headers, various link libraries and DLL's, >>>>> supporting applications, documentation.
    You know... what a product looks like.

    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Depending on what you need to to, you can give MinGW-w64 a try.
    It works either as a cross-compiler from Linux or on Windows using
    msys2 or Cygwin.

    Personally, I like Cygwin best because it gives you access to the
    usual UNIX tools like make or emacs, and you can immediately run
    the executable. I just add -static-libgfortran for Fortran code
    to avoid the hassle of distributing a DLL with it.


    Personally, I prefer msys2 because it gives you access to the usual *nix
    tools like make - and does so far better than Cygwin. (Here "better"
    means more native-like file access, and more efficient usage.) And you
    don't get the DLL hell of Cygwin.

    Just one remark - I was referring to running the mingw compiler
    under Cygwin, for which you don't get the DLL issues.

    Ah, okay.

    But you don't need Cygwin here. You don't even need a msys2
    environment, unless you want to have things in the same place as on a
    *nix system or use programs that expect other files in those places.
    Almost all of the compilations I do under Windows (and most of those
    that I do under Linux) are cross-compilations, are cross-compilations.
    For Windows, those are mingw hosted gcc's. And as long as things like
    make, cp, rm, sed, and a few other common utilities are on the path,
    they can be used fine without a full msys2 environment. The same goes
    for other command-line utilities I use all the time like ssh, grep,
    less, etc.

    The only call I've had for Cygwin is for building software with more complicated or old-fashioned styles, like big .config arrangements, or
    for code that needs more complete POSIX emulation. I'm not sure I have
    used it since the days of building my own gcc 3 cross-compilers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Sun Sep 29 16:19:38 2024
    Tim Rentsch wrote:
    EricP <[email protected]> writes:

    Lawrence D'Oliveiro wrote:

    On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

    I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>> beta in 1992 for $99 and came with the development kit,
    editor, source code debugger, tools, documentation.
    A few hundred bucks is not going to hurt my business.

    Given that GCC offers more features and generates better code than
    MSVC, the money may not matter to your business, but the quality of
    the product will.

    GCC is a compiler collection not a integrated development kit for Windows. >> I have no knowledge of what state GCC was in in 1992 but it likely
    did not support the MS enhancements for Win32 programming:
    structured exception handling, various ABI's, inline assembler,
    defined behavior for some of C's undefined behavior,
    later first-class-type support for 64-bit signed and unsigned integers,
    and most important: integration with the GUI source code debugger.

    Plus come with necessary API headers, various link libraries and DLL's,
    supporting applications, documentation.
    You know... what a product looks like.

    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Are there any MSVC folks here who can help with these problems?
    I am not an MSVC expert by any means and easily could have missed
    something.

    I should mention that the code is written in C, not C++, and that
    is not something I am at liberty to change.

    I seem to remember finding something like __int128_t and __uint128_t
    inside MSVC?

    And that by casting uint64_t parameters to the u128 variant, the
    compiler would generate the obvious MUL RDI and save RDX:RAX as the
    128-bit result:

    uint128_t mulw(uint64_t a, uint64_t b)
    {
    return (uint128_t) a * (uint128_t) b;
    }

    I.e. no subroutine call/zero overhead.

    OTOH, getting optimal wide integer accumulators is a bit harder, needing compiler intrinsics to access the widening add with carry opcodes. (ADDX)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Sun Sep 29 18:00:26 2024
    On Sun, 29 Sep 2024 16:19:38 +0200
    Terje Mathisen <[email protected]> wrote:

    Tim Rentsch wrote:
    EricP <[email protected]> writes:

    Lawrence D'Oliveiro wrote:

    On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

    I've always paid for mine. My first C compiler came with the
    WinNT 3.5 beta in 1992 for $99 and came with the development kit,
    editor, source code debugger, tools, documentation.
    A few hundred bucks is not going to hurt my business.

    Given that GCC offers more features and generates better code than
    MSVC, the money may not matter to your business, but the quality
    of the product will.

    GCC is a compiler collection not a integrated development kit for
    Windows. I have no knowledge of what state GCC was in in 1992 but
    it likely did not support the MS enhancements for Win32
    programming: structured exception handling, various ABI's, inline
    assembler, defined behavior for some of C's undefined behavior,
    later first-class-type support for 64-bit signed and unsigned
    integers, and most important: integration with the GUI source
    code debugger.

    Plus come with necessary API headers, various link libraries and
    DLL's, supporting applications, documentation.
    You know... what a product looks like.

    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Are there any MSVC folks here who can help with these problems?
    I am not an MSVC expert by any means and easily could have missed something.

    I should mention that the code is written in C, not C++, and that
    is not something I am at liberty to change.

    I seem to remember finding something like __int128_t and __uint128_t
    inside MSVC?


    Very unlikely.
    Most likely your are thinking about C++ rather than C.
    Newer versions of Microsoft's STL appear to feature intentionally
    undocumented 128-bit integer classes std::_Signed128 and
    std:_Unsigned128.

    And that by casting uint64_t parameters to the u128 variant, the
    compiler would generate the obvious MUL RDI and save RDX:RAX as the
    128-bit result:

    uint128_t mulw(uint64_t a, uint64_t b)
    {
    return (uint128_t) a * (uint128_t) b;
    }

    I.e. no subroutine call/zero overhead.


    That is not MSVC.
    For this specific case, MSVC has a [better] solution in form of
    intrinsic ___umul128. But it wouldn't help Tim, because he is not
    allowed to modify the sources.


    OTOH, getting optimal wide integer accumulators is a bit harder,
    needing compiler intrinsics to access the widening add with carry
    opcodes. (ADDX)

    Terje


    That's correct about intrinsics, but incorrect about ADCX/ADOX.
    The later can be moderately helpful in special situuations, esp.
    128b * 128b => 256b multiplication, but it is never necessary
    and for addition/sbtraction is not needed at all.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Sun Sep 29 09:28:48 2024
    Thomas Koenig <[email protected]> writes:

    Tim Rentsch <[email protected]> schrieb:
    [...]
    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Depending on what you need to to, you can give MinGW-w64 a try.
    It works either as a cross-compiler from Linux or on Windows using
    msys2 or Cygwin.

    Thank you for these suggestions. I have started to explore
    mingw but not yet the others. Is there a difference between
    mingw and mingw-w64, do you know? Also do you know if mingw
    is compatible with MSVC, as long as long double is not used?
    (The code being ported would never call MSVC with a long double
    or 128-bit integer argument.)

    Personally, I like Cygwin best because it gives you access to the
    usual UNIX tools like make or emacs, and you can immediately run
    the executable. I just add -static-libgfortran for Fortran code
    to avoid the hassle of distributing a DLL with it.

    Running make and emacs in MS Windows... I like it!

    Even gdb works.

    That says a lot about the effort that went into Cygwin.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Sun Sep 29 09:39:27 2024
    Michael S <[email protected]> writes:

    On Sat, 28 Sep 2024 23:59:23 -0700
    Tim Rentsch <[email protected]> wrote:

    [...]
    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Are there any MSVC folks here who can help with these problems?
    I am not an MSVC expert by any means and easily could have missed
    something.

    I should mention that the code is written in C, not C++, and that
    is not something I am at liberty to change.

    Both of your problems have no [MSVC] solution right now.

    In case of 128-bit integer, there is a chance that MSVC will support
    it in the future.

    In case of 80-bit long double, there is no chance. If MSVC ever
    supports binary floating point wider than 64-bit on x86-64 platform
    then it would be IEEE binary128 implemented in software. But even
    then they would not use name 'long double' for a new type, because it
    would break existing programs.

    Thank you, this is helpful.

    But if all you want is the program running on Windows, then the
    solution is easy - use different compiler.
    MSYS2 is just couple of clicks (and ~0.8 GB :( ) away.
    After you have msys2 installed do
    pacman -Sy
    pacman mingw-w64-ucrt-x86_64-gcc

    Several hundreds of MB more and you have gcc14
    Possible, I'd have to install make separately, i.e.
    pacman make.

    Do I understand this right, that msys2 is to be installed
    on Windows? And that the pacman commands are to be run
    within msys2 on the MS Windows system?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Sun Sep 29 09:56:52 2024
    Terje Mathisen <[email protected]> writes:

    Tim Rentsch wrote:
    [...]
    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Are there any MSVC folks here who can help with these problems?
    I am not an MSVC expert by any means and easily could have missed
    something.

    I should mention that the code is written in C, not C++, and that
    is not something I am at liberty to change.

    I seem to remember finding something like __int128_t and __uint128_t
    inside MSVC?

    Apparently VScode has added or is going to add support for
    the pre-defined type names __int128_t and __uint128_t. That
    is good to know even though it doesn't bear directly on my
    question.

    And that by casting uint64_t parameters to the u128 variant, the
    compiler would generate the obvious MUL RDI and save RDX:RAX as the
    128-bit result:

    uint128_t mulw(uint64_t a, uint64_t b)
    {
    return (uint128_t) a * (uint128_t) b;
    }

    I.e. no subroutine call/zero overhead.

    OTOH, getting optimal wide integer accumulators is a bit harder,
    needing compiler intrinsics to access the widening add with carry
    opcodes. (ADDX)

    This information doesn't affect what I am hoping to accomplish,
    because any arithmetic on 128-bit types would already be held
    in 128-bit values. Thank you though for the information.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Sun Sep 29 19:45:16 2024
    On Sun, 29 Sep 2024 09:39:27 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Sat, 28 Sep 2024 23:59:23 -0700
    Tim Rentsch <[email protected]> wrote:

    [...]
    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Are there any MSVC folks here who can help with these problems?
    I am not an MSVC expert by any means and easily could have missed
    something.

    I should mention that the code is written in C, not C++, and that
    is not something I am at liberty to change.

    Both of your problems have no [MSVC] solution right now.

    In case of 128-bit integer, there is a chance that MSVC will support
    it in the future.

    In case of 80-bit long double, there is no chance. If MSVC ever
    supports binary floating point wider than 64-bit on x86-64 platform
    then it would be IEEE binary128 implemented in software. But even
    then they would not use name 'long double' for a new type, because
    it would break existing programs.

    Thank you, this is helpful.

    But if all you want is the program running on Windows, then the
    solution is easy - use different compiler.
    MSYS2 is just couple of clicks (and ~0.8 GB :( ) away.
    After you have msys2 installed do
    pacman -Sy
    pacman mingw-w64-ucrt-x86_64-gcc

    Several hundreds of MB more and you have gcc14
    Possible, I'd have to install make separately, i.e.
    pacman make.

    Do I understand this right, that msys2 is to be installed
    on Windows? And that the pacman commands are to be run
    within msys2 on the MS Windows system?

    Yes and yes.
    pacman has to be run from msys2 terminal window (bash).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Sun Sep 29 10:32:06 2024
    Michael S <[email protected]> writes:

    On Sun, 29 Sep 2024 09:39:27 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Sat, 28 Sep 2024 23:59:23 -0700
    Tim Rentsch <[email protected]> wrote:

    [...]

    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Are there any MSVC folks here who can help with these problems?
    I am not an MSVC expert by any means and easily could have missed
    something.

    I should mention that the code is written in C, not C++, and that
    is not something I am at liberty to change.

    Both of your problems have no [MSVC] solution right now.

    In case of 128-bit integer, there is a chance that MSVC will support
    it in the future.

    In case of 80-bit long double, there is no chance. If MSVC ever
    supports binary floating point wider than 64-bit on x86-64 platform
    then it would be IEEE binary128 implemented in software. But even
    then they would not use name 'long double' for a new type, because
    it would break existing programs.

    Thank you, this is helpful.

    But if all you want is the program running on Windows, then the
    solution is easy - use different compiler.
    MSYS2 is just couple of clicks (and ~0.8 GB :( ) away.
    After you have msys2 installed do
    pacman -Sy
    pacman mingw-w64-ucrt-x86_64-gcc

    Several hundreds of MB more and you have gcc14
    Possible, I'd have to install make separately, i.e.
    pacman make.

    Do I understand this right, that msys2 is to be installed
    on Windows? And that the pacman commands are to be run
    within msys2 on the MS Windows system?

    Yes and yes.
    pacman has to be run from msys2 terminal window (bash).

    Okay, thank you again.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Sun Sep 29 19:30:08 2024
    Tim Rentsch <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:

    Tim Rentsch <[email protected]> schrieb:
    [...]
    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Depending on what you need to to, you can give MinGW-w64 a try.
    It works either as a cross-compiler from Linux or on Windows using
    msys2 or Cygwin.

    Thank you for these suggestions. I have started to explore
    mingw but not yet the others. Is there a difference between
    mingw and mingw-w64, do you know?

    One is a fork of the other, I believe.

    Also do you know if mingw
    is compatible with MSVC, as long as long double is not used?

    I believe that Mingw-w64 uses the Windows ABI, but that is a
    belief, not something I know first-hand; I haven't looked
    at the assembly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to D'Oliveiro on Sun Sep 29 20:51:00 2024
    In article <vd9udm$1dgsp$[email protected]>, [email protected]d (Lawrence
    D'Oliveiro) wrote:

    Are you still doing your programming
    to 32-bit APIs? Isn't there a _Win64_ yet?

    "Win32" covers both 32-bit and 64-bit APIs. The reasons for this silly nomenclature are complicated and lie deep in the past.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Dallman on Sun Sep 29 19:57:22 2024
    John Dallman <[email protected]> schrieb:
    In article <vd9udm$1dgsp$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

    Are you still doing your programming
    to 32-bit APIs? Isn't there a _Win64_ yet?

    "Win32" covers both 32-bit and 64-bit APIs. The reasons for this silly nomenclature are complicated and lie deep in the past.

    "This cannot be explained logically, only chronologically."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Sun Sep 29 23:37:57 2024
    On Sun, 29 Sep 2024 19:30:08 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Tim Rentsch <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:

    Tim Rentsch <[email protected]> schrieb:
    [...]
    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Depending on what you need to to, you can give MinGW-w64 a try.
    It works either as a cross-compiler from Linux or on Windows using
    msys2 or Cygwin.

    Thank you for these suggestions. I have started to explore
    mingw but not yet the others. Is there a difference between
    mingw and mingw-w64, do you know?

    One is a fork of the other, I believe.

    Also do you know if mingw
    is compatible with MSVC, as long as long double is not used?

    I believe that Mingw-w64 uses the Windows ABI, but that is a
    belief, not something I know first-hand; I haven't looked
    at the assembly.

    mingw64 tools are mostly compatible with Windows x64 ABI, but long
    double is an exception.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Dallman on Mon Sep 30 01:28:42 2024
    On Sun, 29 Sep 2024 20:51 +0100 (BST), John Dallman wrote:

    In article <vd9udm$1dgsp$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

    Are you still doing your programming to 32-bit APIs? Isn't there a
    _Win64_ yet?

    "Win32" covers both 32-bit and 64-bit APIs. The reasons for this silly nomenclature are complicated and lie deep in the past.

    Also the fact that those “64-bit” APIs are not entirely “64-bit” ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Sep 30 11:15:05 2024
    On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sun, 29 Sep 2024 20:51 +0100 (BST), John Dallman wrote:

    In article <vd9udm$1dgsp$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

    Are you still doing your programming to 32-bit APIs? Isn't there a
    _Win64_ yet?

    "Win32" covers both 32-bit and 64-bit APIs. The reasons for this
    silly nomenclature are complicated and lie deep in the past.

    Also the fact that those “64-bit” APIs are not entirely “64-bit” ...

    They are entirely 64-bit. Every user-supplied buffer can be anywhere in
    user's address space.
    Possibly you are confusing Windows with VMS.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Mon Sep 30 03:36:26 2024
    Michael S <[email protected]> writes:

    On Sun, 29 Sep 2024 19:30:08 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Tim Rentsch <[email protected]> schrieb:

    Thomas Koenig <[email protected]> writes:

    Tim Rentsch <[email protected]> schrieb:

    [...]

    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Depending on what you need to to, you can give MinGW-w64 a try.
    It works either as a cross-compiler from Linux or on Windows using
    msys2 or Cygwin.

    Thank you for these suggestions. I have started to explore
    mingw but not yet the others. Is there a difference between
    mingw and mingw-w64, do you know?

    One is a fork of the other, I believe.

    Also do you know if mingw
    is compatible with MSVC, as long as long double is not used?

    I believe that Mingw-w64 uses the Windows ABI, but that is a
    belief, not something I know first-hand; I haven't looked
    at the assembly.

    mingw64 tools are mostly compatible with Windows x64 ABI, but long
    double is an exception.

    That was my impression but it's nice to have it confirmed.

    My thanks again to both you and Thomas.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Mon Sep 30 14:07:47 2024
    On 29/09/2024 21:30, Thomas Koenig wrote:
    Tim Rentsch <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:

    Tim Rentsch <[email protected]> schrieb:
    [...]
    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Depending on what you need to to, you can give MinGW-w64 a try.
    It works either as a cross-compiler from Linux or on Windows using
    msys2 or Cygwin.

    Thank you for these suggestions. I have started to explore
    mingw but not yet the others. Is there a difference between
    mingw and mingw-w64, do you know?

    One is a fork of the other, I believe.


    mingw-w64 was started as a fork of mingw, initially created to support generating 64-bit binaries and because of disagreements with the pace of development in mingw.


    Also do you know if mingw
    is compatible with MSVC, as long as long double is not used?

    I believe that Mingw-w64 uses the Windows ABI, but that is a
    belief, not something I know first-hand; I haven't looked
    at the assembly.

    There is a reasonably defined ABI for 64-bit Windows, so I think there
    will be compatibility for most things in C. C++ is more complicated and
    much more likely to have incompatibilities.

    There are approximately a hundred and one different C ABI's and calling conventions for 32-bit Windows, since MS never actually defined one, so
    things are a bit of a mess there. (DLL calling conventions are clearer.)

    I believe the two most popular ways of running "Linux-like" software and
    gcc on Windows are using WSL (which is more of a virtualisation layer),
    and mingw-64 for the compiler target (with either gcc or clang) and
    msys2 as an environment and source of *nix utilities and libraries.
    mingw/msys is considered old and limited (32-bit only), while Cygwin is considered slow and clunky by many.

    At least, that is my understanding.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Michael S on Mon Sep 30 14:49:00 2024
    In article <[email protected]>, [email protected] (Michael S) wrote:
    On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    Also the fact that those _64-bit_ APIs are not entirely _64-bit_

    They are entirely 64-bit. Every user-supplied buffer can be
    anywhere in user's address space. Possibly you are confusing
    Windows with VMS.

    Windows NT's VMS heritage does not extend to that particular VMS
    misfeature. The lack of 64-bit versions of some APIs on VMS is simply due
    to shortcuts taken by DEC to get 64-bit VMS out faster, which have never
    been caught up. The rather elaborate VMS API definitions, in terms of
    memory block sizes rather than calls in a programming language, made it
    harder to create 64-bit APIs.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Mon Sep 30 09:24:03 2024
    Michael S wrote:
    On Wed, 25 Sep 2024 13:56:40 -0400
    EricP <[email protected]> wrote:

    Terje Mathisen wrote:
    Kent Dickey wrote:
    Look at:
    https://godbolt.org/z/oMhW55YsK

    Which is this code:

    int add2(int num, int other) {
    return num + other;
    }

    Compiled with these options: -O2 -ftrapv
    (-ftrapv is the GCC argument for detect signed overflows and
    crash).

    For x86-64 clang 19.1.0:

    add2:
    add edi, esi
    jo .LBB0_1
    mov eax, edi
    ret
    .LBB0_1:
    ud1 eax, dword ptr [eax]

    This looks OK: it does a normal add, then branches-on-overflow to
    an undefined instruction.

    But x86 has an instruction to trap on overflow directly: INTO.
    It's one byte.
    And it doesn't use it.

    GCC x86-64 14.2 is even worse:

    add2:
    sub rsp, 8
    call __addvsi3
    add rsp, 8
    ret

    It calls a routine to do all additions which might overflow, and
    that routine calls assert() if an overflow occurs.

    The CPU has a trap-on-overflow instruction exactly for this case
    (to crash
    on detecting an overflow), and compilers don't even use it.

    So even on architectures which have a trap-on-overflow instruction,
    compilers don't use it.
    You can only compile in INTO opcodes if you can guarantee that the
    INT 4 (INTO) trap vector will always be set to a proper handler,
    and since that isn't part of the ABI, compilers can't depend on it?

    I do agree that it would be nice if it did work, barring that clang
    is doing the best possible alternative, at close to zero cost
    except for the useless branch predictor table entry wastage.

    Terje
    On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
    One must use JO to detect signed overflow.
    Others were repurposed, 1-byte INC and DEC 40..4F became the REX
    prefix.



    Single-byte form of INTO reassigned. Dual-byte form (CD 04) is here.

    The INTO (CE) instruction conditionally generates an exception when
    overflow flag is set to exception vector 4.

    INT 4 (CD 04) unconditionally generates an exception to vector 4.

    To get the same behavior in x64 you would have to

    ADD blah
    JNO Ok1
    INT 4
    Ok1:

    which is less code bytes than jumping to a call to an error routine
    but has the issue that now every ADD/SUB is followed by a short rel8
    forward conditional branch that will be mispredicted (predicted not taken)
    the first time through every code section.

    Intel documents x64 Jcc branch hint prefixes 2E = not taken, 3E = taken. However AMD does not. On x86 2E and 3E are segment override prefixes
    and AMD documents them as ignored on x64.

    If there is an overflow this leaves the RIP pointing right after the problem.

    Or generate a JO long rel32 branch forward to a INT 4 which will be
    correctly predicted as not taken unless there is an overflow.
    If there is an overflow this leaves the RIP pointing far away from the
    problem and you would have to trace the code paths backwards to find
    where it occured.

    And all this crap goes away if an ISA has ADDV Add Trap Signed Overflow.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Mon Sep 30 17:32:47 2024
    On Mon, 30 Sep 2024 14:07:47 +0200
    David Brown <[email protected]> wrote:

    On 29/09/2024 21:30, Thomas Koenig wrote:
    Tim Rentsch <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:

    Tim Rentsch <[email protected]> schrieb:
    [...]
    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Depending on what you need to to, you can give MinGW-w64 a try.
    It works either as a cross-compiler from Linux or on Windows using
    msys2 or Cygwin.

    Thank you for these suggestions. I have started to explore
    mingw but not yet the others. Is there a difference between
    mingw and mingw-w64, do you know?

    One is a fork of the other, I believe.


    mingw-w64 was started as a fork of mingw, initially created to
    support generating 64-bit binaries and because of disagreements with
    the pace of development in mingw.


    Also do you know if mingw
    is compatible with MSVC, as long as long double is not used?

    I believe that Mingw-w64 uses the Windows ABI, but that is a
    belief, not something I know first-hand; I haven't looked
    at the assembly.

    There is a reasonably defined ABI for 64-bit Windows, so I think
    there will be compatibility for most things in C.

    For "most things" - yes. For 'long double' - no.
    In case of 'long double' mingw64 tools use their own conventions that
    differ both from SysV and from Win64. But at C level behavior is
    identical to x86-64 Linux.

    C++ is more
    complicated and much more likely to have incompatibilities.

    There are approximately a hundred and one different C ABI's and
    calling conventions for 32-bit Windows, since MS never actually
    defined one, so things are a bit of a mess there. (DLL calling
    conventions are clearer.)

    I believe the two most popular ways of running "Linux-like" software
    and gcc on Windows are using WSL (which is more of a virtualisation
    layer),

    WSL (now often referred as WSL1) is not a virtualization layer.
    WSL2 is indeed a Linux running in virtual machine +
    integration features for convenience.

    WSL1 is the worst possible place to run Linux programs that depend on
    long double having higher precision. That's because when WSL1 kernel
    starts a new process it sets precision of x87 co-processor to 52 bits,
    which is different from default settings on just about any other x86-64
    Linux. Of course, the process can change the settings, but for that the programmer would have to be aware that the problem exists. Which is
    rare.

    WSL2 doesn't have this problem, but it is supported only on relatively
    new versions of Windows.

    So, for older versions of Windows if one wants to run Linux binaries
    'as is' and to get the same behavior of long doable as in original then
    one is advised to run Linux in less-integrated VMs, like Virtual Box
    or MS's own HyperV.

    and mingw-64 for the compiler target (with either gcc or
    clang) and msys2 as an environment and source of *nix utilities and libraries. mingw/msys is considered old and limited (32-bit only),
    while Cygwin is considered slow and clunky by many.


    And cygwin console is quite inconvenient.

    At least, that is my understanding.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Mon Sep 30 17:01:01 2024
    On 30/09/2024 16:32, Michael S wrote:
    On Mon, 30 Sep 2024 14:07:47 +0200
    David Brown <[email protected]> wrote:

    On 29/09/2024 21:30, Thomas Koenig wrote:
    Tim Rentsch <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:

    Tim Rentsch <[email protected]> schrieb:
    [...]
    I am currently in the position of needing to take some code
    written for Linux/Unix and get it running in MS Windows.

    My attempts to use MSVC have been frustrating, because of some
    limitations of that environment. The two most prominent are
    these: long double is only 64 bits, and there are no integer
    types of 128 bits that I could find.

    Depending on what you need to to, you can give MinGW-w64 a try.
    It works either as a cross-compiler from Linux or on Windows using
    msys2 or Cygwin.

    Thank you for these suggestions. I have started to explore
    mingw but not yet the others. Is there a difference between
    mingw and mingw-w64, do you know?

    One is a fork of the other, I believe.


    mingw-w64 was started as a fork of mingw, initially created to
    support generating 64-bit binaries and because of disagreements with
    the pace of development in mingw.


    Also do you know if mingw
    is compatible with MSVC, as long as long double is not used?

    I believe that Mingw-w64 uses the Windows ABI, but that is a
    belief, not something I know first-hand; I haven't looked
    at the assembly.

    There is a reasonably defined ABI for 64-bit Windows, so I think
    there will be compatibility for most things in C.

    For "most things" - yes. For 'long double' - no.
    In case of 'long double' mingw64 tools use their own conventions that
    differ both from SysV and from Win64.

    I did not know that - thanks for that information. (Though hopefully
    I'll not have to do enough C programming on Windows to find the
    information useful!)

    But at C level behavior is
    identical to x86-64 Linux.

    C++ is more
    complicated and much more likely to have incompatibilities.

    There are approximately a hundred and one different C ABI's and
    calling conventions for 32-bit Windows, since MS never actually
    defined one, so things are a bit of a mess there. (DLL calling
    conventions are clearer.)

    I believe the two most popular ways of running "Linux-like" software
    and gcc on Windows are using WSL (which is more of a virtualisation
    layer),

    WSL (now often referred as WSL1) is not a virtualization layer.
    WSL2 is indeed a Linux running in virtual machine +
    integration features for convenience.


    I was thinking of current WSL, which would presumably be WSL2. I have
    not used it myself, but I have a customer who does, and who simply calls
    it WSL. But perhaps I should use "WSL2" to be clear.

    WSL1 is the worst possible place to run Linux programs that depend on
    long double having higher precision. That's because when WSL1 kernel
    starts a new process it sets precision of x87 co-processor to 52 bits,
    which is different from default settings on just about any other x86-64 Linux. Of course, the process can change the settings, but for that the programmer would have to be aware that the problem exists. Which is
    rare.

    WSL2 doesn't have this problem, but it is supported only on relatively
    new versions of Windows.

    So, for older versions of Windows if one wants to run Linux binaries
    'as is' and to get the same behavior of long doable as in original then
    one is advised to run Linux in less-integrated VMs, like Virtual Box
    or MS's own HyperV.


    I just run them on Linux :-)

    But while none of this affects me, it might affect customers or others
    that I deal with, so it is good to know.

    and mingw-64 for the compiler target (with either gcc or
    clang) and msys2 as an environment and source of *nix utilities and
    libraries. mingw/msys is considered old and limited (32-bit only),
    while Cygwin is considered slow and clunky by many.


    And cygwin console is quite inconvenient.

    At least, that is my understanding.





    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Tue Oct 1 00:33:01 2024
    On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:

    On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    Also the fact that those “64-bit” APIs are not entirely “64-bit” ...

    They are entirely 64-bit.

    <https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>:

    Another example; Win32 has a function for getting the size of a
    file. File sizes on Windows are limited to 2^64 bytes, and so they
    need a 64-bit integer to be expressed easily. But the API call to
    get the size of a file doesn't give you a 64-bit value. Instead,
    it gives you a pair of 32-bit values that have to be combined in a
    particular way. For 32-bit Windows, that's sort of understandable;
    32-bit Windows is, well, 32-bit, so you might not expect to be
    able to use 64-bit integers. But if you use the same API in 64-bit
    Windows, it still gives you the pair of numbers, rather than just
    a nice simple 64-bit number. While this made some kind of sense on
    32-bit Windows, it makes no sense at all on 64-bit Windows, since
    64-bit Windows can, by definition, use 64-bit numbers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Oct 1 03:53:01 2024
    On Tue, 1 Oct 2024 0:33:01 +0000, Lawrence D'Oliveiro wrote:

    On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:

    On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    Also the fact that those “64-bit” APIs are not entirely “64-bit” ...

    They are entirely 64-bit.

    <https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>:

    Another example; Win32 has a function for getting the size of a
    file. File sizes on Windows are limited to 2^64 bytes, and so they
    need a 64-bit integer to be expressed easily. But the API call to
    get the size of a file doesn't give you a 64-bit value. Instead,
    it gives you a pair of 32-bit values that have to be combined in a
    particular way.

    As long as you can embed the API function in a macro that performs
    said combining, it's all OK.

    uint64_T filesize = GetFileSize64( whatever );

    For 32-bit Windows, that's sort of understandable;
    32-bit Windows is, well, 32-bit, so you might not expect to be
    able to use 64-bit integers. But if you use the same API in 64-bit
    Windows, it still gives you the pair of numbers, rather than just
    a nice simple 64-bit number. While this made some kind of sense on
    32-bit Windows, it makes no sense at all on 64-bit Windows, since
    64-bit Windows can, by definition, use 64-bit numbers.

    Why would you want a 32-bit application to be able to use files
    of 2^64-bits in size ???

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Tue Oct 1 07:35:17 2024
    MitchAlsup1 wrote:
    On Tue, 1 Oct 2024 0:33:01 +0000, Lawrence D'Oliveiro wrote:

    On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:

    On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    Also the fact that those “64-bit” APIs are not entirely
    “64-bit” ...

    They are entirely 64-bit.

    <https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>: >>

        Another example; Win32 has a function for getting the size of a
        file. File sizes on Windows are limited to 2^64 bytes, and so they
        need a 64-bit integer to be expressed easily. But the API call to
        get the size of a file doesn't give you a 64-bit value. Instead,
        it gives you a pair of 32-bit values that have to be combined in a
        particular way.

    As long as you can embed the API function in a macro that performs
    said combining, it's all OK.

        uint64_T filesize = GetFileSize64( whatever );

    As I wrote in my other post, the API is in fact directly usable as-is on
    any compiler with 64-bit support.

                        For 32-bit Windows, that's sort of understandable;
        32-bit Windows is, well, 32-bit, so you might not expect to be
        able to use 64-bit integers. But if you use the same API in 64-bit
        Windows, it still gives you the pair of numbers, rather than just
        a nice simple 64-bit number. While this made some kind of sense on
        32-bit Windows, it makes no sense at all on 64-bit Windows, since
        64-bit Windows can, by definition, use 64-bit numbers.

    Why would you want a 32-bit application to be able to use files
    of 2^64-bits in size ???

    Why not? I had both partition and DVD image files larger than 4 GB well
    before I had Win64, but I still wanted to be able to get my du (disk
    use) utility to work.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Tue Oct 1 06:04:17 2024
    On Tue, 1 Oct 2024 07:30:45 +0200, Terje Mathisen wrote:

    The first issue here is that the original API defined the return value
    as 32-bit ...

    The POSIX effort started in around 1988, and one of its defining characteristics is the use of symbolic type names like “size_t”, “off_t”
    and “time_t”, and not assuming particular sizes for them. Windows NT had plenty of time to learn from that example; why didn’t it?

    Turns out every single Win32-system in existence/in regular use is
    little endian ...

    Of which there are not many. Wasn’t Windows NT supposed to be some kind of “portable” OS? Wasn’t it supposed to run on big-endian architectures too, like POWER, MIPS and SPARC?

    Only all those ports failed. In fact, every single non-x86 port of Windows
    has failed.

    Yeah, not too pretty, but also not a real/important problem.

    “Death of a thousand cuts”. And now Microsoft is left scrambling, desperately trying to turn Windows into Linux.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Tue Oct 1 07:30:45 2024
    Lawrence D'Oliveiro wrote:
    On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:

    On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    Also the fact that those “64-bit” APIs are not entirely “64-bit” ...

    They are entirely 64-bit.

    <https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>:

    Another example; Win32 has a function for getting the size of a
    file. File sizes on Windows are limited to 2^64 bytes, and so they
    need a 64-bit integer to be expressed easily. But the API call to
    get the size of a file doesn't give you a 64-bit value. Instead,
    it gives you a pair of 32-bit values that have to be combined in a
    particular way. For 32-bit Windows, that's sort of understandable;
    32-bit Windows is, well, 32-bit, so you might not expect to be
    able to use 64-bit integers. But if you use the same API in 64-bit
    Windows, it still gives you the pair of numbers, rather than just
    a nice simple 64-bit number. While this made some kind of sense on
    32-bit Windows, it makes no sense at all on 64-bit Windows, since
    64-bit Windows can, by definition, use 64-bit numbers.

    The first issue here is that the original API defined the return value
    as 32-bit, with an optional pointer to another variable to receive the
    high part, but they came up with the GetFileSizeEx() function decades
    ago, and that one gets the file size as a LARGE_INTEGER. Nobody uses
    anything else afair.

    The second potential issue is with the definition of LARGE_INTEGER:

    It is as as you say defined as a pair of 32-bit values, overlayed with a LONGLONG which can only work on a little-endian cpu since the low part
    is followed by the high, right?

    Turns out every single Win32-system in existence/in regular use is
    little endian, so that is much less of a problem, and the docs tell you to

    "The LARGE_INTEGER structure is actually a union. If your compiler has built-in support for 64-bit integers, use the QuadPart member to store
    the 64-bit integer. Otherwise, use the LowPart and HighPart members to
    store the 64-bit integer."

    Yeah, not too pretty, but also not a real/important problem.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Tue Oct 1 06:09:53 2024
    On Tue, 1 Oct 2024 07:35:17 +0200, Terje Mathisen wrote:

    I had both partition and DVD image files larger than 4 GB well
    before I had Win64 ...

    DVD is an interesting case. If you look at the file structure,
    individual .VOB files typically don’t exceed about 1GiB in size, but successive segments of a title are required to be physically laid out next
    to each other on the disc, so a player can actually forget about the
    filesystem once it gets started, and just read successive physical
    sectors.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Oct 1 06:05:36 2024
    On Tue, 1 Oct 2024 03:53:01 +0000, MitchAlsup1 wrote:

    Why would you want a 32-bit application to be able to use files of
    2^64-bits in size ???

    Because multi-gigabyte files were becoming commonplace 20-25 years ago,
    before 64-bit CPUs did the same.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue Oct 1 10:57:25 2024
    On Tue, 1 Oct 2024 07:30:45 +0200
    Terje Mathisen <[email protected]> wrote:

    Lawrence D'Oliveiro wrote:
    On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:

    On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    Also the fact that those “64-bit” APIs are not entirely
    “64-bit” ...

    They are entirely 64-bit.

    <https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>:

    Another example; Win32 has a function for getting the size of a
    file. File sizes on Windows are limited to 2^64 bytes, and so
    they need a 64-bit integer to be expressed easily. But the API call
    to get the size of a file doesn't give you a 64-bit value. Instead,
    it gives you a pair of 32-bit values that have to be combined
    in a particular way. For 32-bit Windows, that's sort of
    understandable; 32-bit Windows is, well, 32-bit, so you might not
    expect to be able to use 64-bit integers. But if you use the same
    API in 64-bit Windows, it still gives you the pair of numbers,
    rather than just a nice simple 64-bit number. While this made some
    kind of sense on 32-bit Windows, it makes no sense at all on 64-bit Windows, since 64-bit Windows can, by definition, use 64-bit
    numbers.

    The first issue here is that the original API defined the return
    value as 32-bit, with an optional pointer to another variable to
    receive the high part, but they came up with the GetFileSizeEx()
    function decades ago, and that one gets the file size as a
    LARGE_INTEGER. Nobody uses anything else afair.

    The second potential issue is with the definition of LARGE_INTEGER:

    It is as as you say defined as a pair of 32-bit values, overlayed
    with a LONGLONG which can only work on a little-endian cpu since the
    low part is followed by the high, right?


    It seems to me that back when Windows still supported Big Endian
    targets (PPC) there was #ifdef in the relevant header, so on PPC the
    order of low and high parts was opposite to the rest of targets.
    The remnants of that are more likely to be found in DDK headers than in
    Windows SDK headers.

    Turns out every single Win32-system in existence/in regular use is
    little endian, so that is much less of a problem, and the docs tell
    you to

    "The LARGE_INTEGER structure is actually a union. If your compiler
    has built-in support for 64-bit integers, use the QuadPart member to
    store the 64-bit integer. Otherwise, use the LowPart and HighPart
    members to store the 64-bit integer."

    Yeah, not too pretty, but also not a real/important problem.

    Terje


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to D'Oliveiro on Tue Oct 1 10:12:00 2024
    In article <vdg3d1$2kdqr$[email protected]>, [email protected]d (Lawrence
    D'Oliveiro) wrote:

    Wasn't Windows NT supposed to be some kind of _portable_ OS? Wasn't
    it supposed to run on big-endian architectures too, like POWER, MIPS
    and SPARC?

    It did. I have no experience with Windows NT on SPARC or PowerPC, but the
    OS ran fine on MIPS. It was a commercial failure, because MIPS didn't
    keep up with the performance growth of x86.

    PowerPC did for a while, but the company interested in NT on PowerPC was
    IBM, and their hardware prices were a /lot/ higher than x86 prices. They
    didn't see that as a problem, but all the potential customers did.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Dallman on Tue Oct 1 12:34:26 2024
    On Tue, 1 Oct 2024 10:12 +0100 (BST)
    [email protected] (John Dallman) wrote:

    In article <vdg3d1$2kdqr$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

    Wasn't Windows NT supposed to be some kind of _portable_ OS? Wasn't
    it supposed to run on big-endian architectures too, like POWER, MIPS
    and SPARC?

    It did. I have no experience with Windows NT on SPARC or PowerPC,

    Did WinNT on SPARC ever ship? I don't think so.

    but
    the OS ran fine on MIPS. It was a commercial failure, because MIPS
    didn't keep up with the performance growth of x86.


    Wasn't MIPS edition of WinNT Little Endian?

    PowerPC did for a while, but the company interested in NT on PowerPC
    was IBM, and their hardware prices were a /lot/ higher than x86
    prices. They didn't see that as a problem, but all the potential
    customers did.

    John

    Now I wonder what endiannes was used by PowerPC variant WinNT.
    In theory, PPC/POWER could run in Little Endian mode, but before v3 of
    POWER ISA it wasn't as full-featured as Big Endian mode. If I am not
    mistaken, the difference was that in LE mode there was no support for
    unaligned memory accesses.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Michael S on Tue Oct 1 12:28:00 2024
    In article <[email protected]>, [email protected] (Michael S) wrote:

    On Tue, 1 Oct 2024 10:12 +0100 (BST)
    [email protected] (John Dallman) wrote:
    [email protected]d (Lawrence D'Oliveiro) wrote:

    Wasn't Windows NT supposed to be some kind of _portable_ OS?
    Wasn't it supposed to run on big-endian architectures too, like
    POWER, MIPS and SPARC?
    It did. I have no experience with Windows NT on SPARC or PowerPC,

    Did WinNT on SPARC ever ship? I don't think so.

    No. Intergraph worked on a port, but it never shipped, and neither did Intergraph's SPARC-based hardware.

    Wasn't MIPS edition of WinNT Little Endian?

    Yes.

    Now I wonder what endiannes was used by PowerPC variant WinNT.
    In theory, PPC/POWER could run in Little Endian mode, but before v3
    of POWER ISA it wasn't as full-featured as Big Endian mode. If I am not mistaken, the difference was that in LE mode there was no support
    for unaligned memory accesses.

    WinNT ran it little-endian according to <https://en.wikipedia.org/wiki/PowerPC#Endian_modes>

    I neglected the bi-endianness of PowerPC and MIPS. SPARC was purely
    big-endian until SPARC V9.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Tue Oct 1 19:08:05 2024
    On Tue, 1 Oct 2024 15:31:55 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    John Dallman <[email protected]> schrieb:
    In article <vdg3d1$2kdqr$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

    Wasn't Windows NT supposed to be some kind of _portable_ OS?
    Wasn't it supposed to run on big-endian architectures too, like
    POWER, MIPS and SPARC?

    It did. I have no experience with Windows NT on SPARC or PowerPC,
    but the OS ran fine on MIPS.

    There was also a Windows for Alpha. A German computer chain, Vobis,
    tried to sell two models with that, but it flopped.

    Alpha is Little Endian.
    It seems that SPARC stands out as the only strictly Big Endian
    architecture for which there was a serious attempt to port WinNT.
    But not serious enough, it seems.

    Now, thinking about it, I have a question.
    Did Intergraph really try to port WinNT to SPARC v8 that was strictly
    BE or they were porting to emerging SPARC v9 ? The later supports LE
    data access. It seems, at that moment (~1993) there were no production
    SPARC V9 chips, but the V9 ISA specs was already published.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Dallman on Tue Oct 1 15:31:55 2024
    John Dallman <[email protected]> schrieb:
    In article <vdg3d1$2kdqr$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

    Wasn't Windows NT supposed to be some kind of _portable_ OS? Wasn't
    it supposed to run on big-endian architectures too, like POWER, MIPS
    and SPARC?

    It did. I have no experience with Windows NT on SPARC or PowerPC, but the
    OS ran fine on MIPS.

    There was also a Windows for Alpha. A German computer chain, Vobis,
    tried to sell two models with that, but it flopped.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Tue Oct 1 16:26:25 2024
    Michael S <[email protected]> writes:
    On Tue, 1 Oct 2024 10:12 +0100 (BST)
    [email protected] (John Dallman) wrote:
    PowerPC did for a while, but the company interested in NT on PowerPC
    was IBM, and their hardware prices were a /lot/ higher than x86
    prices. They didn't see that as a problem, but all the potential
    customers did.

    The ideas of ARC (MIPS) and PowerPC (which was not just IBM) was that
    they would succeed the IA-32-based PC. Given the assumed (and, around
    1990, actual) performance superiority of RISCs over IA-32, this looked plausible. However, even with Alpha, which was often superior in
    performance throughout the 1990s, and for which there were cheap
    offerings (but without performance edge; e.g., I once was playing with
    the idea of buying a 21164PC-based PC164SX system, where the CPU+board
    (with 1MB L2 cache) cost ATS 6000 (~EUR 440) in 1998; but I went with
    a K6-2, because I played some DOS games:-). The cheap 164SX offer may have been a clearance sale, however.

    In any case, the performance advantage of the RISCs vanished during
    the 1990s, the RISCs never had wide ISV support, and so WNT on RISCs
    flopped.

    Now I wonder what endiannes was used by PowerPC variant WinNT.
    In theory, PPC/POWER could run in Little Endian mode, but before v3 of
    POWER ISA it wasn't as full-featured as Big Endian mode. If I am not >mistaken, the difference was that in LE mode there was no support for >unaligned memory accesses.

    Given that MIPS and Alpha require natural alignment, little-endian
    PowerPC at the time was as full-featured as the other RISCs.

    Alignment issues may have been a problem with the RISC ports, though.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Oct 1 18:15:31 2024
    On Tue, 1 Oct 2024 16:26:25 +0000, Anton Ertl wrote:

    Michael S <[email protected]> writes:
    On Tue, 1 Oct 2024 10:12 +0100 (BST)
    [email protected] (John Dallman) wrote:
    PowerPC did for a while, but the company interested in NT on PowerPC
    was IBM, and their hardware prices were a /lot/ higher than x86
    prices. They didn't see that as a problem, but all the potential
    customers did.

    The ideas of ARC (MIPS) and PowerPC (which was not just IBM) was that
    they would succeed the IA-32-based PC. Given the assumed (and, around
    1990, actual) performance superiority of RISCs over IA-32, this looked plausible. However, even with Alpha, which was often superior in
    performance throughout the 1990s, and for which there were cheap
    offerings (but without performance edge; e.g., I once was playing with
    the idea of buying a 21164PC-based PC164SX system, where the CPU+board
    (with 1MB L2 cache) cost ATS 6000 (~EUR 440) in 1998; but I went with
    a K6-2, because I played some DOS games:-). The cheap 164SX offer may
    have
    been a clearance sale, however.

    In any case, the performance advantage of the RISCs vanished during
    the 1990s, the RISCs never had wide ISV support, and so WNT on RISCs
    flopped.

    Pentium Pro wrote the death nell for 1st gen RISC architectures.
    Only ARM found a new and expanding market.

    Now I wonder what endiannes was used by PowerPC variant WinNT.
    In theory, PPC/POWER could run in Little Endian mode, but before v3 of >>POWER ISA it wasn't as full-featured as Big Endian mode. If I am not >>mistaken, the difference was that in LE mode there was no support for >>unaligned memory accesses.

    Given that MIPS and Alpha require natural alignment, little-endian
    PowerPC at the time was as full-featured as the other RISCs.

    Alignment issues may have been a problem with the RISC ports, though.

    One of the reasons to do misaligned in HW.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Anton Ertl on Tue Oct 1 19:19:14 2024
    On 10/1/24 12:26, Anton Ertl wrote:
    Michael S <[email protected]> writes:
    On Tue, 1 Oct 2024 10:12 +0100 (BST)
    [email protected] (John Dallman) wrote:
    PowerPC did for a while, but the company interested in NT on PowerPC
    was IBM, and their hardware prices were a /lot/ higher than x86
    prices. They didn't see that as a problem, but all the potential
    customers did.

    The ideas of ARC (MIPS) and PowerPC (which was not just IBM) was that
    they would succeed the IA-32-based PC. Given the assumed (and, around
    1990, actual) performance superiority of RISCs over IA-32, this looked plausible. However, even with Alpha, which was often superior in
    performance throughout the 1990s, and for which there were cheap
    offerings (but without performance edge; e.g., I once was playing with
    the idea of buying a 21164PC-based PC164SX system, where the CPU+board
    (with 1MB L2 cache) cost ATS 6000 (~EUR 440) in 1998; but I went with
    a K6-2, because I played some DOS games:-). The cheap 164SX offer may have been a clearance sale, however.


    My impression at the time was given the ppc was 1/3 the cost of the
    intel processors at the time that they would have destroyed Intel.
    OS2 for the ppc was a no show and IBM didn't believe AIX for ppc
    would be popular enough so they canned it. Which kind of left
    Apple holding the bag because I believe the sell to Apple was
    the ppc was going to be produced in such volume that unit
    prices would have dropped considerably. One of the many stories
    of IBM snatching defeat from the jaws of victory.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Michael S on Wed Oct 2 14:58:00 2024
    In article <[email protected]>, [email protected] (Michael S) wrote:

    Did Intergraph really try to port WinNT to SPARC v8 that was
    strictly BE or they were porting to emerging SPARC v9?

    I found the press releases:

    <http://ftp.lanet.lv/ftp/sun-info/sunflash/1993/Jul/55.11-Sun-Intergraph:- SPARC-and-Windows-NT>

    PALO ALTO, Calif., July 7, 1993 -- Sun Microsystems Computer
    Corporation (SMCC) and Intergraph Corporation announced today that they
    have signed a development agreement that will accelerate delivery of
    future generations of SPARC microprocessors. In addition, Intergraph
    will port Microsoft Corporation's Windows NT operating system to future
    SPARC microprocessors.

    Under terms of the agreement, Intergraph's Advanced Processor Division
    (APD), located here, will develop high-end 64-bit SPARC microprocessors
    jointly with SMCC's SPARC Technology Business (STB). Intergraph and
    SMCC both have the right to use these processors in their system level
    products, while STB will make these components available to the open
    market.

    As part of the agreement, APD will assume responsibility for porting
    Microsoft's Windows NT to Intergraph systems using future versions of
    SPARC processors. The APD port will support the "little-endian" byte
    ordering feature to be included in future SPARC implementations. This
    means that Windows NT itself and Windows NT applications will
    transition easily to the SPARC architecture. Solaris will continue to
    support "big-endian" byte ordering, as defined in current and future
    versions of the SPARC architecture, which to date runs more than 7500
    hardware and software solutions.

    There's more, but it is clear that they were going to use little-endian
    for Windows NT. At the time, Intergraph were selling their (ex-Fairchild) Clipper CPU in Unix CAD workstations with some degree of success, so this agreement wasn't a crazy idea. They had also ported Windows NT to Clipper,
    so they had some idea of what they were about.

    I discovered today that a contact who works for a product that started
    life at Intergraph was working there during the relevant period. They
    didn't work on this, but they're asking around for anyone who did.

    It seems as if NT has only ever been little-endian.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Dallman on Wed Oct 2 17:05:53 2024
    On Wed, 2 Oct 2024 14:58 +0100 (BST)
    [email protected] (John Dallman) wrote:

    <snip>

    It seems as if NT has only ever been little-endian.

    John

    Thank you.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Dallman on Thu Oct 3 00:07:17 2024
    On Tue, 1 Oct 2024 10:12 +0100 (BST), John Dallman wrote:

    [Windows NT on MIPS] was a commercial failure, because MIPS
    didn't keep up with the performance growth of x86.

    It was NT that was the commercial failure, not MIPS. MIPS found a niche in
    the embedded world, and went on to outsell x86 by a factor of 3:1 or so.

    We know this because a lot of those embedded devices ran Linux.

    PowerPC did for a while, but the company interested in NT on PowerPC was
    IBM, and their hardware prices were a /lot/ higher than x86 prices. They didn't see that as a problem, but all the potential customers did.

    PowerPC got rolled back into POWER, near as I can tell. And that continues
    to sell today -- you see some POWER machines not far from the top of the current Top500 list of the world’s most powerful supercomputers. That
    shows there is a viable market for the products.

    And of course they, too, run Linux.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Thu Oct 3 00:11:02 2024
    On Tue, 01 Oct 2024 16:26:25 GMT, Anton Ertl wrote:

    In any case, the performance advantage of the RISCs vanished during the
    1990s ...

    Only for as long as Intel could afford to spend 10× as much on developing
    each chip generation as the RISC vendors could. It could because it could
    reap 10× the profits in return, but it can’t any more. Which is why you
    see ARM coming to the fore, and RISC-V appearing as the upstart
    challenger.

    It’s a whole new ballgame now, and x86 is starting to look a little long
    in the tooth. Which is why even Microsoft recognizes it needs to spread
    its eggs outside that one basket, with its ongoing attempts to promote
    Windows on ARM (without much success, so far).

    ... the RISCs never had wide ISV support, and so WNT on RISCs
    flopped.

    As I said above, RISC is still around and dominating the computing world. They’re not running Windows, because it was Windows that could not adapt
    well to them. Instead, they are running Linux.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 09:13:21 2024
    On 03/10/2024 02:07, Lawrence D'Oliveiro wrote:
    On Tue, 1 Oct 2024 10:12 +0100 (BST), John Dallman wrote:

    [Windows NT on MIPS] was a commercial failure, because MIPS
    didn't keep up with the performance growth of x86.

    It was NT that was the commercial failure, not MIPS. MIPS found a niche in the embedded world, and went on to outsell x86 by a factor of 3:1 or so.


    The key markets for MIPS were network devices (managed switches,
    routers, small Wifi/NAT routers, etc.) and multimedia devices (smart
    TVs, Bluray players, set-top boxes, etc.).

    These have mostly been overtaken by ARM these days.

    We know this because a lot of those embedded devices ran Linux.

    Most of these ran Linux, a few had RTOS's.


    PowerPC did for a while, but the company interested in NT on PowerPC was
    IBM, and their hardware prices were a /lot/ higher than x86 prices. They
    didn't see that as a problem, but all the potential customers did.

    PowerPC got rolled back into POWER, near as I can tell. And that continues
    to sell today -- you see some POWER machines not far from the top of the current Top500 list of the world’s most powerful supercomputers. That
    shows there is a viable market for the products.

    And of course they, too, run Linux.

    PowerPC also moved into the embedded world, especially in the automotive industry and networking, as a replacement for m68k and Coldfire for
    Motorola (then Freescale, now part of NXP). PowerPC-based
    microcontrollers are still a big part of NXP's high reliability and
    safety oriented lineups for things like engine control. Those things,
    of course, do /not/ run Linux. (But they most certainly don't run
    Windows :-) )

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 06:57:54 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Tue, 01 Oct 2024 16:26:25 GMT, Anton Ertl wrote:

    In any case, the performance advantage of the RISCs vanished during the
    1990s ...

    Only for as long as Intel could afford to spend 10× as much on developing >each chip generation as the RISC vendors could. It could because it could >reap 10× the profits in return, but it can’t any more.

    Nexgen and AMD were smaller companies than Intel, DEC, Sun, HP, or the
    AIM companies, and yet managed to produce CPUs that were competetive
    with Intel's CPUs despite suffering from the CISC baggage. If the
    RISC companies failed to keep up, they only have themselves to blame.
    It seems to me that a number of RISC companies had difficulties with
    managing the larger projects that the growing die areas allowed.

    Another issue was the marketing. The RISC companies did not want to
    damage their existing high-priced workstation and server business by
    providing cheap CPUs for the masses, and yet had to do that in order
    to displace Intel, AMD, and Cyrix. AMD and Cyrix did not have that
    problem.

    Which is why you
    see ARM coming to the fore, and RISC-V appearing as the upstart
    challenger.

    ARM did not have the marketing problem, either, because they were not
    competing in the workstation/server market. They developed their
    business model of selling cores (and more) for SoCs for portable
    computing, and expand from that.

    ... the RISCs never had wide ISV support, and so WNT on RISCs
    flopped.

    As I said above, RISC is still around and dominating the computing world. >They’re not running Windows, because it was Windows that could not adapt >well to them. Instead, they are running Linux.

    Dominating? In the smartphone and tablet world, yes. In the embedded
    world, too. In laptops, desktops and servers, no.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 15:55:57 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Tue, 1 Oct 2024 10:12 +0100 (BST), John Dallman wrote:

    [Windows NT on MIPS] was a commercial failure, because MIPS
    didn't keep up with the performance growth of x86.

    It was NT that was the commercial failure, not MIPS. MIPS found a niche in >the embedded world, and went on to outsell x86 by a factor of 3:1 or so.

    Note that MIPS CPUs were used in SGI supercomputers and high-end
    graphics workstations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Thu Oct 3 17:18:22 2024
    Scott Lurndal <[email protected]> schrieb:
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Tue, 1 Oct 2024 10:12 +0100 (BST), John Dallman wrote:

    [Windows NT on MIPS] was a commercial failure, because MIPS
    didn't keep up with the performance growth of x86.

    It was NT that was the commercial failure, not MIPS. MIPS found a niche in >>the embedded world, and went on to outsell x86 by a factor of 3:1 or so.

    Note that MIPS CPUs were used in SGI supercomputers and high-end
    graphics workstations.

    I worked on one of the SGI machines for a time, it was the
    successor of the Cray which had been decommisioned before I started
    work at the company.

    It wasn't economical to keep around, so it got decommisioned
    when the next big reorganization (and big crisis) hit, and
    the staff then retired.

    For a time, the only machine for CFD applications at that company
    was an HP Itanium box on my desk...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Fri Sep 27 19:52:58 2024
    On Wed, 25 Sep 2024 12:54:18 -0400, EricP
    <[email protected]> wrote:

    For me error detection of all kinds is useful. It just happens
    to not be conveniently supported in C so no one tries it in C.

    GCC's -trapv option is not useful for a variety of reasons.
    1) its slow, about 50% performance hit
    2) its always on for a compilation unit which is not what programmers need
    as it triggers for many false positives so people turn it off.

    Things like that are why some companies have a code policy that allows
    just one function per file.

    Still a problem if you need <whatever the relevant flag does> only in
    one or a few places.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to EricP on Sat Sep 28 02:25:21 2024
    On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

    I've always paid for mine. My first C compiler came with the WinNT 3.5
    beta in 1992 for $99 and came with the development kit,
    editor, source code debugger, tools, documentation.
    A few hundred bucks is not going to hurt my business.

    Given that GCC offers more features and generates better code than MSVC,
    the money may not matter to your business, but the quality of the product
    will.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to David Brown on Thu Oct 3 23:49:00 2024
    In article <vdlg6h$3kq50$[email protected]>, [email protected]
    (David Brown) wrote:
    On 03/10/2024 02:07, Lawrence D'Oliveiro wrote:
    It was NT that was the commercial failure, not MIPS. MIPS found a
    niche in the embedded world, and went on to outsell x86 by a
    factor of 3:1 or so.

    The key markets for MIPS were network devices (managed switches,
    routers, small Wifi/NAT routers, etc.) and multimedia devices
    (smart TVs, Bluray players, set-top boxes, etc.).

    These have mostly been overtaken by ARM these days.

    And MIPS, the company, has abandoned its own architecture in favour of
    RISC-V. <https://mips.com/>

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Thu Oct 3 23:49:00 2024
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    If the RISC companies failed to keep up, they only have themselves to
    blame. It seems to me that a number of RISC companies had difficulties
    with managing the larger projects that the growing die areas allowed.

    Another contributing factor was Itanium, which was quite successful at disrupting the development cycles of the RISC architectures. Of the five
    that I worked with:

    Alpha suffered from DEC's mis-management, which led to DEC being taken
    over by Compaq. They killed Alpha when Itanium first became to work, and
    before it was clear that it was a turkey.

    PA-RISC was intended by HP to be replaced by Itanium. They managed that,
    but their success was limited because Linux on x86-64 was so much more cost-effective.

    IBM kept POWER development going through the Itanium period, which is a significant reason why it's still going.

    SGI went into Itanium hard and neglected MIPS development, which never recovered. It had been losing in the performance race anyway.

    Sun kept SPARC development going, but made a different mistake, by
    spreading their development resources over too many projects. The ones
    that succeeded did so too slowly, and they fell behind. Also, Linux ate
    their web-infrastructure market rather quickly.

    Linux could not have had the success it did without the large range of
    powerful and cheap hardware designed to run Windows.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri Oct 4 00:48:43 2024
    On Thu, 03 Oct 2024 06:57:54 GMT, Anton Ertl wrote:

    If the RISC companies failed to keep up, they only have themselves to
    blame.

    That’s all past history, anyway. RISC very much rules today, and it is x86 that is struggling to keep up.

    Another issue was the marketing. The RISC companies did not want to
    damage their existing high-priced workstation and server business by providing cheap CPUs for the masses ...

    There was one RISC family that did indeed provide cheap CPUs for the
    masses, even more so than x86, and that was ARM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Dallman on Fri Oct 4 00:56:01 2024
    On Thu, 3 Oct 2024 23:49 +0100 (BST), John Dallman wrote:

    Another contributing factor was Itanium, which was quite successful at disrupting the development cycles of the RISC architectures. Of the five
    that I worked with [that all failed, except one] ...

    That’s a pretty depressing list. ;) Except

    IBM kept POWER development going through the Itanium period, which is a significant reason why it's still going.

    Given all of IBM’s missteps, it’s mildly surprising they got that one right. Even a stopped clock is right once a day ...

    SGI went into Itanium hard and neglected MIPS development, which never recovered. It had been losing in the performance race anyway.

    SGI decided to embrace the platform that was eating their market, and try
    to sell Windows NT boxes. Trouble is, those NT boxes, while only a
    fraction of the cost of an IRIX-based product, still cost about 3× what
    other NT machines were going for.

    Sun kept SPARC development going, but made a different mistake, by
    spreading their development resources over too many projects. The ones
    that succeeded did so too slowly, and they fell behind. Also, Linux ate
    their web-infrastructure market rather quickly.

    They could still have sold SPARC hardware running Linux. I can remember comments saying Linux ran better on that hardware than Sun’s own SunOS/ Solaris did.

    Linux could not have had the success it did without the large range of powerful and cheap hardware designed to run Windows.

    Linux succeeded by not having all its eggs in one basket. It ran on
    everything.

    Which is why it is now running rings around Windows, as Microsoft
    struggles to dig itself out of the x86 dead-end niche.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Dallman on Fri Oct 4 00:46:55 2024
    On Thu, 3 Oct 2024 23:49 +0100 (BST), John Dallman wrote:

    And MIPS, the company, has abandoned its own architecture in favour of RISC-V. <https://mips.com/>

    And that will never run Windows, either.

    But it does run Linux.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Fri Oct 4 00:23:12 2024
    On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Thu, 03 Oct 2024 06:57:54 GMT, Anton Ertl wrote:

    If the RISC companies failed to keep up, they only have themselves to
    blame.

    That’s all past history, anyway. RISC very much rules today, and it is x86 >that is struggling to keep up.

    You are, of course, aware that the complex "x86" instruction set is an
    illusion and that the hardware essentially has been a load-store RISC
    with a complex decoder on the front end since the Pentium Pro landed
    in 1995.


    Another issue was the marketing. The RISC companies did not want to
    damage their existing high-priced workstation and server business by
    providing cheap CPUs for the masses ...

    There was one RISC family that did indeed provide cheap CPUs for the
    masses, even more so than x86, and that was ARM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to George Neuner on Fri Oct 4 07:05:34 2024
    George Neuner <[email protected]> writes:
    You are, of course, aware that the complex "x86" instruction set is an >illusion and that the hardware essentially has been a load-store RISC
    with a complex decoder on the front end since the Pentium Pro landed
    in 1995.

    Repeating nonsense does not make it any truer, and this nonsense has
    been repeated since at least the Pentium Pro (1995), maybe already
    since the 486 (1989). CISC and RISC are about the instruction set,
    not about the implementation. And even if you look at the
    implementation, it's not true: The P6 has microinstructions that are
    ~100 bits long, whereas RISCs have 32-bit and 16-bit instructions.
    The K7 has load-store microinstructions; RISCs don't have that.

    In more recent CPUs, AMD tends to work with macro-instructions between
    the decoder and the reorder buffer (i.e., in the part that in the
    Pentium Pro may have been used as the justification for the RISC
    claim); these macro instructions are load-and-op and read-modify-write instructions.

    John Mashey has written about the difference between CISC and RISC
    repeatedly <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>, and
    he gives good criteria for classifying instruction sets as RISC or
    CISC, and by his criteria the 80286 and IA-32 instruction sets of the
    Pentium Pro clearly both are CISCs. I have recently <[email protected]> used his criteria on
    instruction sets that Mashey did not classify (mostly because they
    were done after his table), and by these criteria AMD64 is clearly a
    CISC, while ARM A64 and RISC-V are clearly RISCs.

    In searching for whether he has written something specific about
    IA-32, I found <https://yarchive.net/comp/vax.html>, which is an
    earlier instance of the recent discussion of whether it would have
    been better for DEC to stick with VAX, do an OoO implementation and
    extend the architecture to 64 bits, like Intel has done: <https://yarchive.net/comp/vax.html>. He also discusses the problems
    of IA-32 there, but mainly in pointing out how much smaller they were
    than the VAX ones.

    I don't agree with all of that, however. E.g., when discussing a VAX instruction similar to IA-32's REP MOVS, he considers it to be a big
    advantage that the operands of REP MOVS are in registers. That
    appears wrong to me; you either have to keep REP MOVS in decoding (and
    thus stop decoding any later instructions) until you know the value of
    that register coming out of the OoO engine, making REP MOVS a mostly serializing instruction. Or you have a separate OoO logic for REP
    MOVS that keeps generating loads and stores inside the OoO engine. If
    you have the latter in the VAX, it does not make much difference if
    the operand is on a register or memory. The possibility of trapping
    during REP MOVS (or the VAX variant) complicates things, though: the
    first part of the REP MOVS has to be committed, and the registers
    written to the architectural state, and then execution has to start
    again with the REP MOVS. Does not seem much harder on the VAX to me,
    however.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Fri Oct 4 15:07:17 2024
    [email protected] (John Dallman) writes:
    In article <[email protected]>, >[email protected] (Anton Ertl) wrote:

    If the RISC companies failed to keep up, they only have themselves to
    blame. It seems to me that a number of RISC companies had difficulties
    with managing the larger projects that the growing die areas allowed.

    Another contributing factor was Itanium, which was quite successful at >disrupting the development cycles of the RISC architectures.

    That's the question. It seems to me that many struggled even before,
    and jumped ship to IA-64 ASAP.

    Alpha suffered from DEC's mis-management, which led to DEC being taken
    over by Compaq. They killed Alpha when Itanium first became to work, and >before it was clear that it was a turkey.

    Alpha suffered before. The 21264 was late, and did not keep up in the
    clock race. While they had higher clock rates than the competition up
    to the EV56 (1996), the OoO EV6 appeared with a lower clock rate than
    the in-order EV56 (while the OoO Pentium Pro had a higher clock rate
    than the in-order Pentium available at the same time), and did not
    scale as well with smaller processes as the Intel and AMD CPUs, which
    were making huge strides in those years. Intel then had the 2000MHz
    Pentium 4, and AMD the 1200MHz Athlon in 2000 (and 1400MHz by the time
    Alpha was canceled); unfortunately, release dates for EV6 variants at
    different clock rates are not documented on Wikipedia, so
    unfortunately I cannot make a table of Alpha vs. Intel and AMD clock
    rates by year.

    PA-RISC was intended by HP to be replaced by Itanium. They managed that,
    but their success was limited because Linux on x86-64 was so much more >cost-effective.

    Reportedly they thought early on that they could not afford to keep
    their own line competetive, so they started the IA-64 project with
    Intel. Interestingly, they also designed the OoO PA-8000, which was
    introduced at the same time as the Pentium Pro, and they used the same microarchitectur until they introduced the PA-8900 almost 10 years
    later, which showed a more evolutionary approach than most others used
    in those years.

    IBM kept POWER development going through the Itanium period, which is a >significant reason why it's still going.

    With the Power 4+ (2003) it also got competetive clock rates
    (although, judging by the PowerPC 970, I wonder what the IPC was).

    SGI went into Itanium hard and neglected MIPS development, which never >recovered. It had been losing in the performance race anyway.

    The followon project "Beast" for the R10000 failed (was canceled), and
    then SGI management was happy to jump ship to Itanium, and in the
    meantime they only respun the R10000 into R12000, R14000, R16000.

    Sun kept SPARC development going, but made a different mistake, by
    spreading their development resources over too many projects. The ones
    that succeeded did so too slowly, and they fell behind.

    Intel, HP, SGI and AMD went to OoO in 1995/1996, Alpha in 1998, Power
    at the latest with Power3 in 1998, only Sun kept doing in-order stuff,
    and took until 2011 to finally get an OoO CPU out the door in the form
    of the SPARC T4 (their Rock project was also OoO, but was canceled).
    They also had relatively low clock rates before that (which changed
    with the SPARC T5). Fujitsu managed better, introducing the OoO
    SPARC64 V in 2002, and also with competetive clock rates.

    Also, Linux ate
    their web-infrastructure market rather quickly.

    Well, SPARC survived much longer than most others, despite being
    technically a lot behind.

    Power still survives, maybe only because it has a common basis with
    iSeries (or whatever it is called now). Similarly, s390x survives
    because of its software legacy.

    Linux could not have had the success it did without the large range of >powerful and cheap hardware designed to run Windows.

    It was first developed on a 386, and many of the early co-developers
    also had IA-32 machines. But the 386 certainly was not designed to
    run Windows. The 386 project was finished before Windows 1.0 was
    released in November 1985, and nobody used Windows 1.0 or 2.0, so why
    would anybody design a processor for those? Windows became only
    popular with 3.0 in 1990 (after the release of the 486, which was
    therefore not designed for Windows, either). When I bought my first
    PC (with a 486) in 1993, it ran DOS (for games) and Linux (for
    everything else).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Fri Oct 4 19:44:40 2024
    Anton Ertl <[email protected]> schrieb:

    Alpha suffered before. The 21264 was late, and did not keep up in the
    clock race.

    https://www.star.bnl.gov/public/daq/HARDWARE/21264_data_sheet.pdf
    gives the clock rate as varying between 466 and 600 MHz, and
    Wikipedia gives the clock frequency of the Pentium Pro as between
    150 and 200 MHz. The Pentium II Overdrive, according to Wikipedia,
    had up to 333 MHz.

    Is this information wrong?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to D'Oliveiro on Fri Oct 4 21:53:00 2024
    In article <vdnef0$3uaeh$[email protected]>, [email protected]d (Lawrence
    D'Oliveiro) wrote:

    On Thu, 3 Oct 2024 23:49 +0100 (BST), John Dallman wrote:

    Given all of IBM's missteps, it's mildly surprising they got that
    one right. Even a stopped clock is right once a day ...

    IBM doesn't often repeat a mistake. They're made all the ordinary ones,
    so nowadays they usually invent new ones.

    SGI decided to embrace the platform that was eating their market,
    and try to sell Windows NT boxes. Trouble is, those NT boxes, while
    only a fraction of the cost of an IRIX-based product, still cost
    about 3� what other NT machines were going for.

    SGI had a lengthy internal conflict about Windows NT. One group of pro-NT people left and founded NetPower, whose idea was to build really fast workstations running NT on MIPS. We had one for a while, and were
    persuading Microsoft to fix a bug from the MIPSPro code generator for the
    third time (we'd also had it on DEC MIPS/Ultrix, and SGI Irix) when the
    Pentium Pro was released, and NetPower suddenly went very quiet.

    Then there were the SGI Visual Workstations, which ran NT on x86. The
    first generation of them were quite nice, but needed a very custom HAL,
    and hence couldn't be upgraded to later versions of Windows once SGI
    abandoned them.

    The later generations were ordinary PCs - the one I had as a deskside for
    a while was made by Mitsubishi - with an Nvidia graphics card. The only
    SGI added value was their OpenGL driver, and that didn't seem to justify
    the price if you were buying them.

    By this time, SGI had a department of downsizing, whose job was to get
    rid of departments and sites. Being an American company, this department
    fought for power and budget share, and nobody inside the company seemed
    to think that this would spell doom for SGI.

    They could still have sold SPARC hardware running Linux. I can
    remember comments saying Linux ran better on that hardware than
    Sun's own SunOS/Solaris did.

    They would not have faced up to that. There was an interesting incident
    with Solaris on x86. Since the Linux and Solaris kernel interfaces are
    somewhat similar, somebody at Sun decided to try making the Solaris
    kernel capable of acting as a Linux kernel, so that they could run a
    Linux userland and applications on the same machine as the Solaris
    userland and applications.

    So they hired some Linux people, but they didn't get good ones. A year
    later, their Linux people came back to Sun with a huge set of patches
    that amounted to patching a lot of the Linux kernel into Solaris, and
    didn't do it at all well. The Solaris kernel people looked at it a bit
    and said "Hell, no! This will destabilise Solaris!" They weren't
    exaggerating.

    So that year was wasted, and the project was restarted with some of the
    Solaris people involved, to explain how their kernel worked. Quite a
    while later you could install the 32-bit Red Hat Enterprise Linux 3.0
    userland and most application would run, but not all. This was not a
    success, and was dropped.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Dallman on Fri Oct 4 21:37:25 2024
    [email protected] (John Dallman) writes:
    In article <vdnef0$3uaeh$[email protected]>, [email protected]d (Lawrence >D'Oliveiro) wrote:

    On Thu, 3 Oct 2024 23:49 +0100 (BST), John Dallman wrote:

    Then there were the SGI Visual Workstations, which ran NT on x86. The
    first generation of them were quite nice, but needed a very custom HAL,
    and hence couldn't be upgraded to later versions of Windows once SGI >abandoned them.

    I left in early 2000, just after it was introduced. I was using
    an 2P octane at the time (with the 24" sony monitor).


    By this time, SGI had a department of downsizing, whose job was to get
    rid of departments and sites. Being an American company, this department >fought for power and budget share, and nobody inside the company seemed
    to think that this would spell doom for SGI.

    They [SGI ed.] could still have sold SPARC hardware running Linux. I can
    remember comments saying Linux ran better on that hardware than
    Sun's own SunOS/Solaris did.

    They would not have faced up to that.

    Some of the SGI engineers were fond of loud noises, and one day took
    a sun pizza box into the parking lot with some m-80's. Got
    a visit a bit later from the secret service as AF-1 was next
    door at moffett that day.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Fri Oct 4 21:48:12 2024
    Thomas Koenig <[email protected]> writes:
    Anton Ertl <[email protected]> schrieb:

    Alpha suffered before. The 21264 was late, and did not keep up in the
    clock race.

    https://www.star.bnl.gov/public/daq/HARDWARE/21264_data_sheet.pdf
    gives the clock rate as varying between 466 and 600 MHz, and
    Wikipedia gives the clock frequency of the Pentium Pro as between
    150 and 200 MHz. The Pentium II Overdrive, according to Wikipedia,
    had up to 333 MHz.

    Is this information wrong?

    No, but it misses context: The Pentium Pro was available in late 1995.
    The 21264 was officially available in 1998, but when we ordered a
    machine with a 500MHz 21264 (and needed it delivered before the end of
    the year for budget reasons), they delivered a machine with a 21164a,
    and then in the next year upgraded it to the 21264 (which probably
    meant replacing the motherboard, not just the CPU package).

    Intel released a 450MHz Pentium II in 1998, and the 500MHz Pentium III
    on February 28, 1999. AMD released the 600MHz Athlon in June 23,
    1999, and won the GHz race with the 1000MHz Athlon in March 6, 2000,
    with Intel's Pentium III following in March 8. Meanwhile, the Alphas
    could not keep up in MHz numbers, but I have no firm dates, only
    memories from that time.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Fri Oct 4 22:49:26 2024
    On Fri, 4 Oct 2024 7:05:34 +0000, Anton Ertl wrote:

    George Neuner <[email protected]> writes:
    <snipping>

    I don't agree with all of that, however. E.g., when discussing a VAX instruction similar to IA-32's REP MOVS, he considers it to be a big advantage that the operands of REP MOVS are in registers. That
    appears wrong to me; you either have to keep REP MOVS in decoding (and
    thus stop decoding any later instructions) until you know the value of
    that register coming out of the OoO engine, making REP MOVS a mostly serializing instruction. Or you have a separate OoO logic for REP
    MOVS that keeps generating loads and stores inside the OoO engine. If
    you have the latter in the VAX, it does not make much difference if
    the operand is on a register or memory. The possibility of trapping
    during REP MOVS (or the VAX variant) complicates things, though: the
    first part of the REP MOVS has to be committed, and the registers
    written to the architectural state, and then execution has to start
    again with the REP MOVS. Does not seem much harder on the VAX to me, however.

    My 66000 has a MemMove instruction consisting of a 1 word instruction,
    that leaves DECODE and enters into one MEMory unit, where it proceeds
    to AGEN and Read, AGEN and Write, leaving the rest of the function
    units proceeding to whatever is next.

    One thing I did different, here, none of the 3 registers is modified,
    yet I retain the ability to take exception and re-play the instruction
    from where it left off {in state never visible to the instruction
    stream except via DECODE stage.}

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Fri Oct 4 22:54:55 2024
    On Fri, 4 Oct 2024 19:36:41 +0000, Chris M. Thomasson wrote:

    On 10/3/2024 11:36 PM, Chris M. Thomasson wrote:
    On 10/3/2024 9:23 PM, George Neuner wrote:
    On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Thu, 03 Oct 2024 06:57:54 GMT, Anton Ertl wrote:

    If the RISC companies failed to keep up, they only have themselves to >>>>> blame.

    That’s all past history, anyway. RISC very much rules today, and it
    is x86
    that is struggling to keep up.

    You are, of course, aware that the complex "x86" instruction set is an
    illusion and that the hardware essentially has been a load-store RISC
    with a complex decoder on the front end since the Pentium Pro landed
    in 1995.

    Yeah. Wrt memory barriers, one is allowed to release a spinlock on "x86"
    with a simple store.

    The fact that one can release a spinlock using a simple store means that
    its basically load-acquire release-store.

    So a load will do a load then have an implied acquire barrier.

    A store will do an implied release barrier then perform the store.

    How does the store know it needs to do this when the locking
    instruction is more than a pipeline depth away from the
    store release ?? So, Locked LD (or something) happens at
    1,000,000 cycles, and the corresponding store happens at
    10,000,000 cycles (9,000,000 locked).

    This release behavior is okay for releasing a spinlock with a simple
    store, MOV.

    It may be OK to SW but it causes all kinds of grief to HW.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Fri Oct 4 23:30:03 2024
    [email protected] (MitchAlsup1) writes:
    On Fri, 4 Oct 2024 7:05:34 +0000, Anton Ertl wrote:

    George Neuner <[email protected]> writes:
    <snipping>


    My 66000 has a MemMove instruction consisting of a 1 word instruction,
    that leaves DECODE and enters into one MEMory unit, where it proceeds
    to AGEN and Read, AGEN and Write, leaving the rest of the function
    units proceeding to whatever is next.

    One thing I did different, here, none of the 3 registers is modified,
    yet I retain the ability to take exception and re-play the instruction
    from where it left off {in state never visible to the instruction
    stream except via DECODE stage.}

    What happens if the exception handler reschedules the CPU to
    a different task before returning from the exception?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to Anton Ertl on Sat Oct 5 00:13:24 2024
    On Fri, 04 Oct 2024 07:05:34 GMT, [email protected]
    (Anton Ertl) wrote:

    George Neuner <[email protected]> writes:
    You are, of course, aware that the complex "x86" instruction set is an >>illusion and that the hardware essentially has been a load-store RISC
    with a complex decoder on the front end since the Pentium Pro landed
    in 1995.

    Repeating nonsense does not make it any truer, and this nonsense has
    been repeated since at least the Pentium Pro (1995), maybe already
    since the 486 (1989). CISC and RISC are about the instruction set,
    not about the implementation. And even if you look at the
    implementation, it's not true: The P6 has microinstructions that are
    ~100 bits long, whereas RISCs have 32-bit and 16-bit instructions.
    The K7 has load-store microinstructions; RISCs don't have that.

    Anton, you know very well that the hardware does not execute the "x86" instruction set but only /emulates/ it. The decoder translates x86 instructions into sequences of microinstructions that perform the
    equivalent operations. The fact that some simple instructions
    translate one to one does not change this.


    In more recent CPUs, AMD tends to work with macro-instructions between
    the decoder and the reorder buffer (i.e., in the part that in the
    Pentium Pro may have been used as the justification for the RISC
    claim); these macro instructions are load-and-op and read-modify-write >instructions.

    John Mashey has written about the difference between CISC and RISC
    repeatedly <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>, and
    he gives good criteria for classifying instruction sets as RISC or
    CISC, and by his criteria the 80286 and IA-32 instruction sets of the
    Pentium Pro clearly both are CISCs. I have recently ><[email protected]> used his criteria on >instruction sets that Mashey did not classify (mostly because they
    were done after his table), and by these criteria AMD64 is clearly a
    CISC, while ARM A64 and RISC-V are clearly RISCs.

    In searching for whether he has written something specific about
    IA-32, I found <https://yarchive.net/comp/vax.html>, which is an
    earlier instance of the recent discussion of whether it would have
    been better for DEC to stick with VAX, do an OoO implementation and
    extend the architecture to 64 bits, like Intel has done: ><https://yarchive.net/comp/vax.html>. He also discusses the problems
    of IA-32 there, but mainly in pointing out how much smaller they were
    than the VAX ones.

    I don't agree with all of that, however. E.g., when discussing a VAX >instruction similar to IA-32's REP MOVS, he considers it to be a big >advantage that the operands of REP MOVS are in registers. That
    appears wrong to me; you either have to keep REP MOVS in decoding (and
    thus stop decoding any later instructions) until you know the value of
    that register coming out of the OoO engine, making REP MOVS a mostly >serializing instruction. Or you have a separate OoO logic for REP
    MOVS that keeps generating loads and stores inside the OoO engine. If
    you have the latter in the VAX, it does not make much difference if
    the operand is on a register or memory. The possibility of trapping
    during REP MOVS (or the VAX variant) complicates things, though: the
    first part of the REP MOVS has to be committed, and the registers
    written to the architectural state, and then execution has to start
    again with the REP MOVS. Does not seem much harder on the VAX to me, >however.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to George Neuner on Sat Oct 5 08:01:23 2024
    George Neuner <[email protected]> writes:
    On Fri, 04 Oct 2024 07:05:34 GMT, [email protected]
    (Anton Ertl) wrote:

    George Neuner <[email protected]> writes:
    You are, of course, aware that the complex "x86" instruction set is an >>>illusion and that the hardware essentially has been a load-store RISC >>>with a complex decoder on the front end since the Pentium Pro landed
    in 1995.

    Repeating nonsense does not make it any truer, and this nonsense has
    been repeated since at least the Pentium Pro (1995), maybe already
    since the 486 (1989). CISC and RISC are about the instruction set,
    not about the implementation. And even if you look at the
    implementation, it's not true: The P6 has microinstructions that are
    ~100 bits long, whereas RISCs have 32-bit and 16-bit instructions.
    The K7 has load-store microinstructions; RISCs don't have that.

    Anton, you know very well that the hardware does not execute the "x86" >instruction set but only /emulates/ it. The decoder translates x86 >instructions into sequences of microinstructions that perform the
    equivalent operations. The fact that some simple instructions
    translate one to one does not change this.

    I know that the hardware does not execute the "x86" instruction set,
    because there is no "x86" instruction set. There is the 80286
    instruction set, the IA-32 instruction set, and the AMD64 instruction
    set (and the boundary between 286 and IA-32 is squishy, but that
    between those and AMD64 is hard).

    As for the point you are trying to make, I know quite a bit about how
    the instruction execution is implemented on various IA-32 and AMD64 implementations. Whether you call it execution or emulation, IA-32
    and AMD64 are still the instruction sets of all of them, and there is
    no way to execute (or emulate) other instruction sets, and no way to
    run programs written in macro-ops, micro-ops, ROPs, or whatever they
    may be called. That's even true for the Transmeta implementations
    (although doing other instruction sets would have been possible there
    and IIRC was demonstrated once). Moreover, these
    implementation-specific things change from one implementation to the
    next, and that includes the implementations by Transmeta.

    For the 6502 or the MIPS R2000 we don't consider the instruction set
    to be emulated, either, and they have a decoder that translates the instructions into sequences of signals to various units (i.e., microinstructions), too.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Paul A. Clayton on Sat Oct 5 15:06:39 2024
    "Paul A. Clayton" <[email protected]> writes:
    On 10/4/24 7:30 PM, Scott Lurndal wrote:
    [email protected] (MitchAlsup1) writes:
    On Fri, 4 Oct 2024 7:05:34 +0000, Anton Ertl wrote:

    George Neuner <[email protected]> writes:
    <snipping>


    My 66000 has a MemMove instruction consisting of a 1 word instruction,
    that leaves DECODE and enters into one MEMory unit, where it proceeds
    to AGEN and Read, AGEN and Write, leaving the rest of the function
    units proceeding to whatever is next.

    One thing I did different, here, none of the 3 registers is modified,
    yet I retain the ability to take exception and re-play the instruction >>>from where it left off {in state never visible to the instruction
    stream except via DECODE stage.}

    What happens if the exception handler reschedules the CPU to
    a different task before returning from the exception?

    I ass-me that like the PREDicate instruction modifier, there is
    _implicit_ state that is saved on context switches. I.e., there is
    extra storage space in the context store for such data.

    My 66000 uses hardware context saving, so software can be ignorant
    of such (aside from reserving enough storage).

    I got the impression that it wasn't so much context saving,
    as context switching (i.e. storage per 'process/thread');
    yet if that storage needs to be saved to DRAM on any
    exception, just in case the OS switches to a different
    thread context, then I don't see how he can get his
    claimed context switch times.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Sat Oct 5 17:20:00 2024
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    [email protected] (John Dallman) writes:
    Linux could not have had the success it did without the large
    range of powerful and cheap hardware designed to run Windows.

    It was first developed on a 386, and many of the early co-developers
    also had IA-32 machines. But the 386 certainly was not designed to
    run Windows. The 386 project was finished before Windows 1.0 was
    released in November 1985, and nobody used Windows 1.0 or 2.0, so
    why would anybody design a processor for those? ...

    OK, "designed to run MS-DOS, and later Windows"?

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Sat Oct 5 17:10:47 2024
    [email protected] (John Dallman) writes:
    In article <[email protected]>, >[email protected] (Anton Ertl) wrote:

    [email protected] (John Dallman) writes:
    Linux could not have had the success it did without the large
    range of powerful and cheap hardware designed to run Windows.

    It was first developed on a 386, and many of the early co-developers
    also had IA-32 machines. But the 386 certainly was not designed to
    run Windows. The 386 project was finished before Windows 1.0 was
    released in November 1985, and nobody used Windows 1.0 or 2.0, so
    why would anybody design a processor for those? ...

    OK, "designed to run MS-DOS, and later Windows"?

    The 286 protected mode was certainly not designed for MS-DOS, and the
    386 paging of linear addresses was certainly not designed for DOS,
    either.

    The virtual 8086 mode of the 386 was used by Windows/386 (starting
    already in 1987). Was virtual 8086 mode designed into the 386
    specifically for Windows? I doubt it, and AFAIK it is also used by
    DOSEMU under Linux, and I expect that you can run, e.g., CP/M-86 on
    it. It seems to be a good idea when designing a CPU like the 386
    where backwards compatibility with the 8086 was a requirement.

    Can you point to a specific feature of Intel CPUs that you think is specifically designed in for DOS or Windows? Even the A20-gate is a
    general backwards-compatibility mechanism that may benefit real-mode
    software other than DOS.

    It seems to me that the 286 protected mode was a continuation of the
    iAPX432 ideas, which predated DOS, and that the 386 paging imitated
    the virtual-memory mainstream of bigger computing platforms at the
    time, such as the VAX and S/370.

    And the success of IA-32 and then AMD64 at replacing the RISCs is
    exactly because it was not some DOS-centric architecture, but also
    provided features needed by other OSs like 386/ix (later Interactive
    Unix, which I used myself in 1990 or so), Xenix, Linux, Windows NT,
    Solaris, the various BSDs, and others. And the computers built around
    these CPUs also provided these features.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Oct 5 22:35:06 2024
    On Fri, 4 Oct 2024 23:30:03 +0000, Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:

    What happens if the exception handler reschedules the CPU to
    a different task before returning from the exception?

    There are 2 pointers and a count==index.

    When a context switch happens (interrupt or exception)
    the current count is saved in a "free" register in thread
    header.

    When control returns, and MM is executed for a second time
    this saved count is used instead of the original operand
    count.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Oct 5 22:42:19 2024
    On Sat, 5 Oct 2024 15:06:39 +0000, Scott Lurndal wrote:

    "Paul A. Clayton" <[email protected]> writes:
    On 10/4/24 7:30 PM, Scott Lurndal wrote:
    [email protected] (MitchAlsup1) writes:
    On Fri, 4 Oct 2024 7:05:34 +0000, Anton Ertl wrote:

    George Neuner <[email protected]> writes:
    <snipping>


    My 66000 has a MemMove instruction consisting of a 1 word instruction, >>>> that leaves DECODE and enters into one MEMory unit, where it proceeds
    to AGEN and Read, AGEN and Write, leaving the rest of the function
    units proceeding to whatever is next.

    One thing I did different, here, none of the 3 registers is modified,
    yet I retain the ability to take exception and re-play the instruction >>>>from where it left off {in state never visible to the instruction
    stream except via DECODE stage.}

    What happens if the exception handler reschedules the CPU to
    a different task before returning from the exception?

    I ass-me that like the PREDicate instruction modifier, there is
    _implicit_ state that is saved on context switches. I.e., there is
    extra storage space in the context store for such data.

    My 66000 uses hardware context saving, so software can be ignorant
    of such (aside from reserving enough storage).

    I got the impression that it wasn't so much context saving,
    as context switching (i.e. storage per 'process/thread');

    Thread headers and thread register files are treated as
    a write back cache. Core knows where it originally
    got the RF and thus remembers where to put it back.

    This state area is in the thread control block. So, there
    is no way it can "not be there". An OS will not start a
    process until there is enough memory to contain all
    thread state {and .text, .data, .bss, ...}

    yet if that storage needs to be saved to DRAM on any
    exception, just in case the OS switches to a different
    thread context, then I don't see how he can get his
    claimed context switch times.

    For interrupts, core starts fetching ISR thread state
    before negotiating for an interrupt has finished. Most
    of the time, the new state is arriving about when it is
    known the interrupt will be "taken" by this core. Old
    state is pushed out as new state arrives, then proceeds
    to where is lives long term.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Oct 6 00:43:56 2024
    On Fri, 04 Oct 2024 15:07:17 GMT, Anton Ertl wrote:

    Power still survives, maybe only because it has a common basis with
    iSeries (or whatever it is called now).

    As I understand it, iSeries is the emulation of the old AS/400 on POWER processors. And AS/400 was the unification of the older System/38 with the System/34? System/36? lines.

    System/38 (or AS/400, or iSeries) has/had this interestingly unusual architecture which builds database features right into the OS kernel, so
    that they can be used everywhere. And it also uses capabilities as an alternative to the traditional privilege-mode hierarchy. Neither of these
    ideas says much for performance, but they still suggest some interesting possibilities, nonetheless.

    Native POWER is, I think, called pSeries. It continues to sell in its own
    right because it offers high performance--high enough to earn a few
    ongoing spots near the top of the Top500 supercomputer list.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Oct 6 00:47:04 2024
    On Sat, 05 Oct 2024 17:10:47 GMT, Anton Ertl wrote:

    The virtual 8086 mode of the 386 was used by Windows/386 (starting
    already in 1987). Was virtual 8086 mode designed into the 386
    specifically for Windows? I doubt it ...

    Nevertheless, I think this was the “killer app” that made Windows actually useful to the masses: instead of having to wait for developers to create
    apps written for Windows (which they were reluctant to do, as long as the
    users didn’t want to buy Windows because there weren’t apps available for it ...), here was a feature they could use “out of the box”, to multitask their existing DOS apps, without any need for changes to application code.

    And this led to a nice growth in the number of Windows installations,
    which in turn created a market for actual Windows-specific apps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Chris M. Thomasson on Sun Oct 6 00:55:31 2024
    On Thu, 3 Oct 2024 23:36:12 -0700, Chris M. Thomasson wrote:

    On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    That’s all past history, anyway. RISC very much rules today, and it is >>> x86 that is struggling to keep up.

    You are, of course, aware that the complex "x86" instruction set is an
    illusion and that the hardware essentially has been a load-store RISC
    with a complex decoder on the front end since the Pentium Pro landed in
    1995.

    Of course, and that complexity (and consequent expense) is part of the struggle. Looking at Intel’s current financial woes, it is clearly not
    being as successful at that as it has been in the past.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Oct 6 01:03:00 2024
    On Fri, 04 Oct 2024 07:05:34 GMT, Anton Ertl wrote:

    CISC and RISC are about the instruction set, not about
    the implementation. And even if you look at the implementation, it's
    not true: The P6 has microinstructions that are ~100 bits long, whereas
    RISCs have 32-bit and 16-bit instructions. The K7 has load-store microinstructions; RISCs don't have that.

    Intel I think tried to spread this idea of a “RISC core” somewhere inside the labyrinthine complexity of its Pentium-and-later chips, in the hope
    that some of the aura attached to the term “RISC” would rub off on its products.

    And quite a few people fell for it.

    ... ARM A64 and RISC-V are clearly RISCs.

    ARM and some other RISC architectures (e.g. POWER) do somewhat stretch the
    term though, don’t they, when they add that combinatorial explosion of operand types in their short-vector instructions.

    RISC-V has consciously avoided this, by going back to the older long-
    vector idea, like Seymour Cray used in his machines.

    The possibility of trapping during
    REP MOVS (or the VAX variant) complicates things, though: the first part
    of the REP MOVS has to be committed, and the registers written to the architectural state, and then execution has to start again with the REP
    MOVS. Does not seem much harder on the VAX to me, however.

    This is why the VAX has the “FPD” (“First Part Done”) processor status bit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sun Oct 6 09:09:53 2024
    Anton Ertl <[email protected]> schrieb:
    In any case, certainly for the stuff I do I see no reason why I would consider, much less recommend buying a Power machine these days.

    If you do not want backdoors in your system, you might want
    consider a Talos II. (I'm told that various represesentatives of
    government agencies with vaguely funny-sounding names have been
    seen, and talked to, at OpenPOWER conferences).

    Not Power 10 though, that has an unexplained binary blob.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sun Oct 6 07:18:59 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    Native POWER is, I think, called pSeries. It continues to sell in its own >right because it offers high performance--high enough to earn a few
    ongoing spots near the top of the Top500 supercomputer list.

    Looking at the June 2024 edition, I see Summit as the highest-ranked
    system with Power CPUs, and they are Power 9. So if your claim was
    true that the Top500 supercomputer list reflects CPU performance,
    Power 9 would beat Power 10 in CPU performance, and EPYC, Xeon,
    Fujitsu A64FX and Nvidia Grace are more powerful CPUs. However, in
    most supercomputers (including Summit) the GPGPUs provide the bulk of
    the FLOPS that are measured in the Top 500, so looking at the Top 500
    is misleading for determining CPU performance.

    So let's look at SPEC CPU instead. For CPU2017, I see only four
    entries from IBM, all for the Integer Rate metric, two with Power 9
    and two with Power 10 CPUs. The highest of those results is:

    base peak
    1700 2170 IBM Power E1080

    That's with 8 sockets, 120 cores, and 960 threads. Looking at other
    8-socket machines, I find

    base peak
    3820 3880 BullSequana SH80

    That's with 8 sockets, 480 cores, and 960 threads (similar results
    from Fujitsu PRIMERGY RX8770 M7, HPE Compute Scale-up Server 3200,
    Inspur TS860G7 and Supermicro SuperServer SYS-681E-TR, all done with
    Xeon Platinum 8490H CPUs). And if you go for maximum performance,
    there's a 16-socket Xeon machine from Bull with base=7400, peak=7450.

    Alternatively, you can instead buy a 2-socket system with similar
    performance to the 8-socket IBM Power E1080:

    base peak
    1950 2140 ASUS RS720A-E12-RS12

    and similar results from other systems with the EPYC 9754.

    https://www.spec.org/cpu2017/results/res2021q3/cpu2017-20210814-28679.html https://www.spec.org/cpu2017/results/res2024q3/cpu2017-20240701-43944.html https://www.spec.org/cpu2017/results/res2023q2/cpu2017-20230522-36617.html

    Admittedly, IBM extracts the most performance from each core, but with
    only 15 cores per CPU (where others have 128), that is no longer that impressive. Nevertheless, neither machines with the Ryzen 7950X nor
    with the Xeon-E2488 reach the performance per core (and no results for
    the Ryzen 9950X have been submitted yet), so it looks like Power 10
    has a really good multi-threading implementation.

    The fact that IBM has not submitted results for Power for SPEC CPU
    2017 for (Int or FP) Speed or FP Rate results is an admission that
    their numbers there are even less impressive.

    In any case, certainly for the stuff I do I see no reason why I would
    consider, much less recommend buying a Power machine these days. My
    guess is that the major reasons for buying pSeries machines these days
    are legacy software and IBM salesmanship.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sun Oct 6 12:40:00 2024
    On Sun, 06 Oct 2024 07:18:59 GMT
    [email protected] (Anton Ertl) wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:
    Native POWER is, I think, called pSeries. It continues to sell in
    its own right because it offers high performance--high enough to
    earn a few ongoing spots near the top of the Top500 supercomputer
    list.

    Looking at the June 2024 edition, I see Summit as the highest-ranked
    system with Power CPUs, and they are Power 9. So if your claim was
    true that the Top500 supercomputer list reflects CPU performance,
    Power 9 would beat Power 10 in CPU performance, and EPYC, Xeon,
    Fujitsu A64FX and Nvidia Grace are more powerful CPUs. However, in
    most supercomputers (including Summit) the GPGPUs provide the bulk of
    the FLOPS that are measured in the Top 500, so looking at the Top 500
    is misleading for determining CPU performance.


    Yes, in almost all top entries in top500 the compute muscle is GPGPU,
    with CPUs playing role of glorified Peripheral Processor of ancient supercomputers. That applies both to POWER and to Xeons and to EPYC.

    However there are two exceptions: Fugaku (#4, Fujitsu A64Fx) and Sunway TaihuLight (#13, Sunway SW26010).

    Majority of GPUs in the list are NVidia of various generations, but the
    #1 (US DOE Frontier) uses AMD GPUs and #2 (US DOE Aurora) uses Intel
    GPUs.

    BTW, IBM Summit (NVidea GV100+IBM POWER9), despite still being pretty
    high on the Top500 list, is going to be retired next month.
    I wonder if Sunway TaihuLight is aging better.


    So let's look at SPEC CPU instead. For CPU2017, I see only four
    entries from IBM, all for the Integer Rate metric, two with Power 9
    and two with Power 10 CPUs. The highest of those results is:

    base peak
    1700 2170 IBM Power E1080

    That's with 8 sockets, 120 cores, and 960 threads. Looking at other
    8-socket machines, I find

    base peak
    3820 3880 BullSequana SH80

    That's with 8 sockets, 480 cores, and 960 threads (similar results
    from Fujitsu PRIMERGY RX8770 M7, HPE Compute Scale-up Server 3200,
    Inspur TS860G7 and Supermicro SuperServer SYS-681E-TR, all done with
    Xeon Platinum 8490H CPUs). And if you go for maximum performance,
    there's a 16-socket Xeon machine from Bull with base=7400, peak=7450.

    Alternatively, you can instead buy a 2-socket system with similar
    performance to the 8-socket IBM Power E1080:

    base peak
    1950 2140 ASUS RS720A-E12-RS12

    and similar results from other systems with the EPYC 9754.

    https://www.spec.org/cpu2017/results/res2021q3/cpu2017-20210814-28679.html https://www.spec.org/cpu2017/results/res2024q3/cpu2017-20240701-43944.html https://www.spec.org/cpu2017/results/res2023q2/cpu2017-20230522-36617.html

    Admittedly, IBM extracts the most performance from each core, but with
    only 15 cores per CPU (where others have 128), that is no longer that impressive.

    "Core" in POWER9 is sort-of cheating. For nearly all practical purposes
    what they call 'core' is a couple of cores with just a little bit of
    resource sharing between halves when running in single-thread mode.
    Just enough to have a judicial justification to being called 'core'.
    I don't know if POWER10 is similar or different in that regard.

    Nevertheless, neither machines with the Ryzen 7950X nor
    with the Xeon-E2488 reach the performance per core (and no results for
    the Ryzen 9950X have been submitted yet), so it looks like Power 10
    has a really good multi-threading implementation.

    The fact that IBM has not submitted results for Power for SPEC CPU
    2017 for (Int or FP) Speed or FP Rate results is an admission that
    their numbers there are even less impressive.


    That's most likely explanation, but another one is that it is sort of
    internal policy no matter what.
    IIRC, they didn't publish non-rate scores for POWER7 either, despite
    that according to independent measurement at point of introduction
    POWER7 single-threaded performance was in the same ballpark with best
    Intel offerings and easily ahead of best AMD.

    In any case, certainly for the stuff I do I see no reason why I would consider, much less recommend buying a Power machine these days. My
    guess is that the major reasons for buying pSeries machines these days
    are legacy software and IBM salesmanship.

    - anton

    I think, if you are running Oracle DB Enterprise Edition, where
    software license per core is the most expensive part then there could
    be an economical reason for preferring POWER9 or 10 over Intel or AMD.
    But that's just a guess.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sun Oct 6 08:40:55 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    Intel I think tried to spread this idea of a "RISC core" somewhere inside
    the labyrinthine complexity of its Pentium-and-later chips, in the hope
    that some of the aura attached to the term "RISC" would rub off on its >products.

    I am not sure it was Intel, but certainly a number of people on the
    net did write this, and it may already have started with the 486, and
    my guess is that even an explanation of Intel that just expained the implementation of the 486 without making any claim that the 486 is a
    RISC would have led to that result. It's just a thing that people
    rooting for Intel like to believe, so in retelling the implementation explanation, they eventually settle down to "the 486 is a RISC
    internally", and eventually this becomes "the complex 'x86'
    instruction set is an illusion and that the hardware essentially has
    been a load-store RISC".

    AMD are easily provable culprits in this scam: They call their
    micro-ops "ROPs", for RISC ops.

    BTW, this page looks at the microcode of different IA-32
    implementations:
    <https://fanael.github.io/is-x86-risc-internally.html>

    And quite a few people fell for it.

    Yes. Apparently it's something people want to believe in.

    ... ARM A64 and RISC-V are clearly RISCs.

    ARM and some other RISC architectures (e.g. POWER) do somewhat stretch the >term though, don’t they, when they add that combinatorial explosion of >operand types in their short-vector instructions.

    Number of operand types never has been a criterion in any of the RISC definitions I have seen, nor the number of instructions (although some
    people like to go by that).

    As for ARM and Power, from
    <[email protected]>:


    CPU Age 3a 3b 3c 3d 4a4b 5a 5b 6a 6b# ODD
    (1991)
    RULE <6 =1 =4 <5 =0 =0=1 <2 =1 >4 >3
    G1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000
    6+ 1 4 7+ 0 0 1 0 1 4+ - 3/8 ARM1
    -12 2+ 4 7+ 0 0 1 1 2+ 4+ 5 4/7 ARMv7 T32
    -22 1 4 15+ 0 0 1 1 2+ 5 5 2/9 ARM A64

    So for John Mashey the RS/6000 (original Power) satisfied all his RISC criteria. I think that since the PowerPC, Power fails his criteria 5b
    and maybe 5a, so these days Power would be classified as 1/10 or 2/9
    (i.e., 10 for RISC, 1 against), so it's clearly RISC, like the others,
    and unlike AMD64 (7/4).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun Oct 6 13:21:32 2024
    On Sun, 6 Oct 2024 00:43:56 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Fri, 04 Oct 2024 15:07:17 GMT, Anton Ertl wrote:

    Power still survives, maybe only because it has a common basis with
    iSeries (or whatever it is called now).

    As I understand it, iSeries is the emulation of the old AS/400 on
    POWER processors. And AS/400 was the unification of the older
    System/38 with the System/34? System/36? lines.

    System/38 (or AS/400, or iSeries) has/had this interestingly unusual architecture which builds database features right into the OS kernel,
    so that they can be used everywhere. And it also uses capabilities as
    an alternative to the traditional privilege-mode hierarchy. Neither
    of these ideas says much for performance, but they still suggest some interesting possibilities, nonetheless.

    Native POWER is, I think, called pSeries. It continues to sell in its
    own right because it offers high performance--

    https://www.ibm.com/downloads/cas/B425DZZ1
    Try to find word POWER (or Power systems) in this 128-page document.
    Then may be you will get the idea of how important it is according to
    IBM management.

    Compare with 5, 10 and 15 years ago (in the oldest report look for
    system p)
    https://www.ibm.com/investor/att/pdf/IBM_Annual_Report_2018.pdf https://www.ibm.com/investor/att/pdf/IBM_Annual_Report_2013.pdf https://www.ibm.com/investor/att/pdf/IBM_Annual_Report_2008.pdf


    high enough to earn a
    few ongoing spots near the top of the Top500 supercomputer list.

    This misunderstanding cleared by Anton.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun Oct 6 13:51:19 2024
    On Sun, 6 Oct 2024 00:55:31 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 3 Oct 2024 23:36:12 -0700, Chris M. Thomasson wrote:

    On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    That’s all past history, anyway. RISC very much rules today, and
    it is x86 that is struggling to keep up.

    You are, of course, aware that the complex "x86" instruction set
    is an illusion and that the hardware essentially has been a
    load-store RISC with a complex decoder on the front end since the
    Pentium Pro landed in 1995.

    Of course, and that complexity (and consequent expense) is part of
    the struggle. Looking at Intel’s current financial woes, it is
    clearly not being as successful at that as it has been in the past.

    Intel's current financial woes do not appear to be [directly] related to
    Intel PC (laptops+desktop) sails that are right now pretty good and
    profitable.
    Even servers division that struggled and lost money for majority of
    2023 now recovering and is profitable again even if profit margin is
    tiny comparatively to 2021.
    Actually, it takes special management talent to have such good result
    in the company's main segment and despite that to lose money for Q
    after Q after Q.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sun Oct 6 14:04:56 2024
    On Sun, 06 Oct 2024 08:40:55 GMT
    [email protected] (Anton Ertl) wrote:


    AMD are easily provable culprits in this scam: They call their
    micro-ops "ROPs", for RISC ops.


    Wasn't the term invented by Nexgen for Nx586 and later adapted by AMD
    after they scrapped their home brewed core in favor of Nexgen's core
    that later became know as AMD K6?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sun Oct 6 11:58:51 2024
    Michael S <[email protected]> writes:
    On Sun, 06 Oct 2024 07:18:59 GMT
    [email protected] (Anton Ertl) wrote:
    The fact that IBM has not submitted results for Power for SPEC CPU
    2017 for (Int or FP) Speed or FP Rate results is an admission that
    their numbers there are even less impressive.


    That's most likely explanation, but another one is that it is sort of >internal policy no matter what.
    IIRC, they didn't publish non-rate scores for POWER7 either, despite
    that according to independent measurement at point of introduction
    POWER7 single-threaded performance was in the same ballpark with best
    Intel offerings and easily ahead of best AMD.

    For our LaTeX benchmark (numbers are in seconds (lower is better), the
    years are when the hardware came on the market:

    Power7, 3600MHz, CentOS 7 (ppc64) TeX Live 2013 0.81 2010
    Athlon 64 3200+, 2000MHz, 1MB L2, Fedora Core 1 (64-bit) 0.76 2003
    Xeon X3460 (Lynnfield (Nehalem)) 2800MHz, Deb Lenny 64b 0.484 2009
    Xeon 5160, 3000MHz, (2*)4MB L2, Debian Etch (64-bit) 0.48 2006
    Phenom II X2 560, 3300MHz, 6MB L3, Debian Jessie (64-bit) 0.452 2010

    On whatever applications the performance of Power7 is competetive with
    Intel and AMD of its time, it's certainly is not on our LaTeX
    benchmark.

    In any case, certainly for the stuff I do I see no reason why I would
    consider, much less recommend buying a Power machine these days. My
    guess is that the major reasons for buying pSeries machines these days
    are legacy software and IBM salesmanship.

    - anton

    I think, if you are running Oracle DB Enterprise Edition, where
    software license per core is the most expensive part then there could
    be an economical reason for preferring POWER9 or 10 over Intel or AMD.
    But that's just a guess.

    Yes, per-core licensing fees irrespective of the actual hardware might
    be a reason, but it would be perverse of Oracle or some other ISV to
    pay for porting to Power in order to reduce the licensing fees that
    their customers have to pay to them. But stranger things have
    happened.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sun Oct 6 14:03:59 2024
    Michael S <[email protected]> writes:
    On Sun, 06 Oct 2024 08:40:55 GMT
    [email protected] (Anton Ertl) wrote:


    AMD are easily provable culprits in this scam: They call their
    micro-ops "ROPs", for RISC ops.


    Wasn't the term invented by Nexgen for Nx586 and later adapted by AMD
    after they scrapped their home brewed core in favor of Nexgen's core
    that later became know as AMD K6?

    Easily possible. AMD may not have been the original culprit, but they continued this terminology, and therefore are just as guilty.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sun Oct 6 17:35:40 2024
    On Sun, 06 Oct 2024 14:03:59 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Sun, 06 Oct 2024 08:40:55 GMT
    [email protected] (Anton Ertl) wrote:


    AMD are easily provable culprits in this scam: They call their
    micro-ops "ROPs", for RISC ops.


    Wasn't the term invented by Nexgen for Nx586 and later adapted by AMD
    after they scrapped their home brewed core in favor of Nexgen's core
    that later became know as AMD K6?

    Easily possible. AMD may not have been the original culprit, but they continued this terminology, and therefore are just as guilty.

    - anton

    To their defense, AMD's use of the term ROP didn't last for long.
    K8 manuals use the better term micro-ops. I don't have K7 manual to
    look, but it seems to me that it uses the same terminology as K8.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Sun Oct 6 16:21:00 2024
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    OK, "designed to run MS-DOS, and later Windows"?

    The 286 protected mode was certainly not designed for MS-DOS, and
    the 386 paging of linear addresses was certainly not designed for
    DOS, either.

    I see where I'm going wrong: I'm trying to talk about the machines
    designed to run MS-DOS and later Windows, not just the CPUs. The vast
    range of hardware that all had substantial degrees of compatibility as
    regards booting, busses and so on. Those things let their manufacturers
    compete for the DOS and Windows market, whereas x86-based machines that
    weren't PC-compatible only succeeded in quite specialised niches.

    Those hardware suppliers did not close off access to the more advanced
    features of i386 onwards, because they had no reason to, and that let
    Linux take advantage of all that hardware when it came along. That's the
    point I was failing to make.

    And the success of IA-32 and then AMD64 at replacing the RISCs is
    exactly because it was not some DOS-centric architecture, but also
    provided features needed by other OSs like 386/ix (later Interactive
    Unix, which I used myself in 1990 or so), Xenix, Linux, Windows NT,
    Solaris, the various BSDs, and others. And the computers built
    around these CPUs also provided these features.

    Just so.

    It seems to me that the 286 protected mode was a continuation of the
    iAPX432 ideas, which predated DOS,

    Not sure about that: it is also a bit like the base-limit memory
    protection of various old mainframe architectures, done in the context of
    x86 64KB segments.

    and that the 386 paging imitated the virtual-memory mainstream
    of bigger computing platforms at the time, such as the VAX and
    S/370.

    Absolutely.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Sun Oct 6 23:34:04 2024
    On Sun, 6 Oct 2024 13:51:19 +0300, Michael S wrote:

    Intel's current financial woes do not appear to be [directly] related to Intel PC (laptops+desktop) sails that are right now pretty good and profitable.

    x86 chip sales have been declining for years. At one time they were up to
    a million per day; nowadays it’s only about 80% of that. And you see the trouble they have keeping up in performance, microcode bugs etc. All adds
    up to competitiveness trouble.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Sun Oct 6 23:33:38 2024
    On Sat, 5 Oct 2024 21:56:34 +0000, Chris M. Thomasson wrote:

    On 10/4/2024 3:54 PM, MitchAlsup1 wrote:
    On Fri, 4 Oct 2024 19:36:41 +0000, Chris M. Thomasson wrote:

    On 10/3/2024 11:36 PM, Chris M. Thomasson wrote:
    On 10/3/2024 9:23 PM, George Neuner wrote:
    On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Thu, 03 Oct 2024 06:57:54 GMT, Anton Ertl wrote:

    If the RISC companies failed to keep up, they only have themselves to >>>>>>> blame.

    That’s all past history, anyway. RISC very much rules today, and it >>>>>> is x86
    that is struggling to keep up.

    You are, of course, aware that the complex "x86" instruction set is an >>>>> illusion and that the hardware essentially has been a load-store RISC >>>>> with a complex decoder on the front end since the Pentium Pro landed >>>>> in 1995.

    Yeah. Wrt memory barriers, one is allowed to release a spinlock on "x86" >>>> with a simple store.

    The fact that one can release a spinlock using a simple store means that >>> its basically load-acquire release-store.

    So a load will do a load then have an implied acquire barrier.

    A store will do an implied release barrier then perform the store.

    How does the store know it needs to do this when the locking
    instruction is more than a pipeline depth away from the
    store release ?? So, Locked LD (or something) happens at
    1,000,000 cycles, and the corresponding store happens at
    10,000,000 cycles (9,000,000 locked).

    This release behavior is okay for releasing a spinlock with a simple
    store, MOV.

    It may be OK to SW but it causes all kinds of grief to HW.

    I thought that x86 has an implied #LoadStore | #StoreStore before the
    store, basically to give it release semantics. This means that one can release a spinlock without using any explicit membars. Iirc, there are
    Intel manuals that show this for spinlocks. Cannot exactly remember
    right now.

    I wonder if this actually works with my scenario above.

    On x86 an atomic load has acquire and atomic stores have release
    semantics. Well, I think that is for WB memory only. Humm... Cannot
    remember if its for WC or WB memory right now. Then there are the
    L/S/MFENCE instructions...

    https://www.felixcloutier.com/x86/sfence

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Oct 6 23:36:10 2024
    On Sun, 06 Oct 2024 08:40:55 GMT, Anton Ertl wrote:

    Number of operand types never has been a criterion in any of the RISC definitions I have seen, nor the number of instructions (although some
    people like to go by that).

    It’s in the name: “Reduced Instruction Set Computer”.

    I always thought it should have been “IRSC”: “Increased Register Set Computer”. The most obvious characteristic, the one that tends to hit you first, is having lots of registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Oct 6 23:42:47 2024
    On Sun, 06 Oct 2024 07:18:59 GMT, Anton Ertl wrote:

    However, in most supercomputers (including Summit) the GPGPUs provide
    the bulk of the FLOPS ...

    That tends to go back and forth, between CPU and GPU.

    See this <https://www.nextplatform.com/2020/03/05/software-evolution-on-ornls-summit-supercomputer/>
    interview with Dr Tjerk Straatsma, group lead for scientific computing
    at ORNL. Seems their supers have made heavy use of NVidia GPUs up to
    now, but this was set to change:

    Frontier, the next system for the OLCF, will have AMD CPUs and
    GPUs.

    To prepare for this system, software developers may want to make
    changes to their programming approach, with OpenMP directive-based
    and HIP native offloading as the most comparable to the OpenACC
    and CUDA approaches on Summit today.

    I wonder what happened to OpenCL as the cross-platform architecture
    for GPU computing?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Sun Oct 6 23:45:20 2024
    On Sun, 6 Oct 2024 12:40:00 +0300, Michael S wrote:

    However there are two exceptions: Fugaku (#4, Fujitsu A64Fx) and Sunway TaihuLight (#13, Sunway SW26010).

    Some suspect there may be more Chinese machines that would be worthy of a
    high place in the Top500 list, if only people knew about them. And those
    would likely be strong on the CPU side, weak on the GPU side, too.

    Why could China be drawing back from publicizing its supercomputer
    prowess? Partly national security, perhaps; but also because it tends to
    enrage some in the US to have their noses rubbed in another country’s technological superiority.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Dallman on Sun Oct 6 23:38:07 2024
    On Sun, 6 Oct 2024 16:21 +0100 (BST), John Dallman wrote:

    ... whereas x86-based machines that weren't PC-compatible ...

    They could have been PCs, too, since IBM neither pioneered nor owned the
    term.

    The standard for compatibility soon had more to do with Microsoft software
    than particular IBM hardware, anyway; which is why I like to say “Microsoft-compatible”. No possibility of confusion over what you mean.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Mon Oct 7 07:17:02 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Sun, 06 Oct 2024 08:40:55 GMT, Anton Ertl wrote:

    Number of operand types never has been a criterion in any of the RISC
    definitions I have seen, nor the number of instructions (although some
    people like to go by that).

    It’s in the name: "Reduced Instruction Set Computer".

    Not at all. What you think of is a "fewer instructions computer", but
    it's called a "reduced-instruction set computer". It becomes more
    obvious if you look at the opposite: "Complex-instruction set
    computer", not "more-instructions computer".

    I always thought it should have been "IRSC": "Increased Register Set >Computer". The most obvious characteristic, the one that tends to hit you >first, is having lots of registers.

    Having 32 GPRs does not make AMD64 with APX a RISC, and VAX (a CISC)
    has the same number of registers as the first 801 and the ARM A32/T32
    (RISCs).

    However, in John Mashey's criteria the number of registers plays a
    role; he requires >4 bits for the GPR specifier, and >3 bits for the
    FPR specifier.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Mon Oct 7 08:00:03 2024
    [email protected] (John Dallman) writes:
    In article <[email protected]>, >[email protected] (Anton Ertl) wrote:

    OK, "designed to run MS-DOS, and later Windows"?

    The 286 protected mode was certainly not designed for MS-DOS, and
    the 386 paging of linear addresses was certainly not designed for
    DOS, either.

    I see where I'm going wrong: I'm trying to talk about the machines
    designed to run MS-DOS and later Windows, not just the CPUs. The vast
    range of hardware that all had substantial degrees of compatibility as >regards booting, busses and so on. Those things let their manufacturers >compete for the DOS and Windows market, whereas x86-based machines that >weren't PC-compatible only succeeded in quite specialised niches.

    There actually were MS-DOS-compatible machines that were not 100% IBM
    PC compatible, and did not run programs that used direct hardware
    access, but MS-DOS programs that only used BIOS functions (i.e., a
    HAL). The BIOS functions were too slow, so the programs with direct
    hardware access won out over those that used the BIOS, and therefore
    the 100% IBM PC compatibles won out over the MS-DOS compatibles.

    The PC industry then developed a culture of compatibility, and that
    helped all OSs, not just DOS and Windows. E.g., it is much easier to
    install Linux on a PC than on some ARM-based SBC; for the ARM-based
    SBC the typical way is to use a prepared system image on an SD-card,
    because you cannot just put in a USB stick and run an installer.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Mon Oct 7 10:17:26 2024
    Anton Ertl wrote:
    [email protected] (John Dallman) writes:
    In article <[email protected]>,
    [email protected] (Anton Ertl) wrote:

    OK, "designed to run MS-DOS, and later Windows"?

    The 286 protected mode was certainly not designed for MS-DOS, and
    the 386 paging of linear addresses was certainly not designed for
    DOS, either.

    I see where I'm going wrong: I'm trying to talk about the machines
    designed to run MS-DOS and later Windows, not just the CPUs. The vast
    range of hardware that all had substantial degrees of compatibility as
    regards booting, busses and so on. Those things let their manufacturers
    compete for the DOS and Windows market, whereas x86-based machines that
    weren't PC-compatible only succeeded in quite specialised niches.

    There actually were MS-DOS-compatible machines that were not 100% IBM
    PC compatible, and did not run programs that used direct hardware
    access, but MS-DOS programs that only used BIOS functions (i.e., a
    HAL). The BIOS functions were too slow, so the programs with direct
    hardware access won out over those that used the BIOS, and therefore
    the 100% IBM PC compatibles won out over the MS-DOS compatibles.

    The single most canonical test for IBM PC compatibility was Microsoft's
    Flight Simulator, taking off from the now demolished Meighs Field in
    Chicago.

    That game used the OS and BIOS for the loading of the game, and then
    went on to direct hardware access for pretty much the rest of the
    playing time.


    The PC industry then developed a culture of compatibility, and that
    helped all OSs, not just DOS and Windows. E.g., it is much easier to
    install Linux on a PC than on some ARM-based SBC; for the ARM-based
    SBC the typical way is to use a prepared system image on an SD-card,
    because you cannot just put in a USB stick and run an installer.

    Terje


    - anton



    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Oct 7 13:05:53 2024
    On Sun, 6 Oct 2024 23:42:47 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sun, 06 Oct 2024 07:18:59 GMT, Anton Ertl wrote:

    However, in most supercomputers (including Summit) the GPGPUs
    provide the bulk of the FLOPS ...

    That tends to go back and forth, between CPU and GPU.

    See this <https://www.nextplatform.com/2020/03/05/software-evolution-on-ornls-summit-supercomputer/>
    interview with Dr Tjerk Straatsma, group lead for scientific computing
    at ORNL. Seems their supers have made heavy use of NVidia GPUs up to
    now, but this was set to change:

    Frontier, the next system for the OLCF, will have AMD CPUs and
    GPUs.


    No back and forth here. Frontier is as much GPU-centric as was Summit.
    The same for Aurora that is installed alongside earlier Polarais at
    Argonne.
    Or for El Capitan that is going to replace Sierra at LLNL.

    In all cases the vendor of GPU changed, but balance between GPU and CPU computing power remains heavily skewed toward GPU.


    To prepare for this system, software developers may want to make
    changes to their programming approach, with OpenMP directive-based
    and HIP native offloading as the most comparable to the OpenACC
    and CUDA approaches on Summit today.

    I wonder what happened to OpenCL as the cross-platform architecture
    for GPU computing?

    I have no idea.
    May be, it was not designed with the same level of competence as CUDA ?
    Or, may be, being cross-platform, OpenCL is at inherent disadvantage?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Oct 7 12:27:09 2024
    On Mon, 07 Oct 2024 07:17:02 GMT
    [email protected] (Anton Ertl) wrote:


    However, in John Mashey's criteria the number of registers plays a
    role; he requires >4 bits for the GPR specifier, and >3 bits for the
    FPR specifier.

    - anton

    Which sounds rather arbitrary. Or even worse, like if he wanted for
    SPARC to be called 'typical RISC' and for ARM to be called atypical and
    had chosen the numbers to match the agenda.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Oct 7 13:26:54 2024
    On Sun, 6 Oct 2024 23:34:04 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sun, 6 Oct 2024 13:51:19 +0300, Michael S wrote:

    Intel's current financial woes do not appear to be [directly]
    related to Intel PC (laptops+desktop) sails that are right now
    pretty good and profitable.

    x86 chip sales have been declining for years. At one time they were
    up to a million per day; nowadays it’s only about 80% of that.

    That can explain slow shift from being crazily profitable to "just" very
    very profitable.
    It's not nearly enough to explain several consecutive quarters of big
    losses.

    And
    you see the trouble they have keeping up in performance, microcode
    bugs etc. All adds up to competitiveness trouble.

    No, I don't see it. What I see that in absolute performance per
    thread four companies are head and shoulders above of the rest of
    the industry. Two out of the four make ARM, other two make x86.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Mon Oct 7 17:38:54 2024
    Anton Ertl <[email protected]> schrieb:
    Michael S <[email protected]> writes:
    On Mon, 07 Oct 2024 07:17:02 GMT
    [email protected] (Anton Ertl) wrote:


    However, in John Mashey's criteria the number of registers plays a
    role; he requires >4 bits for the GPR specifier, and >3 bits for the
    FPR specifier.

    - anton

    Which sounds rather arbitrary.

    In a way it is, but see below.

    Or even worse, like if he wanted for
    SPARC to be called 'typical RISC' and for ARM to be called atypical and
    had chosen the numbers to match the agenda.

    I think that ARM did not exist for John Mashey.

    When was his definition made?

    ARM was rather late to the RISC game, this might have been literally
    true.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Mon Oct 7 17:09:10 2024
    Michael S <[email protected]> writes:
    On Mon, 07 Oct 2024 07:17:02 GMT
    [email protected] (Anton Ertl) wrote:


    However, in John Mashey's criteria the number of registers plays a
    role; he requires >4 bits for the GPR specifier, and >3 bits for the
    FPR specifier.

    - anton

    Which sounds rather arbitrary.

    In a way it is, but see below.

    Or even worse, like if he wanted for
    SPARC to be called 'typical RISC' and for ARM to be called atypical and
    had chosen the numbers to match the agenda.

    I think that ARM did not exist for John Mashey.

    He probably chose the criterion ">4 bits" because it excluded VAX.
    So, yes, his criteria were based on classifying some architectures as
    RISCs and some as CISCs, and then drawing the lines to fit that
    classification. But his criteria also work for architectures he did
    not look at, including ARM A32, even if not every criterion agrees
    with all the others for every architecture.

    Alternatively, you could do a cluster analysis with these criteria and
    maybe others, and I think that the RISCs would come out pretty tightly clustered; the non-RISCs would be further away from that, and I doubt
    that they would form a single cluster.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon Oct 7 18:56:26 2024
    On Sun, 6 Oct 2024 1:03:00 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 04 Oct 2024 07:05:34 GMT, Anton Ertl wrote:

    CISC and RISC are about the instruction set, not about
    the implementation. And even if you look at the implementation, it's
    not true: The P6 has microinstructions that are ~100 bits long, whereas
    RISCs have 32-bit and 16-bit instructions. The K7 has load-store
    microinstructions; RISCs don't have that.

    Intel I think tried to spread this idea of a “RISC core” somewhere
    inside
    the labyrinthine complexity of its Pentium-and-later chips, in the hope
    that some of the aura attached to the term “RISC” would rub off on its products.

    And quite a few people fell for it.

    ... ARM A64 and RISC-V are clearly RISCs.

    ARM and some other RISC architectures (e.g. POWER) do somewhat stretch
    the
    term though, don’t they, when they add that combinatorial explosion of operand types in their short-vector instructions.

    600-1200 instructions in SIMD

    RISC-V has consciously avoided this, by going back to the older long-
    vector idea, like Seymour Cray used in his machines.

    ONLY 100-300 instructions for long vector.

    Compare this to 2 instructions in My 66000 that provide access
    to both SIMD and long vectors.

    In My Humble Opinion, ISAs with SIMD or Long Vectors do not qualify
    as RISC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to [email protected] on Mon Oct 7 18:55:26 2024
    In article <efXIO.169388$[email protected]>,
    EricP <[email protected]> wrote:
    Kent Dickey wrote:
    In article <O2DHO.184073$[email protected]>,
    EricP <[email protected]> wrote:
    Kent Dickey wrote:
    In article <[email protected]>,
    Anton Ertl <[email protected]> wrote:
    Brett <[email protected]> writes:
    Speaking of complex things, have you looked at Swift output, as it checks
    all operations for overflow?

    You could add an exception type for that, saving huge numbers of correctly
    predicted branch instructions.

    The future of programming languages is type safe with checks, you need to
    get on that bandwagon early.
    MIPS got on that bandwagon early. It has, e.g., add (which traps on >>>>> signed overflow) in addition to addu (which performs modulo
    arithmetic). It has been abandoned and replaced by RISC-V several
    years ago.

    Alpha got on that bandwagon early. It's a descendent of MIPS, but it >>>>> renamed add into addv, and addu into add. It has been canceled around >>>>> the year 2000.
    [ More details about architectures without trapping overflow instructions ]

    Trapping on overflow is basically useless other than as a debug aid,
    which clearly nobody values. If you take Rust's approach, and only
    detect overflow in debug builds, then you already don't care about
    performance.
    Those automatic software correctness checks, of which signed integer
    overflow detection is one of many, went away because most code was
    being written in C/C++ and those two languages don't require them.

    That just makes it more expensive in code size and performance to effect >>> such checks. This overhead leads some to conclude it justifies eliminating >>> the error checks.

    Eliminating the error event detectors doesn't make errors go away,
    just your knowledge of them.

    I gather portions of 16-bit Windows 3.1 were written in Pascal.
    When Microsoft developed 32-bit WinNT, if instead of C it they had
    switched their official development language from Pascal to Modula-2
    which does require signed and unsigned, checked and modulo arithmetic,
    and array bounds checks, the world would have been a much safer place.

    But they didn't so it isn't.

    The x86 designers might then have had an incentive to make all the
    checks as efficient as possible, and rather than eliminate them,
    they might have enhanced and more tightly integrated them.

    OK, my post was about how having a hardware trap-on-overflow instruction
    (or a mode for existing ALU instructions) is useless for anything OTHER
    than as a debug aid where you crash the problem on overflow (you can
    have a general exception handler to shut down gracefully, but "patching things
    up and continuing" doesn't work). I gave details of reasons folks might
    want to try to use trap-on-overflow instructions, and show how the
    other cases don't make sense.

    For me error detection of all kinds is useful. It just happens
    to not be conveniently supported in C so no one tries it in C.

    GCC's -trapv option is not useful for a variety of reasons.
    1) its slow, about 50% performance hit
    2) its always on for a compilation unit which is not what programmers need
    as it triggers for many false positives so people turn it off.

    In no way was I ever arguing that checking for overflow was a bad idea,
    or a language issue, or anything else. Just that CPUs should not bother
    having trap-on-overflow instructions.

    I understand, and I disagree with this conclusion.
    I think all forms of software error detection are useful and
    HW should make them simple and eliminate cost when possible.

    I think I am not explaining the issue well.

    I'm not arguing what you want to do with overflow. I'm trying to show that
    for all uses of detecting overflow other than crashing with no recovery, hardware trapping on overflow is a poor approach.

    If you enable hardware traps on integer overflow, then to do anything other than crash the program would require engineering a very complex set of
    data structures, roughly approximately the complexity of adding debug information to the executable, in order to make this work. As far as I know, no one in the history of computers has yet undertaken this task.

    This is because each instruction which overflows would need special
    handling, and the "debug" information would be needed. It would be a huge amount of compiler/linker/runtime complexity.

    This is different than most "signal" handlers people have written, where
    simple inspection of the instruction which failed and the address involved allows it to be "handled". But to do anything other than crash, each instruction which overflows needs special handling unique to that instruction and dependent on what the compiler was in the middle of doing when the
    overflow happened. This is why trapping just isn't a good idea.

    I'm just explaining why trap-on-overflow has gone away, because it's
    almost completely useless: hardware trap on overflow is only good for the
    case that you want to crash on integer overflow. Branch-on-overflow is the correct approach--the compiler can branch to either a trapping instruction
    (if you just want to crash), or for all other cases of detecting overflow,
    the compiler branches to "fixup" code.

    And crash-on-overflow just isn't a popular use model, as I use the example
    of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
    and no compiler seems to use it. Especially since branch-on-overflow
    is almost as good in every way.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to [email protected] on Mon Oct 7 18:45:01 2024
    In article <S9YIO.47284$[email protected]>,
    EricP <[email protected]> wrote:
    Terje Mathisen wrote:
    Kent Dickey wrote:

    Look at:
    https://godbolt.org/z/oMhW55YsK

    Which is this code:

    int add2(int num, int other) {
    return num + other;
    }

    Compiled with these options: -O2 -ftrapv
    (-ftrapv is the GCC argument for detect signed overflows and crash).

    For x86-64 clang 19.1.0:

    add2:
    add edi, esi
    jo .LBB0_1
    mov eax, edi
    ret
    .LBB0_1:
    ud1 eax, dword ptr [eax]

    This looks OK: it does a normal add, then branches-on-overflow to
    an undefined instruction.

    But x86 has an instruction to trap on overflow directly: INTO. It's
    one byte.
    And it doesn't use it.

    GCC x86-64 14.2 is even worse:

    add2:
    sub rsp, 8
    call __addvsi3
    add rsp, 8
    ret

    It calls a routine to do all additions which might overflow, and that
    routine calls assert() if an overflow occurs.

    The CPU has a trap-on-overflow instruction exactly for this case (to
    crash
    on detecting an overflow), and compilers don't even use it.

    So even on architectures which have a trap-on-overflow instruction,
    compilers don't use it.

    You can only compile in INTO opcodes if you can guarantee that the INT 4
    (INTO) trap vector will always be set to a proper handler, and since
    that isn't part of the ABI, compilers can't depend on it?

    I do agree that it would be nice if it did work, barring that clang is
    doing the best possible alternative, at close to zero cost except for
    the useless branch predictor table entry wastage.

    Terje

    On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
    One must use JO to detect signed overflow.
    Others were repurposed, 1-byte INC and DEC 40..4F became the REX prefix.

    Right, I forgot this. But even in 32-bit mode compiles, GCC and CLANG
    both do not use INTO when using the -ftrapv flag--the compilers do the same thing they do in 64-bit mode.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon Oct 7 19:03:59 2024
    On Sun, 6 Oct 2024 23:36:10 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 06 Oct 2024 08:40:55 GMT, Anton Ertl wrote:

    Number of operand types never has been a criterion in any of the RISC
    definitions I have seen, nor the number of instructions (although some
    people like to go by that).

    It’s in the name: “Reduced Instruction Set Computer”.

    I always thought it should have been “IRSC”: “Increased Register Set Computer”. The most obvious characteristic, the one that tends to hit
    you first, is having lots of registers.

    At the time of RISC, Denelcor has HEP and each process could have upto
    256 registers (along with up to 256 constants).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Mon Oct 7 19:02:12 2024
    On Sun, 6 Oct 2024 14:35:40 +0000, Michael S wrote:

    On Sun, 06 Oct 2024 14:03:59 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Sun, 06 Oct 2024 08:40:55 GMT
    [email protected] (Anton Ertl) wrote:


    AMD are easily provable culprits in this scam: They call their
    micro-ops "ROPs", for RISC ops.


    Wasn't the term invented by Nexgen for Nx586 and later adapted by AMD >>>after they scrapped their home brewed core in favor of Nexgen's core
    that later became know as AMD K6?

    Easily possible. AMD may not have been the original culprit, but they
    continued this terminology, and therefore are just as guilty.

    - anton

    To their defense, AMD's use of the term ROP didn't last for long.
    K8 manuals use the better term micro-ops. I don't have K7 manual to
    look, but it seems to me that it uses the same terminology as K8.

    K9 used the terms micro-ops and meso-ops to describe before and
    after peephole optimization. HW was happy to run either as micro-
    ops were a strict subset of meso-ops, meso-ops just got more work
    done per cycle.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to All on Mon Oct 7 22:51:31 2024
    On 2024-10-07 22:12, MitchAlsup1 wrote:
    On Mon, 7 Oct 2024 18:55:26 +0000, Kent Dickey wrote:

    In article <efXIO.169388$[email protected]>,
    EricP  <[email protected]> wrote:
    Kent Dickey wrote:
    In article <O2DHO.184073$[email protected]>,
    EricP  <[email protected]> wrote:
    Kent Dickey wrote:

    In no way was I ever arguing that checking for overflow was a bad idea, >>>> or a language issue, or anything else.  Just that CPUs should not
    bother
    having trap-on-overflow instructions.

    I understand, and I disagree with this conclusion.
    I think all forms of software error detection are useful and
    HW should make them simple and eliminate cost when possible.

    I think I am not explaining the issue well.

    I'm not arguing what you want to do with overflow.  I'm trying to show
    that for all uses of detecting overflow other than crashing with no
    recovery, hardware trapping on overflow is a poor approach.

    If you enable hardware traps on integer overflow, then to do anything
    other than crash the program would require engineering a very complex
    set of data structures, roughly approximately the complexity of adding
    debug information to the executable, in order to make this work.  As
    far as I know, no one in the history of computers has yet undertaken
    this task.

    And yet, this is exactly the kind of data C++ needs in order to
    use its Try-Throw-Catch exception model. The stack walker needs
    to know where on the stack is the list of stuff to free on block
    exit, where are the preserved registers and how many, ...


    Ada too.

    There are at least two ways to do that (at least for Ada, probably also
    for C++):

    - Dynamically maintain a stack-like data structure (a chain, linked
    list) that describes the current nesting of "code blocks" and their
    exception handlers. Whenever the program enters a block with an
    exception handler, there is entry code that pushes the description of
    that exception handler on this chain, including the address of its code;
    and vice versa pop on exiting such a block.

    - Statically construct a mapping table that is stored in the executable
    and maps code ranges to exception handlers.

    Ada implementations started with the dynamic method, which is simpler
    but adds some execution cost to all blocks with exception handlers, even
    if an exception never happens. Current implementations tend to the
    static method, also called "zero-cost exceptions" because there is no
    extra execution cost for blocks with exception handlers /unless/ an
    exception does occur.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Kent Dickey on Mon Oct 7 19:12:32 2024
    On Mon, 7 Oct 2024 18:55:26 +0000, Kent Dickey wrote:

    In article <efXIO.169388$[email protected]>,
    EricP <[email protected]> wrote:
    Kent Dickey wrote:
    In article <O2DHO.184073$[email protected]>,
    EricP <[email protected]> wrote:
    Kent Dickey wrote:

    In no way was I ever arguing that checking for overflow was a bad idea,
    or a language issue, or anything else. Just that CPUs should not bother >>> having trap-on-overflow instructions.

    I understand, and I disagree with this conclusion.
    I think all forms of software error detection are useful and
    HW should make them simple and eliminate cost when possible.

    I think I am not explaining the issue well.

    I'm not arguing what you want to do with overflow. I'm trying to show
    that for all uses of detecting overflow other than crashing with no
    recovery, hardware trapping on overflow is a poor approach.

    If you enable hardware traps on integer overflow, then to do anything
    other than crash the program would require engineering a very complex
    set of data structures, roughly approximately the complexity of adding
    debug information to the executable, in order to make this work. As
    far as I know, no one in the history of computers has yet undertaken
    this task.

    And yet, this is exactly the kind of data C++ needs in order to
    use its Try-Throw-Catch exception model. The stack walker needs
    to know where on the stack is the list of stuff to free on block
    exit, where are the preserved registers and how many, ...

    This is because each instruction which overflows would need special
    handling, and the "debug" information would be needed. It would be a
    huge amount of compiler/linker/runtime complexity.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Mon Oct 7 22:26:58 2024
    On Mon, 7 Oct 2024 17:38:54 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Anton Ertl <[email protected]> schrieb:
    Michael S <[email protected]> writes:
    On Mon, 07 Oct 2024 07:17:02 GMT
    [email protected] (Anton Ertl) wrote:


    However, in John Mashey's criteria the number of registers plays a
    role; he requires >4 bits for the GPR specifier, and >3 bits for
    the FPR specifier.

    - anton

    Which sounds rather arbitrary.

    In a way it is, but see below.

    Or even worse, like if he wanted for
    SPARC to be called 'typical RISC' and for ARM to be called atypical
    and had chosen the numbers to match the agenda.

    I think that ARM did not exist for John Mashey.

    When was his definition made?

    ARM was rather late to the RISC game, this might have been literally
    true.

    ARM was rather early to the RISC game. Shipped for profit since late
    1986. Less than a year after MIPS and ROMP. Several months after
    SPARC. PA-RISC first shipped in the 1986H1, but volume production
    started later than ARM.
    Appolo PRISM, Motorola 88K, Intel i960 and AMD 29K all came later than
    ARM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Mon Oct 7 19:52:51 2024
    [email protected] (MitchAlsup1) writes:
    On Sun, 6 Oct 2024 14:35:40 +0000, Michael S wrote:

    To their defense, AMD's use of the term ROP didn't last for long.
    K8 manuals use the better term micro-ops. I don't have K7 manual to
    look, but it seems to me that it uses the same terminology as K8.

    K9 used the terms micro-ops and meso-ops to describe before and
    after peephole optimization. HW was happy to run either as micro-
    ops were a strict subset of meso-ops, meso-ops just got more work
    done per cycle.

    ARM Neoverse cores use the terms 'macro ops' and 'micro ops',
    the decoder produces Macro Ops which exist through renaming
    and dispatch stages. Further down the pipeline, a Macro Op
    can be split into two Micro Ops which can be issued OoO.

    See '2.1 Pipeline Overview'

    https://developer.arm.com/documentation/109637/latest/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Tue Oct 8 06:14:59 2024
    Thomas Koenig <[email protected]> writes:
    Anton Ertl <[email protected]> schrieb:
    I think that ARM did not exist for John Mashey.

    When was his definition made?

    <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>

    He reposted in 1995, the first few postings have no date, but they
    include the IBM RS/6000 (1990) and 68040 (1990), but not the Alpha
    (1992), so I expect that it happened between 1990 and 1992. The ARM
    was first released in a development kit for the BBC Micro in 1986, and
    then to the mass market in the Archimedes in 1987.

    My guess is that it did not exist for John Mashey because it did not
    originiate in the USA and was sold mainly in home computers and
    usually did not run Unix.

    Somewhat to my surprise, I just read that there was <https://en.wikipedia.org/wiki/RISC_iX>, which would work on many (but
    not all) Archimedes models with some additional hardware (in
    particular, a hard disk), and that they sold complete workstations
    like the R140 that included this hardware; the R140 (8MHz) cost GBP
    3500 in 1989 (without Ehernet). The R260 (30MHz ARM3, 8MB RAM, 100MHz
    HDD, with Ethernet) cost GBP 3995 in 1990 (or as R225 without hard
    disk GBP 1995), which was probably pretty competetive with the likes
    of DG Aviion AV 100 (16MHz 88100, 8MB, diskless with 20" monitor for
    $4000 in 1990 <https://www.techmonitor.ai/technology/data_general_outdoes_sun_with_4000_17_mips_av_100_risc_station>),
    the DECStation 3100, or the HP 9000/425, machines that I had contact
    with at the time. A problem may have been that they did not have an
    FPU before 1993 (except for the R140, but that cost GBP 599).

    ARM was rather late to the RISC game, this might have been literally
    true.

    What makes you think so? Did you read that in krone.at?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Tue Oct 8 14:11:00 2024
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    Somewhat to my surprise, I just read that there was <https://en.wikipedia.org/wiki/RISC_iX>, which would work on many
    (but not all) Archimedes models with some additional hardware (in
    particular, a hard disk), and that they sold complete workstations
    like the R140 that included this hardware ...

    Acorn did not try hard to sell RISC iX to industry, even in the UK. I
    remember knowing that it existed, and I /might/ have seen it running at a computer show. Acorn was, by 1990, seen as a specialist educational
    supplier. They may have sold some into universities, but I've never
    encountered anyone who'd used them.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Wed Oct 9 18:42:42 2024
    EricP <[email protected]> writes:
    Kent Dickey wrote:


    And crash-on-overflow just isn't a popular use model, as I use the example >> of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
    and no compiler seems to use it. Especially since branch-on-overflow
    is almost as good in every way.

    Kent

    Because C doesn't require it. That does not make the capability useless.

    Other languages do require overflow detection (e.g. COBOL ON OVERFLOW clause), and it's best done with conditional branches, not traps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Kent Dickey on Wed Oct 9 14:16:51 2024
    Kent Dickey wrote:
    In article <efXIO.169388$[email protected]>,
    EricP <[email protected]> wrote:
    Kent Dickey wrote:
    OK, my post was about how having a hardware trap-on-overflow instruction >>> (or a mode for existing ALU instructions) is useless for anything OTHER
    than as a debug aid where you crash the problem on overflow (you can
    have a general exception handler to shut down gracefully, but "patching things
    up and continuing" doesn't work). I gave details of reasons folks might >>> want to try to use trap-on-overflow instructions, and show how the
    other cases don't make sense.
    For me error detection of all kinds is useful. It just happens
    to not be conveniently supported in C so no one tries it in C.

    GCC's -trapv option is not useful for a variety of reasons.
    1) its slow, about 50% performance hit
    2) its always on for a compilation unit which is not what programmers need >> as it triggers for many false positives so people turn it off.

    In no way was I ever arguing that checking for overflow was a bad idea,
    or a language issue, or anything else. Just that CPUs should not bother >>> having trap-on-overflow instructions.
    I understand, and I disagree with this conclusion.
    I think all forms of software error detection are useful and
    HW should make them simple and eliminate cost when possible.

    I think I am not explaining the issue well.

    I'm not arguing what you want to do with overflow. I'm trying to show that for all uses of detecting overflow other than crashing with no recovery, hardware trapping on overflow is a poor approach.

    If you enable hardware traps on integer overflow, then to do anything other than crash the program would require engineering a very complex set of
    data structures, roughly approximately the complexity of adding debug information to the executable, in order to make this work. As far as I know, no one in the history of computers has yet undertaken this task.

    VAX/VMS 1.0 in 1979 had stack-based Structured Exception Handling (SEP).
    And of course carried it over onto Alpha/VMS.
    WinNT had SEP in its first version in 1992 for MIPS and 386
    supported both by the C compiler and OS. Win95 had support too.
    In WinNT MS added __try __except keywords to the C language to support
    it for both themselves inside the OS and for users.

    Some languages like C++ and Ada have native support for SEP.
    There can be differences in what behaviors languages expect to be supported, like can one continue from an exception, or pass arguments to a handler.

    This is because each instruction which overflows would need special
    handling, and the "debug" information would be needed. It would be a huge amount of compiler/linker/runtime complexity.

    General structured exception handling is not as complex or expensive
    as you think. It's in the multiple 1000's of instructions range
    (so don't use it gratuitously).

    WinNT implemented it differently on 32-bit x86 and 64-bit x64,
    with the x64 method being more efficient because the compiler
    does most of the work. On x64 the compiler just needs to supply
    bounding low and high RIP's for *just the exception handler code*.

    The cost of delivering a structured exception is the OS basically
    delivers an exception to a thread dispatcher similar to a signal,
    but for structured exceptions that dispatcher code acts differently.
    The thread's frame pointer is the head of a single linked list of
    stack frames. It starts at the bottom of stack pointed to by the
    frame pointer and scans backward, taking the RIP for each context
    and looking in a small table of handler bounds to see if it is in range.
    If there is a handler, it is called. If it handles it, great.
    Otherwise it continues to scan backwards through the stack frames.
    If it gets to the top of stack and there is no handler, it invokes the
    thread's last chance handler, and if that doesn't intercept the exception,
    it terminates the thread.

    This is different than most "signal" handlers people have written, where simple inspection of the instruction which failed and the address involved allows it to be "handled". But to do anything other than crash, each instruction which overflows needs special handling unique to that instruction and dependent on what the compiler was in the middle of doing when the overflow happened. This is why trapping just isn't a good idea.

    Except you keep missing the point:
    no one has a handler for integer overflow because it should never happen.
    Just like no one has a handler for memory read parity errors.

    When you wrote C code using signed integers, *YOU* guarenteed to the
    compiler that your code would never overflow. Overflow checking just
    detects when you have made an error, just like array bounds checking,
    or divide by zero checking.

    This is not something being done *to you* against your will,
    this is something that you *ask for* because it helps detect your errors.
    Doing it in hardware just makes it efficient.

    A better exception usage example might be a routine that enables exceptions
    for floating point underflow where the FPU traps to a handler that zeros
    the value and logs where it happened so someone can look at it later,
    then continues with its calculation.

    I'm just explaining why trap-on-overflow has gone away, because it's
    almost completely useless: hardware trap on overflow is only good for the case that you want to crash on integer overflow. Branch-on-overflow is the correct approach--the compiler can branch to either a trapping instruction (if you just want to crash), or for all other cases of detecting overflow, the compiler branches to "fixup" code.

    But crash on overflow *IS* the correct behavior in 99.999% of cases.
    Branch on overflow is ALSO needed in certain rare cases and I showed how
    it is easily detected.

    And crash-on-overflow just isn't a popular use model, as I use the example
    of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
    and no compiler seems to use it. Especially since branch-on-overflow
    is almost as good in every way.

    Kent

    Because C doesn't require it. That does not make the capability useless.

    Removing error detectors does not make the errors go away,
    just your knowledge of them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Niklas Holsti on Wed Oct 9 14:43:49 2024
    Niklas Holsti wrote:
    On 2024-10-07 22:12, MitchAlsup1 wrote:
    On Mon, 7 Oct 2024 18:55:26 +0000, Kent Dickey wrote:

    In article <efXIO.169388$[email protected]>,
    EricP <[email protected]> wrote:
    Kent Dickey wrote:
    In article <O2DHO.184073$[email protected]>,
    EricP <[email protected]> wrote:
    Kent Dickey wrote:

    In no way was I ever arguing that checking for overflow was a bad
    idea,
    or a language issue, or anything else. Just that CPUs should not
    bother
    having trap-on-overflow instructions.

    I understand, and I disagree with this conclusion.
    I think all forms of software error detection are useful and
    HW should make them simple and eliminate cost when possible.

    I think I am not explaining the issue well.

    I'm not arguing what you want to do with overflow. I'm trying to show
    that for all uses of detecting overflow other than crashing with no
    recovery, hardware trapping on overflow is a poor approach.

    If you enable hardware traps on integer overflow, then to do anything
    other than crash the program would require engineering a very complex
    set of data structures, roughly approximately the complexity of adding
    debug information to the executable, in order to make this work. As
    far as I know, no one in the history of computers has yet undertaken
    this task.

    And yet, this is exactly the kind of data C++ needs in order to
    use its Try-Throw-Catch exception model. The stack walker needs
    to know where on the stack is the list of stuff to free on block
    exit, where are the preserved registers and how many, ...


    Ada too.

    There are at least two ways to do that (at least for Ada, probably also
    for C++):

    - Dynamically maintain a stack-like data structure (a chain, linked
    list) that describes the current nesting of "code blocks" and their
    exception handlers. Whenever the program enters a block with an
    exception handler, there is entry code that pushes the description of
    that exception handler on this chain, including the address of its code;
    and vice versa pop on exiting such a block.

    Usually it uses the frame pointer to create a single linked list of
    call frames to walk backwards when scanning for an exception handler.

    There is also control block information that needs to be dynamically
    set up for each handler, so there is some runtime overhead.

    - Statically construct a mapping table that is stored in the executable
    and maps code ranges to exception handlers.

    The static method moves as much as possible of the control block
    information out of the dynamic context, lowering the set up cost
    for a handler.

    Ada implementations started with the dynamic method, which is simpler
    but adds some execution cost to all blocks with exception handlers, even
    if an exception never happens. Current implementations tend to the
    static method, also called "zero-cost exceptions" because there is no
    extra execution cost for blocks with exception handlers /unless/ an
    exception does occur.


    Windows used the dynamic method in 32-bit x86 OS and switched
    to static method on 64-bit x64 as it has lower runtime overhead.

    Structured Exception Handling (C/C++) https://learn.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp?view=msvc-170

    x64 exception handling https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64?view=msvc-170

    Exception handling in MSVC https://learn.microsoft.com/en-us/cpp/cpp/exception-handling-in-visual-cpp?view=msvc-170

    Modern C++ best practices for exceptions and error handling https://learn.microsoft.com/en-us/cpp/cpp/errors-and-exception-handling-modern-cpp?view=msvc-170

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Wed Oct 9 15:08:05 2024
    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    Kent Dickey wrote:

    And crash-on-overflow just isn't a popular use model, as I use the example >>> of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
    and no compiler seems to use it. Especially since branch-on-overflow
    is almost as good in every way.

    Kent
    Because C doesn't require it. That does not make the capability useless.

    Other languages do require overflow detection (e.g. COBOL ON OVERFLOW clause),
    and it's best done with conditional branches, not traps.

    Then you use the overflow branching form for those situations
    where you have a specific local overflow handler. Nothing stops that.

    But that is not a justification for getting rid of overflow trapping instructions altogether, as Kent was making. And actually it looks to me,
    not knowing Cobol, like it should use overflow trapping instructions
    UNLESS there is an ON OVERFLOW clause. i.e. that the default should be to
    treat overflow as an error unless you explicitly state how to handle it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Oct 9 19:43:39 2024
    On Wed, 9 Oct 2024 18:16:51 +0000, EricP wrote:

    Except you keep missing the point:
    no one has a handler for integer overflow because it should never
    happen. Just like no one has a handler for memory read parity errors.

    Oh contrairé:
    I understand how to recover from even "late write ECC violations*"--
    but mostly that is because I am primarily a HW guy. (*) When a cache
    line displaced from L1 or L2 arrives at L3/DRAM with a bad ECC.

    When you wrote C code using signed integers, *YOU* guarenteed to the
    compiler that your code would never overflow. Overflow checking just
    detects when you have made an error, just like array bounds checking,
    or divide by zero checking.

    I disagree with this statement. I wrote in C under the knowledge
    that integer data types can overflow--they have to be able to--
    it is the nature of fixed size containers. I am happy for the
    compiler to IGNORE the possibility of overflow, but not the HW.

    This is not something being done *to you* against your will,
    this is something that you *ask for* because it helps detect your
    errors.
    Doing it in hardware just makes it efficient.

    Yes, allow the compiler to IGNORE the problem, but have HW detect the
    problem.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Wed Oct 9 21:36:21 2024
    On Wed, 9 Oct 2024 20:12:40 +0000, Robert Finch wrote:

    On 2024-10-09 2:16 p.m., EricP wrote:
    Kent Dickey wrote:

    But crash on overflow *IS* the correct behavior in 99.999% of cases.
    Branch on overflow is ALSO needed in certain rare cases and I showed how
    it is easily detected.

    And crash-on-overflow just isn't a popular use model, as I use the
    example
    of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
    and no compiler seems to use it.  Especially since branch-on-overflow
    is almost as good in every way.

    Kent

    Because C doesn't require it. That does not make the capability useless.

    Removing error detectors does not make the errors go away,
    just your knowledge of them.


    Slightly confused on trap versus branch. Trapping on overflow is not a
    good solution, but a branch on overflow is? A trap is just a slow
    branch. The reason for trapping was to improve code density and non-exceptional performance.
    If it is the overhead of performing a trap operation that is the issue,

    x86 has seriously distorted peoples view on how much overhead is
    associated with a trap*. MIPS had trap handlers measuring in the
    17 cycle range both getting to the handler, handling the exception,
    and getting back to the instruction that trapped. Since GBOoO windows
    have mispredicted branches in this kind of latency, too; then a
    properly designed architecture should be able to do similarly to MIPS.

    Whereas x86 may take 1,000 cycles to get to the handler. This is due
    to all the Descriptor table stuff, call-gates, protection rings, and segmentation.

    (*) trap == exception == fault == any unpredicted control flow
    cause by the instruction stream itself (SVC-et-al not included
    because it is requested by the instruction stream).

    then a special register could be dedicated to holding the overflow
    handler address, and instructions defined to automatically jump through
    the overflow handler address register (a branch target address
    register).
    Overflow detecting instructions are just a fusion of the instruction and
    the following branch on overflow operation.

    addjo r1,r2,r3 <- does a jump (instead of a trap) to branch register #7
    for instance, on overflow.

    Having an overflow branch register might be better for code density / performance.

    What if you want to handle multiply overflow differently than
    addition overflow ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Thu Oct 10 15:36:32 2024
    On Wed, 9 Oct 2024 21:36:21 +0000
    [email protected] (MitchAlsup1) wrote:

    On Wed, 9 Oct 2024 20:12:40 +0000, Robert Finch wrote:


    x86 has seriously distorted peoples view on how much overhead is
    associated with a trap*.

    Do you have an opinion about FRED? https://cdrdv2-public.intel.com/819481/346446-flexible-return-and-event-delivery.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Thu Oct 10 15:32:21 2024
    EricP <[email protected]> writes:
    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    Kent Dickey wrote:

    And crash-on-overflow just isn't a popular use model, as I use the example >>>> of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
    and no compiler seems to use it. Especially since branch-on-overflow
    is almost as good in every way.

    Kent
    Because C doesn't require it. That does not make the capability useless.

    Other languages do require overflow detection (e.g. COBOL ON OVERFLOW clause),
    and it's best done with conditional branches, not traps.

    Then you use the overflow branching form for those situations
    where you have a specific local overflow handler. Nothing stops that.

    But that is not a justification for getting rid of overflow trapping >instructions altogether, as Kent was making. And actually it looks to me,
    not knowing Cobol, like it should use overflow trapping instructions
    UNLESS there is an ON OVERFLOW clause. i.e. that the default should be to >treat overflow as an error unless you explicitly state how to handle it.

    See https://www.mainframestechhelp.com/tutorials/cobol/size-error-phrase.htm

    The default is to truncate. All other cases can be handled with
    a conditional branch.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Fri Oct 11 01:40:15 2024
    On Mon, 7 Oct 2024 10:17:26 +0200, Terje Mathisen wrote:

    The single most canonical test for IBM PC compatibility was Microsoft's Flight Simulator, taking off from the now demolished Meighs Field in
    Chicago.

    That game used the OS and BIOS for the loading of the game, and then
    went on to direct hardware access for pretty much the rest of the
    playing time.

    I can remember Flight Simulator being used as the benchmark for
    compatibility as far back as 1985. A report on a computer show mentioned
    that clone makers were demoing it running on their products.

    This is why I feel the term “IBM compatible” was misleading, it should
    have been “Microsoft compatible” from at least that point on.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Fri Oct 11 01:41:58 2024
    On Mon, 7 Oct 2024 13:05:53 +0300, Michael S wrote:

    In all cases the vendor of GPU changed ...

    That, too, added to the problem, in that the software folks had to rewrite
    all the performance-intensive bits yet again for the new machine.

    OpenCL never took off because the GPGPU market simply isn’t competitive enough. NVidia is dominant, AMD plays second fiddle, and that’s it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Fri Oct 11 01:46:51 2024
    On Mon, 7 Oct 2024 22:26:58 +0300, Michael S wrote:

    On Mon, 7 Oct 2024 17:38:54 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    ARM was rather late to the RISC game, this might have been literally
    true.

    ARM was rather early to the RISC game. Shipped for profit since late
    1986.

    Shipped in an actual PC, the Acorn Archimedes range.

    That was the first time I ever saw a 3D shaded rendition of a flag waving,
    on a computer, generated in real time. No other machine could do it,
    unless you got up to the really expensive Unix workstation class (e.g.
    SGI, custom Evans & Sutherland hardware etc).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Fri Oct 11 06:42:15 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    I can remember Flight Simulator being used as the benchmark for
    compatibility as far back as 1985. A report on a computer show mentioned
    that clone makers were demoing it running on their products.

    This is why I feel the term “IBM compatible” was misleading, it should >have been “Microsoft compatible” from at least that point on.

    It was IBM PC compatible, and that was not misleading, because that's
    what it was about. "Microsoft compatible" would have been misleading
    (if you want it to mean the same as "IBM PC compatible"), because lots
    of hardware was Microsoft DOS compatible that was not an IBM PC clone
    and therefore not 100% IBM PC compatible. And MS-DOS was certainly
    the higher-profile Microsoft product than the Flight Simulator.

    And many buyers did not care about the Flight Simulator, but more
    about Lotus 1-2-3, which also required an IBM PC compatible machine.

    Of course you saw the Flight Simulator a lot at shows: Moving pictures
    attract the eye in a way that a static spreadsheet screen does not.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Lawrence D'Oliveiro on Fri Oct 11 14:20:20 2024
    On 11/10/2024 03:46, Lawrence D'Oliveiro wrote:
    On Mon, 7 Oct 2024 22:26:58 +0300, Michael S wrote:

    On Mon, 7 Oct 2024 17:38:54 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    ARM was rather late to the RISC game, this might have been literally
    true.

    ARM was rather early to the RISC game. Shipped for profit since late
    1986.

    Shipped in an actual PC, the Acorn Archimedes range.

    That was the first time I ever saw a 3D shaded rendition of a flag waving,
    on a computer, generated in real time. No other machine could do it,
    unless you got up to the really expensive Unix workstation class (e.g.
    SGI, custom Evans & Sutherland hardware etc).

    The Acorn Archimedes was /way/ ahead of anything in the PC / x86 world,
    both in hardware and software. It could emulate an 80286 PC almost as
    fast as real PC's that you could buy at the time for a higher price than
    the Archimedes.

    The demo that impressed me most was drawing full-screen Mandelbrot sets
    in a second or two, compared to several minutes for a typical PC at the
    time. It meant you could do real-time zooming and flying around in the set.

    My first encounter with ARM assembly was enhancing that demo program for
    higher screen resolution and deeper zooming.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Sat Oct 12 08:23:39 2024
    Terje Mathisen <[email protected]> writes:
    Maybe all add/sub/etc opcodes that are immediately followed by an INTO=20 >could be fused into a single ADDO/SUBO/etc version that takes zero extra =

    cycles as long as the trap part isn't hit?

    On Intel P-cores add/inc/sub etc. has been fused with a following
    JO/JNO into one uop for quite a while (I guess since Sandy Bridge
    (2011)).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sat Oct 12 08:45:57 2024
    EricP <[email protected]> writes:
    But then, risc processors mostly, started using exceptions for housekeeping
    - SPARC for register window sliding, Alpha for byte, word and misaligned >memory access

    On Alpha the assembler expands byte, word and unaligned access
    mnemonics into sequences of machine instructions; if you compile for
    BWX extensions, byte and word mnemonics get compiled into BWX
    instructions. If the machine does not have the BWX extensions and it encounters a BWX instruction, the result is an illegal instruction
    signal at least on Linux. This terminates your typical program, so
    it's not at all frequent.

    Concerning unaligned accesses, if you use a load or store that
    requires alignment, Digital OSF/1 (and the later versions with various
    names) by default produced a signal rather than fixing it up, so again
    programs are typically terminated, and the exception is not at all
    frequent. There is a system call and a tool (uac) that allows telling
    the OS to fix up unaligned accesses, but it played no role in my
    experience while I was still using Digital OSF/1 (including it's
    successors).

    On Linux the default behaviour was to fix up the unaligned accesses
    and to log that in the system log. There were a few such messages in
    the log per day, so that obviously was not a frequent occurence,
    either. I wrote a program that allowed me to change the behaviour <https://www.complang.tuwien.ac.at/anton/uace.c>, mainly because I
    wanted to get a signal when an unaligned access happens.

    As for the unaligned-access mnemonics, these were obviously barely
    used: I found that gas generates wrong code for ustq several years
    after Alpha was introduced, so obviously no software running under
    Linux has used this mnemonic.

    The solution for Alpha was to add back the byte and word instructions,
    and add misaligned access support to all memory ops.

    Alpha added BWX instructions, but not because it had used trapping to
    emulate them earlier; Old or portable binaries continued to use
    instruction sequences. Alpha traps when you do, e.g., an unaligned
    ldq in all Alpha implementations I have had contact with (up to a
    800MHz 21264B).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sat Oct 12 09:18:23 2024
    EricP <[email protected]> writes:
    Kent Dickey wrote:
    [...]
    GCC's -trapv option is not useful for a variety of reasons.
    1) its slow, about 50% performance hit
    2) its always on for a compilation unit which is not what programmers need
    as it triggers for many false positives so people turn it off.
    ...
    So why should any hardware include an instruction to trap-on-overflow?

    Because ALL the negative speed and code size consequences do not occur.

    Looking at <https://godbolt.org/z/oMhW55YsK> and selecting MIPS clang
    18.1.0, I get a 15-instruction sequence which does not include add
    (the trap-on-overflow version).

    MIPS gcc 14.2.0 generates a sequence that includes

    jal __addvsi3

    i.e., just as for x86-64. Similar for MIPS64 with these compilers.

    Interestingly, with RISC-V rv64gc clang 18.1.0, the sequence is much
    shorter than for MIPS clang 18.1.0, even though RV64GC has no specific
    way of checking overflow at all.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sat Oct 12 10:23:18 2024
    Michael S <[email protected]> writes:
    That's correct about intrinsics, but incorrect about ADCX/ADOX.
    The later can be moderately helpful in special situuations, esp.
    128b * 128b => 256b multiplication, but it is never necessary
    and for addition/sbtraction is not needed at all.

    They are useful if there are two strings of additions. This happens
    naturally in wide multiplication (also beyond 256b results). But it
    also happens when you add three multi-precision numbers (say, X, Y,
    Z): You need C for the carry of XYi=X[i]+Y[i]+C, and O for the carry
    of XYZ[i]=XYi+Z[i]+O. If you have ADCX/ADOX, you can do both
    additions in one loop, so XYi can be in a register and does not need
    to be stored . If you don't have these instructions, only ADC, you
    need one loop to compute X+Y and store the result in memory, and one
    loop to compute XY+Z, i.e., the lack of ADCX/ADOX results in
    substantial additional cost.

    If you add 4 multi-precision numbers, AMD64 with ADX runs out of carry
    bits, so you have to spend the overhead of an additional loop (but not
    of two additional loops as without ADCX/ADOX).

    With carry bits in the general purpose registers <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> and 30 GPRs
    (one is zero, one is sp), you can add 14 multi-precision numbers per
    loop: 14 GPRs for source addresses, 1 GPR for the target address, 1
    for the loop counter, 13 registers for loop-carried carry flags.

    Of course, the question is if this kind of computation is needed
    frequently enough to justify this kind of extension. For
    multi-precision multiplication and squaring, Intel considered the
    frequency relevant enough to introduce ADCX/ADOX/MULX.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sun Oct 13 13:00:14 2024
    On Sat, 12 Oct 2024 10:23:18 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    That's correct about intrinsics, but incorrect about ADCX/ADOX.
    The later can be moderately helpful in special situuations, esp.
    128b * 128b => 256b multiplication, but it is never necessary
    and for addition/sbtraction is not needed at all.

    They are useful if there are two strings of additions. This happens naturally in wide multiplication (also beyond 256b results). But it
    also happens when you add three multi-precision numbers (say, X, Y,
    Z): You need C for the carry of XYi=X[i]+Y[i]+C, and O for the carry
    of XYZ[i]=XYi+Z[i]+O. If you have ADCX/ADOX, you can do both
    additions in one loop, so XYi can be in a register and does not need
    to be stored . If you don't have these instructions, only ADC, you
    need one loop to compute X+Y and store the result in memory, and one
    loop to compute XY+Z, i.e., the lack of ADCX/ADOX results in
    substantial additional cost.

    If you add 4 multi-precision numbers, AMD64 with ADX runs out of carry
    bits, so you have to spend the overhead of an additional loop (but not
    of two additional loops as without ADCX/ADOX).

    With carry bits in the general purpose registers <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> and 30 GPRs
    (one is zero, one is sp), you can add 14 multi-precision numbers per
    loop: 14 GPRs for source addresses, 1 GPR for the target address, 1
    for the loop counter, 13 registers for loop-carried carry flags.

    Of course, the question is if this kind of computation is needed
    frequently enough to justify this kind of extension. For
    multi-precision multiplication and squaring, Intel considered the
    frequency relevant enough to introduce ADCX/ADOX/MULX.

    - anton

    That's not bad. I think, you see yourself that spill and context switch
    parts could benefit from more work.
    But I suspect that the main opposition you'll face in RISC-V
    organization will center not on that, but on fear of increase in cycle
    time, no matter if proven or not with hard numbers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun Oct 13 13:10:58 2024
    On Fri, 11 Oct 2024 01:41:58 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Mon, 7 Oct 2024 13:05:53 +0300, Michael S wrote:

    In all cases the vendor of GPU changed ...

    That, too, added to the problem, in that the software folks had to
    rewrite all the performance-intensive bits yet again for the new
    machine.

    OpenCL never took off because the GPGPU market simply isn’t
    competitive enough. NVidia is dominant, AMD plays second fiddle, and
    that’s it.

    I am not sure about dog-tail relationships.
    To me it sound plausible that NV dominates due to better software story.

    At least that's what I see in certain sectors of embedded market -
    people prefer old NV Jetson Xavier over newer AMD and Intel SoCs that
    are much better not only on the CPU side, but also provide much more
    FLOPs on GPU side. And the reason is that they are much more certain
    that they will be able to write programs for NV GPUs than they are for
    AMD or Intel GPUs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sun Oct 13 15:16:08 2024
    Michael S <[email protected]> writes:
    To their defense, AMD's use of the term ROP didn't last for long.
    K8 manuals use the better term micro-ops. I don't have K7 manual to
    look, but it seems to me that it uses the same terminology as K8.

    I have come across ROP (and its expansion RISC op) relatively
    recently, but maybe it was in third-party material. Their evil deeds
    of the past come back to haunt them:-).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Mon Oct 14 23:39:59 2024
    On Sun, 13 Oct 2024 13:10:58 +0300, Michael S wrote:

    On Fri, 11 Oct 2024 01:41:58 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    OpenCL never took off because the GPGPU market simply isn’t competitive
    enough. NVidia is dominant, AMD plays second fiddle, and that’s it.

    I am not sure about dog-tail relationships.

    In a market dominated by one player, the dominant player tends not to like
    open standards. Open standards allow competitors to get a foot in the
    door, and the dominant player doesn’t like that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Oct 14 23:38:43 2024
    On Fri, 11 Oct 2024 06:42:15 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    I can remember Flight Simulator being used as the benchmark for >>compatibility as far back as 1985. A report on a computer show
    mentioned
    that clone makers were demoing it running on their products.

    This is why I feel the term “IBM compatible” was misleading, it should >>have been “Microsoft compatible” from at least that point on.

    It was IBM PC compatible, and that was not misleading, because that's
    what it was about.

    But then IBM came along shortly afterwards with their PS/2 range, which no longer defined the standard for compatibility.

    So at that point it was either “Microsoft compatible” or nothing.

    ... lots of hardware was Microsoft DOS compatible ...

    Yes it was, but none of them could run Flight Simulator.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Mon Oct 14 21:44:06 2024
    Anton Ertl wrote:
    EricP <[email protected]> writes:
    Kent Dickey wrote:
    [...]
    GCC's -trapv option is not useful for a variety of reasons.
    1) its slow, about 50% performance hit
    2) its always on for a compilation unit which is not what programmers need >> as it triggers for many false positives so people turn it off.
    ....
    So why should any hardware include an instruction to trap-on-overflow?
    Because ALL the negative speed and code size consequences do not occur.

    Looking at <https://godbolt.org/z/oMhW55YsK> and selecting MIPS clang
    18.1.0, I get a 15-instruction sequence which does not include add
    (the trap-on-overflow version).

    MIPS gcc 14.2.0 generates a sequence that includes

    jal __addvsi3

    i.e., just as for x86-64. Similar for MIPS64 with these compilers.

    Interestingly, with RISC-V rv64gc clang 18.1.0, the sequence is much
    shorter than for MIPS clang 18.1.0, even though RV64GC has no specific
    way of checking overflow at all.

    - anton

    Yes. So even when the ADD instruction is available they won't use it.
    At least clang for MIPS64 uses one of the overflow detect idioms inlined.
    Gcc calls that rather expensive subroutine.

    I changed your example to use long instead of int
    to avoid any partial register issues.
    Also I added a third argument just to see what it would do.
    It generates slightly different code for the second check.

    long add3 (long a, long b, long c) {
    return a + b + c;
    }

    I also tried Ada mips64 gnat 14.2.0 -O2 (below).
    It also didn't use the ADD which traps but uses a different idom inlined.

    Both examples should have taken 3 instructions
    add3:
    dadd $2, $4, $5 ; r2 = r4 + r5
    dadd $2, $2, $6 ; r2 = r2 + r6
    jr $ra
    nop


    but what clang generated was:

    ; The comments on the left are mine
    add3:
    daddiu $sp, $sp, -16 ; set up call frame
    sd $ra, 8($sp)
    sd $fp, 0($sp)
    move $fp, $sp
    daddu $3, $4, $5 ; r3 = r4 + r5
    slt $1, $3, $4 ; r1 = r3 < r4
    slti $2, $5, 0 ; r2 = r5 < 0
    bne $2, $1, .LBB0_3 ; if (r2 != r1) goto Overflow
    nop
    daddu $2, $3, $6 ; r2 = r3 + r6
    slt $1, $2, $3 ; r1 = r2 < r3
    slti $3, $6, 0 ; r3 = r6 < 0
    xor $1, $3, $1 ; if (r3 != r1) goto Overflow
    bnez $1, .LBB0_3
    nop
    move $sp, $fp ; pop frame
    ld $fp, 0($sp)
    ld $ra, 8($sp)
    jr $ra
    daddiu $sp, $sp, 16
    .LBB0_3:
    break

    ====================================

    -- Ada mips64 gnat 14.2.0 -O2
    function add3 (a, b, c : Long_Integer) return Long_Integer is
    begin
    return a + b + c;
    end add3;

    .LC0:
    .ascii "example.adb"
    .space 1
    _ada_add3:
    daddu $3,$4,$5 # tmp205, a, b
    xor $4,$4,$5 # tmp206, a, b
    nor $4,$0,$4 # tmp208, tmp206
    xor $5,$3,$5 # tmp207, tmp205, b
    and $5,$5,$4 # tmp209, tmp207, tmp208
    bltz $5,.L7 #, tmp209,
    daddu $2,$3,$6 # tmp212, tmp205, c

    xor $3,$3,$6 # tmp213, tmp205, c
    nor $3,$0,$3 # tmp215, tmp213
    xor $6,$2,$6 # tmp214, tmp212, c
    and $6,$6,$3 # tmp216, tmp214, tmp215
    bltz $6,.L7
    nop
    jr $31
    nop
    .L7:
    daddiu $sp,$sp,-16 #,,
    sd $28,0($sp) #,
    lui $28,%hi(%neg(%gp_rel(_ada_add3))) #,
    daddu $28,$28,$25 #,,
    daddiu $28,$28,%lo(%neg(%gp_rel(_ada_add3))) #,,
    ld $4,%got_page(.LC0)($28) # tmp210,,
    ld $25,%call16(__gnat_rcheck_CE_Overflow_Check)($28) # tmp211,,
    sd $31,8($sp) #,
    li $5,3 # 0x3 #,
    1: jalr $25 # tmp211
    daddiu $4,$4,%got_ofst(.LC0) #, tmp210,

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Tue Oct 15 12:59:03 2024
    Anton Ertl wrote:
    EricP <[email protected]> writes:
    But then, risc processors mostly, started using exceptions for housekeeping >> - SPARC for register window sliding, Alpha for byte, word and misaligned
    memory access

    On Alpha the assembler expands byte, word and unaligned access
    mnemonics into sequences of machine instructions; if you compile for
    BWX extensions, byte and word mnemonics get compiled into BWX
    instructions. If the machine does not have the BWX extensions and it encounters a BWX instruction, the result is an illegal instruction
    signal at least on Linux. This terminates your typical program, so
    it's not at all frequent.

    Ah yes, that was it. After they added BWX to 21164 in 1996,
    for older 21064 models VMS had an optional illegal instruction exception handler that caught BWX instructions, emulated them and continued,
    or terminated.

    Concerning unaligned accesses, if you use a load or store that
    requires alignment, Digital OSF/1 (and the later versions with various
    names) by default produced a signal rather than fixing it up, so again programs are typically terminated, and the exception is not at all
    frequent. There is a system call and a tool (uac) that allows telling
    the OS to fix up unaligned accesses, but it played no role in my
    experience while I was still using Digital OSF/1 (including it's
    successors).

    On Linux the default behaviour was to fix up the unaligned accesses
    and to log that in the system log. There were a few such messages in
    the log per day, so that obviously was not a frequent occurence,
    either. I wrote a program that allowed me to change the behaviour <https://www.complang.tuwien.ac.at/anton/uace.c>, mainly because I
    wanted to get a signal when an unaligned access happens.

    IIRC on VMS the unaligned exception was caught and could optionally
    log a diagnostic, execute a fixup handler and continue, or terminate.

    As for the unaligned-access mnemonics, these were obviously barely
    used: I found that gas generates wrong code for ustq several years
    after Alpha was introduced, so obviously no software running under
    Linux has used this mnemonic.

    The solution for Alpha was to add back the byte and word instructions,
    and add misaligned access support to all memory ops.

    Alpha added BWX instructions, but not because it had used trapping to
    emulate them earlier; Old or portable binaries continued to use
    instruction sequences. Alpha traps when you do, e.g., an unaligned
    ldq in all Alpha implementations I have had contact with (up to a
    800MHz 21264B).

    - anton

    You are right... they didn't add misaligned access to all LD and ST.
    Except for LDQ_U and STQ_U they still fault on non-natural alignment.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Linsel@21:1/5 to Anton Ertl on Tue Oct 15 21:24:11 2024
    On 12.10.24 11:18, Anton Ertl wrote:
    EricP <[email protected]> writes:
    Kent Dickey wrote:
    [...]
    GCC's -trapv option is not useful for a variety of reasons.
    1) its slow, about 50% performance hit
    2) its always on for a compilation unit which is not what programmers need >> as it triggers for many false positives so people turn it off.
    ...
    So why should any hardware include an instruction to trap-on-overflow?

    Because ALL the negative speed and code size consequences do not occur.

    Looking at <https://godbolt.org/z/oMhW55YsK> and selecting MIPS clang
    18.1.0, I get a 15-instruction sequence which does not include add
    (the trap-on-overflow version).

    MIPS gcc 14.2.0 generates a sequence that includes

    jal __addvsi3

    i.e., just as for x86-64. Similar for MIPS64 with these compilers.

    Interestingly, with RISC-V rv64gc clang 18.1.0, the sequence is much
    shorter than for MIPS clang 18.1.0, even though RV64GC has no specific
    way of checking overflow at all.

    - anton

    Very irritating: https://godbolt.org/z/KsMc3KfKc

    Why do neither gcc nor clang use MIPS's trap-on-overflow addition
    operators, while they indeed use teq <divisor>, 0 for a division-by-zero
    check?

    --
    Bernd Linsel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to John Dallman on Sat Oct 26 18:37:14 2024
    John Dallman <[email protected]> wrote:

    I see where I'm going wrong: I'm trying to talk about the machines
    designed to run MS-DOS and later Windows, not just the CPUs. The vast
    range of hardware that all had substantial degrees of compatibility as regards booting, busses and so on. Those things let their manufacturers compete for the DOS and Windows market, whereas x86-based machines that weren't PC-compatible only succeeded in quite specialised niches.

    Those hardware suppliers did not close off access to the more advanced features of i386 onwards, because they had no reason to, and that let
    Linux take advantage of all that hardware when it came along. That's the point I was failing to make.

    I think this is still misleading. Not only 386 was _much_ more
    ambitious desgin than just "processor for running DOS". Hadware
    manufacturers also cared about running more things than just
    DOS. And "running DOS" is misleading too: for many "DOS applications"
    DOS provided just program loader and file system access. Such
    applications could switch to protected mode, use multitasking
    and 32-bit addressing. There were "DOS extenders". Before
    Windows gained market dominance there were competing GUI-s.
    There were PC servers, which at some time meant Novell.

    So things critical to Linux were also important on general PC
    market. Clearly Linux benefited from availabilty of comodity
    PC-s. But things that made a PC good PC were correlated with
    being good Linux machine. As a litte anecdote let med add that
    small sellers frequently used Linux as a tester for PC-s they
    were selling, as it was stressing machines more than "typical"
    DOS applications.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Anton Ertl on Mon Oct 28 23:45:53 2024
    Anton Ertl <[email protected]> wrote:
    Thomas Koenig <[email protected]> writes:
    Anton Ertl <[email protected]> schrieb:

    [in-memory database]

    but the question is if
    the machine has enough RAM for the database. Our dual-Xeon system
    from IIRC 2007 has 24GB of RAM, not sure how big it could be
    configured; OTOH, we have a single-Xeon system from 2009 or so with
    32GB of RAM (and there were bigger Xeons in the market at the time).

    The minimum requirement of SAP HANA is 64 GB of memory, but typical
    ranges are from 256GB to 1TB.

    What is the relevance of SAP HANA for the topic at hand?

    The question was if the RAM can hold the data. For each account they
    would have to keep the current balance (64 bits should be enough for
    that), the account number (64 bits for the up to 19 digits of a Visa
    card) for verifying that we are at the correct entry in the hash table
    and probably some account status information (64 bits should be
    plenty?).

    There is also the sequence of transactions (a 64-bit transaction
    offset in the log per transaction should be enough for that). The
    sequence of transactions may be useful for fraud detection, but I
    don't know enough about that to know how to scale the system, so I'll
    just say that fraud detection is done by a bigger system before the transaction goes through to the transaction processing computer.

    The sequence of transactions is also needed for generating the reports
    and for dealing with customer complaints, but again, that's not
    processing the transactions themselves (and is basically read-only,
    except that the customer-complaint processing may result in additional transactions).

    So, with 24 bytes needed for each account on the
    transaction-processing server, 32GB with, say 8GB left for
    copy-on-write and other administrative purposes should be good for
    about 900M accounts at a hash table load factor of 84%. I guess that
    Visa has more accounts, so one would need a box with more RAM.

    A single core of the Xeon should easily be able to handle all the 56K transactions per second, both the logging and the update of the hash
    table, and in that case no locking is needed. But that first needs a sequence of transactions coming in.

    AFAICS main transaction processor does not need to know about individual
    cards. Cards are issued by banks and clearly bank needs info about
    card and customer. Main transaction processor could deal only with
    banks, namely verify that bank is solvent (_bank_ balance stays within
    agreed limits) and that message data is legit.

    OTOH, information about specific card is likely to be bigger. Card
    may have daily limit on transactions, that is another number to
    keep (actually limit and daily amount of transactions). There is
    information used to verify validity of transactoion like customer
    name, validity date of the card and validation code. Financial
    institutions are regulated and may be legaly obliged to keep and
    check some information which is technically not needed for processing
    of transactions (IIUC in my country banks are not allowed to
    directly transfer money between themselfs, all transfers are
    supposed to go trough central national bank).

    I think that transaction center wants to keep more information
    than a single copy of data in the RAM: with single copy any
    memory corruption could mean loss of hours of transaction data
    which is equivalent to quite a lot of cash. So I suspect that
    that there are layers of redundancy buit-in. And even if
    performencewise what they do is suboptimal they are probably very
    reluctant to changes in core accounting code.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Waldek Hebisch on Tue Oct 29 00:17:34 2024
    On Mon, 28 Oct 2024 23:45:53 -0000 (UTC), Waldek Hebisch wrote:

    I think that transaction center wants to keep more information
    than a single copy of data in the RAM: with single copy any
    memory corruption could mean loss of hours of transaction data
    which is equivalent to quite a lot of cash. So I suspect that
    that there are layers of redundancy buit-in.

    They could distribute the load, by spreading it across multiple processing centres. For example, most transactions on a given card are likely to be
    with businesses in a particular locality, or with certain large online retailers. The card has a credit limit, but I suspect that is not a “brick wall” limit, so if it takes a few seconds to reconcile multiple
    transactions, with the chance that they could add up to something a bit
    beyond the credit limit once the totals have been made consistent again, that’s not the end of the world.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)