Forum: >>> Magnum BBS <<<

is Vax adressing sane today

From Brett@21:1/5 to All on Thu Sep 5 21:03:37 2024

Is Vax addressing sane today?

I am not talking indirect addressing, that is stupid.

It has been determined from trusted sources that add from memory and add to memory as used in x86 are sane, and not much of a problem.

But Vax allows all three arguments to be in memory with different pointers.

Is this sane, just a natural progression if you allow memory operands?

Packing three offsets in an instruction that can be decoded reasonably is a whole other problem…

Heads and tails encoding could actually do this reasonably, and the code density would be actually be better than most competitors. Heads and tails
is not that easy, but it’s not x86 difficult.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Brett on Thu Sep 5 21:15:20 2024

Brett <[email protected]> writes:

Is Vax addressing sane today?

By what definition of sanity? The question doesn't make sense.

I am not talking indirect addressing, that is stupid.

That seems rather judgmental. Although you may wish to further
define 'indirect addressing' to make clear exactly what you're
referring to (PDP-8 style, with auto-increment? B3500 style
where there is potentially infinite indirection? x86 where
indirection through a register is common?)

They may not make sense in the context of a modern RISC processor,
but that doesn't make them "stupid".

It has been determined from trusted sources that add from memory and add to >memory as used in x86 are sane, and not much of a problem.

"Sanity" isn't an attribute associated with hardware architecture.

But Vax allows all three arguments to be in memory with different pointers.

That's not an addressing issue, it is simply a natural form of
three-operand instruction when the processor supports memory to
memory instructions.

Now, you may reframe your question as to the desirability
of memory-to-memory instructions vis a vis performance and or
optimal code, and there you might find various opinions.

I think the folks on this group spend an inordinate amount of
time discussing minutia such has how many address/offset bits
can be encoded in an instruction, or code density, which in the
real world with modern high-performance processors aren't
particularly significant or interesting to the typical
programmer.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Brett on Thu Sep 5 22:55:00 2024

In article <vbd6b9$g147$[email protected]>, [email protected] (Brett) wrote:

It has been determined from trusted sources that add from memory
and add to memory as used in x86 are sane, and not much of a problem.

But Vax allows all three arguments to be in memory with different
pointers.

Is this sane, just a natural progression if you allow memory
operands?

Memory-to-memory instructions, in general, are hard to get to run fast
with today's processors and memory, simply because memory access times
are long enough for many register-to-register instructions to execute. A
lot of that time can be hidden with good caches and prefetchers, but if
your memory access patterns are complicated, those speedups can fail to
work.

One reason for memory-to-memory instructions was to avoid the need to
dedicate registers to operands, but that's not much of a problem these
days, since we have space in the CPU for lots of registers and rename
systems for them.

VAX was designed when heavy use of microcoding seemed like a good idea to
make a CPU at an economical price, and memory wasn't much slower than registers. It was a backward-looking design in some ways, being a much
better computer for the 1970s, rather than looking ahead to new concepts.
VMS was the last large operating system written in assembly language (and Bliss, which is somewhat higher-level, bit not much).

DEC spent a lot of time and money trying to keep VAX competitive and took
too long to accept that was impractical. That was one of the seeds of
their downfall.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Thu Sep 5 23:32:49 2024

On Thu, 5 Sep 2024 21:03:37 +0000, Brett wrote:

Is Vax addressing sane today?

I am not talking indirect addressing, that is stupid.

It has been determined from trusted sources that add from memory and add
to memory as used in x86 are sane, and not much of a problem.

But Vax allows all three arguments to be in memory with different
pointers.

With modern compiler technology 88% of instructions need only 1
constant--thus VAX provides too many, along with providing address
modes that require sequential decoding.

Most ISAs do not provide "enough" constants, VAX provides too many.
Where "enough" covers::

SLA R9,#1,R17 // this is 1 instruction
DIV R9,#24,R17 // ibid
FDIV R8,#3.14159265358928,R17

Is this sane, just a natural progression if you allow memory operands?

Having watching this in real time:: in 1970 we needed more/better
constants, then PDP-11 came around and we liked it, then at the end
of the decade VAX cam along and we loved it, only later recognizing
that it had fallen for the second system syndrome--becoming overly
complicated without benefit--the address space was definitely needed
the address modes no so much.

Packing three offsets in an instruction that can be decoded reasonably
is a whole other problem…

Realistically, modern compilers have advanced to the point where
anything more than 1` constant per instruction is overkill--
harder to build and delivering no more performance.

Heads and tails encoding could actually do this reasonably, and the code density would be actually be better than most competitors. Heads and
tails is not that easy, but it’s not x86 difficult.

Another encoding scheme is segmenting the OpCode into 2 components
1) goes to the function unit to convey the kind of calculation
to be performed,
2) goes to the forwarding logic to convey how to route bits into
calculation.

Some might consider the concatenation of both to the be OpCode
but that obscures what to do with when to do it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Dallman on Fri Sep 6 00:39:30 2024

On Thu, 5 Sep 2024 22:55 +0100 (BST), John Dallman wrote:

VMS was the last large operating system written in assembly language
(and Bliss, which is somewhat higher-level, bit not much).

Bliss could, I think, have been just as portable as C. But it mainly found favour inside DEC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Fri Sep 6 00:38:31 2024

On Thu, 05 Sep 2024 21:15:20 GMT, Scott Lurndal wrote:

"Sanity" isn't an attribute associated with hardware architecture.

Says someone posting to a group full of hardware experts ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Brett on Fri Sep 6 05:38:01 2024

Brett <[email protected]> writes:

But Vax allows all three arguments to be in memory with different pointers.

Is this sane, just a natural progression if you allow memory operands?

In combination with supporting unaligned accesses (but excluding
indirect addressing), it means that an instruction can access 6 pages,
and so the TLB (and/or TLB loader) has to be designed to support that. Likewise, the OS has to be designed to load all 6 pages into physical
RAM without evicting one of these pages again. So this kind of
architecture increases the design complexity. And I don't see a
benefit from this design.

Heads and tails encoding could actually do this reasonably, and the code >density would be actually be better than most competitors.

Would it? Please present empirical data. Certainly people claim that instruction sets with one-memory-address load-and-op and
read-modify-write instructions have better code density, but when you
look at the data, there are load-store instruction sets with better
code density (and by quite a lot). From <[email protected]>:

bash grep gzip
595204 107636 46744 armhf 16 regs load/store 32-bit
599832 101102 46898 riscv64 32 regs load/store 64-bit
796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
829776 134784 56868 arm64 32 regs load/store 64-bit
853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
891128 158544 68500 armel 16 regs load/store 32-bit
892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
1020720 170736 71088 mips64el 32 regs load/store 64-bit
1168104 194900 83332 ppc64el 32 regs load/store 64-bit

What is "heads and tails encoding"?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri Sep 6 07:08:51 2024

On Fri, 06 Sep 2024 06:05:35 GMT, Anton Ertl wrote:

... they failed to stick to VAX for the few more years until
they would have developed an OoO implementation, which would have
leveled the playing field again (see Pentium Pro).

It takes a whole lot of extra transistors (and consequent die area) to
keep a CISC architecture comparable in performance to RISC. Back about
when Intel finally caught up with PowerPC, I remember their chip packages
were huge -- about the size of a VHS videocassette.

Intel were probably spending 10× what Apple-IBM-Motorola were putting into each generation of chip development. But then, the x86 world had 10× the revenue coming in, so Intel could afford it. That’s how they regained the lead over RISC.

Nowadays, I don’t think the revenue advantage is quite what it once was. That, and the even greater increases in chip complexity (and hence
development cost), has tilted the playing field more in favour of RISC architectures, notably ARM and RISC-V.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Fri Sep 6 06:05:35 2024

[email protected] (John Dallman) writes:

Memory-to-memory instructions, in general, are hard to get to run fast
with today's processors and memory, simply because memory access times
are long enough for many register-to-register instructions to execute.

Given modern OoO technology, even VAX can fly. It does not matter
whether, say,

*a++ = *b++ + *c++;

is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
RISC-V instructions, what goes on inside the OoO engine is pretty
similar in all cases, and so is the performance.

In recent years a number of implementations have 0-cycle store-to-load forwarding, so the misconception that a memory operand is as cheap as
a register operand if only the instruction set has memory operands of
operate instructions is a little bit closer to reality. It is still a misconception, because such an implementation can read and write
several times as many registers per cycle as memory operands.

A
lot of that time can be hidden with good caches and prefetchers, but if
your memory access patterns are complicated, those speedups can fail to
work.

Whether operate instructions in an instruction set have 0, 1, or 3
memory operands makes little difference in that case.

One reason for memory-to-memory instructions was to avoid the need to >dedicate registers to operands, but that's not much of a problem these
days, since we have space in the CPU for lots of registers and rename
systems for them.

That may have been a consideration in the NOVA or the 6800, but in
case of the VAX with its 16 registers, that corresponds to a load/store-architecture with 18 registers, so for the VAX this is just
a minor issue.

Some time ago I thought a bit about which kind of architecture to
design with the transistor budget of the 6502, but with the RISC
lessons under the belt. One problem with a big RISC-like register set
is the instruction bandwidth. You really want to stick to 8-bit
instructions if you only have an 8-bit data bus. With a register
architecture that means 2-bits for register operands, and that means
you would need a lot of loads and stores in a load/store architecture.
So the narrow instruction word almost forces you to use implicit
register operands or at small special-purpose register sets (e.g., 2 accumulators and 4 index registers, as in the 6809) rather than
general-purpose registers.

However, the VAX 11/780 does not have these restrictions. It has a
wider memory bus and it has a cache.

DEC spent a lot of time and money trying to keep VAX competitive and took
too long to accept that was impractical. That was one of the seeds of
their downfall.

Either that, or they failed to stick to VAX for the few more years
until they would have developed an OoO implementation, which would
have leveled the playing field again (see Pentium Pro). The Alpha
came out in 1992, the Pentium Pro in 1995, so if DEC has stuck to the
VAX and managed a timely OoO implementation, they would have needed to
survive just 3 years. And it seems that they lost a lot of customers
in the transition from VAX to Alpha.

Of course, the question is if the customers would have stayed with DEC
if they had continued with the VAX. The vibe at the time was that
CISCs are doomed. OTOH, Intel stuck with IA-32 and won with the P6,
and IBM stuck with the S390. But VAX customers are not S390
customers, and maybe they would have defected to Intel even if the VAX
had been there.

From what I read, the VAX 9000 was a big nail in the DEC coffin. In
hindsight they should have canceled the project early, but that does
not mean that they could not have continued with VAX (they could even
have competed with the IBM mainframes, which took quite long to gain superscalar and OoO implementations).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Fri Sep 6 14:54:24 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Thu, 05 Sep 2024 21:15:20 GMT, Scott Lurndal wrote:

"Sanity" isn't an attribute associated with hardware architecture.

Says someone posting to a group full of hardware experts ...

That someone has been doing architecture since 1983; from
mainframes to high-end SoCs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Anton Ertl on Fri Sep 6 18:19:02 2024

Anton Ertl <[email protected]> wrote:

Brett <[email protected]> writes:

But Vax allows all three arguments to be in memory with different pointers. >>
Is this sane, just a natural progression if you allow memory operands?

In combination with supporting unaligned accesses (but excluding
indirect addressing), it means that an instruction can access 6 pages,
and so the TLB (and/or TLB loader) has to be designed to support that. Likewise, the OS has to be designed to load all 6 pages into physical
RAM without evicting one of these pages again. So this kind of
architecture increases the design complexity. And I don't see a
benefit from this design.

The memory system is pipelined, once you load the first of the three
values, you do not care if that cache line is evicted while you load the second.

Caches are 16 way today, one does not worry about cache line evictions, it
just works.

Heads and tails encoding could actually do this reasonably, and the code
density would be actually be better than most competitors.

Would it? Please present empirical data. Certainly people claim that instruction sets with one-memory-address load-and-op and
read-modify-write instructions have better code density, but when you
look at the data, there are load-store instruction sets with better
code density (and by quite a lot). From <[email protected]>:

bash grep gzip
595204 107636 46744 armhf 16 regs load/store 32-bit
599832 101102 46898 riscv64 32 regs load/store 64-bit
796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
829776 134784 56868 arm64 32 regs load/store 64-bit
853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
891128 158544 68500 armel 16 regs load/store 32-bit
892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
1020720 170736 71088 mips64el 32 regs load/store 64-bit
1168104 194900 83332 ppc64el 32 regs load/store 64-bit

What is "heads and tails encoding"?

128 bit or larger packets with the fixed size opcodes on the front, and the variable sized data and offsets packing in from the end. You get variable length instruction density with easier faster wide decoding. And also using memory operands give you another density bonus on top.

The down side is that it makes your one and two wide implementations
bigger.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to [email protected] on Fri Sep 6 18:21:17 2024

MitchAlsup1 <[email protected]> wrote:

On Thu, 5 Sep 2024 21:03:37 +0000, Brett wrote:

Is Vax addressing sane today?

I am not talking indirect addressing, that is stupid.

It has been determined from trusted sources that add from memory and add
to memory as used in x86 are sane, and not much of a problem.

But Vax allows all three arguments to be in memory with different
pointers.

With modern compiler technology 88% of instructions need only 1 constant--thus VAX provides too many, along with providing address
modes that require sequential decoding.

Most ISAs do not provide "enough" constants, VAX provides too many.
Where "enough" covers::

SLA R9,#1,R17 // this is 1 instruction
DIV R9,#24,R17 // ibid
FDIV R8,#3.14159265358928,R17

In C++ game code there are places where you are loading from two structures
and storing into a third structure. Three offsets are needed and used.

Most commonly you need two offsets as you are building a new structure from
the old one. The example being building the polygon display lists
structures from your source structures which contain X,Y,Z and R,G,B,A and weights, and other info.

The benchmarks you are using are out of date, using arrays instead of structures. Arrays are rare in game code, it’s all structures.

Not Fortran arrays, C++ structure spaghetti.

Is this sane, just a natural progression if you allow memory operands?

Having watching this in real time:: in 1970 we needed more/better
constants, then PDP-11 came around and we liked it, then at the end
of the decade VAX cam along and we loved it, only later recognizing
that it had fallen for the second system syndrome--becoming overly complicated without benefit--the address space was definitely needed
the address modes no so much.

Packing three offsets in an instruction that can be decoded reasonably
is a whole other problem…

Realistically, modern compilers have advanced to the point where
anything more than 1` constant per instruction is overkill--
harder to build and delivering no more performance.

Heads and tails encoding could actually do this reasonably, and the code
density would be actually be better than most competitors. Heads and
tails is not that easy, but it’s not x86 difficult.

Another encoding scheme is segmenting the OpCode into 2 components
1) goes to the function unit to convey the kind of calculation
to be performed,
2) goes to the forwarding logic to convey how to route bits into
calculation.

Some might consider the concatenation of both to the be OpCode
but that obscures what to do with when to do it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Fri Sep 6 18:31:05 2024

On Fri, 6 Sep 2024 6:05:35 +0000, Anton Ertl wrote:

[email protected] (John Dallman) writes:

Memory-to-memory instructions, in general, are hard to get to run fast
with today's processors and memory, simply because memory access times
are long enough for many register-to-register instructions to execute.

Given modern OoO technology, even VAX can fly. It does not matter
whether, say,

*a++ = *b++ + *c++;

is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
RISC-V instructions, what goes on inside the OoO engine is pretty
similar in all cases, and so is the performance.

In recent years a number of implementations have 0-cycle store-to-load forwarding, so the misconception that a memory operand is as cheap as
a register operand if only the instruction set has memory operands of
operate instructions is a little bit closer to reality. It is still a misconception, because such an implementation can read and write
several times as many registers per cycle as memory operands.

A
lot of that time can be hidden with good caches and prefetchers, but if >>your memory access patterns are complicated, those speedups can fail to >>work.

Whether operate instructions in an instruction set have 0, 1, or 3
memory operands makes little difference in that case.

One reason for memory-to-memory instructions was to avoid the need to >>dedicate registers to operands, but that's not much of a problem these >>days, since we have space in the CPU for lots of registers and rename >>systems for them.

That may have been a consideration in the NOVA or the 6800, but in
case of the VAX with its 16 registers, that corresponds to a load/store-architecture with 18 registers, so for the VAX this is just
a minor issue.

Some time ago I thought a bit about which kind of architecture to
design with the transistor budget of the 6502, but with the RISC
lessons under the belt. One problem with a big RISC-like register set
is the instruction bandwidth. You really want to stick to 8-bit
instructions if you only have an 8-bit data bus. With a register architecture that means 2-bits for register operands, and that means
you would need a lot of loads and stores in a load/store architecture.
So the narrow instruction word almost forces you to use implicit
register operands or at small special-purpose register sets (e.g., 2 accumulators and 4 index registers, as in the 6809) rather than general-purpose registers.

However, the VAX 11/780 does not have these restrictions. It has a
wider memory bus and it has a cache.

DEC spent a lot of time and money trying to keep VAX competitive and took >>too long to accept that was impractical. That was one of the seeds of
their downfall.

Either that, or they failed to stick to VAX for the few more years
until they would have developed an OoO implementation, which would
have leveled the playing field again (see Pentium Pro). The Alpha
came out in 1992, the Pentium Pro in 1995, so if DEC has stuck to the
VAX and managed a timely OoO implementation, they would have needed to survive just 3 years. And it seems that they lost a lot of customers
in the transition from VAX to Alpha.

Of course, the question is if the customers would have stayed with DEC
if they had continued with the VAX. The vibe at the time was that
CISCs are doomed. OTOH, Intel stuck with IA-32 and won with the P6,
and IBM stuck with the S390. But VAX customers are not S390
customers, and maybe they would have defected to Intel even if the VAX
had been there.

In my opinion, DEC was caught at an ugly time for them. They did not
have the transistor budget for a GBOoO implementation at exactly the
time they also needed a clean transition to 64-bits (even more trans-
istors). DEC did have the transistors for a medium OoO implementation
but unlikely the 64-bit transition.

From what I read, the VAX 9000 was a big nail in the DEC coffin. In hindsight they should have canceled the project early, but that does
not mean that they could not have continued with VAX (they could even
have competed with the IBM mainframes, which took quite long to gain superscalar and OoO implementations).

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Fri Sep 6 18:33:12 2024

On Fri, 6 Sep 2024 5:38:01 +0000, Anton Ertl wrote:

From
<[email protected]>:

bash grep gzip
595204 107636 46744 armhf 16 regs load/store 32-bit
599832 101102 46898 riscv64 32 regs load/store 64-bit
796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
829776 134784 56868 arm64 32 regs load/store 64-bit
853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
891128 158544 68500 armel 16 regs load/store 32-bit
892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
1020720 170736 71088 mips64el 32 regs load/store 64-bit
1168104 194900 83332 ppc64el 32 regs load/store 64-bit

Is there source code freely available so these could be compiled
in My 66000 ISA and placed in the list ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Brett on Fri Sep 6 23:49:27 2024

Brett <[email protected]> wrote:

Anton Ertl <[email protected]> wrote:

Here is a PDF on heads and tails:

http://scale.eecs.berkeley.edu/papers/hat-cases2001.pdf

They went for maximum density, which is stupid. The timing critical part is
the source registers, and in a wide implementation the dest registers are
also critical. Opcodes and data/offsets only matter far later in the
pipeline.

I would do three registers and enough opcode bits to get an idea of opcode
type and size. For one and two register instructions you pack in more
opcode.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to [email protected] on Sat Sep 7 05:04:34 2024

[email protected] (MitchAlsup1) writes:

On Fri, 6 Sep 2024 5:38:01 +0000, Anton Ertl wrote:

From
<[email protected]>:

bash grep gzip
595204 107636 46744 armhf 16 regs load/store 32-bit
599832 101102 46898 riscv64 32 regs load/store 64-bit
796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
829776 134784 56868 arm64 32 regs load/store 64-bit
853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
891128 158544 68500 armel 16 regs load/store 32-bit
892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
1020720 170736 71088 mips64el 32 regs load/store 64-bit
1168104 194900 83332 ppc64el 32 regs load/store 64-bit

Is there source code freely available so these could be compiled
in My 66000 ISA and placed in the list ??

Yes. I measured the binaries of the Debian packages
bash_5.2.21-2_$arch.deb, grep_3.11-4~exp1_$arch.deb, and
gzip_1.12-1_$arch.deb (sometimes with an extra suffix: bash_5.2.21-2+b1_$arch.deb gzip_1.12-1+b2_$arch.deb). So look up
these packages, and then get the corresponding source packages.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to [email protected] on Sat Sep 7 05:37:57 2024

[email protected] (MitchAlsup1) writes:

In my opinion, DEC was caught at an ugly time for them. They did not
have the transistor budget for a GBOoO implementation at exactly the
time they also needed a clean transition to 64-bits (even more trans- >istors). DEC did have the transistors for a medium OoO implementation
but unlikely the 64-bit transition.

For the K8 the switch from 32-bit to 64-bit was reported to have cost
5%. You were there. Are the reports wrong?

The Pentium Pro has a die size of 306mm^2 in 0.5um and 196mm^2 in
0.35um according to <https://de.wikipedia.org/wiki/Intel_Pentium_Pro>
(it's interesting that Intel produced a 0.6um, 0.5um, and 0.35um
version; apparently their lock into a specific process for a chip came
only later).

The 64-bit OoO R10000 has a die size of 298mm^2 in 0.35um (but it was
a RISC).

DEC could fabricate the 299mm^2 21164 in 0.5um, and then the 21164a
with 209mm^2 in 0.35um (in 1996), and the 21264 with 314mm^2 in 0.35um
(in 1998).

An OoO VAX should be possible in 0.35um, maybe not 4-wide as the
R10000 and the 21264, but supporting three simple or three uops from
one complex instruction per cycle like the Pentium Pro.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Brett on Sat Sep 7 05:17:18 2024

Brett <[email protected]> writes:

Anton Ertl <[email protected]> wrote:

Brett <[email protected]> writes:

But Vax allows all three arguments to be in memory with different pointers. >>>
Is this sane, just a natural progression if you allow memory operands?

In combination with supporting unaligned accesses (but excluding
indirect addressing), it means that an instruction can access 6 pages,
and so the TLB (and/or TLB loader) has to be designed to support that.
Likewise, the OS has to be designed to load all 6 pages into physical
RAM without evicting one of these pages again. So this kind of
architecture increases the design complexity. And I don't see a
benefit from this design.

The memory system is pipelined, once you load the first of the three
values, you do not care if that cache line is evicted while you load the >second.

Caches are 16 way today, one does not worry about cache line evictions, it >just works.

I did not write about caches, but yes, for TLBs a (the?) solution is
to have the ITLB to be at least 6-way.

It's unclear how pipelining should help. The VAX 11/780 was not much
pipelined and can also do the memory accesses one after the other;
this did not protect it from the complexity coming from x memory
accesses in a single instruction. E.g., all the pages accessed by an instruction have to be in physical memory, or maybe support
interruptable instructions; in any case, there is complexity.

Heads and tails encoding could actually do this reasonably, and the code >>> density would be actually be better than most competitors.

...

What is "heads and tails encoding"?

128 bit or larger packets with the fixed size opcodes on the front, and the >variable sized data and offsets packing in from the end. You get variable >length instruction density with easier faster wide decoding. And also using >memory operands give you another density bonus on top.

The only reason for VAX-style instructions is if you want to implement
the VAX instruction set, because you want to run software for the VAX
(and that reason started to vanish three decades ago and is now almost
gone). Also, decoding variable-length instructions is a solved
problem: Intel's P-cores and AMD's Zen-Zen5 cores solve it with
microcode caches, and Intel's recent E-cores (Tremont, Gracemont,
Chrestmont, Skymont) solve it by having 2-3 3-wide decoders.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sat Sep 7 07:42:07 2024

On Sat, 07 Sep 2024 05:04:34 GMT, Anton Ertl wrote:

[email protected] (MitchAlsup1) writes:

Is there source code freely available so these could be compiled in My >>66000 ISA and placed in the list ??

So look up these packages, and then get the corresponding source
packages.

Debian provides the “apt-get source” command for this purpose.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Sat Sep 7 08:44:40 2024

[email protected] (Anton Ertl) writes:

[email protected] (MitchAlsup1) writes:

In my opinion, DEC was caught at an ugly time for them. They did not
have the transistor budget for a GBOoO implementation at exactly the
time they also needed a clean transition to 64-bits (even more trans- >>istors). DEC did have the transistors for a medium OoO implementation
but unlikely the 64-bit transition.

For the K8 the switch from 32-bit to 64-bit was reported to have cost
5%. You were there. Are the reports wrong?

In addition, VAX is a 32-bit architecture. It was not necessary to
extend it to 64 bits and do OoO at the same time. IBM stuck with its
31-bit addresses in s390 until 2000 when it was extended to the 64-bit z/Architecture (and the first implementation, the z900 was scalar, not
even in-order superscalar; they got superscalar in the z990 in 2003
and apparently out-of-order with the z196 in 2011; but then, IBM's
customers are probably less performance-sensitive than DEC's customers
used to be). Intel only delivered Merced (IA-64) in 2002 and
delivered Nocona (AMD64) in 2004.

Sure, there was marketing pressure to deliver 64-bit architectures
early, but I think that a competetive 32-bit OoO VAX in 1996 with an announcement of a future 64-bit extension would have been fine
wrt. marketing. And a 0.25um 64-bit VAX in 1999 or 2000 (they shrank
the 21264 to 0.25um in 1999) would have certainly made good on that
promise.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Sat Sep 7 16:31:00 2024

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

Sure, there was marketing pressure to deliver 64-bit architectures
early, but I think that a competetive 32-bit OoO VAX in 1996 with an announcement of a future 64-bit extension would have been fine
wrt. marketing. And a 0.25um 64-bit VAX in 1999 or 2000 (they
shrank the 21264 to 0.25um in 1999) would have certainly made good
on that promise.

VAX had initially been very successful for the late 1970s and early 1980s
in technical computing, because it was performance-competitive and had a
better operating system than any of the other superminis of the time.

Then multiple RISCs with Unix came along, which were cheaper, had equal
or better performance, and a satisfactory operating system. Those ate
DEC's technical computing market quite fast, but its business IT market
lasted longer.

The technical computing market was /much/ more interested in 64-bit than
the business IT market. When I got involved at a software supplier for technical computing in 1995, VAX was not performance-competitive and was
on the way out, but Alpha was the fastest thing around until Pentium Pro, stayed competitive for a couple more years, and didn't die out completely
until 2002 or so.

DEC seem to have concluded in 1988 that they could not keep VAX
performance competitive with the RISCs of the time at a competitive price. Also, 64-bit ASAP was necessary to retain their part of the technical
computing market and try to win some of it back.

Trying to hold on with VAX, in the hope technology would emerge that
would make it practical, without a clear idea of when or what that would
be is not something that shareholders will tolerate for very long.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Sat Sep 7 15:52:17 2024

[email protected] (John Dallman) writes:

In article <[email protected]>, >[email protected] (Anton Ertl) wrote:

Sure, there was marketing pressure to deliver 64-bit architectures
early, but I think that a competetive 32-bit OoO VAX in 1996 with an
announcement of a future 64-bit extension would have been fine
wrt. marketing. And a 0.25um 64-bit VAX in 1999 or 2000 (they
shrank the 21264 to 0.25um in 1999) would have certainly made good
on that promise.

VAX had initially been very successful for the late 1970s and early 1980s
in technical computing, because it was performance-competitive and had a >better operating system than any of the other superminis of the time.

Then multiple RISCs with Unix came along, which were cheaper, had equal
or better performance, and a satisfactory operating system. Those ate
DEC's technical computing market quite fast, but its business IT market >lasted longer.

The technical computing market was /much/ more interested in 64-bit than
the business IT market. When I got involved at a software supplier for >technical computing in 1995, VAX was not performance-competitive and was
on the way out, but Alpha was the fastest thing around until Pentium Pro, >stayed competitive for a couple more years, and didn't die out completely >until 2002 or so.

DEC seem to have concluded in 1988 that they could not keep VAX
performance competitive with the RISCs of the time at a competitive price.

Not really. They were still burning lots of money on the VAX 9000, a
dead end.

They stopped doing new VAX designs after the NVAX/NVAX+ was introduced
in 1991 (Alpha was introduced in 1992). There was the NVAX++ shrink
that improved the clock rate.

Also, 64-bit ASAP was necessary to retain their part of the technical >computing market and try to win some of it back.

Trying to hold on with VAX, in the hope technology would emerge that
would make it practical, without a clear idea of when or what that would
be is not something that shareholders will tolerate for very long.

They tolerated it for Intel and for IBM. Ok, IBM introduced Power for
the technical market, maybe that would have been the way to go for
DEC: continue with MIPS for the technical market, and continue with
VAX for the business market.

The Alpha with VMS and with translation to run VAX (and DecStation)
software sounds plausible, but somehow it did not work, neither for
keeping the technical nor the business customers, even though Alpha
was very competetive until the late 1990s.

Maybe the business customers would not have defected without the
VAX->Alpha transition, or maybe they would still have defected (they
were DEC customers instead of IBM customers for a reason).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Sat Sep 7 16:37:00 2024

[email protected] (Anton Ertl) writes:

Brett <[email protected]> writes:

I did not write about caches, but yes, for TLBs a (the?) solution is
to have the ITLB to be at least 6-way.

It's unclear how pipelining should help. The VAX 11/780 was not much >pipelined and can also do the memory accesses one after the other;
this did not protect it from the complexity coming from x memory
accesses in a single instruction. E.g., all the pages accessed by an >instruction have to be in physical memory, or maybe support
interruptable instructions; in any case, there is complexity.

MOVC3/MOVC5 were interruptable; specifically to handle page faults
(and to reduce interrupt latency).

Given the arguments were in registers, interruptibility in that case
was just "restart with current register values". Similer to x86
REP string instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sat Sep 7 21:17:42 2024

According to Anton Ertl <[email protected]>:

Given modern OoO technology, even VAX can fly. It does not matter
whether, say,

*a++ = *b++ + *c++;

is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
RISC-V instructions, what goes on inside the OoO engine is pretty
similar in all cases, and so is the performance.

It is my impression that unwinding all the side effects if the reference to "c" causes a
page fault was painful. Particularly keeping in mind that b and c could be the same
register, and if the code were this:

*a++ = *b++ - *b++

the order of increments and fetches matters.

If you split it into four ARM instructions, a fault just has to restart one of those
instructions which will have at most one register to fix up.

It is my impression that even when the Vax was designed, it was already becoming evident
that the Vax's super dense super encoded instruction set was not going to be a long term
winner. The IBM 801 project was well along in 1975 when they started designing the Vax.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Sat Sep 7 23:38:45 2024

On Sat, 07 Sep 2024 16:37:00 GMT, Scott Lurndal wrote:

MOVC3/MOVC5 were interruptable ...

Given the arguments were in registers ...

The operands were in the perfectly general descriptor format, same as with nearly every other instruction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sun Sep 8 00:52:25 2024

On Fri, 6 Sep 2024 6:05:35 +0000, Anton Ertl wrote:

[email protected] (John Dallman) writes:

Memory-to-memory instructions, in general, are hard to get to run fast
with today's processors and memory, simply because memory access times
are long enough for many register-to-register instructions to execute.

Given modern OoO technology, even VAX can fly. It does not matter
whether, say,

*a++ = *b++ + *c++;

is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
RISC-V instructions, what goes on inside the OoO engine is pretty
similar in all cases, and so is the performance.

When I faced a similar set of desires, I had my movememory MM
instruction do::

for( control_register = 0,
control_register < Rs3,
control+register+=size )
rd[control_register] = rs1[control_register];

where size is determined by alignment, memory type (from PTE).
Thus, when a page fault or interrupt cuts the instruction in
the middle I don't have to recover any of the registers.

This also allows the instruction to be thrown over to the
Memory Function Unit and be performed in parallel with other
calculation instructions.

Getting back to the originating:: It is faster these days to
write::
a[i] = b[i] + c[i];i++;
than the pre/post increment/decrement style of PDP-11.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sun Sep 8 02:03:44 2024

On Sun, 8 Sep 2024 00:52:25 +0000, MitchAlsup1 wrote:

Getting back to the originating:: It is faster these days to write::
a[i] = b[i] + c[i];i++;
than the pre/post increment/decrement style of PDP-11.

That’s how I normally write the code in a high-level language, and have
done so since the 1980s.

I figured any decent compiler would be able to attend to the details I couldn’t be bothered thinking about. ;)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Sun Sep 8 13:55:11 2024

John Levine <[email protected]> writes:

According to Anton Ertl <[email protected]>:

Given modern OoO technology, even VAX can fly. It does not matter
whether, say,

*a++ = *b++ + *c++;

is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7 >>RISC-V instructions, what goes on inside the OoO engine is pretty
similar in all cases, and so is the performance.

It is my impression that unwinding all the side effects if the
reference to "c" causes a page fault was painful.

Yes, that was certainly a problem when using the implementation
techniques of the day. With an OoO implementation, if any of the
operations of the instruction causes an exception, none of the
results of any of the operations are commited. Problem solved.

Or almost: I expect that it's more complex to implement a reorder
buffer that deals with such monster instructions than one that deals
just with RISC-V instructions.

Particularly
keeping in mind that b and c could be the same register, and if the
code were this:

*a++ = *b++ - *b++

the order of increments and fetches matters.

Yes, but the decoder produces operations as defined by the
architecture. I don't know how VAX specifies the order, but a simple translation could be

# at the start, b is in p1, and a is in p6
p0 = *p1 #*b
p2 = p1+4 #b++
p3 = *p2 #*b
p4 = p2+4 #b++
p5 = p2-p4
*p6= p5 #*a = ...
p7 = p6+4 #a++
#at the end, b is in p4 and a is in p7

where p0..p7 are physical registers. If there is an exception in any
of the operations, b stays in p1 and a stays in p6.

It is my impression that even when the Vax was designed, it was
already becoming evident that the Vax's super dense super encoded
instruction set was not going to be a long term winner. The IBM 801
project was well along in 1975 when they started designing the Vax.

The question is how much was known about the IBM 801 at the time.
According to <https://en.wikipedia.org/wiki/OpenVMS>, the VAX project
started in April 1975. Data General's Fountainhead project (FHP)
started in July 1975. Intel started the iAPX 432 in 1975 or 1976,
Zilog started the Z8000 after recruiting Bernard Peuto in March 1976 <https://thechipletter.substack.com/p/captain-zilog-crushed-the-story-of>. Motorola started the 68000 project in late 1976, and National
Semiconductor obviously knew about the VAX when they designed the
32016 (they originally wanted to implement the VAX instruction set,
but in the end did something incompatible for legal reasons). All
these projects used CISCy designs rather than RISCy designs. FHP was
a bit special in making the writable control store an architectural
feature (so it did not have just one instruction set); the thinking
behind it is the "closing the semantic gap" idea that gave us
architectures like the VAX.

The first commercial RISCs were delivered in 1986 (including from IBM
itself). Apparently the industry took that long to absorb the ideas
from the IBM 801 and turn them into a commercial product.

It would be interesting to take a time machine to, say, 1976, to go to
any of these companies and try to convince them to do a RISCy CPU.
How hard would it be to convince them? Would technical arguments be sufficient, or would one have to wave with money (as a customer or
investor)? And how would such a CPU do in the marketplace?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Sun Sep 8 17:56:55 2024

The problem with VAX was NOT that one could not put a lot of
work in a single instruction;

no,

The problem with VAX is that it made putting too much work
in a single instruction easy.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Anton Ertl on Sun Sep 8 19:15:25 2024

Anton Ertl <[email protected]> wrote:

John Levine <[email protected]> writes:

According to Anton Ertl <[email protected]>:

Given modern OoO technology, even VAX can fly. It does not matter
whether, say,

*a++ = *b++ + *c++;

is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
RISC-V instructions, what goes on inside the OoO engine is pretty
similar in all cases, and so is the performance.

It is my impression that unwinding all the side effects if the
reference to "c" causes a page fault was painful.

Yes, that was certainly a problem when using the implementation
techniques of the day. With an OoO implementation, if any of the
operations of the instruction causes an exception, none of the
results of any of the operations are commited. Problem solved.

Or almost: I expect that it's more complex to implement a reorder
buffer that deals with such monster instructions than one that deals
just with RISC-V instructions.

Particularly
keeping in mind that b and c could be the same register, and if the
code were this:

*a++ = *b++ - *b++

the order of increments and fetches matters.

Yes, but the decoder produces operations as defined by the
architecture. I don't know how VAX specifies the order, but a simple translation could be

# at the start, b is in p1, and a is in p6
p0 = *p1 #*b
p2 = p1+4 #b++
p3 = *p2 #*b
p4 = p2+4 #b++
p5 = p2-p4
*p6= p5 #*a = ...
p7 = p6+4 #a++
#at the end, b is in p4 and a is in p7

where p0..p7 are physical registers. If there is an exception in any
of the operations, b stays in p1 and a stays in p6.

It is my impression that even when the Vax was designed, it was
already becoming evident that the Vax's super dense super encoded
instruction set was not going to be a long term winner. The IBM 801
project was well along in 1975 when they started designing the Vax.

The question is how much was known about the IBM 801 at the time.
According to <https://en.wikipedia.org/wiki/OpenVMS>, the VAX project
started in April 1975. Data General's Fountainhead project (FHP)
started in July 1975. Intel started the iAPX 432 in 1975 or 1976,
Zilog started the Z8000 after recruiting Bernard Peuto in March 1976 <https://thechipletter.substack.com/p/captain-zilog-crushed-the-story-of>. Motorola started the 68000 project in late 1976, and National
Semiconductor obviously knew about the VAX when they designed the
32016 (they originally wanted to implement the VAX instruction set,
but in the end did something incompatible for legal reasons). All
these projects used CISCy designs rather than RISCy designs. FHP was
a bit special in making the writable control store an architectural
feature (so it did not have just one instruction set); the thinking
behind it is the "closing the semantic gap" idea that gave us
architectures like the VAX.

The first commercial RISCs were delivered in 1986 (including from IBM itself). Apparently the industry took that long to absorb the ideas
from the IBM 801 and turn them into a commercial product.

It would be interesting to take a time machine to, say, 1976, to go to
any of these companies and try to convince them to do a RISCy CPU.
How hard would it be to convince them? Would technical arguments be sufficient, or would one have to wave with money (as a customer or
investor)? And how would such a CPU do in the marketplace?

The IBM 801 was boring and did not have a patent moat protecting it.

We have talent, we can build something more complex that keeps out
competition that does not have our talent.

They had no idea that complexity doubling every two years was going to
crush all those complex ideas. Instead they thought more transistors would
keep letting them add ever more complex features.

They had no idea that they would crash into the clock speed wall, and if
they did that argues for more complexity in the same clock time to get more done.

They had no idea that they would be building eight wide designs.
This is the critical idea that made RISC popular. They figured out that
they had been too smart for their own good.

We are post RISC now and adding complexity that gets more work done per operation, with less tracking. Three sources and two destinations will be
the rule. Load with address update, add with shift, three way add with
logical operations is next. The FPU already has MAC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Brett on Sun Sep 8 21:01:50 2024

On Sun, 8 Sep 2024 19:15:25 -0000 (UTC), Brett wrote:

We are post RISC now and adding complexity that gets more work done per operation, with less tracking. Three sources and two destinations will
be the rule. Load with address update, add with shift, three way add
with logical operations is next. The FPU already has MAC.

Does sound rather like another variant on Ivan Sutherland’s Wheel of Reincarnation, doesn’t it?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Sun Sep 8 21:03:31 2024

On Sun, 8 Sep 2024 20:20:26 +0000, Thomas Koenig wrote:

Brett <[email protected]> schrieb:

[VAX]

They had no idea that they would be building eight wide designs.
This is the critical idea that made RISC popular.

Nope.

The early RISC designs aimed for one instruction per cycle, achieved
maybe 0.7.

But they were competing against processors with 4+ clocks per
instruction.

S.E.L 32/87 had an IBM 360-like ISA and also achieved 0.7 I/C largely
because it was NOT microcoded for 95% of the instructions executed,
but well pipelined. When it encountered an instruction that required
microcode to complete, the HW did the first cycle and then let micro-
code take over, and was ready to switch back to HW-control without
wasting a clock.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sun Sep 8 21:09:39 2024

On Sun, 8 Sep 2024 17:56:55 +0000, MitchAlsup1 wrote:

The problem with VAX was NOT that one could not put a lot of work in a
single instruction;

no,

The problem with VAX is that it made putting too much work in a single instruction easy.

Perhaps there is also the issue of the wildly-variable instruction length.
A single VAX operand descriptor could be up to 6 bytes; I think the
instruction with the most general-format operands could have 6 of them:
so, plus opcode, such an instruction could be 37 bytes long.

While the shortest instruction could be just 1 byte.

Even those who are talking about “post-RISC” are, I think, still in favour of RISC-style fixed instruction lengths.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Brett on Sun Sep 8 20:20:26 2024

Brett <[email protected]> schrieb:

[VAX]

They had no idea that they would be building eight wide designs.
This is the critical idea that made RISC popular.

Nope.

The early RISC designs aimed for one instruction per cycle, achieved
maybe 0.7.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon Sep 9 00:27:39 2024

On Sun, 8 Sep 2024 21:09:39 +0000, Lawrence D'Oliveiro wrote:

On Sun, 8 Sep 2024 17:56:55 +0000, MitchAlsup1 wrote:

The problem with VAX was NOT that one could not put a lot of work in a
single instruction;

no,

The problem with VAX is that it made putting too much work in a single
instruction easy.

Perhaps there is also the issue of the wildly-variable instruction
length.
A single VAX operand descriptor could be up to 6 bytes; I think the instruction with the most general-format operands could have 6 of them:
so, plus opcode, such an instruction could be 37 bytes long.

I have not heard an argument that the complex things in VAX ISA are
a) desirable
b) performance helpful

I (sort of) think VAX ISA as a grown up PDP-11, ignoring all the
dastardly complicated instructions it inflicted upon itself. AND
it did inflict those things upon itself.

Restricting a new-VAX-like ISA to 1-2-3 Operand and 1-result with
at most 1 exception would result in a MUCH cleaner and easier to
build machine.

While the shortest instruction could be just 1 byte.

Even those who are talking about “post-RISC” are, I think, still in favour of RISC-style fixed instruction lengths.

I, for the record, are in favor of fixed length instruction-specifier
followed by constants the entirety is the instruction, while the
former minimizes your ability of shooting yourself in the foot.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Thomas Koenig on Mon Sep 9 04:38:42 2024

Thomas Koenig <[email protected]> wrote:

Brett <[email protected]> schrieb:

[VAX]

They had no idea that they would be building eight wide designs.
This is the critical idea that made RISC popular.

Nope.

The early RISC designs aimed for one instruction per cycle, achieved
maybe 0.7.

The next step up for a CPU has one ALU and one load/store unit, giving
above one IPC. This is what one of the PlayStation CPU’s did.

The yellow brick road to eight way was apparent with the first RISC architecture, even if not fully implemented due to time and die size.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Brett on Mon Sep 9 06:21:08 2024

On Mon, 9 Sep 2024 04:38:42 -0000 (UTC), Brett wrote:

Thomas Koenig <[email protected]> wrote:

The early RISC designs aimed for one instruction per cycle, achieved
maybe 0.7.

The next step up for a CPU has one ALU and one load/store unit, giving
above one IPC. This is what one of the PlayStation CPU’s did.

Those were the ones using PowerPC chips in the 1990s, I think it was.
IBM’s POWER claimed superscalar performance right from its launch in, what was it, 1989.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Mon Sep 9 06:50:17 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Mon, 9 Sep 2024 04:38:42 -0000 (UTC), Brett wrote:

The next step up for a CPU has one ALU and one load/store unit, giving
above one IPC. This is what one of the PlayStation CPU’s did.

Those were the ones using PowerPC chips in the 1990s, I think it was.

The first PlayStation used a 33MHz R3000 (single-issue).

The PS2 released in 2000 used a 299MHz MIPS R5900-based core, two-way in-order superscalar.

The PS3 released in 2006 used the PowerPC-based Cell broadband engine.

IBM’s POWER claimed superscalar performance right from its launch in, what >was it, 1989.

1990.

It's interesting that it took so long to go to dual-issue with the
same number of functional units. I guess that the early RISCs were bandwidth-limited, and only once the L1 cache(s) came on-chip, was
there enough bandwidth to make superscalarity actually pay off.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Mon Sep 9 08:03:00 2024

Lawrence D'Oliveiro <[email protected]d> writes:

Perhaps there is also the issue of the wildly-variable instruction length.
A single VAX operand descriptor could be up to 6 bytes; I think the >instruction with the most general-format operands could have 6 of them:
so, plus opcode, such an instruction could be 37 bytes long.

The regularity of the VAX operand formats may actually help build the
decoder: Decode your byte stream as possible operands, and then let
the instruction decoder pick the real operands from the potential
operands.

Even those who are talking about “post-RISC” are, I think, still in favour >of RISC-style fixed instruction lengths.

Even among the RISCs, fixed instruction lengths are not universal: ARM
T32 has two widths, as has RV64GC (and RISC-V has provisions for
additional lengths, but AFAIK nobody uses them yet); there was also
ROMP and MIPS16.

Interestingly, despite their ample experience with T32, ARM went
fixed-length with A64, but then the market for A64 is probably not as
code-size sensitive as that for T32.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Mon Sep 9 12:32:34 2024

On Mon, 09 Sep 2024 08:03:00 GMT
[email protected] (Anton Ertl) wrote:

Interestingly, despite their ample experience with T32, ARM went
fixed-length with A64, but then the market for A64 is probably not as code-size sensitive as that for T32.

- anton

ARMv9-M is still T32 which probably should tell us something.
Or, may be, not.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Sep 9 09:52:53 2024

On Mon, 09 Sep 2024 08:03:00 GMT, Anton Ertl wrote:

... ARM T32 has two widths, as has RV64GC (and RISC-V has provisions for additional lengths, but AFAIK nobody uses them yet); there was also ROMP
and MIPS16.

That’s all very well. But none of them go to the extremes that VAX did:
37:1, remember.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Mon Sep 9 11:33:37 2024

Michael S <[email protected]> writes:

ARMv9-M is still T32 which probably should tell us something.

It tells me that ARM sees a market (covered by the M profile) where
4GB of address space is sufficient and where code size is relevant.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Mon Sep 9 15:15:59 2024

On Mon, 09 Sep 2024 11:33:37 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

ARMv9-M is still T32 which probably should tell us something.

It tells me that ARM sees a market (covered by the M profile) where
4GB of address space is sufficient and where code size is relevant.

- anton

I think that the sad (for Arm) truth is that ARMv8-M was unnecessary
except for tiny niches and ARMv9-M even more so. ARMv7-M works fine for overwhelming majority of user.
So even if they somehow invent fixed-width 32-bit architecture that
matches T32 in code density and then implement it in core that matches Cortex-M4 in performance per clock, but occupies 10% smaller area
clocks 10% faster on the same process, their major licensees (STMicro,
TI, NXP) wouldn't be willing to pay 1 cent more for that core than what
they are currently paying for M4.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to [email protected] on Mon Sep 9 19:38:52 2024

MitchAlsup1 <[email protected]> wrote:

On Sun, 8 Sep 2024 21:09:39 +0000, Lawrence D'Oliveiro wrote:

On Sun, 8 Sep 2024 17:56:55 +0000, MitchAlsup1 wrote:

The problem with VAX was NOT that one could not put a lot of work in a
single instruction;

no,

The problem with VAX is that it made putting too much work in a single
instruction easy.

Perhaps there is also the issue of the wildly-variable instruction
length.
A single VAX operand descriptor could be up to 6 bytes; I think the
instruction with the most general-format operands could have 6 of them:
so, plus opcode, such an instruction could be 37 bytes long.

I have not heard an argument that the complex things in VAX ISA are
a) desirable
b) performance helpful

Speaking of complex things, have you looked at Swift output, as it checks
all operations for overflow?

You could add an exception type for that, saving huge numbers of correctly predicted branch instructions.

The future of programming languages is type safe with checks, you need to
get on that bandwagon early.

I (sort of) think VAX ISA as a grown up PDP-11, ignoring all the
dastardly complicated instructions it inflicted upon itself. AND
it did inflict those things upon itself.

Restricting a new-VAX-like ISA to 1-2-3 Operand and 1-result with
at most 1 exception would result in a MUCH cleaner and easier to
build machine.

While the shortest instruction could be just 1 byte.

Even those who are talking about “post-RISC” are, I think, still in
favour of RISC-style fixed instruction lengths.

I, for the record, are in favor of fixed length instruction-specifier followed by constants the entirety is the instruction, while the
former minimizes your ability of shooting yourself in the foot.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon Sep 9 20:34:39 2024

According to Anton Ertl <[email protected]>:

Lawrence D'Oliveiro <[email protected]d> writes:

Perhaps there is also the issue of the wildly-variable instruction length. >>A single VAX operand descriptor could be up to 6 bytes; I think the >>instruction with the most general-format operands could have 6 of them:
so, plus opcode, such an instruction could be 37 bytes long.

The regularity of the VAX operand formats may actually help build the >decoder: Decode your byte stream as possible operands, and then let
the instruction decoder pick the real operands from the potential
operands.

Urrgh. Some of those bogus operands are indirect indexed auto-increment, so you are going to be throwing away a whole lot of work.

Compare that to zSeries, where even after 50 years of sticking new instructions into the holes in the S/360 instruction set, it can still tell the length of the
instruction from the first two bits and the operands from the first byte.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Mon Sep 9 20:44:00 2024

On Mon, 9 Sep 2024 19:38:52 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Sun, 8 Sep 2024 21:09:39 +0000, Lawrence D'Oliveiro wrote:

On Sun, 8 Sep 2024 17:56:55 +0000, MitchAlsup1 wrote:

The problem with VAX was NOT that one could not put a lot of work in a >>>> single instruction;

no,

The problem with VAX is that it made putting too much work in a single >>>> instruction easy.

Perhaps there is also the issue of the wildly-variable instruction
length.
A single VAX operand descriptor could be up to 6 bytes; I think the
instruction with the most general-format operands could have 6 of them:
so, plus opcode, such an instruction could be 37 bytes long.

I have not heard an argument that the complex things in VAX ISA are
a) desirable
b) performance helpful

Speaking of complex things, have you looked at Swift output, as it
checks
all operations for overflow?

You could add an exception type for that, saving huge numbers of
correctly predicted branch instructions.

Unlike RISC-V and may others; My 66000 has maskable integer exceptions.
An exception can be routed directly to a signal handler of the current application (without a trip through GuestOS). GuestOS just has to
configure where exceptions are delivered.

The future of programming languages is type safe with checks, you need
to get on that bandwagon early.

This would/will happen faster when type-safe with checks are well
represented in benchmarks used to measure various architectural
things, and the exceptions and checks are actually utilized showing
performance degradation of lesser endowed architectures.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Brett on Tue Sep 10 07:43:53 2024

Brett <[email protected]> writes:

Speaking of complex things, have you looked at Swift output, as it checks
all operations for overflow?

You could add an exception type for that, saving huge numbers of correctly >predicted branch instructions.

The future of programming languages is type safe with checks, you need to
get on that bandwagon early.

MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.

Alpha got on that bandwagon early. It's a descendent of MIPS, but it
renamed add into addv, and addu into add. It has been canceled around
the year 2000.

RISC-V, another descendent of MIPS, has an add instruction that
corresponds to MIPS' addu, and no instruction that corresponds to
MIPS' add. They obviously don't think that there's a bandwagon. Note
that RISC-V was designed after Swift was introduced.

IA-32 got on that bandwagon early. It has a single-byte instruction
trapv that traps if the overflow flag is set. The AMD64 instruction
set is very similar to the IA-32 instruction set, but one of the few differences is that the trapv instruction was eliminated, and the
encoding replaced with a REX prefix. The AMD64 architects obviously
don't think that there is a bandwagon.

Apple has been designing their own silicon for a while, and they have introduced Swift as their language in 2010. Yet they have not
switched to an architecture like MIPS or Alpha, nor have they designed
their own architecture or architecture extension that includes
instructions like Alpha's addv or IA-32's trapv. Instead, they
switched to ARM A64, which does not have such features, after
introducing Swift in 2010. They obviously don't think that there is
such a bandwagon, either.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to All on Tue Sep 10 11:13:11 2024

On 2024-09-09 23:44, MitchAlsup1 wrote:

On Mon, 9 Sep 2024 19:38:52 +0000, Brett wrote:

[snip]

The future of programming languages is type safe with checks, you need
to get on that bandwagon early.

This would/will happen faster when type-safe with checks are well
represented in benchmarks used to measure various architectural
things, and the exceptions and checks are actually utilized showing performance degradation of lesser endowed architectures.

Not all the type-safe, checking languages are equal in that respect. In
some languages, and I am thinking of Ada, the language design and the
favored programming styles work to reduce the number of run-time checks required.

In the Ada case, the ability to declare array types with
programmer-chosen index types with bounded range, such as range-bounded integers or enumerations, means that the compiler can avoid indexing
checks when the (sub)type of the index is known at compile time to fit
within the index range of the array.

It is also helpful that loop counters in Ada are also (sub)typed in the
same way, which provides compile-time information on their range with
respect to the index range of arrays accessed in the loop, even if the
bounds of the range are not known at compile time.

As a result of these language features, the matching programming styles,
and the attention paid to these issues by the compilers, the number of
run-time checks executed by an Ada program and their effect on execution
time are often surprisingly small.

So, to demonstrate the usefulness of HW support for checks, the
benchmarks should use languages that require checks but do not have the features that let programmers reduce the number of checks by suitable programming styles. If Ada were used for the benchmarks, the programmers
would have to use an abnormal, pessimal style that defeats the
compiler's ability to elide checks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Tue Sep 10 08:05:07 2024

John Levine <[email protected]> writes:

According to Anton Ertl <[email protected]>:

The regularity of the VAX operand formats may actually help build the >>decoder: Decode your byte stream as possible operands, and then let
the instruction decoder pick the real operands from the potential
operands.

Urrgh. Some of those bogus operands are indirect indexed auto-increment, so you
are going to be throwing away a whole lot of work.

Yes, AFAIK that's how multi-instruction decoding for variable-width
instruction sets works these days: Decode at every potential
instruction start, then select those decoded instructions that are at
actual instruction boundaries, and throw the others away.

Compare that to zSeries, where even after 50 years of sticking new instructions
into the holes in the S/360 instruction set, it can still tell the length of the
instruction from the first two bits and the operands from the first byte.

Good for sequential decoding, and maybe it makes parallel decoding
cheaper (but OTOH, the first superscalar S/360 descendent came out in
2000, 7 years after the superscalar Pentium, and the first OoO S/360
descendent lagged the Pentium Pro by 14 years or so), but as the IIRC
6-wide decoder of Alder Lake demonstrates, hardware designers are able
to deal with instruction sets that do not have such nice properties:
an AMD64 instruction can have a large number of prefixes, and I think
that the encoding of indexed addressing is not announced in the first non-prefix instruction byte, either.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Tue Sep 10 12:08:40 2024

On Tue, 10 Sep 2024 07:43:53 GMT
[email protected] (Anton Ertl) wrote:

Brett <[email protected]> writes:

Speaking of complex things, have you looked at Swift output, as it
checks all operations for overflow?

You could add an exception type for that, saving huge numbers of
correctly predicted branch instructions.

The future of programming languages is type safe with checks, you
need to get on that bandwagon early.

MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic).

Trapping variants were deprecated in Release 6 of MIPS ISA.

It has been abandoned and replaced by RISC-V several

years ago.

I don't think that "replaced by RISC-V" is a correct description of proceedings.

Alpha got on that bandwagon early. It's a descendent of MIPS, but it
renamed add into addv, and addu into add. It has been canceled around
the year 2000.

RISC-V, another descendent of MIPS, has an add instruction that
corresponds to MIPS' addu, and no instruction that corresponds to
MIPS' add. They obviously don't think that there's a bandwagon. Note
that RISC-V was designed after Swift was introduced.

IA-32 got on that bandwagon early. It has a single-byte instruction
trapv that traps if the overflow flag is set. The AMD64 instruction
set is very similar to the IA-32 instruction set, but one of the few differences is that the trapv instruction was eliminated, and the
encoding replaced with a REX prefix. The AMD64 architects obviously
don't think that there is a bandwagon.

Apple has been designing their own silicon for a while, and they have introduced Swift as their language in 2010. Yet they have not
switched to an architecture like MIPS or Alpha, nor have they designed
their own architecture or architecture extension that includes
instructions like Alpha's addv or IA-32's trapv. Instead, they
switched to ARM A64, which does not have such features, after
introducing Swift in 2010. They obviously don't think that there is
such a bandwagon, either.

- anton

How does Intel MPX fit in your picture?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Tue Sep 10 12:35:51 2024

On Tue, 10 Sep 2024 08:05:07 GMT
[email protected] (Anton Ertl) wrote:

John Levine <[email protected]> writes:

Compare that to zSeries, where even after 50 years of sticking new >instructions into the holes in the S/360 instruction set, it can
still tell the length of the instruction from the first two bits and
the operands from the first byte.

Good for sequential decoding, and maybe it makes parallel decoding
cheaper (but OTOH, the first superscalar S/360 descendent came out in
2000, 7 years after the superscalar Pentium, and the first OoO S/360 descendent lagged the Pentium Pro by 14 years or so),

Wikipedia says that ES/9000 Model 900 had superscalar OoO CPU in 1991.
This line was abandoned in favor of simpler 'CMOS' line in mid 90s, but according to the same Wiki article, CMOS line didn't matched Model 900
in performance until 9672-RY5 near the end of 1997.

but as the IIRC
6-wide decoder of Alder Lake demonstrates, hardware designers are able
to deal with instruction sets that do not have such nice properties:
an AMD64 instruction can have a large number of prefixes, and I think
that the encoding of indexed addressing is not announced in the first non-prefix instruction byte, either.

- anton

1. The longest AMD64 instruction is much shorter than the longest VAX instruction
2. On AMD64 instruction length information is continuous. Yes, there
could be multiple prefixes and it makes things ugly, but I would think
that in practice you very rarely need to look at more than 5 leading
bytes in order to figure out the length of the tail. And in practice
it's probably o.k. when instructions with more than 3 prefixes decoded
slowly.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Tue Sep 10 16:32:05 2024

Michael S <[email protected]> writes:

On Tue, 10 Sep 2024 08:05:07 GMT
[email protected] (Anton Ertl) wrote:

Good for sequential decoding, and maybe it makes parallel decoding
cheaper (but OTOH, the first superscalar S/360 descendent came out in
2000, 7 years after the superscalar Pentium

Correction: The first superscalar CMOS S/360 descendent, the z990 came
out in 2003, a decade after the Pentium, but see below about bipolar
CPUs.

and the first OoO S/360
descendent lagged the Pentium Pro by 14 years or so),

Wikipedia says that ES/9000 Model 900 had superscalar OoO CPU in 1991.

Reading up on this, the article says even more:

|models 900 and 820 had full out-of-order execution for both integer
|and floating-point units, with precise exception handling, and a fully |superscalar pipeline.

So these are probably the first proper OoO processors (while the
360-91 was an interesting prototype, it was too limited to count as
proper OoO CPU). The 900 ran at 111MHz, and the 1994-vintage 9X2 ran
at 141MHz and was rated at 468MIPS (for 10 CPUs), i.e. each 141MHz CPU
at 47MIPS. So that would be an IPC of 1/3, which is somewhat
disappointing even for an early superscalar OoO machine. But then I
don't know how IBM produces its MIPS ratings.

This line was abandoned in favor of simpler 'CMOS' line in mid 90s, but >according to the same Wiki article, CMOS line didn't matched Model 900
in performance until 9672-RY5 near the end of 1997.

A single-issue in-order CPU running at 370MHz with comparable per-CPU performance (and also 1-10CPUs); apparently 49MIPS for one CPU and
447MIPS for 10. Again, the IPC seems abysmal, but who knows how IBM
measures MIPS. Still, I expect that a contemporaneous Pentium II
outperforms this 9672 by a lot on, say, SPEC95, just because of the
basic technology and clock rate.

It seems that during the late 1990s, IBM was not particularly
interested in mainframe per-CPU performance.

1. The longest AMD64 instruction is much shorter than the longest VAX >instruction
2. On AMD64 instruction length information is continuous. Yes, there
could be multiple prefixes and it makes things ugly, but I would think
that in practice you very rarely need to look at more than 5 leading
bytes in order to figure out the length of the tail. And in practice
it's probably o.k. when instructions with more than 3 prefixes decoded >slowly.

Yes, you can always choose to take slow paths on rare cases, but you
can also do that for a VAX decoder. I don't expect that the 37 bytes
(or whatever it is) is the common case.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Tue Sep 10 15:42:25 2024

Michael S <[email protected]> writes:

On Tue, 10 Sep 2024 07:43:53 GMT
[email protected] (Anton Ertl) wrote:

Brett <[email protected]> writes:

Speaking of complex things, have you looked at Swift output, as it
checks all operations for overflow?

You could add an exception type for that, saving huge numbers of
correctly predicted branch instructions.

The future of programming languages is type safe with checks, you
need to get on that bandwagon early.

MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic).

Trapping variants were deprecated in Release 6 of MIPS ISA.

Interesting. So they abandoned the supposed bandwagon in 2014, after
Swift was introduced.

What they did add in the same release are branch instructions that
check whether the sum of two signed integers overflows. That's useful
for languages with arbitrarily large integers (also knowns as Big
Integers or Bignums), while the trapping adds are too cumbersome for
that purpose.

And it seems to me that Swift with its trapping arithmetic is a blast
from the past (with Algol, Pascal etc. usually erroring out on
overflow, and Ada raising an exception (with famously explosive
consequences for the Ariane 5)), and that the trend in safe languages
is to eliminate integer overflow by allowing arbitrarily large
integers.

It has been abandoned and replaced by RISC-V several

years ago.

I don't think that "replaced by RISC-V" is a correct description of >proceedings.

I don't know anything about the proceedings, just what Wikipedia tells
me:

|In March 2021, MIPS announced that the development of the MIPS
|architecture had ended as the company is making the transition to
|RISC-V.

Sounds like a replacement to me.

How does Intel MPX fit in your picture?

I don't know anything about MPX beyond what Wikipedia says, which
includes:

|In practice, there have been too many flaws discovered in the design
|for it to be useful, and support has been deprecated or removed from
|most compilers and operating systems.

Maybe a less flawed concept would have been more successful, but
apparently MPX has had no such successor.

Overall, languages that perform bounds checking seem on the rise,
unlike languages that trap on signed integer overflow, so the window
of opportunity for architectural support gets bigger.

However, the question is if there is architectural support that is significantly better than what can be done with the current
architectural features. SPARC has architectural tagging support for
LISP, yet a comp.arch poster who worked on a major LISP implementation
(Franz LISP IIRC) reported that their LISP implementation does not use
these instructions.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Anton Ertl on Tue Sep 10 20:55:37 2024

On 2024-09-10 18:42, Anton Ertl wrote:

And it seems to me that Swift with its trapping arithmetic is a
blast from the past

The dominance of C and its descendants has corrupted the world of
programming on this point. :-(

Fortunately, among up-and-coming new languages Rust is in the
overflow-checking camp, at least in DEBUG-mode compilations.

(with Algol, Pascal etc. usually erroring out on
overflow, and Ada raising an exception (with famously explosive
consequences for the Ariane 5)),

A bit misleading, as so often when the Ariane 501 incident is brought up.

The Ariane 501 failure was a HW trap on an instruction converting a floating-point value into a 16-bit integer, not an Ada exception.

As I understand it, the analogous C code could have used the same
instruction and failed in the same way (an example of Undefined Behavior)

The original designers of that Ada SW had carefully analysed the
possible ranges of the numbers and correctly concluded that an overflow
could not happen if the HW operated correctly. Correctly, that is, for
the Ariane 4, but not for the Ariane 5 where the SW was sloppily reused
through multiple process skimps and failures.

Several other similar conversions were protected with programmed range
checks and suitable alternative code paths, but the analysis showed that
this particular conversion did not need such checks for the Ariane 4.

One of the process failures was that the SW was never tested with the
Ariane 5 launch trajectory, which would have revealed the error.

If the SW had really used Ada exceptions (difficult as the processor was
quite maxed out) a reasonable SW designer would have added an exception
handler and could have made this part of the SW fail gracefully. But
the mission would probably not have been saved because the failure investigation found other potentially fatal flaws in the systems,
pointing to more process failures.

and that the trend in safe languages is to eliminate integer overflow
by allowing arbitrarily large integers.

That is not practical in a real-time, resource-limited context, at least
not without a large over-provision of computing resources. And sending
the resulting over-large integer to a HW register will still fail in
some way if the value is too large for the HW to accept.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Sep 10 23:46:18 2024

On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:

It seems that during the late 1990s, IBM was not particularly interested
in mainframe per-CPU performance.

Mainframes were never about CPU performance. They were about high I/O throughput for efficient batch operations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Sep 10 23:51:20 2024

On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:

[MIPS] has been abandoned and replaced by RISC-V several years ago.

I’m not so sure the MIPS architecture has been “abandoned”. Last I heard, it was still shipping hundreds of millions of chips per year. Also those Chinese supers run LoongArch, which is some sort of MIPS derivative.

It is true that there is no more money to be made from licensing any “MIPS IP”, which is why Imagination Tech, the inheritors of whatever was left of MIPS the commercial operation, have switched to being a RISC-V-centric
company now.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Niklas Holsti on Tue Sep 10 23:47:53 2024

On Tue, 10 Sep 2024 11:13:11 +0300, Niklas Holsti wrote:

Not all the type-safe, checking languages are equal in that respect. In
some languages, and I am thinking of Ada, the language design and the
favored programming styles work to reduce the number of run-time checks required.

True, and this was also demonstrated with Pascal before Ada.

It’s a point that those who are accustomed to program in C seem to find it difficult to appreciate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to It appears that Anton Ertl on Wed Sep 11 01:56:16 2024

It appears that Anton Ertl <[email protected]> said:

The future of programming languages is type safe with checks, you need to >>get on that bandwagon early.

MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.

S/360 had signed and unsigned adds in the 1960s, with optional
trapping for signed overflow. OS/360 let you catch the traps and z
still does but it is not my impression that many programs did or do.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Levine on Wed Sep 11 05:36:46 2024

John Levine <[email protected]> schrieb:

It appears that Anton Ertl <[email protected]> said:

The future of programming languages is type safe with checks, you need to >>>get on that bandwagon early.

MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.

S/360 had signed and unsigned adds in the 1960s, with optional
trapping for signed overflow. OS/360 let you catch the traps and z
still does but it is not my impression that many programs did or do.

With trapping, I understand. Without trapping - what is the
difference on a two's complement machine?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Wed Sep 11 06:24:55 2024

Thomas Koenig <[email protected]> writes:

John Levine <[email protected]> schrieb:

S/360 had signed and unsigned adds in the 1960s, with optional
trapping for signed overflow. OS/360 let you catch the traps and z
still does but it is not my impression that many programs did or do.

With trapping, I understand. Without trapping - what is the
difference on a two's complement machine?

Possibly in the flags set. The S/360 has a pretty perverse flags
architecture.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Wed Sep 11 11:07:53 2024

On Tue, 10 Sep 2024 15:42:25 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

How does Intel MPX fit in your picture?

I don't know anything about MPX beyond what Wikipedia says, which
includes:

|In practice, there have been too many flaws discovered in the design
|for it to be useful, and support has been deprecated or removed from
|most compilers and operating systems.

Maybe a less flawed concept would have been more successful, but
apparently MPX has had no such successor.

Overall, languages that perform bounds checking seem on the rise,
unlike languages that trap on signed integer overflow, so the window
of opportunity for architectural support gets bigger.

Yes, I posted my questions without sufficient thinking.
Intel MPX is about array bound checks which is a separate issue from
catching signed overflow.
The only commonality is that in both cases there is a potential for
significant saving in code size if checks are handled by exception
instead of conditional branch.

However, the question is if there is architectural support that is significantly better than what can be done with the current
architectural features. SPARC has architectural tagging support for
LISP, yet a comp.arch poster who worked on a major LISP implementation
(Franz LISP IIRC) reported that their LISP implementation does not use
these instructions.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Wed Sep 11 11:54:25 2024

On Tue, 10 Sep 2024 23:51:20 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:

[MIPS] has been abandoned and replaced by RISC-V several years ago.

I’m not so sure the MIPS architecture has been “abandoned”. Last I heard, it was still shipping hundreds of millions of chips per year.

Care to point to the source of this claim? Two main suppliers of MIPS
silicon in this century are Microchip and Cavium (now owned by Marvell).

According to my understanding Microchip's MIPS-based PIC32 line was
never as popular as their other offerings.

In case of Marvell, I no longer see MIPS-based Octeon III chips in the
product section of their Web site. Which, I'd guess, means that in order
to buy one you has to be an existing customer. Since the market that
Octeon III was playing in, is rather dynamic, I don't expect that those existing customers buy very old chips in tens of millions. Likely not
even in single-digit millions.

Also those Chinese supers run LoongArch, which is some sort of MIPS derivative.

Sort of.
And majority of my FPGA designs run Nios2 soft cores that are also
'sort of MIPS'. But they are *not* MIPS.

It is true that there is no more money to be made from licensing any
“MIPS IP”, which is why Imagination Tech, the inheritors of whatever
was left of MIPS the commercial operation, have switched to being a RISC-V-centric company now.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Wed Sep 11 11:40:23 2024

On 11/09/2024 10:54, Michael S wrote:

On Tue, 10 Sep 2024 23:51:20 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:

[MIPS] has been abandoned and replaced by RISC-V several years ago.

I’m not so sure the MIPS architecture has been “abandoned”. Last I
heard, it was still shipping hundreds of millions of chips per year.

Care to point to the source of this claim? Two main suppliers of MIPS
silicon in this century are Microchip and Cavium (now owned by Marvell).

According to my understanding Microchip's MIPS-based PIC32 line was
never as popular as their other offerings.

IMHO a major reason for that is Microchip's insane licensing policy for
their development tools - although their compilers are just a minor modification of standard gcc, you have to pay huge amounts if you want
to use the full features of the compiler. (At least now you can enable
/some/ optimisation without a paid license.) It is not even possible to
see from the release notes or documentation what version of gcc is
provided, though my guess is that it is pretty old (the documentation
describes "-std" options up to C++14).

The other reason, of course, was the name - "PIC" is associated with
brain-dead microcontrollers with terrible C tools and which many people
program in assembly. They are also renowned for being very solid,
coming in relatively amateur-friendly packages, and for never going out
of production.

For some time now, Microchip's PIC32 line has all had ARM Cortex-M cores
in new devices, based on SAM parts they got when they purchased Atmel.
But they still sell the existing MIPS microcontrollers.

Sort of.
And majority of my FPGA designs run Nios2 soft cores that are also
'sort of MIPS'. But they are *not* MIPS.

I thought the NIOS 2 was more "MIPS inspired" than "sort of MIPS". (And
the original NIOS was "SPARC inspired.) They have now jumped to NIOS V,
which is RISC-V (actual RISC-V).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Sep 11 09:32:04 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:

It seems that during the late 1990s, IBM was not particularly interested
in mainframe per-CPU performance.

Mainframes were never about CPU performance.

The S/360 Model 91 and the Model 195 certainly were about the maximum
CPU performance. And I doubt that IBM would have spent all the effort
with ECL and a superscalar OoO implementation for some of the ES/9000
machines if CPU performance was considered unimportant at the time.

It's an interesting question why they did not follow up their
superscalar OoO ECL implementations with a superscalar OoO CMOS
implementation in addition to the scalar in-order 9672. Here are
three speculations of what happened:

1) They had such a project and it did not work out, and the "never
about CPU performance" spin is a sour-grapes type rationalization of
the result.

2) They expected their mainframe market to be eaten up by the Unix
and/or WNT markets, and did not want to invest a lot into the
development of mainframe CPUs. Again, the "never about CPU
performance" spin is a sour-grapes type rationalization of the result.

3) They had decided that they had a captive market in the mainframes,
with software that was written for lower-powered CPUs, that the rapid
CMOS advances in the 1990s would give them enough of a performance
push to satisfy the needs of this software, so no more sophisticated
CPU designs that the 9672 was necessary (and the G5 and G6 of the 9672
indeed gave them more CPU power than ever). The "never about CPU
performance" reflected their position at the time and also served to
placate anyone who pointed out that the per-CPU performance was
inferior to that of other CPUs of the time, including IBM's own
RS/6000 line.

Eventually they seem to have decided that per-CPU performance is
important after all, with the superscalar z990 in 2003 and the OoO
z196 in 2010. But of course Dennart scaling was slowing down around
2003, so they needed to increase IPC to increase per-CPU performance.
And even if they don't need more per-CPU performance than other
architectures, they apparently do need advances over earlier
generations of their own machines and maybe to discourage competition
from emulators or startups.

They were about high I/O
throughput for efficient batch operations.

Batch operations? I wonder how much CPU time on mainframes in the
1990s and today is spent on that compared to interactive applications
such as online transaction processing.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Anton Ertl on Wed Sep 11 07:48:36 2024

On 9/11/2024 2:24 AM, Anton Ertl wrote:

Thomas Koenig <[email protected]> writes:

John Levine <[email protected]> schrieb:

S/360 had signed and unsigned adds in the 1960s, with optional
trapping for signed overflow. OS/360 let you catch the traps and z
still does but it is not my impression that many programs did or do.

With trapping, I understand. Without trapping - what is the
difference on a two's complement machine?

Possibly in the flags set. The S/360 has a pretty perverse flags architecture.

In the EC PSW from my yellow card
bit 20 fixed-point overflow mask
bit 21 decimal overflow mask
bit 22 exponent overflow mask
bit 23 significance mask

BC mode was bits 36 thru 39.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed Sep 11 15:53:06 2024

According to Thomas Koenig <[email protected]>:

John Levine <[email protected]> schrieb:

It appears that Anton Ertl <[email protected]> said:

The future of programming languages is type safe with checks, you need to >>>>get on that bandwagon early.

MIPS got on that bandwagon early. It has, e.g., add (which traps on >>>signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.

S/360 had signed and unsigned adds in the 1960s, with optional
trapping for signed overflow. OS/360 let you catch the traps and z
still does but it is not my impression that many programs did or do.

With trapping, I understand. Without trapping - what is the
difference on a two's complement machine?

Different condition codes. Signed add:

0 Sum is zero
1 Sum is less than zero
2 Sum is greater than zero
3 Overflow

Unsigned add:

0 Sum is zero (no carry)
1 Sum is not zero (no carry)
2 Sum is zero (carry)
3 Sum is not zero (carry)

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed Sep 11 16:21:19 2024

According to Anton Ertl <[email protected]>:

Lawrence D'Oliveiro <[email protected]d> writes:

On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:

It seems that during the late 1990s, IBM was not particularly interested >>> in mainframe per-CPU performance.

Mainframes were never about CPU performance.

The S/360 Model 91 and the Model 195 certainly were about the maximum
CPU performance. And I doubt that IBM would have spent all the effort
with ECL and a superscalar OoO implementation for some of the ES/9000 >machines if CPU performance was considered unimportant at the time.

It's an interesting question why they did not follow up their
superscalar OoO ECL implementations with a superscalar OoO CMOS >implementation in addition to the scalar in-order 9672. ...

IBM definitely cared about maximum performance in the 1950s and early 1960s.

The goal of STRETCH was specifically to make the fastest possible computer. It sort of
succeeded, late and over budget and not as fast as they hoped, but still the fastest
computer in the world for a while. It was a success in that they reused a lot of the
technology like the fast core memory in later computers.

The 360/91 was also intended to be the fastest possible computer, which again it sort of
was, late and over budget. One thing that STRETCH and the /91 shared was that they were
extremely complicated. STRETCH had variable sized bytes and and addressing modes that I
never entirely figured out. The /91 had an instruction queue with loop mode and out of
order operations and register renaming and imprecise interrupts. When the CDC 6600 came
out, a much simpler design from a tiny company that was nonetheless faster than the /91,
they knew they had a problem. The /95 and /195 were minor upgrades of the /91 but that was
the end of their supercomputer efforts.

The point of a mainframe is balanced performance. The CPU of a 360/30 was extremely slow
but it was fast enough to drive a disk or two and a printer and card read/punch and get a
lot of useful work done. Mainframes have had channels since the 709 in the late 1950s so
they have a lot of I/O capacity. Modern ones have terabytes of RAM and exabyte of disk.

They also care deeply about reliability. Modern mainframes have multiple kinds of error
checking and standby CPUs that can take over from a failed CPU, restart a failed
instruction, and the program doesn't notice. I think you'll find a pattern since the
CDC shock of making CPUs fast enough to keep the RAM and I/O devices busy while having
the error checking and recovery features so the systems keep running for years at a time.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Anton Ertl on Wed Sep 11 16:39:23 2024

Anton Ertl <[email protected]> wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:

It seems that during the late 1990s, IBM was not particularly interested >>> in mainframe per-CPU performance.

Mainframes were never about CPU performance.

The S/360 Model 91 and the Model 195 certainly were about the maximum
CPU performance. And I doubt that IBM would have spent all the effort
with ECL and a superscalar OoO implementation for some of the ES/9000 machines if CPU performance was considered unimportant at the time.

It's an interesting question why they did not follow up their
superscalar OoO ECL implementations with a superscalar OoO CMOS implementation in addition to the scalar in-order 9672. Here are
three speculations of what happened:

1) They had such a project and it did not work out, and the "never
about CPU performance" spin is a sour-grapes type rationalization of
the result.

2) They expected their mainframe market to be eaten up by the Unix
and/or WNT markets, and did not want to invest a lot into the
development of mainframe CPUs. Again, the "never about CPU
performance" spin is a sour-grapes type rationalization of the result.

3) They had decided that they had a captive market in the mainframes,
with software that was written for lower-powered CPUs, that the rapid
CMOS advances in the 1990s would give them enough of a performance
push to satisfy the needs of this software, so no more sophisticated
CPU designs that the 9672 was necessary (and the G5 and G6 of the 9672
indeed gave them more CPU power than ever). The "never about CPU performance" reflected their position at the time and also served to
placate anyone who pointed out that the per-CPU performance was
inferior to that of other CPUs of the time, including IBM's own
RS/6000 line.

IBM had huge caches the PC’s could not match and smart IO processors to handle much of the load, that PC’s had to handle with the CPU because they were cheap.

You could go into this as my knowledge is mostly SWAG based off marketing
bull and what little I know.

Then there is the issue of cheap PC’s that fail, and a mainframes have a higher level of redundancy and failover. Failed business transactions can
cost millions, more than the machine is worth, so saving pennies on
hardware is stupid.

Eventually they seem to have decided that per-CPU performance is
important after all, with the superscalar z990 in 2003 and the OoO
z196 in 2010. But of course Dennart scaling was slowing down around
2003, so they needed to increase IPC to increase per-CPU performance.
And even if they don't need more per-CPU performance than other architectures, they apparently do need advances over earlier
generations of their own machines and maybe to discourage competition
from emulators or startups.

They were about high I/O
throughput for efficient batch operations.

Batch operations? I wonder how much CPU time on mainframes in the
1990s and today is spent on that compared to interactive applications
such as online transaction processing.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Wed Sep 11 17:02:26 2024

David Brown <[email protected]> schrieb:

On 11/09/2024 10:54, Michael S wrote:

On Tue, 10 Sep 2024 23:51:20 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:

[MIPS] has been abandoned and replaced by RISC-V several years ago.

I’m not so sure the MIPS architecture has been “abandoned”. Last I >>> heard, it was still shipping hundreds of millions of chips per year.

Care to point to the source of this claim? Two main suppliers of MIPS
silicon in this century are Microchip and Cavium (now owned by Marvell).

According to my understanding Microchip's MIPS-based PIC32 line was
never as popular as their other offerings.

IMHO a major reason for that is Microchip's insane licensing policy for
their development tools - although their compilers are just a minor modification of standard gcc, you have to pay huge amounts if you want
to use the full features of the compiler. (At least now you can enable /some/ optimisation without a paid license.) It is not even possible to
see from the release notes or documentation what version of gcc is
provided, though my guess is that it is pretty old (the documentation describes "-std" options up to C++14).

Sounds like a violation of the GPL. Do they provide the sources?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed Sep 11 17:50:43 2024

According to Stephen Fuld <[email protected]d>:

IBM definitely cared about maximum performance in the 1950s and early

1960s.

Yes. And remember, one of the goals of S/360 was to provide an
architecture that could handle both scientific (i.e. compute bound) and >business (i.e. I/O bound) workloads.

I don't think anyone would have forseen how quickly scientific computing
moved to mini and micro computers with fast CPUs and weak peripherals.
Perhaps once the RAM is big enough to hold all the data the I/O performance
is not a big deal.

they knew they had a problem. The /95 and /195 were minor upgrades of

the /91 but that was the end of their supercomputer efforts.

Mostly true, except for the 3090 vector facility.

I suppose. A review from the USDOE said:

The IBM 3090 with Vector Facility is an extremely interesting machine
because it combines very good scaler performance with enhanced vector
and multitasking performance. For many IBM installations with a large
scientific workload, the 3090/vector/MTF combination may be an ideal
means of increasing throughput at minimum cost. However, neither the
vector nor multitasking capabilities are sufficiently developed to
make the 3090 competitive with our current worker machines for our
large-scale scientific codes.

https://www.osti.gov/biblio/5039931

instruction, and the program doesn't notice. I think you'll find a

pattern since the

CDC shock of making CPUs fast enough to keep the RAM and I/O devices

busy while having

the error checking and recovery features so the systems keep running

for years at a time.

Yes, but they also have to keep producing faster and faster CPUs so they
can entice current customers to upgrade and thus meet their revenue goals.

The memories and disks keep getting bigger so it's not totally silly to
think that the CPUs need to get faster, too. They keep increasing the
number of CPUs, with z16 topping out at 200, all sharing up to 40TB of RAM.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Levine on Wed Sep 11 10:18:44 2024

On 9/11/2024 9:21 AM, John Levine wrote:

According to Anton Ertl <[email protected]>:

Lawrence D'Oliveiro <[email protected]d> writes:

On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:

It seems that during the late 1990s, IBM was not particularly

interested

in mainframe per-CPU performance.

Mainframes were never about CPU performance.

The S/360 Model 91 and the Model 195 certainly were about the maximum
CPU performance. And I doubt that IBM would have spent all the effort
with ECL and a superscalar OoO implementation for some of the ES/9000
machines if CPU performance was considered unimportant at the time.

It's an interesting question why they did not follow up their
superscalar OoO ECL implementations with a superscalar OoO CMOS
implementation in addition to the scalar in-order 9672. ...

IBM definitely cared about maximum performance in the 1950s and early

1960s.

Yes. And remember, one of the goals of S/360 was to provide an
architecture that could handle both scientific (i.e. compute bound) and business (i.e. I/O bound) workloads.

The goal of STRETCH was specifically to make the fastest possible

computer. It sort of

succeeded, late and over budget and not as fast as they hoped, but

still the fastest

computer in the world for a while. It was a success in that they

reused a lot of the

technology like the fast core memory in later computers.

The 360/91 was also intended to be the fastest possible computer,

which again it sort of

was, late and over budget. One thing that STRETCH and the /91 shared

was that they were

extremely complicated. STRETCH had variable sized bytes and and

addressing modes that I

never entirely figured out. The /91 had an instruction queue with

loop mode and out of

order operations and register renaming and imprecise interrupts. When

the CDC 6600 came

out, a much simpler design from a tiny company that was nonetheless

faster than the /91,

they knew they had a problem. The /95 and /195 were minor upgrades of

the /91 but that was

the end of their supercomputer efforts.

Mostly true, except for the 3090 vector facility.

The point of a mainframe is balanced performance. The CPU of a 360/30

was extremely slow

but it was fast enough to drive a disk or two and a printer and card

read/punch and get a

lot of useful work done. Mainframes have had channels since the 709

in the late 1950s so

they have a lot of I/O capacity. Modern ones have terabytes of RAM

and exabyte of disk.

Yes.

They also care deeply about reliability. Modern mainframes have

multiple kinds of error

checking and standby CPUs that can take over from a failed CPU,

restart a failed

instruction, and the program doesn't notice. I think you'll find a

pattern since the

CDC shock of making CPUs fast enough to keep the RAM and I/O devices

busy while having

the error checking and recovery features so the systems keep running

for years at a time.

Yes, but they also have to keep producing faster and faster CPUs so they
can entice current customers to upgrade and thus meet their revenue goals.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Wed Sep 11 19:57:30 2024

On 11/09/2024 19:02, Thomas Koenig wrote:

David Brown <[email protected]> schrieb:

On 11/09/2024 10:54, Michael S wrote:

On Tue, 10 Sep 2024 23:51:20 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:

[MIPS] has been abandoned and replaced by RISC-V several years ago.

I’m not so sure the MIPS architecture has been “abandoned”. Last I >>>> heard, it was still shipping hundreds of millions of chips per year.

Care to point to the source of this claim? Two main suppliers of MIPS
silicon in this century are Microchip and Cavium (now owned by Marvell). >>>
According to my understanding Microchip's MIPS-based PIC32 line was
never as popular as their other offerings.

IMHO a major reason for that is Microchip's insane licensing policy for
their development tools - although their compilers are just a minor
modification of standard gcc, you have to pay huge amounts if you want
to use the full features of the compiler. (At least now you can enable
/some/ optimisation without a paid license.) It is not even possible to
see from the release notes or documentation what version of gcc is
provided, though my guess is that it is pretty old (the documentation
describes "-std" options up to C++14).

Sounds like a violation of the GPL. Do they provide the sources?

Yes.

It's perfectly fine to take the gcc sources, add in some code that
checks for a paid license of some sort, and distribute that as a binary
- as long as you also provide the source for it. So you /could/ take
the source and compile it yourself (or just get the original gcc source,
or another binary build of gcc MIPS).

But the license for their header files, SDKs, libraries, IDE (which,
IIRC, was basically NetBeans) and other tools says you can only use them
with an unmodified binary that they provide. And writing your own
header files for a big microcontroller is not a quick and easy job.

I believe there was no legal violation of the GPL, but there was no
doubt that it was trashing the spirit of it.

And as far as I could see from a look at their website, they are still
at the same game (though you can now enable /some/ optimisations), now
with the ARM core PIC32 devices as well.

Every other ARM or RISC V based microcontroller manufacturer I have seen provides free gcc and/or clang tools, along with a free IDE (Eclipse or
MS Code). They will also provide support for paid tools like ARM's
development tools, or IAR, or maybe Green Hills - these are expensive,
but that's fair enough. What is not fair, even if it is legal, is
taking something that they get for free and charging multiple
kilodollars for people to use it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Levine on Wed Sep 11 11:25:48 2024

On 9/11/2024 10:50 AM, John Levine wrote:

According to Stephen Fuld <[email protected]d>:

IBM definitely cared about maximum performance in the 1950s and early

1960s.

Yes. And remember, one of the goals of S/360 was to provide an
architecture that could handle both scientific (i.e. compute bound) and
business (i.e. I/O bound) workloads.

I don't think anyone would have forseen how quickly scientific computing moved to mini and micro computers with fast CPUs and weak peripherals.

Agreed, plus the development of CDC/Cray supercomputers that took the
high end scientific market away from IBM.

Perhaps once the RAM is big enough to hold all the data the I/O performance is not a big deal.

they knew they had a problem. The /95 and /195 were minor upgrades of

the /91 but that was the end of their supercomputer efforts.

Mostly true, except for the 3090 vector facility.

I suppose. A review from the USDOE said:

The IBM 3090 with Vector Facility is an extremely interesting machine
because it combines very good scaler performance with enhanced vector
and multitasking performance. For many IBM installations with a large
scientific workload, the 3090/vector/MTF combination may be an ideal
means of increasing throughput at minimum cost. However, neither the
vector nor multitasking capabilities are sufficiently developed to
make the 3090 competitive with our current worker machines for our
large-scale scientific codes.

https://www.osti.gov/biblio/5039931

I didn't claim that the 3090VF was successful, just that IBM was
interested enough in the scientific market to spend money developing it
after the 370/195.

instruction, and the program doesn't notice. I think you'll find a

pattern since the

CDC shock of making CPUs fast enough to keep the RAM and I/O devices

busy while having

the error checking and recovery features so the systems keep running

for years at a time.

Yes, but they also have to keep producing faster and faster CPUs so they
can entice current customers to upgrade and thus meet their revenue goals.

The memories and disks keep getting bigger so it's not totally silly to
think that the CPUs need to get faster, too.

Agreed, of course.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Anton Ertl on Wed Sep 11 11:36:46 2024

On 9/11/2024 2:32 AM, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

snip

They were about high I/O
throughput for efficient batch operations.

Batch operations? I wonder how much CPU time on mainframes in the
1990s and today is spent on that compared to interactive applications
such as online transaction processing.

Perhaps it would have been better stated as being about balanced
performance (CPU and I/O) for business applications, which at the time
were primarily batch, but have migrated over time to transactions, but
which still are more I/O bound than scientific workloads.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Levine on Wed Sep 11 19:36:59 2024

John Levine <[email protected]> schrieb:

According to Stephen Fuld <[email protected]d>:

IBM definitely cared about maximum performance in the 1950s and early >>1960s.

Yes. And remember, one of the goals of S/360 was to provide an >>architecture that could handle both scientific (i.e. compute bound) and >>business (i.e. I/O bound) workloads.

I don't think anyone would have forseen how quickly scientific computing moved to mini and micro computers with fast CPUs and weak peripherals. Perhaps once the RAM is big enough to hold all the data the I/O performance is not a big deal.

they knew they had a problem. The /95 and /195 were minor upgrades of >>the /91 but that was the end of their supercomputer efforts.

Don't forget the ACS.

Looking at (if that is to be believed) https://people.computing.clemson.edu/~mark/acs_performance.html
this seems to have been quite an amazing machine for its time,
with projected 160 MFlops and around five concurrent instructions
using OoO.

Had this been realized, it would havbe been as fast as the Cray-I.
But it never reached the market, so...

Mostly true, except for the 3090 vector facility.

I suppose. A review from the USDOE said:

We had that in our IBM 3090 at the computer center. Compared to the
Fujitsu VP machine sitting next to it, it was not impressive at
all (which can also be read in the report you youted).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Wed Sep 11 21:12:34 2024

On Wed, 11 Sep 2024 9:32:04 +0000, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:

It seems that during the late 1990s, IBM was not particularly interested >>> in mainframe per-CPU performance.

Mainframes were never about CPU performance.

The S/360 Model 91 and the Model 195 certainly were about the maximum
CPU performance. And I doubt that IBM would have spent all the effort
with ECL and a superscalar OoO implementation for some of the ES/9000 machines if CPU performance was considered unimportant at the time.

91 was Current-Mode-Logic CML. Don't know about 195.
CML had all of the speed and all of the electrical and all of the heat
problems ECK had.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Wed Sep 11 21:15:29 2024

On Wed, 11 Sep 2024 16:21:19 +0000, John Levine wrote:

According to Anton Ertl <[email protected]>:

Lawrence D'Oliveiro <[email protected]d> writes:

The 360/91 was also intended to be the fastest possible computer, which
again it sort of
was, late and over budget. One thing that STRETCH and the /91 shared was
that they were
extremely complicated. STRETCH had variable sized bytes and and
addressing modes that I
never entirely figured out. The /91 had an instruction queue with loop
mode and out of
order operations and register renaming and imprecise interrupts. When
the CDC 6600 came
out, a much simpler design from a tiny company that was nonetheless
faster than the /91,
they knew they had a problem. The /95 and /195 were minor upgrades of
the /91 but that was
the end of their supercomputer efforts.

You forgot to mention /91 had a 60ns clock while 6600 had a 100 ns
clock.
Here was a case where parallelism beat out pipelining.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Wed Sep 11 21:51:54 2024

Thomas Koenig <[email protected]> writes:

John Levine <[email protected]> schrieb:

According to Stephen Fuld <[email protected]d>:

IBM definitely cared about maximum performance in the 1950s and early >>>1960s.

Yes. And remember, one of the goals of S/360 was to provide an >>>architecture that could handle both scientific (i.e. compute bound) and >>>business (i.e. I/O bound) workloads.

I don't think anyone would have forseen how quickly scientific computing
moved to mini and micro computers with fast CPUs and weak peripherals.
Perhaps once the RAM is big enough to hold all the data the I/O performance >> is not a big deal.

they knew they had a problem. The /95 and /195 were minor upgrades of >>>the /91 but that was the end of their supercomputer efforts.

Don't forget the ACS.

Looking at (if that is to be believed) >https://people.computing.clemson.edu/~mark/acs_performance.html
this seems to have been quite an amazing machine for its time,
with projected 160 MFlops and around five concurrent instructions
using OoO.

The (cancelled) Burroughs Scientific Processor was as fast as
a Cray-I.

The Burroughs Scientific Processor (BSP), a high-performance
computer system, performed the Department of Energy LLL loops
at roughly the speed of the CRAY-1. The BSP combined parallelism
and pipelining, performing memory-to-memory operations. Seventeen
memory units and two crossbar switch data alignment networks
provided conflict-free access to most indexed arrays. Fast linear
recurrence algorithms provided good performance on constructs that
some machines execute serially. A system manager computer ran the
operating system and a vectorizing Fortran compiler. An MOS file
memory system served as a high bandwidth secondary memory.

https://ieeexplore.ieee.org/document/1676014 https://en.wikipedia.org/wiki/Parallel_Element_Processing_Ensemble

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lars Poulsen@21:1/5 to John Levine on Wed Sep 11 17:44:58 2024

On 9/11/2024 9:21 AM, John Levine wrote:

I think you'll find a pattern since the
CDC shock of making CPUs fast enough to keep the RAM and I/O devices busy while having
the error checking and recovery features so the systems keep running for years at a time.

So do these systems not require security patches?
Or do they apply PTFs to the running system? (reliably?)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Sep 12 02:05:14 2024

According to Lars Poulsen <[email protected]>:

On 9/11/2024 9:21 AM, John Levine wrote:

I think you'll find a pattern since the
CDC shock of making CPUs fast enough to keep the RAM and I/O devices busy while having
the error checking and recovery features so the systems keep running for years at a time.

So do these systems not require security patches?
Or do they apply PTFs to the running system? (reliably?)

They don't just update the software, they swap out entire hardware subsystems while the
overall system keeps running.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Thu Sep 12 04:28:00 2024

On Wed, 11 Sep 2024 11:36:46 -0700, Stephen Fuld wrote:

Perhaps it would have been better stated as being about balanced
performance (CPU and I/O) for business applications, which at the time
were primarily batch, but have migrated over time to transactions, but
which still are more I/O bound than scientific workloads.

Depending on the kind of transactions: online interactive stuff requires
low latencies as opposed to high throughput. Mainframes are optimized for
high throughput.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Brett on Thu Sep 12 04:26:52 2024

On Wed, 11 Sep 2024 16:39:23 -0000 (UTC), Brett wrote:

Then there is the issue of cheap PC’s that fail, and a mainframes have a higher level of redundancy and failover. Failed business transactions
can cost millions, more than the machine is worth, so saving pennies on hardware is stupid.

You solve that by having multiple units of the cheap machines to achieve
the same level of redundancy, or even more. That ends up being more cost- effective than the mainframe.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Thu Sep 12 04:30:05 2024

On Wed, 11 Sep 2024 21:12:34 +0000, MitchAlsup1 wrote:

91 was Current-Mode-Logic CML. ...
CML had all of the speed and all of the electrical and all of the heat problems ECK had.

IBM over-promising and under-delivering, again.

The ’90, or was it the ’91, or the ‘92, was the machine IBM promised to deliver to those customers who were looking to buy a CDC machine. It
remained vapourware for close to two years I think it was, and was underwhelming when it did finally appear. But it still managed to cost CDC
a lot of sales in the meantime.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu Sep 12 04:31:49 2024

On Wed, 11 Sep 2024 16:21:19 -0000 (UTC), John Levine wrote:

They also care deeply about reliability. Modern mainframes have multiple kinds of error checking and standby CPUs that can take over from a
failed CPU, restart a failed instruction, and the program doesn't
notice.

This “mainframe reliability” seems to be a persistent myth. There is a document at Bitsavers, dating from 1986, which says that, if you want to
turn daylight saving on or off on an IBM mainframe, you really should
reboot.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu Sep 12 04:33:09 2024

On Thu, 12 Sep 2024 02:05:14 -0000 (UTC), John Levine wrote:

They don't just update the software, they swap out entire hardware
subsystems while the overall system keeps running.

Xen Orchestra (open-source) can do that on commodity PC hardware.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to John Levine on Thu Sep 12 10:50:33 2024

John Levine wrote:

According to Stephen Fuld <[email protected]d>:

IBM definitely cared about maximum performance in the 1950s and early

1960s.

Yes. And remember, one of the goals of S/360 was to provide an
architecture that could handle both scientific (i.e. compute bound) and
business (i.e. I/O bound) workloads.

I don't think anyone would have forseen how quickly scientific computing moved to mini and micro computers with fast CPUs and weak peripherals. Perhaps once the RAM is big enough to hold all the data the I/O performance is not a big deal.

Back around 1986 or so I stated that all programming tasks will migrate
down to the lowest/cheapest architecture which is large enough to handle
the task. This meant that I was sure both minis and mainframes would go
away, so I was in fact only 99.9% correct. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Thu Sep 12 14:30:27 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Thu, 12 Sep 2024 02:05:14 -0000 (UTC), John Levine wrote:

They don't just update the software, they swap out entire hardware
subsystems while the overall system keeps running.

Xen Orchestra (open-source) can do that on commodity PC hardware.

The 3leaf hypervisor supported hot-plug memory, hot-plug CPU
hot-plug PCI 15 years ago with commodity linux guests.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Thu Sep 12 14:31:26 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Wed, 11 Sep 2024 16:21:19 -0000 (UTC), John Levine wrote:

They also care deeply about reliability. Modern mainframes have multiple
kinds of error checking and standby CPUs that can take over from a
failed CPU, restart a failed instruction, and the program doesn't
notice.

This “mainframe reliability” seems to be a persistent myth.

Where do you come up with this nonsense?

Have you not heard of Tandem, or Stratus?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Schultz@21:1/5 to David Brown on Wed Sep 11 14:06:23 2024

On 9/11/24 4:40 AM, David Brown wrote:

The other reason, of course, was the name - "PIC" is associated with brain-dead microcontrollers with terrible C tools and which many people program in assembly. They are also renowned for being very solid,
coming in relatively amateur-friendly packages, and for never going out
of production.

After having written code for a PIC I agree with "brain-dead". The small
sized memory pages were bad enough but the total lack of an add with
carry instruction drove me mad.

So I swore them off and the introduction of a MIPS based system did
nothing to change that.

--
http://davesrocketworks.com
David Schultz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Sep 12 20:10:43 2024

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 11 Sep 2024 16:39:23 -0000 (UTC), Brett wrote:

Then there is the issue of cheap PC’s that fail, and a mainframes have a >> higher level of redundancy and failover. Failed business transactions
can cost millions, more than the machine is worth, so saving pennies on
hardware is stupid.

You solve that by having multiple units of the cheap machines to achieve
the same level of redundancy, or even more. That ends up being more cost- >effective than the mainframe.

That's fine for workloads that work that way.

Airline reservation systems historically ran on mainframes because when they were invented
that's all there was (original SABRE ran on two 7090s) and they are business critical so
they need to be very reliable.

About 30 years ago some guys at MIT realized that route and fare search, which are some of
the most demanding things that CRS do, are easy to parallelize and don't have to be
particularly reliable -- if your search system crashes and restarts and reruns the search
and the result is a couple of seconds late, that's OK. So they started ITA software which
used racks of PC servers running parallel applications written in Lisp (they were from
MIT) and blew away the competition.

However, that's just the search part. Actually booking the seats and selling tickets stays
on a mainframe or an Oracle system because double booking or giving away free tickets would
be really bad.

There's also a rule of thumb about databases that says one system of performance 100 is
much better than 100 systems of performance 1 because those 100 systems will spend all
their time contending for database locks.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu Sep 12 22:20:23 2024

On Thu, 12 Sep 2024 20:10:43 -0000 (UTC), John Levine wrote:

Actually booking the seats and
selling tickets stays on a mainframe or an Oracle system because double booking or giving away free tickets would be really bad.

Fun fact: double-booking happens all the time.

I don’t think either mainframes or Oracle are still dominant in this sort
of business. After all, Paul Graham’s Orbitz was doing this sort of thing over 20 years ago ... in LISP.

<https://paulgraham.com/carl.html>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to John Levine on Fri Sep 13 09:28:06 2024

John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 11 Sep 2024 16:39:23 -0000 (UTC), Brett wrote:

Then there is the issue of cheap PCâ€™s that fail, and a mainframes have a
higher level of redundancy and failover. Failed business transactions
can cost millions, more than the machine is worth, so saving pennies on
hardware is stupid.

You solve that by having multiple units of the cheap machines to achieve
the same level of redundancy, or even more. That ends up being more cost-
effective than the mainframe.

That's fine for workloads that work that way.

Airline reservation systems historically ran on mainframes because when they were invented
that's all there was (original SABRE ran on two 7090s) and they are business critical so
they need to be very reliable.

About 30 years ago some guys at MIT realized that route and fare search, which are some of
the most demanding things that CRS do, are easy to parallelize and don't have to be
particularly reliable -- if your search system crashes and restarts and reruns the search
and the result is a couple of seconds late, that's OK. So they started ITA software which
used racks of PC servers running parallel applications written in Lisp (they were from
MIT) and blew away the competition.

However, that's just the search part. Actually booking the seats and selling tickets stays
on a mainframe or an Oracle system because double booking or giving away free tickets would
be really bad.

You could replicate much of that part as well, for most of the time, by setting aside chunks of seats to parallel servers, so that they can
book/sell within that chunk until they start to run out. This way the expensive system is mostly needed only when the front ends run out?

There's also a rule of thumb about databases that says one system of performance 100 is
much better than 100 systems of performance 1 because those 100 systems will spend all
their time contending for database locks.

10-15 years ago I talked to another speaker at a conference, he told me
that he was working on high-end open source LDAP software using _very_
large memory DBs: Their system allowed one US cell phone company to keep
every SIM card (~100M) on a single system, while a similar-size
competitor had been forced to fall back on 17-way sharding (presumably
using a hash of the SIM id).

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Scott Lurndal on Fri Sep 13 09:15:52 2024

Scott Lurndal wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Thu, 12 Sep 2024 02:05:14 -0000 (UTC), John Levine wrote:

They don't just update the software, they swap out entire hardware
subsystems while the overall system keeps running.

Xen Orchestra (open-source) can do that on commodity PC hardware.

The 3leaf hypervisor supported hot-plug memory, hot-plug CPU
hot-plug PCI 15 years ago with commodity linux guests.

Novell's System Fault Tolerant NetWare 386 (around 1990) supported two
complete servers acting like one, so that any hardware component could
fail and the system would keep running, with nothing noticed by the
clients, even those that were in the middle of an update/write request.

Worked with a private high-speed link between the two servers, so that
all requests were mirrored from master to slave. This way the slave
would do the requested operation in sync with the master, maintaining
the exact same state so it was ready to take over at any point.

BTW, since the pair naturally had separate network connections, they
could also be connected to separate LAN segments, and this worked
transparently because every server (single or SFT) maintained a LAN
segment inside the server: This way the two server connections just
looked like redundant routing paths.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Fri Sep 13 06:23:47 2024

John Levine <[email protected]> writes:

According to Lawrence D'Oliveiro <[email protected]d>:

You solve that by having multiple units of the cheap machines to achieve >>the same level of redundancy, or even more. That ends up being more cost- >>effective than the mainframe.

That's fine for workloads that work that way.

Airline reservation systems historically ran on mainframes because
when they were invented that's all there was (original SABRE ran on
two 7090s) and they are business critical so they need to be very
reliable.

About 30 years ago some guys at MIT realized that route and fare
search, which are some of the most demanding things that CRS do, are
easy to parallelize and don't have to be particularly reliable -- if
your search system crashes and restarts and reruns the search and the
result is a couple of seconds late, that's OK. So they started ITA
software which used racks of PC servers running parallel applications
written in Lisp (they were from MIT) and blew away the competition.

However, that's just the search part. Actually booking the seats and
selling tickets stays on a mainframe or an Oracle system because
double booking or giving away free tickets would be really bad.

Booking flights or seats can easily be distributed: each flight is
assigned to one computer. To avoid double booking or free tickets
even in case of a computer crash, you use the usual transaction
processing approach, and report completion of the booking only when
the booking has reached persistent memory. For persistent memory you
use SSDs with power-loss protection.

These SSDs, ECC RAM, RAID-1, redundant power supplies and UPSs protect
against most hardware failures, but availability is still a concern
(e.g., motherboard or CPU failure; that normally does not affect data
integrity if the other measures are taken, but it affects
availability). To increase availability, you can use e.g., DRBD
(distributed replicated block device) to get the data on multiple
machines.

Concerning "real bad": Airlines overbook their flights as a matter of
policy to increase their revenue. If they had a booking system that double-booked, say, 1ppm of all bookings, they probably would not even
notice, and would deal with it in the same way they deal with the
cases when the overbooking actually results in too many passengers
arriving for the flight. Likewise, free tickets are not an issue if
they occur rarely enough. Do they want to spend a million on a
mainframe to avoid a revenue loss of $100k? But in any case, that's
not the problem with cheap hardware.

The problems are: When the persistent storage fails, you lose all
transactions since the latest backup. To avoid that, RAID-1 helps, or
a redundant distributed storage like DRBD, or a redundant distributed transaction system. You may also want more availability than a single
system with RAID-1 (with a spare system standing by) provides, then
you have to go for one of the redundant distributed approaches.

However, my impression from booking flights online is that reliability
of the booking platform is not at all a concern for the airlines. And
as a customer, I find little difference between the booking front-end
erroring out or the transaction back-end being unavailable.

There's also a rule of thumb about databases that says one system of >performance 100 is much better than 100 systems of performance 1
because those 100 systems will spend all their time contending for
database locks.

If you handle each flight on one system, the contention for locks is
only within that one system. And I expect that there is not that much contention. How many people book the same flight within the same
millisecond (or however long the lock is held)?

Concerning performance 100 vs. performance 1, about what systems are
you thinking? z17 will have 32*8=256 cores (of unknown performance
that is likely to be disappointing, or IBM would not disallow
publishing benchmark results), compared to similar numbers of cores on
servers with AMD or Intel CPUs, or 16-24 cores on systems based on
desktop chips (with Intel you pay a heavy premium these days if you
want ECC memory, however).

Interestingly, with increasing number of cores per socket in recent
years, the number of sockets is going down. E.g., the successor for
the HPE Superdome Flex with up to 32 sockets (up to 32*28=896 cores)
is the HPE Compute Scale-Up Server 3200 with up to 16 sockets
(16*60=960 cores). Either there is little demand for single systems
with more cores, or there are technical difficulties (probably both).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Fri Sep 13 08:24:19 2024

On Fri, 13 Sep 2024 09:15:52 +0200, Terje Mathisen wrote:

Novell's System Fault Tolerant NetWare 386 (around 1990) supported two complete servers acting like one, so that any hardware component could
fail and the system would keep running, with nothing noticed by the
clients, even those that were in the middle of an update/write request.

Just so long as it wasn’t the network connection between them that failed.

See also, “CAP Theorem”.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Levine on Fri Sep 13 12:22:17 2024

On Thu, 12 Sep 2024 20:10:43 -0000 (UTC)
John Levine <[email protected]> wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 11 Sep 2024 16:39:23 -0000 (UTC), Brett wrote:

Then there is the issue of cheap PC’s that fail, and a mainframes
have a higher level of redundancy and failover. Failed business
transactions can cost millions, more than the machine is worth, so
saving pennies on hardware is stupid.

You solve that by having multiple units of the cheap machines to
achieve the same level of redundancy, or even more. That ends up
being more cost- effective than the mainframe.

That's fine for workloads that work that way.

Airline reservation systems historically ran on mainframes because
when they were invented that's all there was (original SABRE ran on
two 7090s) and they are business critical so they need to be very
reliable.

About 30 years ago some guys at MIT realized that route and fare
search, which are some of the most demanding things that CRS do, are
easy to parallelize and don't have to be particularly reliable -- if
your search system crashes and restarts and reruns the search and the
result is a couple of seconds late, that's OK. So they started ITA
software which used racks of PC servers running parallel applications
written in Lisp (they were from MIT) and blew away the competition.

However, that's just the search part. Actually booking the seats and
selling tickets stays on a mainframe or an Oracle system because
double booking or giving away free tickets would be really bad.

There's also a rule of thumb about databases that says one system of performance 100 is much better than 100 systems of performance 1
because those 100 systems will spend all their time contending for
database locks.

How many transactions per minute does world's biggest company need at
peak hours? Is not this number small relatively to capabilities of
even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Fri Sep 13 12:02:27 2024

Lawrence D'Oliveiro wrote:

On Fri, 13 Sep 2024 09:15:52 +0200, Terje Mathisen wrote:

Novell's System Fault Tolerant NetWare 386 (around 1990) supported two
complete servers acting like one, so that any hardware component could
fail and the system would keep running, with nothing noticed by the
clients, even those that were in the middle of an update/write request.

Just so long as it wasnâ€™t the network connection between them that failed.

See also, â€œCAP Theoremâ€.

If that failed, normal network routing would apply and the master-slave traffic would go out and back in over the normal network cards, but of
course giving slighlty reduced performance.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Fri Sep 13 11:20:06 2024

Terje Mathisen <[email protected]> schrieb:

10-15 years ago I talked to another speaker at a conference, he told me
that he was working on high-end open source LDAP software using _very_
large memory DBs: Their system allowed one US cell phone company to keep every SIM card (~100M) on a single system, while a similar-size
competitor had been forced to fall back on 17-way sharding (presumably
using a hash of the SIM id).

Keeping databases in memory is definitely a thing now... see SAP HANA.

Any architectural implications for this?

Browsing through the SAP pages, it seems they used Intel's Optane
persistent memory, but that is no longer manufactured (?). But
having fast, persistent storage is definitely an advantage for
databases.

Large memory: Of course.

On the ISA level... these databases run on x86, so that seems to
be good enough.

Anything else?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Fri Sep 13 14:55:39 2024

On Fri, 13 Sep 2024 11:20:06 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Terje Mathisen <[email protected]> schrieb:

10-15 years ago I talked to another speaker at a conference, he
told me that he was working on high-end open source LDAP software
using _very_ large memory DBs: Their system allowed one US cell
phone company to keep every SIM card (~100M) on a single system,
while a similar-size competitor had been forced to fall back on
17-way sharding (presumably using a hash of the SIM id).

Keeping databases in memory is definitely a thing now... see SAP HANA.

Any architectural implications for this?

Browsing through the SAP pages, it seems they used Intel's Optane
persistent memory, but that is no longer manufactured (?). But
having fast, persistent storage is definitely an advantage for
databases.

Large memory: Of course.

On the ISA level... these databases run on x86, so that seems to
be good enough.

Anything else?

Another thing that SAP HANA seems to use more intensely than anybody
else is Intel TSX. TSX (at least RTM part, I am not sure about HLE
part) still present in the latest Xeon generation, but is strongly de-emphasized.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Michael S on Fri Sep 13 13:35:00 2024

In article <[email protected]>, [email protected] (Michael S) wrote:

How many transactions per minute does world's biggest company need
at peak hours?

One very painful case is credit card spending in the run-up to major
holidays, such as Christmas, where the credit card companies feel the
need for central authorisation of all transactions to reduce fraud. Fraud
is, naturally, at its peak at these times. The price of wrongly refused transactions is also high, because it means customers march out of shops, having wasted retail staff time.

Is not this number small relatively to capabilities of even 15 y.o.
dual-Xeon server with few dozens of spinning rust disks?

This does not seem to be the case.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Dallman on Fri Sep 13 15:50:12 2024

On Fri, 13 Sep 2024 13:35 +0100 (BST)
[email protected] (John Dallman) wrote:

In article <[email protected]>,
[email protected] (Michael S) wrote:

How many transactions per minute does world's biggest company need
at peak hours?

One very painful case is credit card spending in the run-up to major holidays, such as Christmas, where the credit card companies feel the
need for central authorisation of all transactions to reduce fraud.
Fraud is, naturally, at its peak at these times. The price of wrongly
refused transactions is also high, because it means customers march
out of shops, having wasted retail staff time.

Is not this number small relatively to capabilities of even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?

This does not seem to be the case.

John

My post was specifically about flight reservations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Michael S on Fri Sep 13 16:18:00 2024

In article <[email protected]>, [email protected] (Michael S) wrote:

My post was specifically about flight reservations.

Ah, sorry. In that field, the airlines have found it best to collect into
large groups with shared reservation systems. If a travel agent has to
use a different system for each airline, then booking becomes very
inefficient and capacity gets wasted.

<https://en.wikipedia.org/wiki/Computer_reservation_system>

So there's real demand for systems with huge capacity. Not very many of
them, but they have large budgets.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to David Schultz on Fri Sep 13 17:36:10 2024

On 11/09/2024 21:06, David Schultz wrote:

On 9/11/24 4:40 AM, David Brown wrote:

The other reason, of course, was the name - "PIC" is associated with
brain-dead microcontrollers with terrible C tools and which many
people program in assembly. They are also renowned for being very
solid, coming in relatively amateur-friendly packages, and for never
going out of production.

After having written code for a PIC I agree with "brain-dead". The small sized memory pages were bad enough but the total lack of an add with
carry instruction drove me mad.

If you pretend it is a sort of microcode rather than assembly, so that
you need several PIC assembly instructions to do the work of a single
assembly instruction on a "normal" 8-bit CISC microcontroller, it feels
less bad. And with enough complicated macros, it is possible to keep
paging a bit more under control and automated.

It also helps to use macros to give instructions better names - such as
"IfBit" and "IfNBit" rather than "btfsc" and "btfss". (The fact that I
can remember these after 25 years or so is a sign of the amount of
cognitive effort it took to work with these things!)

So I swore them off and the introduction of a MIPS based system did
nothing to change that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Fri Sep 13 11:50:44 2024

Anton Ertl wrote:

John Levine <[email protected]> writes:

According to Lawrence D'Oliveiro <[email protected]d>:

You solve that by having multiple units of the cheap machines to achieve >>> the same level of redundancy, or even more. That ends up being more cost- >>> effective than the mainframe.

That's fine for workloads that work that way.

Airline reservation systems historically ran on mainframes because
when they were invented that's all there was (original SABRE ran on
two 7090s) and they are business critical so they need to be very
reliable.

About 30 years ago some guys at MIT realized that route and fare
search, which are some of the most demanding things that CRS do, are
easy to parallelize and don't have to be particularly reliable -- if
your search system crashes and restarts and reruns the search and the
result is a couple of seconds late, that's OK. So they started ITA
software which used racks of PC servers running parallel applications
written in Lisp (they were from MIT) and blew away the competition.

However, that's just the search part. Actually booking the seats and
selling tickets stays on a mainframe or an Oracle system because
double booking or giving away free tickets would be really bad.

Booking flights or seats can easily be distributed: each flight is
assigned to one computer. To avoid double booking or free tickets
even in case of a computer crash, you use the usual transaction
processing approach, and report completion of the booking only when
the booking has reached persistent memory. For persistent memory you
use SSDs with power-loss protection.

These SSDs, ECC RAM, RAID-1, redundant power supplies and UPSs protect against most hardware failures, but availability is still a concern
(e.g., motherboard or CPU failure; that normally does not affect data integrity if the other measures are taken, but it affects
availability). To increase availability, you can use e.g., DRBD
(distributed replicated block device) to get the data on multiple
machines.

Concerning "real bad": Airlines overbook their flights as a matter of
policy to increase their revenue. If they had a booking system that double-booked, say, 1ppm of all bookings, they probably would not even notice, and would deal with it in the same way they deal with the
cases when the overbooking actually results in too many passengers
arriving for the flight. Likewise, free tickets are not an issue if
they occur rarely enough. Do they want to spend a million on a
mainframe to avoid a revenue loss of $100k? But in any case, that's
not the problem with cheap hardware.

You would not want the computer or agents overbooking randomly.
You would want this controlled by policy and done behind the agents
and customers back (or they would just cancel).

So overbooking is just another kind of normal transaction
and not an accident of fate.

The problems are: When the persistent storage fails, you lose all transactions since the latest backup. To avoid that, RAID-1 helps, or
a redundant distributed storage like DRBD, or a redundant distributed transaction system. You may also want more availability than a single
system with RAID-1 (with a spare system standing by) provides, then
you have to go for one of the redundant distributed approaches.

However, my impression from booking flights online is that reliability
of the booking platform is not at all a concern for the airlines. And
as a customer, I find little difference between the booking front-end erroring out or the transaction back-end being unavailable.

There's also a rule of thumb about databases that says one system of
performance 100 is much better than 100 systems of performance 1
because those 100 systems will spend all their time contending for
database locks.

If you handle each flight on one system, the contention for locks is
only within that one system. And I expect that there is not that much contention. How many people book the same flight within the same
millisecond (or however long the lock is held)?

Unlike debit/credit or stock trading transactions which are self contained,
the problem with airline reservation style transactions is they are
interactive in the middle and a traditional DB record/row locking
mechanism is insufficient.

In the interactive transaction case, to do this properly (I don't know
if airline systems actually do this) one needs to apply a timed reserve
lock to a 3-seat row to give the agent a chance to talk to the customer
and find out if the seat is acceptable. This creates a context that must
be maintained for a period of time.

Also typical SQL DB's do not have a way to read a row with lock
and fail if already locked. They stall the request, which is not
what you want a long duration interactive transaction to ever do.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Fri Sep 13 17:05:35 2024

On Fri, 13 Sep 2024 9:22:17 +0000, Michael S wrote:

How many transactions per minute does world's biggest company need at
peak hours? Is not this number small relatively to capabilities of
even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?

A SWAG::

8B people in the world: 1/3rd sleeping, 1/3rd working, 1/3rd relaxing.

So we have only 3B potential transactions, and a single person will
not average more than 1 transaction every 15 minutes over an hour.

So: 3B/15 = 200M T/m

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to John Levine on Fri Sep 13 08:45:46 2024

John Levine <[email protected]> writes:

They also care deeply about reliability. Modern mainframes have multiple kinds of error
checking and standby CPUs that can take over from a failed CPU, restart a failed
instruction, and the program doesn't notice. I think you'll find a pattern since the
CDC shock of making CPUs fast enough to keep the RAM and I/O devices busy while having
the error checking and recovery features so the systems keep running for years at a time.

shortly after joining IBM, I got pulled into effort to multithread
370/195 ... 195 didn't have branch prediction or speculative execution
so conditional branches drained pipeline and most codes ran at half
rated throughput. Two (simulated, "red/black") instruction streams
running at half speed would achieve rated throughput.

They also claimed that the main difference between 360/195 and 370/195
was introduction of ("370") hardware retry (masking all sorts of
transient hardware errors). Some vague recall mention that 360/195 mean
time between some hardware check was three hrs (combination of number of circuits and how fast they were running).

Then decision was made to add virtual memory to all 370s and it was
decided that the difficulty in adding virtual memory to 370/195 wasn't justified ... and all new work on machine was dropped.

Account of end of ACS/360 ... Amdahl had won the battle to make ACS, 360 compatible ... but folklore is then executives were afraid that it would advance state-of-the-art too fast and IBM would loose control of the
market ... includes some references to multithreading
patents&disclosures.
https://people.computing.clemson.edu/~mark/acs_end.html

also mentions some of ACS/360 features show up more than 20yrs later
with ES/9000.

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to John Levine on Fri Sep 13 09:05:33 2024

John Levine <[email protected]> writes:

I suppose. A review from the USDOE said:

The IBM 3090 with Vector Facility is an extremely interesting machine
because it combines very good scaler performance with enhanced vector
and multitasking performance. For many IBM installations with a large
scientific workload, the 3090/vector/MTF combination may be an ideal
means of increasing throughput at minimum cost. However, neither the
vector nor multitasking capabilities are sufficiently developed to
make the 3090 competitive with our current worker machines for our
large-scale scientific codes.

1st part of 70s, IBM had FS effort, was totally different than 370
and was to completely replace 370 (internal politics was killing
off 370 efforts).
http://www.jfsowa.com/computer/memo125.htm

when FS finally implodes, there is a mad rush to get stuff back into the
370 product pipelines, including kicking off 3033&3081 efforts in
parallel.

I got sucked into work on 16-processor 370 SMP and we con the 3033
processor engineers into working on it in their spare time (lot more interesting that remapping 168 logic for 20% faster chips). Everybody
thot it was great until somebody tells the head of POK lab that it could
be decades before the POK favorite son operating system (batch MVS) has (effective) 16-processor support (at the time MVS documentation claimed
that its 2-processor throughput had 1.2-1.5 times the throughput of
single process). The head of POK then invites some of us to never visit
POK again and the 3033 processor engineers, heads down and no
distractions (although I was invited to sneak back into POK to work with
them). POK doesn't ship a 16-processor machine until after the turn of
the century, more than two decades later.

Once the 3033 was out the door, the processor engineers start on
trout/3090. When vector was announced they complained about it being
purely marketing stunt ... that they had so speeded up 3090 scalar that
it ran at memory bus saturation (and vector would unlikely make
throughput much better).

I had also started pontificating the relative disk throughput had gotten
an order of magnitude slower (disks got 3-5 times faster while systems
got 40-50 times faster) since 360 announce. Disk division executive took exception and directed division performance group to refute the claims,
after a couple weeks they came back and said I had slightly understated
the problem. They respun the analysis on how to configure disks to
improve system throughput for a user group presentation (16Aug1984,
SHARE 63, B874).

I was doing some work with disk engineers and that they had been
directed to use a very slow processor for the 3880 disk controller
follow-on to the 3830 ... while it handled 3mbyte/sec 3380 disks, it
otherwise seriously drove up channel busy. 3090 originally assumed that
3880 would be like previous 3830 but with 3mbyte/sec transfer ... when
they found out how bad things actually was, they realized they would
have to seriously increase the number of (3mbyte/sec) channels (to
achieve target throughput). Marketing then respins the significant
increase in channels as being wonderful I/O machine. Trivia: the
increase in channels required an extra TCM and the 3090 group
semi-facetiously claimed they would bill the 3880 group for increase in
3090 manufacturing cost.

I was also doing some work with Clementi https://en.wikipedia.org/wiki/Enrico_Clementi
E&S lab in IBM Kingston ... had boatload of Floating Point Systems boxes https://en.wikipedia.org/wiki/Floating_Point_Systems
that had 40mbyte/sec disk arrays for keeping the FPS boxes fed.

In 1980, I had been con'ed into doing channel-extender implementation
for IBM STL (since renamed SVL), they were moving 300 people and 3270
terminal to offsite bldg with dataprocessing back to STL datacenter.
They had tried "remote 3270" but found human factors unacceptable. Channel-extender allowed "channel-attached" 3270 controllers to be place
at offsite bldg with no human factors difference between offsite and in
STL. Side-effect was that it increased system throughput by 10-15%. They
had previously spread 3270 controllers across all the same channels with
disks, the channel-extender work significantly reduced 3270 terminal I/O channel busy, increasing disk I/O and system throughput (they were
considering moving all 3270 controllers to channel-extender, even those physically inside STL. Then there was attempt to my support released to customers, but there was group in POK playing with some serial stuff
that get it vetoed, they were afraid if it was in the market, it would
make it harder to release their stuff.

In 1988, the IBM branch office asks if I could help LLNL get some serial
stuff they were playing with, standardized ... which quickly becomes fibre-channel standard ("FCS", initially 1gibt/sec, full-duplex,
aggregate 200mbyte/sec). The POK serial stuff finally gets released in
the 90s with ES/9000 as ESCON (when it is already obsolete,
17mbytes/sec). Then some POK engineers become involved with FCS and
define a heavy-weight protocol that significantly reduces throughput, eventually released as FICON. The latest, public benchmark I've found is
z196 "Peak I/O", getting 2M IOPS using 104 FICON. About the same time, a
FCS is announced for E5-2600 blade claiming over million IOPS (two such
FCS has higher throughput than 104 FICON). Also IBM docs had SAPs
(system assist processors that do actual I/O) kept to 70% cpu (more like
1.5M IOPS), also no IBM CKD DASD have been made for decades all being
simulated on industry standard fixed-block disks.

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to Terje Mathisen on Fri Sep 13 09:54:45 2024

Terje Mathisen <[email protected]> writes:

Novell's System Fault Tolerant NetWare 386 (around 1990) supported two complete servers acting like one, so that any hardware component could
fail and the system would keep running, with nothing noticed by the
clients, even those that were in the middle of an update/write
request.

late 80s, get HA/6000 project, originally for NYTimes to move their
newspaper system (ATEX) off VAXCluster to RS/6000. I then rename it
HA/CMP when I start doing technical/scientific scale-up with national
labs and commercial scale-up with RDBMS vendors (Oracle, Sybase,
Informix, Ingres) that had VAXCluster support in same source base with
Unix (I do distributed lock manager that supported VAXCluster semantics
to ease ports). https://en.wikipedia.org/wiki/IBM_High_Availability_Cluster_Multiprocessing

IBM had been marketing S/88, rebranded fault tolerant. Then the S/88
product administer starts taking us around to their customers. https://en.wikipedia.org/wiki/Stratus_Technologies
Also has me write a section for the corporate continuous availability
strategy document ... however, it gets pulled when both Rochester
(AS/400, I-systems) and POK (mainframe) complain that they couldn't meet
the requirements.

Early Jan92 in meeting with Oracle CEO, AWD/Hester tells Ellison that we
would have 16processor clusters by mid92 and 128processor clusters by
ye92. Within a couple weeks (end jan92), cluster scale-up is transferred
for announce as IBM Supercomputer (scientific/technical *ONLY*) and we
are told we can't work on anything with more than four processors (we
leave IBM a few months later).

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to John Levine on Fri Sep 13 10:38:40 2024

John Levine <[email protected]> writes:

That's fine for workloads that work that way.

Airline reservation systems historically ran on mainframes because when they were invented
that's all there was (original SABRE ran on two 7090s) and they are business critical so
they need to be very reliable.

About 30 years ago some guys at MIT realized that route and fare search, which are some of
the most demanding things that CRS do, are easy to parallelize and don't have to be
particularly reliable -- if your search system crashes and restarts and reruns the search
and the result is a couple of seconds late, that's OK. So they started ITA software which
used racks of PC servers running parallel applications written in Lisp (they were from
MIT) and blew away the competition.

However, that's just the search part. Actually booking the seats and selling tickets stays
on a mainframe or an Oracle system because double booking or giving away free tickets would
be really bad.

There's also a rule of thumb about databases that says one system of performance 100 is
much better than 100 systems of performance 1 because those 100 systems will spend all
their time contending for database locks.

after leaving IBM was brought into largest airline res system to look
ten impossible things they can't do. Got started with "ROUTES" (about
25% of the mainframe workload), they gave me a full softcopy of OAG (all scheduled commercial flt segments in the world) ... couple weeks later
came back with ROUTES that implemented their impossible things.
Mainframe had tech trade-offs from the 60s and started from scratch
could make totally different tech trade-offs, initially ran 100 times
faster, then implementing the impossible stuff and still ran ten times
faster (than their mainframe systems). Showed that ten rs6000/990 could
handle workload for every flt and every airline in the world.

Part of the issue was that they extensively massaged the data on a
mainframe MVS/IMS system and then in sunday night, rebuilt the mainframe
"TPF" (limited datamanagement services) system from the MVS/IMS
system. That was all eliminated.

Fare search was harder because it started being "tuned" by some real
time factors.

Could move all to RS/6000 - HA/CMP. Then some very non-technical issues kicked-in (like large staff involved in the data massaging). trivia: I
had done a bunch of slight of hand for HA/CMP RDBMS distributed lock
manager scaleup for 128-processor clusters.

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Fri Sep 13 21:43:06 2024

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 11:20:06 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Terje Mathisen <[email protected]> schrieb:

10-15 years ago I talked to another speaker at a conference, he
told me that he was working on high-end open source LDAP software
using _very_ large memory DBs: Their system allowed one US cell
phone company to keep every SIM card (~100M) on a single system,
while a similar-size competitor had been forced to fall back on
17-way sharding (presumably using a hash of the SIM id).

Keeping databases in memory is definitely a thing now... see SAP HANA.

Any architectural implications for this?

Browsing through the SAP pages, it seems they used Intel's Optane
persistent memory, but that is no longer manufactured (?). But
having fast, persistent storage is definitely an advantage for
databases.

Large memory: Of course.

On the ISA level... these databases run on x86, so that seems to
be good enough.

Anything else?

Another thing that SAP HANA seems to use more intensely than anybody
else is Intel TSX. TSX (at least RTM part, I am not sure about HLE
part) still present in the latest Xeon generation, but is strongly de-emphasized.

Sounds like a market niche... Mitch, how good is your ESM for
in-memory databases?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Lynn Wheeler on Fri Sep 13 22:14:11 2024

On Fri, 13 Sep 2024 09:05:33 -1000, Lynn Wheeler wrote:

I had also started pontificating the relative disk throughput had gotten
an order of magnitude slower (disks got 3-5 times faster while systems
got 40-50 times faster) since 360 announce.

Out of curiosity, did you have figures on how closely the filesystem could
get to using all of theoretical disk I/O bandwidth?

I ask because, in the Unix world, this was pretty terrible until
Berkeley’s FFS (“Fast File System”) came along.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to [email protected] on Fri Sep 13 21:50:15 2024

It appears that Michael S <[email protected]> said:

There's also a rule of thumb about databases that says one system of
performance 100 is much better than 100 systems of performance 1
because those 100 systems will spend all their time contending for
database locks.

How many transactions per minute does world's biggest company need at
peak hours?

Ten years ago Visa could process 56,000 messages/second. It must be a
lot more now. I think a transaction is two or four messages depending
on the transaction type.

Is not this number small relatively to capabilities of
even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?

Uh, no, it is not.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Sep 13 23:12:24 2024

On Fri, 13 Sep 2024 21:43:06 +0000, Thomas Koenig wrote:

Michael S <[email protected]> schrieb:

On Fri, 13 Sep 2024 11:20:06 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Anything else?

Another thing that SAP HANA seems to use more intensely than anybody
else is Intel TSX. TSX (at least RTM part, I am not sure about HLE
part) still present in the latest Xeon generation, but is strongly
de-emphasized.

Sounds like a market niche... Mitch, how good is your ESM for
in-memory databases?

I do not think the in-memory part has anything to do with ESM
ATOMIC behavior.

I have no actual data, all I have is mental analyses.

The real think about ESM is that it allows one to code in such a way
as to need FEWER ATOMIC events--because each event can do more work;
so, thereby one needs fewer events.

1) You can acquire several cache lines and perform a single event
that would take a more typical ISA multiple ATIMOIC instructions.
This attacks the exponent of how rapidly things degrade under
contention.

2) secondly if a higher privilege thread contends with a lower thread
the higher privileged thread wins.

3) amongst equally privileged threads the one(s) that have made more
forward progress succeed while those just getting started fail.

4) There are ways for SW to get a count of the amount of interference
and each thread choose more wisely such that contention is reduced
on subsequent tries. There are some ATOMIC things for which this takes
a BigO( n**3 ) and makes it BigO( 3 ) {yes constant time}. A more
typical; use with new contenders coming and going randomly goes from
BIgO( n**3 ) to between BigO( n*ln(ln(n)) ) and BigO( n*ln(n) ).

HOWEVER:: if one uses ESM to simply implement locking behavior; only
part 1) above applies. That is if one uses ESM to create you standard {test&set, test*test*set, LoadLocked-StoreCOnditional, CAS, DCAS,
DCADS, TCADS,...} to get a performing kernel that depends on how
the SW is written, not necessarily how HW performs ESM.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Sat Sep 14 01:47:03 2024

On Fri, 13 Sep 2024 12:22:17 +0300, Michael S wrote:

How many transactions per minute does world's biggest company need at
peak hours?

A few years ago, I read an article about Facebook’s setup. At the time,
they had about a billion users who were active at least once a month. So
that would have been over 300 postings per second, sustained.

They were using MySQL with memcached, and I think they already had HHVM
(their custom PHP implementation) then as well.

Mainframes? Never heard of them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Sat Sep 14 01:44:14 2024

On Fri, 13 Sep 2024 11:20:06 -0000 (UTC), Thomas Koenig wrote:

Keeping databases in memory is definitely a thing now... see SAP HANA.

memcached might have been there before SAP.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Dallman on Sat Sep 14 01:48:26 2024

On Fri, 13 Sep 2024 16:18 +0100 (BST), John Dallman wrote:

So there's real demand for systems with huge capacity. Not very many of
them, but they have large budgets.

Did somebody say “cloud” ... ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Sat Sep 14 09:21:46 2024

John Levine <[email protected]> writes:

It appears that Michael S <[email protected]> said:

There's also a rule of thumb about databases that says one system of
performance 100 is much better than 100 systems of performance 1
because those 100 systems will spend all their time contending for
database locks.

How many transactions per minute does world's biggest company need at
peak hours?

Ten years ago Visa could process 56,000 messages/second. It must be a
lot more now. I think a transaction is two or four messages depending
on the transaction type.

Is not this number small relatively to capabilities of
even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?

Uh, no, it is not.

The way I would design this for a machine with that little IOPS is as
an in-memory database, with transactions written to a log on RAID-1
(on two or three of the HDDs), and a snapshot of the in-memory
database written to disk repeatedly, with copy-on-write to get a
consistent snapshot. The 8 cores of a 2009-vintage the dual-Xeon
machine should be easily capable of doing it, but the question is if
the machine has enough RAM for the database. Our dual-Xeon system
from IIRC 2007 has 24GB of RAM, not sure how big it could be
configured; OTOH, we have a single-Xeon system from 2009 or so with
32GB of RAM (and there were bigger Xeons in the market at the time).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sat Sep 14 09:59:58 2024

Anton Ertl <[email protected]> schrieb:

[in-memory database]

but the question is if
the machine has enough RAM for the database. Our dual-Xeon system
from IIRC 2007 has 24GB of RAM, not sure how big it could be
configured; OTOH, we have a single-Xeon system from 2009 or so with
32GB of RAM (and there were bigger Xeons in the market at the time).

The minimum requirement of SAP HANA is 64 GB of memory, but typical
ranges are from 256GB to 1TB.

Interestingly enough, it will run on selected systemw, which only
have Intel processors, and little-endian POWER 8 to 10. No AMD,
no ARM, no zSystem.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Sat Sep 14 09:42:00 2024

On Fri, 13 Sep 2024 21:50:15 -0000 (UTC), John Levine wrote:

Ten years ago Visa could process 56,000 messages/second.

That maybe sounds better than it is. After all, most of those transactions would tend to be geographically localized.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sat Sep 14 10:45:34 2024

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

[in-memory database]

but the question is if
the machine has enough RAM for the database. Our dual-Xeon system
from IIRC 2007 has 24GB of RAM, not sure how big it could be
configured; OTOH, we have a single-Xeon system from 2009 or so with
32GB of RAM (and there were bigger Xeons in the market at the time).

The minimum requirement of SAP HANA is 64 GB of memory, but typical
ranges are from 256GB to 1TB.

What is the relevance of SAP HANA for the topic at hand?

The question was if the RAM can hold the data. For each account they
would have to keep the current balance (64 bits should be enough for
that), the account number (64 bits for the up to 19 digits of a Visa
card) for verifying that we are at the correct entry in the hash table
and probably some account status information (64 bits should be
plenty?).

There is also the sequence of transactions (a 64-bit transaction
offset in the log per transaction should be enough for that). The
sequence of transactions may be useful for fraud detection, but I
don't know enough about that to know how to scale the system, so I'll
just say that fraud detection is done by a bigger system before the
transaction goes through to the transaction processing computer.

The sequence of transactions is also needed for generating the reports
and for dealing with customer complaints, but again, that's not
processing the transactions themselves (and is basically read-only,
except that the customer-complaint processing may result in additional transactions).

So, with 24 bytes needed for each account on the
transaction-processing server, 32GB with, say 8GB left for
copy-on-write and other administrative purposes should be good for
about 900M accounts at a hash table load factor of 84%. I guess that
Visa has more accounts, so one would need a box with more RAM.

A single core of the Xeon should easily be able to handle all the 56K transactions per second, both the logging and the update of the hash
table, and in that case no locking is needed. But that first needs a
sequence of transactions coming in.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sat Sep 14 11:46:24 2024

Anton Ertl <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

[in-memory database]

but the question is if
the machine has enough RAM for the database. Our dual-Xeon system
from IIRC 2007 has 24GB of RAM, not sure how big it could be
configured; OTOH, we have a single-Xeon system from 2009 or so with
32GB of RAM (and there were bigger Xeons in the market at the time).

The minimum requirement of SAP HANA is 64 GB of memory, but typical
ranges are from 256GB to 1TB.

What is the relevance of SAP HANA for the topic at hand?

It is something that is implemented, unlike what you were discussing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sat Sep 14 12:48:33 2024

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

The minimum requirement of SAP HANA is 64 GB of memory, but typical >>>ranges are from 256GB to 1TB.

What is the relevance of SAP HANA for the topic at hand?

It is something that is implemented, unlike what you were discussing.

So what? Linux is also implemented, and it runs on a 32GB machine.

Neither Linux nor SAP HANA satisfy even the most basic requirement
that I outlined (keeping balances) without additional implementation
work. And I doubt that if you give a 15 year old Dual-Xeon even with
64GB of RAM and a bunch of HDDs to a typical SAP developer, he will
implement a system that manages to keep the balance on 1.8M (double
the number for double RAM capacity) credit cards at 56K transactions
per second on that system. What I described is relatively
straightforward to implement on top of Linux.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Sat Sep 14 13:41:53 2024

[email protected] (Anton Ertl) writes:

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

The minimum requirement of SAP HANA is 64 GB of memory, but typical >>>>ranges are from 256GB to 1TB.

What is the relevance of SAP HANA for the topic at hand?

It is something that is implemented, unlike what you were discussing.

So what? Linux is also implemented, and it runs on a 32GB machine.

Neither Linux nor SAP HANA satisfy even the most basic requirement
that I outlined (keeping balances) without additional implementation
work. And I doubt that if you give a 15 year old Dual-Xeon even with
64GB of RAM and a bunch of HDDs to a typical SAP developer, he will
implement a system that manages to keep the balance on 1.8M (double
the number for double RAM capacity) credit cards at 56K transactions
per second on that system. What I described is relatively
straightforward to implement on top of Linux.

That should be 1.8G credit cards. I guess that a typical SAP HANA
client developer will be able to handle 1.8M credit cards in 64GB.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Sat Sep 14 22:17:51 2024

On Fri, 13 Sep 2024 17:05:35 +0000
[email protected] (MitchAlsup1) wrote:

On Fri, 13 Sep 2024 9:22:17 +0000, Michael S wrote:

How many transactions per minute does world's biggest company need
at peak hours? Is not this number small relatively to capabilities
of even 15 y.o. dual-Xeon server with few dozens of spinning rust
disks?

A SWAG::

8B people in the world: 1/3rd sleeping, 1/3rd working, 1/3rd relaxing.

So we have only 3B potential transactions, and a single person will
not average more than 1 transaction every 15 minutes over an hour.

So: 3B/15 = 200M T/m

I don't know about you, but I personally don't book flights 8 hours per
day. Even less so in the biggest company in the world, which, I
suppose, does not account for more tha 5-7% of world's flights.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Levine on Sat Sep 14 23:32:36 2024

On Fri, 13 Sep 2024 21:50:15 -0000 (UTC)
John Levine <[email protected]> wrote:

It appears that Michael S <[email protected]> said:

There's also a rule of thumb about databases that says one system
of performance 100 is much better than 100 systems of performance 1
because those 100 systems will spend all their time contending for
database locks.

How many transactions per minute does world's biggest company need at
peak hours?

Ten years ago Visa could process 56,000 messages/second. It must be a
lot more now. I think a transaction is two or four messages depending
on the transaction type.

Is not this number small relatively to capabilities of
even 15 y.o. dual-Xeon server with few dozens of spinning rust
disks?

Uh, no, it is not.

I probably was not clear enough. I have no doubts that there exist jobs
for which the machine that I mentioned above is insufficient.
I just don't believe that flight reservation is one of such jobs.

BTW, tcp.org site is down, so I can not find OLTP bechmarks for the
sort of machines that I mentioned in my previus post. Quite possibly
they (scores) do not exsit, because in 2009 people already stopped
benchmarking OLT with rotating media.
For reference, the best TPC-C score for 2-way Xeon in 2010 is 803,068
tpmC, but that score uses SSDs. IIRC, that's ~80% of the world's
absolutely fastest non-clustered score of 2003.

I am posting a link with a hope that tcp.org will be up tomorrow. http://www.tpc.org/results/individual_results/HP/hp_DL380_TPCC_051110_ES.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Sat Sep 14 20:42:31 2024

On Sat, 14 Sep 2024 19:17:51 +0000, Michael S wrote:

On Fri, 13 Sep 2024 17:05:35 +0000
[email protected] (MitchAlsup1) wrote:

On Fri, 13 Sep 2024 9:22:17 +0000, Michael S wrote:

How many transactions per minute does world's biggest company need
at peak hours? Is not this number small relatively to capabilities
of even 15 y.o. dual-Xeon server with few dozens of spinning rust
disks?

A SWAG::

8B people in the world: 1/3rd sleeping, 1/3rd working, 1/3rd relaxing.

So we have only 3B potential transactions, and a single person will
not average more than 1 transaction every 15 minutes over an hour.

So: 3B/15 = 200M T/m

I don't know about you, but I personally don't book flights 8 hours per
day. Even less so in the biggest company in the world, which, I
suppose, does not account for more tha 5-7% of world's flights.

An number for which total number of transactions of all kinds world
wide will not be exceeded by a total population of 8B people.
{Not just airline, but every transaction over the whole world.}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to John Levine on Sat Sep 14 10:57:57 2024

John Levine <[email protected]> writes:

Ten years ago Visa could process 56,000 messages/second. It must be a
lot more now. I think a transaction is two or four messages depending
on the transaction type.

card associations were originally to promote brand acceptance/uptake/advertising and network interconnecting the acquiring/merchant card transaction processors with the issuing/consumer
card transaction processors (issuer processor doing the real-time authorization/"auth" transaction).

late 90s, internet/micropayments was looking at card transaction
processors being able to handle micropayments ... but required
singnificantly higher transaction rate than card processors were capable
off. They turn to cellphone operations that were using "in-memory" DBMS
capable of ten times the transaction rate (that card processors were
doing).

Some of the cellphone companies were enticed to get into micropayments
but got out after a few years, turns out they lacked the significant
fraud handling capability (they were absorbing cellphone calling fraud
because it was their own resources, but in case of micropayments fraud,
it involved actually transferring real money to other entities).

As an side, card association interconnect network was flavor of VAN
(value added networks) that was prevalent at the time, but were in the
process of of being obsoleted by the internet. As an side, at the turn
of the century, 90% of all acquiring&issuing card transactions were
being handled by six datacenters having their own private, dedicated, non-association interconnect ... big litigation between card
associations and those processors (the card association network had been charging fee for each transaction that flowed through their network, and association still wanted that fee paid them whether or not the
transaction actually flowed their network.

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sun Sep 15 00:57:12 2024

On Sun, 15 Sep 2024 0:40:31 +0000, Lawrence D'Oliveiro wrote:

On Sat, 14 Sep 2024 10:57:57 -1000, Lynn Wheeler wrote:

Some of the cellphone companies were enticed to get into micropayments
but got out after a few years, turns out they lacked the significant
fraud handling capability ...

Meanwhile, the Kenyans have figured out how to run a successful online micropayments system mediated via text messages (M-Pesa).

Another reason not to have ANY apps on your cell phone.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Lynn Wheeler on Sun Sep 15 00:40:31 2024

On Sat, 14 Sep 2024 10:57:57 -1000, Lynn Wheeler wrote:

Some of the cellphone companies were enticed to get into micropayments
but got out after a few years, turns out they lacked the significant
fraud handling capability ...

Meanwhile, the Kenyans have figured out how to run a successful online micropayments system mediated via text messages (M-Pesa).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Sep 15 02:02:07 2024

According to Lawrence D'Oliveiro <[email protected]d>:

Some of the cellphone companies were enticed to get into micropayments
but got out after a few years, turns out they lacked the significant
fraud handling capability ...

Meanwhile, the Kenyans have figured out how to run a successful online >micropayments system mediated via text messages (M-Pesa).

M-pesa isn't micropayments, typical transfers are on the order of
a dollar. It was only possible because there was a dominant
government owned mobile mobile carrier in Kenya and M-Pesa is
basically sending around prepaid mobile phone credits.

It is certainly a success, providing banking services to vast numbers
of poor people who'd never be able to use a normal bank.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to All on Sat Sep 14 17:00:59 2024

before and after turn of century we would periodically have threads on
the "bank fraud blame game" (in financial industry mailing lists);
interchange fees that financial institutions charge merchants is base
plus fraud surcharge ... adjusted for the fraud rate for kind of
transactions .... internet transactions can have highest surcharge (with
many banks' profit from fraud surcharge reaching major percentage of
their bottom line)

right after turn of century, several "safe" transaction products were
presented to major online merchants (representing 80% of total internet
payment transactions) which saw high acceptance ... expecting that the
fraud surcharge would be eliminated. Then the cognitive dissonance set
in, they were told that instead of eliminating the fraud surchanged, a
new large "safe" surcharge would be added on top of the existing fraud surchange ... and all the interest evaporated.

I had co-authored financial industry transaction protocols as well as
done "safe" transaction chip design (that was one of the "safe"
products) ... was one of panel giving talk at standing room only large ballroom, semi-facetiously saying I was taking $500 milspec chip and aggresively cost reducing by more than two orders of magnitude while
increasing its security: https://csrc.nist.gov/pubs/conference/1998/10/08/proceedings-of-the-21st-nissc-1998/final
got prototype chips after turn of the century and gave talk in assurance
panal in the trusted computing tract at 2001 IDF https://web.archive.org/web/20011109072807/http://www.intel94.com/idf/spr2001/sessiondescription.asp?id=stp%2bs13
the guy running trusted-computing TPM chip was in front row and I chided
him that it was nice to see his chip was starting to look more like
mine; his response was that I didn't have a committee of 200 people
helping me with design.

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kent Dickey@21:1/5 to Anton Ertl on Fri Sep 20 18:35:26 2024

In article <[email protected]>,
Anton Ertl <[email protected]> wrote:

Brett <[email protected]> writes:

Speaking of complex things, have you looked at Swift output, as it checks >>all operations for overflow?

You could add an exception type for that, saving huge numbers of correctly >>predicted branch instructions.

The future of programming languages is type safe with checks, you need to >>get on that bandwagon early.

MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.

Alpha got on that bandwagon early. It's a descendent of MIPS, but it
renamed add into addv, and addu into add. It has been canceled around
the year 2000.

[ More details about architectures without trapping overflow instructions ]

Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values. If you take Rust's approach, and only
detect overflow in debug builds, then you already don't care about
performance.

If you want to do almost anything at all other than core dump on
overflow, you need to branch to recovery code. And although it's
theoretically possible to recover from the trap, it's worse than any
other approach. So it's added hardware that's HARDER for software to
use. No surprise it's gone away.

IA64 went down this road--trapping on speculation failures. It was a
huge disaster--trying to recover through an exception handler mechanism
is slow and painful, for the reasons I'll lay out for overflow
exceptions.

Let's look at how you might want to handle overflows when they happen:

1) Your language supports seemlessly transitioning to BigInts on
overflow. Then each operation that could overflow needs to call
a special bit of code to change to BigInt and then continue the
calculation. This code must exist, even if a trapping
instruction doesn't need an explicit branch to it. Some
mechanism is needed to call this code.

2) You need to call an exception handler, and the routine with the overflow
is ended. We need to know which exception handler to call.

3) You want to clamp the value to a reasonable range and continue. The
reasonable values need to be looked up somewhere.

4) You just want to crash the program. If a debugger is attached, it can
say where the overflow occurred.

Trapping on overflow instructions really are only useful for #4. Let's
look at how the other cases could be handled, with a) meaning using
branches, and b) mean using a trapping instruction.

1a) (BigInt): After doing an operation which could overflow, use a
conditional branch to jump to code to convert to BigInt, which
then jumps back. Overhead is basically the branch-on-overflow
instruction.

1b) (BigInt with traps). Hardware traps to the OS, which needs to prepare
the required structures describing the exception (all regs and
the address), and then call the signal handler. The signal
handler needs to look up the address of the trap with a table
describing what to do for this particular operation which
overflowed. Each table entry needs to describe, in detail, what
registers are involved (the sources and the dest), and where to
return once the BigInt has been created. This requires massive
changes to the compiler (and possibly linker) to prepare these
tables. The compiler must guarantee that changing the dest
register to a pointer to BigInt works properly (otherwise,
special code needs to be emitted for each potentially trapping
instruction to try to recover).

2a) (Try/Catch): After doing an operation which could overflow, use a
conditional branch to jump to the catch block.

2b) (Try/Catch with traps). Repeat all the OS work and call the signal
handler. Now, it just needs a table entry describing where to
jump to to enter the catch block. Almost all the complexity of
1b), but without needing the register details.

3a) (Clamp): After doing an operation which could overflow, use a
conditional branch to do the MIN/MAX operations to bring it back
within range and then jump back.

3b) (Clamp with trap): Basically the same as 1b), but there's an alternative
if the clamps are global (MAX_INT/MIN_INT). The exception handler
can read the instruction which trapped, figure out the source and
dest registers, re-do the calculation, and clamp the destination
to MIN or MAX, and return to just after the instruction which
trapped.

4a) (Crash): Every operation could overflow needs a conditional branch
after it to branch to a crashing instruction (or a branch over
an undefined instruction if there's no overflow).

4b) (Crash with trap): Use operations which trap on overflow. This takes
no new instructions and costs no performance.

Basically, all a) cases are:

op_with_might_overflow();
if(overflow_happened) {
handle the overflow
}

Trapping-on-overflow instructions are clearly useless for languages
which care about overflow. To save one branch instruction, an entry is
needed to describe how to handle the overflow, which is certainly larger
than a branch instruction. And the code to "handle the overflow" is
needed in any case. And this assume some sort of instant lookup--of the
1000 overflow instructions, we need a hash table to look up the address,
which is more overhead.

Trapping on overflow instructions are useful as a debug aid for
languages which don't care about overflow--but then you're optimizing
something nearly useless. It also might be helpful if global clamping
to MIN/MAX was useful (and I don't think it is).

Instruction sets which make detecting overflow difficult (say, RISC-V),
would do well to make branch-on-overflow efficient and easy. But adding trap-on-overflow instructions is a waste of effort.

Note that using traps on data access violations which are "fixed" by
signal handlers CAN work out. They are slow, but as long as the
exception handler can fix the access violation and return right to the instruction which failed (without needing to know ANYTHING about that instruction in particular), this can work fine. But integer overflow
doesn't work like that--it's generally not possible to figure out
in the trap handler what to do without more information.

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Kent Dickey on Fri Sep 20 22:00:28 2024

On Fri, 20 Sep 2024 18:35:26 +0000, Kent Dickey wrote:

In article <[email protected]>,

Alpha got on that bandwagon early. It's a descendent of MIPS, but it >>renamed add into addv, and addu into add. It has been canceled around
the year 2000.

[ More details about architectures without trapping overflow
instructions ]

Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values. If you take Rust's approach, and only
detect overflow in debug builds, then you already don't care about performance.

If you want to do almost anything at all other than core dump on
overflow, you need to branch to recovery code. And although it's theoretically possible to recover from the trap, it's worse than any
other approach. So it's added hardware that's HARDER for software to
use. No surprise it's gone away.

Note: Linux does not even have an "Integer Overflow" signal, while
it does have a "FP exception" signal.

But then IEEE 754 exception semantics make even less sense than
Linux signals. ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sat Sep 21 01:09:43 2024

On Fri, 20 Sep 2024 22:00:28 +0000, MitchAlsup1 wrote:

But then IEEE 754 exception semantics make even less sense than Linux signals. ...

Note that what IEEE 754 calls an “exception” is just a bunch of status
bits reporting on the current state of the computation: there is no
implication of some transfer of control elsewhere.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Kent Dickey on Sat Sep 21 01:12:11 2024

On Fri, 20 Sep 2024 18:35:26 -0000 (UTC), Kent Dickey wrote:

3) You want to clamp the value to a reasonable range and continue. The
reasonable values need to be looked up somewhere.

This won’t work. The values outside the range are by definition non- representable, so comparisons against them are useless.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Sep 21 01:52:32 2024

On Sat, 21 Sep 2024 1:09:43 +0000, Lawrence D'Oliveiro wrote:

On Fri, 20 Sep 2024 22:00:28 +0000, MitchAlsup1 wrote:

But then IEEE 754 exception semantics make even less sense than Linux
signals. ...

Note that what IEEE 754 calls an “exception” is just a bunch of status bits reporting on the current state of the computation: there is no implication of some transfer of control elsewhere.

Then how do you implement the alternate exception model ???
which IS part of 754-2008 and 754-2019

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Sep 21 01:51:21 2024

On Sat, 21 Sep 2024 1:12:11 +0000, Lawrence D'Oliveiro wrote:

On Fri, 20 Sep 2024 18:35:26 -0000 (UTC), Kent Dickey wrote:

3) You want to clamp the value to a reasonable range and continue. The
reasonable values need to be looked up somewhere.

This won’t work. The values outside the range are by definition non- representable, so comparisons against them are useless.

When a range is 0..10 both -1 and 11 are representable in
the arithmetic of ALL computers, just not in the language
specifying the range.

So you are talking a language issue not a computer arithmetic
issue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to All on Sat Sep 21 10:56:24 2024

On 2024-09-21 4:51, MitchAlsup1 wrote:

On Sat, 21 Sep 2024 1:12:11 +0000, Lawrence D'Oliveiro wrote:

On Fri, 20 Sep 2024 18:35:26 -0000 (UTC), Kent Dickey wrote:

3) You want to clamp the value to a reasonable range and continue. The >>> reasonable values need to be looked up somewhere.

This won’t work. The values outside the range are by definition non-
representable, so comparisons against them are useless.

When a range is 0..10 both -1 and 11 are representable in
the arithmetic of ALL computers, just not in the language
specifying the range.

For "11" I agree, for "-1" disagree.

if the program was written (in whatever language) with the assumption
that the data type in question is unsigned, then it cannot represent -1
in the program's view of the bits. The bits that represent -1 in a
signed two's complement view represent a large positive value in the
unsigned view that the code uses.

Now if the error condition that was trapped or detected was an attempt
to produce a negative value like -1 for an unsigned data type, that
error condition is of course representable separately; it does not have
to be encoded by an out-of-range value in the data type itself.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sat Sep 21 08:17:17 2024

On Sat, 21 Sep 2024 01:52:32 +0000, MitchAlsup1 wrote:

On Sat, 21 Sep 2024 1:09:43 +0000, Lawrence D'Oliveiro wrote:

On Fri, 20 Sep 2024 22:00:28 +0000, MitchAlsup1 wrote:

But then IEEE 754 exception semantics make even less sense than Linux
signals. ...

Note that what IEEE 754 calls an “exception” is just a bunch of status >> bits reporting on the current state of the computation: there is no
implication of some transfer of control elsewhere.

Then how do you implement the alternate exception model ??? which IS
part of 754-2008 and 754-2019

Section 8.3 of the 2008 spec says:

NOTE 2 — Immediate alternate exception handling for an exception
can be implemented by traps or, for exceptions listed in Clause 7
other than underflow, by testing status flags after each operation
or at the end of the associated block. Thus for exceptions listed
in Clause 7 other than underflow, immediate exception handling can
be implemented with the same mechanism as delayed exception
handling, if no better implementation mechanism is available.

So explicit testing of flag bits is permitted. Note that the special case
for underflow mentioned is that the exception signalled is “inexact”, not “underflow”.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sat Sep 21 08:18:06 2024

On Sat, 21 Sep 2024 01:51:21 +0000, MitchAlsup1 wrote:

On Sat, 21 Sep 2024 1:12:11 +0000, Lawrence D'Oliveiro wrote:

On Fri, 20 Sep 2024 18:35:26 -0000 (UTC), Kent Dickey wrote:

3) You want to clamp the value to a reasonable range and continue.
The reasonable values need to be looked up somewhere.

This won’t work. The values outside the range are by definition non-
representable, so comparisons against them are useless.

When a range is 0..10 both -1 and 11 are representable in the arithmetic
of ALL computers, just not in the language specifying the range.

That’s an ”out of subrange” error, not an “overflow” error.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Kent Dickey on Sat Sep 21 13:05:02 2024

Kent Dickey wrote:

In article <[email protected]>,
Anton Ertl <[email protected]> wrote:

Brett <[email protected]> writes:

Speaking of complex things, have you looked at Swift output, as it checks >>> all operations for overflow?

You could add an exception type for that, saving huge numbers of correctly >>> predicted branch instructions.

The future of programming languages is type safe with checks, you need to >>> get on that bandwagon early.

MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.

Alpha got on that bandwagon early. It's a descendent of MIPS, but it
renamed add into addv, and addu into add. It has been canceled around
the year 2000.

[ More details about architectures without trapping overflow instructions ]

Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values. If you take Rust's approach, and only
detect overflow in debug builds, then you already don't care about performance.

Those automatic software correctness checks, of which signed integer
overflow detection is one of many, went away because most code was
being written in C/C++ and those two languages don't require them.

That just makes it more expensive in code size and performance to effect
such checks. This overhead leads some to conclude it justifies eliminating
the error checks.

Eliminating the error event detectors doesn't make errors go away,
just your knowledge of them.

I gather portions of 16-bit Windows 3.1 were written in Pascal.
When Microsoft developed 32-bit WinNT, if instead of C it they had
switched their official development language from Pascal to Modula-2
which does require signed and unsigned, checked and modulo arithmetic,
and array bounds checks, the world would have been a much safer place.

But they didn't so it isn't.

The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.

If you want to do almost anything at all other than core dump on
overflow, you need to branch to recovery code. And although it's theoretically possible to recover from the trap, it's worse than any
other approach. So it's added hardware that's HARDER for software to
use. No surprise it's gone away.

The reason it traps is because YOU ASKED IT TO DETECT CERTAIN EVENTS!
An exception is just a method to deliver notification of an event.
What makes such event detections efficient, in code size and performance,
is that they ARE automatic and in the background.

What makes those events errors is that you DIDN'T handle them.
If you did handle them, then they wouldn't be errors,
just automatically detected events.

The reason most code does not have exceptions handlers and most are fatal
is because that code doesn't have a way to recover from fundamental
programming errors that are never supposed to occur.

Eliminating the error event detectors doesn't make errors go away,
just your knowledge of them.

The only exception handler I have is a last-chance handler at the top
of the thread stack, which dumps a stack traceback to a log file,
and attempts a clean shutdown.

IA64 went down this road--trapping on speculation failures. It was a
huge disaster--trying to recover through an exception handler mechanism
is slow and painful, for the reasons I'll lay out for overflow
exceptions.

That has nothing to do with the occurrence of errors, software or hardware. That notification of such events was painful on IA64, such was its nature.

Let's look at how you might want to handle overflows when they happen:

1) Your language supports seemlessly transitioning to BigInts on
overflow. Then each operation that could overflow needs to call
a special bit of code to change to BigInt and then continue the
calculation. This code must exist, even if a trapping
instruction doesn't need an explicit branch to it. Some
mechanism is needed to call this code.

BigInt is a variable sized signed integer type that, by definition,
do not overflow. BigInt code library will make its own decisions on
how to efficiently implement that behavior.

I do not want integers that I declared as fixed sized types to be
changed to variable sized BigInts, thank you.

2) You need to call an exception handler, and the routine with the overflow
is ended. We need to know which exception handler to call.

3) You want to clamp the value to a reasonable range and continue. The
reasonable values need to be looked up somewhere.

I do not want integers that I declared as (normal) linear types to be
changed to saturating types, thank you.

4) You just want to crash the program. If a debugger is attached, it can
say where the overflow occurred.

If I _asked_ for signed overflow event detection on some expressions
and one occurs, and the detection and delivery mechanism uses exceptions,
then yes that is what happens.

And yes, I crash by virtue of not having an exception handler for such
an event because it is never supposed to occur and I have no way to
correct the situation. If I could correct it I would and it wouldn't
be an error, just and event. But I didn't so it was.

Trapping on overflow instructions really are only useful for #4. Let's
look at how the other cases could be handled, with a) meaning using
branches, and b) mean using a trapping instruction.

Yep, pretty much. But that is why I, the programmer, chose fixed size
signed integer using checked arithmetic at that point in my program.
Because I want that behavior.

1a) (BigInt): After doing an operation which could overflow, use a
conditional branch to jump to code to convert to BigInt, which
then jumps back. Overhead is basically the branch-on-overflow
instruction.

That looks like a good way to implement BigInt.

1b) (BigInt with traps). Hardware traps to the OS, which needs to prepare
the required structures describing the exception (all regs and
the address), and then call the signal handler. The signal
handler needs to look up the address of the trap with a table
describing what to do for this particular operation which
overflowed. Each table entry needs to describe, in detail, what
registers are involved (the sources and the dest), and where to
return once the BigInt has been created. This requires massive
changes to the compiler (and possibly linker) to prepare these
tables. The compiler must guarantee that changing the dest
register to a pointer to BigInt works properly (otherwise,
special code needs to be emitted for each potentially trapping
instruction to try to recover).

That looks like an expensive way to implement BigInt.

2a) (Try/Catch): After doing an operation which could overflow, use a
conditional branch to jump to the catch block.

That looks like a more expensive way, in code size and performance,
than automatic by hardware to detect an event that should never occur.

However if this overflow event might regularly occur at that point
in your code, and you do have a way of handling it, then yes by all
means do it programmatically as that less expensive than a trip
through the OS. But both detection methods get the job done.

2b) (Try/Catch with traps). Repeat all the OS work and call the signal
handler. Now, it just needs a table entry describing where to
jump to to enter the catch block. Almost all the complexity of
1b), but without needing the register details.

That looks like a less expensive way, in code size and performance,
than manual by software, to detect an event that should never occur.

3a) (Clamp): After doing an operation which could overflow, use a
conditional branch to do the MIN/MAX operations to bring it back
within range and then jump back.

I do not want integers that I declared as (normal) linear types to be
changed to saturating types, thank you.

3b) (Clamp with trap): Basically the same as 1b), but there's an alternative
if the clamps are global (MAX_INT/MIN_INT). The exception handler
can read the instruction which trapped, figure out the source and
dest registers, re-do the calculation, and clamp the destination
to MIN or MAX, and return to just after the instruction which
trapped.

I do not want integers that I declared as (normal) linear types to be
changed to saturating types, thank you.

If I declare a saturating type then overflow exceptions is an expensive
way to implement them considering that the C code is:

long LongSatAdd (long left, long right)
{
long result;
result = left + right;
if (((result ^ left) & (result ^ right)) < 0) // Signed overflow?
result = (result < 0)? LONG_MAX : LONG_MIN; // Saturate high or low
return result;
}

Of course, a hardware instruction that does exactly this is preferred.

4a) (Crash): Every operation could overflow needs a conditional branch
after it to branch to a crashing instruction (or a branch over
an undefined instruction if there's no overflow).

Good choice for uncorrectable errors, poor choice for handleable events.

4b) (Crash with trap): Use operations which trap on overflow. This takes
no new instructions and costs no performance.

Good choice for uncorrectable errors, poor choice for handleable events.

Basically, all a) cases are:

op_with_might_overflow();
if(overflow_happened) {
handle the overflow
}

Trapping-on-overflow instructions are clearly useless for languages
which care about overflow.

This conclusion is completely wrong.
Exceptions are an event detection and notification delivery mechanism.
It is very efficient if those events are rarely or never supposed to occur.

It may not be a good tool choice for something that happens frequently,
as might happen when implementing your BigInt library,
as it can have a large overhead. But as the programmer that is your responsibility to make and why you get the big bucks.

To save one branch instruction, an entry is
needed to describe how to handle the overflow, which is certainly larger
than a branch instruction. And the code to "handle the overflow" is
needed in any case. And this assume some sort of instant lookup--of the
1000 overflow instructions, we need a hash table to look up the address, which is more overhead.

Trapping on overflow instructions are useful as a debug aid for
languages which don't care about overflow--but then you're optimizing something nearly useless. It also might be helpful if global clamping
to MIN/MAX was useful (and I don't think it is).

Instruction sets which make detecting overflow difficult (say, RISC-V),
would do well to make branch-on-overflow efficient and easy. But adding trap-on-overflow instructions is a waste of effort.

No they are a very useful tool for those who need such a tool
because the manual alternative is significantly more expensive
for both size and performance.

"I have one example where overflow exceptions would be a poor implementation choice" does not imply "therefore no one should have them as an option".

And remember, you ASKED to be informed of overflow when YOU selected
a checked data type. Now if your language or compiler of choice doesn't
allow you to choose checked vs unchecked then your gripe is with the
language or compiler.

It has nothing to do with whether an ISA should include trapping arithmetic
as it is the most efficient way to deliver this functionality to
THOSE WHO ASK FOR IT.

Note that using traps on data access violations which are "fixed" by
signal handlers CAN work out. They are slow, but as long as the
exception handler can fix the access violation and return right to the instruction which failed (without needing to know ANYTHING about that instruction in particular), this can work fine. But integer overflow
doesn't work like that--it's generally not possible to figure out
in the trap handler what to do without more information.

Kent

No, actually signed overflow, like many other exceptions, works like
that because, like access violations, or divide by zero, or array bounds violations, or illegal instructions, or invalid float operands,
they are never supposed to occur. And if they could occur you should
have checked for the potential error first.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sat Sep 21 20:39:38 2024

On Sat, 21 Sep 2024 17:05:02 +0000, EricP wrote:

Kent Dickey wrote:

Basically, all a) cases are:

op_with_might_overflow();
if(overflow_happened) {
handle the overflow
}

Trapping-on-overflow instructions are clearly useless for languages
which care about overflow.

This conclusion is completely wrong.

In the days before <good> branch prediction having a conditional branch
after each instruction that could have an execution problem was an
extremely poor choice. Thus, exceptions were invented (circa 1958).

Now with good branch prediction, having a branch after each instruction
which could suffer an execution problem is simply a bad was to blow
up the size of the executable.

Exceptions are an event detection and notification delivery mechanism.

Exceptions are a free (easy to predict) branch.

It is very efficient if those events are rarely or never supposed to
occur.

Many (most, nearly all) processor architectures have notoriously
bad exception delivery to a point of control that can deal with
the problem at hand.

But it is not necessary for that bad mechanism to be necessary !!

Some of the things that minimize the "badness" of taking an exception::

a) deliver control to user signal handler without taking an
excursion through GuestOS. (think 10 cycles)
b) when control arrives, receiving thread is already reentrant.
c) when control arrives, the instruction (bits) and its operand
values are delivered to the exception handler. So, the exception
handler has what it needs to deal with the problem at hand.
d) when control returns, the result (R0) is delivered back to the
destination register.
e) (b, c, d) are performed without handler needing to understand
how. Handler is just a subroutine that receives arguments (c)
fixes the problem, and returns a non-excepting value, or abort.
f) return has a way to re-execute the instruction or to skip the
instruction under control of handler without having access
to excepting-IP and without knowing the length of the
instruction.
g) during (a..f) nobody ever has to disable interrupts or
exceptions or re-enable them later. Priority and privilege
are inherited automatically from excepting thread.

I know of only 1 ISA with these properties....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sat Sep 21 22:14:12 2024

According to MitchAlsup1 <[email protected]>:

In the days before <good> branch prediction having a conditional branch
after each instruction that could have an execution problem was an
extremely poor choice. Thus, exceptions were invented (circa 1958).

Oh, it was worse than that. There were instructions like "Divide or
Halt" which stopped the computer with an error light on a zero divide.

Many (most, nearly all) processor architectures have notoriously
bad exception delivery to a point of control that can deal with
the problem at hand.

Some of us remember imprecise itnterrupts and the OS/360 S0C0
completion code.

But you are in general right, it makes more sense to keep the computer
running in the normal case and provide slow ways to recover from
failures and do something else.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Levine on Sun Sep 22 01:41:47 2024

On Sat, 21 Sep 2024 22:14:12 -0000 (UTC)
John Levine <[email protected]> wrote:

According to MitchAlsup1 <[email protected]>:

In the days before <good> branch prediction having a conditional
branch after each instruction that could have an execution problem
was an extremely poor choice. Thus, exceptions were invented (circa
1958).

Oh, it was worse than that. There were instructions like "Divide or
Halt" which stopped the computer with an error light on a zero divide.

Many (most, nearly all) processor architectures have notoriously
bad exception delivery to a point of control that can deal with
the problem at hand.

Some of us remember imprecise itnterrupts and the OS/360 S0C0
completion code.

But you are in general right, it makes more sense to keep the computer running in the normal case and provide slow ways to recover from
failures and do something else.

Where is Nick to tell you that any attempt of recovery is a Bad Idea.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sat Sep 21 23:29:01 2024

On Sat, 21 Sep 2024 20:39:38 +0000, MitchAlsup1 wrote:

In the days before <good> branch prediction having a conditional branch
after each instruction that could have an execution problem was an
extremely poor choice. Thus, exceptions were invented (circa 1958).

So all that does is push the conditional branch into the microcode. And
make the instruction more complicated. Why should that be faster?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Sat Sep 21 23:29:40 2024

On Sat, 21 Sep 2024 22:14:12 -0000 (UTC), John Levine wrote:

But you are in general right, it makes more sense to keep the computer running in the normal case and provide slow ways to recover from
failures and do something else.

Aren’t branches that are not taken supposed to be fast?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Sun Sep 22 01:24:05 2024

On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:

On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:

On Sat, 21 Sep 2024 22:14:12 -0000 (UTC), John Levine wrote:

But you are in general right, it makes more sense to keep the computer
running in the normal case and provide slow ways to recover from
failures and do something else.

Aren’t branches that are not taken supposed to be fast?

Well, they are not taken, so they should be faster... ;^)

It is NOT the speed, it is the code bloat.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sun Sep 22 01:23:35 2024

On Sat, 21 Sep 2024 23:29:01 +0000, Lawrence D'Oliveiro wrote:

On Sat, 21 Sep 2024 20:39:38 +0000, MitchAlsup1 wrote:

In the days before <good> branch prediction having a conditional branch
after each instruction that could have an execution problem was an
extremely poor choice. Thus, exceptions were invented (circa 1958).

So all that does is push the conditional branch into the microcode. And
make the instruction more complicated. Why should that be faster?

It pushes the branch into the mispredict-recovery path and does not
occupy any code space.

There is no microcode outside of Z-system these days.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sun Sep 22 02:09:29 2024

On Sun, 22 Sep 2024 01:24:05 +0000, MitchAlsup1 wrote:

On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:

On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:

On Sat, 21 Sep 2024 22:14:12 -0000 (UTC), John Levine wrote:

But you are in general right, it makes more sense to keep the
computer running in the normal case and provide slow ways to recover
from failures and do something else.

Aren’t branches that are not taken supposed to be fast?

Well, they are not taken, so they should be faster... ;^)

It is NOT the speed, it is the code bloat.

That’s an argument against RISC though, isn’t it?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sun Sep 22 02:10:40 2024

On Sun, 22 Sep 2024 01:23:35 +0000, MitchAlsup1 wrote:

On Sat, 21 Sep 2024 23:29:01 +0000, Lawrence D'Oliveiro wrote:

On Sat, 21 Sep 2024 20:39:38 +0000, MitchAlsup1 wrote:

In the days before <good> branch prediction having a conditional
branch after each instruction that could have an execution problem was
an extremely poor choice. Thus, exceptions were invented (circa 1958).

So all that does is push the conditional branch into the microcode. And
make the instruction more complicated. Why should that be faster?

It pushes the branch into the mispredict-recovery path and does not
occupy any code space.

There is no microcode outside of Z-system these days.

It occupies some space, either microcode or circuit logic, or both.

And why should that be faster?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sun Sep 22 02:26:33 2024

On Sun, 22 Sep 2024 2:10:40 +0000, Lawrence D'Oliveiro wrote:

On Sun, 22 Sep 2024 01:23:35 +0000, MitchAlsup1 wrote:

On Sat, 21 Sep 2024 23:29:01 +0000, Lawrence D'Oliveiro wrote:

On Sat, 21 Sep 2024 20:39:38 +0000, MitchAlsup1 wrote:

In the days before <good> branch prediction having a conditional
branch after each instruction that could have an execution problem was >>>> an extremely poor choice. Thus, exceptions were invented (circa 1958).

So all that does is push the conditional branch into the microcode. And
make the instruction more complicated. Why should that be faster?

It pushes the branch into the mispredict-recovery path and does not
occupy any code space.

There is no microcode outside of Z-system these days.

It occupies some space, either microcode or circuit logic, or both.

It has sequencers, but none of them are in ROM or PLA form.

And why should that be faster?

It is faster if for no other reason that it did not fetch the branch
that is always predicted non-taken. ICache and Fetch argument. It is
of lower power because it did not fetch, decode, or execute the branch.

If every calculation instruction had to be followed by a conditional
branch, then the code would be 150% its original size (or worse).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sun Sep 22 02:23:12 2024

On Sun, 22 Sep 2024 2:09:29 +0000, Lawrence D'Oliveiro wrote:

On Sun, 22 Sep 2024 01:24:05 +0000, MitchAlsup1 wrote:

On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:

On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:

On Sat, 21 Sep 2024 22:14:12 -0000 (UTC), John Levine wrote:

But you are in general right, it makes more sense to keep the
computer running in the normal case and provide slow ways to recover >>>>> from failures and do something else.

Aren’t branches that are not taken supposed to be fast?

Well, they are not taken, so they should be faster... ;^)

It is NOT the speed, it is the code bloat.

That’s an argument against RISC though, isn’t it?

Yes, and that is why my RISC ISA has VAX instruction count
while having the pipelineability of MIPS.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sun Sep 22 07:11:53 2024

On Sun, 22 Sep 2024 02:26:33 +0000, MitchAlsup1 wrote:

It is faster if for no other reason that it did not fetch the branch
that is always predicted non-taken.

But architectures like POWER were able to do that sort of thing in zero effective cycles, decades ago.

If every calculation instruction had to be followed by a conditional
branch, then the code would be 150% its original size (or worse).

Not every one. There are ways to do the checks only at crucial points,
after sequences of instructions. This is how IEEE754 “exceptions” work,
for example.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to [email protected] on Sun Sep 22 09:25:30 2024

[email protected] (MitchAlsup1) writes:

There is no microcode outside of Z-system these days.

Every AMD64 processor has microcode. E.g., on an Alder Lake system
"perf list" lists the following events that have "microcode" in their description:

machine_clears.fp_assist
[Counts the number of floating point operations retired that required
microcode assist. Unit: cpu_atom]
assists.fp
[Counts all microcode FP assists. Unit: cpu_core]
machine_clears.slow
[Counts the number of machine clears that flush the pipeline and
restart the machine with the use of microcode due to SMC,
MEMORY_ORDERING, FP_ASSISTS, PAGE_FAULT, DISAMBIGUATION, and
FPC_VIRTUAL_TRAP. Unit: cpu_atom]
topdown_be_bound.serialization
[Counts the number of issue slots every cycle that were not consumed by
the backend due to scoreboards from the instruction queue (IQ), jump
execution unit (JEU), or microcode sequencer (MS). Unit: cpu_atom]
topdown_fe_bound.cisc
[Counts the number of issue slots every cycle that were not delivered
by the frontend due to the microcode sequencer (MS). Unit: cpu_atom]
assists.any
[Number of occurrences where a microcode assist is invoked by hardware.
Unit: cpu_core]
tma_microcode_sequencer
[This metric represents fraction of slots the CPU was retiring uops
fetched by the Microcode Sequencer (MS) unit. Unit: cpu_core]
IpAssist
[Instructions per a microcode Assist invocation. See Assists tree node
for details (lower number means higher occurrence rate). Unit:
cpu_core]
tma_heavy_operations
[This metric represents fraction of slots where the CPU was retiring
heavy-weight operations -- instructions that require two or more uops
or microcoded sequences. Unit: cpu_core]
tma_cisc
[Counts the number of issue slots that were not delivered by the
frontend due to the microcode sequencer (MS). Unit: cpu_atom]
tma_serialization
[Counts the number of issue slots that were not consumed by the backend
due to scoreboards from the instruction queue (IQ), jump execution
unit (JEU), or microcode sequencer (MS). Unit: cpu_atom] tma_microcode_sequencer_group:
tma_assists
[This metric estimates fraction of slots the CPU retired uops delivered
by the Microcode_Sequencer as a result of Assists. Unit: cpu_core]

tma_microcode_sequencer, IpAssist, tma_heavy_operations, tma_cisc
occurs several times in different sections.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to [email protected] on Sun Sep 22 12:30:00 2024

In article <[email protected]>, [email protected] (MitchAlsup1) wrote:

On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:

On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:

Aren't branches that are not taken supposed to be fast?

Well, they are not taken, so they should be faster... ;^)

It is NOT the speed, it is the code bloat.

Yup. Bigger code is always a potential problem, not so much because it
takes up RAM nowadays, but because it takes up memory bandwidth and cache space. Using up cache space is always bad, because bigger caches are
slower, and instructions seem naturally smaller than cache blocks.

Wanting smaller code isn't an argument against RISC, but an argument
against poorly optimised ISA design. Variable-length CISC makes it easier
to get smaller average instruction sizes but has other drawbacks.

For the stuff I work, on ARM64 code is consistently smaller than x86-64, although the factor varies by platform.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to John Levine on Sun Sep 22 13:45:28 2024

John Levine wrote:

According to MitchAlsup1 <[email protected]>:

In the days before <good> branch prediction having a conditional branch
after each instruction that could have an execution problem was an
extremely poor choice. Thus, exceptions were invented (circa 1958).

Before (good) branch predictors and exceptions, we did have the ability
to fall through any forward branches, and assume backwards branches were
taken, right?

With that approach, compilers were free to place all recovery code after
the function itself, in which case something like

add,ax,bx
jo overflow_detected
;; ax is now good

would work quite well, i.e. costing "only" 8 cycles for the load of the
two JO instruction bytes.

Oh, it was worse than that. There were instructions like "Divide or
Halt" which stopped the computer with an error light on a zero divide.

Sort of like "Halt_And_Catch_Fire"?

Many (most, nearly all) processor architectures have notoriously
bad exception delivery to a point of control that can deal with
the problem at hand.

Some of us remember imprecise itnterrupts and the OS/360 S0C0
completion code.

But you are in general right, it makes more sense to keep the computer running in the normal case and provide slow ways to recover from
failures and do something else.

This was idea behind the single-byte INTO (Interrupt on Overflow) opcode:

add ax,bx
into

would cost just 4 clock cycles vs the 8 needed for the forward exception handler.

OTOH, you did need to either reload the INTO vector (same location as
INT 4) every time you needed a possibly new/different handler.

I never saw any compiler using it and I never used ti in my own asm code.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lars Poulsen@21:1/5 to John Levine on Sun Sep 22 09:14:04 2024

On 9/21/2024 3:14 PM, John Levine wrote:

According to MitchAlsup1 <[email protected]>:

In the days before <good> branch prediction having a conditional branch
after each instruction that could have an execution problem was an
extremely poor choice. Thus, exceptions were invented (circa 1958).

Oh, it was worse than that. There were instructions like "Divide or
Halt" which stopped the computer with an error light on a zero divide.

Many (most, nearly all) processor architectures have notoriously
bad exception delivery to a point of control that can deal with
the problem at hand.

Some of us remember imprecise itnterrupts and the OS/360 S0C0
completion code.

But you are in general right, it makes more sense to keep the computer running in the normal case and provide slow ways to recover from
failures and do something else.

From a programmer's perspective, VAX exception handling was very nice.
It may have been high overhead, though.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Kent Dickey on Sun Sep 22 16:59:24 2024

Kent Dickey <[email protected]> schrieb:

Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values.

I disagree.

Look at the sanitizer libraries, which insert runtime checks for
integer overflow - having less overhead for these would definitely
be a plus.

See https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html
or https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-flags .

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Dallman on Sun Sep 22 18:40:31 2024

On Sun, 22 Sep 2024 11:30:00 +0000, John Dallman wrote:

In article <[email protected]>, [email protected] (MitchAlsup1) wrote:

On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:

On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:

Aren't branches that are not taken supposed to be fast?

Well, they are not taken, so they should be faster... ;^)

It is NOT the speed, it is the code bloat.

Yup. Bigger code is always a potential problem, not so much because it
takes up RAM nowadays, but because it takes up memory bandwidth and
cache
space. Using up cache space is always bad, because bigger caches are
slower, and instructions seem naturally smaller than cache blocks.

Wanting smaller code isn't an argument against RISC, but an argument
against poorly optimised ISA design. Variable-length CISC makes it
easier to get smaller average instruction sizes but has other drawbacks.

Variable length RISC makes it easier, too.

For the stuff I work, on ARM64 code is consistently smaller than x86-64, although the factor varies by platform.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Sep 23 17:51:26 2024

Let's look at how you might want to handle overflows when they happen:

1) Your language supports seemlessly transitioning to BigInts on
overflow. Then each operation that could overflow needs to call
a special bit of code to change to BigInt and then continue the
calculation. This code must exist, even if a trapping
instruction doesn't need an explicit branch to it. Some
mechanism is needed to call this code.

IOW, this is the case where the program thinks it's manipulating some "mathematical number ∈ Z" or something like it.

2) You need to call an exception handler, and the routine with the overflow
is ended. We need to know which exception handler to call.

Not sure when that would be useful, other than for low-level coding.

3) You want to clamp the value to a reasonable range and continue. The
reasonable values need to be looked up somewhere.

Here overflow detection can be useful only to the extent that the
wrap-around may hide the range-error (or confuse its nature between over<->under).

I'm not familiar with code needing this, but I heard such needs are
common in some fields.

4) You just want to crash the program. If a debugger is attached, it can
say where the overflow occurred.

Beside debugging, this corresponds to the case where the code thinks
it's manipulating some "mathematical number ∈ Z" or something like it
yet we have a good reason to think that this number will "never"
overflow, so we never want to switch to bigint because an overflow means
our "good reason" was crap and we we're in trouble anyway.

IME, this shows up typically for integers related to the size of some data-structure, where the limited address space "guarantees" that we'll
never go beyond some limit. They're *very* common (think array indices
and things like that) and using overflow-trapping operations for them
would make sense. At the same time, they *should* never overflow, and
if some error somewhere means one of them can overflow (e.g. mistakenly
using (u)int32 for array indices in a 64bit system), there's a good
chance that the error will cause other trouble which may prevent us from
even reaching the overflowing operation.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kent Dickey@21:1/5 to [email protected] on Mon Sep 23 21:57:08 2024

In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:

Kent Dickey wrote:

In article <[email protected]>,
Anton Ertl <[email protected]> wrote:

Brett <[email protected]> writes:

Speaking of complex things, have you looked at Swift output, as it checks >>>> all operations for overflow?

You could add an exception type for that, saving huge numbers of correctly >>>> predicted branch instructions.

The future of programming languages is type safe with checks, you need to >>>> get on that bandwagon early.

MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.

Alpha got on that bandwagon early. It's a descendent of MIPS, but it
renamed add into addv, and addu into add. It has been canceled around
the year 2000.

[ More details about architectures without trapping overflow instructions ] >>
Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values. If you take Rust's approach, and only
detect overflow in debug builds, then you already don't care about
performance.

Those automatic software correctness checks, of which signed integer
overflow detection is one of many, went away because most code was
being written in C/C++ and those two languages don't require them.

That just makes it more expensive in code size and performance to effect
such checks. This overhead leads some to conclude it justifies eliminating >the error checks.

Eliminating the error event detectors doesn't make errors go away,
just your knowledge of them.

I gather portions of 16-bit Windows 3.1 were written in Pascal.
When Microsoft developed 32-bit WinNT, if instead of C it they had
switched their official development language from Pascal to Modula-2
which does require signed and unsigned, checked and modulo arithmetic,
and array bounds checks, the world would have been a much safer place.

But they didn't so it isn't.

The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.

OK, my post was about how having a hardware trap-on-overflow instruction
(or a mode for existing ALU instructions) is useless for anything OTHER
than as a debug aid where you crash the problem on overflow (you can
have a general exception handler to shut down gracefully, but "patching things up and continuing" doesn't work). I gave details of reasons folks might
want to try to use trap-on-overflow instructions, and show how the
other cases don't make sense.

In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother
having trap-on-overflow instructions.

You then went on to discuss how you want trap-on-overflow instructions
for stuff like C code, so you can detect code bugs and shut down gracefully.

And my response to that is we already know compilers don't use it. x86
has INTO, which is "trap if the overflow bit is set". So "ADD r8,r9; INTO" would trap if the add overflowed.

Look at:
https://godbolt.org/z/oMhW55YsK

Which is this code:

int add2(int num, int other) {
return num + other;
}

Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and crash).

For x86-64 clang 19.1.0:

add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
.LBB0_1:
ud1 eax, dword ptr [eax]

This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.

But x86 has an instruction to trap on overflow directly: INTO. It's one byte. And it doesn't use it.

GCC x86-64 14.2 is even worse:

add2:
sub rsp, 8
call __addvsi3
add rsp, 8
ret

It calls a routine to do all additions which might overflow, and that
routine calls assert() if an overflow occurs.

The CPU has a trap-on-overflow instruction exactly for this case (to crash
on detecting an overflow), and compilers don't even use it.

So even on architectures which have a trap-on-overflow instruction,
compilers don't use it.

So why should any hardware include an instruction to trap-on-overflow?

Trap-on-overflow instruction have a hardware cost, of varying severity.
If the ISA isn't already trapping on ALU instructions (such as
divide-by-0), it adds a new class of operations which can take
exceptions. An ALU functional unit that cannot take exceptions doesn't
have to save "unwinding" info (at minimum, info to recover the PC, and
possibly rollback state), and not needing this can be a nice
simplification. Branches and LD/ST always needs this info, but not
needing it on ALU ops can be a nice simplification of logic, and makes it easier to have multiple ALU functional units. Note that x86 INTO can
be treated as a branch, so it doesn't have the cost of an instruction
like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
traps if it overflows. ADDTO is particularly what I am arguing against--
it is just a bad idea for the ISA to have ALU instructions take exceptions.

Instruction sets which make detecting overflow difficult (say, RISC-V),
would do well to make branch-on-overflow efficient and easy. But adding
trap-on-overflow instructions is a waste of effort.

No they are a very useful tool for those who need such a tool
because the manual alternative is significantly more expensive
for both size and performance.

"I have one example where overflow exceptions would be a poor implementation >choice" does not imply "therefore no one should have them as an option".

Can you share what language, compiler, and hardware you are using which implements overflow checks using a trap-on-overflow instruction?

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Sep 23 18:00:36 2024

Some of the things that minimize the "badness" of taking an exception::

a) deliver control to user signal handler without taking an
excursion through GuestOS. (think 10 cycles)
b) when control arrives, receiving thread is already reentrant.
c) when control arrives, the instruction (bits) and its operand
values are delivered to the exception handler. So, the exception
handler has what it needs to deal with the problem at hand.
d) when control returns, the result (R0) is delivered back to the
destination register.
e) (b, c, d) are performed without handler needing to understand
how. Handler is just a subroutine that receives arguments (c)
fixes the problem, and returns a non-excepting value, or abort.
f) return has a way to re-execute the instruction or to skip the
instruction under control of handler without having access
to excepting-IP and without knowing the length of the
instruction.
g) during (a..f) nobody ever has to disable interrupts or
exceptions or re-enable them later. Priority and privilege
are inherited automatically from excepting thread.

Note that in the case where you want the overflow exception to jump to
some alternate code path (a language-level exception handler, or a code
path that continues with a bigint instead of a register-sized integer),
(d) is useless because you don't want to return to the overflowing
instruction (nor to the immediately following instruction). Instead you usually want to lookup a side table indexed with the address of the
overflowing instruction to find the "exception handler" to "return" to.

(a) (b) and (c) are still very welcome, of course.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Mon Sep 23 22:23:51 2024

On Sun, 22 Sep 2024 9:25:30 +0000, Anton Ertl wrote:

[email protected] (MitchAlsup1) writes:

There is no microcode outside of Z-system these days.

Every AMD64 processor has microcode.

Yes, yes, my brain was not working...........sigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Kent Dickey on Mon Sep 23 22:17:02 2024

On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:

In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:

The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.

OK, my post was about how having a hardware trap-on-overflow instruction
(or a mode for existing ALU instructions) is useless for anything OTHER
than as a debug aid where you crash the problem on overflow (you can
have a general exception handler to shut down gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons folks might
want to try to use trap-on-overflow instructions, and show how the
other cases don't make sense.

In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother having trap-on-overflow instructions.

<snip>

So why should any hardware include an instruction to trap-on-overflow?

Trap-on-overflow instruction have a hardware cost, of varying severity.
If the ISA isn't already trapping on ALU instructions (such as
divide-by-0), it adds a new class of operations which can take
exceptions. An ALU functional unit that cannot take exceptions doesn't
have to save "unwinding" info (at minimum, info to recover the PC, and possibly rollback state), and not needing this can be a nice
simplification. Branches and LD/ST always needs this info, but not
needing it on ALU ops can be a nice simplification of logic, and makes
it
easier to have multiple ALU functional units. Note that x86 INTO can
be treated as a branch, so it doesn't have the cost of an instruction
like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
traps if it overflows. ADDTO is particularly what I am arguing
against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.

You argue that trap-on-overflow as an instruction is unnecessary
AND
You argue that overflow detection is worthwhile
AND
You argue that ALU should not raise overflow exceptions

I am at a loss for how to take all 3 arguments together at the
same time !?! Can you explain ??

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Mon Sep 23 22:22:17 2024

On Mon, 23 Sep 2024 22:00:36 +0000, Stefan Monnier wrote:

Some of the things that minimize the "badness" of taking an exception::

a) deliver control to user signal handler without taking an
excursion through GuestOS. (think 10 cycles)
b) when control arrives, receiving thread is already reentrant.
c) when control arrives, the instruction (bits) and its operand
values are delivered to the exception handler. So, the exception
handler has what it needs to deal with the problem at hand.
d) when control returns, the result (R0) is delivered back to the
destination register.
e) (b, c, d) are performed without handler needing to understand
how. Handler is just a subroutine that receives arguments (c)
fixes the problem, and returns a non-excepting value, or abort.
f) return has a way to re-execute the instruction or to skip the
instruction under control of handler without having access
to excepting-IP and without knowing the length of the
instruction.
g) during (a..f) nobody ever has to disable interrupts or
exceptions or re-enable them later. Priority and privilege
are inherited automatically from excepting thread.

Note that in the case where you want the overflow exception to jump to
some alternate code path (a language-level exception handler, or a code
path that continues with a bigint instead of a register-sized integer),
(d) is useless because you don't want to return to the overflowing instruction (nor to the immediately following instruction). Instead you usually want to lookup a side table indexed with the address of the overflowing instruction to find the "exception handler" to "return" to.

longjump() returns in such a way that integer ADD code path is never
executed again.

(a) (b) and (c) are still very welcome, of course.

(d) is for the case where the exception handler fixes the problem
and calculates the desired result and skips the instruction on
return (completion) whereas you typical page fault is retried
upon return.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Lars Poulsen on Tue Sep 24 00:41:39 2024

On Sun, 22 Sep 2024 09:14:04 -0700, Lars Poulsen wrote:

From a programmer's perspective, VAX exception handling was very nice.
It may have been high overhead, though.

Very high overhead. But it was also language-independent, and integrated
into the procedure-calling convention, which also managed to be language- independent.

There is an internal memo on Bitsavers somewhere, critiquing a proposal to adopt the MIPS architecture (which DEC did, for just one machine, the DECstation 3000 if I recall rightly), and one of the points against MIPS
was that it didn’t have language-independent exception handling. But then
no other architecture, before the VAX or since, has been able to do that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Sep 24 00:37:25 2024

On Mon, 23 Sep 2024 22:17:02 +0000, MitchAlsup1 wrote:

You argue that trap-on-overflow as an instruction is unnecessary
AND
You argue that overflow detection is worthwhile
AND
You argue that ALU should not raise overflow exceptions

I am at a loss for how to take all 3 arguments together at the same time
!?! Can you explain ??

The answer is pretty obvious: explicit instruction to branch on overflow detection.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Tue Sep 24 00:43:40 2024

On Sun, 22 Sep 2024 13:45:28 +0200, Terje Mathisen wrote:

There were instructions like "Divide or
Halt" which stopped the computer with an error light on a zero divide.

Sort of like "Halt_And_Catch_Fire"?

Imagine if it was a design feature that, if the error light came on too
much, it would overheat and set the machine on fire? ;)

Would that encourage programmers to have fewer bugs in their
programs ... ?

“The explosions will continue until morale improves.”

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Tue Sep 24 03:02:20 2024

On Tue, 24 Sep 2024 2:52:09 +0000, Chris M. Thomasson wrote:

On 9/23/2024 5:43 PM, Lawrence D'Oliveiro wrote:

Imagine if it was a design feature that, if the error light came on too
much, it would overheat and set the machine on fire? ;)

Would that encourage programmers to have fewer bugs in their
programs ... ?

In the first computer controlled aircraft, the programmers were
told that 10% of them would be on the first flight.

Engine start, takeoff, and flying went perfect, than as the plane
neared the ground it rolled 90º just 100 feet from the ground. The
pilot flicked off the computer, saved the plane, and landed.

Later investigation showed that the ailerons had their positions
initialized while the pilot was doing his flight control maneuvers.

many of the programmers came out of the plane sickened to their
stomachs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Kent Dickey on Tue Sep 24 07:58:24 2024

Kent Dickey wrote:

In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:

Kent Dickey wrote:

In article <[email protected]>,
Anton Ertl <[email protected]> wrote:

Brett <[email protected]> writes:

Speaking of complex things, have you looked at Swift output, as it checks >>>>> all operations for overflow?

You could add an exception type for that, saving huge numbers of correctly
predicted branch instructions.

The future of programming languages is type safe with checks, you need to >>>>> get on that bandwagon early.

MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.

Alpha got on that bandwagon early. It's a descendent of MIPS, but it
renamed add into addv, and addu into add. It has been canceled around >>>> the year 2000.

[ More details about architectures without trapping overflow instructions ] >>>
Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values. If you take Rust's approach, and only
detect overflow in debug builds, then you already don't care about
performance.

Those automatic software correctness checks, of which signed integer
overflow detection is one of many, went away because most code was
being written in C/C++ and those two languages don't require them.

That just makes it more expensive in code size and performance to effect
such checks. This overhead leads some to conclude it justifies eliminating >> the error checks.

Eliminating the error event detectors doesn't make errors go away,
just your knowledge of them.

I gather portions of 16-bit Windows 3.1 were written in Pascal.
When Microsoft developed 32-bit WinNT, if instead of C it they had
switched their official development language from Pascal to Modula-2
which does require signed and unsigned, checked and modulo arithmetic,
and array bounds checks, the world would have been a much safer place.

But they didn't so it isn't.

The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.

OK, my post was about how having a hardware trap-on-overflow instruction
(or a mode for existing ALU instructions) is useless for anything OTHER
than as a debug aid where you crash the problem on overflow (you can
have a general exception handler to shut down gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons folks might
want to try to use trap-on-overflow instructions, and show how the
other cases don't make sense.

In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother having trap-on-overflow instructions.

You then went on to discuss how you want trap-on-overflow instructions
for stuff like C code, so you can detect code bugs and shut down gracefully.

And my response to that is we already know compilers don't use it. x86
has INTO, which is "trap if the overflow bit is set". So "ADD r8,r9; INTO" would trap if the add overflowed.

Look at:
https://godbolt.org/z/oMhW55YsK

Which is this code:

int add2(int num, int other) {
return num + other;
}

Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and crash).

For x86-64 clang 19.1.0:

add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
.LBB0_1:
ud1 eax, dword ptr [eax]

This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.

But x86 has an instruction to trap on overflow directly: INTO. It's one byte.
And it doesn't use it.

GCC x86-64 14.2 is even worse:

add2:
sub rsp, 8
call __addvsi3
add rsp, 8
ret

It calls a routine to do all additions which might overflow, and that
routine calls assert() if an overflow occurs.

The CPU has a trap-on-overflow instruction exactly for this case (to crash
on detecting an overflow), and compilers don't even use it.

So even on architectures which have a trap-on-overflow instruction,
compilers don't use it.

You can only compile in INTO opcodes if you can guarantee that the INT 4
(INTO) trap vector will always be set to a proper handler, and since
that isn't part of the ABI, compilers can't depend on it?

I do agree that it would be nice if it did work, barring that clang is
doing the best possible alternative, at close to zero cost except for
the useless branch predictor table entry wastage.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Tue Sep 24 08:02:23 2024

MitchAlsup1 wrote:

On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:

In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:

The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.

OK, my post was about how having a hardware trap-on-overflow instruction
(or a mode for existing ALU instructions) is useless for anything OTHER
than as a debug aid where you crash the problem on overflow (you can
have a general exception handler to shut down gracefully, but "patching
things
up and continuing" doesn't work). I gave details of reasons folks might
want to try to use trap-on-overflow instructions, and show how the
other cases don't make sense.

In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother
having trap-on-overflow instructions.

<snip>

So why should any hardware include an instruction to trap-on-overflow?

Trap-on-overflow instruction have a hardware cost, of varying severity.
If the ISA isn't already trapping on ALU instructions (such as
divide-by-0), it adds a new class of operations which can take
exceptions. An ALU functional unit that cannot take exceptions doesn't
have to save "unwinding" info (at minimum, info to recover the PC, and
possibly rollback state), and not needing this can be a nice
simplification. Branches and LD/ST always needs this info, but not
needing it on ALU ops can be a nice simplification of logic, and makes
it
easier to have multiple ALU functional units. Note that x86 INTO can
be treated as a branch, so it doesn't have the cost of an instruction
like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
traps if it overflows. ADDTO is particularly what I am arguing
against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.

You argue that trap-on-overflow as an instruction is unnecessary
AND
You argue that overflow detection is worthwhile
AND
You argue that ALU should not raise overflow exceptions

I am at a loss for how to take all 3 arguments together at the
same time !?! Can you explain ??

Maybe all add/sub/etc opcodes that are immediately followed by an INTO
could be fused into a single ADDO/SUBO/etc version that takes zero extra
cycles as long as the trap part isn't hit?

Personally I'm happy with the clang approach.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Tue Sep 24 11:00:20 2024

On Tue, 24 Sep 2024 00:41:39 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Sun, 22 Sep 2024 09:14:04 -0700, Lars Poulsen wrote:

From a programmer's perspective, VAX exception handling was very
nice. It may have been high overhead, though.

Very high overhead. But it was also language-independent, and
integrated into the procedure-calling convention, which also managed
to be language- independent.

There is an internal memo on Bitsavers somewhere, critiquing a
proposal to adopt the MIPS architecture (which DEC did, for just one
machine, the DECstation 3000 if I recall rightly),

Much more than one machine.
https://en.wikipedia.org/wiki/DECstation
4 ranges, 13 models.

DEC was quite successful both with MIPS and with x86.
I'd guess, their CPU designers didn't like it.

and one of the
points against MIPS was that it didn’t have language-independent
exception handling. But then no other architecture, before the VAX or
since, has been able to do that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Tue Sep 24 11:37:03 2024

On Tue, 24 Sep 2024 08:02:23 +0200
Terje Mathisen <[email protected]> wrote:

MitchAlsup1 wrote:

On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:

In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:

The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.

OK, my post was about how having a hardware trap-on-overflow
instruction (or a mode for existing ALU instructions) is useless
for anything OTHER than as a debug aid where you crash the problem
on overflow (you can have a general exception handler to shut down
gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons folks
might want to try to use trap-on-overflow instructions, and show
how the other cases don't make sense.

In no way was I ever arguing that checking for overflow was a bad
idea, or a language issue, or anything else. Just that CPUs
should not bother having trap-on-overflow instructions.

<snip>

So why should any hardware include an instruction to
trap-on-overflow?

Trap-on-overflow instruction have a hardware cost, of varying
severity. If the ISA isn't already trapping on ALU instructions
(such as divide-by-0), it adds a new class of operations which can
take exceptions. An ALU functional unit that cannot take
exceptions doesn't have to save "unwinding" info (at minimum, info
to recover the PC, and possibly rollback state), and not needing
this can be a nice simplification. Branches and LD/ST always
needs this info, but not needing it on ALU ops can be a nice
simplification of logic, and makes it
easier to have multiple ALU functional units. Note that x86 INTO
can be treated as a branch, so it doesn't have the cost of an
instruction like "ADDTO r1,r2,r3" which is a normal ADD but where
the ADD itself traps if it overflows. ADDTO is particularly what
I am arguing against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.

You argue that trap-on-overflow as an instruction is unnecessary
AND
You argue that overflow detection is worthwhile
AND
You argue that ALU should not raise overflow exceptions

I am at a loss for how to take all 3 arguments together at the
same time !?! Can you explain ??

Maybe all add/sub/etc opcodes that are immediately followed by an
INTO could be fused into a single ADDO/SUBO/etc version that takes
zero extra cycles as long as the trap part isn't hit?

Personally I'm happy with the clang approach.

Couple of questions:
1. Which code would you put at destination of jo branch?
2. In your code generator would every jo in the code (or in the module,
or in the function) jump to the same destination or each will have
destination of its own.

It would be interesting if you answer before looking at what clang does,
then take a look and comment again.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Tue Sep 24 10:44:22 2024

Michael S wrote:

On Tue, 24 Sep 2024 08:02:23 +0200
Terje Mathisen <[email protected]> wrote:

MitchAlsup1 wrote:

On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:

In article <O2DHO.184073$[email protected]>,
EricPÂ <[email protected]> wrote:

The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.

OK, my post was about how having a hardware trap-on-overflow
instruction (or a mode for existing ALU instructions) is useless
for anything OTHER than as a debug aid where you crash the problem
on overflow (you can have a general exception handler to shut down
gracefully, but "patching things
up and continuing" doesn't work).Â I gave details of reasons folks
might want to try to use trap-on-overflow instructions, and show
how the other cases don't make sense.

In no way was I ever arguing that checking for overflow was a bad
idea, or a language issue, or anything else.Â Just that CPUs
should not bother having trap-on-overflow instructions.

<snip>

So why should any hardware include an instruction to
trap-on-overflow?

Trap-on-overflow instruction have a hardware cost, of varying
severity. If the ISA isn't already trapping on ALU instructions
(such as divide-by-0), it adds a new class of operations which can
take exceptions.Â An ALU functional unit that cannot take
exceptions doesn't have to save "unwinding" info (at minimum, info
to recover the PC, and possibly rollback state), and not needing
this can be a nice simplification.Â Branches and LD/ST always
needs this info, but not needing it on ALU ops can be a nice
simplification of logic, and makes it
easier to have multiple ALU functional units.Â Note that x86 INTO
can be treated as a branch, so it doesn't have the cost of an
instruction like "ADDTO r1,r2,r3" which is a normal ADD but where
the ADD itself traps if it overflows.Â ADDTO is particularly what
I am arguing against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.

You argue that trap-on-overflow as an instruction is unnecessary
AND
You argue that overflow detection is worthwhile
AND
You argue that ALU should not raise overflow exceptions

I am at a loss for how to take all 3 arguments together at the
same time !?! Can you explain ??

Maybe all add/sub/etc opcodes that are immediately followed by an
INTO could be fused into a single ADDO/SUBO/etc version that takes
zero extra cycles as long as the trap part isn't hit?

Personally I'm happy with the clang approach.

Couple of questions:
1. Which code would you put at destination of jo branch?
2. In your code generator would every jo in the code (or in the module,
or in the function) jump to the same destination or each will have destination of its own.

It would be interesting if you answer before looking at what clang does,
then take a look and comment again.

If the handler consists of terminating the program, then every function,
or small group of functions depending upon total code size, can have a
common target, just so that all the JO opcodes can use the short-form
two-byte encoding og a forward branch. I.e. leaving just 127 bytes
available for mainline code.

If you want separate handling for each overflow, i.e. switch to bigint
and resume, then you do need one target per JO, in order to pick up the originating instruction address (and place it on the stack for a
subsequent RET?) before jumping to a common handler.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Tue Sep 24 13:18:55 2024

On Tue, 24 Sep 2024 10:44:22 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Tue, 24 Sep 2024 08:02:23 +0200
Terje Mathisen <[email protected]> wrote:

MitchAlsup1 wrote:

On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:

In article <O2DHO.184073$[email protected]>,
EricPÂ <[email protected]> wrote:

The x86 designers might then have had an incentive to make all
the checks as efficient as possible, and rather than eliminate
them, they might have enhanced and more tightly integrated
them.

OK, my post was about how having a hardware trap-on-overflow
instruction (or a mode for existing ALU instructions) is useless
for anything OTHER than as a debug aid where you crash the
problem on overflow (you can have a general exception handler to
shut down gracefully, but "patching things
up and continuing" doesn't work).Â I gave details of reasons
folks might want to try to use trap-on-overflow instructions,
and show how the other cases don't make sense.

In no way was I ever arguing that checking for overflow was a bad
idea, or a language issue, or anything else.Â Just that CPUs
should not bother having trap-on-overflow instructions.

<snip>

So why should any hardware include an instruction to
trap-on-overflow?

Trap-on-overflow instruction have a hardware cost, of varying
severity. If the ISA isn't already trapping on ALU instructions
(such as divide-by-0), it adds a new class of operations which
can take exceptions.Â An ALU functional unit that cannot take
exceptions doesn't have to save "unwinding" info (at minimum,
info to recover the PC, and possibly rollback state), and not
needing this can be a nice simplification.Â Branches and LD/ST
always needs this info, but not needing it on ALU ops can be a
nice simplification of logic, and makes it
easier to have multiple ALU functional units.Â Note that x86
INTO can be treated as a branch, so it doesn't have the cost of
an instruction like "ADDTO r1,r2,r3" which is a normal ADD but
where the ADD itself traps if it overflows.Â ADDTO is
particularly what I am arguing against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.

You argue that trap-on-overflow as an instruction is unnecessary
AND
You argue that overflow detection is worthwhile
AND
You argue that ALU should not raise overflow exceptions

I am at a loss for how to take all 3 arguments together at the
same time !?! Can you explain ??

Maybe all add/sub/etc opcodes that are immediately followed by an
INTO could be fused into a single ADDO/SUBO/etc version that takes
zero extra cycles as long as the trap part isn't hit?

Personally I'm happy with the clang approach.

Couple of questions:
1. Which code would you put at destination of jo branch?
2. In your code generator would every jo in the code (or in the
module, or in the function) jump to the same destination or each
will have destination of its own.

It would be interesting if you answer before looking at what clang
does, then take a look and comment again.

If the handler consists of terminating the program, then every
function, or small group of functions depending upon total code size,
can have a common target, just so that all the JO opcodes can use the short-form two-byte encoding og a forward branch. I.e. leaving just
127 bytes available for mainline code.

In case of clang (on Win7+msys2, I didn't test on other targets) the
handler terminates the program with no useful info printed.
Still, clang has different target for each jo. At target location it
places invalid instruction.
So, it lays infrastructure for better handler, pays the price in terms
of code size, but does not take an advantage.

If you want separate handling for each overflow, i.e. switch to
bigint and resume, then you do need one target per JO, in order to
pick up the originating instruction address (and place it on the
stack for a subsequent RET?) before jumping to a common handler.

Terje

I want address of originating instruction in the handler.
I want it not for switch to bigint that would not be in spirit of
non-dynamic compiled languages, but in order to get useful termination printout.
With JO in order to get what I want I'd have to pay by significant
increase in code size.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kent Dickey@21:1/5 to [email protected] on Tue Sep 24 15:47:21 2024

In article <vcpidc$29e51$[email protected]>,
Thomas Koenig <[email protected]> wrote:

Kent Dickey <[email protected]> schrieb:

Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values.

I disagree.

Look at the sanitizer libraries, which insert runtime checks for
integer overflow - having less overhead for these would definitely
be a plus.

See https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html
or >https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-flags >.

Not valuing something just means no one is spending a lot of time/effort
on it. Decimal math is not valued--but you can still do it, it just
has no special instructions on most architectures to make it fast/easy.
And as I've pointed out, trapping on integer overflow is clearly not
valued--on x86, where INTO exists, GCC and Clang do not use it.

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Terje Mathisen on Tue Sep 24 10:06:27 2024

On 9/23/2024 11:02 PM, Terje Mathisen wrote:

snip

Maybe all add/sub/etc opcodes that are immediately followed by an INTO
could be fused into a single ADDO/SUBO/etc version that takes zero extra cycles as long as the trap part isn't hit?

If you are going to do that, why not make it an optional prefix byte?
That way, no fusion needed, no extra cycles, yet the same amount of code
space.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Sep 24 17:05:28 2024

On Tue, 24 Sep 2024 10:18:55 +0000, Michael S wrote:

I want address of originating instruction in the handler.
I want it not for switch to bigint that would not be in spirit of
non-dynamic compiled languages, but in order to get useful termination printout.
With JO in order to get what I want I'd have to pay by significant
increase in code size.

0XADDRESS OPCODE Rd,RS1,Rs2 // Had OVERFLOW using Rs1=0x12345678
and Rs2=0xFEDCBA09

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bill Findlay@21:1/5 to Kent Dickey on Tue Sep 24 18:38:49 2024

On 24 Sep 2024, Kent Dickey wrote
(in article <vcumu9$38iv2$[email protected]>):

In article<vcpidc$29e51$[email protected]>,
Thomas Koenig <[email protected]> wrote:

Kent Dickey <[email protected]> schrieb:

Trapping on overflow is basically useless other than as a debug aid, which clearly nobody values.

I disagree.

Look at the sanitizer libraries, which insert runtime checks for
integer overflow - having less overhead for these would definitely
be a plus.

See https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html
or https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-fla
gs
.

Not valuing something just means no one is spending a lot of time/effort
on it. Decimal math is not valued--but you can still do it, it just
has no special instructions on most architectures to make it fast/easy.
And as I've pointed out, trapping on integer overflow is clearly not valued--on x86, where INTO exists, GCC and Clang do not use it.

To quote Nick: sigh.
--
Bill Findlay

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Kent Dickey on Tue Sep 24 17:40:48 2024

Kent Dickey <[email protected]> schrieb:

In article <vcpidc$29e51$[email protected]>,
Thomas Koenig <[email protected]> wrote:

Kent Dickey <[email protected]> schrieb:

Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values.

I disagree.

Look at the sanitizer libraries, which insert runtime checks for
integer overflow - having less overhead for these would definitely
be a plus.

See https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html
or >>https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-flags
.

Not valuing something just means no one is spending a lot of time/effort
on it.

So writing and maintaining these libraries is not a lot of effort?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Sep 24 17:57:51 2024

On Tue, 24 Sep 2024 17:06:27 +0000, Stephen Fuld wrote:

On 9/23/2024 11:02 PM, Terje Mathisen wrote:

snip

Maybe all add/sub/etc opcodes that are immediately followed by an INTO
could be fused into a single ADDO/SUBO/etc version that takes zero extra
cycles as long as the trap part isn't hit?

If you are going to do that, why not make it an optional prefix byte?
That way, no fusion needed, no extra cycles, yet the same amount of code space.

Realistically, what is the difference if INTO is a prefix
byte or a postfix byte ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Niklas Holsti on Wed Sep 25 09:54:17 2024

On 9/10/2024 1:13 AM, Niklas Holsti wrote:

In the Ada case, the ability to declare array types with programmer-
chosen index types with bounded range, such as range-bounded integers or enumerations, means that the compiler can avoid indexing checks when the (sub)type of the index is known at compile time to fit within the index
range of the array.

I have always liked the idea of variable ranges able to be specified in
the language. Besides the advantages you mentioned, it provides more
human "comprehensibility" (if the ranges are reasonably named) i.e.
better internal documentation, and it makes responding to specification
changes required later in the program life cycle easier and less error
prone, i.e. if the range has to change, you change it in one place and
don't risk missing making the change in some obscure part of the program
you forgot about.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Stephen Fuld on Wed Sep 25 20:07:45 2024

On Wed, 25 Sep 2024 09:54:17 -0700
Stephen Fuld <[email protected]d> wrote:

On 9/10/2024 1:13 AM, Niklas Holsti wrote:

In the Ada case, the ability to declare array types with
programmer- chosen index types with bounded range, such as
range-bounded integers or enumerations, means that the compiler can
avoid indexing checks when the (sub)type of the index is known at
compile time to fit within the index range of the array.

I have always liked the idea of variable ranges able to be specified
in the language. Besides the advantages you mentioned, it provides
more human "comprehensibility" (if the ranges are reasonably named)
i.e. better internal documentation, and it makes responding to
specification changes required later in the program life cycle easier
and less error prone, i.e. if the range has to change, you change it
in one place and don't risk missing making the change in some obscure
part of the program you forgot about.

The problem here is that arrays with fixed bounds were common when
Ada was conceived back in the mid 1970s. On general-purpose (as opposed
to embedded) computers they were already much rarer when Ada was shipped
in 1983. By late 1990s arrays with fixed bounds were rare exception
rather than rule.
Except, of course, for many types of embedded computers. But even that
is gradually changing. Very gradually.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Wed Sep 25 12:16:59 2024

MitchAlsup1 wrote:

On Sat, 21 Sep 2024 17:05:02 +0000, EricP wrote:

It is very efficient if those events are rarely or never supposed to
occur.

Many (most, nearly all) processor architectures have notoriously
bad exception delivery to a point of control that can deal with
the problem at hand.

But it is not necessary for that bad mechanism to be necessary !!

Some of the things that minimize the "badness" of taking an exception::

a) deliver control to user signal handler without taking an
excursion through GuestOS. (think 10 cycles)
b) when control arrives, receiving thread is already reentrant.
c) when control arrives, the instruction (bits) and its operand
values are delivered to the exception handler. So, the exception
handler has what it needs to deal with the problem at hand.
d) when control returns, the result (R0) is delivered back to the
destination register.
e) (b, c, d) are performed without handler needing to understand
how. Handler is just a subroutine that receives arguments (c)
fixes the problem, and returns a non-excepting value, or abort.
f) return has a way to re-execute the instruction or to skip the
instruction under control of handler without having access
to excepting-IP and without knowing the length of the
instruction.
g) during (a..f) nobody ever has to disable interrupts or
exceptions or re-enable them later. Priority and privilege
are inherited automatically from excepting thread.

I know of only 1 ISA with these properties....

It all depends on the frequency that exceptions occur.
It used to be that Page Fault was the only one that occurred with any frequency, and the code path for the page fault handler was long enough
that any HW overhead was lost in the noise. In all other cases they
indicated a fatal error so the HW cost was the least of your problems.

But then, risc processors mostly, started using exceptions for housekeeping
- SPARC for register window sliding, Alpha for byte, word and misaligned
memory access, MIPS and Alpha for software TLB-miss handling.
And suddenly the exceptional becomes the normal.

The solution for Alpha was to add back the byte and word instructions,
and add misaligned access support to all memory ops.
Sparc stuck with traps for register windows.
No one else used software managed TLB's.

Then virtual machines come along using exceptions to trigger
trap-and-emulate code, and now the normal becomes frequent.
Not 1 or 10 exceptions per second, but 100,000 or 200,000.

The solution for VM's is to add the ISA features necessary so that
most exceptions are rare, and when they do happen they are cheap.
Worst case it should cost the same as a branch mispredict pipeline drain.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Kent Dickey on Wed Sep 25 12:54:18 2024

Kent Dickey wrote:

In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:

Kent Dickey wrote:

In article <[email protected]>,
Anton Ertl <[email protected]> wrote:

Brett <[email protected]> writes:

Speaking of complex things, have you looked at Swift output, as it checks >>>>> all operations for overflow?

You could add an exception type for that, saving huge numbers of correctly
predicted branch instructions.

The future of programming languages is type safe with checks, you need to >>>>> get on that bandwagon early.

MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.

Alpha got on that bandwagon early. It's a descendent of MIPS, but it
renamed add into addv, and addu into add. It has been canceled around >>>> the year 2000.

[ More details about architectures without trapping overflow instructions ] >>>
Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values. If you take Rust's approach, and only
detect overflow in debug builds, then you already don't care about
performance.

Those automatic software correctness checks, of which signed integer
overflow detection is one of many, went away because most code was
being written in C/C++ and those two languages don't require them.

That just makes it more expensive in code size and performance to effect
such checks. This overhead leads some to conclude it justifies eliminating >> the error checks.

Eliminating the error event detectors doesn't make errors go away,
just your knowledge of them.

I gather portions of 16-bit Windows 3.1 were written in Pascal.
When Microsoft developed 32-bit WinNT, if instead of C it they had
switched their official development language from Pascal to Modula-2
which does require signed and unsigned, checked and modulo arithmetic,
and array bounds checks, the world would have been a much safer place.

But they didn't so it isn't.

The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.

OK, my post was about how having a hardware trap-on-overflow instruction
(or a mode for existing ALU instructions) is useless for anything OTHER
than as a debug aid where you crash the problem on overflow (you can
have a general exception handler to shut down gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons folks might
want to try to use trap-on-overflow instructions, and show how the
other cases don't make sense.

For me error detection of all kinds is useful. It just happens
to not be conveniently supported in C so no one tries it in C.

GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need
as it triggers for many false positives so people turn it off.

In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother having trap-on-overflow instructions.

I understand, and I disagree with this conclusion.
I think all forms of software error detection are useful and
HW should make them simple and eliminate cost when possible.

You then went on to discuss how you want trap-on-overflow instructions
for stuff like C code, so you can detect code bugs and shut down gracefully.

And my response to that is we already know compilers don't use it. x86
has INTO, which is "trap if the overflow bit is set". So "ADD r8,r9; INTO" would trap if the add overflowed.

Well, there is a bunch of things to unpack here.

First, INTO is a 32-bit x86 instruction. On 64-bit x64 AMD reassigned
that opcode to be for other instructions. On x64 the JO (jump overflow) instruction does overflow detection.

The reason AMD could reassign INTO was because it wasn't being used by C/C++. But this is a side effect of C's widespread use, not the cause.
Programmers write in C because it is widely used and supported,
and as a consequence of that choice they get unchecked arithmetic.
But they are not choosing C to get unchecked arithmetic.
Had this same usage tests been done on other languages the results
would likely be quite different.

Second, on x86 the INTO and on x64 a JO offset32 take up 1 and 5 bytes respectively. In JO case it has to branch to a ThrowOverflow () call
so thats 5 more bytes per ADD or SUB if you want error traceability.
With overflow trapping instructions there is NO runtime or code size cost.

Third, on many risc ISA like RISC-V there are no flags so no JO instruction even possible. Either they must use the branchless overflow idiom or the branching version, adding more to the cost of error detection.

*OR*its has an Add Fault Overflow instruction which has NO RUNTIME COST
ADDFO rd = rs1 + rs2

Fourth, it sounds like what you want is a risc (no flags) ADD instruction
that returns both a result and an overflow flag so you can do the
equivalent of the x64 JO branch test.
ADDO (ro,rd) = rs1 + rs2
where rd is dest and ro is a register to receive a 0/1 overflow flag.
Once one allows multiple dest registers ADDO is trivial to support.

But that does not invalidate the usefulness of ADDFO.
I would also have ADDFC Add Fault Carry for unsigned overflow,
plus other instructions for checking signed overflow and unsigned carry.

Look at:
https://godbolt.org/z/oMhW55YsK

Which is this code:

int add2(int num, int other) {
return num + other;
}

Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and crash).

For x86-64 clang 19.1.0:

add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
..LBB0_1:
ud1 eax, dword ptr [eax]

This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.

Yes, this is all for x64 which has no INTO instruction.
GCC's -trapv redirects signed arithmetic to the overflow trapping library
which used to (a) have a reputation for bugs and (b) cause a 50% slow down.
I believe they fixed the bugs eventually but the performance hit remains.

But x86 has an instruction to trap on overflow directly: INTO. It's one byte.
And it doesn't use it.

GCC x86-64 14.2 is even worse:

add2:
sub rsp, 8
call __addvsi3
add rsp, 8
ret

It calls a routine to do all additions which might overflow, and that
routine calls assert() if an overflow occurs.

The CPU has a trap-on-overflow instruction exactly for this case (to crash
on detecting an overflow), and compilers don't even use it.

So even on architectures which have a trap-on-overflow instruction,
compilers don't use it.

This is for x64 which has no INTO instruction.
And __addvsi3 it didn't use JO either. It uses the branching test,
not the branchless idiom.

https://blog.weghos.com/llvm/llvm/compiler-rt/lib/builtins/addvsi3.c.html

So why should any hardware include an instruction to trap-on-overflow?

Because ALL the negative speed and code size consequences do not occur.

Trap-on-overflow instruction have a hardware cost, of varying severity.
If the ISA isn't already trapping on ALU instructions (such as
divide-by-0), it adds a new class of operations which can take
exceptions. An ALU functional unit that cannot take exceptions doesn't
have to save "unwinding" info (at minimum, info to recover the PC, and possibly rollback state), and not needing this can be a nice
simplification. Branches and LD/ST always needs this info, but not
needing it on ALU ops can be a nice simplification of logic, and makes it easier to have multiple ALU functional units. Note that x86 INTO can
be treated as a branch, so it doesn't have the cost of an instruction
like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
traps if it overflows. ADDTO is particularly what I am arguing against--
it is just a bad idea for the ISA to have ALU instructions take exceptions.

Not really. Its a flag in the uOp indicating HasException and a union
of fields to hold exception status and RIP, all of which needs to be
there for other instructions like load/store.

Instruction sets which make detecting overflow difficult (say, RISC-V),
would do well to make branch-on-overflow efficient and easy. But adding >>> trap-on-overflow instructions is a waste of effort.

No they are a very useful tool for those who need such a tool
because the manual alternative is significantly more expensive
for both size and performance.

"I have one example where overflow exceptions would be a poor implementation >> choice" does not imply "therefore no one should have them as an option".

Can you share what language, compiler, and hardware you are using which implements overflow checks using a trap-on-overflow instruction?

Kent

On DEC VAX the Overflow Enable flag was in the Program Status Word.
IIRC it was enabled by default in all DEC languages, Fortran77, Pascal,
Ada, Cobol, and disabled by default for C. But it could be toggled with
a runtime library call.

For a variety of reasons having Overflow Enable in the status register is
A Bad Idea. On Alpha it was a compile switch which selects different instructions ADD vs ADDV, and also controlled by pragmas.
If you wanted to manually test for overflow then you used
one of the idioms, whatever language you worked in.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Wed Sep 25 17:30:46 2024

On Wed, 25 Sep 2024 17:07:45 +0000, Michael S wrote:

On Wed, 25 Sep 2024 09:54:17 -0700
Stephen Fuld <[email protected]d> wrote:

On 9/10/2024 1:13 AM, Niklas Holsti wrote:

In the Ada case, the ability to declare array types with
programmer- chosen index types with bounded range, such as
range-bounded integers or enumerations, means that the compiler can
avoid indexing checks when the (sub)type of the index is known at
compile time to fit within the index range of the array.

I have always liked the idea of variable ranges able to be specified
in the language. Besides the advantages you mentioned, it provides
more human "comprehensibility" (if the ranges are reasonably named)
i.e. better internal documentation, and it makes responding to
specification changes required later in the program life cycle easier
and less error prone, i.e. if the range has to change, you change it
in one place and don't risk missing making the change in some obscure
part of the program you forgot about.

The problem here is that arrays with fixed bounds were common when
Ada was conceived back in the mid 1970s. On general-purpose (as opposed
to embedded) computers they were already much rarer when Ada was shipped
in 1983. By late 1990s arrays with fixed bounds were rare exception
rather than rule.

It sounds like variable ranges (array indexes) would be becoming more
common, also.

Where "variable range" is a variable that is defined to have a
specified range, but from run to run the upper and lower bounds
can be modified without re-compilation.

Except, of course, for many types of embedded computers. But even that
is gradually changing. Very gradually.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Wed Sep 25 17:46:46 2024

EricP <[email protected]> writes:

MitchAlsup1 wrote:

But then, risc processors mostly, started using exceptions for housekeeping
- SPARC for register window sliding, Alpha for byte, word and misaligned >memory access, MIPS and Alpha for software TLB-miss handling.
And suddenly the exceptional becomes the normal.

Yet all four of those have been relegated to the mists of history,
for good reason.

Then virtual machines come along using exceptions to trigger
trap-and-emulate code, and now the normal becomes frequent.

And then SVM (AMD) and VT-X (Intel) subsequently added
hardware support that significantly reduced the need for
trap-and-emulate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Terje Mathisen on Wed Sep 25 13:56:40 2024

Terje Mathisen wrote:

Kent Dickey wrote:

Look at:
https://godbolt.org/z/oMhW55YsK

Which is this code:

int add2(int num, int other) {
return num + other;
}

Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and crash).

For x86-64 clang 19.1.0:

add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
.LBB0_1:
ud1 eax, dword ptr [eax]

This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.

But x86 has an instruction to trap on overflow directly: INTO. It's
one byte.
And it doesn't use it.

GCC x86-64 14.2 is even worse:

add2:
sub rsp, 8
call __addvsi3
add rsp, 8
ret

It calls a routine to do all additions which might overflow, and that
routine calls assert() if an overflow occurs.

The CPU has a trap-on-overflow instruction exactly for this case (to
crash
on detecting an overflow), and compilers don't even use it.

So even on architectures which have a trap-on-overflow instruction,
compilers don't use it.

You can only compile in INTO opcodes if you can guarantee that the INT 4 (INTO) trap vector will always be set to a proper handler, and since
that isn't part of the ABI, compilers can't depend on it?

I do agree that it would be nice if it did work, barring that clang is
doing the best possible alternative, at close to zero cost except for
the useless branch predictor table entry wastage.

Terje

On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
One must use JO to detect signed overflow.
Others were repurposed, 1-byte INC and DEC 40..4F became the REX prefix.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Sep 25 17:49:13 2024

On Wed, 25 Sep 2024 16:16:59 +0000, EricP wrote:

MitchAlsup1 wrote:

On Sat, 21 Sep 2024 17:05:02 +0000, EricP wrote:

It is very efficient if those events are rarely or never supposed to
occur.

Many (most, nearly all) processor architectures have notoriously
bad exception delivery to a point of control that can deal with
the problem at hand.

But it is not necessary for that bad mechanism to be necessary !!

Some of the things that minimize the "badness" of taking an exception::

a) deliver control to user signal handler without taking an
excursion through GuestOS. (think 10 cycles)
b) when control arrives, receiving thread is already reentrant.
c) when control arrives, the instruction (bits) and its operand
values are delivered to the exception handler. So, the exception
handler has what it needs to deal with the problem at hand.
d) when control returns, the result (R0) is delivered back to the
destination register.
e) (b, c, d) are performed without handler needing to understand
how. Handler is just a subroutine that receives arguments (c)
fixes the problem, and returns a non-excepting value, or abort.
f) return has a way to re-execute the instruction or to skip the
instruction under control of handler without having access
to excepting-IP and without knowing the length of the
instruction.
g) during (a..f) nobody ever has to disable interrupts or
exceptions or re-enable them later. Priority and privilege
are inherited automatically from excepting thread.

I know of only 1 ISA with these properties....

It all depends on the frequency that exceptions occur.
It used to be that Page Fault was the only one that occurred with any frequency, and the code path for the page fault handler was long enough
that any HW overhead was lost in the noise. In all other cases they
indicated a fatal error so the HW cost was the least of your problems.

But then, risc processors mostly, started using exceptions for
housekeeping
- SPARC for register window sliding, Alpha for byte, word and misaligned memory access, MIPS and Alpha for software TLB-miss handling.
And suddenly the exceptional becomes the normal.

The solution for Alpha was to add back the byte and word instructions,
and add misaligned access support to all memory ops.
Sparc stuck with traps for register windows.
No one else used software managed TLB's.

Then virtual machines come along using exceptions to trigger
trap-and-emulate code, and now the normal becomes frequent.
Not 1 or 10 exceptions per second, but 100,000 or 200,000.

High amounts of trap-and-emulate code originate from a "I am
a CPU and I decide everything" mindset of ISAs developed
before VMs came around. Too many control registers being
touched too often, and don't get me started on interrupt
"routing". What a VM wants is "I am a virtual CPU and I
make choices for virtual me; I do not own interrupts or
devices, or control the system--nor am I controlled by
a sea of control registers".

My 66000 architecture has only 1 instruction with any notion
of privilege and when you touch a control register with it
you end up changing a cache line size of control register
state--dropping the amount of touches nearly an order of
magnitude:: 100,000/sec -> 10,000/second.

Privilege has been specified in such a way that a p-thread
thread can context switch to another p-thread thread within
a single application with a single non-privilege-violating
instruction. New stack, new translation tables, save/restore
the register files, change the ASID, change which exceptions
are recognized/ignored, ...

In addition, there is no SW overhead wrt interrupt routing
when performing a "world switch" making a world switch
have the same 10-20 cycle overhead as a context switch.

The solution for VM's is to add the ISA features necessary so that
most exceptions are rare, and when they do happen they are cheap.

Exceptions should be rare AND cheap.

But I argue against adding instructions to mask the deficiencies
of the ISAs that got it wrong oh so long ago. But no-one will
listen. I argue to fixe the problem at its source:: the notion
of how the machine is controlled, and interrupted; rather than
adding zillions of helper instruction to mask the real problem.

Worst case it should cost the same as a branch mispredict pipeline
drain.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Wed Sep 25 18:00:28 2024

EricP <[email protected]> writes:

Terje Mathisen wrote:

You can only compile in INTO opcodes if you can guarantee that the INT 4
(INTO) trap vector will always be set to a proper handler, and since
that isn't part of the ABI, compilers can't depend on it?

I do agree that it would be nice if it did work, barring that clang is
doing the best possible alternative, at close to zero cost except for
the useless branch predictor table entry wastage.

Terje

On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
One must use JO to detect signed overflow.
Others were repurposed, 1-byte INC and DEC 40..4F became the REX prefix.

It seems more flexible to have language facilities to handle overflow directly rather than using trap & emulate to fake it.

COMPUTE A = B + C ON OVERFLOW DO SOMETHING.

A sticky processor state "overflow" bit is sufficient to support
such languages.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to All on Wed Sep 25 21:12:50 2024

On 2024-09-25 20:30, MitchAlsup1 wrote:

On Wed, 25 Sep 2024 17:07:45 +0000, Michael S wrote:

On Wed, 25 Sep 2024 09:54:17 -0700
Stephen Fuld <[email protected]d> wrote:

On 9/10/2024 1:13 AM, Niklas Holsti wrote:

In the Ada case, the ability to declare array types with
programmer- chosen index types with bounded range, such as
range-bounded integers or enumerations, means that the compiler can
avoid indexing checks when the (sub)type of the index is known at
compile time to fit within the index range of the array.

I have always liked the idea of variable ranges able to be specified
in the language. Besides the advantages you mentioned, it provides
more human "comprehensibility" (if the ranges are reasonably named)
i.e. better internal documentation, and it makes responding to
specification changes required later in the program life cycle easier
and less error prone, i.e. if the range has to change, you change it
in one place and don't risk missing making the change in some obscure
part of the program you forgot about.

The problem here is that arrays with fixed bounds were common when
Ada was conceived back in the mid 1970s. On general-purpose (as opposed
to embedded) computers they were already much rarer when Ada was shipped
in 1983. By late 1990s arrays with fixed bounds were rare exception
rather than rule.

It sounds like variable ranges (array indexes) would be becoming more
common, also.

Where "variable range" is a variable that is defined to have a
specified range, but from run to run the upper and lower bounds
can be modified without re-compilation.

Ada subtypes can do that, but the underlying type, set at compile time,
for example the standard Integer type, will put an upper bound on the
range of the subtype. The number of bits in the numbers cannot change
without recompilation.

Michael S says that arrays with variable bounds are becoming more
common. I assume he means indexable containers, often called vectors.
Ada has several such containers in the standard library, all defined as
generic in the index subtype, which means that the compiler can check
that the type of a vector index is correct.

If a vector is allocated (sized) with a certain (dynamically defined)
length, instead of growing element by element, and if that length
matches the (dynamically defined) range of the index subtype, the same
static checking methods/proofs can be applied as for traditional arrays
of fixed size. I think that the current tools don't do that for vector containers by default, but I beliveve they can be persuaded to do it by
writing the corresponding preconditions for the indexing operations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Sep 25 18:11:32 2024

On Wed, 25 Sep 2024 16:54:18 +0000, EricP wrote:

Kent Dickey wrote:

In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:

Well, there is a bunch of things to unpack here.

First, INTO is a 32-bit x86 instruction. On 64-bit x64 AMD reassigned
that opcode to be for other instructions. On x64 the JO (jump overflow) instruction does overflow detection.

The reason AMD could reassign INTO was because it wasn't being used by
C/C++.
But this is a side effect of C's widespread use, not the cause.
Programmers write in C because it is widely used and supported,

Free compilers

and as a consequence of that choice they get unchecked arithmetic.
But they are not choosing C to get unchecked arithmetic.
Had this same usage tests been done on other languages the results
would likely be quite different.

Second, on x86 the INTO and on x64 a JO offset32 take up 1 and 5 bytes respectively. In JO case it has to branch to a ThrowOverflow () call
so thats 5 more bytes per ADD or SUB if you want error traceability.
With overflow trapping instructions there is NO runtime or code size
cost.

Third, on many risc ISA like RISC-V there are no flags so no JO
instruction
even possible. Either they must use the branchless overflow idiom or the branching version, adding more to the cost of error detection.

*OR*its has an Add Fault Overflow instruction which has NO RUNTIME COST
ADDFO rd = rs1 + rs2

*OR"its ??? can you translate than into comp.arch language.

Fourth, it sounds like what you want is a risc (no flags) ADD
instruction
that returns both a result and an overflow flag so you can do the
equivalent of the x64 JO branch test.
ADDO (ro,rd) = rs1 + rs2
where rd is dest and ro is a register to receive a 0/1 overflow flag.
Once one allows multiple dest registers ADDO is trivial to support.

But that does not invalidate the usefulness of ADDFO.
I would also have ADDFC Add Fault Carry for unsigned overflow,

Which will be used two orders of magnitude less than ADDFO. First
because unsigned is used less often than signed, secondly much/most
unsigned arithmetic is specified to wrap rather than check.

plus other instructions for checking signed overflow and unsigned carry.

Look at:
https://godbolt.org/z/oMhW55YsK

Which is this code:

int add2(int num, int other) {
return num + other;
}

Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and crash).

For x86-64 clang 19.1.0:

add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
..LBB0_1:
ud1 eax, dword ptr [eax]

This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.

Yes, this is all for x64 which has no INTO instruction.

s/x64/x86-64/g

It is still an x86 with all the benefits and detriments.
<snip>

So why should any hardware include an instruction to trap-on-overflow?

Because ALL the negative speed and code size consequences do not occur.

No because an EFFICIENT trap-on-overflow has no performance consequences
when no overflow is created. Efficient means 10-20 cycles to arrive at exception handler--already in a reentrant state with exceptions and
interrupts still enabled. Just because x86 is so horrible in this regard
does not mean every architecture has to be at least that bad.

Trap-on-overflow instruction have a hardware cost, of varying severity.
If the ISA isn't already trapping on ALU instructions (such as
divide-by-0), it adds a new class of operations which can take
exceptions. An ALU functional unit that cannot take exceptions doesn't
have to save "unwinding" info (at minimum, info to recover the PC, and
possibly rollback state), and not needing this can be a nice
simplification. Branches and LD/ST always needs this info, but not
needing it on ALU ops can be a nice simplification of logic, and makes
it
easier to have multiple ALU functional units. Note that x86 INTO can
be treated as a branch, so it doesn't have the cost of an instruction
like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
traps if it overflows. ADDTO is particularly what I am arguing
against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.

Not really. Its a flag in the uOp indicating HasException and a union
of fields to hold exception status and RIP, all of which needs to be
there for other instructions like load/store.

Agreed, the overhead of recording "Overflow" and whether to do something
about it is so small that other considerations sway the argumetns.

Instruction sets which make detecting overflow difficult (say, RISC-V), >>>> would do well to make branch-on-overflow efficient and easy. But adding >>>> trap-on-overflow instructions is a waste of effort.

No they are a very useful tool for those who need such a tool
because the manual alternative is significantly more expensive
for both size and performance.

"I have one example where overflow exceptions would be a poor
implementation
choice" does not imply "therefore no one should have them as an option".

Can you share what language, compiler, and hardware you are using which
implements overflow checks using a trap-on-overflow instruction?

Kent

On DEC VAX the Overflow Enable flag was in the Program Status Word.

On My 66000 Overflow enable bit is part of the thread-status-line.

IIRC it was enabled by default in all DEC languages, Fortran77, Pascal,
Ada, Cobol, and disabled by default for C. But it could be toggled with
a runtime library call.

Similar--but library call does not have to "gain privilege" to flip the
bit's state.

For a variety of reasons having Overflow Enable in the status register
is A Bad Idea.

Can you expand. It seems to me if the unprivileged application using
the instructions at hand (Header Register instruction) can access
and write those exception control bits without needing privilege--
that most of the "A Bad Idea™" disappear. At the same time there
are significant amounts of state that do require privilege to
access in thread-status-line, and HR obeys such a distinction.

On Alpha it was a compile switch which selects different instructions ADD vs ADDV, and also controlled by pragmas.
If you wanted to manually test for overflow then you used
one of the idioms, whatever language you worked in.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Wed Sep 25 12:30:02 2024

On 9/24/2024 10:57 AM, MitchAlsup1 wrote:

On Tue, 24 Sep 2024 17:06:27 +0000, Stephen Fuld wrote:

On 9/23/2024 11:02 PM, Terje Mathisen wrote:

snip

Maybe all add/sub/etc opcodes that are immediately followed by an INTO
could be fused into a single ADDO/SUBO/etc version that takes zero extra >>> cycles as long as the trap part isn't hit?

If you are going to do that, why not make it an optional prefix byte?
That way, no fusion needed, no extra cycles, yet the same amount of code
space.

Realistically, what is the difference if INTO is a prefix
byte or a postfix byte ?

Of course, IANAHG, but my guess was that not having to do instruction
fusion was worth something, and my suggestion has zero cost. If this is
wrong, then there is no difference.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Thu Sep 26 13:13:02 2024

MitchAlsup1 wrote:

On Wed, 25 Sep 2024 16:54:18 +0000, EricP wrote:

Kent Dickey wrote:

In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:

Well, there is a bunch of things to unpack here.

First, INTO is a 32-bit x86 instruction. On 64-bit x64 AMD reassigned
that opcode to be for other instructions. On x64 the JO (jump overflow)
instruction does overflow detection.

The reason AMD could reassign INTO was because it wasn't being used by
C/C++.
But this is a side effect of C's widespread use, not the cause.
Programmers write in C because it is widely used and supported,

Free compilers

I've always paid for mine. My first C compiler came with the
WinNT 3.5 beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.

and as a consequence of that choice they get unchecked arithmetic.
But they are not choosing C to get unchecked arithmetic.
Had this same usage tests been done on other languages the results
would likely be quite different.

Second, on x86 the INTO and on x64 a JO offset32 take up 1 and 5 bytes
respectively. In JO case it has to branch to a ThrowOverflow () call
so thats 5 more bytes per ADD or SUB if you want error traceability.
With overflow trapping instructions there is NO runtime or code size
cost.

Third, on many risc ISA like RISC-V there are no flags so no JO
instruction
even possible. Either they must use the branchless overflow idiom or the
branching version, adding more to the cost of error detection.

*OR*its has an Add Fault Overflow instruction which has NO RUNTIME COST
ADDFO rd = rs1 + rs2

*OR"its ??? can you translate than into comp.arch language.

*OR* an ISA has an Add Fault Overflow instruction which has NO RUNTIME COST
ADDFO rd = rs1 + rs2

Fourth, it sounds like what you want is a risc (no flags) ADD
instruction
that returns both a result and an overflow flag so you can do the
equivalent of the x64 JO branch test.
ADDO (ro,rd) = rs1 + rs2
where rd is dest and ro is a register to receive a 0/1 overflow flag.
Once one allows multiple dest registers ADDO is trivial to support.

But that does not invalidate the usefulness of ADDFO.
I would also have ADDFC Add Fault Carry for unsigned overflow,

Which will be used two orders of magnitude less than ADDFO. First
because unsigned is used less often than signed, secondly much/most
unsigned arithmetic is specified to wrap rather than check.

Unsigned checked arithmetic is just another data type but with the
unsigned range of 0..2^n-1. A programmer should select it for situtations
where unsigned values are never supposed to wrap at zero.

Note that "checked arithmetic" is more than just checks for overflow on arithmetic ops. It also checks on assignments with down casts that no
overflow occurs, that you are not trying to put 10 pounds in a 5 pound bag,
and conversions between signed and unsigned checked types.
It can also include range checks on subtypes.

plus other instructions for checking signed overflow and unsigned carry.

Look at:
https://godbolt.org/z/oMhW55YsK

Which is this code:

int add2(int num, int other) {
return num + other;
}

Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and crash).

For x86-64 clang 19.1.0:

add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
..LBB0_1:
ud1 eax, dword ptr [eax]

This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.

Yes, this is all for x64 which has no INTO instruction.

s/x64/x86-64/g

It is still an x86 with all the benefits and detriments.
<snip>

So why should any hardware include an instruction to trap-on-overflow?

Because ALL the negative speed and code size consequences do not occur.

No because an EFFICIENT trap-on-overflow has no performance consequences
when no overflow is created. Efficient means 10-20 cycles to arrive at exception handler--already in a reentrant state with exceptions and interrupts still enabled. Just because x86 is so horrible in this regard
does not mean every architecture has to be at least that bad.

It has zero cost when no overflow occurs.
And if one does occur it leaves the RIP pointing at the problem instruction.

Trap-on-overflow instruction have a hardware cost, of varying severity.
If the ISA isn't already trapping on ALU instructions (such as
divide-by-0), it adds a new class of operations which can take
exceptions. An ALU functional unit that cannot take exceptions doesn't
have to save "unwinding" info (at minimum, info to recover the PC, and
possibly rollback state), and not needing this can be a nice
simplification. Branches and LD/ST always needs this info, but not
needing it on ALU ops can be a nice simplification of logic, and makes
it
easier to have multiple ALU functional units. Note that x86 INTO can
be treated as a branch, so it doesn't have the cost of an instruction
like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
traps if it overflows. ADDTO is particularly what I am arguing
against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.

Not really. Its a flag in the uOp indicating HasException and a union
of fields to hold exception status and RIP, all of which needs to be
there for other instructions like load/store.

Agreed, the overhead of recording "Overflow" and whether to do something about it is so small that other considerations sway the argumetns.

Instruction sets which make detecting overflow difficult (say,
RISC-V),
would do well to make branch-on-overflow efficient and easy. But
adding
trap-on-overflow instructions is a waste of effort.

No they are a very useful tool for those who need such a tool
because the manual alternative is significantly more expensive
for both size and performance.

"I have one example where overflow exceptions would be a poor
implementation
choice" does not imply "therefore no one should have them as an
option".

Can you share what language, compiler, and hardware you are using which
implements overflow checks using a trap-on-overflow instruction?

Kent

On DEC VAX the Overflow Enable flag was in the Program Status Word.

On My 66000 Overflow enable bit is part of the thread-status-line.

IIRC it was enabled by default in all DEC languages, Fortran77, Pascal,
Ada, Cobol, and disabled by default for C. But it could be toggled with
a runtime library call.

Similar--but library call does not have to "gain privilege" to flip the
bit's state.

The library routine didn't need a privilege change.
This was just to give high level languages access to the PSW.
Same as floating point unit control routines.

For a variety of reasons having Overflow Enable in the status register
is A Bad Idea.

Can you expand. It seems to me if the unprivileged application using
the instructions at hand (Header Register instruction) can access
and write those exception control bits without needing privilege--
that most of the "A Bad Idea™" disappear. At the same time there
are significant amounts of state that do require privilege to
access in thread-status-line, and HR obeys such a distinction.

On Alpha it was a compile switch which selects different
instructions ADD vs ADDV, and also controlled by pragmas.
If you wanted to manually test for overflow then you used
one of the idioms, whatever language you worked in.

The problem is mostly due to the fact that expressions are
*mixtures of signed and unsigned, checked and modulo arithmetic*.
If overflow checks are enabled by status flag then the program has to
keep switching between modes for individual arithmetic operations.
This leads to a slew of enable and disable instructions which could
be serializing.

This is because array index value expressions are calculated using signed, checked arithmetic, then the result is range checked against the bounds. However addresses are calculated using modulo arithmetic.
Since most OS define address 0 to be the start of user space,
and locate the OS at FF...FF, addresses are unsigned modulo numbers.

If checks are not disabled for the address calculation
and the address not calculated using modulo arithmetic,
it is easy to trigger false overflow exceptions with arrays
that do not have base-0 or base-1 array bounds as many compilers
use bias-base array buffer pointers.

This is why one wants separate instructions for ADD and ADDV - there is
no overhead to switching between modulo and checked linear arithmetic.

Second, if there is a control register, it becomes part of the ABI.
It can either be
- undefined on calls, in which case each routine must save the current
flags state on *each* entry and set a value, and restore the original
state on return,
- or defined to have a particular enable/disable value on call and
callee's are required toggle it if needed but restore it to default
for all calls and returns.

Third, there is no reason to have overflow as a dynamic enable/disable
because the kind of arithmetic, modulo or linear, is fixed by what the programmer writes and does not change dynamically.

Dynamic overflow enable results in a continuous overhead managing it
which does not occur with explicit fault testing instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Sep 26 18:11:56 2024

On Thu, 26 Sep 2024 17:13:02 +0000, EricP wrote:

MitchAlsup1 wrote:

On Wed, 25 Sep 2024 16:54:18 +0000, EricP wrote:

IIRC it was enabled by default in all DEC languages, Fortran77, Pascal,
Ada, Cobol, and disabled by default for C. But it could be toggled with
a runtime library call.

Similar--but library call does not have to "gain privilege" to flip the
bit's state.

The library routine didn't need a privilege change.

Most architectures do not allow the unpriviledged to access or modify
PSW (except the IP via branch instructions and CC via arithmetic).

My 66000 ISA allows unprivileged to access and modify PSW state
that only effects how the application acts. So, unprivileged can
modify:
a) IEEE flags
b) exception enablement
c) rounding mode
but not
d) Translation Tables
e) ASID
f) Dispatcher
g) call stack pointer
..
Even though they are stored in the same cache line.
<snip>

The problem is mostly due to the fact that expressions are
*mixtures of signed and unsigned, checked and modulo arithmetic*.
If overflow checks are enabled by status flag then the program has to
keep switching between modes for individual arithmetic operations.
This leads to a slew of enable and disable instructions which could
be serializing.

And often are.

This is because array index value expressions are calculated using
signed,
checked arithmetic, then the result is range checked against the bounds. However addresses are calculated using modulo arithmetic.
Since most OS define address 0 to be the start of user space,
and locate the OS at FF...FF, addresses are unsigned modulo numbers.

If checks are not disabled for the address calculation
and the address not calculated using modulo arithmetic,
it is easy to trigger false overflow exceptions with arrays
that do not have base-0 or base-1 array bounds as many compilers
use bias-base array buffer pointers.

This is why one wants separate instructions for ADD and ADDV - there is
no overhead to switching between modulo and checked linear arithmetic.

Second, if there is a control register, it becomes part of the ABI.
It can either be
- undefined on calls, in which case each routine must save the current
flags state on *each* entry and set a value, and restore the original
state on return,
- or defined to have a particular enable/disable value on call and
callee's are required toggle it if needed but restore it to default
for all calls and returns.

Third, there is no reason to have overflow as a dynamic enable/disable because the kind of arithmetic, modulo or linear, is fixed by what the programmer writes and does not change dynamically.

Dynamic overflow enable results in a continuous overhead managing it
which does not occur with explicit fault testing instructions.

Where it consumes valuable OpCode space which is sometimes not available {{3-operand 1-result instructions are notoriously "tight" in encoding:
± on 2 operands, int/float/double, FMAC<->INSert, attached
constant,...}}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Lawrence D'Oliveiro on Sat Sep 28 11:04:34 2024

Lawrence D'Oliveiro wrote:

On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

I've always paid for mine. My first C compiler came with the WinNT 3.5
beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.

Given that GCC offers more features and generates better code than MSVC,
the money may not matter to your business, but the quality of the product will.

GCC is a compiler collection not a integrated development kit for Windows.
I have no knowledge of what state GCC was in in 1992 but it likely
did not support the MS enhancements for Win32 programming:
structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned integers,
and most important: integration with the GUI source code debugger.

Plus come with necessary API headers, various link libraries and DLL's, supporting applications, documentation.
You know... what a product looks like.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to George Neuner on Sat Sep 28 17:20:56 2024

On 28/09/2024 01:52, George Neuner wrote:

On Wed, 25 Sep 2024 12:54:18 -0400, EricP
<[email protected]> wrote:

For me error detection of all kinds is useful. It just happens
to not be conveniently supported in C so no one tries it in C.

GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit

Trapping or other overflow detection makes it extremely difficult to
optimise arithmetic. All kinds of re-arrangements, strength reduction
and constant propagation become problematic if you need to flag
overflow. That comes in addition to any actual overflow detection.

I've just tried a quick test on godbolt - when you use -ftrapv, it seems
that in at least some cases, the trapping arithmetic is done use a
library function (like "__mulvsi3") rather than direct code. This will
have significant performance implications.

2) its always on for a compilation unit which is not what programmers need >> as it triggers for many false positives so people turn it off.

GCC lets you turn "-fwrapv" on and off with :

#pragma GCC optimize("-ftrapv")

and

#pragma GCC optimize("-fno-trapv")

It also lets you specify it for a single function at a time with

__attribute__((optimize("-ftrapv")))

Changing these options does have some limitations, such as disabling
inlining into functions with different options. But you can happily
apply it to only some functions in a translation unit.

Things like that are why some companies have a code policy that allows
just one function per file.

I have never heard of such a policy, and I think it would be an
extremely silly one - code would be completely unmanageable, and the
results would be significantly poorer when using modern compilers (i.e., anything this century).

Still a problem if you need <whatever the relevant flag does> only in
one or a few places.

There are gcc flags that are only controllable for compiler invocations,
rather than with pragmas or attributes, and of course not every compiler
has the flexibility of gcc or clang. But this is not nearly the level
you seem to think it is.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Sat Sep 28 13:20:57 2024

On Sat, 28 Sep 2024 02:25:21 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

I've always paid for mine. My first C compiler came with the WinNT 3.5
beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.

Given that GCC offers more features and generates better code than MSVC,
the money may not matter to your business, but the quality of the product >will.

The main reason to use MSVC is Windows GUI programming - and the code
quality is fine for most applications.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Sat Sep 28 13:58:18 2024

On Sat, 28 Sep 2024 17:20:56 +0200, David Brown
<[email protected]> wrote:

On 28/09/2024 01:52, George Neuner wrote:

On Wed, 25 Sep 2024 12:54:18 -0400, EricP
<[email protected]> wrote:

For me error detection of all kinds is useful. It just happens
to not be conveniently supported in C so no one tries it in C.

GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit

:

2) its always on for a compilation unit which is not what programmers need >>> as it triggers for many false positives so people turn it off.

:

Changing these options does have some limitations, such as disabling
inlining into functions with different options. But you can happily
apply it to only some functions in a translation unit.

Things like that are why some companies have a code policy that allows
just one function per file.

I have never heard of such a policy, and I think it would be an
extremely silly one - code would be completely unmanageable, and the
results would be significantly poorer when using modern compilers (i.e., >anything this century).

It is often the case when software modeling tools are in use because
the tools tend to produce one source file per model 'object' or
relation. And it DOES tend to result in poor(er) code.

Most modeling tools do have the option to place code in specified
files - so generally it is possible to have better control over the
compilation and linking of the executables ... but typically it isn't
done: often because the company has a policy to not interfere with the
tool.

Some people - managers mostly - feel that if it runs, and the code
quality is 'acceptible' (for some definition), then "better is the
enemy of 'good enough'".

I have seen it firsthand.

But for the record: I refuse to use modeling tools for code or for
project management. However, I do a fair amount of DBMS work these
days, and I have found some DBMS modeling tools to be useful for
creating schema /documentation/.

Still a problem if you need <whatever the relevant flag does> only in
one or a few places.

There are gcc flags that are only controllable for compiler invocations, >rather than with pragmas or attributes, and of course not every compiler
has the flexibility of gcc or clang. But this is not nearly the level
you seem to think it is.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to D'Oliveiro on Sat Sep 28 19:37:00 2024

In article <vd7peh$12kpl$[email protected]>, [email protected]d (Lawrence
D'Oliveiro) wrote:

Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.

Using GCC (or Clang) for Windows programming is OK, and can be great,
provided other organisations don't need to program against APIs you
provide. If they do, the amount of FUD generated and explaining required overwhelms the gains.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to EricP on Sat Sep 28 22:02:31 2024

On Sat, 28 Sep 2024 11:04:34 -0400, EricP wrote:

GCC is a compiler collection not a integrated development kit for
Windows.

GCC is cross-platform and IDE-independent. If you can’t use it with your
IDE, the limitation is with your IDE, not with GCC.

I have no knowledge of what state GCC was in in 1992 but it likely did
not support the MS enhancements for Win32 programming:

That’s a long time to go without checking on what’s been happening in the computing scene lately. Are you still doing your programming to 32-bit
APIs? Isn’t there a “Win64” yet?

Also, Microsoft’s development tools still assume a Windows-centric world,
and are not suited for cross-development. If you haven’t noticed, a lot of corporate work is deployed in the cloud now, which is Linux-based. And
then there is embedded work, which is also heavily Linux-based.

Plus come with necessary API headers, various link libraries and DLL's, supporting applications, documentation.
You know... what a product looks like.

Those are all parts of the relevant SDKs, separate from GCC. E.g.
something as basic as

ldo@theon:~> dpkg-query -S /usr/include/stdlib.h
libc6-dev:amd64: /usr/include/stdlib.h

is part of the POSIX/C runtime library SDK, not part of GCC itself.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to EricP on Sat Sep 28 23:59:23 2024

EricP <[email protected]> writes:

Lawrence D'Oliveiro wrote:

On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

I've always paid for mine. My first C compiler came with the WinNT 3.5
beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.

Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.

GCC is a compiler collection not a integrated development kit for Windows.
I have no knowledge of what state GCC was in in 1992 but it likely
did not support the MS enhancements for Win32 programming:
structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned integers,
and most important: integration with the GUI source code debugger.

Plus come with necessary API headers, various link libraries and DLL's, supporting applications, documentation.
You know... what a product looks like.

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed
something.

I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Sun Sep 29 07:15:51 2024

Tim Rentsch <[email protected]> schrieb:

EricP <[email protected]> writes:

Lawrence D'Oliveiro wrote:

On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>> beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.

Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.

GCC is a compiler collection not a integrated development kit for Windows. >> I have no knowledge of what state GCC was in in 1992 but it likely
did not support the MS enhancements for Win32 programming:
structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned integers,
and most important: integration with the GUI source code debugger.

Plus come with necessary API headers, various link libraries and DLL's,
supporting applications, documentation.
You know... what a product looks like.

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.

Personally, I like Cygwin best because it gives you access to the
usual UNIX tools like make or emacs, and you can immediately run
the executable. I just add -static-libgfortran for Fortran code
to avoid the hassle of distributing a DLL with it.

Even gdb works.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to EricP on Sun Sep 29 13:13:26 2024

On Wed, 25 Sep 2024 13:56:40 -0400
EricP <[email protected]> wrote:

Terje Mathisen wrote:

Kent Dickey wrote:

Look at:
https://godbolt.org/z/oMhW55YsK

Which is this code:

int add2(int num, int other) {
return num + other;
}

Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and
crash).

For x86-64 clang 19.1.0:

add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
.LBB0_1:
ud1 eax, dword ptr [eax]

This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.

But x86 has an instruction to trap on overflow directly: INTO.
It's one byte.
And it doesn't use it.

GCC x86-64 14.2 is even worse:

add2:
sub rsp, 8
call __addvsi3
add rsp, 8
ret

It calls a routine to do all additions which might overflow, and
that routine calls assert() if an overflow occurs.

The CPU has a trap-on-overflow instruction exactly for this case
(to crash
on detecting an overflow), and compilers don't even use it.

So even on architectures which have a trap-on-overflow instruction,
compilers don't use it.

You can only compile in INTO opcodes if you can guarantee that the
INT 4 (INTO) trap vector will always be set to a proper handler,
and since that isn't part of the ABI, compilers can't depend on it?

I do agree that it would be nice if it did work, barring that clang
is doing the best possible alternative, at close to zero cost
except for the useless branch predictor table entry wastage.

Terje

On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
One must use JO to detect signed overflow.
Others were repurposed, 1-byte INC and DEC 40..4F became the REX
prefix.

Single-byte form of INTO reassigned. Dual-byte form (CD 04) is here.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Sun Sep 29 13:55:11 2024

On Sat, 28 Sep 2024 23:59:23 -0700
Tim Rentsch <[email protected]> wrote:

EricP <[email protected]> writes:

Lawrence D'Oliveiro wrote:

On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

I've always paid for mine. My first C compiler came with the
WinNT 3.5 beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.

Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.

GCC is a compiler collection not a integrated development kit for
Windows. I have no knowledge of what state GCC was in in 1992 but
it likely did not support the MS enhancements for Win32 programming: structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned
integers, and most important: integration with the GUI source code debugger.

Plus come with necessary API headers, various link libraries and
DLL's, supporting applications, documentation.
You know... what a product looks like.

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed
something.

I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.

Both of your problems have no [MSVC] solution right now.

In case of 128-bit integer, there is a chance that MSVC will support it
in the future.

In case of 80-bit long double, there is no chance. If MSVC ever
supports binary floating point wider than 64-bit on x86-64 platform
then it would be IEEE binary128 implemented in software. But even then
they would not use name 'long double' for a new type, because it would
break existing programs.

But if all you want is the program running on Windows, then the
solution is easy - use different compiler.
MSYS2 is just couple of clicks (and ~0.8 GB :( ) away.
After you have msys2 installed do
pacman -Sy
pacman mingw-w64-ucrt-x86_64-gcc

Several hundreds of MB more and you have gcc14
Possible, I'd have to install make separately, i.e.
pacman make.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Sun Sep 29 14:18:54 2024

On 29/09/2024 09:15, Thomas Koenig wrote:

Tim Rentsch <[email protected]> schrieb:

EricP <[email protected]> writes:

Lawrence D'Oliveiro wrote:

On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>>> beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.

Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.

GCC is a compiler collection not a integrated development kit for Windows. >>> I have no knowledge of what state GCC was in in 1992 but it likely
did not support the MS enhancements for Win32 programming:
structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned integers,
and most important: integration with the GUI source code debugger.

Plus come with necessary API headers, various link libraries and DLL's,
supporting applications, documentation.
You know... what a product looks like.

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.

Personally, I like Cygwin best because it gives you access to the
usual UNIX tools like make or emacs, and you can immediately run
the executable. I just add -static-libgfortran for Fortran code
to avoid the hassle of distributing a DLL with it.

Personally, I prefer msys2 because it gives you access to the usual *nix
tools like make - and does so far better than Cygwin. (Here "better"
means more native-like file access, and more efficient usage.) And you
don't get the DLL hell of Cygwin.

I think Cygwin is useful if you need more advanced or accurate POSIX
semantics - such as "fork" calls. But for most uses, msys2 is much
simpler to work with. Msys2 also has a more friendly license for many
people.

However, I haven't had to do much compilation of any kind targetting
Windows.

Even gdb works.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Sun Sep 29 12:34:53 2024

David Brown <[email protected]> schrieb:

On 29/09/2024 09:15, Thomas Koenig wrote:

Tim Rentsch <[email protected]> schrieb:

EricP <[email protected]> writes:

Lawrence D'Oliveiro wrote:

On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>>>> beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.

Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.

GCC is a compiler collection not a integrated development kit for Windows. >>>> I have no knowledge of what state GCC was in in 1992 but it likely
did not support the MS enhancements for Win32 programming:
structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned integers, >>>> and most important: integration with the GUI source code debugger.

Plus come with necessary API headers, various link libraries and DLL's, >>>> supporting applications, documentation.
You know... what a product looks like.

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.

Personally, I like Cygwin best because it gives you access to the
usual UNIX tools like make or emacs, and you can immediately run
the executable. I just add -static-libgfortran for Fortran code
to avoid the hassle of distributing a DLL with it.

Personally, I prefer msys2 because it gives you access to the usual *nix tools like make - and does so far better than Cygwin. (Here "better"
means more native-like file access, and more efficient usage.) And you
don't get the DLL hell of Cygwin.

Just one remark - I was referring to running the mingw compiler
under Cygwin, for which you don't get the DLL issues.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Sun Sep 29 15:49:32 2024

On 29/09/2024 14:34, Thomas Koenig wrote:

David Brown <[email protected]> schrieb:

On 29/09/2024 09:15, Thomas Koenig wrote:

Tim Rentsch <[email protected]> schrieb:

EricP <[email protected]> writes:

Lawrence D'Oliveiro wrote:

On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>>>>> beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.

Given that GCC offers more features and generates better code than >>>>>> MSVC, the money may not matter to your business, but the quality of >>>>>> the product will.

GCC is a compiler collection not a integrated development kit for Windows.
I have no knowledge of what state GCC was in in 1992 but it likely
did not support the MS enhancements for Win32 programming:
structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned integers, >>>>> and most important: integration with the GUI source code debugger.

Plus come with necessary API headers, various link libraries and DLL's, >>>>> supporting applications, documentation.
You know... what a product looks like.

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.

Personally, I like Cygwin best because it gives you access to the
usual UNIX tools like make or emacs, and you can immediately run
the executable. I just add -static-libgfortran for Fortran code
to avoid the hassle of distributing a DLL with it.

Personally, I prefer msys2 because it gives you access to the usual *nix
tools like make - and does so far better than Cygwin. (Here "better"
means more native-like file access, and more efficient usage.) And you
don't get the DLL hell of Cygwin.

Just one remark - I was referring to running the mingw compiler
under Cygwin, for which you don't get the DLL issues.

Ah, okay.

But you don't need Cygwin here. You don't even need a msys2
environment, unless you want to have things in the same place as on a
*nix system or use programs that expect other files in those places.
Almost all of the compilations I do under Windows (and most of those
that I do under Linux) are cross-compilations, are cross-compilations.
For Windows, those are mingw hosted gcc's. And as long as things like
make, cp, rm, sed, and a few other common utilities are on the path,
they can be used fine without a full msys2 environment. The same goes
for other command-line utilities I use all the time like ssh, grep,
less, etc.

The only call I've had for Cygwin is for building software with more complicated or old-fashioned styles, like big .config arrangements, or
for code that needs more complete POSIX emulation. I'm not sure I have
used it since the days of building my own gcc 3 cross-compilers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Tim Rentsch on Sun Sep 29 16:19:38 2024

Tim Rentsch wrote:

EricP <[email protected]> writes:

Lawrence D'Oliveiro wrote:

On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>> beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.

Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.

GCC is a compiler collection not a integrated development kit for Windows. >> I have no knowledge of what state GCC was in in 1992 but it likely
did not support the MS enhancements for Win32 programming:
structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned integers,
and most important: integration with the GUI source code debugger.

Plus come with necessary API headers, various link libraries and DLL's,
supporting applications, documentation.
You know... what a product looks like.

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed
something.

I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.

I seem to remember finding something like __int128_t and __uint128_t
inside MSVC?

And that by casting uint64_t parameters to the u128 variant, the
compiler would generate the obvious MUL RDI and save RDX:RAX as the
128-bit result:

uint128_t mulw(uint64_t a, uint64_t b)
{
return (uint128_t) a * (uint128_t) b;
}

I.e. no subroutine call/zero overhead.

OTOH, getting optimal wide integer accumulators is a bit harder, needing compiler intrinsics to access the widening add with carry opcodes. (ADDX)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Sun Sep 29 18:00:26 2024

On Sun, 29 Sep 2024 16:19:38 +0200
Terje Mathisen <[email protected]> wrote:

Tim Rentsch wrote:

EricP <[email protected]> writes:

Lawrence D'Oliveiro wrote:

On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

I've always paid for mine. My first C compiler came with the
WinNT 3.5 beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.

Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality
of the product will.

GCC is a compiler collection not a integrated development kit for
Windows. I have no knowledge of what state GCC was in in 1992 but
it likely did not support the MS enhancements for Win32
programming: structured exception handling, various ABI's, inline
assembler, defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned
integers, and most important: integration with the GUI source
code debugger.

Plus come with necessary API headers, various link libraries and
DLL's, supporting applications, documentation.
You know... what a product looks like.

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed something.

I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.

I seem to remember finding something like __int128_t and __uint128_t
inside MSVC?

Very unlikely.
Most likely your are thinking about C++ rather than C.
Newer versions of Microsoft's STL appear to feature intentionally
undocumented 128-bit integer classes std::_Signed128 and
std:_Unsigned128.

And that by casting uint64_t parameters to the u128 variant, the
compiler would generate the obvious MUL RDI and save RDX:RAX as the
128-bit result:

uint128_t mulw(uint64_t a, uint64_t b)
{
return (uint128_t) a * (uint128_t) b;
}

I.e. no subroutine call/zero overhead.

That is not MSVC.
For this specific case, MSVC has a [better] solution in form of
intrinsic ___umul128. But it wouldn't help Tim, because he is not
allowed to modify the sources.

OTOH, getting optimal wide integer accumulators is a bit harder,
needing compiler intrinsics to access the widening add with carry
opcodes. (ADDX)

Terje

That's correct about intrinsics, but incorrect about ADCX/ADOX.
The later can be moderately helpful in special situuations, esp.
128b * 128b => 256b multiplication, but it is never necessary
and for addition/sbtraction is not needed at all.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Sun Sep 29 09:28:48 2024

Thomas Koenig <[email protected]> writes:

Tim Rentsch <[email protected]> schrieb:

[...]

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.

Thank you for these suggestions. I have started to explore
mingw but not yet the others. Is there a difference between
mingw and mingw-w64, do you know? Also do you know if mingw
is compatible with MSVC, as long as long double is not used?
(The code being ported would never call MSVC with a long double
or 128-bit integer argument.)

Personally, I like Cygwin best because it gives you access to the
usual UNIX tools like make or emacs, and you can immediately run
the executable. I just add -static-libgfortran for Fortran code
to avoid the hassle of distributing a DLL with it.

Running make and emacs in MS Windows... I like it!

Even gdb works.

That says a lot about the effort that went into Cygwin.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Sun Sep 29 09:39:27 2024

Michael S <[email protected]> writes:

On Sat, 28 Sep 2024 23:59:23 -0700
Tim Rentsch <[email protected]> wrote:

[...]

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed
something.

I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.

Both of your problems have no [MSVC] solution right now.

In case of 128-bit integer, there is a chance that MSVC will support
it in the future.

In case of 80-bit long double, there is no chance. If MSVC ever
supports binary floating point wider than 64-bit on x86-64 platform
then it would be IEEE binary128 implemented in software. But even
then they would not use name 'long double' for a new type, because it
would break existing programs.

Thank you, this is helpful.

But if all you want is the program running on Windows, then the
solution is easy - use different compiler.
MSYS2 is just couple of clicks (and ~0.8 GB :( ) away.
After you have msys2 installed do
pacman -Sy
pacman mingw-w64-ucrt-x86_64-gcc

Several hundreds of MB more and you have gcc14
Possible, I'd have to install make separately, i.e.
pacman make.

Do I understand this right, that msys2 is to be installed
on Windows? And that the pacman commands are to be run
within msys2 on the MS Windows system?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Sun Sep 29 09:56:52 2024

Terje Mathisen <[email protected]> writes:

Tim Rentsch wrote:

[...]

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed
something.

I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.

I seem to remember finding something like __int128_t and __uint128_t
inside MSVC?

Apparently VScode has added or is going to add support for
the pre-defined type names __int128_t and __uint128_t. That
is good to know even though it doesn't bear directly on my
question.

And that by casting uint64_t parameters to the u128 variant, the
compiler would generate the obvious MUL RDI and save RDX:RAX as the
128-bit result:

uint128_t mulw(uint64_t a, uint64_t b)
{
return (uint128_t) a * (uint128_t) b;
}

I.e. no subroutine call/zero overhead.

OTOH, getting optimal wide integer accumulators is a bit harder,
needing compiler intrinsics to access the widening add with carry
opcodes. (ADDX)

This information doesn't affect what I am hoping to accomplish,
because any arithmetic on 128-bit types would already be held
in 128-bit values. Thank you though for the information.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Sun Sep 29 19:45:16 2024

On Sun, 29 Sep 2024 09:39:27 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Sat, 28 Sep 2024 23:59:23 -0700
Tim Rentsch <[email protected]> wrote:

[...]

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed
something.

I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.

Both of your problems have no [MSVC] solution right now.

In case of 128-bit integer, there is a chance that MSVC will support
it in the future.

In case of 80-bit long double, there is no chance. If MSVC ever
supports binary floating point wider than 64-bit on x86-64 platform
then it would be IEEE binary128 implemented in software. But even
then they would not use name 'long double' for a new type, because
it would break existing programs.

Thank you, this is helpful.

But if all you want is the program running on Windows, then the
solution is easy - use different compiler.
MSYS2 is just couple of clicks (and ~0.8 GB :( ) away.
After you have msys2 installed do
pacman -Sy
pacman mingw-w64-ucrt-x86_64-gcc

Several hundreds of MB more and you have gcc14
Possible, I'd have to install make separately, i.e.
pacman make.

Do I understand this right, that msys2 is to be installed
on Windows? And that the pacman commands are to be run
within msys2 on the MS Windows system?

Yes and yes.
pacman has to be run from msys2 terminal window (bash).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Sun Sep 29 10:32:06 2024

Michael S <[email protected]> writes:

On Sun, 29 Sep 2024 09:39:27 -0700
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Sat, 28 Sep 2024 23:59:23 -0700
Tim Rentsch <[email protected]> wrote:

[...]

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed
something.

I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.

Both of your problems have no [MSVC] solution right now.

In case of 128-bit integer, there is a chance that MSVC will support
it in the future.

In case of 80-bit long double, there is no chance. If MSVC ever
supports binary floating point wider than 64-bit on x86-64 platform
then it would be IEEE binary128 implemented in software. But even
then they would not use name 'long double' for a new type, because
it would break existing programs.

Thank you, this is helpful.

But if all you want is the program running on Windows, then the
solution is easy - use different compiler.
MSYS2 is just couple of clicks (and ~0.8 GB :( ) away.
After you have msys2 installed do
pacman -Sy
pacman mingw-w64-ucrt-x86_64-gcc

Several hundreds of MB more and you have gcc14
Possible, I'd have to install make separately, i.e.
pacman make.

Do I understand this right, that msys2 is to be installed
on Windows? And that the pacman commands are to be run
within msys2 on the MS Windows system?

Yes and yes.
pacman has to be run from msys2 terminal window (bash).

Okay, thank you again.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Sun Sep 29 19:30:08 2024

Tim Rentsch <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Tim Rentsch <[email protected]> schrieb:

[...]

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.

Thank you for these suggestions. I have started to explore
mingw but not yet the others. Is there a difference between
mingw and mingw-w64, do you know?

One is a fork of the other, I believe.

Also do you know if mingw
is compatible with MSVC, as long as long double is not used?

I believe that Mingw-w64 uses the Windows ABI, but that is a
belief, not something I know first-hand; I haven't looked
at the assembly.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to D'Oliveiro on Sun Sep 29 20:51:00 2024

In article <vd9udm$1dgsp$[email protected]>, [email protected]d (Lawrence
D'Oliveiro) wrote:

Are you still doing your programming
to 32-bit APIs? Isn't there a _Win64_ yet?

"Win32" covers both 32-bit and 64-bit APIs. The reasons for this silly nomenclature are complicated and lie deep in the past.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Dallman on Sun Sep 29 19:57:22 2024

John Dallman <[email protected]> schrieb:

In article <vd9udm$1dgsp$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

Are you still doing your programming
to 32-bit APIs? Isn't there a _Win64_ yet?

"Win32" covers both 32-bit and 64-bit APIs. The reasons for this silly nomenclature are complicated and lie deep in the past.

"This cannot be explained logically, only chronologically."

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Sun Sep 29 23:37:57 2024

On Sun, 29 Sep 2024 19:30:08 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Tim Rentsch <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Tim Rentsch <[email protected]> schrieb:

[...]

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.

Thank you for these suggestions. I have started to explore
mingw but not yet the others. Is there a difference between
mingw and mingw-w64, do you know?

One is a fork of the other, I believe.

Also do you know if mingw
is compatible with MSVC, as long as long double is not used?

I believe that Mingw-w64 uses the Windows ABI, but that is a
belief, not something I know first-hand; I haven't looked
at the assembly.

mingw64 tools are mostly compatible with Windows x64 ABI, but long
double is an exception.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Dallman on Mon Sep 30 01:28:42 2024

On Sun, 29 Sep 2024 20:51 +0100 (BST), John Dallman wrote:

In article <vd9udm$1dgsp$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

Are you still doing your programming to 32-bit APIs? Isn't there a
_Win64_ yet?

"Win32" covers both 32-bit and 64-bit APIs. The reasons for this silly nomenclature are complicated and lie deep in the past.

Also the fact that those “64-bit” APIs are not entirely “64-bit” ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Sep 30 11:15:05 2024

On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Sun, 29 Sep 2024 20:51 +0100 (BST), John Dallman wrote:

In article <vd9udm$1dgsp$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

Are you still doing your programming to 32-bit APIs? Isn't there a
_Win64_ yet?

"Win32" covers both 32-bit and 64-bit APIs. The reasons for this
silly nomenclature are complicated and lie deep in the past.

Also the fact that those “64-bit” APIs are not entirely “64-bit” ...

They are entirely 64-bit. Every user-supplied buffer can be anywhere in
user's address space.
Possibly you are confusing Windows with VMS.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Mon Sep 30 03:36:26 2024

Michael S <[email protected]> writes:

On Sun, 29 Sep 2024 19:30:08 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Tim Rentsch <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Tim Rentsch <[email protected]> schrieb:

[...]

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.

Thank you for these suggestions. I have started to explore
mingw but not yet the others. Is there a difference between
mingw and mingw-w64, do you know?

One is a fork of the other, I believe.

Also do you know if mingw
is compatible with MSVC, as long as long double is not used?

I believe that Mingw-w64 uses the Windows ABI, but that is a
belief, not something I know first-hand; I haven't looked
at the assembly.

mingw64 tools are mostly compatible with Windows x64 ABI, but long
double is an exception.

That was my impression but it's nice to have it confirmed.

My thanks again to both you and Thomas.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Mon Sep 30 14:07:47 2024

On 29/09/2024 21:30, Thomas Koenig wrote:

Tim Rentsch <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Tim Rentsch <[email protected]> schrieb:

[...]

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.

Thank you for these suggestions. I have started to explore
mingw but not yet the others. Is there a difference between
mingw and mingw-w64, do you know?

One is a fork of the other, I believe.

mingw-w64 was started as a fork of mingw, initially created to support generating 64-bit binaries and because of disagreements with the pace of development in mingw.

Also do you know if mingw
is compatible with MSVC, as long as long double is not used?

I believe that Mingw-w64 uses the Windows ABI, but that is a
belief, not something I know first-hand; I haven't looked
at the assembly.

There is a reasonably defined ABI for 64-bit Windows, so I think there
will be compatibility for most things in C. C++ is more complicated and
much more likely to have incompatibilities.

There are approximately a hundred and one different C ABI's and calling conventions for 32-bit Windows, since MS never actually defined one, so
things are a bit of a mess there. (DLL calling conventions are clearer.)

I believe the two most popular ways of running "Linux-like" software and
gcc on Windows are using WSL (which is more of a virtualisation layer),
and mingw-64 for the compiler target (with either gcc or clang) and
msys2 as an environment and source of *nix utilities and libraries.
mingw/msys is considered old and limited (32-bit only), while Cygwin is considered slow and clunky by many.

At least, that is my understanding.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Michael S on Mon Sep 30 14:49:00 2024

In article <[email protected]>, [email protected] (Michael S) wrote:

On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

Also the fact that those _64-bit_ APIs are not entirely _64-bit_

They are entirely 64-bit. Every user-supplied buffer can be
anywhere in user's address space. Possibly you are confusing
Windows with VMS.

Windows NT's VMS heritage does not extend to that particular VMS
misfeature. The lack of 64-bit versions of some APIs on VMS is simply due
to shortcuts taken by DEC to get 64-bit VMS out faster, which have never
been caught up. The rather elaborate VMS API definitions, in terms of
memory block sizes rather than calls in a programming language, made it
harder to create 64-bit APIs.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Michael S on Mon Sep 30 09:24:03 2024

Michael S wrote:

On Wed, 25 Sep 2024 13:56:40 -0400
EricP <[email protected]> wrote:

Terje Mathisen wrote:

Kent Dickey wrote:

Look at:
https://godbolt.org/z/oMhW55YsK

Which is this code:

int add2(int num, int other) {
return num + other;
}

Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and
crash).

For x86-64 clang 19.1.0:

add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
.LBB0_1:
ud1 eax, dword ptr [eax]

This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.

But x86 has an instruction to trap on overflow directly: INTO.
It's one byte.
And it doesn't use it.

GCC x86-64 14.2 is even worse:

add2:
sub rsp, 8
call __addvsi3
add rsp, 8
ret

It calls a routine to do all additions which might overflow, and
that routine calls assert() if an overflow occurs.

The CPU has a trap-on-overflow instruction exactly for this case
(to crash
on detecting an overflow), and compilers don't even use it.

So even on architectures which have a trap-on-overflow instruction,
compilers don't use it.

You can only compile in INTO opcodes if you can guarantee that the
INT 4 (INTO) trap vector will always be set to a proper handler,
and since that isn't part of the ABI, compilers can't depend on it?

I do agree that it would be nice if it did work, barring that clang
is doing the best possible alternative, at close to zero cost
except for the useless branch predictor table entry wastage.

Terje

On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
One must use JO to detect signed overflow.
Others were repurposed, 1-byte INC and DEC 40..4F became the REX
prefix.

Single-byte form of INTO reassigned. Dual-byte form (CD 04) is here.

The INTO (CE) instruction conditionally generates an exception when
overflow flag is set to exception vector 4.

INT 4 (CD 04) unconditionally generates an exception to vector 4.

To get the same behavior in x64 you would have to

ADD blah
JNO Ok1
INT 4
Ok1:

which is less code bytes than jumping to a call to an error routine
but has the issue that now every ADD/SUB is followed by a short rel8
forward conditional branch that will be mispredicted (predicted not taken)
the first time through every code section.

Intel documents x64 Jcc branch hint prefixes 2E = not taken, 3E = taken. However AMD does not. On x86 2E and 3E are segment override prefixes
and AMD documents them as ignored on x64.

If there is an overflow this leaves the RIP pointing right after the problem.

Or generate a JO long rel32 branch forward to a INT 4 which will be
correctly predicted as not taken unless there is an overflow.
If there is an overflow this leaves the RIP pointing far away from the
problem and you would have to trace the code paths backwards to find
where it occured.

And all this crap goes away if an ISA has ADDV Add Trap Signed Overflow.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Mon Sep 30 17:32:47 2024

On Mon, 30 Sep 2024 14:07:47 +0200
David Brown <[email protected]> wrote:

On 29/09/2024 21:30, Thomas Koenig wrote:

Tim Rentsch <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Tim Rentsch <[email protected]> schrieb:

[...]

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.

Thank you for these suggestions. I have started to explore
mingw but not yet the others. Is there a difference between
mingw and mingw-w64, do you know?

One is a fork of the other, I believe.

mingw-w64 was started as a fork of mingw, initially created to
support generating 64-bit binaries and because of disagreements with
the pace of development in mingw.

Also do you know if mingw
is compatible with MSVC, as long as long double is not used?

I believe that Mingw-w64 uses the Windows ABI, but that is a
belief, not something I know first-hand; I haven't looked
at the assembly.

There is a reasonably defined ABI for 64-bit Windows, so I think
there will be compatibility for most things in C.

For "most things" - yes. For 'long double' - no.
In case of 'long double' mingw64 tools use their own conventions that
differ both from SysV and from Win64. But at C level behavior is
identical to x86-64 Linux.

C++ is more
complicated and much more likely to have incompatibilities.

There are approximately a hundred and one different C ABI's and
calling conventions for 32-bit Windows, since MS never actually
defined one, so things are a bit of a mess there. (DLL calling
conventions are clearer.)

I believe the two most popular ways of running "Linux-like" software
and gcc on Windows are using WSL (which is more of a virtualisation
layer),

WSL (now often referred as WSL1) is not a virtualization layer.
WSL2 is indeed a Linux running in virtual machine +
integration features for convenience.

WSL1 is the worst possible place to run Linux programs that depend on
long double having higher precision. That's because when WSL1 kernel
starts a new process it sets precision of x87 co-processor to 52 bits,
which is different from default settings on just about any other x86-64
Linux. Of course, the process can change the settings, but for that the programmer would have to be aware that the problem exists. Which is
rare.

WSL2 doesn't have this problem, but it is supported only on relatively
new versions of Windows.

So, for older versions of Windows if one wants to run Linux binaries
'as is' and to get the same behavior of long doable as in original then
one is advised to run Linux in less-integrated VMs, like Virtual Box
or MS's own HyperV.

and mingw-64 for the compiler target (with either gcc or
clang) and msys2 as an environment and source of *nix utilities and libraries. mingw/msys is considered old and limited (32-bit only),
while Cygwin is considered slow and clunky by many.

And cygwin console is quite inconvenient.

At least, that is my understanding.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Mon Sep 30 17:01:01 2024

On 30/09/2024 16:32, Michael S wrote:

On Mon, 30 Sep 2024 14:07:47 +0200
David Brown <[email protected]> wrote:

On 29/09/2024 21:30, Thomas Koenig wrote:

Tim Rentsch <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Tim Rentsch <[email protected]> schrieb:

[...]

I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.

My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.

Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.

Thank you for these suggestions. I have started to explore
mingw but not yet the others. Is there a difference between
mingw and mingw-w64, do you know?

One is a fork of the other, I believe.

mingw-w64 was started as a fork of mingw, initially created to
support generating 64-bit binaries and because of disagreements with
the pace of development in mingw.

Also do you know if mingw
is compatible with MSVC, as long as long double is not used?

I believe that Mingw-w64 uses the Windows ABI, but that is a
belief, not something I know first-hand; I haven't looked
at the assembly.

There is a reasonably defined ABI for 64-bit Windows, so I think
there will be compatibility for most things in C.

For "most things" - yes. For 'long double' - no.
In case of 'long double' mingw64 tools use their own conventions that
differ both from SysV and from Win64.

I did not know that - thanks for that information. (Though hopefully
I'll not have to do enough C programming on Windows to find the
information useful!)

But at C level behavior is
identical to x86-64 Linux.

C++ is more
complicated and much more likely to have incompatibilities.

There are approximately a hundred and one different C ABI's and
calling conventions for 32-bit Windows, since MS never actually
defined one, so things are a bit of a mess there. (DLL calling
conventions are clearer.)

I believe the two most popular ways of running "Linux-like" software
and gcc on Windows are using WSL (which is more of a virtualisation
layer),

WSL (now often referred as WSL1) is not a virtualization layer.
WSL2 is indeed a Linux running in virtual machine +
integration features for convenience.

I was thinking of current WSL, which would presumably be WSL2. I have
not used it myself, but I have a customer who does, and who simply calls
it WSL. But perhaps I should use "WSL2" to be clear.

WSL1 is the worst possible place to run Linux programs that depend on
long double having higher precision. That's because when WSL1 kernel
starts a new process it sets precision of x87 co-processor to 52 bits,
which is different from default settings on just about any other x86-64 Linux. Of course, the process can change the settings, but for that the programmer would have to be aware that the problem exists. Which is
rare.

WSL2 doesn't have this problem, but it is supported only on relatively
new versions of Windows.

So, for older versions of Windows if one wants to run Linux binaries
'as is' and to get the same behavior of long doable as in original then
one is advised to run Linux in less-integrated VMs, like Virtual Box
or MS's own HyperV.

I just run them on Linux :-)

But while none of this affects me, it might affect customers or others
that I deal with, so it is good to know.

and mingw-64 for the compiler target (with either gcc or
clang) and msys2 as an environment and source of *nix utilities and
libraries. mingw/msys is considered old and limited (32-bit only),
while Cygwin is considered slow and clunky by many.

And cygwin console is quite inconvenient.

At least, that is my understanding.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Tue Oct 1 00:33:01 2024

On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:

On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

Also the fact that those “64-bit” APIs are not entirely “64-bit” ...

They are entirely 64-bit.

<https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>:

Another example; Win32 has a function for getting the size of a
file. File sizes on Windows are limited to 2^64 bytes, and so they
need a 64-bit integer to be expressed easily. But the API call to
get the size of a file doesn't give you a 64-bit value. Instead,
it gives you a pair of 32-bit values that have to be combined in a
particular way. For 32-bit Windows, that's sort of understandable;
32-bit Windows is, well, 32-bit, so you might not expect to be
able to use 64-bit integers. But if you use the same API in 64-bit
Windows, it still gives you the pair of numbers, rather than just
a nice simple 64-bit number. While this made some kind of sense on
32-bit Windows, it makes no sense at all on 64-bit Windows, since
64-bit Windows can, by definition, use 64-bit numbers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Oct 1 03:53:01 2024

On Tue, 1 Oct 2024 0:33:01 +0000, Lawrence D'Oliveiro wrote:

On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:

On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

Also the fact that those “64-bit” APIs are not entirely “64-bit” ...

They are entirely 64-bit.

<https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>:

Another example; Win32 has a function for getting the size of a
file. File sizes on Windows are limited to 2^64 bytes, and so they
need a 64-bit integer to be expressed easily. But the API call to
get the size of a file doesn't give you a 64-bit value. Instead,
it gives you a pair of 32-bit values that have to be combined in a
particular way.

As long as you can embed the API function in a macro that performs
said combining, it's all OK.

uint64_T filesize = GetFileSize64( whatever );

For 32-bit Windows, that's sort of understandable;
32-bit Windows is, well, 32-bit, so you might not expect to be
able to use 64-bit integers. But if you use the same API in 64-bit
Windows, it still gives you the pair of numbers, rather than just
a nice simple 64-bit number. While this made some kind of sense on
32-bit Windows, it makes no sense at all on 64-bit Windows, since
64-bit Windows can, by definition, use 64-bit numbers.

Why would you want a 32-bit application to be able to use files
of 2^64-bits in size ???

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Tue Oct 1 07:35:17 2024

MitchAlsup1 wrote:

On Tue, 1 Oct 2024 0:33:01 +0000, Lawrence D'Oliveiro wrote:

On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:

On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

Also the fact that those â€œ64-bitâ€ APIs are not entirely
â€œ64-bitâ€ ...

They are entirely 64-bit.

<https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>: >>

    Another example; Win32 has a function for getting the size of a
    file. File sizes on Windows are limited to 2^64 bytes, and so they
    need a 64-bit integer to be expressed easily. But the API call to
    get the size of a file doesn't give you a 64-bit value. Instead,
    it gives you a pair of 32-bit values that have to be combined in a
    particular way.

As long as you can embed the API function in a macro that performs
said combining, it's all OK.

    uint64_T filesize = GetFileSize64( whatever );

As I wrote in my other post, the API is in fact directly usable as-is on
any compiler with 64-bit support.

                    For 32-bit Windows, that's sort of understandable;
    32-bit Windows is, well, 32-bit, so you might not expect to be
    able to use 64-bit integers. But if you use the same API in 64-bit
    Windows, it still gives you the pair of numbers, rather than just
    a nice simple 64-bit number. While this made some kind of sense on
    32-bit Windows, it makes no sense at all on 64-bit Windows, since
    64-bit Windows can, by definition, use 64-bit numbers.

Why would you want a 32-bit application to be able to use files
of 2^64-bits in size ???

Why not? I had both partition and DVD image files larger than 4 GB well
before I had Win64, but I still wanted to be able to get my du (disk
use) utility to work.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Tue Oct 1 06:04:17 2024

On Tue, 1 Oct 2024 07:30:45 +0200, Terje Mathisen wrote:

The first issue here is that the original API defined the return value
as 32-bit ...

The POSIX effort started in around 1988, and one of its defining characteristics is the use of symbolic type names like “size_t”, “off_t”
and “time_t”, and not assuming particular sizes for them. Windows NT had plenty of time to learn from that example; why didn’t it?

Turns out every single Win32-system in existence/in regular use is
little endian ...

Of which there are not many. Wasn’t Windows NT supposed to be some kind of “portable” OS? Wasn’t it supposed to run on big-endian architectures too, like POWER, MIPS and SPARC?

Only all those ports failed. In fact, every single non-x86 port of Windows
has failed.

Yeah, not too pretty, but also not a real/important problem.

“Death of a thousand cuts”. And now Microsoft is left scrambling, desperately trying to turn Windows into Linux.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Tue Oct 1 07:30:45 2024

Lawrence D'Oliveiro wrote:

On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:

On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

Also the fact that those â€œ64-bitâ€ APIs are not entirely â€œ64-bitâ€ ...

They are entirely 64-bit.

<https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>:

Another example; Win32 has a function for getting the size of a
file. File sizes on Windows are limited to 2^64 bytes, and so they
need a 64-bit integer to be expressed easily. But the API call to
get the size of a file doesn't give you a 64-bit value. Instead,
it gives you a pair of 32-bit values that have to be combined in a
particular way. For 32-bit Windows, that's sort of understandable;
32-bit Windows is, well, 32-bit, so you might not expect to be
able to use 64-bit integers. But if you use the same API in 64-bit
Windows, it still gives you the pair of numbers, rather than just
a nice simple 64-bit number. While this made some kind of sense on
32-bit Windows, it makes no sense at all on 64-bit Windows, since
64-bit Windows can, by definition, use 64-bit numbers.

The first issue here is that the original API defined the return value
as 32-bit, with an optional pointer to another variable to receive the
high part, but they came up with the GetFileSizeEx() function decades
ago, and that one gets the file size as a LARGE_INTEGER. Nobody uses
anything else afair.

The second potential issue is with the definition of LARGE_INTEGER:

It is as as you say defined as a pair of 32-bit values, overlayed with a LONGLONG which can only work on a little-endian cpu since the low part
is followed by the high, right?

Turns out every single Win32-system in existence/in regular use is
little endian, so that is much less of a problem, and the docs tell you to

"The LARGE_INTEGER structure is actually a union. If your compiler has built-in support for 64-bit integers, use the QuadPart member to store
the 64-bit integer. Otherwise, use the LowPart and HighPart members to
store the 64-bit integer."

Yeah, not too pretty, but also not a real/important problem.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Tue Oct 1 06:09:53 2024

On Tue, 1 Oct 2024 07:35:17 +0200, Terje Mathisen wrote:

I had both partition and DVD image files larger than 4 GB well
before I had Win64 ...

DVD is an interesting case. If you look at the file structure,
individual .VOB files typically don’t exceed about 1GiB in size, but successive segments of a title are required to be physically laid out next
to each other on the disc, so a player can actually forget about the
filesystem once it gets started, and just read successive physical
sectors.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Oct 1 06:05:36 2024

On Tue, 1 Oct 2024 03:53:01 +0000, MitchAlsup1 wrote:

Why would you want a 32-bit application to be able to use files of
2^64-bits in size ???

Because multi-gigabyte files were becoming commonplace 20-25 years ago,
before 64-bit CPUs did the same.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Tue Oct 1 10:57:25 2024

On Tue, 1 Oct 2024 07:30:45 +0200
Terje Mathisen <[email protected]> wrote:

Lawrence D'Oliveiro wrote:

On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:

On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

Also the fact that those â€œ64-bitâ€ APIs are not entirely
â€œ64-bitâ€ ...

They are entirely 64-bit.

<https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>:

Another example; Win32 has a function for getting the size of a
file. File sizes on Windows are limited to 2^64 bytes, and so
they need a 64-bit integer to be expressed easily. But the API call
to get the size of a file doesn't give you a 64-bit value. Instead,
it gives you a pair of 32-bit values that have to be combined
in a particular way. For 32-bit Windows, that's sort of
understandable; 32-bit Windows is, well, 32-bit, so you might not
expect to be able to use 64-bit integers. But if you use the same
API in 64-bit Windows, it still gives you the pair of numbers,
rather than just a nice simple 64-bit number. While this made some
kind of sense on 32-bit Windows, it makes no sense at all on 64-bit Windows, since 64-bit Windows can, by definition, use 64-bit
numbers.

The first issue here is that the original API defined the return
value as 32-bit, with an optional pointer to another variable to
receive the high part, but they came up with the GetFileSizeEx()
function decades ago, and that one gets the file size as a
LARGE_INTEGER. Nobody uses anything else afair.

The second potential issue is with the definition of LARGE_INTEGER:

It is as as you say defined as a pair of 32-bit values, overlayed
with a LONGLONG which can only work on a little-endian cpu since the
low part is followed by the high, right?

It seems to me that back when Windows still supported Big Endian
targets (PPC) there was #ifdef in the relevant header, so on PPC the
order of low and high parts was opposite to the rest of targets.
The remnants of that are more likely to be found in DDK headers than in
Windows SDK headers.

Turns out every single Win32-system in existence/in regular use is
little endian, so that is much less of a problem, and the docs tell
you to

"The LARGE_INTEGER structure is actually a union. If your compiler
has built-in support for 64-bit integers, use the QuadPart member to
store the 64-bit integer. Otherwise, use the LowPart and HighPart
members to store the 64-bit integer."

Yeah, not too pretty, but also not a real/important problem.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to D'Oliveiro on Tue Oct 1 10:12:00 2024

In article <vdg3d1$2kdqr$[email protected]>, [email protected]d (Lawrence
D'Oliveiro) wrote:

Wasn't Windows NT supposed to be some kind of _portable_ OS? Wasn't
it supposed to run on big-endian architectures too, like POWER, MIPS
and SPARC?

It did. I have no experience with Windows NT on SPARC or PowerPC, but the
OS ran fine on MIPS. It was a commercial failure, because MIPS didn't
keep up with the performance growth of x86.

PowerPC did for a while, but the company interested in NT on PowerPC was
IBM, and their hardware prices were a /lot/ higher than x86 prices. They
didn't see that as a problem, but all the potential customers did.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Dallman on Tue Oct 1 12:34:26 2024

On Tue, 1 Oct 2024 10:12 +0100 (BST)
[email protected] (John Dallman) wrote:

In article <vdg3d1$2kdqr$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

Wasn't Windows NT supposed to be some kind of _portable_ OS? Wasn't
it supposed to run on big-endian architectures too, like POWER, MIPS
and SPARC?

It did. I have no experience with Windows NT on SPARC or PowerPC,

Did WinNT on SPARC ever ship? I don't think so.

but
the OS ran fine on MIPS. It was a commercial failure, because MIPS
didn't keep up with the performance growth of x86.

Wasn't MIPS edition of WinNT Little Endian?

PowerPC did for a while, but the company interested in NT on PowerPC
was IBM, and their hardware prices were a /lot/ higher than x86
prices. They didn't see that as a problem, but all the potential
customers did.

John

Now I wonder what endiannes was used by PowerPC variant WinNT.
In theory, PPC/POWER could run in Little Endian mode, but before v3 of
POWER ISA it wasn't as full-featured as Big Endian mode. If I am not
mistaken, the difference was that in LE mode there was no support for
unaligned memory accesses.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Michael S on Tue Oct 1 12:28:00 2024

In article <[email protected]>, [email protected] (Michael S) wrote:

On Tue, 1 Oct 2024 10:12 +0100 (BST)
[email protected] (John Dallman) wrote:

[email protected]d (Lawrence D'Oliveiro) wrote:

Wasn't Windows NT supposed to be some kind of _portable_ OS?
Wasn't it supposed to run on big-endian architectures too, like
POWER, MIPS and SPARC?

It did. I have no experience with Windows NT on SPARC or PowerPC,

Did WinNT on SPARC ever ship? I don't think so.

No. Intergraph worked on a port, but it never shipped, and neither did Intergraph's SPARC-based hardware.

Wasn't MIPS edition of WinNT Little Endian?

Yes.

Now I wonder what endiannes was used by PowerPC variant WinNT.
In theory, PPC/POWER could run in Little Endian mode, but before v3
of POWER ISA it wasn't as full-featured as Big Endian mode. If I am not mistaken, the difference was that in LE mode there was no support
for unaligned memory accesses.

WinNT ran it little-endian according to <https://en.wikipedia.org/wiki/PowerPC#Endian_modes>

I neglected the bi-endianness of PowerPC and MIPS. SPARC was purely
big-endian until SPARC V9.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Tue Oct 1 19:08:05 2024

On Tue, 1 Oct 2024 15:31:55 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

John Dallman <[email protected]> schrieb:

In article <vdg3d1$2kdqr$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

Wasn't Windows NT supposed to be some kind of _portable_ OS?
Wasn't it supposed to run on big-endian architectures too, like
POWER, MIPS and SPARC?

It did. I have no experience with Windows NT on SPARC or PowerPC,
but the OS ran fine on MIPS.

There was also a Windows for Alpha. A German computer chain, Vobis,
tried to sell two models with that, but it flopped.

Alpha is Little Endian.
It seems that SPARC stands out as the only strictly Big Endian
architecture for which there was a serious attempt to port WinNT.
But not serious enough, it seems.

Now, thinking about it, I have a question.
Did Intergraph really try to port WinNT to SPARC v8 that was strictly
BE or they were porting to emerging SPARC v9 ? The later supports LE
data access. It seems, at that moment (~1993) there were no production
SPARC V9 chips, but the V9 ISA specs was already published.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Dallman on Tue Oct 1 15:31:55 2024

John Dallman <[email protected]> schrieb:

In article <vdg3d1$2kdqr$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

Wasn't Windows NT supposed to be some kind of _portable_ OS? Wasn't
it supposed to run on big-endian architectures too, like POWER, MIPS
and SPARC?

It did. I have no experience with Windows NT on SPARC or PowerPC, but the
OS ran fine on MIPS.

There was also a Windows for Alpha. A German computer chain, Vobis,
tried to sell two models with that, but it flopped.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Tue Oct 1 16:26:25 2024

Michael S <[email protected]> writes:

On Tue, 1 Oct 2024 10:12 +0100 (BST)
[email protected] (John Dallman) wrote:

PowerPC did for a while, but the company interested in NT on PowerPC
was IBM, and their hardware prices were a /lot/ higher than x86
prices. They didn't see that as a problem, but all the potential
customers did.

The ideas of ARC (MIPS) and PowerPC (which was not just IBM) was that
they would succeed the IA-32-based PC. Given the assumed (and, around
1990, actual) performance superiority of RISCs over IA-32, this looked plausible. However, even with Alpha, which was often superior in
performance throughout the 1990s, and for which there were cheap
offerings (but without performance edge; e.g., I once was playing with
the idea of buying a 21164PC-based PC164SX system, where the CPU+board
(with 1MB L2 cache) cost ATS 6000 (~EUR 440) in 1998; but I went with
a K6-2, because I played some DOS games:-). The cheap 164SX offer may have been a clearance sale, however.

In any case, the performance advantage of the RISCs vanished during
the 1990s, the RISCs never had wide ISV support, and so WNT on RISCs
flopped.

Now I wonder what endiannes was used by PowerPC variant WinNT.
In theory, PPC/POWER could run in Little Endian mode, but before v3 of
POWER ISA it wasn't as full-featured as Big Endian mode. If I am not >mistaken, the difference was that in LE mode there was no support for >unaligned memory accesses.

Given that MIPS and Alpha require natural alignment, little-endian
PowerPC at the time was as full-featured as the other RISCs.

Alignment issues may have been a problem with the RISC ports, though.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Oct 1 18:15:31 2024

On Tue, 1 Oct 2024 16:26:25 +0000, Anton Ertl wrote:

Michael S <[email protected]> writes:

On Tue, 1 Oct 2024 10:12 +0100 (BST)
[email protected] (John Dallman) wrote:

PowerPC did for a while, but the company interested in NT on PowerPC
was IBM, and their hardware prices were a /lot/ higher than x86
prices. They didn't see that as a problem, but all the potential
customers did.

The ideas of ARC (MIPS) and PowerPC (which was not just IBM) was that
they would succeed the IA-32-based PC. Given the assumed (and, around
1990, actual) performance superiority of RISCs over IA-32, this looked plausible. However, even with Alpha, which was often superior in
performance throughout the 1990s, and for which there were cheap
offerings (but without performance edge; e.g., I once was playing with
the idea of buying a 21164PC-based PC164SX system, where the CPU+board
(with 1MB L2 cache) cost ATS 6000 (~EUR 440) in 1998; but I went with
a K6-2, because I played some DOS games:-). The cheap 164SX offer may
have
been a clearance sale, however.

In any case, the performance advantage of the RISCs vanished during
the 1990s, the RISCs never had wide ISV support, and so WNT on RISCs
flopped.

Pentium Pro wrote the death nell for 1st gen RISC architectures.
Only ARM found a new and expanding market.

Now I wonder what endiannes was used by PowerPC variant WinNT.
In theory, PPC/POWER could run in Little Endian mode, but before v3 of >>POWER ISA it wasn't as full-featured as Big Endian mode. If I am not >>mistaken, the difference was that in LE mode there was no support for >>unaligned memory accesses.

Given that MIPS and Alpha require natural alignment, little-endian
PowerPC at the time was as full-featured as the other RISCs.

Alignment issues may have been a problem with the RISC ports, though.

One of the reasons to do misaligned in HW.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Anton Ertl on Tue Oct 1 19:19:14 2024

On 10/1/24 12:26, Anton Ertl wrote:

Michael S <[email protected]> writes:

On Tue, 1 Oct 2024 10:12 +0100 (BST)
[email protected] (John Dallman) wrote:

PowerPC did for a while, but the company interested in NT on PowerPC
was IBM, and their hardware prices were a /lot/ higher than x86
prices. They didn't see that as a problem, but all the potential
customers did.

The ideas of ARC (MIPS) and PowerPC (which was not just IBM) was that
they would succeed the IA-32-based PC. Given the assumed (and, around
1990, actual) performance superiority of RISCs over IA-32, this looked plausible. However, even with Alpha, which was often superior in
performance throughout the 1990s, and for which there were cheap
offerings (but without performance edge; e.g., I once was playing with
the idea of buying a 21164PC-based PC164SX system, where the CPU+board
(with 1MB L2 cache) cost ATS 6000 (~EUR 440) in 1998; but I went with
a K6-2, because I played some DOS games:-). The cheap 164SX offer may have been a clearance sale, however.

My impression at the time was given the ppc was 1/3 the cost of the
intel processors at the time that they would have destroyed Intel.
OS2 for the ppc was a no show and IBM didn't believe AIX for ppc
would be popular enough so they canned it. Which kind of left
Apple holding the bag because I believe the sell to Apple was
the ppc was going to be produced in such volume that unit
prices would have dropped considerably. One of the many stories
of IBM snatching defeat from the jaws of victory.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Michael S on Wed Oct 2 14:58:00 2024

In article <[email protected]>, [email protected] (Michael S) wrote:

Did Intergraph really try to port WinNT to SPARC v8 that was
strictly BE or they were porting to emerging SPARC v9?

I found the press releases:

<http://ftp.lanet.lv/ftp/sun-info/sunflash/1993/Jul/55.11-Sun-Intergraph:- SPARC-and-Windows-NT>

PALO ALTO, Calif., July 7, 1993 -- Sun Microsystems Computer
Corporation (SMCC) and Intergraph Corporation announced today that they
have signed a development agreement that will accelerate delivery of
future generations of SPARC microprocessors. In addition, Intergraph
will port Microsoft Corporation's Windows NT operating system to future
SPARC microprocessors.

Under terms of the agreement, Intergraph's Advanced Processor Division
(APD), located here, will develop high-end 64-bit SPARC microprocessors
jointly with SMCC's SPARC Technology Business (STB). Intergraph and
SMCC both have the right to use these processors in their system level
products, while STB will make these components available to the open
market.

As part of the agreement, APD will assume responsibility for porting
Microsoft's Windows NT to Intergraph systems using future versions of
SPARC processors. The APD port will support the "little-endian" byte
ordering feature to be included in future SPARC implementations. This
means that Windows NT itself and Windows NT applications will
transition easily to the SPARC architecture. Solaris will continue to
support "big-endian" byte ordering, as defined in current and future
versions of the SPARC architecture, which to date runs more than 7500
hardware and software solutions.

There's more, but it is clear that they were going to use little-endian
for Windows NT. At the time, Intergraph were selling their (ex-Fairchild) Clipper CPU in Unix CAD workstations with some degree of success, so this agreement wasn't a crazy idea. They had also ported Windows NT to Clipper,
so they had some idea of what they were about.

I discovered today that a contact who works for a product that started
life at Intergraph was working there during the relevant period. They
didn't work on this, but they're asking around for anyone who did.

It seems as if NT has only ever been little-endian.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Dallman on Wed Oct 2 17:05:53 2024

On Wed, 2 Oct 2024 14:58 +0100 (BST)
[email protected] (John Dallman) wrote:

<snip>

It seems as if NT has only ever been little-endian.

John

Thank you.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Dallman on Thu Oct 3 00:07:17 2024

On Tue, 1 Oct 2024 10:12 +0100 (BST), John Dallman wrote:

[Windows NT on MIPS] was a commercial failure, because MIPS
didn't keep up with the performance growth of x86.

It was NT that was the commercial failure, not MIPS. MIPS found a niche in
the embedded world, and went on to outsell x86 by a factor of 3:1 or so.

We know this because a lot of those embedded devices ran Linux.

PowerPC did for a while, but the company interested in NT on PowerPC was
IBM, and their hardware prices were a /lot/ higher than x86 prices. They didn't see that as a problem, but all the potential customers did.

PowerPC got rolled back into POWER, near as I can tell. And that continues
to sell today -- you see some POWER machines not far from the top of the current Top500 list of the world’s most powerful supercomputers. That
shows there is a viable market for the products.

And of course they, too, run Linux.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Thu Oct 3 00:11:02 2024

On Tue, 01 Oct 2024 16:26:25 GMT, Anton Ertl wrote:

In any case, the performance advantage of the RISCs vanished during the
1990s ...

Only for as long as Intel could afford to spend 10× as much on developing
each chip generation as the RISC vendors could. It could because it could
reap 10× the profits in return, but it can’t any more. Which is why you
see ARM coming to the fore, and RISC-V appearing as the upstart
challenger.

It’s a whole new ballgame now, and x86 is starting to look a little long
in the tooth. Which is why even Microsoft recognizes it needs to spread
its eggs outside that one basket, with its ongoing attempts to promote
Windows on ARM (without much success, so far).

... the RISCs never had wide ISV support, and so WNT on RISCs
flopped.

As I said above, RISC is still around and dominating the computing world. They’re not running Windows, because it was Windows that could not adapt
well to them. Instead, they are running Linux.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 09:13:21 2024

On 03/10/2024 02:07, Lawrence D'Oliveiro wrote:

On Tue, 1 Oct 2024 10:12 +0100 (BST), John Dallman wrote:

[Windows NT on MIPS] was a commercial failure, because MIPS
didn't keep up with the performance growth of x86.

It was NT that was the commercial failure, not MIPS. MIPS found a niche in the embedded world, and went on to outsell x86 by a factor of 3:1 or so.

The key markets for MIPS were network devices (managed switches,
routers, small Wifi/NAT routers, etc.) and multimedia devices (smart
TVs, Bluray players, set-top boxes, etc.).

These have mostly been overtaken by ARM these days.

We know this because a lot of those embedded devices ran Linux.

Most of these ran Linux, a few had RTOS's.

PowerPC did for a while, but the company interested in NT on PowerPC was
IBM, and their hardware prices were a /lot/ higher than x86 prices. They
didn't see that as a problem, but all the potential customers did.

PowerPC got rolled back into POWER, near as I can tell. And that continues
to sell today -- you see some POWER machines not far from the top of the current Top500 list of the world’s most powerful supercomputers. That
shows there is a viable market for the products.

And of course they, too, run Linux.

PowerPC also moved into the embedded world, especially in the automotive industry and networking, as a replacement for m68k and Coldfire for
Motorola (then Freescale, now part of NXP). PowerPC-based
microcontrollers are still a big part of NXP's high reliability and
safety oriented lineups for things like engine control. Those things,
of course, do /not/ run Linux. (But they most certainly don't run
Windows :-) )

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 06:57:54 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Tue, 01 Oct 2024 16:26:25 GMT, Anton Ertl wrote:

In any case, the performance advantage of the RISCs vanished during the
1990s ...

Only for as long as Intel could afford to spend 10× as much on developing >each chip generation as the RISC vendors could. It could because it could >reap 10× the profits in return, but it can’t any more.

Nexgen and AMD were smaller companies than Intel, DEC, Sun, HP, or the
AIM companies, and yet managed to produce CPUs that were competetive
with Intel's CPUs despite suffering from the CISC baggage. If the
RISC companies failed to keep up, they only have themselves to blame.
It seems to me that a number of RISC companies had difficulties with
managing the larger projects that the growing die areas allowed.

Another issue was the marketing. The RISC companies did not want to
damage their existing high-priced workstation and server business by
providing cheap CPUs for the masses, and yet had to do that in order
to displace Intel, AMD, and Cyrix. AMD and Cyrix did not have that
problem.

Which is why you
see ARM coming to the fore, and RISC-V appearing as the upstart
challenger.

ARM did not have the marketing problem, either, because they were not
competing in the workstation/server market. They developed their
business model of selling cores (and more) for SoCs for portable
computing, and expand from that.

... the RISCs never had wide ISV support, and so WNT on RISCs
flopped.

As I said above, RISC is still around and dominating the computing world. >They’re not running Windows, because it was Windows that could not adapt >well to them. Instead, they are running Linux.

Dominating? In the smartphone and tablet world, yes. In the embedded
world, too. In laptops, desktops and servers, no.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 15:55:57 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Tue, 1 Oct 2024 10:12 +0100 (BST), John Dallman wrote:

[Windows NT on MIPS] was a commercial failure, because MIPS
didn't keep up with the performance growth of x86.

It was NT that was the commercial failure, not MIPS. MIPS found a niche in >the embedded world, and went on to outsell x86 by a factor of 3:1 or so.

Note that MIPS CPUs were used in SGI supercomputers and high-end
graphics workstations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Thu Oct 3 17:18:22 2024

Scott Lurndal <[email protected]> schrieb:

Lawrence D'Oliveiro <[email protected]d> writes:

On Tue, 1 Oct 2024 10:12 +0100 (BST), John Dallman wrote:

[Windows NT on MIPS] was a commercial failure, because MIPS
didn't keep up with the performance growth of x86.

It was NT that was the commercial failure, not MIPS. MIPS found a niche in >>the embedded world, and went on to outsell x86 by a factor of 3:1 or so.

Note that MIPS CPUs were used in SGI supercomputers and high-end
graphics workstations.

I worked on one of the SGI machines for a time, it was the
successor of the Cray which had been decommisioned before I started
work at the company.

It wasn't economical to keep around, so it got decommisioned
when the next big reorganization (and big crisis) hit, and
the staff then retired.

For a time, the only machine for CFD applications at that company
was an HP Itanium box on my desk...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Fri Sep 27 19:52:58 2024

On Wed, 25 Sep 2024 12:54:18 -0400, EricP
<[email protected]> wrote:

For me error detection of all kinds is useful. It just happens
to not be conveniently supported in C so no one tries it in C.

GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need
as it triggers for many false positives so people turn it off.

Things like that are why some companies have a code policy that allows
just one function per file.

Still a problem if you need <whatever the relevant flag does> only in
one or a few places.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to EricP on Sat Sep 28 02:25:21 2024

On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:

I've always paid for mine. My first C compiler came with the WinNT 3.5
beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.

Given that GCC offers more features and generates better code than MSVC,
the money may not matter to your business, but the quality of the product
will.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to David Brown on Thu Oct 3 23:49:00 2024

In article <vdlg6h$3kq50$[email protected]>, [email protected]
(David Brown) wrote:

On 03/10/2024 02:07, Lawrence D'Oliveiro wrote:

It was NT that was the commercial failure, not MIPS. MIPS found a
niche in the embedded world, and went on to outsell x86 by a
factor of 3:1 or so.

The key markets for MIPS were network devices (managed switches,
routers, small Wifi/NAT routers, etc.) and multimedia devices
(smart TVs, Bluray players, set-top boxes, etc.).

These have mostly been overtaken by ARM these days.

And MIPS, the company, has abandoned its own architecture in favour of
RISC-V. <https://mips.com/>

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Thu Oct 3 23:49:00 2024

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

If the RISC companies failed to keep up, they only have themselves to
blame. It seems to me that a number of RISC companies had difficulties
with managing the larger projects that the growing die areas allowed.

Another contributing factor was Itanium, which was quite successful at disrupting the development cycles of the RISC architectures. Of the five
that I worked with:

Alpha suffered from DEC's mis-management, which led to DEC being taken
over by Compaq. They killed Alpha when Itanium first became to work, and
before it was clear that it was a turkey.

PA-RISC was intended by HP to be replaced by Itanium. They managed that,
but their success was limited because Linux on x86-64 was so much more cost-effective.

IBM kept POWER development going through the Itanium period, which is a significant reason why it's still going.

SGI went into Itanium hard and neglected MIPS development, which never recovered. It had been losing in the performance race anyway.

Sun kept SPARC development going, but made a different mistake, by
spreading their development resources over too many projects. The ones
that succeeded did so too slowly, and they fell behind. Also, Linux ate
their web-infrastructure market rather quickly.

Linux could not have had the success it did without the large range of
powerful and cheap hardware designed to run Windows.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri Oct 4 00:48:43 2024

On Thu, 03 Oct 2024 06:57:54 GMT, Anton Ertl wrote:

If the RISC companies failed to keep up, they only have themselves to
blame.

That’s all past history, anyway. RISC very much rules today, and it is x86 that is struggling to keep up.

Another issue was the marketing. The RISC companies did not want to
damage their existing high-priced workstation and server business by providing cheap CPUs for the masses ...

There was one RISC family that did indeed provide cheap CPUs for the
masses, even more so than x86, and that was ARM.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Dallman on Fri Oct 4 00:56:01 2024

On Thu, 3 Oct 2024 23:49 +0100 (BST), John Dallman wrote:

Another contributing factor was Itanium, which was quite successful at disrupting the development cycles of the RISC architectures. Of the five
that I worked with [that all failed, except one] ...

That’s a pretty depressing list. ;) Except

IBM kept POWER development going through the Itanium period, which is a significant reason why it's still going.

Given all of IBM’s missteps, it’s mildly surprising they got that one right. Even a stopped clock is right once a day ...

SGI went into Itanium hard and neglected MIPS development, which never recovered. It had been losing in the performance race anyway.

SGI decided to embrace the platform that was eating their market, and try
to sell Windows NT boxes. Trouble is, those NT boxes, while only a
fraction of the cost of an IRIX-based product, still cost about 3× what
other NT machines were going for.

Sun kept SPARC development going, but made a different mistake, by
spreading their development resources over too many projects. The ones
that succeeded did so too slowly, and they fell behind. Also, Linux ate
their web-infrastructure market rather quickly.

They could still have sold SPARC hardware running Linux. I can remember comments saying Linux ran better on that hardware than Sun’s own SunOS/ Solaris did.

Linux could not have had the success it did without the large range of powerful and cheap hardware designed to run Windows.

Linux succeeded by not having all its eggs in one basket. It ran on
everything.

Which is why it is now running rings around Windows, as Microsoft
struggles to dig itself out of the x86 dead-end niche.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Dallman on Fri Oct 4 00:46:55 2024

On Thu, 3 Oct 2024 23:49 +0100 (BST), John Dallman wrote:

And MIPS, the company, has abandoned its own architecture in favour of RISC-V. <https://mips.com/>

And that will never run Windows, either.

But it does run Linux.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Fri Oct 4 00:23:12 2024

On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Thu, 03 Oct 2024 06:57:54 GMT, Anton Ertl wrote:

If the RISC companies failed to keep up, they only have themselves to
blame.

That’s all past history, anyway. RISC very much rules today, and it is x86 >that is struggling to keep up.

You are, of course, aware that the complex "x86" instruction set is an
illusion and that the hardware essentially has been a load-store RISC
with a complex decoder on the front end since the Pentium Pro landed
in 1995.

Another issue was the marketing. The RISC companies did not want to
damage their existing high-priced workstation and server business by
providing cheap CPUs for the masses ...

There was one RISC family that did indeed provide cheap CPUs for the
masses, even more so than x86, and that was ARM.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to George Neuner on Fri Oct 4 07:05:34 2024

George Neuner <[email protected]> writes:

You are, of course, aware that the complex "x86" instruction set is an >illusion and that the hardware essentially has been a load-store RISC
with a complex decoder on the front end since the Pentium Pro landed
in 1995.

Repeating nonsense does not make it any truer, and this nonsense has
been repeated since at least the Pentium Pro (1995), maybe already
since the 486 (1989). CISC and RISC are about the instruction set,
not about the implementation. And even if you look at the
implementation, it's not true: The P6 has microinstructions that are
~100 bits long, whereas RISCs have 32-bit and 16-bit instructions.
The K7 has load-store microinstructions; RISCs don't have that.

In more recent CPUs, AMD tends to work with macro-instructions between
the decoder and the reorder buffer (i.e., in the part that in the
Pentium Pro may have been used as the justification for the RISC
claim); these macro instructions are load-and-op and read-modify-write instructions.

John Mashey has written about the difference between CISC and RISC
repeatedly <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>, and
he gives good criteria for classifying instruction sets as RISC or
CISC, and by his criteria the 80286 and IA-32 instruction sets of the
Pentium Pro clearly both are CISCs. I have recently <[email protected]> used his criteria on
instruction sets that Mashey did not classify (mostly because they
were done after his table), and by these criteria AMD64 is clearly a
CISC, while ARM A64 and RISC-V are clearly RISCs.

In searching for whether he has written something specific about
IA-32, I found <https://yarchive.net/comp/vax.html>, which is an
earlier instance of the recent discussion of whether it would have
been better for DEC to stick with VAX, do an OoO implementation and
extend the architecture to 64 bits, like Intel has done: <https://yarchive.net/comp/vax.html>. He also discusses the problems
of IA-32 there, but mainly in pointing out how much smaller they were
than the VAX ones.

I don't agree with all of that, however. E.g., when discussing a VAX instruction similar to IA-32's REP MOVS, he considers it to be a big
advantage that the operands of REP MOVS are in registers. That
appears wrong to me; you either have to keep REP MOVS in decoding (and
thus stop decoding any later instructions) until you know the value of
that register coming out of the OoO engine, making REP MOVS a mostly serializing instruction. Or you have a separate OoO logic for REP
MOVS that keeps generating loads and stores inside the OoO engine. If
you have the latter in the VAX, it does not make much difference if
the operand is on a register or memory. The possibility of trapping
during REP MOVS (or the VAX variant) complicates things, though: the
first part of the REP MOVS has to be committed, and the registers
written to the architectural state, and then execution has to start
again with the REP MOVS. Does not seem much harder on the VAX to me,
however.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Fri Oct 4 15:07:17 2024

[email protected] (John Dallman) writes:

In article <[email protected]>, >[email protected] (Anton Ertl) wrote:

If the RISC companies failed to keep up, they only have themselves to
blame. It seems to me that a number of RISC companies had difficulties
with managing the larger projects that the growing die areas allowed.

Another contributing factor was Itanium, which was quite successful at >disrupting the development cycles of the RISC architectures.

That's the question. It seems to me that many struggled even before,
and jumped ship to IA-64 ASAP.

Alpha suffered from DEC's mis-management, which led to DEC being taken
over by Compaq. They killed Alpha when Itanium first became to work, and >before it was clear that it was a turkey.

Alpha suffered before. The 21264 was late, and did not keep up in the
clock race. While they had higher clock rates than the competition up
to the EV56 (1996), the OoO EV6 appeared with a lower clock rate than
the in-order EV56 (while the OoO Pentium Pro had a higher clock rate
than the in-order Pentium available at the same time), and did not
scale as well with smaller processes as the Intel and AMD CPUs, which
were making huge strides in those years. Intel then had the 2000MHz
Pentium 4, and AMD the 1200MHz Athlon in 2000 (and 1400MHz by the time
Alpha was canceled); unfortunately, release dates for EV6 variants at
different clock rates are not documented on Wikipedia, so
unfortunately I cannot make a table of Alpha vs. Intel and AMD clock
rates by year.

PA-RISC was intended by HP to be replaced by Itanium. They managed that,
but their success was limited because Linux on x86-64 was so much more >cost-effective.

Reportedly they thought early on that they could not afford to keep
their own line competetive, so they started the IA-64 project with
Intel. Interestingly, they also designed the OoO PA-8000, which was
introduced at the same time as the Pentium Pro, and they used the same microarchitectur until they introduced the PA-8900 almost 10 years
later, which showed a more evolutionary approach than most others used
in those years.

IBM kept POWER development going through the Itanium period, which is a >significant reason why it's still going.

With the Power 4+ (2003) it also got competetive clock rates
(although, judging by the PowerPC 970, I wonder what the IPC was).

SGI went into Itanium hard and neglected MIPS development, which never >recovered. It had been losing in the performance race anyway.

The followon project "Beast" for the R10000 failed (was canceled), and
then SGI management was happy to jump ship to Itanium, and in the
meantime they only respun the R10000 into R12000, R14000, R16000.

Sun kept SPARC development going, but made a different mistake, by
spreading their development resources over too many projects. The ones
that succeeded did so too slowly, and they fell behind.

Intel, HP, SGI and AMD went to OoO in 1995/1996, Alpha in 1998, Power
at the latest with Power3 in 1998, only Sun kept doing in-order stuff,
and took until 2011 to finally get an OoO CPU out the door in the form
of the SPARC T4 (their Rock project was also OoO, but was canceled).
They also had relatively low clock rates before that (which changed
with the SPARC T5). Fujitsu managed better, introducing the OoO
SPARC64 V in 2002, and also with competetive clock rates.

Also, Linux ate
their web-infrastructure market rather quickly.

Well, SPARC survived much longer than most others, despite being
technically a lot behind.

Power still survives, maybe only because it has a common basis with
iSeries (or whatever it is called now). Similarly, s390x survives
because of its software legacy.

Linux could not have had the success it did without the large range of >powerful and cheap hardware designed to run Windows.

It was first developed on a 386, and many of the early co-developers
also had IA-32 machines. But the 386 certainly was not designed to
run Windows. The 386 project was finished before Windows 1.0 was
released in November 1985, and nobody used Windows 1.0 or 2.0, so why
would anybody design a processor for those? Windows became only
popular with 3.0 in 1990 (after the release of the 486, which was
therefore not designed for Windows, either). When I bought my first
PC (with a 486) in 1993, it ran DOS (for games) and Linux (for
everything else).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Fri Oct 4 19:44:40 2024

Anton Ertl <[email protected]> schrieb:

Alpha suffered before. The 21264 was late, and did not keep up in the
clock race.

https://www.star.bnl.gov/public/daq/HARDWARE/21264_data_sheet.pdf
gives the clock rate as varying between 466 and 600 MHz, and
Wikipedia gives the clock frequency of the Pentium Pro as between
150 and 200 MHz. The Pentium II Overdrive, according to Wikipedia,
had up to 333 MHz.

Is this information wrong?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to D'Oliveiro on Fri Oct 4 21:53:00 2024

In article <vdnef0$3uaeh$[email protected]>, [email protected]d (Lawrence
D'Oliveiro) wrote:

On Thu, 3 Oct 2024 23:49 +0100 (BST), John Dallman wrote:

Given all of IBM's missteps, it's mildly surprising they got that
one right. Even a stopped clock is right once a day ...

IBM doesn't often repeat a mistake. They're made all the ordinary ones,
so nowadays they usually invent new ones.

SGI decided to embrace the platform that was eating their market,
and try to sell Windows NT boxes. Trouble is, those NT boxes, while
only a fraction of the cost of an IRIX-based product, still cost
about 3� what other NT machines were going for.

SGI had a lengthy internal conflict about Windows NT. One group of pro-NT people left and founded NetPower, whose idea was to build really fast workstations running NT on MIPS. We had one for a while, and were
persuading Microsoft to fix a bug from the MIPSPro code generator for the
third time (we'd also had it on DEC MIPS/Ultrix, and SGI Irix) when the
Pentium Pro was released, and NetPower suddenly went very quiet.

Then there were the SGI Visual Workstations, which ran NT on x86. The
first generation of them were quite nice, but needed a very custom HAL,
and hence couldn't be upgraded to later versions of Windows once SGI
abandoned them.

The later generations were ordinary PCs - the one I had as a deskside for
a while was made by Mitsubishi - with an Nvidia graphics card. The only
SGI added value was their OpenGL driver, and that didn't seem to justify
the price if you were buying them.

By this time, SGI had a department of downsizing, whose job was to get
rid of departments and sites. Being an American company, this department
fought for power and budget share, and nobody inside the company seemed
to think that this would spell doom for SGI.

They could still have sold SPARC hardware running Linux. I can
remember comments saying Linux ran better on that hardware than
Sun's own SunOS/Solaris did.

They would not have faced up to that. There was an interesting incident
with Solaris on x86. Since the Linux and Solaris kernel interfaces are
somewhat similar, somebody at Sun decided to try making the Solaris
kernel capable of acting as a Linux kernel, so that they could run a
Linux userland and applications on the same machine as the Solaris
userland and applications.

So they hired some Linux people, but they didn't get good ones. A year
later, their Linux people came back to Sun with a huge set of patches
that amounted to patching a lot of the Linux kernel into Solaris, and
didn't do it at all well. The Solaris kernel people looked at it a bit
and said "Hell, no! This will destabilise Solaris!" They weren't
exaggerating.

So that year was wasted, and the project was restarted with some of the
Solaris people involved, to explain how their kernel worked. Quite a
while later you could install the 32-bit Red Hat Enterprise Linux 3.0
userland and most application would run, but not all. This was not a
success, and was dropped.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Dallman on Fri Oct 4 21:37:25 2024

[email protected] (John Dallman) writes:

In article <vdnef0$3uaeh$[email protected]>, [email protected]d (Lawrence >D'Oliveiro) wrote:

On Thu, 3 Oct 2024 23:49 +0100 (BST), John Dallman wrote:

Then there were the SGI Visual Workstations, which ran NT on x86. The
first generation of them were quite nice, but needed a very custom HAL,
and hence couldn't be upgraded to later versions of Windows once SGI >abandoned them.

I left in early 2000, just after it was introduced. I was using
an 2P octane at the time (with the 24" sony monitor).

By this time, SGI had a department of downsizing, whose job was to get
rid of departments and sites. Being an American company, this department >fought for power and budget share, and nobody inside the company seemed
to think that this would spell doom for SGI.

They [SGI ed.] could still have sold SPARC hardware running Linux. I can
remember comments saying Linux ran better on that hardware than
Sun's own SunOS/Solaris did.

They would not have faced up to that.

Some of the SGI engineers were fond of loud noises, and one day took
a sun pizza box into the parking lot with some m-80's. Got
a visit a bit later from the secret service as AF-1 was next
door at moffett that day.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Fri Oct 4 21:48:12 2024

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

Alpha suffered before. The 21264 was late, and did not keep up in the
clock race.

https://www.star.bnl.gov/public/daq/HARDWARE/21264_data_sheet.pdf
gives the clock rate as varying between 466 and 600 MHz, and
Wikipedia gives the clock frequency of the Pentium Pro as between
150 and 200 MHz. The Pentium II Overdrive, according to Wikipedia,
had up to 333 MHz.

Is this information wrong?

No, but it misses context: The Pentium Pro was available in late 1995.
The 21264 was officially available in 1998, but when we ordered a
machine with a 500MHz 21264 (and needed it delivered before the end of
the year for budget reasons), they delivered a machine with a 21164a,
and then in the next year upgraded it to the 21264 (which probably
meant replacing the motherboard, not just the CPU package).

Intel released a 450MHz Pentium II in 1998, and the 500MHz Pentium III
on February 28, 1999. AMD released the 600MHz Athlon in June 23,
1999, and won the GHz race with the 1000MHz Athlon in March 6, 2000,
with Intel's Pentium III following in March 8. Meanwhile, the Alphas
could not keep up in MHz numbers, but I have no firm dates, only
memories from that time.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Fri Oct 4 22:49:26 2024

On Fri, 4 Oct 2024 7:05:34 +0000, Anton Ertl wrote:

George Neuner <[email protected]> writes:

<snipping>

I don't agree with all of that, however. E.g., when discussing a VAX instruction similar to IA-32's REP MOVS, he considers it to be a big advantage that the operands of REP MOVS are in registers. That
appears wrong to me; you either have to keep REP MOVS in decoding (and
thus stop decoding any later instructions) until you know the value of
that register coming out of the OoO engine, making REP MOVS a mostly serializing instruction. Or you have a separate OoO logic for REP
MOVS that keeps generating loads and stores inside the OoO engine. If
you have the latter in the VAX, it does not make much difference if
the operand is on a register or memory. The possibility of trapping
during REP MOVS (or the VAX variant) complicates things, though: the
first part of the REP MOVS has to be committed, and the registers
written to the architectural state, and then execution has to start
again with the REP MOVS. Does not seem much harder on the VAX to me, however.

My 66000 has a MemMove instruction consisting of a 1 word instruction,
that leaves DECODE and enters into one MEMory unit, where it proceeds
to AGEN and Read, AGEN and Write, leaving the rest of the function
units proceeding to whatever is next.

One thing I did different, here, none of the 3 registers is modified,
yet I retain the ability to take exception and re-play the instruction
from where it left off {in state never visible to the instruction
stream except via DECODE stage.}

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Fri Oct 4 22:54:55 2024

On Fri, 4 Oct 2024 19:36:41 +0000, Chris M. Thomasson wrote:

On 10/3/2024 11:36 PM, Chris M. Thomasson wrote:

On 10/3/2024 9:23 PM, George Neuner wrote:

On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Thu, 03 Oct 2024 06:57:54 GMT, Anton Ertl wrote:

If the RISC companies failed to keep up, they only have themselves to >>>>> blame.

That’s all past history, anyway. RISC very much rules today, and it
is x86
that is struggling to keep up.

You are, of course, aware that the complex "x86" instruction set is an
illusion and that the hardware essentially has been a load-store RISC
with a complex decoder on the front end since the Pentium Pro landed
in 1995.

Yeah. Wrt memory barriers, one is allowed to release a spinlock on "x86"
with a simple store.

The fact that one can release a spinlock using a simple store means that
its basically load-acquire release-store.

So a load will do a load then have an implied acquire barrier.

A store will do an implied release barrier then perform the store.

How does the store know it needs to do this when the locking
instruction is more than a pipeline depth away from the
store release ?? So, Locked LD (or something) happens at
1,000,000 cycles, and the corresponding store happens at
10,000,000 cycles (9,000,000 locked).

This release behavior is okay for releasing a spinlock with a simple
store, MOV.

It may be OK to SW but it causes all kinds of grief to HW.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Fri Oct 4 23:30:03 2024

[email protected] (MitchAlsup1) writes:

On Fri, 4 Oct 2024 7:05:34 +0000, Anton Ertl wrote:

George Neuner <[email protected]> writes:

<snipping>

My 66000 has a MemMove instruction consisting of a 1 word instruction,
that leaves DECODE and enters into one MEMory unit, where it proceeds
to AGEN and Read, AGEN and Write, leaving the rest of the function
units proceeding to whatever is next.

One thing I did different, here, none of the 3 registers is modified,
yet I retain the ability to take exception and re-play the instruction
from where it left off {in state never visible to the instruction
stream except via DECODE stage.}

What happens if the exception handler reschedules the CPU to
a different task before returning from the exception?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to Anton Ertl on Sat Oct 5 00:13:24 2024

On Fri, 04 Oct 2024 07:05:34 GMT, [email protected]
(Anton Ertl) wrote:

George Neuner <[email protected]> writes:

You are, of course, aware that the complex "x86" instruction set is an >>illusion and that the hardware essentially has been a load-store RISC
with a complex decoder on the front end since the Pentium Pro landed
in 1995.

Repeating nonsense does not make it any truer, and this nonsense has
been repeated since at least the Pentium Pro (1995), maybe already
since the 486 (1989). CISC and RISC are about the instruction set,
not about the implementation. And even if you look at the
implementation, it's not true: The P6 has microinstructions that are
~100 bits long, whereas RISCs have 32-bit and 16-bit instructions.
The K7 has load-store microinstructions; RISCs don't have that.

Anton, you know very well that the hardware does not execute the "x86" instruction set but only /emulates/ it. The decoder translates x86 instructions into sequences of microinstructions that perform the
equivalent operations. The fact that some simple instructions
translate one to one does not change this.

In more recent CPUs, AMD tends to work with macro-instructions between
the decoder and the reorder buffer (i.e., in the part that in the
Pentium Pro may have been used as the justification for the RISC
claim); these macro instructions are load-and-op and read-modify-write >instructions.

John Mashey has written about the difference between CISC and RISC
repeatedly <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>, and
he gives good criteria for classifying instruction sets as RISC or
CISC, and by his criteria the 80286 and IA-32 instruction sets of the
Pentium Pro clearly both are CISCs. I have recently ><[email protected]> used his criteria on >instruction sets that Mashey did not classify (mostly because they
were done after his table), and by these criteria AMD64 is clearly a
CISC, while ARM A64 and RISC-V are clearly RISCs.

In searching for whether he has written something specific about
IA-32, I found <https://yarchive.net/comp/vax.html>, which is an
earlier instance of the recent discussion of whether it would have
been better for DEC to stick with VAX, do an OoO implementation and
extend the architecture to 64 bits, like Intel has done: ><https://yarchive.net/comp/vax.html>. He also discusses the problems
of IA-32 there, but mainly in pointing out how much smaller they were
than the VAX ones.

I don't agree with all of that, however. E.g., when discussing a VAX >instruction similar to IA-32's REP MOVS, he considers it to be a big >advantage that the operands of REP MOVS are in registers. That
appears wrong to me; you either have to keep REP MOVS in decoding (and
thus stop decoding any later instructions) until you know the value of
that register coming out of the OoO engine, making REP MOVS a mostly >serializing instruction. Or you have a separate OoO logic for REP
MOVS that keeps generating loads and stores inside the OoO engine. If
you have the latter in the VAX, it does not make much difference if
the operand is on a register or memory. The possibility of trapping
during REP MOVS (or the VAX variant) complicates things, though: the
first part of the REP MOVS has to be committed, and the registers
written to the architectural state, and then execution has to start
again with the REP MOVS. Does not seem much harder on the VAX to me, >however.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to George Neuner on Sat Oct 5 08:01:23 2024

George Neuner <[email protected]> writes:

On Fri, 04 Oct 2024 07:05:34 GMT, [email protected]
(Anton Ertl) wrote:

George Neuner <[email protected]> writes:

You are, of course, aware that the complex "x86" instruction set is an >>>illusion and that the hardware essentially has been a load-store RISC >>>with a complex decoder on the front end since the Pentium Pro landed
in 1995.

Repeating nonsense does not make it any truer, and this nonsense has
been repeated since at least the Pentium Pro (1995), maybe already
since the 486 (1989). CISC and RISC are about the instruction set,
not about the implementation. And even if you look at the
implementation, it's not true: The P6 has microinstructions that are
~100 bits long, whereas RISCs have 32-bit and 16-bit instructions.
The K7 has load-store microinstructions; RISCs don't have that.

Anton, you know very well that the hardware does not execute the "x86" >instruction set but only /emulates/ it. The decoder translates x86 >instructions into sequences of microinstructions that perform the
equivalent operations. The fact that some simple instructions
translate one to one does not change this.

I know that the hardware does not execute the "x86" instruction set,
because there is no "x86" instruction set. There is the 80286
instruction set, the IA-32 instruction set, and the AMD64 instruction
set (and the boundary between 286 and IA-32 is squishy, but that
between those and AMD64 is hard).

As for the point you are trying to make, I know quite a bit about how
the instruction execution is implemented on various IA-32 and AMD64 implementations. Whether you call it execution or emulation, IA-32
and AMD64 are still the instruction sets of all of them, and there is
no way to execute (or emulate) other instruction sets, and no way to
run programs written in macro-ops, micro-ops, ROPs, or whatever they
may be called. That's even true for the Transmeta implementations
(although doing other instruction sets would have been possible there
and IIRC was demonstrated once). Moreover, these
implementation-specific things change from one implementation to the
next, and that includes the implementations by Transmeta.

For the 6502 or the MIPS R2000 we don't consider the instruction set
to be emulated, either, and they have a decoder that translates the instructions into sequences of signals to various units (i.e., microinstructions), too.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Paul A. Clayton on Sat Oct 5 15:06:39 2024

"Paul A. Clayton" <[email protected]> writes:

On 10/4/24 7:30 PM, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

On Fri, 4 Oct 2024 7:05:34 +0000, Anton Ertl wrote:

George Neuner <[email protected]> writes:

<snipping>

My 66000 has a MemMove instruction consisting of a 1 word instruction,
that leaves DECODE and enters into one MEMory unit, where it proceeds
to AGEN and Read, AGEN and Write, leaving the rest of the function
units proceeding to whatever is next.

One thing I did different, here, none of the 3 registers is modified,
yet I retain the ability to take exception and re-play the instruction >>>from where it left off {in state never visible to the instruction
stream except via DECODE stage.}

What happens if the exception handler reschedules the CPU to
a different task before returning from the exception?

I ass-me that like the PREDicate instruction modifier, there is
_implicit_ state that is saved on context switches. I.e., there is
extra storage space in the context store for such data.

My 66000 uses hardware context saving, so software can be ignorant
of such (aside from reserving enough storage).

I got the impression that it wasn't so much context saving,
as context switching (i.e. storage per 'process/thread');
yet if that storage needs to be saved to DRAM on any
exception, just in case the OS switches to a different
thread context, then I don't see how he can get his
claimed context switch times.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Sat Oct 5 17:20:00 2024

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

[email protected] (John Dallman) writes:

Linux could not have had the success it did without the large
range of powerful and cheap hardware designed to run Windows.

It was first developed on a 386, and many of the early co-developers
also had IA-32 machines. But the 386 certainly was not designed to
run Windows. The 386 project was finished before Windows 1.0 was
released in November 1985, and nobody used Windows 1.0 or 2.0, so
why would anybody design a processor for those? ...

OK, "designed to run MS-DOS, and later Windows"?

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Sat Oct 5 17:10:47 2024

[email protected] (John Dallman) writes:

In article <[email protected]>, >[email protected] (Anton Ertl) wrote:

[email protected] (John Dallman) writes:

Linux could not have had the success it did without the large
range of powerful and cheap hardware designed to run Windows.

It was first developed on a 386, and many of the early co-developers
also had IA-32 machines. But the 386 certainly was not designed to
run Windows. The 386 project was finished before Windows 1.0 was
released in November 1985, and nobody used Windows 1.0 or 2.0, so
why would anybody design a processor for those? ...

OK, "designed to run MS-DOS, and later Windows"?

The 286 protected mode was certainly not designed for MS-DOS, and the
386 paging of linear addresses was certainly not designed for DOS,
either.

The virtual 8086 mode of the 386 was used by Windows/386 (starting
already in 1987). Was virtual 8086 mode designed into the 386
specifically for Windows? I doubt it, and AFAIK it is also used by
DOSEMU under Linux, and I expect that you can run, e.g., CP/M-86 on
it. It seems to be a good idea when designing a CPU like the 386
where backwards compatibility with the 8086 was a requirement.

Can you point to a specific feature of Intel CPUs that you think is specifically designed in for DOS or Windows? Even the A20-gate is a
general backwards-compatibility mechanism that may benefit real-mode
software other than DOS.

It seems to me that the 286 protected mode was a continuation of the
iAPX432 ideas, which predated DOS, and that the 386 paging imitated
the virtual-memory mainstream of bigger computing platforms at the
time, such as the VAX and S/370.

And the success of IA-32 and then AMD64 at replacing the RISCs is
exactly because it was not some DOS-centric architecture, but also
provided features needed by other OSs like 386/ix (later Interactive
Unix, which I used myself in 1990 or so), Xenix, Linux, Windows NT,
Solaris, the various BSDs, and others. And the computers built around
these CPUs also provided these features.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Oct 5 22:35:06 2024

On Fri, 4 Oct 2024 23:30:03 +0000, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

What happens if the exception handler reschedules the CPU to
a different task before returning from the exception?

There are 2 pointers and a count==index.

When a context switch happens (interrupt or exception)
the current count is saved in a "free" register in thread
header.

When control returns, and MM is executed for a second time
this saved count is used instead of the original operand
count.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Oct 5 22:42:19 2024

On Sat, 5 Oct 2024 15:06:39 +0000, Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

On 10/4/24 7:30 PM, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

On Fri, 4 Oct 2024 7:05:34 +0000, Anton Ertl wrote:

George Neuner <[email protected]> writes:

<snipping>

My 66000 has a MemMove instruction consisting of a 1 word instruction, >>>> that leaves DECODE and enters into one MEMory unit, where it proceeds
to AGEN and Read, AGEN and Write, leaving the rest of the function
units proceeding to whatever is next.

One thing I did different, here, none of the 3 registers is modified,
yet I retain the ability to take exception and re-play the instruction >>>>from where it left off {in state never visible to the instruction
stream except via DECODE stage.}

What happens if the exception handler reschedules the CPU to
a different task before returning from the exception?

I ass-me that like the PREDicate instruction modifier, there is
_implicit_ state that is saved on context switches. I.e., there is
extra storage space in the context store for such data.

My 66000 uses hardware context saving, so software can be ignorant
of such (aside from reserving enough storage).

I got the impression that it wasn't so much context saving,
as context switching (i.e. storage per 'process/thread');

Thread headers and thread register files are treated as
a write back cache. Core knows where it originally
got the RF and thus remembers where to put it back.

This state area is in the thread control block. So, there
is no way it can "not be there". An OS will not start a
process until there is enough memory to contain all
thread state {and .text, .data, .bss, ...}

yet if that storage needs to be saved to DRAM on any
exception, just in case the OS switches to a different
thread context, then I don't see how he can get his
claimed context switch times.

For interrupts, core starts fetching ISR thread state
before negotiating for an interrupt has finished. Most
of the time, the new state is arriving about when it is
known the interrupt will be "taken" by this core. Old
state is pushed out as new state arrives, then proceeds
to where is lives long term.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Oct 6 00:43:56 2024

On Fri, 04 Oct 2024 15:07:17 GMT, Anton Ertl wrote:

Power still survives, maybe only because it has a common basis with
iSeries (or whatever it is called now).

As I understand it, iSeries is the emulation of the old AS/400 on POWER processors. And AS/400 was the unification of the older System/38 with the System/34? System/36? lines.

System/38 (or AS/400, or iSeries) has/had this interestingly unusual architecture which builds database features right into the OS kernel, so
that they can be used everywhere. And it also uses capabilities as an alternative to the traditional privilege-mode hierarchy. Neither of these
ideas says much for performance, but they still suggest some interesting possibilities, nonetheless.

Native POWER is, I think, called pSeries. It continues to sell in its own
right because it offers high performance--high enough to earn a few
ongoing spots near the top of the Top500 supercomputer list.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Oct 6 00:47:04 2024

On Sat, 05 Oct 2024 17:10:47 GMT, Anton Ertl wrote:

The virtual 8086 mode of the 386 was used by Windows/386 (starting
already in 1987). Was virtual 8086 mode designed into the 386
specifically for Windows? I doubt it ...

Nevertheless, I think this was the “killer app” that made Windows actually useful to the masses: instead of having to wait for developers to create
apps written for Windows (which they were reluctant to do, as long as the
users didn’t want to buy Windows because there weren’t apps available for it ...), here was a feature they could use “out of the box”, to multitask their existing DOS apps, without any need for changes to application code.

And this led to a nice growth in the number of Windows installations,
which in turn created a market for actual Windows-specific apps.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Chris M. Thomasson on Sun Oct 6 00:55:31 2024

On Thu, 3 Oct 2024 23:36:12 -0700, Chris M. Thomasson wrote:

On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

That’s all past history, anyway. RISC very much rules today, and it is >>> x86 that is struggling to keep up.

You are, of course, aware that the complex "x86" instruction set is an
illusion and that the hardware essentially has been a load-store RISC
with a complex decoder on the front end since the Pentium Pro landed in
1995.

Of course, and that complexity (and consequent expense) is part of the struggle. Looking at Intel’s current financial woes, it is clearly not
being as successful at that as it has been in the past.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Oct 6 01:03:00 2024

On Fri, 04 Oct 2024 07:05:34 GMT, Anton Ertl wrote:

CISC and RISC are about the instruction set, not about
the implementation. And even if you look at the implementation, it's
not true: The P6 has microinstructions that are ~100 bits long, whereas
RISCs have 32-bit and 16-bit instructions. The K7 has load-store microinstructions; RISCs don't have that.

Intel I think tried to spread this idea of a “RISC core” somewhere inside the labyrinthine complexity of its Pentium-and-later chips, in the hope
that some of the aura attached to the term “RISC” would rub off on its products.

And quite a few people fell for it.

... ARM A64 and RISC-V are clearly RISCs.

ARM and some other RISC architectures (e.g. POWER) do somewhat stretch the
term though, don’t they, when they add that combinatorial explosion of operand types in their short-vector instructions.

RISC-V has consciously avoided this, by going back to the older long-
vector idea, like Seymour Cray used in his machines.

The possibility of trapping during
REP MOVS (or the VAX variant) complicates things, though: the first part
of the REP MOVS has to be committed, and the registers written to the architectural state, and then execution has to start again with the REP
MOVS. Does not seem much harder on the VAX to me, however.

This is why the VAX has the “FPD” (“First Part Done”) processor status bit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sun Oct 6 09:09:53 2024

Anton Ertl <[email protected]> schrieb:

In any case, certainly for the stuff I do I see no reason why I would consider, much less recommend buying a Power machine these days.

If you do not want backdoors in your system, you might want
consider a Talos II. (I'm told that various represesentatives of
government agencies with vaguely funny-sounding names have been
seen, and talked to, at OpenPOWER conferences).

Not Power 10 though, that has an unexplained binary blob.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sun Oct 6 07:18:59 2024

Lawrence D'Oliveiro <[email protected]d> writes:

Native POWER is, I think, called pSeries. It continues to sell in its own >right because it offers high performance--high enough to earn a few
ongoing spots near the top of the Top500 supercomputer list.

Looking at the June 2024 edition, I see Summit as the highest-ranked
system with Power CPUs, and they are Power 9. So if your claim was
true that the Top500 supercomputer list reflects CPU performance,
Power 9 would beat Power 10 in CPU performance, and EPYC, Xeon,
Fujitsu A64FX and Nvidia Grace are more powerful CPUs. However, in
most supercomputers (including Summit) the GPGPUs provide the bulk of
the FLOPS that are measured in the Top 500, so looking at the Top 500
is misleading for determining CPU performance.

So let's look at SPEC CPU instead. For CPU2017, I see only four
entries from IBM, all for the Integer Rate metric, two with Power 9
and two with Power 10 CPUs. The highest of those results is:

base peak
1700 2170 IBM Power E1080

That's with 8 sockets, 120 cores, and 960 threads. Looking at other
8-socket machines, I find

base peak
3820 3880 BullSequana SH80

That's with 8 sockets, 480 cores, and 960 threads (similar results
from Fujitsu PRIMERGY RX8770 M7, HPE Compute Scale-up Server 3200,
Inspur TS860G7 and Supermicro SuperServer SYS-681E-TR, all done with
Xeon Platinum 8490H CPUs). And if you go for maximum performance,
there's a 16-socket Xeon machine from Bull with base=7400, peak=7450.

Alternatively, you can instead buy a 2-socket system with similar
performance to the 8-socket IBM Power E1080:

base peak
1950 2140 ASUS RS720A-E12-RS12

and similar results from other systems with the EPYC 9754.

https://www.spec.org/cpu2017/results/res2021q3/cpu2017-20210814-28679.html https://www.spec.org/cpu2017/results/res2024q3/cpu2017-20240701-43944.html https://www.spec.org/cpu2017/results/res2023q2/cpu2017-20230522-36617.html

Admittedly, IBM extracts the most performance from each core, but with
only 15 cores per CPU (where others have 128), that is no longer that impressive. Nevertheless, neither machines with the Ryzen 7950X nor
with the Xeon-E2488 reach the performance per core (and no results for
the Ryzen 9950X have been submitted yet), so it looks like Power 10
has a really good multi-threading implementation.

The fact that IBM has not submitted results for Power for SPEC CPU
2017 for (Int or FP) Speed or FP Rate results is an admission that
their numbers there are even less impressive.

In any case, certainly for the stuff I do I see no reason why I would
consider, much less recommend buying a Power machine these days. My
guess is that the major reasons for buying pSeries machines these days
are legacy software and IBM salesmanship.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Sun Oct 6 12:40:00 2024

On Sun, 06 Oct 2024 07:18:59 GMT
[email protected] (Anton Ertl) wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

Native POWER is, I think, called pSeries. It continues to sell in
its own right because it offers high performance--high enough to
earn a few ongoing spots near the top of the Top500 supercomputer
list.

Looking at the June 2024 edition, I see Summit as the highest-ranked
system with Power CPUs, and they are Power 9. So if your claim was
true that the Top500 supercomputer list reflects CPU performance,
Power 9 would beat Power 10 in CPU performance, and EPYC, Xeon,
Fujitsu A64FX and Nvidia Grace are more powerful CPUs. However, in
most supercomputers (including Summit) the GPGPUs provide the bulk of
the FLOPS that are measured in the Top 500, so looking at the Top 500
is misleading for determining CPU performance.

Yes, in almost all top entries in top500 the compute muscle is GPGPU,
with CPUs playing role of glorified Peripheral Processor of ancient supercomputers. That applies both to POWER and to Xeons and to EPYC.

However there are two exceptions: Fugaku (#4, Fujitsu A64Fx) and Sunway TaihuLight (#13, Sunway SW26010).

Majority of GPUs in the list are NVidia of various generations, but the
#1 (US DOE Frontier) uses AMD GPUs and #2 (US DOE Aurora) uses Intel
GPUs.

BTW, IBM Summit (NVidea GV100+IBM POWER9), despite still being pretty
high on the Top500 list, is going to be retired next month.
I wonder if Sunway TaihuLight is aging better.

So let's look at SPEC CPU instead. For CPU2017, I see only four
entries from IBM, all for the Integer Rate metric, two with Power 9
and two with Power 10 CPUs. The highest of those results is:

base peak
1700 2170 IBM Power E1080

That's with 8 sockets, 120 cores, and 960 threads. Looking at other
8-socket machines, I find

base peak
3820 3880 BullSequana SH80

That's with 8 sockets, 480 cores, and 960 threads (similar results
from Fujitsu PRIMERGY RX8770 M7, HPE Compute Scale-up Server 3200,
Inspur TS860G7 and Supermicro SuperServer SYS-681E-TR, all done with
Xeon Platinum 8490H CPUs). And if you go for maximum performance,
there's a 16-socket Xeon machine from Bull with base=7400, peak=7450.

Alternatively, you can instead buy a 2-socket system with similar
performance to the 8-socket IBM Power E1080:

base peak
1950 2140 ASUS RS720A-E12-RS12

and similar results from other systems with the EPYC 9754.

https://www.spec.org/cpu2017/results/res2021q3/cpu2017-20210814-28679.html https://www.spec.org/cpu2017/results/res2024q3/cpu2017-20240701-43944.html https://www.spec.org/cpu2017/results/res2023q2/cpu2017-20230522-36617.html

Admittedly, IBM extracts the most performance from each core, but with
only 15 cores per CPU (where others have 128), that is no longer that impressive.

"Core" in POWER9 is sort-of cheating. For nearly all practical purposes
what they call 'core' is a couple of cores with just a little bit of
resource sharing between halves when running in single-thread mode.
Just enough to have a judicial justification to being called 'core'.
I don't know if POWER10 is similar or different in that regard.

Nevertheless, neither machines with the Ryzen 7950X nor
with the Xeon-E2488 reach the performance per core (and no results for
the Ryzen 9950X have been submitted yet), so it looks like Power 10
has a really good multi-threading implementation.

The fact that IBM has not submitted results for Power for SPEC CPU
2017 for (Int or FP) Speed or FP Rate results is an admission that
their numbers there are even less impressive.

That's most likely explanation, but another one is that it is sort of
internal policy no matter what.
IIRC, they didn't publish non-rate scores for POWER7 either, despite
that according to independent measurement at point of introduction
POWER7 single-threaded performance was in the same ballpark with best
Intel offerings and easily ahead of best AMD.

In any case, certainly for the stuff I do I see no reason why I would consider, much less recommend buying a Power machine these days. My
guess is that the major reasons for buying pSeries machines these days
are legacy software and IBM salesmanship.

- anton

I think, if you are running Oracle DB Enterprise Edition, where
software license per core is the most expensive part then there could
be an economical reason for preferring POWER9 or 10 over Intel or AMD.
But that's just a guess.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sun Oct 6 08:40:55 2024

Lawrence D'Oliveiro <[email protected]d> writes:

Intel I think tried to spread this idea of a "RISC core" somewhere inside
the labyrinthine complexity of its Pentium-and-later chips, in the hope
that some of the aura attached to the term "RISC" would rub off on its >products.

I am not sure it was Intel, but certainly a number of people on the
net did write this, and it may already have started with the 486, and
my guess is that even an explanation of Intel that just expained the implementation of the 486 without making any claim that the 486 is a
RISC would have led to that result. It's just a thing that people
rooting for Intel like to believe, so in retelling the implementation explanation, they eventually settle down to "the 486 is a RISC
internally", and eventually this becomes "the complex 'x86'
instruction set is an illusion and that the hardware essentially has
been a load-store RISC".

AMD are easily provable culprits in this scam: They call their
micro-ops "ROPs", for RISC ops.

BTW, this page looks at the microcode of different IA-32
implementations:
<https://fanael.github.io/is-x86-risc-internally.html>

And quite a few people fell for it.

Yes. Apparently it's something people want to believe in.

... ARM A64 and RISC-V are clearly RISCs.

ARM and some other RISC architectures (e.g. POWER) do somewhat stretch the >term though, don’t they, when they add that combinatorial explosion of >operand types in their short-vector instructions.

Number of operand types never has been a criterion in any of the RISC definitions I have seen, nor the number of instructions (although some
people like to go by that).

As for ARM and Power, from
<[email protected]>:

CPU Age 3a 3b 3c 3d 4a4b 5a 5b 6a 6b# ODD
(1991)
RULE <6 =1 =4 <5 =0 =0=1 <2 =1 >4 >3
G1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000
6+ 1 4 7+ 0 0 1 0 1 4+ - 3/8 ARM1
-12 2+ 4 7+ 0 0 1 1 2+ 4+ 5 4/7 ARMv7 T32
-22 1 4 15+ 0 0 1 1 2+ 5 5 2/9 ARM A64

So for John Mashey the RS/6000 (original Power) satisfied all his RISC criteria. I think that since the PowerPC, Power fails his criteria 5b
and maybe 5a, so these days Power would be classified as 1/10 or 2/9
(i.e., 10 for RISC, 1 against), so it's clearly RISC, like the others,
and unlike AMD64 (7/4).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun Oct 6 13:21:32 2024

On Sun, 6 Oct 2024 00:43:56 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Fri, 04 Oct 2024 15:07:17 GMT, Anton Ertl wrote:

Power still survives, maybe only because it has a common basis with
iSeries (or whatever it is called now).

As I understand it, iSeries is the emulation of the old AS/400 on
POWER processors. And AS/400 was the unification of the older
System/38 with the System/34? System/36? lines.

System/38 (or AS/400, or iSeries) has/had this interestingly unusual architecture which builds database features right into the OS kernel,
so that they can be used everywhere. And it also uses capabilities as
an alternative to the traditional privilege-mode hierarchy. Neither
of these ideas says much for performance, but they still suggest some interesting possibilities, nonetheless.

Native POWER is, I think, called pSeries. It continues to sell in its
own right because it offers high performance--

https://www.ibm.com/downloads/cas/B425DZZ1
Try to find word POWER (or Power systems) in this 128-page document.
Then may be you will get the idea of how important it is according to
IBM management.

Compare with 5, 10 and 15 years ago (in the oldest report look for
system p)
https://www.ibm.com/investor/att/pdf/IBM_Annual_Report_2018.pdf https://www.ibm.com/investor/att/pdf/IBM_Annual_Report_2013.pdf https://www.ibm.com/investor/att/pdf/IBM_Annual_Report_2008.pdf

high enough to earn a
few ongoing spots near the top of the Top500 supercomputer list.

This misunderstanding cleared by Anton.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun Oct 6 13:51:19 2024

On Sun, 6 Oct 2024 00:55:31 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 3 Oct 2024 23:36:12 -0700, Chris M. Thomasson wrote:

On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

That’s all past history, anyway. RISC very much rules today, and
it is x86 that is struggling to keep up.

You are, of course, aware that the complex "x86" instruction set
is an illusion and that the hardware essentially has been a
load-store RISC with a complex decoder on the front end since the
Pentium Pro landed in 1995.

Of course, and that complexity (and consequent expense) is part of
the struggle. Looking at Intel’s current financial woes, it is
clearly not being as successful at that as it has been in the past.

Intel's current financial woes do not appear to be [directly] related to
Intel PC (laptops+desktop) sails that are right now pretty good and
profitable.
Even servers division that struggled and lost money for majority of
2023 now recovering and is profitable again even if profit margin is
tiny comparatively to 2021.
Actually, it takes special management talent to have such good result
in the company's main segment and despite that to lose money for Q
after Q after Q.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Sun Oct 6 14:04:56 2024

On Sun, 06 Oct 2024 08:40:55 GMT
[email protected] (Anton Ertl) wrote:

AMD are easily provable culprits in this scam: They call their
micro-ops "ROPs", for RISC ops.

Wasn't the term invented by Nexgen for Nx586 and later adapted by AMD
after they scrapped their home brewed core in favor of Nexgen's core
that later became know as AMD K6?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Sun Oct 6 11:58:51 2024

Michael S <[email protected]> writes:

On Sun, 06 Oct 2024 07:18:59 GMT
[email protected] (Anton Ertl) wrote:

The fact that IBM has not submitted results for Power for SPEC CPU
2017 for (Int or FP) Speed or FP Rate results is an admission that
their numbers there are even less impressive.

That's most likely explanation, but another one is that it is sort of >internal policy no matter what.
IIRC, they didn't publish non-rate scores for POWER7 either, despite
that according to independent measurement at point of introduction
POWER7 single-threaded performance was in the same ballpark with best
Intel offerings and easily ahead of best AMD.

For our LaTeX benchmark (numbers are in seconds (lower is better), the
years are when the hardware came on the market:

Power7, 3600MHz, CentOS 7 (ppc64) TeX Live 2013 0.81 2010
Athlon 64 3200+, 2000MHz, 1MB L2, Fedora Core 1 (64-bit) 0.76 2003
Xeon X3460 (Lynnfield (Nehalem)) 2800MHz, Deb Lenny 64b 0.484 2009
Xeon 5160, 3000MHz, (2*)4MB L2, Debian Etch (64-bit) 0.48 2006
Phenom II X2 560, 3300MHz, 6MB L3, Debian Jessie (64-bit) 0.452 2010

On whatever applications the performance of Power7 is competetive with
Intel and AMD of its time, it's certainly is not on our LaTeX
benchmark.

In any case, certainly for the stuff I do I see no reason why I would
consider, much less recommend buying a Power machine these days. My
guess is that the major reasons for buying pSeries machines these days
are legacy software and IBM salesmanship.

- anton

I think, if you are running Oracle DB Enterprise Edition, where
software license per core is the most expensive part then there could
be an economical reason for preferring POWER9 or 10 over Intel or AMD.
But that's just a guess.

Yes, per-core licensing fees irrespective of the actual hardware might
be a reason, but it would be perverse of Oracle or some other ISV to
pay for porting to Power in order to reduce the licensing fees that
their customers have to pay to them. But stranger things have
happened.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Sun Oct 6 14:03:59 2024

Michael S <[email protected]> writes:

On Sun, 06 Oct 2024 08:40:55 GMT
[email protected] (Anton Ertl) wrote:

AMD are easily provable culprits in this scam: They call their
micro-ops "ROPs", for RISC ops.

Wasn't the term invented by Nexgen for Nx586 and later adapted by AMD
after they scrapped their home brewed core in favor of Nexgen's core
that later became know as AMD K6?

Easily possible. AMD may not have been the original culprit, but they continued this terminology, and therefore are just as guilty.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Sun Oct 6 17:35:40 2024

On Sun, 06 Oct 2024 14:03:59 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

On Sun, 06 Oct 2024 08:40:55 GMT
[email protected] (Anton Ertl) wrote:

AMD are easily provable culprits in this scam: They call their
micro-ops "ROPs", for RISC ops.

Wasn't the term invented by Nexgen for Nx586 and later adapted by AMD
after they scrapped their home brewed core in favor of Nexgen's core
that later became know as AMD K6?

Easily possible. AMD may not have been the original culprit, but they continued this terminology, and therefore are just as guilty.

- anton

To their defense, AMD's use of the term ROP didn't last for long.
K8 manuals use the better term micro-ops. I don't have K7 manual to
look, but it seems to me that it uses the same terminology as K8.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Sun Oct 6 16:21:00 2024

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

OK, "designed to run MS-DOS, and later Windows"?

The 286 protected mode was certainly not designed for MS-DOS, and
the 386 paging of linear addresses was certainly not designed for
DOS, either.

I see where I'm going wrong: I'm trying to talk about the machines
designed to run MS-DOS and later Windows, not just the CPUs. The vast
range of hardware that all had substantial degrees of compatibility as
regards booting, busses and so on. Those things let their manufacturers
compete for the DOS and Windows market, whereas x86-based machines that
weren't PC-compatible only succeeded in quite specialised niches.

Those hardware suppliers did not close off access to the more advanced
features of i386 onwards, because they had no reason to, and that let
Linux take advantage of all that hardware when it came along. That's the
point I was failing to make.

And the success of IA-32 and then AMD64 at replacing the RISCs is
exactly because it was not some DOS-centric architecture, but also
provided features needed by other OSs like 386/ix (later Interactive
Unix, which I used myself in 1990 or so), Xenix, Linux, Windows NT,
Solaris, the various BSDs, and others. And the computers built
around these CPUs also provided these features.

Just so.

It seems to me that the 286 protected mode was a continuation of the
iAPX432 ideas, which predated DOS,

Not sure about that: it is also a bit like the base-limit memory
protection of various old mainframe architectures, done in the context of
x86 64KB segments.

and that the 386 paging imitated the virtual-memory mainstream
of bigger computing platforms at the time, such as the VAX and
S/370.

Absolutely.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Sun Oct 6 23:34:04 2024

On Sun, 6 Oct 2024 13:51:19 +0300, Michael S wrote:

Intel's current financial woes do not appear to be [directly] related to Intel PC (laptops+desktop) sails that are right now pretty good and profitable.

x86 chip sales have been declining for years. At one time they were up to
a million per day; nowadays it’s only about 80% of that. And you see the trouble they have keeping up in performance, microcode bugs etc. All adds
up to competitiveness trouble.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Sun Oct 6 23:33:38 2024

On Sat, 5 Oct 2024 21:56:34 +0000, Chris M. Thomasson wrote:

On 10/4/2024 3:54 PM, MitchAlsup1 wrote:

On Fri, 4 Oct 2024 19:36:41 +0000, Chris M. Thomasson wrote:

On 10/3/2024 11:36 PM, Chris M. Thomasson wrote:

On 10/3/2024 9:23 PM, George Neuner wrote:

On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Thu, 03 Oct 2024 06:57:54 GMT, Anton Ertl wrote:

If the RISC companies failed to keep up, they only have themselves to >>>>>>> blame.

That’s all past history, anyway. RISC very much rules today, and it >>>>>> is x86
that is struggling to keep up.

You are, of course, aware that the complex "x86" instruction set is an >>>>> illusion and that the hardware essentially has been a load-store RISC >>>>> with a complex decoder on the front end since the Pentium Pro landed >>>>> in 1995.

Yeah. Wrt memory barriers, one is allowed to release a spinlock on "x86" >>>> with a simple store.

The fact that one can release a spinlock using a simple store means that >>> its basically load-acquire release-store.

So a load will do a load then have an implied acquire barrier.

A store will do an implied release barrier then perform the store.

How does the store know it needs to do this when the locking
instruction is more than a pipeline depth away from the
store release ?? So, Locked LD (or something) happens at
1,000,000 cycles, and the corresponding store happens at
10,000,000 cycles (9,000,000 locked).

This release behavior is okay for releasing a spinlock with a simple
store, MOV.

It may be OK to SW but it causes all kinds of grief to HW.

I thought that x86 has an implied #LoadStore | #StoreStore before the
store, basically to give it release semantics. This means that one can release a spinlock without using any explicit membars. Iirc, there are
Intel manuals that show this for spinlocks. Cannot exactly remember
right now.

I wonder if this actually works with my scenario above.

On x86 an atomic load has acquire and atomic stores have release
semantics. Well, I think that is for WB memory only. Humm... Cannot
remember if its for WC or WB memory right now. Then there are the
L/S/MFENCE instructions...

https://www.felixcloutier.com/x86/sfence

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Oct 6 23:36:10 2024

On Sun, 06 Oct 2024 08:40:55 GMT, Anton Ertl wrote:

Number of operand types never has been a criterion in any of the RISC definitions I have seen, nor the number of instructions (although some
people like to go by that).

It’s in the name: “Reduced Instruction Set Computer”.

I always thought it should have been “IRSC”: “Increased Register Set Computer”. The most obvious characteristic, the one that tends to hit you first, is having lots of registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Oct 6 23:42:47 2024

On Sun, 06 Oct 2024 07:18:59 GMT, Anton Ertl wrote:

However, in most supercomputers (including Summit) the GPGPUs provide
the bulk of the FLOPS ...

That tends to go back and forth, between CPU and GPU.

See this <https://www.nextplatform.com/2020/03/05/software-evolution-on-ornls-summit-supercomputer/>
interview with Dr Tjerk Straatsma, group lead for scientific computing
at ORNL. Seems their supers have made heavy use of NVidia GPUs up to
now, but this was set to change:

Frontier, the next system for the OLCF, will have AMD CPUs and
GPUs.

To prepare for this system, software developers may want to make
changes to their programming approach, with OpenMP directive-based
and HIP native offloading as the most comparable to the OpenACC
and CUDA approaches on Summit today.

I wonder what happened to OpenCL as the cross-platform architecture
for GPU computing?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Sun Oct 6 23:45:20 2024

On Sun, 6 Oct 2024 12:40:00 +0300, Michael S wrote:

However there are two exceptions: Fugaku (#4, Fujitsu A64Fx) and Sunway TaihuLight (#13, Sunway SW26010).

Some suspect there may be more Chinese machines that would be worthy of a
high place in the Top500 list, if only people knew about them. And those
would likely be strong on the CPU side, weak on the GPU side, too.

Why could China be drawing back from publicizing its supercomputer
prowess? Partly national security, perhaps; but also because it tends to
enrage some in the US to have their noses rubbed in another country’s technological superiority.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Dallman on Sun Oct 6 23:38:07 2024

On Sun, 6 Oct 2024 16:21 +0100 (BST), John Dallman wrote:

... whereas x86-based machines that weren't PC-compatible ...

They could have been PCs, too, since IBM neither pioneered nor owned the
term.

The standard for compatibility soon had more to do with Microsoft software
than particular IBM hardware, anyway; which is why I like to say “Microsoft-compatible”. No possibility of confusion over what you mean.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Mon Oct 7 07:17:02 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Sun, 06 Oct 2024 08:40:55 GMT, Anton Ertl wrote:

Number of operand types never has been a criterion in any of the RISC
definitions I have seen, nor the number of instructions (although some
people like to go by that).

It’s in the name: "Reduced Instruction Set Computer".

Not at all. What you think of is a "fewer instructions computer", but
it's called a "reduced-instruction set computer". It becomes more
obvious if you look at the opposite: "Complex-instruction set
computer", not "more-instructions computer".

I always thought it should have been "IRSC": "Increased Register Set >Computer". The most obvious characteristic, the one that tends to hit you >first, is having lots of registers.

Having 32 GPRs does not make AMD64 with APX a RISC, and VAX (a CISC)
has the same number of registers as the first 801 and the ARM A32/T32
(RISCs).

However, in John Mashey's criteria the number of registers plays a
role; he requires >4 bits for the GPR specifier, and >3 bits for the
FPR specifier.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Mon Oct 7 08:00:03 2024

[email protected] (John Dallman) writes:

In article <[email protected]>, >[email protected] (Anton Ertl) wrote:

OK, "designed to run MS-DOS, and later Windows"?

The 286 protected mode was certainly not designed for MS-DOS, and
the 386 paging of linear addresses was certainly not designed for
DOS, either.

I see where I'm going wrong: I'm trying to talk about the machines
designed to run MS-DOS and later Windows, not just the CPUs. The vast
range of hardware that all had substantial degrees of compatibility as >regards booting, busses and so on. Those things let their manufacturers >compete for the DOS and Windows market, whereas x86-based machines that >weren't PC-compatible only succeeded in quite specialised niches.

There actually were MS-DOS-compatible machines that were not 100% IBM
PC compatible, and did not run programs that used direct hardware
access, but MS-DOS programs that only used BIOS functions (i.e., a
HAL). The BIOS functions were too slow, so the programs with direct
hardware access won out over those that used the BIOS, and therefore
the 100% IBM PC compatibles won out over the MS-DOS compatibles.

The PC industry then developed a culture of compatibility, and that
helped all OSs, not just DOS and Windows. E.g., it is much easier to
install Linux on a PC than on some ARM-based SBC; for the ARM-based
SBC the typical way is to use a prepared system image on an SD-card,
because you cannot just put in a USB stick and run an installer.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Mon Oct 7 10:17:26 2024

Anton Ertl wrote:

[email protected] (John Dallman) writes:

In article <[email protected]>,
[email protected] (Anton Ertl) wrote:

OK, "designed to run MS-DOS, and later Windows"?

The 286 protected mode was certainly not designed for MS-DOS, and
the 386 paging of linear addresses was certainly not designed for
DOS, either.

I see where I'm going wrong: I'm trying to talk about the machines
designed to run MS-DOS and later Windows, not just the CPUs. The vast
range of hardware that all had substantial degrees of compatibility as
regards booting, busses and so on. Those things let their manufacturers
compete for the DOS and Windows market, whereas x86-based machines that
weren't PC-compatible only succeeded in quite specialised niches.

There actually were MS-DOS-compatible machines that were not 100% IBM
PC compatible, and did not run programs that used direct hardware
access, but MS-DOS programs that only used BIOS functions (i.e., a
HAL). The BIOS functions were too slow, so the programs with direct
hardware access won out over those that used the BIOS, and therefore
the 100% IBM PC compatibles won out over the MS-DOS compatibles.

The single most canonical test for IBM PC compatibility was Microsoft's
Flight Simulator, taking off from the now demolished Meighs Field in
Chicago.

That game used the OS and BIOS for the loading of the game, and then
went on to direct hardware access for pretty much the rest of the
playing time.

The PC industry then developed a culture of compatibility, and that
helped all OSs, not just DOS and Windows. E.g., it is much easier to
install Linux on a PC than on some ARM-based SBC; for the ARM-based
SBC the typical way is to use a prepared system image on an SD-card,
because you cannot just put in a USB stick and run an installer.

Terje

- anton

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Oct 7 13:05:53 2024

On Sun, 6 Oct 2024 23:42:47 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Sun, 06 Oct 2024 07:18:59 GMT, Anton Ertl wrote:

However, in most supercomputers (including Summit) the GPGPUs
provide the bulk of the FLOPS ...

That tends to go back and forth, between CPU and GPU.

See this <https://www.nextplatform.com/2020/03/05/software-evolution-on-ornls-summit-supercomputer/>
interview with Dr Tjerk Straatsma, group lead for scientific computing
at ORNL. Seems their supers have made heavy use of NVidia GPUs up to
now, but this was set to change:

Frontier, the next system for the OLCF, will have AMD CPUs and
GPUs.

No back and forth here. Frontier is as much GPU-centric as was Summit.
The same for Aurora that is installed alongside earlier Polarais at
Argonne.
Or for El Capitan that is going to replace Sierra at LLNL.

In all cases the vendor of GPU changed, but balance between GPU and CPU computing power remains heavily skewed toward GPU.

To prepare for this system, software developers may want to make
changes to their programming approach, with OpenMP directive-based
and HIP native offloading as the most comparable to the OpenACC
and CUDA approaches on Summit today.

I wonder what happened to OpenCL as the cross-platform architecture
for GPU computing?

I have no idea.
May be, it was not designed with the same level of competence as CUDA ?
Or, may be, being cross-platform, OpenCL is at inherent disadvantage?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Mon Oct 7 12:27:09 2024

On Mon, 07 Oct 2024 07:17:02 GMT
[email protected] (Anton Ertl) wrote:

However, in John Mashey's criteria the number of registers plays a
role; he requires >4 bits for the GPR specifier, and >3 bits for the
FPR specifier.

- anton

Which sounds rather arbitrary. Or even worse, like if he wanted for
SPARC to be called 'typical RISC' and for ARM to be called atypical and
had chosen the numbers to match the agenda.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Oct 7 13:26:54 2024

On Sun, 6 Oct 2024 23:34:04 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Sun, 6 Oct 2024 13:51:19 +0300, Michael S wrote:

Intel's current financial woes do not appear to be [directly]
related to Intel PC (laptops+desktop) sails that are right now
pretty good and profitable.

x86 chip sales have been declining for years. At one time they were
up to a million per day; nowadays it’s only about 80% of that.

That can explain slow shift from being crazily profitable to "just" very
very profitable.
It's not nearly enough to explain several consecutive quarters of big
losses.

And
you see the trouble they have keeping up in performance, microcode
bugs etc. All adds up to competitiveness trouble.

No, I don't see it. What I see that in absolute performance per
thread four companies are head and shoulders above of the rest of
the industry. Two out of the four make ARM, other two make x86.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Mon Oct 7 17:38:54 2024

Anton Ertl <[email protected]> schrieb:

Michael S <[email protected]> writes:

On Mon, 07 Oct 2024 07:17:02 GMT
[email protected] (Anton Ertl) wrote:

However, in John Mashey's criteria the number of registers plays a
role; he requires >4 bits for the GPR specifier, and >3 bits for the
FPR specifier.

- anton

Which sounds rather arbitrary.

In a way it is, but see below.

Or even worse, like if he wanted for
SPARC to be called 'typical RISC' and for ARM to be called atypical and
had chosen the numbers to match the agenda.

I think that ARM did not exist for John Mashey.

When was his definition made?

ARM was rather late to the RISC game, this might have been literally
true.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Mon Oct 7 17:09:10 2024

Michael S <[email protected]> writes:

On Mon, 07 Oct 2024 07:17:02 GMT
[email protected] (Anton Ertl) wrote:

However, in John Mashey's criteria the number of registers plays a
role; he requires >4 bits for the GPR specifier, and >3 bits for the
FPR specifier.

- anton

Which sounds rather arbitrary.

In a way it is, but see below.

Or even worse, like if he wanted for
SPARC to be called 'typical RISC' and for ARM to be called atypical and
had chosen the numbers to match the agenda.

I think that ARM did not exist for John Mashey.

He probably chose the criterion ">4 bits" because it excluded VAX.
So, yes, his criteria were based on classifying some architectures as
RISCs and some as CISCs, and then drawing the lines to fit that
classification. But his criteria also work for architectures he did
not look at, including ARM A32, even if not every criterion agrees
with all the others for every architecture.

Alternatively, you could do a cluster analysis with these criteria and
maybe others, and I think that the RISCs would come out pretty tightly clustered; the non-RISCs would be further away from that, and I doubt
that they would form a single cluster.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon Oct 7 18:56:26 2024

On Sun, 6 Oct 2024 1:03:00 +0000, Lawrence D'Oliveiro wrote:

On Fri, 04 Oct 2024 07:05:34 GMT, Anton Ertl wrote:

CISC and RISC are about the instruction set, not about
the implementation. And even if you look at the implementation, it's
not true: The P6 has microinstructions that are ~100 bits long, whereas
RISCs have 32-bit and 16-bit instructions. The K7 has load-store
microinstructions; RISCs don't have that.

Intel I think tried to spread this idea of a “RISC core” somewhere
inside
the labyrinthine complexity of its Pentium-and-later chips, in the hope
that some of the aura attached to the term “RISC” would rub off on its products.

And quite a few people fell for it.

... ARM A64 and RISC-V are clearly RISCs.

ARM and some other RISC architectures (e.g. POWER) do somewhat stretch
the
term though, don’t they, when they add that combinatorial explosion of operand types in their short-vector instructions.

600-1200 instructions in SIMD

RISC-V has consciously avoided this, by going back to the older long-
vector idea, like Seymour Cray used in his machines.

ONLY 100-300 instructions for long vector.

Compare this to 2 instructions in My 66000 that provide access
to both SIMD and long vectors.

In My Humble Opinion, ISAs with SIMD or Long Vectors do not qualify
as RISC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kent Dickey@21:1/5 to [email protected] on Mon Oct 7 18:55:26 2024

In article <efXIO.169388$[email protected]>,
EricP <[email protected]> wrote:

Kent Dickey wrote:

In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:

Kent Dickey wrote:

In article <[email protected]>,
Anton Ertl <[email protected]> wrote:

Brett <[email protected]> writes:

Speaking of complex things, have you looked at Swift output, as it checks
all operations for overflow?

You could add an exception type for that, saving huge numbers of correctly
predicted branch instructions.

The future of programming languages is type safe with checks, you need to
get on that bandwagon early.

MIPS got on that bandwagon early. It has, e.g., add (which traps on >>>>> signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.

Alpha got on that bandwagon early. It's a descendent of MIPS, but it >>>>> renamed add into addv, and addu into add. It has been canceled around >>>>> the year 2000.

[ More details about architectures without trapping overflow instructions ]

Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values. If you take Rust's approach, and only
detect overflow in debug builds, then you already don't care about
performance.

Those automatic software correctness checks, of which signed integer
overflow detection is one of many, went away because most code was
being written in C/C++ and those two languages don't require them.

That just makes it more expensive in code size and performance to effect >>> such checks. This overhead leads some to conclude it justifies eliminating >>> the error checks.

Eliminating the error event detectors doesn't make errors go away,
just your knowledge of them.

I gather portions of 16-bit Windows 3.1 were written in Pascal.
When Microsoft developed 32-bit WinNT, if instead of C it they had
switched their official development language from Pascal to Modula-2
which does require signed and unsigned, checked and modulo arithmetic,
and array bounds checks, the world would have been a much safer place.

But they didn't so it isn't.

The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.

OK, my post was about how having a hardware trap-on-overflow instruction
(or a mode for existing ALU instructions) is useless for anything OTHER
than as a debug aid where you crash the problem on overflow (you can
have a general exception handler to shut down gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons folks might
want to try to use trap-on-overflow instructions, and show how the
other cases don't make sense.

For me error detection of all kinds is useful. It just happens
to not be conveniently supported in C so no one tries it in C.

GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need
as it triggers for many false positives so people turn it off.

In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother
having trap-on-overflow instructions.

I understand, and I disagree with this conclusion.
I think all forms of software error detection are useful and
HW should make them simple and eliminate cost when possible.

I think I am not explaining the issue well.

I'm not arguing what you want to do with overflow. I'm trying to show that
for all uses of detecting overflow other than crashing with no recovery, hardware trapping on overflow is a poor approach.

If you enable hardware traps on integer overflow, then to do anything other than crash the program would require engineering a very complex set of
data structures, roughly approximately the complexity of adding debug information to the executable, in order to make this work. As far as I know, no one in the history of computers has yet undertaken this task.

This is because each instruction which overflows would need special
handling, and the "debug" information would be needed. It would be a huge amount of compiler/linker/runtime complexity.

This is different than most "signal" handlers people have written, where
simple inspection of the instruction which failed and the address involved allows it to be "handled". But to do anything other than crash, each instruction which overflows needs special handling unique to that instruction and dependent on what the compiler was in the middle of doing when the
overflow happened. This is why trapping just isn't a good idea.

I'm just explaining why trap-on-overflow has gone away, because it's
almost completely useless: hardware trap on overflow is only good for the
case that you want to crash on integer overflow. Branch-on-overflow is the correct approach--the compiler can branch to either a trapping instruction
(if you just want to crash), or for all other cases of detecting overflow,
the compiler branches to "fixup" code.

And crash-on-overflow just isn't a popular use model, as I use the example
of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
and no compiler seems to use it. Especially since branch-on-overflow
is almost as good in every way.

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kent Dickey@21:1/5 to [email protected] on Mon Oct 7 18:45:01 2024

In article <S9YIO.47284$[email protected]>,
EricP <[email protected]> wrote:

Terje Mathisen wrote:

Kent Dickey wrote:

Look at:
https://godbolt.org/z/oMhW55YsK

Which is this code:

int add2(int num, int other) {
return num + other;
}

Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and crash).

For x86-64 clang 19.1.0:

add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
.LBB0_1:
ud1 eax, dword ptr [eax]

This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.

But x86 has an instruction to trap on overflow directly: INTO. It's
one byte.
And it doesn't use it.

GCC x86-64 14.2 is even worse:

add2:
sub rsp, 8
call __addvsi3
add rsp, 8
ret

It calls a routine to do all additions which might overflow, and that
routine calls assert() if an overflow occurs.

The CPU has a trap-on-overflow instruction exactly for this case (to
crash
on detecting an overflow), and compilers don't even use it.

So even on architectures which have a trap-on-overflow instruction,
compilers don't use it.

You can only compile in INTO opcodes if you can guarantee that the INT 4
(INTO) trap vector will always be set to a proper handler, and since
that isn't part of the ABI, compilers can't depend on it?

I do agree that it would be nice if it did work, barring that clang is
doing the best possible alternative, at close to zero cost except for
the useless branch predictor table entry wastage.

Terje

On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
One must use JO to detect signed overflow.
Others were repurposed, 1-byte INC and DEC 40..4F became the REX prefix.

Right, I forgot this. But even in 32-bit mode compiles, GCC and CLANG
both do not use INTO when using the -ftrapv flag--the compilers do the same thing they do in 64-bit mode.

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon Oct 7 19:03:59 2024

On Sun, 6 Oct 2024 23:36:10 +0000, Lawrence D'Oliveiro wrote:

On Sun, 06 Oct 2024 08:40:55 GMT, Anton Ertl wrote:

Number of operand types never has been a criterion in any of the RISC
definitions I have seen, nor the number of instructions (although some
people like to go by that).

It’s in the name: “Reduced Instruction Set Computer”.

I always thought it should have been “IRSC”: “Increased Register Set Computer”. The most obvious characteristic, the one that tends to hit
you first, is having lots of registers.

At the time of RISC, Denelcor has HEP and each process could have upto
256 registers (along with up to 256 constants).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Mon Oct 7 19:02:12 2024

On Sun, 6 Oct 2024 14:35:40 +0000, Michael S wrote:

On Sun, 06 Oct 2024 14:03:59 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

On Sun, 06 Oct 2024 08:40:55 GMT
[email protected] (Anton Ertl) wrote:

AMD are easily provable culprits in this scam: They call their
micro-ops "ROPs", for RISC ops.

Wasn't the term invented by Nexgen for Nx586 and later adapted by AMD >>>after they scrapped their home brewed core in favor of Nexgen's core
that later became know as AMD K6?

Easily possible. AMD may not have been the original culprit, but they
continued this terminology, and therefore are just as guilty.

- anton

To their defense, AMD's use of the term ROP didn't last for long.
K8 manuals use the better term micro-ops. I don't have K7 manual to
look, but it seems to me that it uses the same terminology as K8.

K9 used the terms micro-ops and meso-ops to describe before and
after peephole optimization. HW was happy to run either as micro-
ops were a strict subset of meso-ops, meso-ops just got more work
done per cycle.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to All on Mon Oct 7 22:51:31 2024

On 2024-10-07 22:12, MitchAlsup1 wrote:

On Mon, 7 Oct 2024 18:55:26 +0000, Kent Dickey wrote:

In article <efXIO.169388$[email protected]>,
EricP <[email protected]> wrote:

Kent Dickey wrote:

In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:

Kent Dickey wrote:

In no way was I ever arguing that checking for overflow was a bad idea, >>>> or a language issue, or anything else. Just that CPUs should not
bother
having trap-on-overflow instructions.

I understand, and I disagree with this conclusion.
I think all forms of software error detection are useful and
HW should make them simple and eliminate cost when possible.

I think I am not explaining the issue well.

I'm not arguing what you want to do with overflow. I'm trying to show
that for all uses of detecting overflow other than crashing with no
recovery, hardware trapping on overflow is a poor approach.

If you enable hardware traps on integer overflow, then to do anything
other than crash the program would require engineering a very complex
set of data structures, roughly approximately the complexity of adding
debug information to the executable, in order to make this work. As
far as I know, no one in the history of computers has yet undertaken
this task.

And yet, this is exactly the kind of data C++ needs in order to
use its Try-Throw-Catch exception model. The stack walker needs
to know where on the stack is the list of stuff to free on block
exit, where are the preserved registers and how many, ...

Ada too.

There are at least two ways to do that (at least for Ada, probably also
for C++):

- Dynamically maintain a stack-like data structure (a chain, linked
list) that describes the current nesting of "code blocks" and their
exception handlers. Whenever the program enters a block with an
exception handler, there is entry code that pushes the description of
that exception handler on this chain, including the address of its code;
and vice versa pop on exiting such a block.

- Statically construct a mapping table that is stored in the executable
and maps code ranges to exception handlers.

Ada implementations started with the dynamic method, which is simpler
but adds some execution cost to all blocks with exception handlers, even
if an exception never happens. Current implementations tend to the
static method, also called "zero-cost exceptions" because there is no
extra execution cost for blocks with exception handlers /unless/ an
exception does occur.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Kent Dickey on Mon Oct 7 19:12:32 2024

On Mon, 7 Oct 2024 18:55:26 +0000, Kent Dickey wrote:

In article <efXIO.169388$[email protected]>,
EricP <[email protected]> wrote:

Kent Dickey wrote:

In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:

Kent Dickey wrote:

In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother >>> having trap-on-overflow instructions.

I understand, and I disagree with this conclusion.
I think all forms of software error detection are useful and
HW should make them simple and eliminate cost when possible.

I think I am not explaining the issue well.

I'm not arguing what you want to do with overflow. I'm trying to show
that for all uses of detecting overflow other than crashing with no
recovery, hardware trapping on overflow is a poor approach.

If you enable hardware traps on integer overflow, then to do anything
other than crash the program would require engineering a very complex
set of data structures, roughly approximately the complexity of adding
debug information to the executable, in order to make this work. As
far as I know, no one in the history of computers has yet undertaken
this task.

And yet, this is exactly the kind of data C++ needs in order to
use its Try-Throw-Catch exception model. The stack walker needs
to know where on the stack is the list of stuff to free on block
exit, where are the preserved registers and how many, ...

This is because each instruction which overflows would need special
handling, and the "debug" information would be needed. It would be a
huge amount of compiler/linker/runtime complexity.

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Mon Oct 7 22:26:58 2024

On Mon, 7 Oct 2024 17:38:54 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Anton Ertl <[email protected]> schrieb:

Michael S <[email protected]> writes:

On Mon, 07 Oct 2024 07:17:02 GMT
[email protected] (Anton Ertl) wrote:

However, in John Mashey's criteria the number of registers plays a
role; he requires >4 bits for the GPR specifier, and >3 bits for
the FPR specifier.

- anton

Which sounds rather arbitrary.

In a way it is, but see below.

Or even worse, like if he wanted for
SPARC to be called 'typical RISC' and for ARM to be called atypical
and had chosen the numbers to match the agenda.

I think that ARM did not exist for John Mashey.

When was his definition made?

ARM was rather late to the RISC game, this might have been literally
true.

ARM was rather early to the RISC game. Shipped for profit since late
1986. Less than a year after MIPS and ROMP. Several months after
SPARC. PA-RISC first shipped in the 1986H1, but volume production
started later than ARM.
Appolo PRISM, Motorola 88K, Intel i960 and AMD 29K all came later than
ARM.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Mon Oct 7 19:52:51 2024

[email protected] (MitchAlsup1) writes:

On Sun, 6 Oct 2024 14:35:40 +0000, Michael S wrote:

To their defense, AMD's use of the term ROP didn't last for long.
K8 manuals use the better term micro-ops. I don't have K7 manual to
look, but it seems to me that it uses the same terminology as K8.

K9 used the terms micro-ops and meso-ops to describe before and
after peephole optimization. HW was happy to run either as micro-
ops were a strict subset of meso-ops, meso-ops just got more work
done per cycle.

ARM Neoverse cores use the terms 'macro ops' and 'micro ops',
the decoder produces Macro Ops which exist through renaming
and dispatch stages. Further down the pipeline, a Macro Op
can be split into two Micro Ops which can be issued OoO.

See '2.1 Pipeline Overview'

https://developer.arm.com/documentation/109637/latest/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Tue Oct 8 06:14:59 2024

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

I think that ARM did not exist for John Mashey.

When was his definition made?

<https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>

He reposted in 1995, the first few postings have no date, but they
include the IBM RS/6000 (1990) and 68040 (1990), but not the Alpha
(1992), so I expect that it happened between 1990 and 1992. The ARM
was first released in a development kit for the BBC Micro in 1986, and
then to the mass market in the Archimedes in 1987.

My guess is that it did not exist for John Mashey because it did not
originiate in the USA and was sold mainly in home computers and
usually did not run Unix.

Somewhat to my surprise, I just read that there was <https://en.wikipedia.org/wiki/RISC_iX>, which would work on many (but
not all) Archimedes models with some additional hardware (in
particular, a hard disk), and that they sold complete workstations
like the R140 that included this hardware; the R140 (8MHz) cost GBP
3500 in 1989 (without Ehernet). The R260 (30MHz ARM3, 8MB RAM, 100MHz
HDD, with Ethernet) cost GBP 3995 in 1990 (or as R225 without hard
disk GBP 1995), which was probably pretty competetive with the likes
of DG Aviion AV 100 (16MHz 88100, 8MB, diskless with 20" monitor for
$4000 in 1990 <https://www.techmonitor.ai/technology/data_general_outdoes_sun_with_4000_17_mips_av_100_risc_station>),
the DECStation 3100, or the HP 9000/425, machines that I had contact
with at the time. A problem may have been that they did not have an
FPU before 1993 (except for the R140, but that cost GBP 599).

ARM was rather late to the RISC game, this might have been literally
true.

What makes you think so? Did you read that in krone.at?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Tue Oct 8 14:11:00 2024

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

Somewhat to my surprise, I just read that there was <https://en.wikipedia.org/wiki/RISC_iX>, which would work on many
(but not all) Archimedes models with some additional hardware (in
particular, a hard disk), and that they sold complete workstations
like the R140 that included this hardware ...

Acorn did not try hard to sell RISC iX to industry, even in the UK. I
remember knowing that it existed, and I /might/ have seen it running at a computer show. Acorn was, by 1990, seen as a specialist educational
supplier. They may have sold some into universities, but I've never
encountered anyone who'd used them.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Wed Oct 9 18:42:42 2024

EricP <[email protected]> writes:

Kent Dickey wrote:

And crash-on-overflow just isn't a popular use model, as I use the example >> of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
and no compiler seems to use it. Especially since branch-on-overflow
is almost as good in every way.

Kent

Because C doesn't require it. That does not make the capability useless.

Other languages do require overflow detection (e.g. COBOL ON OVERFLOW clause), and it's best done with conditional branches, not traps.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Kent Dickey on Wed Oct 9 14:16:51 2024

Kent Dickey wrote:

In article <efXIO.169388$[email protected]>,
EricP <[email protected]> wrote:

Kent Dickey wrote:

OK, my post was about how having a hardware trap-on-overflow instruction >>> (or a mode for existing ALU instructions) is useless for anything OTHER
than as a debug aid where you crash the problem on overflow (you can
have a general exception handler to shut down gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons folks might >>> want to try to use trap-on-overflow instructions, and show how the
other cases don't make sense.

For me error detection of all kinds is useful. It just happens
to not be conveniently supported in C so no one tries it in C.

GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need >> as it triggers for many false positives so people turn it off.

In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother >>> having trap-on-overflow instructions.

I understand, and I disagree with this conclusion.
I think all forms of software error detection are useful and
HW should make them simple and eliminate cost when possible.

I think I am not explaining the issue well.

I'm not arguing what you want to do with overflow. I'm trying to show that for all uses of detecting overflow other than crashing with no recovery, hardware trapping on overflow is a poor approach.

If you enable hardware traps on integer overflow, then to do anything other than crash the program would require engineering a very complex set of
data structures, roughly approximately the complexity of adding debug information to the executable, in order to make this work. As far as I know, no one in the history of computers has yet undertaken this task.

VAX/VMS 1.0 in 1979 had stack-based Structured Exception Handling (SEP).
And of course carried it over onto Alpha/VMS.
WinNT had SEP in its first version in 1992 for MIPS and 386
supported both by the C compiler and OS. Win95 had support too.
In WinNT MS added __try __except keywords to the C language to support
it for both themselves inside the OS and for users.

Some languages like C++ and Ada have native support for SEP.
There can be differences in what behaviors languages expect to be supported, like can one continue from an exception, or pass arguments to a handler.

This is because each instruction which overflows would need special
handling, and the "debug" information would be needed. It would be a huge amount of compiler/linker/runtime complexity.

General structured exception handling is not as complex or expensive
as you think. It's in the multiple 1000's of instructions range
(so don't use it gratuitously).

WinNT implemented it differently on 32-bit x86 and 64-bit x64,
with the x64 method being more efficient because the compiler
does most of the work. On x64 the compiler just needs to supply
bounding low and high RIP's for *just the exception handler code*.

The cost of delivering a structured exception is the OS basically
delivers an exception to a thread dispatcher similar to a signal,
but for structured exceptions that dispatcher code acts differently.
The thread's frame pointer is the head of a single linked list of
stack frames. It starts at the bottom of stack pointed to by the
frame pointer and scans backward, taking the RIP for each context
and looking in a small table of handler bounds to see if it is in range.
If there is a handler, it is called. If it handles it, great.
Otherwise it continues to scan backwards through the stack frames.
If it gets to the top of stack and there is no handler, it invokes the
thread's last chance handler, and if that doesn't intercept the exception,
it terminates the thread.

This is different than most "signal" handlers people have written, where simple inspection of the instruction which failed and the address involved allows it to be "handled". But to do anything other than crash, each instruction which overflows needs special handling unique to that instruction and dependent on what the compiler was in the middle of doing when the overflow happened. This is why trapping just isn't a good idea.

Except you keep missing the point:
no one has a handler for integer overflow because it should never happen.
Just like no one has a handler for memory read parity errors.

When you wrote C code using signed integers, *YOU* guarenteed to the
compiler that your code would never overflow. Overflow checking just
detects when you have made an error, just like array bounds checking,
or divide by zero checking.

This is not something being done *to you* against your will,
this is something that you *ask for* because it helps detect your errors.
Doing it in hardware just makes it efficient.

A better exception usage example might be a routine that enables exceptions
for floating point underflow where the FPU traps to a handler that zeros
the value and logs where it happened so someone can look at it later,
then continues with its calculation.

I'm just explaining why trap-on-overflow has gone away, because it's
almost completely useless: hardware trap on overflow is only good for the case that you want to crash on integer overflow. Branch-on-overflow is the correct approach--the compiler can branch to either a trapping instruction (if you just want to crash), or for all other cases of detecting overflow, the compiler branches to "fixup" code.

But crash on overflow *IS* the correct behavior in 99.999% of cases.
Branch on overflow is ALSO needed in certain rare cases and I showed how
it is easily detected.

And crash-on-overflow just isn't a popular use model, as I use the example
of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
and no compiler seems to use it. Especially since branch-on-overflow
is almost as good in every way.

Kent

Because C doesn't require it. That does not make the capability useless.

Removing error detectors does not make the errors go away,
just your knowledge of them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Niklas Holsti on Wed Oct 9 14:43:49 2024

Niklas Holsti wrote:

On 2024-10-07 22:12, MitchAlsup1 wrote:

On Mon, 7 Oct 2024 18:55:26 +0000, Kent Dickey wrote:

In article <efXIO.169388$[email protected]>,
EricP <[email protected]> wrote:

Kent Dickey wrote:

In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:

Kent Dickey wrote:

In no way was I ever arguing that checking for overflow was a bad
idea,
or a language issue, or anything else. Just that CPUs should not
bother
having trap-on-overflow instructions.

I understand, and I disagree with this conclusion.
I think all forms of software error detection are useful and
HW should make them simple and eliminate cost when possible.

I think I am not explaining the issue well.

I'm not arguing what you want to do with overflow. I'm trying to show
that for all uses of detecting overflow other than crashing with no
recovery, hardware trapping on overflow is a poor approach.

If you enable hardware traps on integer overflow, then to do anything
other than crash the program would require engineering a very complex
set of data structures, roughly approximately the complexity of adding
debug information to the executable, in order to make this work. As
far as I know, no one in the history of computers has yet undertaken
this task.

And yet, this is exactly the kind of data C++ needs in order to
use its Try-Throw-Catch exception model. The stack walker needs
to know where on the stack is the list of stuff to free on block
exit, where are the preserved registers and how many, ...

Ada too.

There are at least two ways to do that (at least for Ada, probably also
for C++):

- Dynamically maintain a stack-like data structure (a chain, linked
list) that describes the current nesting of "code blocks" and their
exception handlers. Whenever the program enters a block with an
exception handler, there is entry code that pushes the description of
that exception handler on this chain, including the address of its code;
and vice versa pop on exiting such a block.

Usually it uses the frame pointer to create a single linked list of
call frames to walk backwards when scanning for an exception handler.

There is also control block information that needs to be dynamically
set up for each handler, so there is some runtime overhead.

- Statically construct a mapping table that is stored in the executable
and maps code ranges to exception handlers.

The static method moves as much as possible of the control block
information out of the dynamic context, lowering the set up cost
for a handler.

Ada implementations started with the dynamic method, which is simpler
but adds some execution cost to all blocks with exception handlers, even
if an exception never happens. Current implementations tend to the
static method, also called "zero-cost exceptions" because there is no
extra execution cost for blocks with exception handlers /unless/ an
exception does occur.

Windows used the dynamic method in 32-bit x86 OS and switched
to static method on 64-bit x64 as it has lower runtime overhead.

Structured Exception Handling (C/C++) https://learn.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp?view=msvc-170

x64 exception handling https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64?view=msvc-170

Exception handling in MSVC https://learn.microsoft.com/en-us/cpp/cpp/exception-handling-in-visual-cpp?view=msvc-170

Modern C++ best practices for exceptions and error handling https://learn.microsoft.com/en-us/cpp/cpp/errors-and-exception-handling-modern-cpp?view=msvc-170

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Scott Lurndal on Wed Oct 9 15:08:05 2024

Scott Lurndal wrote:

EricP <[email protected]> writes:

Kent Dickey wrote:

And crash-on-overflow just isn't a popular use model, as I use the example >>> of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
and no compiler seems to use it. Especially since branch-on-overflow
is almost as good in every way.

Kent

Because C doesn't require it. That does not make the capability useless.

Other languages do require overflow detection (e.g. COBOL ON OVERFLOW clause),
and it's best done with conditional branches, not traps.

Then you use the overflow branching form for those situations
where you have a specific local overflow handler. Nothing stops that.

But that is not a justification for getting rid of overflow trapping instructions altogether, as Kent was making. And actually it looks to me,
not knowing Cobol, like it should use overflow trapping instructions
UNLESS there is an ON OVERFLOW clause. i.e. that the default should be to
treat overflow as an error unless you explicitly state how to handle it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Oct 9 19:43:39 2024

On Wed, 9 Oct 2024 18:16:51 +0000, EricP wrote:

Except you keep missing the point:
no one has a handler for integer overflow because it should never
happen. Just like no one has a handler for memory read parity errors.

Oh contrairé:
I understand how to recover from even "late write ECC violations*"--
but mostly that is because I am primarily a HW guy. (*) When a cache
line displaced from L1 or L2 arrives at L3/DRAM with a bad ECC.

When you wrote C code using signed integers, *YOU* guarenteed to the
compiler that your code would never overflow. Overflow checking just
detects when you have made an error, just like array bounds checking,
or divide by zero checking.

I disagree with this statement. I wrote in C under the knowledge
that integer data types can overflow--they have to be able to--
it is the nature of fixed size containers. I am happy for the
compiler to IGNORE the possibility of overflow, but not the HW.

This is not something being done *to you* against your will,
this is something that you *ask for* because it helps detect your
errors.
Doing it in hardware just makes it efficient.

Yes, allow the compiler to IGNORE the problem, but have HW detect the
problem.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Wed Oct 9 21:36:21 2024

On Wed, 9 Oct 2024 20:12:40 +0000, Robert Finch wrote:

On 2024-10-09 2:16 p.m., EricP wrote:

Kent Dickey wrote:

But crash on overflow *IS* the correct behavior in 99.999% of cases.
Branch on overflow is ALSO needed in certain rare cases and I showed how
it is easily detected.

And crash-on-overflow just isn't a popular use model, as I use the
example
of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
and no compiler seems to use it. Especially since branch-on-overflow
is almost as good in every way.

Kent

Because C doesn't require it. That does not make the capability useless.

Removing error detectors does not make the errors go away,
just your knowledge of them.

Slightly confused on trap versus branch. Trapping on overflow is not a
good solution, but a branch on overflow is? A trap is just a slow
branch. The reason for trapping was to improve code density and non-exceptional performance.
If it is the overhead of performing a trap operation that is the issue,

x86 has seriously distorted peoples view on how much overhead is
associated with a trap*. MIPS had trap handlers measuring in the
17 cycle range both getting to the handler, handling the exception,
and getting back to the instruction that trapped. Since GBOoO windows
have mispredicted branches in this kind of latency, too; then a
properly designed architecture should be able to do similarly to MIPS.

Whereas x86 may take 1,000 cycles to get to the handler. This is due
to all the Descriptor table stuff, call-gates, protection rings, and segmentation.

(*) trap == exception == fault == any unpredicted control flow
cause by the instruction stream itself (SVC-et-al not included
because it is requested by the instruction stream).

then a special register could be dedicated to holding the overflow
handler address, and instructions defined to automatically jump through
the overflow handler address register (a branch target address
register).
Overflow detecting instructions are just a fusion of the instruction and
the following branch on overflow operation.

addjo r1,r2,r3 <- does a jump (instead of a trap) to branch register #7
for instance, on overflow.

Having an overflow branch register might be better for code density / performance.

What if you want to handle multiply overflow differently than
addition overflow ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Thu Oct 10 15:36:32 2024

On Wed, 9 Oct 2024 21:36:21 +0000
[email protected] (MitchAlsup1) wrote:

On Wed, 9 Oct 2024 20:12:40 +0000, Robert Finch wrote:

x86 has seriously distorted peoples view on how much overhead is
associated with a trap*.

Do you have an opinion about FRED? https://cdrdv2-public.intel.com/819481/346446-flexible-return-and-event-delivery.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Thu Oct 10 15:32:21 2024

EricP <[email protected]> writes:

Scott Lurndal wrote:

EricP <[email protected]> writes:

Kent Dickey wrote:

And crash-on-overflow just isn't a popular use model, as I use the example >>>> of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
and no compiler seems to use it. Especially since branch-on-overflow
is almost as good in every way.

Kent

Because C doesn't require it. That does not make the capability useless.

Other languages do require overflow detection (e.g. COBOL ON OVERFLOW clause),
and it's best done with conditional branches, not traps.

Then you use the overflow branching form for those situations
where you have a specific local overflow handler. Nothing stops that.

But that is not a justification for getting rid of overflow trapping >instructions altogether, as Kent was making. And actually it looks to me,
not knowing Cobol, like it should use overflow trapping instructions
UNLESS there is an ON OVERFLOW clause. i.e. that the default should be to >treat overflow as an error unless you explicitly state how to handle it.

See https://www.mainframestechhelp.com/tutorials/cobol/size-error-phrase.htm

The default is to truncate. All other cases can be handled with
a conditional branch.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Fri Oct 11 01:40:15 2024

On Mon, 7 Oct 2024 10:17:26 +0200, Terje Mathisen wrote:

The single most canonical test for IBM PC compatibility was Microsoft's Flight Simulator, taking off from the now demolished Meighs Field in
Chicago.

That game used the OS and BIOS for the loading of the game, and then
went on to direct hardware access for pretty much the rest of the
playing time.

I can remember Flight Simulator being used as the benchmark for
compatibility as far back as 1985. A report on a computer show mentioned
that clone makers were demoing it running on their products.

This is why I feel the term “IBM compatible” was misleading, it should
have been “Microsoft compatible” from at least that point on.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Fri Oct 11 01:41:58 2024

On Mon, 7 Oct 2024 13:05:53 +0300, Michael S wrote:

In all cases the vendor of GPU changed ...

That, too, added to the problem, in that the software folks had to rewrite
all the performance-intensive bits yet again for the new machine.

OpenCL never took off because the GPGPU market simply isn’t competitive enough. NVidia is dominant, AMD plays second fiddle, and that’s it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Fri Oct 11 01:46:51 2024

On Mon, 7 Oct 2024 22:26:58 +0300, Michael S wrote:

On Mon, 7 Oct 2024 17:38:54 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

ARM was rather late to the RISC game, this might have been literally
true.

ARM was rather early to the RISC game. Shipped for profit since late
1986.

Shipped in an actual PC, the Acorn Archimedes range.

That was the first time I ever saw a 3D shaded rendition of a flag waving,
on a computer, generated in real time. No other machine could do it,
unless you got up to the really expensive Unix workstation class (e.g.
SGI, custom Evans & Sutherland hardware etc).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Fri Oct 11 06:42:15 2024

Lawrence D'Oliveiro <[email protected]d> writes:

I can remember Flight Simulator being used as the benchmark for
compatibility as far back as 1985. A report on a computer show mentioned
that clone makers were demoing it running on their products.

This is why I feel the term “IBM compatible” was misleading, it should >have been “Microsoft compatible” from at least that point on.

It was IBM PC compatible, and that was not misleading, because that's
what it was about. "Microsoft compatible" would have been misleading
(if you want it to mean the same as "IBM PC compatible"), because lots
of hardware was Microsoft DOS compatible that was not an IBM PC clone
and therefore not 100% IBM PC compatible. And MS-DOS was certainly
the higher-profile Microsoft product than the Flight Simulator.

And many buyers did not care about the Flight Simulator, but more
about Lotus 1-2-3, which also required an IBM PC compatible machine.

Of course you saw the Flight Simulator a lot at shows: Moving pictures
attract the eye in a way that a static spreadsheet screen does not.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Lawrence D'Oliveiro on Fri Oct 11 14:20:20 2024

On 11/10/2024 03:46, Lawrence D'Oliveiro wrote:

On Mon, 7 Oct 2024 22:26:58 +0300, Michael S wrote:

On Mon, 7 Oct 2024 17:38:54 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

ARM was rather late to the RISC game, this might have been literally
true.

ARM was rather early to the RISC game. Shipped for profit since late
1986.

Shipped in an actual PC, the Acorn Archimedes range.

That was the first time I ever saw a 3D shaded rendition of a flag waving,
on a computer, generated in real time. No other machine could do it,
unless you got up to the really expensive Unix workstation class (e.g.
SGI, custom Evans & Sutherland hardware etc).

The Acorn Archimedes was /way/ ahead of anything in the PC / x86 world,
both in hardware and software. It could emulate an 80286 PC almost as
fast as real PC's that you could buy at the time for a higher price than
the Archimedes.

The demo that impressed me most was drawing full-screen Mandelbrot sets
in a second or two, compared to several minutes for a typical PC at the
time. It meant you could do real-time zooming and flying around in the set.

My first encounter with ARM assembly was enhancing that demo program for
higher screen resolution and deeper zooming.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Sat Oct 12 08:23:39 2024

Terje Mathisen <[email protected]> writes:

Maybe all add/sub/etc opcodes that are immediately followed by an INTO=20 >could be fused into a single ADDO/SUBO/etc version that takes zero extra =

cycles as long as the trap part isn't hit?

On Intel P-cores add/inc/sub etc. has been fused with a following
JO/JNO into one uop for quite a while (I guess since Sandy Bridge
(2011)).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Sat Oct 12 08:45:57 2024

EricP <[email protected]> writes:

But then, risc processors mostly, started using exceptions for housekeeping
- SPARC for register window sliding, Alpha for byte, word and misaligned >memory access

On Alpha the assembler expands byte, word and unaligned access
mnemonics into sequences of machine instructions; if you compile for
BWX extensions, byte and word mnemonics get compiled into BWX
instructions. If the machine does not have the BWX extensions and it encounters a BWX instruction, the result is an illegal instruction
signal at least on Linux. This terminates your typical program, so
it's not at all frequent.

Concerning unaligned accesses, if you use a load or store that
requires alignment, Digital OSF/1 (and the later versions with various
names) by default produced a signal rather than fixing it up, so again
programs are typically terminated, and the exception is not at all
frequent. There is a system call and a tool (uac) that allows telling
the OS to fix up unaligned accesses, but it played no role in my
experience while I was still using Digital OSF/1 (including it's
successors).

On Linux the default behaviour was to fix up the unaligned accesses
and to log that in the system log. There were a few such messages in
the log per day, so that obviously was not a frequent occurence,
either. I wrote a program that allowed me to change the behaviour <https://www.complang.tuwien.ac.at/anton/uace.c>, mainly because I
wanted to get a signal when an unaligned access happens.

As for the unaligned-access mnemonics, these were obviously barely
used: I found that gas generates wrong code for ustq several years
after Alpha was introduced, so obviously no software running under
Linux has used this mnemonic.

The solution for Alpha was to add back the byte and word instructions,
and add misaligned access support to all memory ops.

Alpha added BWX instructions, but not because it had used trapping to
emulate them earlier; Old or portable binaries continued to use
instruction sequences. Alpha traps when you do, e.g., an unaligned
ldq in all Alpha implementations I have had contact with (up to a
800MHz 21264B).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Sat Oct 12 09:18:23 2024

EricP <[email protected]> writes:

Kent Dickey wrote:

[...]

GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need
as it triggers for many false positives so people turn it off.

...

So why should any hardware include an instruction to trap-on-overflow?

Because ALL the negative speed and code size consequences do not occur.

Looking at <https://godbolt.org/z/oMhW55YsK> and selecting MIPS clang
18.1.0, I get a 15-instruction sequence which does not include add
(the trap-on-overflow version).

MIPS gcc 14.2.0 generates a sequence that includes

jal __addvsi3

i.e., just as for x86-64. Similar for MIPS64 with these compilers.

Interestingly, with RISC-V rv64gc clang 18.1.0, the sequence is much
shorter than for MIPS clang 18.1.0, even though RV64GC has no specific
way of checking overflow at all.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Sat Oct 12 10:23:18 2024

Michael S <[email protected]> writes:

That's correct about intrinsics, but incorrect about ADCX/ADOX.
The later can be moderately helpful in special situuations, esp.
128b * 128b => 256b multiplication, but it is never necessary
and for addition/sbtraction is not needed at all.

They are useful if there are two strings of additions. This happens
naturally in wide multiplication (also beyond 256b results). But it
also happens when you add three multi-precision numbers (say, X, Y,
Z): You need C for the carry of XYi=X[i]+Y[i]+C, and O for the carry
of XYZ[i]=XYi+Z[i]+O. If you have ADCX/ADOX, you can do both
additions in one loop, so XYi can be in a register and does not need
to be stored . If you don't have these instructions, only ADC, you
need one loop to compute X+Y and store the result in memory, and one
loop to compute XY+Z, i.e., the lack of ADCX/ADOX results in
substantial additional cost.

If you add 4 multi-precision numbers, AMD64 with ADX runs out of carry
bits, so you have to spend the overhead of an additional loop (but not
of two additional loops as without ADCX/ADOX).

With carry bits in the general purpose registers <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> and 30 GPRs
(one is zero, one is sp), you can add 14 multi-precision numbers per
loop: 14 GPRs for source addresses, 1 GPR for the target address, 1
for the loop counter, 13 registers for loop-carried carry flags.

Of course, the question is if this kind of computation is needed
frequently enough to justify this kind of extension. For
multi-precision multiplication and squaring, Intel considered the
frequency relevant enough to introduce ADCX/ADOX/MULX.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Sun Oct 13 13:00:14 2024

On Sat, 12 Oct 2024 10:23:18 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

That's correct about intrinsics, but incorrect about ADCX/ADOX.
The later can be moderately helpful in special situuations, esp.
128b * 128b => 256b multiplication, but it is never necessary
and for addition/sbtraction is not needed at all.

They are useful if there are two strings of additions. This happens naturally in wide multiplication (also beyond 256b results). But it
also happens when you add three multi-precision numbers (say, X, Y,
Z): You need C for the carry of XYi=X[i]+Y[i]+C, and O for the carry
of XYZ[i]=XYi+Z[i]+O. If you have ADCX/ADOX, you can do both
additions in one loop, so XYi can be in a register and does not need
to be stored . If you don't have these instructions, only ADC, you
need one loop to compute X+Y and store the result in memory, and one
loop to compute XY+Z, i.e., the lack of ADCX/ADOX results in
substantial additional cost.

If you add 4 multi-precision numbers, AMD64 with ADX runs out of carry
bits, so you have to spend the overhead of an additional loop (but not
of two additional loops as without ADCX/ADOX).

With carry bits in the general purpose registers <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> and 30 GPRs
(one is zero, one is sp), you can add 14 multi-precision numbers per
loop: 14 GPRs for source addresses, 1 GPR for the target address, 1
for the loop counter, 13 registers for loop-carried carry flags.

Of course, the question is if this kind of computation is needed
frequently enough to justify this kind of extension. For
multi-precision multiplication and squaring, Intel considered the
frequency relevant enough to introduce ADCX/ADOX/MULX.

- anton

That's not bad. I think, you see yourself that spill and context switch
parts could benefit from more work.
But I suspect that the main opposition you'll face in RISC-V
organization will center not on that, but on fear of increase in cycle
time, no matter if proven or not with hard numbers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun Oct 13 13:10:58 2024

On Fri, 11 Oct 2024 01:41:58 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Mon, 7 Oct 2024 13:05:53 +0300, Michael S wrote:

In all cases the vendor of GPU changed ...

That, too, added to the problem, in that the software folks had to
rewrite all the performance-intensive bits yet again for the new
machine.

OpenCL never took off because the GPGPU market simply isn’t
competitive enough. NVidia is dominant, AMD plays second fiddle, and
that’s it.

I am not sure about dog-tail relationships.
To me it sound plausible that NV dominates due to better software story.

At least that's what I see in certain sectors of embedded market -
people prefer old NV Jetson Xavier over newer AMD and Intel SoCs that
are much better not only on the CPU side, but also provide much more
FLOPs on GPU side. And the reason is that they are much more certain
that they will be able to write programs for NV GPUs than they are for
AMD or Intel GPUs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Sun Oct 13 15:16:08 2024

Michael S <[email protected]> writes:

To their defense, AMD's use of the term ROP didn't last for long.
K8 manuals use the better term micro-ops. I don't have K7 manual to
look, but it seems to me that it uses the same terminology as K8.

I have come across ROP (and its expansion RISC op) relatively
recently, but maybe it was in third-party material. Their evil deeds
of the past come back to haunt them:-).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Mon Oct 14 23:39:59 2024

On Sun, 13 Oct 2024 13:10:58 +0300, Michael S wrote:

On Fri, 11 Oct 2024 01:41:58 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

OpenCL never took off because the GPGPU market simply isn’t competitive
enough. NVidia is dominant, AMD plays second fiddle, and that’s it.

I am not sure about dog-tail relationships.

In a market dominated by one player, the dominant player tends not to like
open standards. Open standards allow competitors to get a foot in the
door, and the dominant player doesn’t like that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Oct 14 23:38:43 2024

On Fri, 11 Oct 2024 06:42:15 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

I can remember Flight Simulator being used as the benchmark for >>compatibility as far back as 1985. A report on a computer show

mentioned

that clone makers were demoing it running on their products.

This is why I feel the term “IBM compatible” was misleading, it should >>have been “Microsoft compatible” from at least that point on.

It was IBM PC compatible, and that was not misleading, because that's
what it was about.

But then IBM came along shortly afterwards with their PS/2 range, which no longer defined the standard for compatibility.

So at that point it was either “Microsoft compatible” or nothing.

... lots of hardware was Microsoft DOS compatible ...

Yes it was, but none of them could run Flight Simulator.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Mon Oct 14 21:44:06 2024

Anton Ertl wrote:

EricP <[email protected]> writes:

Kent Dickey wrote:

[...]

GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need >> as it triggers for many false positives so people turn it off.

....

So why should any hardware include an instruction to trap-on-overflow?

Because ALL the negative speed and code size consequences do not occur.

Looking at <https://godbolt.org/z/oMhW55YsK> and selecting MIPS clang
18.1.0, I get a 15-instruction sequence which does not include add
(the trap-on-overflow version).

MIPS gcc 14.2.0 generates a sequence that includes

jal __addvsi3

i.e., just as for x86-64. Similar for MIPS64 with these compilers.

Interestingly, with RISC-V rv64gc clang 18.1.0, the sequence is much
shorter than for MIPS clang 18.1.0, even though RV64GC has no specific
way of checking overflow at all.

- anton

Yes. So even when the ADD instruction is available they won't use it.
At least clang for MIPS64 uses one of the overflow detect idioms inlined.
Gcc calls that rather expensive subroutine.

I changed your example to use long instead of int
to avoid any partial register issues.
Also I added a third argument just to see what it would do.
It generates slightly different code for the second check.

long add3 (long a, long b, long c) {
return a + b + c;
}

I also tried Ada mips64 gnat 14.2.0 -O2 (below).
It also didn't use the ADD which traps but uses a different idom inlined.

Both examples should have taken 3 instructions
add3:
dadd $2, $4, $5 ; r2 = r4 + r5
dadd $2, $2, $6 ; r2 = r2 + r6
jr $ra
nop

but what clang generated was:

; The comments on the left are mine
add3:
daddiu $sp, $sp, -16 ; set up call frame
sd $ra, 8($sp)
sd $fp, 0($sp)
move $fp, $sp
daddu $3, $4, $5 ; r3 = r4 + r5
slt $1, $3, $4 ; r1 = r3 < r4
slti $2, $5, 0 ; r2 = r5 < 0
bne $2, $1, .LBB0_3 ; if (r2 != r1) goto Overflow
nop
daddu $2, $3, $6 ; r2 = r3 + r6
slt $1, $2, $3 ; r1 = r2 < r3
slti $3, $6, 0 ; r3 = r6 < 0
xor $1, $3, $1 ; if (r3 != r1) goto Overflow
bnez $1, .LBB0_3
nop
move $sp, $fp ; pop frame
ld $fp, 0($sp)
ld $ra, 8($sp)
jr $ra
daddiu $sp, $sp, 16
.LBB0_3:
break

====================================

-- Ada mips64 gnat 14.2.0 -O2
function add3 (a, b, c : Long_Integer) return Long_Integer is
begin
return a + b + c;
end add3;

.LC0:
.ascii "example.adb"
.space 1
_ada_add3:
daddu $3,$4,$5 # tmp205, a, b
xor $4,$4,$5 # tmp206, a, b
nor $4,$0,$4 # tmp208, tmp206
xor $5,$3,$5 # tmp207, tmp205, b
and $5,$5,$4 # tmp209, tmp207, tmp208
bltz $5,.L7 #, tmp209,
daddu $2,$3,$6 # tmp212, tmp205, c

xor $3,$3,$6 # tmp213, tmp205, c
nor $3,$0,$3 # tmp215, tmp213
xor $6,$2,$6 # tmp214, tmp212, c
and $6,$6,$3 # tmp216, tmp214, tmp215
bltz $6,.L7
nop
jr $31
nop
.L7:
daddiu $sp,$sp,-16 #,,
sd $28,0($sp) #,
lui $28,%hi(%neg(%gp_rel(_ada_add3))) #,
daddu $28,$28,$25 #,,
daddiu $28,$28,%lo(%neg(%gp_rel(_ada_add3))) #,,
ld $4,%got_page(.LC0)($28) # tmp210,,
ld $25,%call16(__gnat_rcheck_CE_Overflow_Check)($28) # tmp211,,
sd $31,8($sp) #,
li $5,3 # 0x3 #,
1: jalr $25 # tmp211
daddiu $4,$4,%got_ofst(.LC0) #, tmp210,

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Tue Oct 15 12:59:03 2024

Anton Ertl wrote:

EricP <[email protected]> writes:

But then, risc processors mostly, started using exceptions for housekeeping >> - SPARC for register window sliding, Alpha for byte, word and misaligned
memory access

On Alpha the assembler expands byte, word and unaligned access
mnemonics into sequences of machine instructions; if you compile for
BWX extensions, byte and word mnemonics get compiled into BWX
instructions. If the machine does not have the BWX extensions and it encounters a BWX instruction, the result is an illegal instruction
signal at least on Linux. This terminates your typical program, so
it's not at all frequent.

Ah yes, that was it. After they added BWX to 21164 in 1996,
for older 21064 models VMS had an optional illegal instruction exception handler that caught BWX instructions, emulated them and continued,
or terminated.

Concerning unaligned accesses, if you use a load or store that
requires alignment, Digital OSF/1 (and the later versions with various
names) by default produced a signal rather than fixing it up, so again programs are typically terminated, and the exception is not at all
frequent. There is a system call and a tool (uac) that allows telling
the OS to fix up unaligned accesses, but it played no role in my
experience while I was still using Digital OSF/1 (including it's
successors).

On Linux the default behaviour was to fix up the unaligned accesses
and to log that in the system log. There were a few such messages in
the log per day, so that obviously was not a frequent occurence,
either. I wrote a program that allowed me to change the behaviour <https://www.complang.tuwien.ac.at/anton/uace.c>, mainly because I
wanted to get a signal when an unaligned access happens.

IIRC on VMS the unaligned exception was caught and could optionally
log a diagnostic, execute a fixup handler and continue, or terminate.

As for the unaligned-access mnemonics, these were obviously barely
used: I found that gas generates wrong code for ustq several years
after Alpha was introduced, so obviously no software running under
Linux has used this mnemonic.

The solution for Alpha was to add back the byte and word instructions,
and add misaligned access support to all memory ops.

Alpha added BWX instructions, but not because it had used trapping to
emulate them earlier; Old or portable binaries continued to use
instruction sequences. Alpha traps when you do, e.g., an unaligned
ldq in all Alpha implementations I have had contact with (up to a
800MHz 21264B).

- anton

You are right... they didn't add misaligned access to all LD and ST.
Except for LDQ_U and STQ_U they still fault on non-natural alignment.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bernd Linsel@21:1/5 to Anton Ertl on Tue Oct 15 21:24:11 2024

On 12.10.24 11:18, Anton Ertl wrote:

EricP <[email protected]> writes:

Kent Dickey wrote:

[...]

GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need >> as it triggers for many false positives so people turn it off.

...

So why should any hardware include an instruction to trap-on-overflow?

Because ALL the negative speed and code size consequences do not occur.

Looking at <https://godbolt.org/z/oMhW55YsK> and selecting MIPS clang
18.1.0, I get a 15-instruction sequence which does not include add
(the trap-on-overflow version).

MIPS gcc 14.2.0 generates a sequence that includes

jal __addvsi3

i.e., just as for x86-64. Similar for MIPS64 with these compilers.

Interestingly, with RISC-V rv64gc clang 18.1.0, the sequence is much
shorter than for MIPS clang 18.1.0, even though RV64GC has no specific
way of checking overflow at all.

- anton

Very irritating: https://godbolt.org/z/KsMc3KfKc

Why do neither gcc nor clang use MIPS's trap-on-overflow addition
operators, while they indeed use teq <divisor>, 0 for a division-by-zero
check?

--
Bernd Linsel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to John Dallman on Sat Oct 26 18:37:14 2024

John Dallman <[email protected]> wrote:

I see where I'm going wrong: I'm trying to talk about the machines
designed to run MS-DOS and later Windows, not just the CPUs. The vast
range of hardware that all had substantial degrees of compatibility as regards booting, busses and so on. Those things let their manufacturers compete for the DOS and Windows market, whereas x86-based machines that weren't PC-compatible only succeeded in quite specialised niches.

Those hardware suppliers did not close off access to the more advanced features of i386 onwards, because they had no reason to, and that let
Linux take advantage of all that hardware when it came along. That's the point I was failing to make.

I think this is still misleading. Not only 386 was _much_ more
ambitious desgin than just "processor for running DOS". Hadware
manufacturers also cared about running more things than just
DOS. And "running DOS" is misleading too: for many "DOS applications"
DOS provided just program loader and file system access. Such
applications could switch to protected mode, use multitasking
and 32-bit addressing. There were "DOS extenders". Before
Windows gained market dominance there were competing GUI-s.
There were PC servers, which at some time meant Novell.

So things critical to Linux were also important on general PC
market. Clearly Linux benefited from availabilty of comodity
PC-s. But things that made a PC good PC were correlated with
being good Linux machine. As a litte anecdote let med add that
small sellers frequently used Linux as a tester for PC-s they
were selling, as it was stressing machines more than "typical"
DOS applications.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Anton Ertl on Mon Oct 28 23:45:53 2024

Anton Ertl <[email protected]> wrote:

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

[in-memory database]

but the question is if
the machine has enough RAM for the database. Our dual-Xeon system
from IIRC 2007 has 24GB of RAM, not sure how big it could be
configured; OTOH, we have a single-Xeon system from 2009 or so with
32GB of RAM (and there were bigger Xeons in the market at the time).

The minimum requirement of SAP HANA is 64 GB of memory, but typical
ranges are from 256GB to 1TB.

What is the relevance of SAP HANA for the topic at hand?

The question was if the RAM can hold the data. For each account they
would have to keep the current balance (64 bits should be enough for
that), the account number (64 bits for the up to 19 digits of a Visa
card) for verifying that we are at the correct entry in the hash table
and probably some account status information (64 bits should be
plenty?).

There is also the sequence of transactions (a 64-bit transaction
offset in the log per transaction should be enough for that). The
sequence of transactions may be useful for fraud detection, but I
don't know enough about that to know how to scale the system, so I'll
just say that fraud detection is done by a bigger system before the transaction goes through to the transaction processing computer.

The sequence of transactions is also needed for generating the reports
and for dealing with customer complaints, but again, that's not
processing the transactions themselves (and is basically read-only,
except that the customer-complaint processing may result in additional transactions).

So, with 24 bytes needed for each account on the
transaction-processing server, 32GB with, say 8GB left for
copy-on-write and other administrative purposes should be good for
about 900M accounts at a hash table load factor of 84%. I guess that
Visa has more accounts, so one would need a box with more RAM.

A single core of the Xeon should easily be able to handle all the 56K transactions per second, both the logging and the update of the hash
table, and in that case no locking is needed. But that first needs a sequence of transactions coming in.

AFAICS main transaction processor does not need to know about individual
cards. Cards are issued by banks and clearly bank needs info about
card and customer. Main transaction processor could deal only with
banks, namely verify that bank is solvent (_bank_ balance stays within
agreed limits) and that message data is legit.

OTOH, information about specific card is likely to be bigger. Card
may have daily limit on transactions, that is another number to
keep (actually limit and daily amount of transactions). There is
information used to verify validity of transactoion like customer
name, validity date of the card and validation code. Financial
institutions are regulated and may be legaly obliged to keep and
check some information which is technically not needed for processing
of transactions (IIUC in my country banks are not allowed to
directly transfer money between themselfs, all transfers are
supposed to go trough central national bank).

I think that transaction center wants to keep more information
than a single copy of data in the RAM: with single copy any
memory corruption could mean loss of hours of transaction data
which is equivalent to quite a lot of cash. So I suspect that
that there are layers of redundancy buit-in. And even if
performencewise what they do is suboptimal they are probably very
reluctant to changes in core accounting code.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Waldek Hebisch on Tue Oct 29 00:17:34 2024

On Mon, 28 Oct 2024 23:45:53 -0000 (UTC), Waldek Hebisch wrote:

I think that transaction center wants to keep more information
than a single copy of data in the RAM: with single copy any
memory corruption could mean loss of hours of transaction data
which is equivalent to quite a lot of cash. So I suspect that
that there are layers of redundancy buit-in.

They could distribute the load, by spreading it across multiple processing centres. For example, most transactions on a given card are likely to be
with businesses in a particular locality, or with certain large online retailers. The card has a credit limit, but I suspect that is not a “brick wall” limit, so if it takes a few seconds to reconcile multiple
transactions, with the chance that they could add up to something a bit
beyond the credit limit once the totals have been made consistent again, that’s not the end of the world.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Guest
  Wed Jul 29 14:26:54 2026
  from Balkans via Telnet
- Rixter
  Wed Jul 29 14:18:17 2026
  from Madison, Nc via Telnet
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	71:50:31
Calls:	12,448
Calls today:	3
Files:	15,194
Messages:	6,537,620

is Vax adressing sane today

Who's Online

Recent Visitors

System Info