Is Vax addressing sane today?
I am not talking indirect addressing, that is stupid.
It has been determined from trusted sources that add from memory and add to >memory as used in x86 are sane, and not much of a problem.
But Vax allows all three arguments to be in memory with different pointers.
It has been determined from trusted sources that add from memory
and add to memory as used in x86 are sane, and not much of a problem.
But Vax allows all three arguments to be in memory with different
pointers.
Is this sane, just a natural progression if you allow memory
operands?
Is Vax addressing sane today?
I am not talking indirect addressing, that is stupid.
It has been determined from trusted sources that add from memory and add
to memory as used in x86 are sane, and not much of a problem.
But Vax allows all three arguments to be in memory with different
pointers.
Is this sane, just a natural progression if you allow memory operands?
Packing three offsets in an instruction that can be decoded reasonably
is a whole other problem…
Heads and tails encoding could actually do this reasonably, and the code density would be actually be better than most competitors. Heads and
tails is not that easy, but it’s not x86 difficult.
VMS was the last large operating system written in assembly language
(and Bliss, which is somewhat higher-level, bit not much).
"Sanity" isn't an attribute associated with hardware architecture.
But Vax allows all three arguments to be in memory with different pointers.
Is this sane, just a natural progression if you allow memory operands?
Heads and tails encoding could actually do this reasonably, and the code >density would be actually be better than most competitors.
... they failed to stick to VAX for the few more years until
they would have developed an OoO implementation, which would have
leveled the playing field again (see Pentium Pro).
Memory-to-memory instructions, in general, are hard to get to run fast
with today's processors and memory, simply because memory access times
are long enough for many register-to-register instructions to execute.
A
lot of that time can be hidden with good caches and prefetchers, but if
your memory access patterns are complicated, those speedups can fail to
work.
One reason for memory-to-memory instructions was to avoid the need to >dedicate registers to operands, but that's not much of a problem these
days, since we have space in the CPU for lots of registers and rename
systems for them.
DEC spent a lot of time and money trying to keep VAX competitive and took
too long to accept that was impractical. That was one of the seeds of
their downfall.
On Thu, 05 Sep 2024 21:15:20 GMT, Scott Lurndal wrote:
"Sanity" isn't an attribute associated with hardware architecture.
Says someone posting to a group full of hardware experts ...
Brett <[email protected]> writes:
But Vax allows all three arguments to be in memory with different pointers. >>
Is this sane, just a natural progression if you allow memory operands?
In combination with supporting unaligned accesses (but excluding
indirect addressing), it means that an instruction can access 6 pages,
and so the TLB (and/or TLB loader) has to be designed to support that. Likewise, the OS has to be designed to load all 6 pages into physical
RAM without evicting one of these pages again. So this kind of
architecture increases the design complexity. And I don't see a
benefit from this design.
Heads and tails encoding could actually do this reasonably, and the code
density would be actually be better than most competitors.
Would it? Please present empirical data. Certainly people claim that instruction sets with one-memory-address load-and-op and
read-modify-write instructions have better code density, but when you
look at the data, there are load-store instruction sets with better
code density (and by quite a lot). From <[email protected]>:
bash grep gzip
595204 107636 46744 armhf 16 regs load/store 32-bit
599832 101102 46898 riscv64 32 regs load/store 64-bit
796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
829776 134784 56868 arm64 32 regs load/store 64-bit
853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
891128 158544 68500 armel 16 regs load/store 32-bit
892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
1020720 170736 71088 mips64el 32 regs load/store 64-bit
1168104 194900 83332 ppc64el 32 regs load/store 64-bit
What is "heads and tails encoding"?
- anton
On Thu, 5 Sep 2024 21:03:37 +0000, Brett wrote:
Is Vax addressing sane today?
I am not talking indirect addressing, that is stupid.
It has been determined from trusted sources that add from memory and add
to memory as used in x86 are sane, and not much of a problem.
But Vax allows all three arguments to be in memory with different
pointers.
With modern compiler technology 88% of instructions need only 1 constant--thus VAX provides too many, along with providing address
modes that require sequential decoding.
Most ISAs do not provide "enough" constants, VAX provides too many.
Where "enough" covers::
SLA R9,#1,R17 // this is 1 instruction
DIV R9,#24,R17 // ibid
FDIV R8,#3.14159265358928,R17
Is this sane, just a natural progression if you allow memory operands?
Having watching this in real time:: in 1970 we needed more/better
constants, then PDP-11 came around and we liked it, then at the end
of the decade VAX cam along and we loved it, only later recognizing
that it had fallen for the second system syndrome--becoming overly complicated without benefit--the address space was definitely needed
the address modes no so much.
Packing three offsets in an instruction that can be decoded reasonably
is a whole other problem…
Realistically, modern compilers have advanced to the point where
anything more than 1` constant per instruction is overkill--
harder to build and delivering no more performance.
Heads and tails encoding could actually do this reasonably, and the code
density would be actually be better than most competitors. Heads and
tails is not that easy, but it’s not x86 difficult.
Another encoding scheme is segmenting the OpCode into 2 components
1) goes to the function unit to convey the kind of calculation
to be performed,
2) goes to the forwarding logic to convey how to route bits into
calculation.
Some might consider the concatenation of both to the be OpCode
but that obscures what to do with when to do it.
[email protected] (John Dallman) writes:
Memory-to-memory instructions, in general, are hard to get to run fast
with today's processors and memory, simply because memory access times
are long enough for many register-to-register instructions to execute.
Given modern OoO technology, even VAX can fly. It does not matter
whether, say,
*a++ = *b++ + *c++;
is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
RISC-V instructions, what goes on inside the OoO engine is pretty
similar in all cases, and so is the performance.
In recent years a number of implementations have 0-cycle store-to-load forwarding, so the misconception that a memory operand is as cheap as
a register operand if only the instruction set has memory operands of
operate instructions is a little bit closer to reality. It is still a misconception, because such an implementation can read and write
several times as many registers per cycle as memory operands.
A
lot of that time can be hidden with good caches and prefetchers, but if >>your memory access patterns are complicated, those speedups can fail to >>work.
Whether operate instructions in an instruction set have 0, 1, or 3
memory operands makes little difference in that case.
One reason for memory-to-memory instructions was to avoid the need to >>dedicate registers to operands, but that's not much of a problem these >>days, since we have space in the CPU for lots of registers and rename >>systems for them.
That may have been a consideration in the NOVA or the 6800, but in
case of the VAX with its 16 registers, that corresponds to a load/store-architecture with 18 registers, so for the VAX this is just
a minor issue.
Some time ago I thought a bit about which kind of architecture to
design with the transistor budget of the 6502, but with the RISC
lessons under the belt. One problem with a big RISC-like register set
is the instruction bandwidth. You really want to stick to 8-bit
instructions if you only have an 8-bit data bus. With a register architecture that means 2-bits for register operands, and that means
you would need a lot of loads and stores in a load/store architecture.
So the narrow instruction word almost forces you to use implicit
register operands or at small special-purpose register sets (e.g., 2 accumulators and 4 index registers, as in the 6809) rather than general-purpose registers.
However, the VAX 11/780 does not have these restrictions. It has a
wider memory bus and it has a cache.
DEC spent a lot of time and money trying to keep VAX competitive and took >>too long to accept that was impractical. That was one of the seeds of
their downfall.
Either that, or they failed to stick to VAX for the few more years
until they would have developed an OoO implementation, which would
have leveled the playing field again (see Pentium Pro). The Alpha
came out in 1992, the Pentium Pro in 1995, so if DEC has stuck to the
VAX and managed a timely OoO implementation, they would have needed to survive just 3 years. And it seems that they lost a lot of customers
in the transition from VAX to Alpha.
Of course, the question is if the customers would have stayed with DEC
if they had continued with the VAX. The vibe at the time was that
CISCs are doomed. OTOH, Intel stuck with IA-32 and won with the P6,
and IBM stuck with the S390. But VAX customers are not S390
customers, and maybe they would have defected to Intel even if the VAX
had been there.
From what I read, the VAX 9000 was a big nail in the DEC coffin. In hindsight they should have canceled the project early, but that does
not mean that they could not have continued with VAX (they could even
have competed with the IBM mainframes, which took quite long to gain superscalar and OoO implementations).
- anton
From
<[email protected]>:
bash grep gzip
595204 107636 46744 armhf 16 regs load/store 32-bit
599832 101102 46898 riscv64 32 regs load/store 64-bit
796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
829776 134784 56868 arm64 32 regs load/store 64-bit
853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
891128 158544 68500 armel 16 regs load/store 32-bit
892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
1020720 170736 71088 mips64el 32 regs load/store 64-bit
1168104 194900 83332 ppc64el 32 regs load/store 64-bit
Anton Ertl <[email protected]> wrote:
On Fri, 6 Sep 2024 5:38:01 +0000, Anton Ertl wrote:
Is there source code freely available so these could be compiled
From
<[email protected]>:
bash grep gzip
595204 107636 46744 armhf 16 regs load/store 32-bit
599832 101102 46898 riscv64 32 regs load/store 64-bit
796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
829776 134784 56868 arm64 32 regs load/store 64-bit
853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
891128 158544 68500 armel 16 regs load/store 32-bit
892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
1020720 170736 71088 mips64el 32 regs load/store 64-bit
1168104 194900 83332 ppc64el 32 regs load/store 64-bit
in My 66000 ISA and placed in the list ??
In my opinion, DEC was caught at an ugly time for them. They did not
have the transistor budget for a GBOoO implementation at exactly the
time they also needed a clean transition to 64-bits (even more trans- >istors). DEC did have the transistors for a medium OoO implementation
but unlikely the 64-bit transition.
Anton Ertl <[email protected]> wrote:
Brett <[email protected]> writes:
But Vax allows all three arguments to be in memory with different pointers. >>>
Is this sane, just a natural progression if you allow memory operands?
In combination with supporting unaligned accesses (but excluding
indirect addressing), it means that an instruction can access 6 pages,
and so the TLB (and/or TLB loader) has to be designed to support that.
Likewise, the OS has to be designed to load all 6 pages into physical
RAM without evicting one of these pages again. So this kind of
architecture increases the design complexity. And I don't see a
benefit from this design.
The memory system is pipelined, once you load the first of the three
values, you do not care if that cache line is evicted while you load the >second.
Caches are 16 way today, one does not worry about cache line evictions, it >just works.
...Heads and tails encoding could actually do this reasonably, and the code >>> density would be actually be better than most competitors.
What is "heads and tails encoding"?
128 bit or larger packets with the fixed size opcodes on the front, and the >variable sized data and offsets packing in from the end. You get variable >length instruction density with easier faster wide decoding. And also using >memory operands give you another density bonus on top.
[email protected] (MitchAlsup1) writes:
Is there source code freely available so these could be compiled in My >>66000 ISA and placed in the list ??
So look up these packages, and then get the corresponding source
packages.
[email protected] (MitchAlsup1) writes:
In my opinion, DEC was caught at an ugly time for them. They did not
have the transistor budget for a GBOoO implementation at exactly the
time they also needed a clean transition to 64-bits (even more trans- >>istors). DEC did have the transistors for a medium OoO implementation
but unlikely the 64-bit transition.
For the K8 the switch from 32-bit to 64-bit was reported to have cost
5%. You were there. Are the reports wrong?
Sure, there was marketing pressure to deliver 64-bit architectures
early, but I think that a competetive 32-bit OoO VAX in 1996 with an announcement of a future 64-bit extension would have been fine
wrt. marketing. And a 0.25um 64-bit VAX in 1999 or 2000 (they
shrank the 21264 to 0.25um in 1999) would have certainly made good
on that promise.
In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
Sure, there was marketing pressure to deliver 64-bit architectures
early, but I think that a competetive 32-bit OoO VAX in 1996 with an
announcement of a future 64-bit extension would have been fine
wrt. marketing. And a 0.25um 64-bit VAX in 1999 or 2000 (they
shrank the 21264 to 0.25um in 1999) would have certainly made good
on that promise.
VAX had initially been very successful for the late 1970s and early 1980s
in technical computing, because it was performance-competitive and had a >better operating system than any of the other superminis of the time.
Then multiple RISCs with Unix came along, which were cheaper, had equal
or better performance, and a satisfactory operating system. Those ate
DEC's technical computing market quite fast, but its business IT market >lasted longer.
The technical computing market was /much/ more interested in 64-bit than
the business IT market. When I got involved at a software supplier for >technical computing in 1995, VAX was not performance-competitive and was
on the way out, but Alpha was the fastest thing around until Pentium Pro, >stayed competitive for a couple more years, and didn't die out completely >until 2002 or so.
DEC seem to have concluded in 1988 that they could not keep VAX
performance competitive with the RISCs of the time at a competitive price.
Also, 64-bit ASAP was necessary to retain their part of the technical >computing market and try to win some of it back.
Trying to hold on with VAX, in the hope technology would emerge that
would make it practical, without a clear idea of when or what that would
be is not something that shareholders will tolerate for very long.
Brett <[email protected]> writes:
I did not write about caches, but yes, for TLBs a (the?) solution is
to have the ITLB to be at least 6-way.
It's unclear how pipelining should help. The VAX 11/780 was not much >pipelined and can also do the memory accesses one after the other;
this did not protect it from the complexity coming from x memory
accesses in a single instruction. E.g., all the pages accessed by an >instruction have to be in physical memory, or maybe support
interruptable instructions; in any case, there is complexity.
Given modern OoO technology, even VAX can fly. It does not matter
whether, say,
*a++ = *b++ + *c++;
is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
RISC-V instructions, what goes on inside the OoO engine is pretty
similar in all cases, and so is the performance.
MOVC3/MOVC5 were interruptable ...
Given the arguments were in registers ...
[email protected] (John Dallman) writes:
Memory-to-memory instructions, in general, are hard to get to run fast
with today's processors and memory, simply because memory access times
are long enough for many register-to-register instructions to execute.
Given modern OoO technology, even VAX can fly. It does not matter
whether, say,
*a++ = *b++ + *c++;
is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
RISC-V instructions, what goes on inside the OoO engine is pretty
similar in all cases, and so is the performance.
Getting back to the originating:: It is faster these days to write::
a[i] = b[i] + c[i];i++;
than the pre/post increment/decrement style of PDP-11.
According to Anton Ertl <[email protected]>:
Given modern OoO technology, even VAX can fly. It does not matter
whether, say,
*a++ = *b++ + *c++;
is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7 >>RISC-V instructions, what goes on inside the OoO engine is pretty
similar in all cases, and so is the performance.
It is my impression that unwinding all the side effects if the
reference to "c" causes a page fault was painful.
Particularly
keeping in mind that b and c could be the same register, and if the
code were this:
*a++ = *b++ - *b++
the order of increments and fetches matters.
It is my impression that even when the Vax was designed, it was
already becoming evident that the Vax's super dense super encoded
instruction set was not going to be a long term winner. The IBM 801
project was well along in 1975 when they started designing the Vax.
John Levine <[email protected]> writes:
According to Anton Ertl <[email protected]>:
Given modern OoO technology, even VAX can fly. It does not matter
whether, say,
*a++ = *b++ + *c++;
is encoded as 1 VAX instruction, or as 4 ARM A64 instructions, or as 7
RISC-V instructions, what goes on inside the OoO engine is pretty
similar in all cases, and so is the performance.
It is my impression that unwinding all the side effects if the
reference to "c" causes a page fault was painful.
Yes, that was certainly a problem when using the implementation
techniques of the day. With an OoO implementation, if any of the
operations of the instruction causes an exception, none of the
results of any of the operations are commited. Problem solved.
Or almost: I expect that it's more complex to implement a reorder
buffer that deals with such monster instructions than one that deals
just with RISC-V instructions.
Particularly
keeping in mind that b and c could be the same register, and if the
code were this:
*a++ = *b++ - *b++
the order of increments and fetches matters.
Yes, but the decoder produces operations as defined by the
architecture. I don't know how VAX specifies the order, but a simple translation could be
# at the start, b is in p1, and a is in p6
p0 = *p1 #*b
p2 = p1+4 #b++
p3 = *p2 #*b
p4 = p2+4 #b++
p5 = p2-p4
*p6= p5 #*a = ...
p7 = p6+4 #a++
#at the end, b is in p4 and a is in p7
where p0..p7 are physical registers. If there is an exception in any
of the operations, b stays in p1 and a stays in p6.
It is my impression that even when the Vax was designed, it was
already becoming evident that the Vax's super dense super encoded
instruction set was not going to be a long term winner. The IBM 801
project was well along in 1975 when they started designing the Vax.
The question is how much was known about the IBM 801 at the time.
According to <https://en.wikipedia.org/wiki/OpenVMS>, the VAX project
started in April 1975. Data General's Fountainhead project (FHP)
started in July 1975. Intel started the iAPX 432 in 1975 or 1976,
Zilog started the Z8000 after recruiting Bernard Peuto in March 1976 <https://thechipletter.substack.com/p/captain-zilog-crushed-the-story-of>. Motorola started the 68000 project in late 1976, and National
Semiconductor obviously knew about the VAX when they designed the
32016 (they originally wanted to implement the VAX instruction set,
but in the end did something incompatible for legal reasons). All
these projects used CISCy designs rather than RISCy designs. FHP was
a bit special in making the writable control store an architectural
feature (so it did not have just one instruction set); the thinking
behind it is the "closing the semantic gap" idea that gave us
architectures like the VAX.
The first commercial RISCs were delivered in 1986 (including from IBM itself). Apparently the industry took that long to absorb the ideas
from the IBM 801 and turn them into a commercial product.
It would be interesting to take a time machine to, say, 1976, to go to
any of these companies and try to convince them to do a RISCy CPU.
How hard would it be to convince them? Would technical arguments be sufficient, or would one have to wave with money (as a customer or
investor)? And how would such a CPU do in the marketplace?
We are post RISC now and adding complexity that gets more work done per operation, with less tracking. Three sources and two destinations will
be the rule. Load with address update, add with shift, three way add
with logical operations is next. The FPU already has MAC.
Brett <[email protected]> schrieb:
[VAX]
They had no idea that they would be building eight wide designs.
This is the critical idea that made RISC popular.
Nope.
The early RISC designs aimed for one instruction per cycle, achieved
maybe 0.7.
The problem with VAX was NOT that one could not put a lot of work in a
single instruction;
no,
The problem with VAX is that it made putting too much work in a single instruction easy.
They had no idea that they would be building eight wide designs.
This is the critical idea that made RISC popular.
On Sun, 8 Sep 2024 17:56:55 +0000, MitchAlsup1 wrote:
The problem with VAX was NOT that one could not put a lot of work in a
single instruction;
no,
The problem with VAX is that it made putting too much work in a single
instruction easy.
Perhaps there is also the issue of the wildly-variable instruction
length.
A single VAX operand descriptor could be up to 6 bytes; I think the instruction with the most general-format operands could have 6 of them:
so, plus opcode, such an instruction could be 37 bytes long.
While the shortest instruction could be just 1 byte.
Even those who are talking about “post-RISC” are, I think, still in favour of RISC-style fixed instruction lengths.
Brett <[email protected]> schrieb:
[VAX]
They had no idea that they would be building eight wide designs.
This is the critical idea that made RISC popular.
Nope.
The early RISC designs aimed for one instruction per cycle, achieved
maybe 0.7.
Thomas Koenig <[email protected]> wrote:
The early RISC designs aimed for one instruction per cycle, achieved
maybe 0.7.
The next step up for a CPU has one ALU and one load/store unit, giving
above one IPC. This is what one of the PlayStation CPU’s did.
On Mon, 9 Sep 2024 04:38:42 -0000 (UTC), Brett wrote:
The next step up for a CPU has one ALU and one load/store unit, giving
above one IPC. This is what one of the PlayStation CPU’s did.
Those were the ones using PowerPC chips in the 1990s, I think it was.
IBM’s POWER claimed superscalar performance right from its launch in, what >was it, 1989.
Perhaps there is also the issue of the wildly-variable instruction length.
A single VAX operand descriptor could be up to 6 bytes; I think the >instruction with the most general-format operands could have 6 of them:
so, plus opcode, such an instruction could be 37 bytes long.
Even those who are talking about “post-RISC” are, I think, still in favour >of RISC-style fixed instruction lengths.
Interestingly, despite their ample experience with T32, ARM went
fixed-length with A64, but then the market for A64 is probably not as code-size sensitive as that for T32.
- anton
... ARM T32 has two widths, as has RV64GC (and RISC-V has provisions for additional lengths, but AFAIK nobody uses them yet); there was also ROMP
and MIPS16.
ARMv9-M is still T32 which probably should tell us something.
Michael S <[email protected]> writes:
ARMv9-M is still T32 which probably should tell us something.
It tells me that ARM sees a market (covered by the M profile) where
4GB of address space is sufficient and where code size is relevant.
- anton
On Sun, 8 Sep 2024 21:09:39 +0000, Lawrence D'Oliveiro wrote:
On Sun, 8 Sep 2024 17:56:55 +0000, MitchAlsup1 wrote:
The problem with VAX was NOT that one could not put a lot of work in a
single instruction;
no,
The problem with VAX is that it made putting too much work in a single
instruction easy.
Perhaps there is also the issue of the wildly-variable instruction
length.
A single VAX operand descriptor could be up to 6 bytes; I think the
instruction with the most general-format operands could have 6 of them:
so, plus opcode, such an instruction could be 37 bytes long.
I have not heard an argument that the complex things in VAX ISA are
a) desirable
b) performance helpful
I (sort of) think VAX ISA as a grown up PDP-11, ignoring all the
dastardly complicated instructions it inflicted upon itself. AND
it did inflict those things upon itself.
Restricting a new-VAX-like ISA to 1-2-3 Operand and 1-result with
at most 1 exception would result in a MUCH cleaner and easier to
build machine.
While the shortest instruction could be just 1 byte.
Even those who are talking about “post-RISC” are, I think, still in
favour of RISC-style fixed instruction lengths.
I, for the record, are in favor of fixed length instruction-specifier followed by constants the entirety is the instruction, while the
former minimizes your ability of shooting yourself in the foot.
Lawrence D'Oliveiro <[email protected]d> writes:
Perhaps there is also the issue of the wildly-variable instruction length. >>A single VAX operand descriptor could be up to 6 bytes; I think the >>instruction with the most general-format operands could have 6 of them:
so, plus opcode, such an instruction could be 37 bytes long.
The regularity of the VAX operand formats may actually help build the >decoder: Decode your byte stream as possible operands, and then let
the instruction decoder pick the real operands from the potential
operands.
MitchAlsup1 <[email protected]> wrote:
On Sun, 8 Sep 2024 21:09:39 +0000, Lawrence D'Oliveiro wrote:
On Sun, 8 Sep 2024 17:56:55 +0000, MitchAlsup1 wrote:
The problem with VAX was NOT that one could not put a lot of work in a >>>> single instruction;
no,
The problem with VAX is that it made putting too much work in a single >>>> instruction easy.
Perhaps there is also the issue of the wildly-variable instruction
length.
A single VAX operand descriptor could be up to 6 bytes; I think the
instruction with the most general-format operands could have 6 of them:
so, plus opcode, such an instruction could be 37 bytes long.
I have not heard an argument that the complex things in VAX ISA are
a) desirable
b) performance helpful
Speaking of complex things, have you looked at Swift output, as it
checks
all operations for overflow?
You could add an exception type for that, saving huge numbers of
correctly predicted branch instructions.
The future of programming languages is type safe with checks, you need
to get on that bandwagon early.
Speaking of complex things, have you looked at Swift output, as it checks
all operations for overflow?
You could add an exception type for that, saving huge numbers of correctly >predicted branch instructions.
The future of programming languages is type safe with checks, you need to
get on that bandwagon early.
On Mon, 9 Sep 2024 19:38:52 +0000, Brett wrote:
The future of programming languages is type safe with checks, you need
to get on that bandwagon early.
This would/will happen faster when type-safe with checks are well
represented in benchmarks used to measure various architectural
things, and the exceptions and checks are actually utilized showing performance degradation of lesser endowed architectures.
According to Anton Ertl <[email protected]>:
The regularity of the VAX operand formats may actually help build the >>decoder: Decode your byte stream as possible operands, and then let
the instruction decoder pick the real operands from the potential
operands.
Urrgh. Some of those bogus operands are indirect indexed auto-increment, so you
are going to be throwing away a whole lot of work.
Compare that to zSeries, where even after 50 years of sticking new instructions
into the holes in the S/360 instruction set, it can still tell the length of the
instruction from the first two bits and the operands from the first byte.
Brett <[email protected]> writes:
Speaking of complex things, have you looked at Swift output, as it
checks all operations for overflow?
You could add an exception type for that, saving huge numbers of
correctly predicted branch instructions.
The future of programming languages is type safe with checks, you
need to get on that bandwagon early.
MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic).
It has been abandoned and replaced by RISC-V severalyears ago.
Alpha got on that bandwagon early. It's a descendent of MIPS, but it
renamed add into addv, and addu into add. It has been canceled around
the year 2000.
RISC-V, another descendent of MIPS, has an add instruction that
corresponds to MIPS' addu, and no instruction that corresponds to
MIPS' add. They obviously don't think that there's a bandwagon. Note
that RISC-V was designed after Swift was introduced.
IA-32 got on that bandwagon early. It has a single-byte instruction
trapv that traps if the overflow flag is set. The AMD64 instruction
set is very similar to the IA-32 instruction set, but one of the few differences is that the trapv instruction was eliminated, and the
encoding replaced with a REX prefix. The AMD64 architects obviously
don't think that there is a bandwagon.
Apple has been designing their own silicon for a while, and they have introduced Swift as their language in 2010. Yet they have not
switched to an architecture like MIPS or Alpha, nor have they designed
their own architecture or architecture extension that includes
instructions like Alpha's addv or IA-32's trapv. Instead, they
switched to ARM A64, which does not have such features, after
introducing Swift in 2010. They obviously don't think that there is
such a bandwagon, either.
- anton
John Levine <[email protected]> writes:
Compare that to zSeries, where even after 50 years of sticking new >instructions into the holes in the S/360 instruction set, it can
still tell the length of the instruction from the first two bits and
the operands from the first byte.
Good for sequential decoding, and maybe it makes parallel decoding
cheaper (but OTOH, the first superscalar S/360 descendent came out in
2000, 7 years after the superscalar Pentium, and the first OoO S/360 descendent lagged the Pentium Pro by 14 years or so),
but as the IIRC
6-wide decoder of Alder Lake demonstrates, hardware designers are able
to deal with instruction sets that do not have such nice properties:
an AMD64 instruction can have a large number of prefixes, and I think
that the encoding of indexed addressing is not announced in the first non-prefix instruction byte, either.
- anton
On Tue, 10 Sep 2024 08:05:07 GMT
[email protected] (Anton Ertl) wrote:
Good for sequential decoding, and maybe it makes parallel decoding
cheaper (but OTOH, the first superscalar S/360 descendent came out in
2000, 7 years after the superscalar Pentium
and the first OoO S/360
descendent lagged the Pentium Pro by 14 years or so),
Wikipedia says that ES/9000 Model 900 had superscalar OoO CPU in 1991.
This line was abandoned in favor of simpler 'CMOS' line in mid 90s, but >according to the same Wiki article, CMOS line didn't matched Model 900
in performance until 9672-RY5 near the end of 1997.
1. The longest AMD64 instruction is much shorter than the longest VAX >instruction
2. On AMD64 instruction length information is continuous. Yes, there
could be multiple prefixes and it makes things ugly, but I would think
that in practice you very rarely need to look at more than 5 leading
bytes in order to figure out the length of the tail. And in practice
it's probably o.k. when instructions with more than 3 prefixes decoded >slowly.
On Tue, 10 Sep 2024 07:43:53 GMT
[email protected] (Anton Ertl) wrote:
Brett <[email protected]> writes:
Speaking of complex things, have you looked at Swift output, as it
checks all operations for overflow?
You could add an exception type for that, saving huge numbers of
correctly predicted branch instructions.
The future of programming languages is type safe with checks, you
need to get on that bandwagon early.
MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic).
Trapping variants were deprecated in Release 6 of MIPS ISA.
It has been abandoned and replaced by RISC-V severalyears ago.
I don't think that "replaced by RISC-V" is a correct description of >proceedings.
How does Intel MPX fit in your picture?
And it seems to me that Swift with its trapping arithmetic is a
blast from the past
(with Algol, Pascal etc. usually erroring out on
overflow, and Ada raising an exception (with famously explosive
consequences for the Ariane 5)),
and that the trend in safe languages is to eliminate integer overflow
by allowing arbitrarily large integers.
It seems that during the late 1990s, IBM was not particularly interested
in mainframe per-CPU performance.
[MIPS] has been abandoned and replaced by RISC-V several years ago.
Not all the type-safe, checking languages are equal in that respect. In
some languages, and I am thinking of Ada, the language design and the
favored programming styles work to reduce the number of run-time checks required.
The future of programming languages is type safe with checks, you need to >>get on that bandwagon early.
MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.
It appears that Anton Ertl <[email protected]> said:
The future of programming languages is type safe with checks, you need to >>>get on that bandwagon early.
MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.
S/360 had signed and unsigned adds in the 1960s, with optional
trapping for signed overflow. OS/360 let you catch the traps and z
still does but it is not my impression that many programs did or do.
John Levine <[email protected]> schrieb:
S/360 had signed and unsigned adds in the 1960s, with optional
trapping for signed overflow. OS/360 let you catch the traps and z
still does but it is not my impression that many programs did or do.
With trapping, I understand. Without trapping - what is the
difference on a two's complement machine?
Michael S <[email protected]> writes:
How does Intel MPX fit in your picture?
I don't know anything about MPX beyond what Wikipedia says, which
includes:
|In practice, there have been too many flaws discovered in the design
|for it to be useful, and support has been deprecated or removed from
|most compilers and operating systems.
Maybe a less flawed concept would have been more successful, but
apparently MPX has had no such successor.
Overall, languages that perform bounds checking seem on the rise,
unlike languages that trap on signed integer overflow, so the window
of opportunity for architectural support gets bigger.
However, the question is if there is architectural support that is significantly better than what can be done with the current
architectural features. SPARC has architectural tagging support for
LISP, yet a comp.arch poster who worked on a major LISP implementation
(Franz LISP IIRC) reported that their LISP implementation does not use
these instructions.
- anton
On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:
[MIPS] has been abandoned and replaced by RISC-V several years ago.
I’m not so sure the MIPS architecture has been “abandoned”. Last I heard, it was still shipping hundreds of millions of chips per year.
Also those Chinese supers run LoongArch, which is some sort of MIPS derivative.
It is true that there is no more money to be made from licensing any
“MIPS IP”, which is why Imagination Tech, the inheritors of whatever
was left of MIPS the commercial operation, have switched to being a RISC-V-centric company now.
On Tue, 10 Sep 2024 23:51:20 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:
[MIPS] has been abandoned and replaced by RISC-V several years ago.
I’m not so sure the MIPS architecture has been “abandoned”. Last I
heard, it was still shipping hundreds of millions of chips per year.
Care to point to the source of this claim? Two main suppliers of MIPS
silicon in this century are Microchip and Cavium (now owned by Marvell).
According to my understanding Microchip's MIPS-based PIC32 line was
never as popular as their other offerings.
Sort of.
And majority of my FPGA designs run Nios2 soft cores that are also
'sort of MIPS'. But they are *not* MIPS.
On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:
It seems that during the late 1990s, IBM was not particularly interested
in mainframe per-CPU performance.
Mainframes were never about CPU performance.
They were about high I/O
throughput for efficient batch operations.
Thomas Koenig <[email protected]> writes:
John Levine <[email protected]> schrieb:
S/360 had signed and unsigned adds in the 1960s, with optional
trapping for signed overflow. OS/360 let you catch the traps and z
still does but it is not my impression that many programs did or do.
With trapping, I understand. Without trapping - what is the
difference on a two's complement machine?
Possibly in the flags set. The S/360 has a pretty perverse flags architecture.
John Levine <[email protected]> schrieb:
It appears that Anton Ertl <[email protected]> said:
The future of programming languages is type safe with checks, you need to >>>>get on that bandwagon early.
MIPS got on that bandwagon early. It has, e.g., add (which traps on >>>signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.
S/360 had signed and unsigned adds in the 1960s, with optional
trapping for signed overflow. OS/360 let you catch the traps and z
still does but it is not my impression that many programs did or do.
With trapping, I understand. Without trapping - what is the
difference on a two's complement machine?
Lawrence D'Oliveiro <[email protected]d> writes:
On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:
It seems that during the late 1990s, IBM was not particularly interested >>> in mainframe per-CPU performance.
Mainframes were never about CPU performance.
The S/360 Model 91 and the Model 195 certainly were about the maximum
CPU performance. And I doubt that IBM would have spent all the effort
with ECL and a superscalar OoO implementation for some of the ES/9000 >machines if CPU performance was considered unimportant at the time.
It's an interesting question why they did not follow up their
superscalar OoO ECL implementations with a superscalar OoO CMOS >implementation in addition to the scalar in-order 9672. ...
Lawrence D'Oliveiro <[email protected]d> writes:
On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:
It seems that during the late 1990s, IBM was not particularly interested >>> in mainframe per-CPU performance.
Mainframes were never about CPU performance.
The S/360 Model 91 and the Model 195 certainly were about the maximum
CPU performance. And I doubt that IBM would have spent all the effort
with ECL and a superscalar OoO implementation for some of the ES/9000 machines if CPU performance was considered unimportant at the time.
It's an interesting question why they did not follow up their
superscalar OoO ECL implementations with a superscalar OoO CMOS implementation in addition to the scalar in-order 9672. Here are
three speculations of what happened:
1) They had such a project and it did not work out, and the "never
about CPU performance" spin is a sour-grapes type rationalization of
the result.
2) They expected their mainframe market to be eaten up by the Unix
and/or WNT markets, and did not want to invest a lot into the
development of mainframe CPUs. Again, the "never about CPU
performance" spin is a sour-grapes type rationalization of the result.
3) They had decided that they had a captive market in the mainframes,
with software that was written for lower-powered CPUs, that the rapid
CMOS advances in the 1990s would give them enough of a performance
push to satisfy the needs of this software, so no more sophisticated
CPU designs that the 9672 was necessary (and the G5 and G6 of the 9672
indeed gave them more CPU power than ever). The "never about CPU performance" reflected their position at the time and also served to
placate anyone who pointed out that the per-CPU performance was
inferior to that of other CPUs of the time, including IBM's own
RS/6000 line.
Eventually they seem to have decided that per-CPU performance is
important after all, with the superscalar z990 in 2003 and the OoO
z196 in 2010. But of course Dennart scaling was slowing down around
2003, so they needed to increase IPC to increase per-CPU performance.
And even if they don't need more per-CPU performance than other architectures, they apparently do need advances over earlier
generations of their own machines and maybe to discourage competition
from emulators or startups.
They were about high I/O
throughput for efficient batch operations.
Batch operations? I wonder how much CPU time on mainframes in the
1990s and today is spent on that compared to interactive applications
such as online transaction processing.
- anton
On 11/09/2024 10:54, Michael S wrote:
On Tue, 10 Sep 2024 23:51:20 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:
[MIPS] has been abandoned and replaced by RISC-V several years ago.
I’m not so sure the MIPS architecture has been “abandoned”. Last I >>> heard, it was still shipping hundreds of millions of chips per year.
Care to point to the source of this claim? Two main suppliers of MIPS
silicon in this century are Microchip and Cavium (now owned by Marvell).
According to my understanding Microchip's MIPS-based PIC32 line was
never as popular as their other offerings.
IMHO a major reason for that is Microchip's insane licensing policy for
their development tools - although their compilers are just a minor modification of standard gcc, you have to pay huge amounts if you want
to use the full features of the compiler. (At least now you can enable /some/ optimisation without a paid license.) It is not even possible to
see from the release notes or documentation what version of gcc is
provided, though my guess is that it is pretty old (the documentation describes "-std" options up to C++14).
IBM definitely cared about maximum performance in the 1950s and early1960s.
Yes. And remember, one of the goals of S/360 was to provide an
architecture that could handle both scientific (i.e. compute bound) and >business (i.e. I/O bound) workloads.
they knew they had a problem. The /95 and /195 were minor upgrades ofthe /91 but that was the end of their supercomputer efforts.
Mostly true, except for the 3090 vector facility.
instruction, and the program doesn't notice. I think you'll find apattern since the
CDC shock of making CPUs fast enough to keep the RAM and I/O devicesbusy while having
the error checking and recovery features so the systems keep runningfor years at a time.
Yes, but they also have to keep producing faster and faster CPUs so they
can entice current customers to upgrade and thus meet their revenue goals.
According to Anton Ertl <[email protected]>:interested
Lawrence D'Oliveiro <[email protected]d> writes:
On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:
It seems that during the late 1990s, IBM was not particularly
1960s.in mainframe per-CPU performance.
Mainframes were never about CPU performance.
The S/360 Model 91 and the Model 195 certainly were about the maximum
CPU performance. And I doubt that IBM would have spent all the effort
with ECL and a superscalar OoO implementation for some of the ES/9000
machines if CPU performance was considered unimportant at the time.
It's an interesting question why they did not follow up their
superscalar OoO ECL implementations with a superscalar OoO CMOS
implementation in addition to the scalar in-order 9672. ...
IBM definitely cared about maximum performance in the 1950s and early
The goal of STRETCH was specifically to make the fastest possiblecomputer. It sort of
succeeded, late and over budget and not as fast as they hoped, butstill the fastest
computer in the world for a while. It was a success in that theyreused a lot of the
technology like the fast core memory in later computers.which again it sort of
The 360/91 was also intended to be the fastest possible computer,
was, late and over budget. One thing that STRETCH and the /91 sharedwas that they were
extremely complicated. STRETCH had variable sized bytes and andaddressing modes that I
never entirely figured out. The /91 had an instruction queue withloop mode and out of
order operations and register renaming and imprecise interrupts. Whenthe CDC 6600 came
out, a much simpler design from a tiny company that was nonethelessfaster than the /91,
they knew they had a problem. The /95 and /195 were minor upgrades ofthe /91 but that was
the end of their supercomputer efforts.
The point of a mainframe is balanced performance. The CPU of a 360/30was extremely slow
but it was fast enough to drive a disk or two and a printer and cardread/punch and get a
lot of useful work done. Mainframes have had channels since the 709in the late 1950s so
they have a lot of I/O capacity. Modern ones have terabytes of RAMand exabyte of disk.
They also care deeply about reliability. Modern mainframes havemultiple kinds of error
checking and standby CPUs that can take over from a failed CPU,restart a failed
instruction, and the program doesn't notice. I think you'll find apattern since the
CDC shock of making CPUs fast enough to keep the RAM and I/O devicesbusy while having
the error checking and recovery features so the systems keep runningfor years at a time.
David Brown <[email protected]> schrieb:
On 11/09/2024 10:54, Michael S wrote:
On Tue, 10 Sep 2024 23:51:20 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Tue, 10 Sep 2024 07:43:53 GMT, Anton Ertl wrote:
[MIPS] has been abandoned and replaced by RISC-V several years ago.
I’m not so sure the MIPS architecture has been “abandoned”. Last I >>>> heard, it was still shipping hundreds of millions of chips per year.
Care to point to the source of this claim? Two main suppliers of MIPS
silicon in this century are Microchip and Cavium (now owned by Marvell). >>>
According to my understanding Microchip's MIPS-based PIC32 line was
never as popular as their other offerings.
IMHO a major reason for that is Microchip's insane licensing policy for
their development tools - although their compilers are just a minor
modification of standard gcc, you have to pay huge amounts if you want
to use the full features of the compiler. (At least now you can enable
/some/ optimisation without a paid license.) It is not even possible to
see from the release notes or documentation what version of gcc is
provided, though my guess is that it is pretty old (the documentation
describes "-std" options up to C++14).
Sounds like a violation of the GPL. Do they provide the sources?
According to Stephen Fuld <[email protected]d>:
IBM definitely cared about maximum performance in the 1950s and early1960s.
Yes. And remember, one of the goals of S/360 was to provide an
architecture that could handle both scientific (i.e. compute bound) and
business (i.e. I/O bound) workloads.
I don't think anyone would have forseen how quickly scientific computing moved to mini and micro computers with fast CPUs and weak peripherals.
Perhaps once the RAM is big enough to hold all the data the I/O performance is not a big deal.
they knew they had a problem. The /95 and /195 were minor upgrades ofthe /91 but that was the end of their supercomputer efforts.
Mostly true, except for the 3090 vector facility.
I suppose. A review from the USDOE said:
The IBM 3090 with Vector Facility is an extremely interesting machine
because it combines very good scaler performance with enhanced vector
and multitasking performance. For many IBM installations with a large
scientific workload, the 3090/vector/MTF combination may be an ideal
means of increasing throughput at minimum cost. However, neither the
vector nor multitasking capabilities are sufficiently developed to
make the 3090 competitive with our current worker machines for our
large-scale scientific codes.
https://www.osti.gov/biblio/5039931
instruction, and the program doesn't notice. I think you'll find apattern since the
CDC shock of making CPUs fast enough to keep the RAM and I/O devicesbusy while having
the error checking and recovery features so the systems keep runningfor years at a time.
Yes, but they also have to keep producing faster and faster CPUs so they
can entice current customers to upgrade and thus meet their revenue goals.
The memories and disks keep getting bigger so it's not totally silly to
think that the CPUs need to get faster, too.
Lawrence D'Oliveiro <[email protected]d> writes:
They were about high I/O
throughput for efficient batch operations.
Batch operations? I wonder how much CPU time on mainframes in the
1990s and today is spent on that compared to interactive applications
such as online transaction processing.
According to Stephen Fuld <[email protected]d>:
IBM definitely cared about maximum performance in the 1950s and early >>1960s.
Yes. And remember, one of the goals of S/360 was to provide an >>architecture that could handle both scientific (i.e. compute bound) and >>business (i.e. I/O bound) workloads.
I don't think anyone would have forseen how quickly scientific computing moved to mini and micro computers with fast CPUs and weak peripherals. Perhaps once the RAM is big enough to hold all the data the I/O performance is not a big deal.
they knew they had a problem. The /95 and /195 were minor upgrades of >>the /91 but that was the end of their supercomputer efforts.
Mostly true, except for the 3090 vector facility.
I suppose. A review from the USDOE said:
Lawrence D'Oliveiro <[email protected]d> writes:
On Tue, 10 Sep 2024 16:32:05 GMT, Anton Ertl wrote:
It seems that during the late 1990s, IBM was not particularly interested >>> in mainframe per-CPU performance.
Mainframes were never about CPU performance.
The S/360 Model 91 and the Model 195 certainly were about the maximum
CPU performance. And I doubt that IBM would have spent all the effort
with ECL and a superscalar OoO implementation for some of the ES/9000 machines if CPU performance was considered unimportant at the time.
According to Anton Ertl <[email protected]>:
Lawrence D'Oliveiro <[email protected]d> writes:
The 360/91 was also intended to be the fastest possible computer, which
again it sort of
was, late and over budget. One thing that STRETCH and the /91 shared was
that they were
extremely complicated. STRETCH had variable sized bytes and and
addressing modes that I
never entirely figured out. The /91 had an instruction queue with loop
mode and out of
order operations and register renaming and imprecise interrupts. When
the CDC 6600 came
out, a much simpler design from a tiny company that was nonetheless
faster than the /91,
they knew they had a problem. The /95 and /195 were minor upgrades of
the /91 but that was
the end of their supercomputer efforts.
John Levine <[email protected]> schrieb:
According to Stephen Fuld <[email protected]d>:
IBM definitely cared about maximum performance in the 1950s and early >>>1960s.
Yes. And remember, one of the goals of S/360 was to provide an >>>architecture that could handle both scientific (i.e. compute bound) and >>>business (i.e. I/O bound) workloads.
I don't think anyone would have forseen how quickly scientific computing
moved to mini and micro computers with fast CPUs and weak peripherals.
Perhaps once the RAM is big enough to hold all the data the I/O performance >> is not a big deal.
they knew they had a problem. The /95 and /195 were minor upgrades of >>>the /91 but that was the end of their supercomputer efforts.
Don't forget the ACS.
Looking at (if that is to be believed) >https://people.computing.clemson.edu/~mark/acs_performance.html
this seems to have been quite an amazing machine for its time,
with projected 160 MFlops and around five concurrent instructions
using OoO.
I think you'll find a pattern since the
CDC shock of making CPUs fast enough to keep the RAM and I/O devices busy while having
the error checking and recovery features so the systems keep running for years at a time.
On 9/11/2024 9:21 AM, John Levine wrote:
I think you'll find a pattern since the
CDC shock of making CPUs fast enough to keep the RAM and I/O devices busy while having
the error checking and recovery features so the systems keep running for years at a time.
So do these systems not require security patches?
Or do they apply PTFs to the running system? (reliably?)
Perhaps it would have been better stated as being about balanced
performance (CPU and I/O) for business applications, which at the time
were primarily batch, but have migrated over time to transactions, but
which still are more I/O bound than scientific workloads.
Then there is the issue of cheap PC’s that fail, and a mainframes have a higher level of redundancy and failover. Failed business transactions
can cost millions, more than the machine is worth, so saving pennies on hardware is stupid.
91 was Current-Mode-Logic CML. ...
CML had all of the speed and all of the electrical and all of the heat problems ECK had.
They also care deeply about reliability. Modern mainframes have multiple kinds of error checking and standby CPUs that can take over from a
failed CPU, restart a failed instruction, and the program doesn't
notice.
They don't just update the software, they swap out entire hardware
subsystems while the overall system keeps running.
According to Stephen Fuld <[email protected]d>:
IBM definitely cared about maximum performance in the 1950s and early1960s.
Yes. And remember, one of the goals of S/360 was to provide an
architecture that could handle both scientific (i.e. compute bound) and
business (i.e. I/O bound) workloads.
I don't think anyone would have forseen how quickly scientific computing moved to mini and micro computers with fast CPUs and weak peripherals. Perhaps once the RAM is big enough to hold all the data the I/O performance is not a big deal.
On Thu, 12 Sep 2024 02:05:14 -0000 (UTC), John Levine wrote:
They don't just update the software, they swap out entire hardware
subsystems while the overall system keeps running.
Xen Orchestra (open-source) can do that on commodity PC hardware.
On Wed, 11 Sep 2024 16:21:19 -0000 (UTC), John Levine wrote:
They also care deeply about reliability. Modern mainframes have multiple
kinds of error checking and standby CPUs that can take over from a
failed CPU, restart a failed instruction, and the program doesn't
notice.
This “mainframe reliability” seems to be a persistent myth.
The other reason, of course, was the name - "PIC" is associated with brain-dead microcontrollers with terrible C tools and which many people program in assembly. They are also renowned for being very solid,
coming in relatively amateur-friendly packages, and for never going out
of production.
On Wed, 11 Sep 2024 16:39:23 -0000 (UTC), Brett wrote:
Then there is the issue of cheap PC’s that fail, and a mainframes have a >> higher level of redundancy and failover. Failed business transactions
can cost millions, more than the machine is worth, so saving pennies on
hardware is stupid.
You solve that by having multiple units of the cheap machines to achieve
the same level of redundancy, or even more. That ends up being more cost- >effective than the mainframe.
Actually booking the seats and
selling tickets stays on a mainframe or an Oracle system because double booking or giving away free tickets would be really bad.
According to Lawrence D'Oliveiro <[email protected]d>:
On Wed, 11 Sep 2024 16:39:23 -0000 (UTC), Brett wrote:
Then there is the issue of cheap PC’s that fail, and a mainframes have a
higher level of redundancy and failover. Failed business transactions
can cost millions, more than the machine is worth, so saving pennies on
hardware is stupid.
You solve that by having multiple units of the cheap machines to achieve
the same level of redundancy, or even more. That ends up being more cost-
effective than the mainframe.
That's fine for workloads that work that way.
Airline reservation systems historically ran on mainframes because when they were invented
that's all there was (original SABRE ran on two 7090s) and they are business critical so
they need to be very reliable.
About 30 years ago some guys at MIT realized that route and fare search, which are some of
the most demanding things that CRS do, are easy to parallelize and don't have to be
particularly reliable -- if your search system crashes and restarts and reruns the search
and the result is a couple of seconds late, that's OK. So they started ITA software which
used racks of PC servers running parallel applications written in Lisp (they were from
MIT) and blew away the competition.
However, that's just the search part. Actually booking the seats and selling tickets stays
on a mainframe or an Oracle system because double booking or giving away free tickets would
be really bad.
There's also a rule of thumb about databases that says one system of performance 100 is
much better than 100 systems of performance 1 because those 100 systems will spend all
their time contending for database locks.
Lawrence D'Oliveiro <[email protected]d> writes:
On Thu, 12 Sep 2024 02:05:14 -0000 (UTC), John Levine wrote:
They don't just update the software, they swap out entire hardware
subsystems while the overall system keeps running.
Xen Orchestra (open-source) can do that on commodity PC hardware.
The 3leaf hypervisor supported hot-plug memory, hot-plug CPU
hot-plug PCI 15 years ago with commodity linux guests.
According to Lawrence D'Oliveiro <[email protected]d>:
You solve that by having multiple units of the cheap machines to achieve >>the same level of redundancy, or even more. That ends up being more cost- >>effective than the mainframe.
That's fine for workloads that work that way.
Airline reservation systems historically ran on mainframes because
when they were invented that's all there was (original SABRE ran on
two 7090s) and they are business critical so they need to be very
reliable.
About 30 years ago some guys at MIT realized that route and fare
search, which are some of the most demanding things that CRS do, are
easy to parallelize and don't have to be particularly reliable -- if
your search system crashes and restarts and reruns the search and the
result is a couple of seconds late, that's OK. So they started ITA
software which used racks of PC servers running parallel applications
written in Lisp (they were from MIT) and blew away the competition.
However, that's just the search part. Actually booking the seats and
selling tickets stays on a mainframe or an Oracle system because
double booking or giving away free tickets would be really bad.
There's also a rule of thumb about databases that says one system of >performance 100 is much better than 100 systems of performance 1
because those 100 systems will spend all their time contending for
database locks.
Novell's System Fault Tolerant NetWare 386 (around 1990) supported two complete servers acting like one, so that any hardware component could
fail and the system would keep running, with nothing noticed by the
clients, even those that were in the middle of an update/write request.
According to Lawrence D'Oliveiro <[email protected]d>:
On Wed, 11 Sep 2024 16:39:23 -0000 (UTC), Brett wrote:
Then there is the issue of cheap PC’s that fail, and a mainframes
have a higher level of redundancy and failover. Failed business
transactions can cost millions, more than the machine is worth, so
saving pennies on hardware is stupid.
You solve that by having multiple units of the cheap machines to
achieve the same level of redundancy, or even more. That ends up
being more cost- effective than the mainframe.
That's fine for workloads that work that way.
Airline reservation systems historically ran on mainframes because
when they were invented that's all there was (original SABRE ran on
two 7090s) and they are business critical so they need to be very
reliable.
About 30 years ago some guys at MIT realized that route and fare
search, which are some of the most demanding things that CRS do, are
easy to parallelize and don't have to be particularly reliable -- if
your search system crashes and restarts and reruns the search and the
result is a couple of seconds late, that's OK. So they started ITA
software which used racks of PC servers running parallel applications
written in Lisp (they were from MIT) and blew away the competition.
However, that's just the search part. Actually booking the seats and
selling tickets stays on a mainframe or an Oracle system because
double booking or giving away free tickets would be really bad.
There's also a rule of thumb about databases that says one system of performance 100 is much better than 100 systems of performance 1
because those 100 systems will spend all their time contending for
database locks.
On Fri, 13 Sep 2024 09:15:52 +0200, Terje Mathisen wrote:
Novell's System Fault Tolerant NetWare 386 (around 1990) supported two
complete servers acting like one, so that any hardware component could
fail and the system would keep running, with nothing noticed by the
clients, even those that were in the middle of an update/write request.
Just so long as it wasn’t the network connection between them that failed.
See also, “CAP Theoremâ€.
10-15 years ago I talked to another speaker at a conference, he told me
that he was working on high-end open source LDAP software using _very_
large memory DBs: Their system allowed one US cell phone company to keep every SIM card (~100M) on a single system, while a similar-size
competitor had been forced to fall back on 17-way sharding (presumably
using a hash of the SIM id).
Terje Mathisen <[email protected]> schrieb:
10-15 years ago I talked to another speaker at a conference, he
told me that he was working on high-end open source LDAP software
using _very_ large memory DBs: Their system allowed one US cell
phone company to keep every SIM card (~100M) on a single system,
while a similar-size competitor had been forced to fall back on
17-way sharding (presumably using a hash of the SIM id).
Keeping databases in memory is definitely a thing now... see SAP HANA.
Any architectural implications for this?
Browsing through the SAP pages, it seems they used Intel's Optane
persistent memory, but that is no longer manufactured (?). But
having fast, persistent storage is definitely an advantage for
databases.
Large memory: Of course.
On the ISA level... these databases run on x86, so that seems to
be good enough.
Anything else?
How many transactions per minute does world's biggest company need
at peak hours?
Is not this number small relatively to capabilities of even 15 y.o.
dual-Xeon server with few dozens of spinning rust disks?
In article <[email protected]>,
[email protected] (Michael S) wrote:
How many transactions per minute does world's biggest company need
at peak hours?
One very painful case is credit card spending in the run-up to major holidays, such as Christmas, where the credit card companies feel the
need for central authorisation of all transactions to reduce fraud.
Fraud is, naturally, at its peak at these times. The price of wrongly
refused transactions is also high, because it means customers march
out of shops, having wasted retail staff time.
Is not this number small relatively to capabilities of even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?
This does not seem to be the case.
John
My post was specifically about flight reservations.
On 9/11/24 4:40 AM, David Brown wrote:
The other reason, of course, was the name - "PIC" is associated with
brain-dead microcontrollers with terrible C tools and which many
people program in assembly. They are also renowned for being very
solid, coming in relatively amateur-friendly packages, and for never
going out of production.
After having written code for a PIC I agree with "brain-dead". The small sized memory pages were bad enough but the total lack of an add with
carry instruction drove me mad.
So I swore them off and the introduction of a MIPS based system did
nothing to change that.
John Levine <[email protected]> writes:
According to Lawrence D'Oliveiro <[email protected]d>:
You solve that by having multiple units of the cheap machines to achieve >>> the same level of redundancy, or even more. That ends up being more cost- >>> effective than the mainframe.That's fine for workloads that work that way.
Airline reservation systems historically ran on mainframes because
when they were invented that's all there was (original SABRE ran on
two 7090s) and they are business critical so they need to be very
reliable.
About 30 years ago some guys at MIT realized that route and fare
search, which are some of the most demanding things that CRS do, are
easy to parallelize and don't have to be particularly reliable -- if
your search system crashes and restarts and reruns the search and the
result is a couple of seconds late, that's OK. So they started ITA
software which used racks of PC servers running parallel applications
written in Lisp (they were from MIT) and blew away the competition.
However, that's just the search part. Actually booking the seats and
selling tickets stays on a mainframe or an Oracle system because
double booking or giving away free tickets would be really bad.
Booking flights or seats can easily be distributed: each flight is
assigned to one computer. To avoid double booking or free tickets
even in case of a computer crash, you use the usual transaction
processing approach, and report completion of the booking only when
the booking has reached persistent memory. For persistent memory you
use SSDs with power-loss protection.
These SSDs, ECC RAM, RAID-1, redundant power supplies and UPSs protect against most hardware failures, but availability is still a concern
(e.g., motherboard or CPU failure; that normally does not affect data integrity if the other measures are taken, but it affects
availability). To increase availability, you can use e.g., DRBD
(distributed replicated block device) to get the data on multiple
machines.
Concerning "real bad": Airlines overbook their flights as a matter of
policy to increase their revenue. If they had a booking system that double-booked, say, 1ppm of all bookings, they probably would not even notice, and would deal with it in the same way they deal with the
cases when the overbooking actually results in too many passengers
arriving for the flight. Likewise, free tickets are not an issue if
they occur rarely enough. Do they want to spend a million on a
mainframe to avoid a revenue loss of $100k? But in any case, that's
not the problem with cheap hardware.
The problems are: When the persistent storage fails, you lose all transactions since the latest backup. To avoid that, RAID-1 helps, or
a redundant distributed storage like DRBD, or a redundant distributed transaction system. You may also want more availability than a single
system with RAID-1 (with a spare system standing by) provides, then
you have to go for one of the redundant distributed approaches.
However, my impression from booking flights online is that reliability
of the booking platform is not at all a concern for the airlines. And
as a customer, I find little difference between the booking front-end erroring out or the transaction back-end being unavailable.
There's also a rule of thumb about databases that says one system of
performance 100 is much better than 100 systems of performance 1
because those 100 systems will spend all their time contending for
database locks.
If you handle each flight on one system, the contention for locks is
only within that one system. And I expect that there is not that much contention. How many people book the same flight within the same
millisecond (or however long the lock is held)?
How many transactions per minute does world's biggest company need at
peak hours? Is not this number small relatively to capabilities of
even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?
They also care deeply about reliability. Modern mainframes have multiple kinds of error
checking and standby CPUs that can take over from a failed CPU, restart a failed
instruction, and the program doesn't notice. I think you'll find a pattern since the
CDC shock of making CPUs fast enough to keep the RAM and I/O devices busy while having
the error checking and recovery features so the systems keep running for years at a time.
I suppose. A review from the USDOE said:
The IBM 3090 with Vector Facility is an extremely interesting machine
because it combines very good scaler performance with enhanced vector
and multitasking performance. For many IBM installations with a large
scientific workload, the 3090/vector/MTF combination may be an ideal
means of increasing throughput at minimum cost. However, neither the
vector nor multitasking capabilities are sufficiently developed to
make the 3090 competitive with our current worker machines for our
large-scale scientific codes.
Novell's System Fault Tolerant NetWare 386 (around 1990) supported two complete servers acting like one, so that any hardware component could
fail and the system would keep running, with nothing noticed by the
clients, even those that were in the middle of an update/write
request.
That's fine for workloads that work that way.
Airline reservation systems historically ran on mainframes because when they were invented
that's all there was (original SABRE ran on two 7090s) and they are business critical so
they need to be very reliable.
About 30 years ago some guys at MIT realized that route and fare search, which are some of
the most demanding things that CRS do, are easy to parallelize and don't have to be
particularly reliable -- if your search system crashes and restarts and reruns the search
and the result is a couple of seconds late, that's OK. So they started ITA software which
used racks of PC servers running parallel applications written in Lisp (they were from
MIT) and blew away the competition.
However, that's just the search part. Actually booking the seats and selling tickets stays
on a mainframe or an Oracle system because double booking or giving away free tickets would
be really bad.
There's also a rule of thumb about databases that says one system of performance 100 is
much better than 100 systems of performance 1 because those 100 systems will spend all
their time contending for database locks.
On Fri, 13 Sep 2024 11:20:06 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Terje Mathisen <[email protected]> schrieb:
10-15 years ago I talked to another speaker at a conference, he
told me that he was working on high-end open source LDAP software
using _very_ large memory DBs: Their system allowed one US cell
phone company to keep every SIM card (~100M) on a single system,
while a similar-size competitor had been forced to fall back on
17-way sharding (presumably using a hash of the SIM id).
Keeping databases in memory is definitely a thing now... see SAP HANA.
Any architectural implications for this?
Browsing through the SAP pages, it seems they used Intel's Optane
persistent memory, but that is no longer manufactured (?). But
having fast, persistent storage is definitely an advantage for
databases.
Large memory: Of course.
On the ISA level... these databases run on x86, so that seems to
be good enough.
Anything else?
Another thing that SAP HANA seems to use more intensely than anybody
else is Intel TSX. TSX (at least RTM part, I am not sure about HLE
part) still present in the latest Xeon generation, but is strongly de-emphasized.
I had also started pontificating the relative disk throughput had gotten
an order of magnitude slower (disks got 3-5 times faster while systems
got 40-50 times faster) since 360 announce.
There's also a rule of thumb about databases that says one system of
performance 100 is much better than 100 systems of performance 1
because those 100 systems will spend all their time contending for
database locks.
How many transactions per minute does world's biggest company need at
peak hours?
Is not this number small relatively to capabilities of
even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?
Michael S <[email protected]> schrieb:
On Fri, 13 Sep 2024 11:20:06 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Anything else?
Another thing that SAP HANA seems to use more intensely than anybody
else is Intel TSX. TSX (at least RTM part, I am not sure about HLE
part) still present in the latest Xeon generation, but is strongly
de-emphasized.
Sounds like a market niche... Mitch, how good is your ESM for
in-memory databases?
How many transactions per minute does world's biggest company need at
peak hours?
Keeping databases in memory is definitely a thing now... see SAP HANA.
So there's real demand for systems with huge capacity. Not very many of
them, but they have large budgets.
It appears that Michael S <[email protected]> said:
There's also a rule of thumb about databases that says one system of
performance 100 is much better than 100 systems of performance 1
because those 100 systems will spend all their time contending for
database locks.
How many transactions per minute does world's biggest company need at
peak hours?
Ten years ago Visa could process 56,000 messages/second. It must be a
lot more now. I think a transaction is two or four messages depending
on the transaction type.
Is not this number small relatively to capabilities of
even 15 y.o. dual-Xeon server with few dozens of spinning rust disks?
Uh, no, it is not.
but the question is if
the machine has enough RAM for the database. Our dual-Xeon system
from IIRC 2007 has 24GB of RAM, not sure how big it could be
configured; OTOH, we have a single-Xeon system from 2009 or so with
32GB of RAM (and there were bigger Xeons in the market at the time).
Ten years ago Visa could process 56,000 messages/second.
Anton Ertl <[email protected]> schrieb:
[in-memory database]
but the question is if
the machine has enough RAM for the database. Our dual-Xeon system
from IIRC 2007 has 24GB of RAM, not sure how big it could be
configured; OTOH, we have a single-Xeon system from 2009 or so with
32GB of RAM (and there were bigger Xeons in the market at the time).
The minimum requirement of SAP HANA is 64 GB of memory, but typical
ranges are from 256GB to 1TB.
Thomas Koenig <[email protected]> writes:
Anton Ertl <[email protected]> schrieb:
[in-memory database]
but the question is if
the machine has enough RAM for the database. Our dual-Xeon system
from IIRC 2007 has 24GB of RAM, not sure how big it could be
configured; OTOH, we have a single-Xeon system from 2009 or so with
32GB of RAM (and there were bigger Xeons in the market at the time).
The minimum requirement of SAP HANA is 64 GB of memory, but typical
ranges are from 256GB to 1TB.
What is the relevance of SAP HANA for the topic at hand?
Anton Ertl <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
The minimum requirement of SAP HANA is 64 GB of memory, but typical >>>ranges are from 256GB to 1TB.
What is the relevance of SAP HANA for the topic at hand?
It is something that is implemented, unlike what you were discussing.
Thomas Koenig <[email protected]> writes:
Anton Ertl <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
The minimum requirement of SAP HANA is 64 GB of memory, but typical >>>>ranges are from 256GB to 1TB.
What is the relevance of SAP HANA for the topic at hand?
It is something that is implemented, unlike what you were discussing.
So what? Linux is also implemented, and it runs on a 32GB machine.
Neither Linux nor SAP HANA satisfy even the most basic requirement
that I outlined (keeping balances) without additional implementation
work. And I doubt that if you give a 15 year old Dual-Xeon even with
64GB of RAM and a bunch of HDDs to a typical SAP developer, he will
implement a system that manages to keep the balance on 1.8M (double
the number for double RAM capacity) credit cards at 56K transactions
per second on that system. What I described is relatively
straightforward to implement on top of Linux.
On Fri, 13 Sep 2024 9:22:17 +0000, Michael S wrote:
How many transactions per minute does world's biggest company need
at peak hours? Is not this number small relatively to capabilities
of even 15 y.o. dual-Xeon server with few dozens of spinning rust
disks?
A SWAG::
8B people in the world: 1/3rd sleeping, 1/3rd working, 1/3rd relaxing.
So we have only 3B potential transactions, and a single person will
not average more than 1 transaction every 15 minutes over an hour.
So: 3B/15 = 200M T/m
It appears that Michael S <[email protected]> said:
There's also a rule of thumb about databases that says one system
of performance 100 is much better than 100 systems of performance 1
because those 100 systems will spend all their time contending for
database locks.
How many transactions per minute does world's biggest company need at
peak hours?
Ten years ago Visa could process 56,000 messages/second. It must be a
lot more now. I think a transaction is two or four messages depending
on the transaction type.
Is not this number small relatively to capabilities of
even 15 y.o. dual-Xeon server with few dozens of spinning rust
disks?
Uh, no, it is not.
On Fri, 13 Sep 2024 17:05:35 +0000
[email protected] (MitchAlsup1) wrote:
On Fri, 13 Sep 2024 9:22:17 +0000, Michael S wrote:
How many transactions per minute does world's biggest company need
at peak hours? Is not this number small relatively to capabilities
of even 15 y.o. dual-Xeon server with few dozens of spinning rust
disks?
A SWAG::
8B people in the world: 1/3rd sleeping, 1/3rd working, 1/3rd relaxing.
So we have only 3B potential transactions, and a single person will
not average more than 1 transaction every 15 minutes over an hour.
So: 3B/15 = 200M T/m
I don't know about you, but I personally don't book flights 8 hours per
day. Even less so in the biggest company in the world, which, I
suppose, does not account for more tha 5-7% of world's flights.
Ten years ago Visa could process 56,000 messages/second. It must be a
lot more now. I think a transaction is two or four messages depending
on the transaction type.
On Sat, 14 Sep 2024 10:57:57 -1000, Lynn Wheeler wrote:
Some of the cellphone companies were enticed to get into micropayments
but got out after a few years, turns out they lacked the significant
fraud handling capability ...
Meanwhile, the Kenyans have figured out how to run a successful online micropayments system mediated via text messages (M-Pesa).
Some of the cellphone companies were enticed to get into micropayments
but got out after a few years, turns out they lacked the significant
fraud handling capability ...
Some of the cellphone companies were enticed to get into micropayments
but got out after a few years, turns out they lacked the significant
fraud handling capability ...
Meanwhile, the Kenyans have figured out how to run a successful online >micropayments system mediated via text messages (M-Pesa).
Brett <[email protected]> writes:
Speaking of complex things, have you looked at Swift output, as it checks >>all operations for overflow?
You could add an exception type for that, saving huge numbers of correctly >>predicted branch instructions.
The future of programming languages is type safe with checks, you need to >>get on that bandwagon early.
MIPS got on that bandwagon early. It has, e.g., add (which traps on
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.
Alpha got on that bandwagon early. It's a descendent of MIPS, but it
renamed add into addv, and addu into add. It has been canceled around
the year 2000.
In article <[email protected]>,
Alpha got on that bandwagon early. It's a descendent of MIPS, but it >>renamed add into addv, and addu into add. It has been canceled around
the year 2000.
[ More details about architectures without trapping overflow
instructions ]
Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values. If you take Rust's approach, and only
detect overflow in debug builds, then you already don't care about performance.
If you want to do almost anything at all other than core dump on
overflow, you need to branch to recovery code. And although it's theoretically possible to recover from the trap, it's worse than any
other approach. So it's added hardware that's HARDER for software to
use. No surprise it's gone away.
But then IEEE 754 exception semantics make even less sense than Linux signals. ...
3) You want to clamp the value to a reasonable range and continue. The
reasonable values need to be looked up somewhere.
On Fri, 20 Sep 2024 22:00:28 +0000, MitchAlsup1 wrote:
But then IEEE 754 exception semantics make even less sense than Linux
signals. ...
Note that what IEEE 754 calls an “exception” is just a bunch of status bits reporting on the current state of the computation: there is no implication of some transfer of control elsewhere.
On Fri, 20 Sep 2024 18:35:26 -0000 (UTC), Kent Dickey wrote:
3) You want to clamp the value to a reasonable range and continue. The
reasonable values need to be looked up somewhere.
This won’t work. The values outside the range are by definition non- representable, so comparisons against them are useless.
On Sat, 21 Sep 2024 1:12:11 +0000, Lawrence D'Oliveiro wrote:
On Fri, 20 Sep 2024 18:35:26 -0000 (UTC), Kent Dickey wrote:
3) You want to clamp the value to a reasonable range and continue. The >>> reasonable values need to be looked up somewhere.
This won’t work. The values outside the range are by definition non-
representable, so comparisons against them are useless.
When a range is 0..10 both -1 and 11 are representable in
the arithmetic of ALL computers, just not in the language
specifying the range.
On Sat, 21 Sep 2024 1:09:43 +0000, Lawrence D'Oliveiro wrote:
On Fri, 20 Sep 2024 22:00:28 +0000, MitchAlsup1 wrote:
But then IEEE 754 exception semantics make even less sense than Linux
signals. ...
Note that what IEEE 754 calls an “exception” is just a bunch of status >> bits reporting on the current state of the computation: there is no
implication of some transfer of control elsewhere.
Then how do you implement the alternate exception model ??? which IS
part of 754-2008 and 754-2019
On Sat, 21 Sep 2024 1:12:11 +0000, Lawrence D'Oliveiro wrote:
On Fri, 20 Sep 2024 18:35:26 -0000 (UTC), Kent Dickey wrote:
3) You want to clamp the value to a reasonable range and continue.
The reasonable values need to be looked up somewhere.
This won’t work. The values outside the range are by definition non-
representable, so comparisons against them are useless.
When a range is 0..10 both -1 and 11 are representable in the arithmetic
of ALL computers, just not in the language specifying the range.
In article <[email protected]>,
Anton Ertl <[email protected]> wrote:
Brett <[email protected]> writes:
Speaking of complex things, have you looked at Swift output, as it checks >>> all operations for overflow?MIPS got on that bandwagon early. It has, e.g., add (which traps on
You could add an exception type for that, saving huge numbers of correctly >>> predicted branch instructions.
The future of programming languages is type safe with checks, you need to >>> get on that bandwagon early.
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.
Alpha got on that bandwagon early. It's a descendent of MIPS, but it
renamed add into addv, and addu into add. It has been canceled around
the year 2000.
[ More details about architectures without trapping overflow instructions ]
Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values. If you take Rust's approach, and only
detect overflow in debug builds, then you already don't care about performance.
If you want to do almost anything at all other than core dump on
overflow, you need to branch to recovery code. And although it's theoretically possible to recover from the trap, it's worse than any
other approach. So it's added hardware that's HARDER for software to
use. No surprise it's gone away.
IA64 went down this road--trapping on speculation failures. It was a
huge disaster--trying to recover through an exception handler mechanism
is slow and painful, for the reasons I'll lay out for overflow
exceptions.
Let's look at how you might want to handle overflows when they happen:
1) Your language supports seemlessly transitioning to BigInts on
overflow. Then each operation that could overflow needs to call
a special bit of code to change to BigInt and then continue the
calculation. This code must exist, even if a trapping
instruction doesn't need an explicit branch to it. Some
mechanism is needed to call this code.
2) You need to call an exception handler, and the routine with the overflow
is ended. We need to know which exception handler to call.
3) You want to clamp the value to a reasonable range and continue. The
reasonable values need to be looked up somewhere.
4) You just want to crash the program. If a debugger is attached, it can
say where the overflow occurred.
Trapping on overflow instructions really are only useful for #4. Let's
look at how the other cases could be handled, with a) meaning using
branches, and b) mean using a trapping instruction.
1a) (BigInt): After doing an operation which could overflow, use a
conditional branch to jump to code to convert to BigInt, which
then jumps back. Overhead is basically the branch-on-overflow
instruction.
1b) (BigInt with traps). Hardware traps to the OS, which needs to prepare
the required structures describing the exception (all regs and
the address), and then call the signal handler. The signal
handler needs to look up the address of the trap with a table
describing what to do for this particular operation which
overflowed. Each table entry needs to describe, in detail, what
registers are involved (the sources and the dest), and where to
return once the BigInt has been created. This requires massive
changes to the compiler (and possibly linker) to prepare these
tables. The compiler must guarantee that changing the dest
register to a pointer to BigInt works properly (otherwise,
special code needs to be emitted for each potentially trapping
instruction to try to recover).
2a) (Try/Catch): After doing an operation which could overflow, use a
conditional branch to jump to the catch block.
2b) (Try/Catch with traps). Repeat all the OS work and call the signal
handler. Now, it just needs a table entry describing where to
jump to to enter the catch block. Almost all the complexity of
1b), but without needing the register details.
3a) (Clamp): After doing an operation which could overflow, use a
conditional branch to do the MIN/MAX operations to bring it back
within range and then jump back.
3b) (Clamp with trap): Basically the same as 1b), but there's an alternative
if the clamps are global (MAX_INT/MIN_INT). The exception handler
can read the instruction which trapped, figure out the source and
dest registers, re-do the calculation, and clamp the destination
to MIN or MAX, and return to just after the instruction which
trapped.
4a) (Crash): Every operation could overflow needs a conditional branch
after it to branch to a crashing instruction (or a branch over
an undefined instruction if there's no overflow).
4b) (Crash with trap): Use operations which trap on overflow. This takes
no new instructions and costs no performance.
Basically, all a) cases are:
op_with_might_overflow();
if(overflow_happened) {
handle the overflow
}
Trapping-on-overflow instructions are clearly useless for languages
which care about overflow.
To save one branch instruction, an entry is
needed to describe how to handle the overflow, which is certainly larger
than a branch instruction. And the code to "handle the overflow" is
needed in any case. And this assume some sort of instant lookup--of the
1000 overflow instructions, we need a hash table to look up the address, which is more overhead.
Trapping on overflow instructions are useful as a debug aid for
languages which don't care about overflow--but then you're optimizing something nearly useless. It also might be helpful if global clamping
to MIN/MAX was useful (and I don't think it is).
Instruction sets which make detecting overflow difficult (say, RISC-V),
would do well to make branch-on-overflow efficient and easy. But adding trap-on-overflow instructions is a waste of effort.
Note that using traps on data access violations which are "fixed" by
signal handlers CAN work out. They are slow, but as long as the
exception handler can fix the access violation and return right to the instruction which failed (without needing to know ANYTHING about that instruction in particular), this can work fine. But integer overflow
doesn't work like that--it's generally not possible to figure out
in the trap handler what to do without more information.
Kent
Kent Dickey wrote:
Basically, all a) cases are:
op_with_might_overflow();
if(overflow_happened) {
handle the overflow
}
Trapping-on-overflow instructions are clearly useless for languages
which care about overflow.
This conclusion is completely wrong.
Exceptions are an event detection and notification delivery mechanism.
It is very efficient if those events are rarely or never supposed to
occur.
In the days before <good> branch prediction having a conditional branch
after each instruction that could have an execution problem was an
extremely poor choice. Thus, exceptions were invented (circa 1958).
Many (most, nearly all) processor architectures have notoriously
bad exception delivery to a point of control that can deal with
the problem at hand.
According to MitchAlsup1 <[email protected]>:
In the days before <good> branch prediction having a conditional
branch after each instruction that could have an execution problem
was an extremely poor choice. Thus, exceptions were invented (circa
1958).
Oh, it was worse than that. There were instructions like "Divide or
Halt" which stopped the computer with an error light on a zero divide.
Many (most, nearly all) processor architectures have notoriously
bad exception delivery to a point of control that can deal with
the problem at hand.
Some of us remember imprecise itnterrupts and the OS/360 S0C0
completion code.
But you are in general right, it makes more sense to keep the computer running in the normal case and provide slow ways to recover from
failures and do something else.
In the days before <good> branch prediction having a conditional branch
after each instruction that could have an execution problem was an
extremely poor choice. Thus, exceptions were invented (circa 1958).
But you are in general right, it makes more sense to keep the computer running in the normal case and provide slow ways to recover from
failures and do something else.
On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:
On Sat, 21 Sep 2024 22:14:12 -0000 (UTC), John Levine wrote:
But you are in general right, it makes more sense to keep the computer
running in the normal case and provide slow ways to recover from
failures and do something else.
Aren’t branches that are not taken supposed to be fast?
Well, they are not taken, so they should be faster... ;^)
On Sat, 21 Sep 2024 20:39:38 +0000, MitchAlsup1 wrote:
In the days before <good> branch prediction having a conditional branch
after each instruction that could have an execution problem was an
extremely poor choice. Thus, exceptions were invented (circa 1958).
So all that does is push the conditional branch into the microcode. And
make the instruction more complicated. Why should that be faster?
On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:
On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:
On Sat, 21 Sep 2024 22:14:12 -0000 (UTC), John Levine wrote:
But you are in general right, it makes more sense to keep the
computer running in the normal case and provide slow ways to recover
from failures and do something else.
Aren’t branches that are not taken supposed to be fast?
Well, they are not taken, so they should be faster... ;^)
It is NOT the speed, it is the code bloat.
On Sat, 21 Sep 2024 23:29:01 +0000, Lawrence D'Oliveiro wrote:
On Sat, 21 Sep 2024 20:39:38 +0000, MitchAlsup1 wrote:
In the days before <good> branch prediction having a conditional
branch after each instruction that could have an execution problem was
an extremely poor choice. Thus, exceptions were invented (circa 1958).
So all that does is push the conditional branch into the microcode. And
make the instruction more complicated. Why should that be faster?
It pushes the branch into the mispredict-recovery path and does not
occupy any code space.
There is no microcode outside of Z-system these days.
On Sun, 22 Sep 2024 01:23:35 +0000, MitchAlsup1 wrote:
On Sat, 21 Sep 2024 23:29:01 +0000, Lawrence D'Oliveiro wrote:
On Sat, 21 Sep 2024 20:39:38 +0000, MitchAlsup1 wrote:
In the days before <good> branch prediction having a conditional
branch after each instruction that could have an execution problem was >>>> an extremely poor choice. Thus, exceptions were invented (circa 1958).
So all that does is push the conditional branch into the microcode. And
make the instruction more complicated. Why should that be faster?
It pushes the branch into the mispredict-recovery path and does not
occupy any code space.
There is no microcode outside of Z-system these days.
It occupies some space, either microcode or circuit logic, or both.
And why should that be faster?
On Sun, 22 Sep 2024 01:24:05 +0000, MitchAlsup1 wrote:
On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:
On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:
On Sat, 21 Sep 2024 22:14:12 -0000 (UTC), John Levine wrote:
But you are in general right, it makes more sense to keep the
computer running in the normal case and provide slow ways to recover >>>>> from failures and do something else.
Aren’t branches that are not taken supposed to be fast?
Well, they are not taken, so they should be faster... ;^)
It is NOT the speed, it is the code bloat.
That’s an argument against RISC though, isn’t it?
It is faster if for no other reason that it did not fetch the branch
that is always predicted non-taken.
If every calculation instruction had to be followed by a conditional
branch, then the code would be 150% its original size (or worse).
There is no microcode outside of Z-system these days.
On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:
On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:It is NOT the speed, it is the code bloat.
Aren't branches that are not taken supposed to be fast?Well, they are not taken, so they should be faster... ;^)
According to MitchAlsup1 <[email protected]>:
In the days before <good> branch prediction having a conditional branch
after each instruction that could have an execution problem was an
extremely poor choice. Thus, exceptions were invented (circa 1958).
Oh, it was worse than that. There were instructions like "Divide or
Halt" which stopped the computer with an error light on a zero divide.
Many (most, nearly all) processor architectures have notoriously
bad exception delivery to a point of control that can deal with
the problem at hand.
Some of us remember imprecise itnterrupts and the OS/360 S0C0
completion code.
But you are in general right, it makes more sense to keep the computer running in the normal case and provide slow ways to recover from
failures and do something else.
According to MitchAlsup1 <[email protected]>:
In the days before <good> branch prediction having a conditional branch
after each instruction that could have an execution problem was an
extremely poor choice. Thus, exceptions were invented (circa 1958).
Oh, it was worse than that. There were instructions like "Divide or
Halt" which stopped the computer with an error light on a zero divide.
Many (most, nearly all) processor architectures have notoriously
bad exception delivery to a point of control that can deal with
the problem at hand.
Some of us remember imprecise itnterrupts and the OS/360 S0C0
completion code.
But you are in general right, it makes more sense to keep the computer running in the normal case and provide slow ways to recover from
failures and do something else.
Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values.
In article <[email protected]>, [email protected] (MitchAlsup1) wrote:
On Sun, 22 Sep 2024 0:14:49 +0000, Chris M. Thomasson wrote:
On 9/21/2024 4:29 PM, Lawrence D'Oliveiro wrote:It is NOT the speed, it is the code bloat.
Aren't branches that are not taken supposed to be fast?Well, they are not taken, so they should be faster... ;^)
Yup. Bigger code is always a potential problem, not so much because it
takes up RAM nowadays, but because it takes up memory bandwidth and
cache
space. Using up cache space is always bad, because bigger caches are
slower, and instructions seem naturally smaller than cache blocks.
Wanting smaller code isn't an argument against RISC, but an argument
against poorly optimised ISA design. Variable-length CISC makes it
easier to get smaller average instruction sizes but has other drawbacks.
For the stuff I work, on ARM64 code is consistently smaller than x86-64, although the factor varies by platform.
John
Let's look at how you might want to handle overflows when they happen:
1) Your language supports seemlessly transitioning to BigInts on
overflow. Then each operation that could overflow needs to call
a special bit of code to change to BigInt and then continue the
calculation. This code must exist, even if a trapping
instruction doesn't need an explicit branch to it. Some
mechanism is needed to call this code.
2) You need to call an exception handler, and the routine with the overflow
is ended. We need to know which exception handler to call.
3) You want to clamp the value to a reasonable range and continue. The
reasonable values need to be looked up somewhere.
4) You just want to crash the program. If a debugger is attached, it can
say where the overflow occurred.
Kent Dickey wrote:
In article <[email protected]>,
Anton Ertl <[email protected]> wrote:
Brett <[email protected]> writes:
Speaking of complex things, have you looked at Swift output, as it checks >>>> all operations for overflow?MIPS got on that bandwagon early. It has, e.g., add (which traps on
You could add an exception type for that, saving huge numbers of correctly >>>> predicted branch instructions.
The future of programming languages is type safe with checks, you need to >>>> get on that bandwagon early.
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.
Alpha got on that bandwagon early. It's a descendent of MIPS, but it
renamed add into addv, and addu into add. It has been canceled around
the year 2000.
[ More details about architectures without trapping overflow instructions ] >>
Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values. If you take Rust's approach, and only
detect overflow in debug builds, then you already don't care about
performance.
Those automatic software correctness checks, of which signed integer
overflow detection is one of many, went away because most code was
being written in C/C++ and those two languages don't require them.
That just makes it more expensive in code size and performance to effect
such checks. This overhead leads some to conclude it justifies eliminating >the error checks.
Eliminating the error event detectors doesn't make errors go away,
just your knowledge of them.
I gather portions of 16-bit Windows 3.1 were written in Pascal.
When Microsoft developed 32-bit WinNT, if instead of C it they had
switched their official development language from Pascal to Modula-2
which does require signed and unsigned, checked and modulo arithmetic,
and array bounds checks, the world would have been a much safer place.
But they didn't so it isn't.
The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.
Instruction sets which make detecting overflow difficult (say, RISC-V),
would do well to make branch-on-overflow efficient and easy. But adding
trap-on-overflow instructions is a waste of effort.
No they are a very useful tool for those who need such a tool
because the manual alternative is significantly more expensive
for both size and performance.
"I have one example where overflow exceptions would be a poor implementation >choice" does not imply "therefore no one should have them as an option".
Some of the things that minimize the "badness" of taking an exception::
a) deliver control to user signal handler without taking an
excursion through GuestOS. (think 10 cycles)
b) when control arrives, receiving thread is already reentrant.
c) when control arrives, the instruction (bits) and its operand
values are delivered to the exception handler. So, the exception
handler has what it needs to deal with the problem at hand.
d) when control returns, the result (R0) is delivered back to the
destination register.
e) (b, c, d) are performed without handler needing to understand
how. Handler is just a subroutine that receives arguments (c)
fixes the problem, and returns a non-excepting value, or abort.
f) return has a way to re-execute the instruction or to skip the
instruction under control of handler without having access
to excepting-IP and without knowing the length of the
instruction.
g) during (a..f) nobody ever has to disable interrupts or
exceptions or re-enable them later. Priority and privilege
are inherited automatically from excepting thread.
[email protected] (MitchAlsup1) writes:
There is no microcode outside of Z-system these days.
Every AMD64 processor has microcode.
In article <O2DHO.184073$[email protected]>,<snip>
EricP <[email protected]> wrote:
The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.
OK, my post was about how having a hardware trap-on-overflow instruction
(or a mode for existing ALU instructions) is useless for anything OTHER
than as a debug aid where you crash the problem on overflow (you can
have a general exception handler to shut down gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons folks might
want to try to use trap-on-overflow instructions, and show how the
other cases don't make sense.
In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother having trap-on-overflow instructions.
So why should any hardware include an instruction to trap-on-overflow?
Trap-on-overflow instruction have a hardware cost, of varying severity.
If the ISA isn't already trapping on ALU instructions (such as
divide-by-0), it adds a new class of operations which can take
exceptions. An ALU functional unit that cannot take exceptions doesn't
have to save "unwinding" info (at minimum, info to recover the PC, and possibly rollback state), and not needing this can be a nice
simplification. Branches and LD/ST always needs this info, but not
needing it on ALU ops can be a nice simplification of logic, and makes
it
easier to have multiple ALU functional units. Note that x86 INTO can
be treated as a branch, so it doesn't have the cost of an instruction
like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
traps if it overflows. ADDTO is particularly what I am arguing
against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.
Kent
Some of the things that minimize the "badness" of taking an exception::
a) deliver control to user signal handler without taking an
excursion through GuestOS. (think 10 cycles)
b) when control arrives, receiving thread is already reentrant.
c) when control arrives, the instruction (bits) and its operand
values are delivered to the exception handler. So, the exception
handler has what it needs to deal with the problem at hand.
d) when control returns, the result (R0) is delivered back to the
destination register.
e) (b, c, d) are performed without handler needing to understand
how. Handler is just a subroutine that receives arguments (c)
fixes the problem, and returns a non-excepting value, or abort.
f) return has a way to re-execute the instruction or to skip the
instruction under control of handler without having access
to excepting-IP and without knowing the length of the
instruction.
g) during (a..f) nobody ever has to disable interrupts or
exceptions or re-enable them later. Priority and privilege
are inherited automatically from excepting thread.
Note that in the case where you want the overflow exception to jump to
some alternate code path (a language-level exception handler, or a code
path that continues with a bigint instead of a register-sized integer),
(d) is useless because you don't want to return to the overflowing instruction (nor to the immediately following instruction). Instead you usually want to lookup a side table indexed with the address of the overflowing instruction to find the "exception handler" to "return" to.
(a) (b) and (c) are still very welcome, of course.
Stefan
From a programmer's perspective, VAX exception handling was very nice.
It may have been high overhead, though.
You argue that trap-on-overflow as an instruction is unnecessary
AND
You argue that overflow detection is worthwhile
AND
You argue that ALU should not raise overflow exceptions
I am at a loss for how to take all 3 arguments together at the same time
!?! Can you explain ??
There were instructions like "Divide or
Halt" which stopped the computer with an error light on a zero divide.
Sort of like "Halt_And_Catch_Fire"?
On 9/23/2024 5:43 PM, Lawrence D'Oliveiro wrote:
Imagine if it was a design feature that, if the error light came on too
much, it would overheat and set the machine on fire? ;)
Would that encourage programmers to have fewer bugs in their
programs ... ?
In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:
Kent Dickey wrote:
In article <[email protected]>,
Anton Ertl <[email protected]> wrote:
Brett <[email protected]> writes:
Speaking of complex things, have you looked at Swift output, as it checks >>>>> all operations for overflow?MIPS got on that bandwagon early. It has, e.g., add (which traps on
You could add an exception type for that, saving huge numbers of correctly
predicted branch instructions.
The future of programming languages is type safe with checks, you need to >>>>> get on that bandwagon early.
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.
Alpha got on that bandwagon early. It's a descendent of MIPS, but it
renamed add into addv, and addu into add. It has been canceled around >>>> the year 2000.
[ More details about architectures without trapping overflow instructions ] >>>
Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values. If you take Rust's approach, and only
detect overflow in debug builds, then you already don't care about
performance.
Those automatic software correctness checks, of which signed integer
overflow detection is one of many, went away because most code was
being written in C/C++ and those two languages don't require them.
That just makes it more expensive in code size and performance to effect
such checks. This overhead leads some to conclude it justifies eliminating >> the error checks.
Eliminating the error event detectors doesn't make errors go away,
just your knowledge of them.
I gather portions of 16-bit Windows 3.1 were written in Pascal.
When Microsoft developed 32-bit WinNT, if instead of C it they had
switched their official development language from Pascal to Modula-2
which does require signed and unsigned, checked and modulo arithmetic,
and array bounds checks, the world would have been a much safer place.
But they didn't so it isn't.
The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.
OK, my post was about how having a hardware trap-on-overflow instruction
(or a mode for existing ALU instructions) is useless for anything OTHER
than as a debug aid where you crash the problem on overflow (you can
have a general exception handler to shut down gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons folks might
want to try to use trap-on-overflow instructions, and show how the
other cases don't make sense.
In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother having trap-on-overflow instructions.
You then went on to discuss how you want trap-on-overflow instructions
for stuff like C code, so you can detect code bugs and shut down gracefully.
And my response to that is we already know compilers don't use it. x86
has INTO, which is "trap if the overflow bit is set". So "ADD r8,r9; INTO" would trap if the add overflowed.
Look at:
https://godbolt.org/z/oMhW55YsK
Which is this code:
int add2(int num, int other) {
return num + other;
}
Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and crash).
For x86-64 clang 19.1.0:
add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
.LBB0_1:
ud1 eax, dword ptr [eax]
This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.
But x86 has an instruction to trap on overflow directly: INTO. It's one byte.
And it doesn't use it.
GCC x86-64 14.2 is even worse:
add2:
sub rsp, 8
call __addvsi3
add rsp, 8
ret
It calls a routine to do all additions which might overflow, and that
routine calls assert() if an overflow occurs.
The CPU has a trap-on-overflow instruction exactly for this case (to crash
on detecting an overflow), and compilers don't even use it.
So even on architectures which have a trap-on-overflow instruction,
compilers don't use it.
On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:
In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:
<snip>The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.
OK, my post was about how having a hardware trap-on-overflow instruction
(or a mode for existing ALU instructions) is useless for anything OTHER
than as a debug aid where you crash the problem on overflow (you can
have a general exception handler to shut down gracefully, but "patching
things
up and continuing" doesn't work). I gave details of reasons folks might
want to try to use trap-on-overflow instructions, and show how the
other cases don't make sense.
In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother
having trap-on-overflow instructions.
So why should any hardware include an instruction to trap-on-overflow?
Trap-on-overflow instruction have a hardware cost, of varying severity.
If the ISA isn't already trapping on ALU instructions (such as
divide-by-0), it adds a new class of operations which can take
exceptions. An ALU functional unit that cannot take exceptions doesn't
have to save "unwinding" info (at minimum, info to recover the PC, and
possibly rollback state), and not needing this can be a nice
simplification. Branches and LD/ST always needs this info, but not
needing it on ALU ops can be a nice simplification of logic, and makes
it
easier to have multiple ALU functional units. Note that x86 INTO can
be treated as a branch, so it doesn't have the cost of an instruction
like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
traps if it overflows. ADDTO is particularly what I am arguing
against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.
You argue that trap-on-overflow as an instruction is unnecessary
AND
You argue that overflow detection is worthwhile
AND
You argue that ALU should not raise overflow exceptions
I am at a loss for how to take all 3 arguments together at the
same time !?! Can you explain ??
On Sun, 22 Sep 2024 09:14:04 -0700, Lars Poulsen wrote:
From a programmer's perspective, VAX exception handling was very
nice. It may have been high overhead, though.
Very high overhead. But it was also language-independent, and
integrated into the procedure-calling convention, which also managed
to be language- independent.
There is an internal memo on Bitsavers somewhere, critiquing a
proposal to adopt the MIPS architecture (which DEC did, for just one
machine, the DECstation 3000 if I recall rightly),
and one of the
points against MIPS was that it didn’t have language-independent
exception handling. But then no other architecture, before the VAX or
since, has been able to do that.
MitchAlsup1 wrote:
On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:
In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:
<snip>The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.
OK, my post was about how having a hardware trap-on-overflow
instruction (or a mode for existing ALU instructions) is useless
for anything OTHER than as a debug aid where you crash the problem
on overflow (you can have a general exception handler to shut down
gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons folks
might want to try to use trap-on-overflow instructions, and show
how the other cases don't make sense.
In no way was I ever arguing that checking for overflow was a bad
idea, or a language issue, or anything else. Just that CPUs
should not bother having trap-on-overflow instructions.
So why should any hardware include an instruction to
trap-on-overflow?
Trap-on-overflow instruction have a hardware cost, of varying
severity. If the ISA isn't already trapping on ALU instructions
(such as divide-by-0), it adds a new class of operations which can
take exceptions. An ALU functional unit that cannot take
exceptions doesn't have to save "unwinding" info (at minimum, info
to recover the PC, and possibly rollback state), and not needing
this can be a nice simplification. Branches and LD/ST always
needs this info, but not needing it on ALU ops can be a nice
simplification of logic, and makes it
easier to have multiple ALU functional units. Note that x86 INTO
can be treated as a branch, so it doesn't have the cost of an
instruction like "ADDTO r1,r2,r3" which is a normal ADD but where
the ADD itself traps if it overflows. ADDTO is particularly what
I am arguing against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.
You argue that trap-on-overflow as an instruction is unnecessary
AND
You argue that overflow detection is worthwhile
AND
You argue that ALU should not raise overflow exceptions
I am at a loss for how to take all 3 arguments together at the
same time !?! Can you explain ??
Maybe all add/sub/etc opcodes that are immediately followed by an
INTO could be fused into a single ADDO/SUBO/etc version that takes
zero extra cycles as long as the trap part isn't hit?
Personally I'm happy with the clang approach.
Terje
On Tue, 24 Sep 2024 08:02:23 +0200
Terje Mathisen <[email protected]> wrote:
MitchAlsup1 wrote:
On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:
In article <O2DHO.184073$[email protected]>,
EricPÂ <[email protected]> wrote:
<snip>The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.
OK, my post was about how having a hardware trap-on-overflow
instruction (or a mode for existing ALU instructions) is useless
for anything OTHER than as a debug aid where you crash the problem
on overflow (you can have a general exception handler to shut down
gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons folks
might want to try to use trap-on-overflow instructions, and show
how the other cases don't make sense.
In no way was I ever arguing that checking for overflow was a bad
idea, or a language issue, or anything else. Just that CPUs
should not bother having trap-on-overflow instructions.
So why should any hardware include an instruction to
trap-on-overflow?
Trap-on-overflow instruction have a hardware cost, of varying
severity. If the ISA isn't already trapping on ALU instructions
(such as divide-by-0), it adds a new class of operations which can
take exceptions. An ALU functional unit that cannot take
exceptions doesn't have to save "unwinding" info (at minimum, info
to recover the PC, and possibly rollback state), and not needing
this can be a nice simplification. Branches and LD/ST always
needs this info, but not needing it on ALU ops can be a nice
simplification of logic, and makes it
easier to have multiple ALU functional units. Note that x86 INTO
can be treated as a branch, so it doesn't have the cost of an
instruction like "ADDTO r1,r2,r3" which is a normal ADD but where
the ADD itself traps if it overflows. ADDTO is particularly what
I am arguing against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.
You argue that trap-on-overflow as an instruction is unnecessary
AND
You argue that overflow detection is worthwhile
AND
You argue that ALU should not raise overflow exceptions
I am at a loss for how to take all 3 arguments together at the
same time !?! Can you explain ??
Maybe all add/sub/etc opcodes that are immediately followed by an
INTO could be fused into a single ADDO/SUBO/etc version that takes
zero extra cycles as long as the trap part isn't hit?
Personally I'm happy with the clang approach.
Couple of questions:
1. Which code would you put at destination of jo branch?
2. In your code generator would every jo in the code (or in the module,
or in the function) jump to the same destination or each will have destination of its own.
It would be interesting if you answer before looking at what clang does,
then take a look and comment again.
Michael S wrote:
On Tue, 24 Sep 2024 08:02:23 +0200
Terje Mathisen <[email protected]> wrote:
MitchAlsup1 wrote:
On Mon, 23 Sep 2024 21:57:08 +0000, Kent Dickey wrote:
In article <O2DHO.184073$[email protected]>,
EricPÂ <[email protected]> wrote:
<snip>The x86 designers might then have had an incentive to make all
the checks as efficient as possible, and rather than eliminate
them, they might have enhanced and more tightly integrated
them.
OK, my post was about how having a hardware trap-on-overflow
instruction (or a mode for existing ALU instructions) is useless
for anything OTHER than as a debug aid where you crash the
problem on overflow (you can have a general exception handler to
shut down gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons
folks might want to try to use trap-on-overflow instructions,
and show how the other cases don't make sense.
In no way was I ever arguing that checking for overflow was a bad
idea, or a language issue, or anything else. Just that CPUs
should not bother having trap-on-overflow instructions.
So why should any hardware include an instruction to
trap-on-overflow?
Trap-on-overflow instruction have a hardware cost, of varying
severity. If the ISA isn't already trapping on ALU instructions
(such as divide-by-0), it adds a new class of operations which
can take exceptions. An ALU functional unit that cannot take
exceptions doesn't have to save "unwinding" info (at minimum,
info to recover the PC, and possibly rollback state), and not
needing this can be a nice simplification. Branches and LD/ST
always needs this info, but not needing it on ALU ops can be a
nice simplification of logic, and makes it
easier to have multiple ALU functional units. Note that x86
INTO can be treated as a branch, so it doesn't have the cost of
an instruction like "ADDTO r1,r2,r3" which is a normal ADD but
where the ADD itself traps if it overflows. ADDTO is
particularly what I am arguing against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.
You argue that trap-on-overflow as an instruction is unnecessary
AND
You argue that overflow detection is worthwhile
AND
You argue that ALU should not raise overflow exceptions
I am at a loss for how to take all 3 arguments together at the
same time !?! Can you explain ??
Maybe all add/sub/etc opcodes that are immediately followed by an
INTO could be fused into a single ADDO/SUBO/etc version that takes
zero extra cycles as long as the trap part isn't hit?
Personally I'm happy with the clang approach.
Couple of questions:
1. Which code would you put at destination of jo branch?
2. In your code generator would every jo in the code (or in the
module, or in the function) jump to the same destination or each
will have destination of its own.
It would be interesting if you answer before looking at what clang
does, then take a look and comment again.
If the handler consists of terminating the program, then every
function, or small group of functions depending upon total code size,
can have a common target, just so that all the JO opcodes can use the short-form two-byte encoding og a forward branch. I.e. leaving just
127 bytes available for mainline code.
If you want separate handling for each overflow, i.e. switch to
bigint and resume, then you do need one target per JO, in order to
pick up the originating instruction address (and place it on the
stack for a subsequent RET?) before jumping to a common handler.
Terje
Kent Dickey <[email protected]> schrieb:
Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values.
I disagree.
Look at the sanitizer libraries, which insert runtime checks for
integer overflow - having less overhead for these would definitely
be a plus.
See https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html
or >https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-flags >.
Maybe all add/sub/etc opcodes that are immediately followed by an INTO
could be fused into a single ADDO/SUBO/etc version that takes zero extra cycles as long as the trap part isn't hit?
I want address of originating instruction in the handler.
I want it not for switch to bigint that would not be in spirit of
non-dynamic compiled languages, but in order to get useful termination printout.
With JO in order to get what I want I'd have to pay by significant
increase in code size.
In article<vcpidc$29e51$[email protected]>,
Thomas Koenig <[email protected]> wrote:
Kent Dickey <[email protected]> schrieb:
Trapping on overflow is basically useless other than as a debug aid, which clearly nobody values.
I disagree.
Look at the sanitizer libraries, which insert runtime checks for
integer overflow - having less overhead for these would definitely
be a plus.
See https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html
or https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-fla
gs
.
Not valuing something just means no one is spending a lot of time/effort
on it. Decimal math is not valued--but you can still do it, it just
has no special instructions on most architectures to make it fast/easy.
And as I've pointed out, trapping on integer overflow is clearly not valued--on x86, where INTO exists, GCC and Clang do not use it.
In article <vcpidc$29e51$[email protected]>,
Thomas Koenig <[email protected]> wrote:
Kent Dickey <[email protected]> schrieb:
Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values.
I disagree.
Look at the sanitizer libraries, which insert runtime checks for
integer overflow - having less overhead for these would definitely
be a plus.
See https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html
or >>https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-flags
.
Not valuing something just means no one is spending a lot of time/effort
on it.
On 9/23/2024 11:02 PM, Terje Mathisen wrote:
snip
Maybe all add/sub/etc opcodes that are immediately followed by an INTO
could be fused into a single ADDO/SUBO/etc version that takes zero extra
cycles as long as the trap part isn't hit?
If you are going to do that, why not make it an optional prefix byte?
That way, no fusion needed, no extra cycles, yet the same amount of code space.
In the Ada case, the ability to declare array types with programmer-
chosen index types with bounded range, such as range-bounded integers or enumerations, means that the compiler can avoid indexing checks when the (sub)type of the index is known at compile time to fit within the index
range of the array.
On 9/10/2024 1:13 AM, Niklas Holsti wrote:
In the Ada case, the ability to declare array types with
programmer- chosen index types with bounded range, such as
range-bounded integers or enumerations, means that the compiler can
avoid indexing checks when the (sub)type of the index is known at
compile time to fit within the index range of the array.
I have always liked the idea of variable ranges able to be specified
in the language. Besides the advantages you mentioned, it provides
more human "comprehensibility" (if the ranges are reasonably named)
i.e. better internal documentation, and it makes responding to
specification changes required later in the program life cycle easier
and less error prone, i.e. if the range has to change, you change it
in one place and don't risk missing making the change in some obscure
part of the program you forgot about.
On Sat, 21 Sep 2024 17:05:02 +0000, EricP wrote:
It is very efficient if those events are rarely or never supposed to
occur.
Many (most, nearly all) processor architectures have notoriously
bad exception delivery to a point of control that can deal with
the problem at hand.
But it is not necessary for that bad mechanism to be necessary !!
Some of the things that minimize the "badness" of taking an exception::
a) deliver control to user signal handler without taking an
excursion through GuestOS. (think 10 cycles)
b) when control arrives, receiving thread is already reentrant.
c) when control arrives, the instruction (bits) and its operand
values are delivered to the exception handler. So, the exception
handler has what it needs to deal with the problem at hand.
d) when control returns, the result (R0) is delivered back to the
destination register.
e) (b, c, d) are performed without handler needing to understand
how. Handler is just a subroutine that receives arguments (c)
fixes the problem, and returns a non-excepting value, or abort.
f) return has a way to re-execute the instruction or to skip the
instruction under control of handler without having access
to excepting-IP and without knowing the length of the
instruction.
g) during (a..f) nobody ever has to disable interrupts or
exceptions or re-enable them later. Priority and privilege
are inherited automatically from excepting thread.
I know of only 1 ISA with these properties....
In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:
Kent Dickey wrote:
In article <[email protected]>,Those automatic software correctness checks, of which signed integer
Anton Ertl <[email protected]> wrote:
Brett <[email protected]> writes:[ More details about architectures without trapping overflow instructions ] >>>
Speaking of complex things, have you looked at Swift output, as it checks >>>>> all operations for overflow?MIPS got on that bandwagon early. It has, e.g., add (which traps on
You could add an exception type for that, saving huge numbers of correctly
predicted branch instructions.
The future of programming languages is type safe with checks, you need to >>>>> get on that bandwagon early.
signed overflow) in addition to addu (which performs modulo
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.
Alpha got on that bandwagon early. It's a descendent of MIPS, but it
renamed add into addv, and addu into add. It has been canceled around >>>> the year 2000.
Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values. If you take Rust's approach, and only
detect overflow in debug builds, then you already don't care about
performance.
overflow detection is one of many, went away because most code was
being written in C/C++ and those two languages don't require them.
That just makes it more expensive in code size and performance to effect
such checks. This overhead leads some to conclude it justifies eliminating >> the error checks.
Eliminating the error event detectors doesn't make errors go away,
just your knowledge of them.
I gather portions of 16-bit Windows 3.1 were written in Pascal.
When Microsoft developed 32-bit WinNT, if instead of C it they had
switched their official development language from Pascal to Modula-2
which does require signed and unsigned, checked and modulo arithmetic,
and array bounds checks, the world would have been a much safer place.
But they didn't so it isn't.
The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.
OK, my post was about how having a hardware trap-on-overflow instruction
(or a mode for existing ALU instructions) is useless for anything OTHER
than as a debug aid where you crash the problem on overflow (you can
have a general exception handler to shut down gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons folks might
want to try to use trap-on-overflow instructions, and show how the
other cases don't make sense.
In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother having trap-on-overflow instructions.
You then went on to discuss how you want trap-on-overflow instructions
for stuff like C code, so you can detect code bugs and shut down gracefully.
And my response to that is we already know compilers don't use it. x86
has INTO, which is "trap if the overflow bit is set". So "ADD r8,r9; INTO" would trap if the add overflowed.
Look at:
https://godbolt.org/z/oMhW55YsK
Which is this code:
int add2(int num, int other) {
return num + other;
}
Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and crash).
For x86-64 clang 19.1.0:
add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
..LBB0_1:
ud1 eax, dword ptr [eax]
This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.
But x86 has an instruction to trap on overflow directly: INTO. It's one byte.
And it doesn't use it.
GCC x86-64 14.2 is even worse:
add2:
sub rsp, 8
call __addvsi3
add rsp, 8
ret
It calls a routine to do all additions which might overflow, and that
routine calls assert() if an overflow occurs.
The CPU has a trap-on-overflow instruction exactly for this case (to crash
on detecting an overflow), and compilers don't even use it.
So even on architectures which have a trap-on-overflow instruction,
compilers don't use it.
So why should any hardware include an instruction to trap-on-overflow?
Trap-on-overflow instruction have a hardware cost, of varying severity.
If the ISA isn't already trapping on ALU instructions (such as
divide-by-0), it adds a new class of operations which can take
exceptions. An ALU functional unit that cannot take exceptions doesn't
have to save "unwinding" info (at minimum, info to recover the PC, and possibly rollback state), and not needing this can be a nice
simplification. Branches and LD/ST always needs this info, but not
needing it on ALU ops can be a nice simplification of logic, and makes it easier to have multiple ALU functional units. Note that x86 INTO can
be treated as a branch, so it doesn't have the cost of an instruction
like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
traps if it overflows. ADDTO is particularly what I am arguing against--
it is just a bad idea for the ISA to have ALU instructions take exceptions.
Instruction sets which make detecting overflow difficult (say, RISC-V),No they are a very useful tool for those who need such a tool
would do well to make branch-on-overflow efficient and easy. But adding >>> trap-on-overflow instructions is a waste of effort.
because the manual alternative is significantly more expensive
for both size and performance.
"I have one example where overflow exceptions would be a poor implementation >> choice" does not imply "therefore no one should have them as an option".
Can you share what language, compiler, and hardware you are using which implements overflow checks using a trap-on-overflow instruction?
Kent
On Wed, 25 Sep 2024 09:54:17 -0700
Stephen Fuld <[email protected]d> wrote:
On 9/10/2024 1:13 AM, Niklas Holsti wrote:
In the Ada case, the ability to declare array types with
programmer- chosen index types with bounded range, such as
range-bounded integers or enumerations, means that the compiler can
avoid indexing checks when the (sub)type of the index is known at
compile time to fit within the index range of the array.
I have always liked the idea of variable ranges able to be specified
in the language. Besides the advantages you mentioned, it provides
more human "comprehensibility" (if the ranges are reasonably named)
i.e. better internal documentation, and it makes responding to
specification changes required later in the program life cycle easier
and less error prone, i.e. if the range has to change, you change it
in one place and don't risk missing making the change in some obscure
part of the program you forgot about.
The problem here is that arrays with fixed bounds were common when
Ada was conceived back in the mid 1970s. On general-purpose (as opposed
to embedded) computers they were already much rarer when Ada was shipped
in 1983. By late 1990s arrays with fixed bounds were rare exception
rather than rule.
Except, of course, for many types of embedded computers. But even that
is gradually changing. Very gradually.
MitchAlsup1 wrote:
But then, risc processors mostly, started using exceptions for housekeeping
- SPARC for register window sliding, Alpha for byte, word and misaligned >memory access, MIPS and Alpha for software TLB-miss handling.
And suddenly the exceptional becomes the normal.
Then virtual machines come along using exceptions to trigger
trap-and-emulate code, and now the normal becomes frequent.
Kent Dickey wrote:
Look at:
https://godbolt.org/z/oMhW55YsK
Which is this code:
int add2(int num, int other) {
return num + other;
}
Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and crash).
For x86-64 clang 19.1.0:
add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
.LBB0_1:
ud1 eax, dword ptr [eax]
This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.
But x86 has an instruction to trap on overflow directly: INTO. It's
one byte.
And it doesn't use it.
GCC x86-64 14.2 is even worse:
add2:
sub rsp, 8
call __addvsi3
add rsp, 8
ret
It calls a routine to do all additions which might overflow, and that
routine calls assert() if an overflow occurs.
The CPU has a trap-on-overflow instruction exactly for this case (to
crash
on detecting an overflow), and compilers don't even use it.
So even on architectures which have a trap-on-overflow instruction,
compilers don't use it.
You can only compile in INTO opcodes if you can guarantee that the INT 4 (INTO) trap vector will always be set to a proper handler, and since
that isn't part of the ABI, compilers can't depend on it?
I do agree that it would be nice if it did work, barring that clang is
doing the best possible alternative, at close to zero cost except for
the useless branch predictor table entry wastage.
Terje
MitchAlsup1 wrote:
On Sat, 21 Sep 2024 17:05:02 +0000, EricP wrote:
It is very efficient if those events are rarely or never supposed to
occur.
Many (most, nearly all) processor architectures have notoriously
bad exception delivery to a point of control that can deal with
the problem at hand.
But it is not necessary for that bad mechanism to be necessary !!
Some of the things that minimize the "badness" of taking an exception::
a) deliver control to user signal handler without taking an
excursion through GuestOS. (think 10 cycles)
b) when control arrives, receiving thread is already reentrant.
c) when control arrives, the instruction (bits) and its operand
values are delivered to the exception handler. So, the exception
handler has what it needs to deal with the problem at hand.
d) when control returns, the result (R0) is delivered back to the
destination register.
e) (b, c, d) are performed without handler needing to understand
how. Handler is just a subroutine that receives arguments (c)
fixes the problem, and returns a non-excepting value, or abort.
f) return has a way to re-execute the instruction or to skip the
instruction under control of handler without having access
to excepting-IP and without knowing the length of the
instruction.
g) during (a..f) nobody ever has to disable interrupts or
exceptions or re-enable them later. Priority and privilege
are inherited automatically from excepting thread.
I know of only 1 ISA with these properties....
It all depends on the frequency that exceptions occur.
It used to be that Page Fault was the only one that occurred with any frequency, and the code path for the page fault handler was long enough
that any HW overhead was lost in the noise. In all other cases they
indicated a fatal error so the HW cost was the least of your problems.
But then, risc processors mostly, started using exceptions for
housekeeping
- SPARC for register window sliding, Alpha for byte, word and misaligned memory access, MIPS and Alpha for software TLB-miss handling.
And suddenly the exceptional becomes the normal.
The solution for Alpha was to add back the byte and word instructions,
and add misaligned access support to all memory ops.
Sparc stuck with traps for register windows.
No one else used software managed TLB's.
Then virtual machines come along using exceptions to trigger
trap-and-emulate code, and now the normal becomes frequent.
Not 1 or 10 exceptions per second, but 100,000 or 200,000.
The solution for VM's is to add the ISA features necessary so that
most exceptions are rare, and when they do happen they are cheap.
Worst case it should cost the same as a branch mispredict pipeline
drain.
Terje Mathisen wrote:
You can only compile in INTO opcodes if you can guarantee that the INT 4
(INTO) trap vector will always be set to a proper handler, and since
that isn't part of the ABI, compilers can't depend on it?
I do agree that it would be nice if it did work, barring that clang is
doing the best possible alternative, at close to zero cost except for
the useless branch predictor table entry wastage.
Terje
On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
One must use JO to detect signed overflow.
Others were repurposed, 1-byte INC and DEC 40..4F became the REX prefix.
On Wed, 25 Sep 2024 17:07:45 +0000, Michael S wrote:
On Wed, 25 Sep 2024 09:54:17 -0700
Stephen Fuld <[email protected]d> wrote:
On 9/10/2024 1:13 AM, Niklas Holsti wrote:
In the Ada case, the ability to declare array types with
programmer- chosen index types with bounded range, such as
range-bounded integers or enumerations, means that the compiler can
avoid indexing checks when the (sub)type of the index is known at
compile time to fit within the index range of the array.
I have always liked the idea of variable ranges able to be specified
in the language. Besides the advantages you mentioned, it provides
more human "comprehensibility" (if the ranges are reasonably named)
i.e. better internal documentation, and it makes responding to
specification changes required later in the program life cycle easier
and less error prone, i.e. if the range has to change, you change it
in one place and don't risk missing making the change in some obscure
part of the program you forgot about.
The problem here is that arrays with fixed bounds were common when
Ada was conceived back in the mid 1970s. On general-purpose (as opposed
to embedded) computers they were already much rarer when Ada was shipped
in 1983. By late 1990s arrays with fixed bounds were rare exception
rather than rule.
It sounds like variable ranges (array indexes) would be becoming more
common, also.
Where "variable range" is a variable that is defined to have a
specified range, but from run to run the upper and lower bounds
can be modified without re-compilation.
Kent Dickey wrote:
In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:
Well, there is a bunch of things to unpack here.
First, INTO is a 32-bit x86 instruction. On 64-bit x64 AMD reassigned
that opcode to be for other instructions. On x64 the JO (jump overflow) instruction does overflow detection.
The reason AMD could reassign INTO was because it wasn't being used by
C/C++.
But this is a side effect of C's widespread use, not the cause.
Programmers write in C because it is widely used and supported,
and as a consequence of that choice they get unchecked arithmetic.
But they are not choosing C to get unchecked arithmetic.
Had this same usage tests been done on other languages the results
would likely be quite different.
Second, on x86 the INTO and on x64 a JO offset32 take up 1 and 5 bytes respectively. In JO case it has to branch to a ThrowOverflow () call
so thats 5 more bytes per ADD or SUB if you want error traceability.
With overflow trapping instructions there is NO runtime or code size
cost.
Third, on many risc ISA like RISC-V there are no flags so no JO
instruction
even possible. Either they must use the branchless overflow idiom or the branching version, adding more to the cost of error detection.
*OR*its has an Add Fault Overflow instruction which has NO RUNTIME COST
ADDFO rd = rs1 + rs2
Fourth, it sounds like what you want is a risc (no flags) ADD
instruction
that returns both a result and an overflow flag so you can do the
equivalent of the x64 JO branch test.
ADDO (ro,rd) = rs1 + rs2
where rd is dest and ro is a register to receive a 0/1 overflow flag.
Once one allows multiple dest registers ADDO is trivial to support.
But that does not invalidate the usefulness of ADDFO.
I would also have ADDFC Add Fault Carry for unsigned overflow,
plus other instructions for checking signed overflow and unsigned carry.
Look at:
https://godbolt.org/z/oMhW55YsK
Which is this code:
int add2(int num, int other) {
return num + other;
}
Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and crash).
For x86-64 clang 19.1.0:
add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
..LBB0_1:
ud1 eax, dword ptr [eax]
This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.
Yes, this is all for x64 which has no INTO instruction.
So why should any hardware include an instruction to trap-on-overflow?
Because ALL the negative speed and code size consequences do not occur.
Trap-on-overflow instruction have a hardware cost, of varying severity.
If the ISA isn't already trapping on ALU instructions (such as
divide-by-0), it adds a new class of operations which can take
exceptions. An ALU functional unit that cannot take exceptions doesn't
have to save "unwinding" info (at minimum, info to recover the PC, and
possibly rollback state), and not needing this can be a nice
simplification. Branches and LD/ST always needs this info, but not
needing it on ALU ops can be a nice simplification of logic, and makes
it
easier to have multiple ALU functional units. Note that x86 INTO can
be treated as a branch, so it doesn't have the cost of an instruction
like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
traps if it overflows. ADDTO is particularly what I am arguing
against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.
Not really. Its a flag in the uOp indicating HasException and a union
of fields to hold exception status and RIP, all of which needs to be
there for other instructions like load/store.
Instruction sets which make detecting overflow difficult (say, RISC-V), >>>> would do well to make branch-on-overflow efficient and easy. But adding >>>> trap-on-overflow instructions is a waste of effort.No they are a very useful tool for those who need such a tool
because the manual alternative is significantly more expensive
for both size and performance.
"I have one example where overflow exceptions would be a poor
implementation
choice" does not imply "therefore no one should have them as an option".
Can you share what language, compiler, and hardware you are using which
implements overflow checks using a trap-on-overflow instruction?
Kent
On DEC VAX the Overflow Enable flag was in the Program Status Word.
IIRC it was enabled by default in all DEC languages, Fortran77, Pascal,
Ada, Cobol, and disabled by default for C. But it could be toggled with
a runtime library call.
For a variety of reasons having Overflow Enable in the status register
is A Bad Idea.
On Alpha it was a compile switch which selects different instructions ADD vs ADDV, and also controlled by pragmas.
If you wanted to manually test for overflow then you used
one of the idioms, whatever language you worked in.
On Tue, 24 Sep 2024 17:06:27 +0000, Stephen Fuld wrote:
On 9/23/2024 11:02 PM, Terje Mathisen wrote:
snip
Maybe all add/sub/etc opcodes that are immediately followed by an INTO
could be fused into a single ADDO/SUBO/etc version that takes zero extra >>> cycles as long as the trap part isn't hit?
If you are going to do that, why not make it an optional prefix byte?
That way, no fusion needed, no extra cycles, yet the same amount of code
space.
Realistically, what is the difference if INTO is a prefix
byte or a postfix byte ?
On Wed, 25 Sep 2024 16:54:18 +0000, EricP wrote:
Kent Dickey wrote:
In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:
Well, there is a bunch of things to unpack here.
First, INTO is a 32-bit x86 instruction. On 64-bit x64 AMD reassigned
that opcode to be for other instructions. On x64 the JO (jump overflow)
instruction does overflow detection.
The reason AMD could reassign INTO was because it wasn't being used by
C/C++.
But this is a side effect of C's widespread use, not the cause.
Programmers write in C because it is widely used and supported,
Free compilers
and as a consequence of that choice they get unchecked arithmetic.
But they are not choosing C to get unchecked arithmetic.
Had this same usage tests been done on other languages the results
would likely be quite different.
Second, on x86 the INTO and on x64 a JO offset32 take up 1 and 5 bytes
respectively. In JO case it has to branch to a ThrowOverflow () call
so thats 5 more bytes per ADD or SUB if you want error traceability.
With overflow trapping instructions there is NO runtime or code size
cost.
Third, on many risc ISA like RISC-V there are no flags so no JO
instruction
even possible. Either they must use the branchless overflow idiom or the
branching version, adding more to the cost of error detection.
*OR*its has an Add Fault Overflow instruction which has NO RUNTIME COST
ADDFO rd = rs1 + rs2
*OR"its ??? can you translate than into comp.arch language.
Fourth, it sounds like what you want is a risc (no flags) ADD
instruction
that returns both a result and an overflow flag so you can do the
equivalent of the x64 JO branch test.
ADDO (ro,rd) = rs1 + rs2
where rd is dest and ro is a register to receive a 0/1 overflow flag.
Once one allows multiple dest registers ADDO is trivial to support.
But that does not invalidate the usefulness of ADDFO.
I would also have ADDFC Add Fault Carry for unsigned overflow,
Which will be used two orders of magnitude less than ADDFO. First
because unsigned is used less often than signed, secondly much/most
unsigned arithmetic is specified to wrap rather than check.
plus other instructions for checking signed overflow and unsigned carry.
Look at:
https://godbolt.org/z/oMhW55YsK
Which is this code:
int add2(int num, int other) {
return num + other;
}
Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and crash).
For x86-64 clang 19.1.0:
add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
..LBB0_1:
ud1 eax, dword ptr [eax]
This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.
Yes, this is all for x64 which has no INTO instruction.
s/x64/x86-64/g
It is still an x86 with all the benefits and detriments.
<snip>
So why should any hardware include an instruction to trap-on-overflow?
Because ALL the negative speed and code size consequences do not occur.
No because an EFFICIENT trap-on-overflow has no performance consequences
when no overflow is created. Efficient means 10-20 cycles to arrive at exception handler--already in a reentrant state with exceptions and interrupts still enabled. Just because x86 is so horrible in this regard
does not mean every architecture has to be at least that bad.
Trap-on-overflow instruction have a hardware cost, of varying severity.
If the ISA isn't already trapping on ALU instructions (such as
divide-by-0), it adds a new class of operations which can take
exceptions. An ALU functional unit that cannot take exceptions doesn't
have to save "unwinding" info (at minimum, info to recover the PC, and
possibly rollback state), and not needing this can be a nice
simplification. Branches and LD/ST always needs this info, but not
needing it on ALU ops can be a nice simplification of logic, and makes
it
easier to have multiple ALU functional units. Note that x86 INTO can
be treated as a branch, so it doesn't have the cost of an instruction
like "ADDTO r1,r2,r3" which is a normal ADD but where the ADD itself
traps if it overflows. ADDTO is particularly what I am arguing
against--
it is just a bad idea for the ISA to have ALU instructions take
exceptions.
Not really. Its a flag in the uOp indicating HasException and a union
of fields to hold exception status and RIP, all of which needs to be
there for other instructions like load/store.
Agreed, the overhead of recording "Overflow" and whether to do something about it is so small that other considerations sway the argumetns.
Instruction sets which make detecting overflow difficult (say,No they are a very useful tool for those who need such a tool
RISC-V),
would do well to make branch-on-overflow efficient and easy. But
adding
trap-on-overflow instructions is a waste of effort.
because the manual alternative is significantly more expensive
for both size and performance.
"I have one example where overflow exceptions would be a poor
implementation
choice" does not imply "therefore no one should have them as an
option".
Can you share what language, compiler, and hardware you are using which
implements overflow checks using a trap-on-overflow instruction?
Kent
On DEC VAX the Overflow Enable flag was in the Program Status Word.
On My 66000 Overflow enable bit is part of the thread-status-line.
IIRC it was enabled by default in all DEC languages, Fortran77, Pascal,
Ada, Cobol, and disabled by default for C. But it could be toggled with
a runtime library call.
Similar--but library call does not have to "gain privilege" to flip the
bit's state.
For a variety of reasons having Overflow Enable in the status register
is A Bad Idea.
Can you expand. It seems to me if the unprivileged application using
the instructions at hand (Header Register instruction) can access
and write those exception control bits without needing privilege--
that most of the "A Bad Idea™" disappear. At the same time there
are significant amounts of state that do require privilege to
access in thread-status-line, and HR obeys such a distinction.
On Alpha it was a compile switch which selects different
instructions ADD vs ADDV, and also controlled by pragmas.
If you wanted to manually test for overflow then you used
one of the idioms, whatever language you worked in.
MitchAlsup1 wrote:
On Wed, 25 Sep 2024 16:54:18 +0000, EricP wrote:
IIRC it was enabled by default in all DEC languages, Fortran77, Pascal,
Ada, Cobol, and disabled by default for C. But it could be toggled with
a runtime library call.
Similar--but library call does not have to "gain privilege" to flip the
bit's state.
The library routine didn't need a privilege change.
The problem is mostly due to the fact that expressions are
*mixtures of signed and unsigned, checked and modulo arithmetic*.
If overflow checks are enabled by status flag then the program has to
keep switching between modes for individual arithmetic operations.
This leads to a slew of enable and disable instructions which could
be serializing.
This is because array index value expressions are calculated using
signed,
checked arithmetic, then the result is range checked against the bounds. However addresses are calculated using modulo arithmetic.
Since most OS define address 0 to be the start of user space,
and locate the OS at FF...FF, addresses are unsigned modulo numbers.
If checks are not disabled for the address calculation
and the address not calculated using modulo arithmetic,
it is easy to trigger false overflow exceptions with arrays
that do not have base-0 or base-1 array bounds as many compilers
use bias-base array buffer pointers.
This is why one wants separate instructions for ADD and ADDV - there is
no overhead to switching between modulo and checked linear arithmetic.
Second, if there is a control register, it becomes part of the ABI.
It can either be
- undefined on calls, in which case each routine must save the current
flags state on *each* entry and set a value, and restore the original
state on return,
- or defined to have a particular enable/disable value on call and
callee's are required toggle it if needed but restore it to default
for all calls and returns.
Third, there is no reason to have overflow as a dynamic enable/disable because the kind of arithmetic, modulo or linear, is fixed by what the programmer writes and does not change dynamically.
Dynamic overflow enable results in a continuous overhead managing it
which does not occur with explicit fault testing instructions.
On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:
I've always paid for mine. My first C compiler came with the WinNT 3.5
beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.
Given that GCC offers more features and generates better code than MSVC,
the money may not matter to your business, but the quality of the product will.
On Wed, 25 Sep 2024 12:54:18 -0400, EricP
<[email protected]> wrote:
For me error detection of all kinds is useful. It just happens
to not be conveniently supported in C so no one tries it in C.
GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need >> as it triggers for many false positives so people turn it off.
Things like that are why some companies have a code policy that allows
just one function per file.
Still a problem if you need <whatever the relevant flag does> only in
one or a few places.
On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:
I've always paid for mine. My first C compiler came with the WinNT 3.5
beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.
Given that GCC offers more features and generates better code than MSVC,
the money may not matter to your business, but the quality of the product >will.
On 28/09/2024 01:52, George Neuner wrote:
On Wed, 25 Sep 2024 12:54:18 -0400, EricP
<[email protected]> wrote:
For me error detection of all kinds is useful. It just happens
to not be conveniently supported in C so no one tries it in C.
GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit
:
2) its always on for a compilation unit which is not what programmers need >>> as it triggers for many false positives so people turn it off.
:
Changing these options does have some limitations, such as disabling
inlining into functions with different options. But you can happily
apply it to only some functions in a translation unit.
Things like that are why some companies have a code policy that allows
just one function per file.
I have never heard of such a policy, and I think it would be an
extremely silly one - code would be completely unmanageable, and the
results would be significantly poorer when using modern compilers (i.e., >anything this century).
Still a problem if you need <whatever the relevant flag does> only in
one or a few places.
There are gcc flags that are only controllable for compiler invocations, >rather than with pragmas or attributes, and of course not every compiler
has the flexibility of gcc or clang. But this is not nearly the level
you seem to think it is.
Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.
GCC is a compiler collection not a integrated development kit for
Windows.
I have no knowledge of what state GCC was in in 1992 but it likely did
not support the MS enhancements for Win32 programming:
Plus come with necessary API headers, various link libraries and DLL's, supporting applications, documentation.
You know... what a product looks like.
Lawrence D'Oliveiro wrote:
On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:
I've always paid for mine. My first C compiler came with the WinNT 3.5
beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.
Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.
GCC is a compiler collection not a integrated development kit for Windows.
I have no knowledge of what state GCC was in in 1992 but it likely
did not support the MS enhancements for Win32 programming:
structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned integers,
and most important: integration with the GUI source code debugger.
Plus come with necessary API headers, various link libraries and DLL's, supporting applications, documentation.
You know... what a product looks like.
EricP <[email protected]> writes:
Lawrence D'Oliveiro wrote:
On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:
I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>> beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.
Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.
GCC is a compiler collection not a integrated development kit for Windows. >> I have no knowledge of what state GCC was in in 1992 but it likely
did not support the MS enhancements for Win32 programming:
structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned integers,
and most important: integration with the GUI source code debugger.
Plus come with necessary API headers, various link libraries and DLL's,
supporting applications, documentation.
You know... what a product looks like.
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Terje Mathisen wrote:
Kent Dickey wrote:
Look at:
https://godbolt.org/z/oMhW55YsK
Which is this code:
int add2(int num, int other) {
return num + other;
}
Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and
crash).
For x86-64 clang 19.1.0:
add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
.LBB0_1:
ud1 eax, dword ptr [eax]
This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.
But x86 has an instruction to trap on overflow directly: INTO.
It's one byte.
And it doesn't use it.
GCC x86-64 14.2 is even worse:
add2:
sub rsp, 8
call __addvsi3
add rsp, 8
ret
It calls a routine to do all additions which might overflow, and
that routine calls assert() if an overflow occurs.
The CPU has a trap-on-overflow instruction exactly for this case
(to crash
on detecting an overflow), and compilers don't even use it.
So even on architectures which have a trap-on-overflow instruction,
compilers don't use it.
You can only compile in INTO opcodes if you can guarantee that the
INT 4 (INTO) trap vector will always be set to a proper handler,
and since that isn't part of the ABI, compilers can't depend on it?
I do agree that it would be nice if it did work, barring that clang
is doing the best possible alternative, at close to zero cost
except for the useless branch predictor table entry wastage.
Terje
On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
One must use JO to detect signed overflow.
Others were repurposed, 1-byte INC and DEC 40..4F became the REX
prefix.
EricP <[email protected]> writes:
Lawrence D'Oliveiro wrote:
On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:
I've always paid for mine. My first C compiler came with the
WinNT 3.5 beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.
Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.
GCC is a compiler collection not a integrated development kit for
Windows. I have no knowledge of what state GCC was in in 1992 but
it likely did not support the MS enhancements for Win32 programming: structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned
integers, and most important: integration with the GUI source code debugger.
Plus come with necessary API headers, various link libraries and
DLL's, supporting applications, documentation.
You know... what a product looks like.
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed
something.
I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.
Tim Rentsch <[email protected]> schrieb:
EricP <[email protected]> writes:
Lawrence D'Oliveiro wrote:
On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:
I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>>> beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.
Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.
GCC is a compiler collection not a integrated development kit for Windows. >>> I have no knowledge of what state GCC was in in 1992 but it likely
did not support the MS enhancements for Win32 programming:
structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned integers,
and most important: integration with the GUI source code debugger.
Plus come with necessary API headers, various link libraries and DLL's,
supporting applications, documentation.
You know... what a product looks like.
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.
Personally, I like Cygwin best because it gives you access to the
usual UNIX tools like make or emacs, and you can immediately run
the executable. I just add -static-libgfortran for Fortran code
to avoid the hassle of distributing a DLL with it.
Even gdb works.
On 29/09/2024 09:15, Thomas Koenig wrote:
Tim Rentsch <[email protected]> schrieb:
EricP <[email protected]> writes:
Lawrence D'Oliveiro wrote:
On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:
I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>>>> beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.
Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.
GCC is a compiler collection not a integrated development kit for Windows. >>>> I have no knowledge of what state GCC was in in 1992 but it likely
did not support the MS enhancements for Win32 programming:
structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned integers, >>>> and most important: integration with the GUI source code debugger.
Plus come with necessary API headers, various link libraries and DLL's, >>>> supporting applications, documentation.
You know... what a product looks like.
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.
Personally, I like Cygwin best because it gives you access to the
usual UNIX tools like make or emacs, and you can immediately run
the executable. I just add -static-libgfortran for Fortran code
to avoid the hassle of distributing a DLL with it.
Personally, I prefer msys2 because it gives you access to the usual *nix tools like make - and does so far better than Cygwin. (Here "better"
means more native-like file access, and more efficient usage.) And you
don't get the DLL hell of Cygwin.
David Brown <[email protected]> schrieb:
On 29/09/2024 09:15, Thomas Koenig wrote:
Tim Rentsch <[email protected]> schrieb:
EricP <[email protected]> writes:
Lawrence D'Oliveiro wrote:
On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:
I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>>>>> beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.
Given that GCC offers more features and generates better code than >>>>>> MSVC, the money may not matter to your business, but the quality of >>>>>> the product will.
GCC is a compiler collection not a integrated development kit for Windows.
I have no knowledge of what state GCC was in in 1992 but it likely
did not support the MS enhancements for Win32 programming:
structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned integers, >>>>> and most important: integration with the GUI source code debugger.
Plus come with necessary API headers, various link libraries and DLL's, >>>>> supporting applications, documentation.
You know... what a product looks like.
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.
Personally, I like Cygwin best because it gives you access to the
usual UNIX tools like make or emacs, and you can immediately run
the executable. I just add -static-libgfortran for Fortran code
to avoid the hassle of distributing a DLL with it.
Personally, I prefer msys2 because it gives you access to the usual *nix
tools like make - and does so far better than Cygwin. (Here "better"
means more native-like file access, and more efficient usage.) And you
don't get the DLL hell of Cygwin.
Just one remark - I was referring to running the mingw compiler
under Cygwin, for which you don't get the DLL issues.
EricP <[email protected]> writes:
Lawrence D'Oliveiro wrote:
On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:
I've always paid for mine. My first C compiler came with the WinNT 3.5 >>>> beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.
Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality of
the product will.
GCC is a compiler collection not a integrated development kit for Windows. >> I have no knowledge of what state GCC was in in 1992 but it likely
did not support the MS enhancements for Win32 programming:
structured exception handling, various ABI's, inline assembler,
defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned integers,
and most important: integration with the GUI source code debugger.
Plus come with necessary API headers, various link libraries and DLL's,
supporting applications, documentation.
You know... what a product looks like.
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed
something.
I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.
Tim Rentsch wrote:
EricP <[email protected]> writes:
Lawrence D'Oliveiro wrote:
On Thu, 26 Sep 2024 13:13:02 -0400, EricP wrote:
I've always paid for mine. My first C compiler came with the
WinNT 3.5 beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.
Given that GCC offers more features and generates better code than
MSVC, the money may not matter to your business, but the quality
of the product will.
GCC is a compiler collection not a integrated development kit for
Windows. I have no knowledge of what state GCC was in in 1992 but
it likely did not support the MS enhancements for Win32
programming: structured exception handling, various ABI's, inline
assembler, defined behavior for some of C's undefined behavior,
later first-class-type support for 64-bit signed and unsigned
integers, and most important: integration with the GUI source
code debugger.
Plus come with necessary API headers, various link libraries and
DLL's, supporting applications, documentation.
You know... what a product looks like.
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed something.
I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.
I seem to remember finding something like __int128_t and __uint128_t
inside MSVC?
And that by casting uint64_t parameters to the u128 variant, the
compiler would generate the obvious MUL RDI and save RDX:RAX as the
128-bit result:
uint128_t mulw(uint64_t a, uint64_t b)
{
return (uint128_t) a * (uint128_t) b;
}
I.e. no subroutine call/zero overhead.
OTOH, getting optimal wide integer accumulators is a bit harder,
needing compiler intrinsics to access the widening add with carry
opcodes. (ADDX)
Terje
Tim Rentsch <[email protected]> schrieb:[...]
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.
Personally, I like Cygwin best because it gives you access to the
usual UNIX tools like make or emacs, and you can immediately run
the executable. I just add -static-libgfortran for Fortran code
to avoid the hassle of distributing a DLL with it.
Even gdb works.
On Sat, 28 Sep 2024 23:59:23 -0700
Tim Rentsch <[email protected]> wrote:
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed
something.
I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.
Both of your problems have no [MSVC] solution right now.
In case of 128-bit integer, there is a chance that MSVC will support
it in the future.
In case of 80-bit long double, there is no chance. If MSVC ever
supports binary floating point wider than 64-bit on x86-64 platform
then it would be IEEE binary128 implemented in software. But even
then they would not use name 'long double' for a new type, because it
would break existing programs.
But if all you want is the program running on Windows, then the
solution is easy - use different compiler.
MSYS2 is just couple of clicks (and ~0.8 GB :( ) away.
After you have msys2 installed do
pacman -Sy
pacman mingw-w64-ucrt-x86_64-gcc
Several hundreds of MB more and you have gcc14
Possible, I'd have to install make separately, i.e.
pacman make.
Tim Rentsch wrote:[...]
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed
something.
I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.
I seem to remember finding something like __int128_t and __uint128_t
inside MSVC?
And that by casting uint64_t parameters to the u128 variant, the
compiler would generate the obvious MUL RDI and save RDX:RAX as the
128-bit result:
uint128_t mulw(uint64_t a, uint64_t b)
{
return (uint128_t) a * (uint128_t) b;
}
I.e. no subroutine call/zero overhead.
OTOH, getting optimal wide integer accumulators is a bit harder,
needing compiler intrinsics to access the widening add with carry
opcodes. (ADDX)
Michael S <[email protected]> writes:
On Sat, 28 Sep 2024 23:59:23 -0700
Tim Rentsch <[email protected]> wrote:
[...]
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed
something.
I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.
Both of your problems have no [MSVC] solution right now.
In case of 128-bit integer, there is a chance that MSVC will support
it in the future.
In case of 80-bit long double, there is no chance. If MSVC ever
supports binary floating point wider than 64-bit on x86-64 platform
then it would be IEEE binary128 implemented in software. But even
then they would not use name 'long double' for a new type, because
it would break existing programs.
Thank you, this is helpful.
But if all you want is the program running on Windows, then the
solution is easy - use different compiler.
MSYS2 is just couple of clicks (and ~0.8 GB :( ) away.
After you have msys2 installed do
pacman -Sy
pacman mingw-w64-ucrt-x86_64-gcc
Several hundreds of MB more and you have gcc14
Possible, I'd have to install make separately, i.e.
pacman make.
Do I understand this right, that msys2 is to be installed
on Windows? And that the pacman commands are to be run
within msys2 on the MS Windows system?
On Sun, 29 Sep 2024 09:39:27 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Sat, 28 Sep 2024 23:59:23 -0700
Tim Rentsch <[email protected]> wrote:
[...]
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Are there any MSVC folks here who can help with these problems?
I am not an MSVC expert by any means and easily could have missed
something.
I should mention that the code is written in C, not C++, and that
is not something I am at liberty to change.
Both of your problems have no [MSVC] solution right now.
In case of 128-bit integer, there is a chance that MSVC will support
it in the future.
In case of 80-bit long double, there is no chance. If MSVC ever
supports binary floating point wider than 64-bit on x86-64 platform
then it would be IEEE binary128 implemented in software. But even
then they would not use name 'long double' for a new type, because
it would break existing programs.
Thank you, this is helpful.
But if all you want is the program running on Windows, then the
solution is easy - use different compiler.
MSYS2 is just couple of clicks (and ~0.8 GB :( ) away.
After you have msys2 installed do
pacman -Sy
pacman mingw-w64-ucrt-x86_64-gcc
Several hundreds of MB more and you have gcc14
Possible, I'd have to install make separately, i.e.
pacman make.
Do I understand this right, that msys2 is to be installed
on Windows? And that the pacman commands are to be run
within msys2 on the MS Windows system?
Yes and yes.
pacman has to be run from msys2 terminal window (bash).
Thomas Koenig <[email protected]> writes:
Tim Rentsch <[email protected]> schrieb:[...]
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.
Thank you for these suggestions. I have started to explore
mingw but not yet the others. Is there a difference between
mingw and mingw-w64, do you know?
Also do you know if mingw
is compatible with MSVC, as long as long double is not used?
Are you still doing your programming
to 32-bit APIs? Isn't there a _Win64_ yet?
In article <vd9udm$1dgsp$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:
Are you still doing your programming
to 32-bit APIs? Isn't there a _Win64_ yet?
"Win32" covers both 32-bit and 64-bit APIs. The reasons for this silly nomenclature are complicated and lie deep in the past.
Tim Rentsch <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Tim Rentsch <[email protected]> schrieb:[...]
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.
Thank you for these suggestions. I have started to explore
mingw but not yet the others. Is there a difference between
mingw and mingw-w64, do you know?
One is a fork of the other, I believe.
Also do you know if mingw
is compatible with MSVC, as long as long double is not used?
I believe that Mingw-w64 uses the Windows ABI, but that is a
belief, not something I know first-hand; I haven't looked
at the assembly.
In article <vd9udm$1dgsp$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:
Are you still doing your programming to 32-bit APIs? Isn't there a
_Win64_ yet?
"Win32" covers both 32-bit and 64-bit APIs. The reasons for this silly nomenclature are complicated and lie deep in the past.
On Sun, 29 Sep 2024 20:51 +0100 (BST), John Dallman wrote:
In article <vd9udm$1dgsp$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:
Are you still doing your programming to 32-bit APIs? Isn't there a
_Win64_ yet?
"Win32" covers both 32-bit and 64-bit APIs. The reasons for this
silly nomenclature are complicated and lie deep in the past.
Also the fact that those “64-bit” APIs are not entirely “64-bit” ...
On Sun, 29 Sep 2024 19:30:08 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Tim Rentsch <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Tim Rentsch <[email protected]> schrieb:
[...]
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.
Thank you for these suggestions. I have started to explore
mingw but not yet the others. Is there a difference between
mingw and mingw-w64, do you know?
One is a fork of the other, I believe.
Also do you know if mingw
is compatible with MSVC, as long as long double is not used?
I believe that Mingw-w64 uses the Windows ABI, but that is a
belief, not something I know first-hand; I haven't looked
at the assembly.
mingw64 tools are mostly compatible with Windows x64 ABI, but long
double is an exception.
Tim Rentsch <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Tim Rentsch <[email protected]> schrieb:[...]
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.
Thank you for these suggestions. I have started to explore
mingw but not yet the others. Is there a difference between
mingw and mingw-w64, do you know?
One is a fork of the other, I believe.
Also do you know if mingw
is compatible with MSVC, as long as long double is not used?
I believe that Mingw-w64 uses the Windows ABI, but that is a
belief, not something I know first-hand; I haven't looked
at the assembly.
On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
Also the fact that those _64-bit_ APIs are not entirely _64-bit_
They are entirely 64-bit. Every user-supplied buffer can be
anywhere in user's address space. Possibly you are confusing
Windows with VMS.
On Wed, 25 Sep 2024 13:56:40 -0400
EricP <[email protected]> wrote:
Terje Mathisen wrote:
Kent Dickey wrote:On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
Look at:You can only compile in INTO opcodes if you can guarantee that the
https://godbolt.org/z/oMhW55YsK
Which is this code:
int add2(int num, int other) {
return num + other;
}
Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and
crash).
For x86-64 clang 19.1.0:
add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
.LBB0_1:
ud1 eax, dword ptr [eax]
This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.
But x86 has an instruction to trap on overflow directly: INTO.
It's one byte.
And it doesn't use it.
GCC x86-64 14.2 is even worse:
add2:
sub rsp, 8
call __addvsi3
add rsp, 8
ret
It calls a routine to do all additions which might overflow, and
that routine calls assert() if an overflow occurs.
The CPU has a trap-on-overflow instruction exactly for this case
(to crash
on detecting an overflow), and compilers don't even use it.
So even on architectures which have a trap-on-overflow instruction,
compilers don't use it.
INT 4 (INTO) trap vector will always be set to a proper handler,
and since that isn't part of the ABI, compilers can't depend on it?
I do agree that it would be nice if it did work, barring that clang
is doing the best possible alternative, at close to zero cost
except for the useless branch predictor table entry wastage.
Terje
One must use JO to detect signed overflow.
Others were repurposed, 1-byte INC and DEC 40..4F became the REX
prefix.
Single-byte form of INTO reassigned. Dual-byte form (CD 04) is here.
On 29/09/2024 21:30, Thomas Koenig wrote:
Tim Rentsch <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Tim Rentsch <[email protected]> schrieb:[...]
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.
Thank you for these suggestions. I have started to explore
mingw but not yet the others. Is there a difference between
mingw and mingw-w64, do you know?
One is a fork of the other, I believe.
mingw-w64 was started as a fork of mingw, initially created to
support generating 64-bit binaries and because of disagreements with
the pace of development in mingw.
Also do you know if mingw
is compatible with MSVC, as long as long double is not used?
I believe that Mingw-w64 uses the Windows ABI, but that is a
belief, not something I know first-hand; I haven't looked
at the assembly.
There is a reasonably defined ABI for 64-bit Windows, so I think
there will be compatibility for most things in C.
C++ is more
complicated and much more likely to have incompatibilities.
There are approximately a hundred and one different C ABI's and
calling conventions for 32-bit Windows, since MS never actually
defined one, so things are a bit of a mess there. (DLL calling
conventions are clearer.)
I believe the two most popular ways of running "Linux-like" software
and gcc on Windows are using WSL (which is more of a virtualisation
layer),
and mingw-64 for the compiler target (with either gcc or
clang) and msys2 as an environment and source of *nix utilities and libraries. mingw/msys is considered old and limited (32-bit only),
while Cygwin is considered slow and clunky by many.
At least, that is my understanding.
On Mon, 30 Sep 2024 14:07:47 +0200
David Brown <[email protected]> wrote:
On 29/09/2024 21:30, Thomas Koenig wrote:
Tim Rentsch <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Tim Rentsch <[email protected]> schrieb:[...]
I am currently in the position of needing to take some code
written for Linux/Unix and get it running in MS Windows.
My attempts to use MSVC have been frustrating, because of some
limitations of that environment. The two most prominent are
these: long double is only 64 bits, and there are no integer
types of 128 bits that I could find.
Depending on what you need to to, you can give MinGW-w64 a try.
It works either as a cross-compiler from Linux or on Windows using
msys2 or Cygwin.
Thank you for these suggestions. I have started to explore
mingw but not yet the others. Is there a difference between
mingw and mingw-w64, do you know?
One is a fork of the other, I believe.
mingw-w64 was started as a fork of mingw, initially created to
support generating 64-bit binaries and because of disagreements with
the pace of development in mingw.
Also do you know if mingw
is compatible with MSVC, as long as long double is not used?
I believe that Mingw-w64 uses the Windows ABI, but that is a
belief, not something I know first-hand; I haven't looked
at the assembly.
There is a reasonably defined ABI for 64-bit Windows, so I think
there will be compatibility for most things in C.
For "most things" - yes. For 'long double' - no.
In case of 'long double' mingw64 tools use their own conventions that
differ both from SysV and from Win64.
But at C level behavior is
identical to x86-64 Linux.
C++ is more
complicated and much more likely to have incompatibilities.
There are approximately a hundred and one different C ABI's and
calling conventions for 32-bit Windows, since MS never actually
defined one, so things are a bit of a mess there. (DLL calling
conventions are clearer.)
I believe the two most popular ways of running "Linux-like" software
and gcc on Windows are using WSL (which is more of a virtualisation
layer),
WSL (now often referred as WSL1) is not a virtualization layer.
WSL2 is indeed a Linux running in virtual machine +
integration features for convenience.
WSL1 is the worst possible place to run Linux programs that depend on
long double having higher precision. That's because when WSL1 kernel
starts a new process it sets precision of x87 co-processor to 52 bits,
which is different from default settings on just about any other x86-64 Linux. Of course, the process can change the settings, but for that the programmer would have to be aware that the problem exists. Which is
rare.
WSL2 doesn't have this problem, but it is supported only on relatively
new versions of Windows.
So, for older versions of Windows if one wants to run Linux binaries
'as is' and to get the same behavior of long doable as in original then
one is advised to run Linux in less-integrated VMs, like Virtual Box
or MS's own HyperV.
and mingw-64 for the compiler target (with either gcc or
clang) and msys2 as an environment and source of *nix utilities and
libraries. mingw/msys is considered old and limited (32-bit only),
while Cygwin is considered slow and clunky by many.
And cygwin console is quite inconvenient.
At least, that is my understanding.
On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
Also the fact that those “64-bit” APIs are not entirely “64-bit” ...
They are entirely 64-bit.
On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:
On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
Also the fact that those “64-bit” APIs are not entirely “64-bit” ...
They are entirely 64-bit.
<https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>:
Another example; Win32 has a function for getting the size of a
file. File sizes on Windows are limited to 2^64 bytes, and so they
need a 64-bit integer to be expressed easily. But the API call to
get the size of a file doesn't give you a 64-bit value. Instead,
it gives you a pair of 32-bit values that have to be combined in a
particular way.
For 32-bit Windows, that's sort of understandable;
32-bit Windows is, well, 32-bit, so you might not expect to be
able to use 64-bit integers. But if you use the same API in 64-bit
Windows, it still gives you the pair of numbers, rather than just
a nice simple 64-bit number. While this made some kind of sense on
32-bit Windows, it makes no sense at all on 64-bit Windows, since
64-bit Windows can, by definition, use 64-bit numbers.
On Tue, 1 Oct 2024 0:33:01 +0000, Lawrence D'Oliveiro wrote:
On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:
On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
Also the fact that those “64-bit†APIs are not entirely
“64-bit†...
They are entirely 64-bit.
<https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>: >>
Another example; Win32 has a function for getting the size of a
file. File sizes on Windows are limited to 2^64 bytes, and so they
need a 64-bit integer to be expressed easily. But the API call to
get the size of a file doesn't give you a 64-bit value. Instead,
it gives you a pair of 32-bit values that have to be combined in a
particular way.
As long as you can embed the API function in a macro that performs
said combining, it's all OK.
uint64_T filesize = GetFileSize64( whatever );
For 32-bit Windows, that's sort of understandable;
32-bit Windows is, well, 32-bit, so you might not expect to be
able to use 64-bit integers. But if you use the same API in 64-bit
Windows, it still gives you the pair of numbers, rather than just
a nice simple 64-bit number. While this made some kind of sense on
32-bit Windows, it makes no sense at all on 64-bit Windows, since
64-bit Windows can, by definition, use 64-bit numbers.
Why would you want a 32-bit application to be able to use files
of 2^64-bits in size ???
The first issue here is that the original API defined the return value
as 32-bit ...
Turns out every single Win32-system in existence/in regular use is
little endian ...
Yeah, not too pretty, but also not a real/important problem.
On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:
On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
Also the fact that those “64-bit†APIs are not entirely “64-bit†...
They are entirely 64-bit.
<https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>:
Another example; Win32 has a function for getting the size of a
file. File sizes on Windows are limited to 2^64 bytes, and so they
need a 64-bit integer to be expressed easily. But the API call to
get the size of a file doesn't give you a 64-bit value. Instead,
it gives you a pair of 32-bit values that have to be combined in a
particular way. For 32-bit Windows, that's sort of understandable;
32-bit Windows is, well, 32-bit, so you might not expect to be
able to use 64-bit integers. But if you use the same API in 64-bit
Windows, it still gives you the pair of numbers, rather than just
a nice simple 64-bit number. While this made some kind of sense on
32-bit Windows, it makes no sense at all on 64-bit Windows, since
64-bit Windows can, by definition, use 64-bit numbers.
I had both partition and DVD image files larger than 4 GB well
before I had Win64 ...
Why would you want a 32-bit application to be able to use files of
2^64-bits in size ???
Lawrence D'Oliveiro wrote:
On Mon, 30 Sep 2024 11:15:05 +0300, Michael S wrote:
On Mon, 30 Sep 2024 01:28:42 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
Also the fact that those “64-bit†APIs are not entirely
“64-bit†...
They are entirely 64-bit.
<https://arstechnica.com/gadgets/2018/05/microsoft-learn-from-apple-ii/2/>:
Another example; Win32 has a function for getting the size of a
file. File sizes on Windows are limited to 2^64 bytes, and so
they need a 64-bit integer to be expressed easily. But the API call
to get the size of a file doesn't give you a 64-bit value. Instead,
it gives you a pair of 32-bit values that have to be combined
in a particular way. For 32-bit Windows, that's sort of
understandable; 32-bit Windows is, well, 32-bit, so you might not
expect to be able to use 64-bit integers. But if you use the same
API in 64-bit Windows, it still gives you the pair of numbers,
rather than just a nice simple 64-bit number. While this made some
kind of sense on 32-bit Windows, it makes no sense at all on 64-bit Windows, since 64-bit Windows can, by definition, use 64-bit
numbers.
The first issue here is that the original API defined the return
value as 32-bit, with an optional pointer to another variable to
receive the high part, but they came up with the GetFileSizeEx()
function decades ago, and that one gets the file size as a
LARGE_INTEGER. Nobody uses anything else afair.
The second potential issue is with the definition of LARGE_INTEGER:
It is as as you say defined as a pair of 32-bit values, overlayed
with a LONGLONG which can only work on a little-endian cpu since the
low part is followed by the high, right?
Turns out every single Win32-system in existence/in regular use is
little endian, so that is much less of a problem, and the docs tell
you to
"The LARGE_INTEGER structure is actually a union. If your compiler
has built-in support for 64-bit integers, use the QuadPart member to
store the 64-bit integer. Otherwise, use the LowPart and HighPart
members to store the 64-bit integer."
Yeah, not too pretty, but also not a real/important problem.
Terje
Wasn't Windows NT supposed to be some kind of _portable_ OS? Wasn't
it supposed to run on big-endian architectures too, like POWER, MIPS
and SPARC?
In article <vdg3d1$2kdqr$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:
Wasn't Windows NT supposed to be some kind of _portable_ OS? Wasn't
it supposed to run on big-endian architectures too, like POWER, MIPS
and SPARC?
It did. I have no experience with Windows NT on SPARC or PowerPC,
but
the OS ran fine on MIPS. It was a commercial failure, because MIPS
didn't keep up with the performance growth of x86.
PowerPC did for a while, but the company interested in NT on PowerPC
was IBM, and their hardware prices were a /lot/ higher than x86
prices. They didn't see that as a problem, but all the potential
customers did.
John
On Tue, 1 Oct 2024 10:12 +0100 (BST)
[email protected] (John Dallman) wrote:
[email protected]d (Lawrence D'Oliveiro) wrote:
Wasn't Windows NT supposed to be some kind of _portable_ OS?It did. I have no experience with Windows NT on SPARC or PowerPC,
Wasn't it supposed to run on big-endian architectures too, like
POWER, MIPS and SPARC?
Did WinNT on SPARC ever ship? I don't think so.
Wasn't MIPS edition of WinNT Little Endian?
Now I wonder what endiannes was used by PowerPC variant WinNT.
In theory, PPC/POWER could run in Little Endian mode, but before v3
of POWER ISA it wasn't as full-featured as Big Endian mode. If I am not mistaken, the difference was that in LE mode there was no support
for unaligned memory accesses.
John Dallman <[email protected]> schrieb:
In article <vdg3d1$2kdqr$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:
Wasn't Windows NT supposed to be some kind of _portable_ OS?
Wasn't it supposed to run on big-endian architectures too, like
POWER, MIPS and SPARC?
It did. I have no experience with Windows NT on SPARC or PowerPC,
but the OS ran fine on MIPS.
There was also a Windows for Alpha. A German computer chain, Vobis,
tried to sell two models with that, but it flopped.
In article <vdg3d1$2kdqr$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:
Wasn't Windows NT supposed to be some kind of _portable_ OS? Wasn't
it supposed to run on big-endian architectures too, like POWER, MIPS
and SPARC?
It did. I have no experience with Windows NT on SPARC or PowerPC, but the
OS ran fine on MIPS.
On Tue, 1 Oct 2024 10:12 +0100 (BST)
[email protected] (John Dallman) wrote:
PowerPC did for a while, but the company interested in NT on PowerPC
was IBM, and their hardware prices were a /lot/ higher than x86
prices. They didn't see that as a problem, but all the potential
customers did.
Now I wonder what endiannes was used by PowerPC variant WinNT.
In theory, PPC/POWER could run in Little Endian mode, but before v3 of
POWER ISA it wasn't as full-featured as Big Endian mode. If I am not >mistaken, the difference was that in LE mode there was no support for >unaligned memory accesses.
Michael S <[email protected]> writes:
On Tue, 1 Oct 2024 10:12 +0100 (BST)
[email protected] (John Dallman) wrote:
PowerPC did for a while, but the company interested in NT on PowerPC
was IBM, and their hardware prices were a /lot/ higher than x86
prices. They didn't see that as a problem, but all the potential
customers did.
The ideas of ARC (MIPS) and PowerPC (which was not just IBM) was that
they would succeed the IA-32-based PC. Given the assumed (and, around
1990, actual) performance superiority of RISCs over IA-32, this looked plausible. However, even with Alpha, which was often superior in
performance throughout the 1990s, and for which there were cheap
offerings (but without performance edge; e.g., I once was playing with
the idea of buying a 21164PC-based PC164SX system, where the CPU+board
(with 1MB L2 cache) cost ATS 6000 (~EUR 440) in 1998; but I went with
a K6-2, because I played some DOS games:-). The cheap 164SX offer may
have
been a clearance sale, however.
In any case, the performance advantage of the RISCs vanished during
the 1990s, the RISCs never had wide ISV support, and so WNT on RISCs
flopped.
Now I wonder what endiannes was used by PowerPC variant WinNT.
In theory, PPC/POWER could run in Little Endian mode, but before v3 of >>POWER ISA it wasn't as full-featured as Big Endian mode. If I am not >>mistaken, the difference was that in LE mode there was no support for >>unaligned memory accesses.
Given that MIPS and Alpha require natural alignment, little-endian
PowerPC at the time was as full-featured as the other RISCs.
Alignment issues may have been a problem with the RISC ports, though.
- anton
Michael S <[email protected]> writes:
On Tue, 1 Oct 2024 10:12 +0100 (BST)
[email protected] (John Dallman) wrote:
PowerPC did for a while, but the company interested in NT on PowerPC
was IBM, and their hardware prices were a /lot/ higher than x86
prices. They didn't see that as a problem, but all the potential
customers did.
The ideas of ARC (MIPS) and PowerPC (which was not just IBM) was that
they would succeed the IA-32-based PC. Given the assumed (and, around
1990, actual) performance superiority of RISCs over IA-32, this looked plausible. However, even with Alpha, which was often superior in
performance throughout the 1990s, and for which there were cheap
offerings (but without performance edge; e.g., I once was playing with
the idea of buying a 21164PC-based PC164SX system, where the CPU+board
(with 1MB L2 cache) cost ATS 6000 (~EUR 440) in 1998; but I went with
a K6-2, because I played some DOS games:-). The cheap 164SX offer may have been a clearance sale, however.
Did Intergraph really try to port WinNT to SPARC v8 that was
strictly BE or they were porting to emerging SPARC v9?
It seems as if NT has only ever been little-endian.
John
[Windows NT on MIPS] was a commercial failure, because MIPS
didn't keep up with the performance growth of x86.
PowerPC did for a while, but the company interested in NT on PowerPC was
IBM, and their hardware prices were a /lot/ higher than x86 prices. They didn't see that as a problem, but all the potential customers did.
In any case, the performance advantage of the RISCs vanished during the
1990s ...
... the RISCs never had wide ISV support, and so WNT on RISCs
flopped.
On Tue, 1 Oct 2024 10:12 +0100 (BST), John Dallman wrote:
[Windows NT on MIPS] was a commercial failure, because MIPS
didn't keep up with the performance growth of x86.
It was NT that was the commercial failure, not MIPS. MIPS found a niche in the embedded world, and went on to outsell x86 by a factor of 3:1 or so.
We know this because a lot of those embedded devices ran Linux.
PowerPC did for a while, but the company interested in NT on PowerPC was
IBM, and their hardware prices were a /lot/ higher than x86 prices. They
didn't see that as a problem, but all the potential customers did.
PowerPC got rolled back into POWER, near as I can tell. And that continues
to sell today -- you see some POWER machines not far from the top of the current Top500 list of the world’s most powerful supercomputers. That
shows there is a viable market for the products.
And of course they, too, run Linux.
On Tue, 01 Oct 2024 16:26:25 GMT, Anton Ertl wrote:
In any case, the performance advantage of the RISCs vanished during the
1990s ...
Only for as long as Intel could afford to spend 10× as much on developing >each chip generation as the RISC vendors could. It could because it could >reap 10× the profits in return, but it can’t any more.
Which is why you
see ARM coming to the fore, and RISC-V appearing as the upstart
challenger.
... the RISCs never had wide ISV support, and so WNT on RISCs
flopped.
As I said above, RISC is still around and dominating the computing world. >They’re not running Windows, because it was Windows that could not adapt >well to them. Instead, they are running Linux.
On Tue, 1 Oct 2024 10:12 +0100 (BST), John Dallman wrote:
[Windows NT on MIPS] was a commercial failure, because MIPS
didn't keep up with the performance growth of x86.
It was NT that was the commercial failure, not MIPS. MIPS found a niche in >the embedded world, and went on to outsell x86 by a factor of 3:1 or so.
Lawrence D'Oliveiro <[email protected]d> writes:
On Tue, 1 Oct 2024 10:12 +0100 (BST), John Dallman wrote:
[Windows NT on MIPS] was a commercial failure, because MIPS
didn't keep up with the performance growth of x86.
It was NT that was the commercial failure, not MIPS. MIPS found a niche in >>the embedded world, and went on to outsell x86 by a factor of 3:1 or so.
Note that MIPS CPUs were used in SGI supercomputers and high-end
graphics workstations.
For me error detection of all kinds is useful. It just happens
to not be conveniently supported in C so no one tries it in C.
GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need
as it triggers for many false positives so people turn it off.
I've always paid for mine. My first C compiler came with the WinNT 3.5
beta in 1992 for $99 and came with the development kit,
editor, source code debugger, tools, documentation.
A few hundred bucks is not going to hurt my business.
On 03/10/2024 02:07, Lawrence D'Oliveiro wrote:
It was NT that was the commercial failure, not MIPS. MIPS found a
niche in the embedded world, and went on to outsell x86 by a
factor of 3:1 or so.
The key markets for MIPS were network devices (managed switches,
routers, small Wifi/NAT routers, etc.) and multimedia devices
(smart TVs, Bluray players, set-top boxes, etc.).
These have mostly been overtaken by ARM these days.
If the RISC companies failed to keep up, they only have themselves to
blame. It seems to me that a number of RISC companies had difficulties
with managing the larger projects that the growing die areas allowed.
If the RISC companies failed to keep up, they only have themselves to
blame.
Another issue was the marketing. The RISC companies did not want to
damage their existing high-priced workstation and server business by providing cheap CPUs for the masses ...
Another contributing factor was Itanium, which was quite successful at disrupting the development cycles of the RISC architectures. Of the five
that I worked with [that all failed, except one] ...
IBM kept POWER development going through the Itanium period, which is a significant reason why it's still going.
SGI went into Itanium hard and neglected MIPS development, which never recovered. It had been losing in the performance race anyway.
Sun kept SPARC development going, but made a different mistake, by
spreading their development resources over too many projects. The ones
that succeeded did so too slowly, and they fell behind. Also, Linux ate
their web-infrastructure market rather quickly.
Linux could not have had the success it did without the large range of powerful and cheap hardware designed to run Windows.
And MIPS, the company, has abandoned its own architecture in favour of RISC-V. <https://mips.com/>
On Thu, 03 Oct 2024 06:57:54 GMT, Anton Ertl wrote:
If the RISC companies failed to keep up, they only have themselves to
blame.
That’s all past history, anyway. RISC very much rules today, and it is x86 >that is struggling to keep up.
Another issue was the marketing. The RISC companies did not want to
damage their existing high-priced workstation and server business by
providing cheap CPUs for the masses ...
There was one RISC family that did indeed provide cheap CPUs for the
masses, even more so than x86, and that was ARM.
You are, of course, aware that the complex "x86" instruction set is an >illusion and that the hardware essentially has been a load-store RISC
with a complex decoder on the front end since the Pentium Pro landed
in 1995.
In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
If the RISC companies failed to keep up, they only have themselves to
blame. It seems to me that a number of RISC companies had difficulties
with managing the larger projects that the growing die areas allowed.
Another contributing factor was Itanium, which was quite successful at >disrupting the development cycles of the RISC architectures.
Alpha suffered from DEC's mis-management, which led to DEC being taken
over by Compaq. They killed Alpha when Itanium first became to work, and >before it was clear that it was a turkey.
PA-RISC was intended by HP to be replaced by Itanium. They managed that,
but their success was limited because Linux on x86-64 was so much more >cost-effective.
IBM kept POWER development going through the Itanium period, which is a >significant reason why it's still going.
SGI went into Itanium hard and neglected MIPS development, which never >recovered. It had been losing in the performance race anyway.
Sun kept SPARC development going, but made a different mistake, by
spreading their development resources over too many projects. The ones
that succeeded did so too slowly, and they fell behind.
Also, Linux ate
their web-infrastructure market rather quickly.
Linux could not have had the success it did without the large range of >powerful and cheap hardware designed to run Windows.
Alpha suffered before. The 21264 was late, and did not keep up in the
clock race.
On Thu, 3 Oct 2024 23:49 +0100 (BST), John Dallman wrote:
Given all of IBM's missteps, it's mildly surprising they got that
one right. Even a stopped clock is right once a day ...
SGI decided to embrace the platform that was eating their market,
and try to sell Windows NT boxes. Trouble is, those NT boxes, while
only a fraction of the cost of an IRIX-based product, still cost
about 3� what other NT machines were going for.
They could still have sold SPARC hardware running Linux. I can
remember comments saying Linux ran better on that hardware than
Sun's own SunOS/Solaris did.
In article <vdnef0$3uaeh$[email protected]>, [email protected]d (Lawrence >D'Oliveiro) wrote:
On Thu, 3 Oct 2024 23:49 +0100 (BST), John Dallman wrote:
Then there were the SGI Visual Workstations, which ran NT on x86. The
first generation of them were quite nice, but needed a very custom HAL,
and hence couldn't be upgraded to later versions of Windows once SGI >abandoned them.
By this time, SGI had a department of downsizing, whose job was to get
rid of departments and sites. Being an American company, this department >fought for power and budget share, and nobody inside the company seemed
to think that this would spell doom for SGI.
They [SGI ed.] could still have sold SPARC hardware running Linux. I can
remember comments saying Linux ran better on that hardware than
Sun's own SunOS/Solaris did.
They would not have faced up to that.
Anton Ertl <[email protected]> schrieb:
Alpha suffered before. The 21264 was late, and did not keep up in the
clock race.
https://www.star.bnl.gov/public/daq/HARDWARE/21264_data_sheet.pdf
gives the clock rate as varying between 466 and 600 MHz, and
Wikipedia gives the clock frequency of the Pentium Pro as between
150 and 200 MHz. The Pentium II Overdrive, according to Wikipedia,
had up to 333 MHz.
Is this information wrong?
George Neuner <[email protected]> writes:<snipping>
I don't agree with all of that, however. E.g., when discussing a VAX instruction similar to IA-32's REP MOVS, he considers it to be a big advantage that the operands of REP MOVS are in registers. That
appears wrong to me; you either have to keep REP MOVS in decoding (and
thus stop decoding any later instructions) until you know the value of
that register coming out of the OoO engine, making REP MOVS a mostly serializing instruction. Or you have a separate OoO logic for REP
MOVS that keeps generating loads and stores inside the OoO engine. If
you have the latter in the VAX, it does not make much difference if
the operand is on a register or memory. The possibility of trapping
during REP MOVS (or the VAX variant) complicates things, though: the
first part of the REP MOVS has to be committed, and the registers
written to the architectural state, and then execution has to start
again with the REP MOVS. Does not seem much harder on the VAX to me, however.
- anton
On 10/3/2024 11:36 PM, Chris M. Thomasson wrote:
On 10/3/2024 9:23 PM, George Neuner wrote:
On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:
On Thu, 03 Oct 2024 06:57:54 GMT, Anton Ertl wrote:
If the RISC companies failed to keep up, they only have themselves to >>>>> blame.
That’s all past history, anyway. RISC very much rules today, and it
is x86
that is struggling to keep up.
You are, of course, aware that the complex "x86" instruction set is an
illusion and that the hardware essentially has been a load-store RISC
with a complex decoder on the front end since the Pentium Pro landed
in 1995.
Yeah. Wrt memory barriers, one is allowed to release a spinlock on "x86"
with a simple store.
The fact that one can release a spinlock using a simple store means that
its basically load-acquire release-store.
So a load will do a load then have an implied acquire barrier.
A store will do an implied release barrier then perform the store.
This release behavior is okay for releasing a spinlock with a simple
store, MOV.
On Fri, 4 Oct 2024 7:05:34 +0000, Anton Ertl wrote:
George Neuner <[email protected]> writes:<snipping>
My 66000 has a MemMove instruction consisting of a 1 word instruction,
that leaves DECODE and enters into one MEMory unit, where it proceeds
to AGEN and Read, AGEN and Write, leaving the rest of the function
units proceeding to whatever is next.
One thing I did different, here, none of the 3 registers is modified,
yet I retain the ability to take exception and re-play the instruction
from where it left off {in state never visible to the instruction
stream except via DECODE stage.}
George Neuner <[email protected]> writes:
You are, of course, aware that the complex "x86" instruction set is an >>illusion and that the hardware essentially has been a load-store RISC
with a complex decoder on the front end since the Pentium Pro landed
in 1995.
Repeating nonsense does not make it any truer, and this nonsense has
been repeated since at least the Pentium Pro (1995), maybe already
since the 486 (1989). CISC and RISC are about the instruction set,
not about the implementation. And even if you look at the
implementation, it's not true: The P6 has microinstructions that are
~100 bits long, whereas RISCs have 32-bit and 16-bit instructions.
The K7 has load-store microinstructions; RISCs don't have that.
In more recent CPUs, AMD tends to work with macro-instructions between
the decoder and the reorder buffer (i.e., in the part that in the
Pentium Pro may have been used as the justification for the RISC
claim); these macro instructions are load-and-op and read-modify-write >instructions.
John Mashey has written about the difference between CISC and RISC
repeatedly <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>, and
he gives good criteria for classifying instruction sets as RISC or
CISC, and by his criteria the 80286 and IA-32 instruction sets of the
Pentium Pro clearly both are CISCs. I have recently ><[email protected]> used his criteria on >instruction sets that Mashey did not classify (mostly because they
were done after his table), and by these criteria AMD64 is clearly a
CISC, while ARM A64 and RISC-V are clearly RISCs.
In searching for whether he has written something specific about
IA-32, I found <https://yarchive.net/comp/vax.html>, which is an
earlier instance of the recent discussion of whether it would have
been better for DEC to stick with VAX, do an OoO implementation and
extend the architecture to 64 bits, like Intel has done: ><https://yarchive.net/comp/vax.html>. He also discusses the problems
of IA-32 there, but mainly in pointing out how much smaller they were
than the VAX ones.
I don't agree with all of that, however. E.g., when discussing a VAX >instruction similar to IA-32's REP MOVS, he considers it to be a big >advantage that the operands of REP MOVS are in registers. That
appears wrong to me; you either have to keep REP MOVS in decoding (and
thus stop decoding any later instructions) until you know the value of
that register coming out of the OoO engine, making REP MOVS a mostly >serializing instruction. Or you have a separate OoO logic for REP
MOVS that keeps generating loads and stores inside the OoO engine. If
you have the latter in the VAX, it does not make much difference if
the operand is on a register or memory. The possibility of trapping
during REP MOVS (or the VAX variant) complicates things, though: the
first part of the REP MOVS has to be committed, and the registers
written to the architectural state, and then execution has to start
again with the REP MOVS. Does not seem much harder on the VAX to me, >however.
- anton
On Fri, 04 Oct 2024 07:05:34 GMT, [email protected]
(Anton Ertl) wrote:
George Neuner <[email protected]> writes:
You are, of course, aware that the complex "x86" instruction set is an >>>illusion and that the hardware essentially has been a load-store RISC >>>with a complex decoder on the front end since the Pentium Pro landed
in 1995.
Repeating nonsense does not make it any truer, and this nonsense has
been repeated since at least the Pentium Pro (1995), maybe already
since the 486 (1989). CISC and RISC are about the instruction set,
not about the implementation. And even if you look at the
implementation, it's not true: The P6 has microinstructions that are
~100 bits long, whereas RISCs have 32-bit and 16-bit instructions.
The K7 has load-store microinstructions; RISCs don't have that.
Anton, you know very well that the hardware does not execute the "x86" >instruction set but only /emulates/ it. The decoder translates x86 >instructions into sequences of microinstructions that perform the
equivalent operations. The fact that some simple instructions
translate one to one does not change this.
On 10/4/24 7:30 PM, Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
On Fri, 4 Oct 2024 7:05:34 +0000, Anton Ertl wrote:
George Neuner <[email protected]> writes:<snipping>
My 66000 has a MemMove instruction consisting of a 1 word instruction,
that leaves DECODE and enters into one MEMory unit, where it proceeds
to AGEN and Read, AGEN and Write, leaving the rest of the function
units proceeding to whatever is next.
One thing I did different, here, none of the 3 registers is modified,
yet I retain the ability to take exception and re-play the instruction >>>from where it left off {in state never visible to the instruction
stream except via DECODE stage.}
What happens if the exception handler reschedules the CPU to
a different task before returning from the exception?
I ass-me that like the PREDicate instruction modifier, there is
_implicit_ state that is saved on context switches. I.e., there is
extra storage space in the context store for such data.
My 66000 uses hardware context saving, so software can be ignorant
of such (aside from reserving enough storage).
[email protected] (John Dallman) writes:
Linux could not have had the success it did without the large
range of powerful and cheap hardware designed to run Windows.
It was first developed on a 386, and many of the early co-developers
also had IA-32 machines. But the 386 certainly was not designed to
run Windows. The 386 project was finished before Windows 1.0 was
released in November 1985, and nobody used Windows 1.0 or 2.0, so
why would anybody design a processor for those? ...
In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
[email protected] (John Dallman) writes:
Linux could not have had the success it did without the large
range of powerful and cheap hardware designed to run Windows.
It was first developed on a 386, and many of the early co-developers
also had IA-32 machines. But the 386 certainly was not designed to
run Windows. The 386 project was finished before Windows 1.0 was
released in November 1985, and nobody used Windows 1.0 or 2.0, so
why would anybody design a processor for those? ...
OK, "designed to run MS-DOS, and later Windows"?
[email protected] (MitchAlsup1) writes:
What happens if the exception handler reschedules the CPU to
a different task before returning from the exception?
"Paul A. Clayton" <[email protected]> writes:
On 10/4/24 7:30 PM, Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
On Fri, 4 Oct 2024 7:05:34 +0000, Anton Ertl wrote:
George Neuner <[email protected]> writes:<snipping>
My 66000 has a MemMove instruction consisting of a 1 word instruction, >>>> that leaves DECODE and enters into one MEMory unit, where it proceeds
to AGEN and Read, AGEN and Write, leaving the rest of the function
units proceeding to whatever is next.
One thing I did different, here, none of the 3 registers is modified,
yet I retain the ability to take exception and re-play the instruction >>>>from where it left off {in state never visible to the instruction
stream except via DECODE stage.}
What happens if the exception handler reschedules the CPU to
a different task before returning from the exception?
I ass-me that like the PREDicate instruction modifier, there is
_implicit_ state that is saved on context switches. I.e., there is
extra storage space in the context store for such data.
My 66000 uses hardware context saving, so software can be ignorant
of such (aside from reserving enough storage).
I got the impression that it wasn't so much context saving,
as context switching (i.e. storage per 'process/thread');
yet if that storage needs to be saved to DRAM on any
exception, just in case the OS switches to a different
thread context, then I don't see how he can get his
claimed context switch times.
Power still survives, maybe only because it has a common basis with
iSeries (or whatever it is called now).
The virtual 8086 mode of the 386 was used by Windows/386 (starting
already in 1987). Was virtual 8086 mode designed into the 386
specifically for Windows? I doubt it ...
On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:
That’s all past history, anyway. RISC very much rules today, and it is >>> x86 that is struggling to keep up.
You are, of course, aware that the complex "x86" instruction set is an
illusion and that the hardware essentially has been a load-store RISC
with a complex decoder on the front end since the Pentium Pro landed in
1995.
CISC and RISC are about the instruction set, not about
the implementation. And even if you look at the implementation, it's
not true: The P6 has microinstructions that are ~100 bits long, whereas
RISCs have 32-bit and 16-bit instructions. The K7 has load-store microinstructions; RISCs don't have that.
... ARM A64 and RISC-V are clearly RISCs.
The possibility of trapping during
REP MOVS (or the VAX variant) complicates things, though: the first part
of the REP MOVS has to be committed, and the registers written to the architectural state, and then execution has to start again with the REP
MOVS. Does not seem much harder on the VAX to me, however.
In any case, certainly for the stuff I do I see no reason why I would consider, much less recommend buying a Power machine these days.
Native POWER is, I think, called pSeries. It continues to sell in its own >right because it offers high performance--high enough to earn a few
ongoing spots near the top of the Top500 supercomputer list.
Lawrence D'Oliveiro <[email protected]d> writes:
Native POWER is, I think, called pSeries. It continues to sell in
its own right because it offers high performance--high enough to
earn a few ongoing spots near the top of the Top500 supercomputer
list.
Looking at the June 2024 edition, I see Summit as the highest-ranked
system with Power CPUs, and they are Power 9. So if your claim was
true that the Top500 supercomputer list reflects CPU performance,
Power 9 would beat Power 10 in CPU performance, and EPYC, Xeon,
Fujitsu A64FX and Nvidia Grace are more powerful CPUs. However, in
most supercomputers (including Summit) the GPGPUs provide the bulk of
the FLOPS that are measured in the Top 500, so looking at the Top 500
is misleading for determining CPU performance.
So let's look at SPEC CPU instead. For CPU2017, I see only four
entries from IBM, all for the Integer Rate metric, two with Power 9
and two with Power 10 CPUs. The highest of those results is:
base peak
1700 2170 IBM Power E1080
That's with 8 sockets, 120 cores, and 960 threads. Looking at other
8-socket machines, I find
base peak
3820 3880 BullSequana SH80
That's with 8 sockets, 480 cores, and 960 threads (similar results
from Fujitsu PRIMERGY RX8770 M7, HPE Compute Scale-up Server 3200,
Inspur TS860G7 and Supermicro SuperServer SYS-681E-TR, all done with
Xeon Platinum 8490H CPUs). And if you go for maximum performance,
there's a 16-socket Xeon machine from Bull with base=7400, peak=7450.
Alternatively, you can instead buy a 2-socket system with similar
performance to the 8-socket IBM Power E1080:
base peak
1950 2140 ASUS RS720A-E12-RS12
and similar results from other systems with the EPYC 9754.
https://www.spec.org/cpu2017/results/res2021q3/cpu2017-20210814-28679.html https://www.spec.org/cpu2017/results/res2024q3/cpu2017-20240701-43944.html https://www.spec.org/cpu2017/results/res2023q2/cpu2017-20230522-36617.html
Admittedly, IBM extracts the most performance from each core, but with
only 15 cores per CPU (where others have 128), that is no longer that impressive.
Nevertheless, neither machines with the Ryzen 7950X nor
with the Xeon-E2488 reach the performance per core (and no results for
the Ryzen 9950X have been submitted yet), so it looks like Power 10
has a really good multi-threading implementation.
The fact that IBM has not submitted results for Power for SPEC CPU
2017 for (Int or FP) Speed or FP Rate results is an admission that
their numbers there are even less impressive.
In any case, certainly for the stuff I do I see no reason why I would consider, much less recommend buying a Power machine these days. My
guess is that the major reasons for buying pSeries machines these days
are legacy software and IBM salesmanship.
- anton
Intel I think tried to spread this idea of a "RISC core" somewhere inside
the labyrinthine complexity of its Pentium-and-later chips, in the hope
that some of the aura attached to the term "RISC" would rub off on its >products.
And quite a few people fell for it.
... ARM A64 and RISC-V are clearly RISCs.
ARM and some other RISC architectures (e.g. POWER) do somewhat stretch the >term though, don’t they, when they add that combinatorial explosion of >operand types in their short-vector instructions.
On Fri, 04 Oct 2024 15:07:17 GMT, Anton Ertl wrote:
Power still survives, maybe only because it has a common basis with
iSeries (or whatever it is called now).
As I understand it, iSeries is the emulation of the old AS/400 on
POWER processors. And AS/400 was the unification of the older
System/38 with the System/34? System/36? lines.
System/38 (or AS/400, or iSeries) has/had this interestingly unusual architecture which builds database features right into the OS kernel,
so that they can be used everywhere. And it also uses capabilities as
an alternative to the traditional privilege-mode hierarchy. Neither
of these ideas says much for performance, but they still suggest some interesting possibilities, nonetheless.
Native POWER is, I think, called pSeries. It continues to sell in its
own right because it offers high performance--
high enough to earn a
few ongoing spots near the top of the Top500 supercomputer list.
On Thu, 3 Oct 2024 23:36:12 -0700, Chris M. Thomasson wrote:
On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:
That’s all past history, anyway. RISC very much rules today, and
it is x86 that is struggling to keep up.
You are, of course, aware that the complex "x86" instruction set
is an illusion and that the hardware essentially has been a
load-store RISC with a complex decoder on the front end since the
Pentium Pro landed in 1995.
Of course, and that complexity (and consequent expense) is part of
the struggle. Looking at Intel’s current financial woes, it is
clearly not being as successful at that as it has been in the past.
AMD are easily provable culprits in this scam: They call their
micro-ops "ROPs", for RISC ops.
On Sun, 06 Oct 2024 07:18:59 GMT
[email protected] (Anton Ertl) wrote:
The fact that IBM has not submitted results for Power for SPEC CPU
2017 for (Int or FP) Speed or FP Rate results is an admission that
their numbers there are even less impressive.
That's most likely explanation, but another one is that it is sort of >internal policy no matter what.
IIRC, they didn't publish non-rate scores for POWER7 either, despite
that according to independent measurement at point of introduction
POWER7 single-threaded performance was in the same ballpark with best
Intel offerings and easily ahead of best AMD.
In any case, certainly for the stuff I do I see no reason why I would
consider, much less recommend buying a Power machine these days. My
guess is that the major reasons for buying pSeries machines these days
are legacy software and IBM salesmanship.
- anton
I think, if you are running Oracle DB Enterprise Edition, where
software license per core is the most expensive part then there could
be an economical reason for preferring POWER9 or 10 over Intel or AMD.
But that's just a guess.
On Sun, 06 Oct 2024 08:40:55 GMT
[email protected] (Anton Ertl) wrote:
AMD are easily provable culprits in this scam: They call their
micro-ops "ROPs", for RISC ops.
Wasn't the term invented by Nexgen for Nx586 and later adapted by AMD
after they scrapped their home brewed core in favor of Nexgen's core
that later became know as AMD K6?
Michael S <[email protected]> writes:
On Sun, 06 Oct 2024 08:40:55 GMT
[email protected] (Anton Ertl) wrote:
AMD are easily provable culprits in this scam: They call their
micro-ops "ROPs", for RISC ops.
Wasn't the term invented by Nexgen for Nx586 and later adapted by AMD
after they scrapped their home brewed core in favor of Nexgen's core
that later became know as AMD K6?
Easily possible. AMD may not have been the original culprit, but they continued this terminology, and therefore are just as guilty.
- anton
OK, "designed to run MS-DOS, and later Windows"?
The 286 protected mode was certainly not designed for MS-DOS, and
the 386 paging of linear addresses was certainly not designed for
DOS, either.
And the success of IA-32 and then AMD64 at replacing the RISCs is
exactly because it was not some DOS-centric architecture, but also
provided features needed by other OSs like 386/ix (later Interactive
Unix, which I used myself in 1990 or so), Xenix, Linux, Windows NT,
Solaris, the various BSDs, and others. And the computers built
around these CPUs also provided these features.
It seems to me that the 286 protected mode was a continuation of the
iAPX432 ideas, which predated DOS,
and that the 386 paging imitated the virtual-memory mainstream
of bigger computing platforms at the time, such as the VAX and
S/370.
Intel's current financial woes do not appear to be [directly] related to Intel PC (laptops+desktop) sails that are right now pretty good and profitable.
On 10/4/2024 3:54 PM, MitchAlsup1 wrote:
On Fri, 4 Oct 2024 19:36:41 +0000, Chris M. Thomasson wrote:
On 10/3/2024 11:36 PM, Chris M. Thomasson wrote:
On 10/3/2024 9:23 PM, George Neuner wrote:
On Fri, 4 Oct 2024 00:48:43 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:
On Thu, 03 Oct 2024 06:57:54 GMT, Anton Ertl wrote:
If the RISC companies failed to keep up, they only have themselves to >>>>>>> blame.
That’s all past history, anyway. RISC very much rules today, and it >>>>>> is x86
that is struggling to keep up.
You are, of course, aware that the complex "x86" instruction set is an >>>>> illusion and that the hardware essentially has been a load-store RISC >>>>> with a complex decoder on the front end since the Pentium Pro landed >>>>> in 1995.
Yeah. Wrt memory barriers, one is allowed to release a spinlock on "x86" >>>> with a simple store.
The fact that one can release a spinlock using a simple store means that >>> its basically load-acquire release-store.
So a load will do a load then have an implied acquire barrier.
A store will do an implied release barrier then perform the store.
How does the store know it needs to do this when the locking
instruction is more than a pipeline depth away from the
store release ?? So, Locked LD (or something) happens at
1,000,000 cycles, and the corresponding store happens at
10,000,000 cycles (9,000,000 locked).
This release behavior is okay for releasing a spinlock with a simple
store, MOV.
It may be OK to SW but it causes all kinds of grief to HW.
I thought that x86 has an implied #LoadStore | #StoreStore before the
store, basically to give it release semantics. This means that one can release a spinlock without using any explicit membars. Iirc, there are
Intel manuals that show this for spinlocks. Cannot exactly remember
right now.
On x86 an atomic load has acquire and atomic stores have release
semantics. Well, I think that is for WB memory only. Humm... Cannot
remember if its for WC or WB memory right now. Then there are the
L/S/MFENCE instructions...
https://www.felixcloutier.com/x86/sfence
Number of operand types never has been a criterion in any of the RISC definitions I have seen, nor the number of instructions (although some
people like to go by that).
However, in most supercomputers (including Summit) the GPGPUs provide
the bulk of the FLOPS ...
However there are two exceptions: Fugaku (#4, Fujitsu A64Fx) and Sunway TaihuLight (#13, Sunway SW26010).
... whereas x86-based machines that weren't PC-compatible ...
On Sun, 06 Oct 2024 08:40:55 GMT, Anton Ertl wrote:
Number of operand types never has been a criterion in any of the RISC
definitions I have seen, nor the number of instructions (although some
people like to go by that).
It’s in the name: "Reduced Instruction Set Computer".
I always thought it should have been "IRSC": "Increased Register Set >Computer". The most obvious characteristic, the one that tends to hit you >first, is having lots of registers.
In article <[email protected]>, >[email protected] (Anton Ertl) wrote:
OK, "designed to run MS-DOS, and later Windows"?
The 286 protected mode was certainly not designed for MS-DOS, and
the 386 paging of linear addresses was certainly not designed for
DOS, either.
I see where I'm going wrong: I'm trying to talk about the machines
designed to run MS-DOS and later Windows, not just the CPUs. The vast
range of hardware that all had substantial degrees of compatibility as >regards booting, busses and so on. Those things let their manufacturers >compete for the DOS and Windows market, whereas x86-based machines that >weren't PC-compatible only succeeded in quite specialised niches.
[email protected] (John Dallman) writes:
In article <[email protected]>,
[email protected] (Anton Ertl) wrote:
OK, "designed to run MS-DOS, and later Windows"?
The 286 protected mode was certainly not designed for MS-DOS, and
the 386 paging of linear addresses was certainly not designed for
DOS, either.
I see where I'm going wrong: I'm trying to talk about the machines
designed to run MS-DOS and later Windows, not just the CPUs. The vast
range of hardware that all had substantial degrees of compatibility as
regards booting, busses and so on. Those things let their manufacturers
compete for the DOS and Windows market, whereas x86-based machines that
weren't PC-compatible only succeeded in quite specialised niches.
There actually were MS-DOS-compatible machines that were not 100% IBM
PC compatible, and did not run programs that used direct hardware
access, but MS-DOS programs that only used BIOS functions (i.e., a
HAL). The BIOS functions were too slow, so the programs with direct
hardware access won out over those that used the BIOS, and therefore
the 100% IBM PC compatibles won out over the MS-DOS compatibles.
The PC industry then developed a culture of compatibility, and that
helped all OSs, not just DOS and Windows. E.g., it is much easier to
install Linux on a PC than on some ARM-based SBC; for the ARM-based
SBC the typical way is to use a prepared system image on an SD-card,
because you cannot just put in a USB stick and run an installer.
- anton
On Sun, 06 Oct 2024 07:18:59 GMT, Anton Ertl wrote:
However, in most supercomputers (including Summit) the GPGPUs
provide the bulk of the FLOPS ...
That tends to go back and forth, between CPU and GPU.
See this <https://www.nextplatform.com/2020/03/05/software-evolution-on-ornls-summit-supercomputer/>
interview with Dr Tjerk Straatsma, group lead for scientific computing
at ORNL. Seems their supers have made heavy use of NVidia GPUs up to
now, but this was set to change:
Frontier, the next system for the OLCF, will have AMD CPUs and
GPUs.
To prepare for this system, software developers may want to make
changes to their programming approach, with OpenMP directive-based
and HIP native offloading as the most comparable to the OpenACC
and CUDA approaches on Summit today.
I wonder what happened to OpenCL as the cross-platform architecture
for GPU computing?
However, in John Mashey's criteria the number of registers plays a
role; he requires >4 bits for the GPR specifier, and >3 bits for the
FPR specifier.
- anton
On Sun, 6 Oct 2024 13:51:19 +0300, Michael S wrote:
Intel's current financial woes do not appear to be [directly]
related to Intel PC (laptops+desktop) sails that are right now
pretty good and profitable.
x86 chip sales have been declining for years. At one time they were
up to a million per day; nowadays it’s only about 80% of that.
And
you see the trouble they have keeping up in performance, microcode
bugs etc. All adds up to competitiveness trouble.
Michael S <[email protected]> writes:
On Mon, 07 Oct 2024 07:17:02 GMT
[email protected] (Anton Ertl) wrote:
However, in John Mashey's criteria the number of registers plays a
role; he requires >4 bits for the GPR specifier, and >3 bits for the
FPR specifier.
- anton
Which sounds rather arbitrary.
In a way it is, but see below.
Or even worse, like if he wanted for
SPARC to be called 'typical RISC' and for ARM to be called atypical and
had chosen the numbers to match the agenda.
I think that ARM did not exist for John Mashey.
On Mon, 07 Oct 2024 07:17:02 GMT
[email protected] (Anton Ertl) wrote:
However, in John Mashey's criteria the number of registers plays a
role; he requires >4 bits for the GPR specifier, and >3 bits for the
FPR specifier.
- anton
Which sounds rather arbitrary.
Or even worse, like if he wanted for
SPARC to be called 'typical RISC' and for ARM to be called atypical and
had chosen the numbers to match the agenda.
On Fri, 04 Oct 2024 07:05:34 GMT, Anton Ertl wrote:
CISC and RISC are about the instruction set, not about
the implementation. And even if you look at the implementation, it's
not true: The P6 has microinstructions that are ~100 bits long, whereas
RISCs have 32-bit and 16-bit instructions. The K7 has load-store
microinstructions; RISCs don't have that.
Intel I think tried to spread this idea of a “RISC core” somewhere
inside
the labyrinthine complexity of its Pentium-and-later chips, in the hope
that some of the aura attached to the term “RISC” would rub off on its products.
And quite a few people fell for it.
... ARM A64 and RISC-V are clearly RISCs.
ARM and some other RISC architectures (e.g. POWER) do somewhat stretch
the
term though, don’t they, when they add that combinatorial explosion of operand types in their short-vector instructions.
RISC-V has consciously avoided this, by going back to the older long-
vector idea, like Seymour Cray used in his machines.
Kent Dickey wrote:
In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:
Kent Dickey wrote:
In article <[email protected]>,Those automatic software correctness checks, of which signed integer
Anton Ertl <[email protected]> wrote:
Brett <[email protected]> writes:[ More details about architectures without trapping overflow instructions ]
Speaking of complex things, have you looked at Swift output, as it checksMIPS got on that bandwagon early. It has, e.g., add (which traps on >>>>> signed overflow) in addition to addu (which performs modulo
all operations for overflow?
You could add an exception type for that, saving huge numbers of correctly
predicted branch instructions.
The future of programming languages is type safe with checks, you need to
get on that bandwagon early.
arithmetic). It has been abandoned and replaced by RISC-V several
years ago.
Alpha got on that bandwagon early. It's a descendent of MIPS, but it >>>>> renamed add into addv, and addu into add. It has been canceled around >>>>> the year 2000.
Trapping on overflow is basically useless other than as a debug aid,
which clearly nobody values. If you take Rust's approach, and only
detect overflow in debug builds, then you already don't care about
performance.
overflow detection is one of many, went away because most code was
being written in C/C++ and those two languages don't require them.
That just makes it more expensive in code size and performance to effect >>> such checks. This overhead leads some to conclude it justifies eliminating >>> the error checks.
Eliminating the error event detectors doesn't make errors go away,
just your knowledge of them.
I gather portions of 16-bit Windows 3.1 were written in Pascal.
When Microsoft developed 32-bit WinNT, if instead of C it they had
switched their official development language from Pascal to Modula-2
which does require signed and unsigned, checked and modulo arithmetic,
and array bounds checks, the world would have been a much safer place.
But they didn't so it isn't.
The x86 designers might then have had an incentive to make all the
checks as efficient as possible, and rather than eliminate them,
they might have enhanced and more tightly integrated them.
OK, my post was about how having a hardware trap-on-overflow instruction
(or a mode for existing ALU instructions) is useless for anything OTHER
than as a debug aid where you crash the problem on overflow (you can
have a general exception handler to shut down gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons folks might
want to try to use trap-on-overflow instructions, and show how the
other cases don't make sense.
For me error detection of all kinds is useful. It just happens
to not be conveniently supported in C so no one tries it in C.
GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need
as it triggers for many false positives so people turn it off.
In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother
having trap-on-overflow instructions.
I understand, and I disagree with this conclusion.
I think all forms of software error detection are useful and
HW should make them simple and eliminate cost when possible.
Terje Mathisen wrote:
Kent Dickey wrote:
Look at:
https://godbolt.org/z/oMhW55YsK
Which is this code:
int add2(int num, int other) {
return num + other;
}
Compiled with these options: -O2 -ftrapv
(-ftrapv is the GCC argument for detect signed overflows and crash).
For x86-64 clang 19.1.0:
add2:
add edi, esi
jo .LBB0_1
mov eax, edi
ret
.LBB0_1:
ud1 eax, dword ptr [eax]
This looks OK: it does a normal add, then branches-on-overflow to
an undefined instruction.
But x86 has an instruction to trap on overflow directly: INTO. It's
one byte.
And it doesn't use it.
GCC x86-64 14.2 is even worse:
add2:
sub rsp, 8
call __addvsi3
add rsp, 8
ret
It calls a routine to do all additions which might overflow, and that
routine calls assert() if an overflow occurs.
The CPU has a trap-on-overflow instruction exactly for this case (to
crash
on detecting an overflow), and compilers don't even use it.
So even on architectures which have a trap-on-overflow instruction,
compilers don't use it.
You can only compile in INTO opcodes if you can guarantee that the INT 4
(INTO) trap vector will always be set to a proper handler, and since
that isn't part of the ABI, compilers can't depend on it?
I do agree that it would be nice if it did work, barring that clang is
doing the best possible alternative, at close to zero cost except for
the useless branch predictor table entry wastage.
Terje
On x64 in 64-bit mode INTO is among 21 opcodes reassigned as invalid.
One must use JO to detect signed overflow.
Others were repurposed, 1-byte INC and DEC 40..4F became the REX prefix.
On Sun, 06 Oct 2024 08:40:55 GMT, Anton Ertl wrote:
Number of operand types never has been a criterion in any of the RISC
definitions I have seen, nor the number of instructions (although some
people like to go by that).
It’s in the name: “Reduced Instruction Set Computer”.
I always thought it should have been “IRSC”: “Increased Register Set Computer”. The most obvious characteristic, the one that tends to hit
you first, is having lots of registers.
On Sun, 06 Oct 2024 14:03:59 GMT
[email protected] (Anton Ertl) wrote:
Michael S <[email protected]> writes:
On Sun, 06 Oct 2024 08:40:55 GMT
[email protected] (Anton Ertl) wrote:
AMD are easily provable culprits in this scam: They call their
micro-ops "ROPs", for RISC ops.
Wasn't the term invented by Nexgen for Nx586 and later adapted by AMD >>>after they scrapped their home brewed core in favor of Nexgen's core
that later became know as AMD K6?
Easily possible. AMD may not have been the original culprit, but they
continued this terminology, and therefore are just as guilty.
- anton
To their defense, AMD's use of the term ROP didn't last for long.
K8 manuals use the better term micro-ops. I don't have K7 manual to
look, but it seems to me that it uses the same terminology as K8.
On Mon, 7 Oct 2024 18:55:26 +0000, Kent Dickey wrote:
In article <efXIO.169388$[email protected]>,
EricP <[email protected]> wrote:
Kent Dickey wrote:
In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:
Kent Dickey wrote:
In no way was I ever arguing that checking for overflow was a bad idea, >>>> or a language issue, or anything else. Just that CPUs should not
bother
having trap-on-overflow instructions.
I understand, and I disagree with this conclusion.
I think all forms of software error detection are useful and
HW should make them simple and eliminate cost when possible.
I think I am not explaining the issue well.
I'm not arguing what you want to do with overflow. I'm trying to show
that for all uses of detecting overflow other than crashing with no
recovery, hardware trapping on overflow is a poor approach.
If you enable hardware traps on integer overflow, then to do anything
other than crash the program would require engineering a very complex
set of data structures, roughly approximately the complexity of adding
debug information to the executable, in order to make this work. As
far as I know, no one in the history of computers has yet undertaken
this task.
And yet, this is exactly the kind of data C++ needs in order to
use its Try-Throw-Catch exception model. The stack walker needs
to know where on the stack is the list of stuff to free on block
exit, where are the preserved registers and how many, ...
In article <efXIO.169388$[email protected]>,
EricP <[email protected]> wrote:
Kent Dickey wrote:
In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:
Kent Dickey wrote:
In no way was I ever arguing that checking for overflow was a bad idea,
or a language issue, or anything else. Just that CPUs should not bother >>> having trap-on-overflow instructions.
I understand, and I disagree with this conclusion.
I think all forms of software error detection are useful and
HW should make them simple and eliminate cost when possible.
I think I am not explaining the issue well.
I'm not arguing what you want to do with overflow. I'm trying to show
that for all uses of detecting overflow other than crashing with no
recovery, hardware trapping on overflow is a poor approach.
If you enable hardware traps on integer overflow, then to do anything
other than crash the program would require engineering a very complex
set of data structures, roughly approximately the complexity of adding
debug information to the executable, in order to make this work. As
far as I know, no one in the history of computers has yet undertaken
this task.
This is because each instruction which overflows would need special
handling, and the "debug" information would be needed. It would be a
huge amount of compiler/linker/runtime complexity.
Kent
Anton Ertl <[email protected]> schrieb:
Michael S <[email protected]> writes:
On Mon, 07 Oct 2024 07:17:02 GMT
[email protected] (Anton Ertl) wrote:
However, in John Mashey's criteria the number of registers plays a
role; he requires >4 bits for the GPR specifier, and >3 bits for
the FPR specifier.
- anton
Which sounds rather arbitrary.
In a way it is, but see below.
Or even worse, like if he wanted for
SPARC to be called 'typical RISC' and for ARM to be called atypical
and had chosen the numbers to match the agenda.
I think that ARM did not exist for John Mashey.
When was his definition made?
ARM was rather late to the RISC game, this might have been literally
true.
On Sun, 6 Oct 2024 14:35:40 +0000, Michael S wrote:
To their defense, AMD's use of the term ROP didn't last for long.
K8 manuals use the better term micro-ops. I don't have K7 manual to
look, but it seems to me that it uses the same terminology as K8.
K9 used the terms micro-ops and meso-ops to describe before and
after peephole optimization. HW was happy to run either as micro-
ops were a strict subset of meso-ops, meso-ops just got more work
done per cycle.
Anton Ertl <[email protected]> schrieb:
I think that ARM did not exist for John Mashey.
When was his definition made?
ARM was rather late to the RISC game, this might have been literally
true.
Somewhat to my surprise, I just read that there was <https://en.wikipedia.org/wiki/RISC_iX>, which would work on many
(but not all) Archimedes models with some additional hardware (in
particular, a hard disk), and that they sold complete workstations
like the R140 that included this hardware ...
Kent Dickey wrote:
And crash-on-overflow just isn't a popular use model, as I use the example >> of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
and no compiler seems to use it. Especially since branch-on-overflow
is almost as good in every way.
Kent
Because C doesn't require it. That does not make the capability useless.
In article <efXIO.169388$[email protected]>,
EricP <[email protected]> wrote:
Kent Dickey wrote:
OK, my post was about how having a hardware trap-on-overflow instruction >>> (or a mode for existing ALU instructions) is useless for anything OTHERFor me error detection of all kinds is useful. It just happens
than as a debug aid where you crash the problem on overflow (you can
have a general exception handler to shut down gracefully, but "patching things
up and continuing" doesn't work). I gave details of reasons folks might >>> want to try to use trap-on-overflow instructions, and show how the
other cases don't make sense.
to not be conveniently supported in C so no one tries it in C.
GCC's -trapv option is not useful for a variety of reasons.
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need >> as it triggers for many false positives so people turn it off.
In no way was I ever arguing that checking for overflow was a bad idea,I understand, and I disagree with this conclusion.
or a language issue, or anything else. Just that CPUs should not bother >>> having trap-on-overflow instructions.
I think all forms of software error detection are useful and
HW should make them simple and eliminate cost when possible.
I think I am not explaining the issue well.
I'm not arguing what you want to do with overflow. I'm trying to show that for all uses of detecting overflow other than crashing with no recovery, hardware trapping on overflow is a poor approach.
If you enable hardware traps on integer overflow, then to do anything other than crash the program would require engineering a very complex set of
data structures, roughly approximately the complexity of adding debug information to the executable, in order to make this work. As far as I know, no one in the history of computers has yet undertaken this task.
This is because each instruction which overflows would need special
handling, and the "debug" information would be needed. It would be a huge amount of compiler/linker/runtime complexity.
This is different than most "signal" handlers people have written, where simple inspection of the instruction which failed and the address involved allows it to be "handled". But to do anything other than crash, each instruction which overflows needs special handling unique to that instruction and dependent on what the compiler was in the middle of doing when the overflow happened. This is why trapping just isn't a good idea.
I'm just explaining why trap-on-overflow has gone away, because it's
almost completely useless: hardware trap on overflow is only good for the case that you want to crash on integer overflow. Branch-on-overflow is the correct approach--the compiler can branch to either a trapping instruction (if you just want to crash), or for all other cases of detecting overflow, the compiler branches to "fixup" code.
And crash-on-overflow just isn't a popular use model, as I use the example
of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
and no compiler seems to use it. Especially since branch-on-overflow
is almost as good in every way.
Kent
On 2024-10-07 22:12, MitchAlsup1 wrote:
On Mon, 7 Oct 2024 18:55:26 +0000, Kent Dickey wrote:
In article <efXIO.169388$[email protected]>,
EricP <[email protected]> wrote:
Kent Dickey wrote:
In article <O2DHO.184073$[email protected]>,
EricP <[email protected]> wrote:
Kent Dickey wrote:
In no way was I ever arguing that checking for overflow was a bad
idea,
or a language issue, or anything else. Just that CPUs should not
bother
having trap-on-overflow instructions.
I understand, and I disagree with this conclusion.
I think all forms of software error detection are useful and
HW should make them simple and eliminate cost when possible.
I think I am not explaining the issue well.
I'm not arguing what you want to do with overflow. I'm trying to show
that for all uses of detecting overflow other than crashing with no
recovery, hardware trapping on overflow is a poor approach.
If you enable hardware traps on integer overflow, then to do anything
other than crash the program would require engineering a very complex
set of data structures, roughly approximately the complexity of adding
debug information to the executable, in order to make this work. As
far as I know, no one in the history of computers has yet undertaken
this task.
And yet, this is exactly the kind of data C++ needs in order to
use its Try-Throw-Catch exception model. The stack walker needs
to know where on the stack is the list of stuff to free on block
exit, where are the preserved registers and how many, ...
Ada too.
There are at least two ways to do that (at least for Ada, probably also
for C++):
- Dynamically maintain a stack-like data structure (a chain, linked
list) that describes the current nesting of "code blocks" and their
exception handlers. Whenever the program enters a block with an
exception handler, there is entry code that pushes the description of
that exception handler on this chain, including the address of its code;
and vice versa pop on exiting such a block.
- Statically construct a mapping table that is stored in the executable
and maps code ranges to exception handlers.
Ada implementations started with the dynamic method, which is simpler
but adds some execution cost to all blocks with exception handlers, even
if an exception never happens. Current implementations tend to the
static method, also called "zero-cost exceptions" because there is no
extra execution cost for blocks with exception handlers /unless/ an
exception does occur.
EricP <[email protected]> writes:
Kent Dickey wrote:
And crash-on-overflow just isn't a popular use model, as I use the example >>> of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,Because C doesn't require it. That does not make the capability useless.
and no compiler seems to use it. Especially since branch-on-overflow
is almost as good in every way.
Kent
Other languages do require overflow detection (e.g. COBOL ON OVERFLOW clause),
and it's best done with conditional branches, not traps.
Except you keep missing the point:
no one has a handler for integer overflow because it should never
happen. Just like no one has a handler for memory read parity errors.
When you wrote C code using signed integers, *YOU* guarenteed to the
compiler that your code would never overflow. Overflow checking just
detects when you have made an error, just like array bounds checking,
or divide by zero checking.
This is not something being done *to you* against your will,
this is something that you *ask for* because it helps detect your
errors.
Doing it in hardware just makes it efficient.
On 2024-10-09 2:16 p.m., EricP wrote:
Kent Dickey wrote:Slightly confused on trap versus branch. Trapping on overflow is not a
But crash on overflow *IS* the correct behavior in 99.999% of cases.
Branch on overflow is ALSO needed in certain rare cases and I showed how
it is easily detected.
And crash-on-overflow just isn't a popular use model, as I use the
example
of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,
and no compiler seems to use it. Especially since branch-on-overflow
is almost as good in every way.
Kent
Because C doesn't require it. That does not make the capability useless.
Removing error detectors does not make the errors go away,
just your knowledge of them.
good solution, but a branch on overflow is? A trap is just a slow
branch. The reason for trapping was to improve code density and non-exceptional performance.
If it is the overhead of performing a trap operation that is the issue,
then a special register could be dedicated to holding the overflow
handler address, and instructions defined to automatically jump through
the overflow handler address register (a branch target address
register).
Overflow detecting instructions are just a fusion of the instruction and
the following branch on overflow operation.
addjo r1,r2,r3 <- does a jump (instead of a trap) to branch register #7
for instance, on overflow.
Having an overflow branch register might be better for code density / performance.
On Wed, 9 Oct 2024 20:12:40 +0000, Robert Finch wrote:
x86 has seriously distorted peoples view on how much overhead is
associated with a trap*.
Scott Lurndal wrote:
EricP <[email protected]> writes:
Kent Dickey wrote:
And crash-on-overflow just isn't a popular use model, as I use the example >>>> of x86 in 32-bit mode having a 1-byte INTO instruction which crashes,Because C doesn't require it. That does not make the capability useless.
and no compiler seems to use it. Especially since branch-on-overflow
is almost as good in every way.
Kent
Other languages do require overflow detection (e.g. COBOL ON OVERFLOW clause),
and it's best done with conditional branches, not traps.
Then you use the overflow branching form for those situations
where you have a specific local overflow handler. Nothing stops that.
But that is not a justification for getting rid of overflow trapping >instructions altogether, as Kent was making. And actually it looks to me,
not knowing Cobol, like it should use overflow trapping instructions
UNLESS there is an ON OVERFLOW clause. i.e. that the default should be to >treat overflow as an error unless you explicitly state how to handle it.
The single most canonical test for IBM PC compatibility was Microsoft's Flight Simulator, taking off from the now demolished Meighs Field in
Chicago.
That game used the OS and BIOS for the loading of the game, and then
went on to direct hardware access for pretty much the rest of the
playing time.
In all cases the vendor of GPU changed ...
On Mon, 7 Oct 2024 17:38:54 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
ARM was rather late to the RISC game, this might have been literally
true.
ARM was rather early to the RISC game. Shipped for profit since late
1986.
I can remember Flight Simulator being used as the benchmark for
compatibility as far back as 1985. A report on a computer show mentioned
that clone makers were demoing it running on their products.
This is why I feel the term “IBM compatible” was misleading, it should >have been “Microsoft compatible” from at least that point on.
On Mon, 7 Oct 2024 22:26:58 +0300, Michael S wrote:
On Mon, 7 Oct 2024 17:38:54 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
ARM was rather late to the RISC game, this might have been literally
true.
ARM was rather early to the RISC game. Shipped for profit since late
1986.
Shipped in an actual PC, the Acorn Archimedes range.
That was the first time I ever saw a 3D shaded rendition of a flag waving,
on a computer, generated in real time. No other machine could do it,
unless you got up to the really expensive Unix workstation class (e.g.
SGI, custom Evans & Sutherland hardware etc).
Maybe all add/sub/etc opcodes that are immediately followed by an INTO=20 >could be fused into a single ADDO/SUBO/etc version that takes zero extra =
cycles as long as the trap part isn't hit?
But then, risc processors mostly, started using exceptions for housekeeping
- SPARC for register window sliding, Alpha for byte, word and misaligned >memory access
The solution for Alpha was to add back the byte and word instructions,
and add misaligned access support to all memory ops.
Kent Dickey wrote:[...]
GCC's -trapv option is not useful for a variety of reasons....
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need
as it triggers for many false positives so people turn it off.
So why should any hardware include an instruction to trap-on-overflow?
Because ALL the negative speed and code size consequences do not occur.
That's correct about intrinsics, but incorrect about ADCX/ADOX.
The later can be moderately helpful in special situuations, esp.
128b * 128b => 256b multiplication, but it is never necessary
and for addition/sbtraction is not needed at all.
Michael S <[email protected]> writes:
That's correct about intrinsics, but incorrect about ADCX/ADOX.
The later can be moderately helpful in special situuations, esp.
128b * 128b => 256b multiplication, but it is never necessary
and for addition/sbtraction is not needed at all.
They are useful if there are two strings of additions. This happens naturally in wide multiplication (also beyond 256b results). But it
also happens when you add three multi-precision numbers (say, X, Y,
Z): You need C for the carry of XYi=X[i]+Y[i]+C, and O for the carry
of XYZ[i]=XYi+Z[i]+O. If you have ADCX/ADOX, you can do both
additions in one loop, so XYi can be in a register and does not need
to be stored . If you don't have these instructions, only ADC, you
need one loop to compute X+Y and store the result in memory, and one
loop to compute XY+Z, i.e., the lack of ADCX/ADOX results in
substantial additional cost.
If you add 4 multi-precision numbers, AMD64 with ADX runs out of carry
bits, so you have to spend the overhead of an additional loop (but not
of two additional loops as without ADCX/ADOX).
With carry bits in the general purpose registers <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> and 30 GPRs
(one is zero, one is sp), you can add 14 multi-precision numbers per
loop: 14 GPRs for source addresses, 1 GPR for the target address, 1
for the loop counter, 13 registers for loop-carried carry flags.
Of course, the question is if this kind of computation is needed
frequently enough to justify this kind of extension. For
multi-precision multiplication and squaring, Intel considered the
frequency relevant enough to introduce ADCX/ADOX/MULX.
- anton
On Mon, 7 Oct 2024 13:05:53 +0300, Michael S wrote:
In all cases the vendor of GPU changed ...
That, too, added to the problem, in that the software folks had to
rewrite all the performance-intensive bits yet again for the new
machine.
OpenCL never took off because the GPGPU market simply isn’t
competitive enough. NVidia is dominant, AMD plays second fiddle, and
that’s it.
To their defense, AMD's use of the term ROP didn't last for long.
K8 manuals use the better term micro-ops. I don't have K7 manual to
look, but it seems to me that it uses the same terminology as K8.
On Fri, 11 Oct 2024 01:41:58 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
OpenCL never took off because the GPGPU market simply isn’t competitive
enough. NVidia is dominant, AMD plays second fiddle, and that’s it.
I am not sure about dog-tail relationships.
Lawrence D'Oliveiro <[email protected]d> writes:mentioned
I can remember Flight Simulator being used as the benchmark for >>compatibility as far back as 1985. A report on a computer show
that clone makers were demoing it running on their products.
This is why I feel the term “IBM compatible” was misleading, it should >>have been “Microsoft compatible” from at least that point on.
It was IBM PC compatible, and that was not misleading, because that's
what it was about.
... lots of hardware was Microsoft DOS compatible ...
EricP <[email protected]> writes:
Kent Dickey wrote:[...]
GCC's -trapv option is not useful for a variety of reasons.....
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need >> as it triggers for many false positives so people turn it off.
So why should any hardware include an instruction to trap-on-overflow?Because ALL the negative speed and code size consequences do not occur.
Looking at <https://godbolt.org/z/oMhW55YsK> and selecting MIPS clang
18.1.0, I get a 15-instruction sequence which does not include add
(the trap-on-overflow version).
MIPS gcc 14.2.0 generates a sequence that includes
jal __addvsi3
i.e., just as for x86-64. Similar for MIPS64 with these compilers.
Interestingly, with RISC-V rv64gc clang 18.1.0, the sequence is much
shorter than for MIPS clang 18.1.0, even though RV64GC has no specific
way of checking overflow at all.
- anton
EricP <[email protected]> writes:
But then, risc processors mostly, started using exceptions for housekeeping >> - SPARC for register window sliding, Alpha for byte, word and misaligned
memory access
On Alpha the assembler expands byte, word and unaligned access
mnemonics into sequences of machine instructions; if you compile for
BWX extensions, byte and word mnemonics get compiled into BWX
instructions. If the machine does not have the BWX extensions and it encounters a BWX instruction, the result is an illegal instruction
signal at least on Linux. This terminates your typical program, so
it's not at all frequent.
Concerning unaligned accesses, if you use a load or store that
requires alignment, Digital OSF/1 (and the later versions with various
names) by default produced a signal rather than fixing it up, so again programs are typically terminated, and the exception is not at all
frequent. There is a system call and a tool (uac) that allows telling
the OS to fix up unaligned accesses, but it played no role in my
experience while I was still using Digital OSF/1 (including it's
successors).
On Linux the default behaviour was to fix up the unaligned accesses
and to log that in the system log. There were a few such messages in
the log per day, so that obviously was not a frequent occurence,
either. I wrote a program that allowed me to change the behaviour <https://www.complang.tuwien.ac.at/anton/uace.c>, mainly because I
wanted to get a signal when an unaligned access happens.
As for the unaligned-access mnemonics, these were obviously barely
used: I found that gas generates wrong code for ustq several years
after Alpha was introduced, so obviously no software running under
Linux has used this mnemonic.
The solution for Alpha was to add back the byte and word instructions,
and add misaligned access support to all memory ops.
Alpha added BWX instructions, but not because it had used trapping to
emulate them earlier; Old or portable binaries continued to use
instruction sequences. Alpha traps when you do, e.g., an unaligned
ldq in all Alpha implementations I have had contact with (up to a
800MHz 21264B).
- anton
EricP <[email protected]> writes:
Kent Dickey wrote:[...]
GCC's -trapv option is not useful for a variety of reasons....
1) its slow, about 50% performance hit
2) its always on for a compilation unit which is not what programmers need >> as it triggers for many false positives so people turn it off.
So why should any hardware include an instruction to trap-on-overflow?
Because ALL the negative speed and code size consequences do not occur.
Looking at <https://godbolt.org/z/oMhW55YsK> and selecting MIPS clang
18.1.0, I get a 15-instruction sequence which does not include add
(the trap-on-overflow version).
MIPS gcc 14.2.0 generates a sequence that includes
jal __addvsi3
i.e., just as for x86-64. Similar for MIPS64 with these compilers.
Interestingly, with RISC-V rv64gc clang 18.1.0, the sequence is much
shorter than for MIPS clang 18.1.0, even though RV64GC has no specific
way of checking overflow at all.
- anton
I see where I'm going wrong: I'm trying to talk about the machines
designed to run MS-DOS and later Windows, not just the CPUs. The vast
range of hardware that all had substantial degrees of compatibility as regards booting, busses and so on. Those things let their manufacturers compete for the DOS and Windows market, whereas x86-based machines that weren't PC-compatible only succeeded in quite specialised niches.
Those hardware suppliers did not close off access to the more advanced features of i386 onwards, because they had no reason to, and that let
Linux take advantage of all that hardware when it came along. That's the point I was failing to make.
Thomas Koenig <[email protected]> writes:
Anton Ertl <[email protected]> schrieb:
[in-memory database]
but the question is if
the machine has enough RAM for the database. Our dual-Xeon system
from IIRC 2007 has 24GB of RAM, not sure how big it could be
configured; OTOH, we have a single-Xeon system from 2009 or so with
32GB of RAM (and there were bigger Xeons in the market at the time).
The minimum requirement of SAP HANA is 64 GB of memory, but typical
ranges are from 256GB to 1TB.
What is the relevance of SAP HANA for the topic at hand?
The question was if the RAM can hold the data. For each account they
would have to keep the current balance (64 bits should be enough for
that), the account number (64 bits for the up to 19 digits of a Visa
card) for verifying that we are at the correct entry in the hash table
and probably some account status information (64 bits should be
plenty?).
There is also the sequence of transactions (a 64-bit transaction
offset in the log per transaction should be enough for that). The
sequence of transactions may be useful for fraud detection, but I
don't know enough about that to know how to scale the system, so I'll
just say that fraud detection is done by a bigger system before the transaction goes through to the transaction processing computer.
The sequence of transactions is also needed for generating the reports
and for dealing with customer complaints, but again, that's not
processing the transactions themselves (and is basically read-only,
except that the customer-complaint processing may result in additional transactions).
So, with 24 bytes needed for each account on the
transaction-processing server, 32GB with, say 8GB left for
copy-on-write and other administrative purposes should be good for
about 900M accounts at a hash table load factor of 84%. I guess that
Visa has more accounts, so one would need a box with more RAM.
A single core of the Xeon should easily be able to handle all the 56K transactions per second, both the logging and the update of the hash
table, and in that case no locking is needed. But that first needs a sequence of transactions coming in.
I think that transaction center wants to keep more information
than a single copy of data in the RAM: with single copy any
memory corruption could mean loss of hours of transaction data
which is equivalent to quite a lot of cash. So I suspect that
that there are layers of redundancy buit-in.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 33:42:17 |
| Calls: | 12,109 |
| Files: | 15,006 |
| Messages: | 6,518,327 |