Forum: >>> Magnum BBS <<<

OoO execution (was: The Seymour Cray Era of Supercomputers)

From Anton Ertl@21:1/5 to quadibloc on Mon May 19 06:22:42 2025

quadibloc <[email protected]> writes:

Eventually, IBM caught up with the Control
Data 6600 by perfecting pipelining in the IBM 360/91, and then combining
it with cache in the 360/195. From the Pentium II onwards, that's the
way computers are made nowadays.

Pipelining and caches are already used on the MIPS R2000 in 1986, and
the 486 in 1989.

You are probably thinking of OoO Execution, where people usually write
as if the Tomasulo algorithm of the 360/91 as implemented the modern
concept of OoO execution. But the 360/91 only did OoO for FP, did not
support branch prediction, had imprecise exceptions, and the Tomasulo
algorithm was used primarily as a workaround for the dearth of FP
registers in the S/360.

The innovation that made OoO execution generally usable rather than a
publicity stunt like the 360/91 is the reorder buffer (ROB), which allows to retire the instructions in-order, and to cancel speculatively
"executed" instructions after an exception or branch misprediction.

The Pentium Pro (introduced 1995-11-01), HP PA-8000 (introduced
1995-11-02), and MIPS R10000 (introduced 1996-01) are the first
microprocessors which have full-blown OoO execution.

But as someone pointed out to me, IBM has implemented OoO execution
between the 370/195 and the Pentium Pro: The ES/9000 models 900 and
820 (shipping from September 1991) "were the first models with
out-of-order execution since the System/370-195 of 1973. However
unlike the old S/360-91-derived systems, the models 900 and 820 had
full out-of-order execution for both integer and floating-point units,
with precise exception handling, and a fully superscalar pipeline." <https://en.wikipedia.org/wiki/IBM_System/390#ES/9000>. So apparently
they had a ROB, and AFAIK were the first machines to have one. These
models also had a branch target buffer; the article does not mention
branch prediction proper, but given a ROB and a branch target buffer,
it would be surprising if they did not predict branches.

So who came up with the concept of the ROB? I recently looked at one
of the HPS papers (Hwu, Patt, Shebanov on a High Performance Substrate
for the VAX from the mid-late 80s) again, and there was no ROB in that
paper. I did not revisit their later papers whether they had it
there. So apparently ROBs were not known in the mid-1980s in
academia, and in 1991 there was hardware with a ROB commercially
available, and a few years later it appeared in microprocessors.

I wonder how early and how much IBM talked about their ES/9000 OoO implementation and features, but that may have inspired the teams at
Intel, HP and SGI; or maybe there was an ealier source that inspired
them all, but only in 1995/1996 the number of transistors on a chip
was enough to do OoO on a microprocessor.

Ironically, in the transition to CMOS (i.e., microprocessors) IBM
mainframe processors regressed back to in-order (and, I think,
single-issue) again (but with higher clock rates), and in the early
2000s they looked pretty outdated to me. In the meantime they have re-progressed to OoO again AFAIK.

Back to OoO: it's interesting that Tomasulo and the 360/91 are
mentioned often, but the ROB and its inventor(s?), which are at least
as important for the success of OoO execution, isn't.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to It appears that Anton Ertl on Mon May 19 17:10:51 2025

It appears that Anton Ertl <[email protected]> said:

quadibloc <[email protected]> writes:

Eventually, IBM caught up with the Control
Data 6600 by perfecting pipelining in the IBM 360/91, and then combining
it with cache in the 360/195. From the Pentium II onwards, that's the
way computers are made nowadays.

Pipelining and caches are already used on the MIPS R2000 in 1986, and
the 486 in 1989.

You are probably thinking of OoO Execution, where people usually write
as if the Tomasulo algorithm of the 360/91 as implemented the modern
concept of OoO execution. But the 360/91 only did OoO for FP, did not >support branch prediction, had imprecise exceptions, and the Tomasulo >algorithm was used primarily as a workaround for the dearth of FP
registers in the S/360.

The 360/91 had primitive branch prediction in "loop mode". It had an
eight doublewprd instruction queue (which it confusingly called a stack.)
If a program did a backward branch of less than eight doublewords, it'd
stop prefetching and execute out of the queue until the program fell or branched out. It was occasionally worth tweaking assembly code to get
a loop to start on a doubleword boundary (the CNOP assembler op) so it'd
fit and run in loop mode.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Mon May 19 17:46:45 2025

John Levine <[email protected]> writes:

The 360/91 had primitive branch prediction in "loop mode". It had an
eight doublewprd instruction queue (which it confusingly called a stack.)
If a program did a backward branch of less than eight doublewords, it'd
stop prefetching and execute out of the queue until the program fell or >branched out.

The 68010 had a similar feature (with a smaller buffer), but I don't
think one would call it branch prediction. In any case, I meant
speculative execution based on branch prediction (but did not write it
that way), and the 360/91 did not do speculative execution AFAIK.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Mon May 19 14:33:47 2025

Anton Ertl wrote:

quadibloc <[email protected]> writes:

Eventually, IBM caught up with the Control
Data 6600 by perfecting pipelining in the IBM 360/91, and then combining
it with cache in the 360/195. From the Pentium II onwards, that's the
way computers are made nowadays.

Pipelining and caches are already used on the MIPS R2000 in 1986, and
the 486 in 1989.

You are probably thinking of OoO Execution, where people usually write
as if the Tomasulo algorithm of the 360/91 as implemented the modern
concept of OoO execution. But the 360/91 only did OoO for FP, did not support branch prediction, had imprecise exceptions, and the Tomasulo algorithm was used primarily as a workaround for the dearth of FP
registers in the S/360.

The innovation that made OoO execution generally usable rather than a publicity stunt like the 360/91 is the reorder buffer (ROB), which allows to retire the instructions in-order, and to cancel speculatively
"executed" instructions after an exception or branch misprediction.

The Pentium Pro (introduced 1995-11-01), HP PA-8000 (introduced
1995-11-02), and MIPS R10000 (introduced 1996-01) are the first microprocessors which have full-blown OoO execution.

But as someone pointed out to me, IBM has implemented OoO execution
between the 370/195 and the Pentium Pro: The ES/9000 models 900 and
820 (shipping from September 1991) "were the first models with
out-of-order execution since the System/370-195 of 1973. However
unlike the old S/360-91-derived systems, the models 900 and 820 had
full out-of-order execution for both integer and floating-point units,
with precise exception handling, and a fully superscalar pipeline." <https://en.wikipedia.org/wiki/IBM_System/390#ES/9000>. So apparently
they had a ROB, and AFAIK were the first machines to have one. These
models also had a branch target buffer; the article does not mention
branch prediction proper, but given a ROB and a branch target buffer,
it would be surprising if they did not predict branches.

So who came up with the concept of the ROB? I recently looked at one
of the HPS papers (Hwu, Patt, Shebanov on a High Performance Substrate
for the VAX from the mid-late 80s) again, and there was no ROB in that
paper. I did not revisit their later papers whether they had it
there. So apparently ROBs were not known in the mid-1980s in
academia, and in 1991 there was hardware with a ROB commercially
available, and a few years later it appeared in microprocessors.

There were a number of papers that circled around the various ideas.
"Decoupled Access Execute Computer Architectures" uses queues to link
the hardware modules together.
"Implementing Precise Interrupts in Pipelined Processors" first mentions
the ROB but doesn't have a renamer and limited OoO ability.
HPS has rename, reservation stations, and multiple FU but no ROB.

I don't know in what machine all the pieces came together at once
but it looks like about 1986 they figured out to use multiple pipelines
AND rename AND future file AND a ROB AND reservation stations AND multiple function units AND forwarding buses.

Decoupled Access Execute Computer Architectures,
James E. Smith, 1982

Instruction Issue Logic in Pipelined Supercomputers
Shlomo Weiss, James E Smith, 1984

Implementing Precise Interrupts in Pipelined Processors,
James E. Smith, A. R. Pleszkun, 1985

HPS - A New Microarchitecture Rationale And Introduction,
Yale N. Patt, Wen-mei Hwu, and Michael Shebanow, 1985

I wonder how early and how much IBM talked about their ES/9000 OoO implementation and features, but that may have inspired the teams at
Intel, HP and SGI; or maybe there was an ealier source that inspired
them all, but only in 1995/1996 the number of transistors on a chip
was enough to do OoO on a microprocessor.

Ironically, in the transition to CMOS (i.e., microprocessors) IBM
mainframe processors regressed back to in-order (and, I think,
single-issue) again (but with higher clock rates), and in the early
2000s they looked pretty outdated to me. In the meantime they have re-progressed to OoO again AFAIK.

Back to OoO: it's interesting that Tomasulo and the 360/91 are
mentioned often, but the ROB and its inventor(s?), which are at least
as important for the success of OoO execution, isn't.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ze@21:1/5 to All on Mon May 19 19:09:12 2025

Wasn't one of the earliest forms of branch prediction the simple
heuristic of always taking it in one direction and not taking it in the
other direction , I seem to remember that being the case for some of the
early pipelined microprocessors. I believe it was called static branch prediction compared to the more modern dynamic branch prediction.

Nicholas (Nick) King

--

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to quadibloc on Mon May 19 22:04:22 2025

quadibloc wrote:

On Mon, 19 May 2025 6:22:42 +0000, Anton Ertl wrote:

You are probably thinking of OoO Execution, where people usually write
as if the Tomasulo algorithm of the 360/91 as implemented the modern
concept of OoO execution. But the 360/91 only did OoO for FP, did not
support branch prediction, had imprecise exceptions, and the Tomasulo
algorithm was used primarily as a workaround for the dearth of FP
registers in the S/360.

Yes, I was thinking of OoO execution, as opposed to other forms of
pipelining - basic pipelining was used in the 7094 II and even the 6502.

The Pentium II (and Pentium Pro) also only used OoO for floating-point,
while the 68050 only used OoO for integers!

Huh???

The Pentium (all versions) had two pipes (u & v), both in-order, and
with severe limitations on which opcodes could run in v in parallel with
the primary opcode in the u pipe.

The P6/PentiumPro OTOH does true OoO for all instruction types.

John, you are usually much better informed!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Mon May 19 23:27:05 2025

On Mon, 19 May 2025 22:04:22 +0200
Terje Mathisen <[email protected]> wrote:

John, you are usually much better informed!

I don't think so. John is as uninformed as usual.
I think, he is repeating this particular bit of nonsense about PPro at
least for the 3rd time and every single time he was corrected.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Mon May 19 23:41:15 2025

On Mon, 19 May 2025 06:22:42 GMT
[email protected] (Anton Ertl) wrote:

The innovation that made OoO execution generally usable rather than a publicity stunt like the 360/91 is the reorder buffer (ROB), which
allows to retire the instructions in-order, and to cancel
speculatively "executed" instructions after an exception or branch misprediction.

The Pentium Pro (introduced 1995-11-01), HP PA-8000 (introduced
1995-11-02), and MIPS R10000 (introduced 1996-01) are the first microprocessors which have full-blown OoO execution.

What about PPC604? It had more limited OoO resources than the 3
processors you mentioned above, esp. fewer numeber of reservation
stations, but it most certainly had reorder buffers, 16 of them.
So, by your own definitions, it should be called the first single-chip full-blown CPU.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue May 20 00:04:03 2025

On Mon, 19 May 2025 19:09:12 +0000, Ze wrote:

Wasn't one of the earliest forms of branch prediction the simple
heuristic of always taking it in one direction and not taking it in the
other direction , I seem to remember that being the case for some of the early pipelined microprocessors. I believe it was called static branch prediction compared to the more modern dynamic branch prediction.

The simple heuristic I remember was to assume that backward branches would
be more likely to be taken than not (on the grounds that they were
probably loops) while forward ones would more likely not be taken (I guess
as an excuse for not disturbing the pipeline too much).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Tue May 20 00:01:18 2025

On Mon, 19 May 2025 23:41:15 +0300, Michael S wrote:

What about PPC604? It had more limited OoO resources than the 3
processors you mentioned above, esp. fewer numeber of reservation
stations, but it most certainly had reorder buffers, 16 of them.
So, by your own definitions, it should be called the first single-chip full-blown CPU.

Was it a PowerPC 604-based Apple Mac that was the first PC to exceed the then-current US Department of Defense threshold for the definition of a “supercomputer”? I think it might have been 1 gigaFLOPS at the time. (Or
is that too high for the time?)

That meant it was subject to export restrictions. I remember Apple making
a lot of publicity about it at the time.

Of course, the threshold was raised soon after.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue May 20 00:30:28 2025

On Tue, 20 May 2025 0:04:03 +0000, Lawrence D'Oliveiro wrote:

On Mon, 19 May 2025 19:09:12 +0000, Ze wrote:

Wasn't one of the earliest forms of branch prediction the simple
heuristic of always taking it in one direction and not taking it in the
other direction , I seem to remember that being the case for some of the
early pipelined microprocessors. I believe it was called static branch
prediction compared to the more modern dynamic branch prediction.

The simple heuristic I remember was to assume that backward branches
would
be more likely to be taken than not (on the grounds that they were
probably loops) while forward ones would more likely not be taken (I
guess
as an excuse for not disturbing the pipeline too much).

CDC 7600 used this scheme. Backwards taken, forwards not-taken.
Was about 70% accurate for essentially zero storage and 1 (or few)
gates.

This scheme might have been limited in scope (backwards into the
instruction stack was predicted taken, farther than stack was
predicted not-taken:: I don't remember exactly.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Tue May 20 13:52:17 2025

[email protected] (MitchAlsup1) writes:

On Tue, 20 May 2025 0:04:03 +0000, Lawrence D'Oliveiro wrote:

On Mon, 19 May 2025 19:09:12 +0000, Ze wrote:

Wasn't one of the earliest forms of branch prediction the simple
heuristic of always taking it in one direction and not taking it in the
other direction , I seem to remember that being the case for some of the >>> early pipelined microprocessors. I believe it was called static branch
prediction compared to the more modern dynamic branch prediction.

The simple heuristic I remember was to assume that backward branches
would
be more likely to be taken than not (on the grounds that they were
probably loops) while forward ones would more likely not be taken (I
guess
as an excuse for not disturbing the pipeline too much).

CDC 7600 used this scheme. Backwards taken, forwards not-taken.
Was about 70% accurate for essentially zero storage and 1 (or few)
gates.

Burroughs B4900 re-wrote the branch opcode on each branch to reflect
the last two taken vs. not-taken choices. There were four opcodes
for each type of branch - taken/taken, taken/not-taken, not-taken/taken
and not-taken/not-taken.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Tue May 20 21:21:07 2025

Michael S <[email protected]> writes:

On Mon, 19 May 2025 06:22:42 GMT
[email protected] (Anton Ertl) wrote:

The Pentium Pro (introduced 1995-11-01), HP PA-8000 (introduced
1995-11-02), and MIPS R10000 (introduced 1996-01) are the first
microprocessors which have full-blown OoO execution.

What about PPC604? It had more limited OoO resources than the 3
processors you mentioned above, esp. fewer numeber of reservation
stations, but it most certainly had reorder buffers, 16 of them.
So, by your own definitions, it should be called the first single-chip >full-blown CPU.

Yes. The OoO nature with ROB is explained in <https://arstechnica.com/articles/paedia/cpu/ppc-1.ars/6>.

Somehow that did not register with me earlier (even though a collegue
had a Mac with a PPC 604e IIRC). I guess it's because Apple Marketing
is low on technical details, and if Motorola emphasized this aspect,
that did not pass the filters of the press. Also, IIRC the
performance was not so exceptional that it would direct a spotlight at
the underlying technology, whereas the Pentium Pro with its suprising
SPECint win certainly did. Finally, the successors of the 604 (in
particular, the PPC 7450) did not progress much further with OoO
execution and still had only mild OoO capabilities at a time when the
Pentium 4 already has a 128-entry ROB (and other structure sizes to
match). So given the lack of ambition in the 7450, I did not even
think about the possibility that the 604 might have been the first microprocessor with OoO execution.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to Anton Ertl on Wed May 21 12:52:57 2025

On Mon, 19 May 2025 17:46:45 GMT, [email protected]
(Anton Ertl) wrote:

John Levine <[email protected]> writes:

The 360/91 had primitive branch prediction in "loop mode". It had an
eight doublewprd instruction queue (which it confusingly called a stack.) >>If a program did a backward branch of less than eight doublewords, it'd >>stop prefetching and execute out of the queue until the program fell or >>branched out.

The 68010 had a similar feature (with a smaller buffer), but I don't
think one would call it branch prediction. In any case, I meant
speculative execution based on branch prediction (but did not write it
that way), and the 360/91 did not do speculative execution AFAIK.

- anton

Most DSPs have some kind of "loop buffer" from which they can execute
without fetching code from memory.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed May 21 13:14:35 2025

Most DSPs have some kind of "loop buffer" from which they can execute
without fetching code from memory.

And Mitch's My 66000 `VEC` instruction takes the idea a step further.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From moi@21:1/5 to George Neuner on Wed May 21 18:47:21 2025

On 21/05/2025 17:52, George Neuner wrote:

On Mon, 19 May 2025 17:46:45 GMT, [email protected]
(Anton Ertl) wrote:

John Levine <[email protected]> writes:

The 360/91 had primitive branch prediction in "loop mode". It had an
eight doublewprd instruction queue (which it confusingly called a stack.) >>> If a program did a backward branch of less than eight doublewords, it'd
stop prefetching and execute out of the queue until the program fell or
branched out.

The 68010 had a similar feature (with a smaller buffer), but I don't
think one would call it branch prediction. In any case, I meant
speculative execution based on branch prediction (but did not write it
that way), and the 360/91 did not do speculative execution AFAIK.

- anton

Most DSPs have some kind of "loop buffer" from which they can execute
without fetching code from memory.

The Ferranti Atlas 2 and the EE KDF9 are both prior art.

--
Bill F.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Thu May 29 19:02:11 2025

Anton Ertl <[email protected]> schrieb:

quadibloc <[email protected]> writes:

Eventually, IBM caught up with the Control
Data 6600 by perfecting pipelining in the IBM 360/91, and then combining
it with cache in the 360/195. From the Pentium II onwards, that's the
way computers are made nowadays.

Pipelining and caches are already used on the MIPS R2000 in 1986, and
the 486 in 1989.

Or the 801. That may have been the first machine to have
separate I- and D-caches (was it?)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Thu May 29 20:06:21 2025

On Thu, 29 May 2025 19:02:11 +0000, Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

quadibloc <[email protected]> writes:

Eventually, IBM caught up with the Control
Data 6600 by perfecting pipelining in the IBM 360/91,

At the cost of about 3× the number of gates and power along with
a 60% increase in the clock rate (60ns versus 100ns). This advantage
vanished about the time of first /91 deliveries with CDC 7600 going
to a ~27ns clock along with pipelining and concurrent calculation.

and then
combining

it with cache in the 360/195.

A last gasp for leadership in Big number crunching for IBM.

From the Pentium II onwards, that's the

way computers are made nowadays.

Once everyone can afford the gates to make pipeline staging latches
it is the natural way for design. Prior to this point, the designers
were more focused on "getting it on in a single die" than getting
the highest possible performance--often limited by the speed of
the external interface more than calculations inside.

Pipelining and caches are already used on the MIPS R2000 in 1986, and
the 486 in 1989.

Or the 801. That may have been the first machine to have
separate I- and D-caches (was it?)

Without disagreeing with the above::

MIPS R2000 (and R3000) had a unified cache--read twice per cycle on
clock high and clock low. R3000 was faster in writing (STs) to the
cache than R2000. Tablewalks in SW via a big hash table.

Mc68010 had a "loop buffer" of a couple handful of instructions.
Mc68020 had 256B instruction cache no TLB
Mc68030 had 256B I$ 256B D$ and ~32E TLB tablewalks in HW

Mc88100 had 16KB I$ with 64E TLB 16KB D$ with 64E TLB tablewalks
in HW.

CDC 6600 had a multi-word instruction stack 6600 and a significantly
larger instruction stack 7600 with backward branch prediction.
Base+Bounds memory protection 6600. Context switch in ~16 cycles
by writing out current state while reading in new state.

Many machines overlapped Fetch-DECODE with EXECUTE-WRITEBACK all the
way back to beginning as a 2 stage pipeline. This, alone, makes the
point where pipelining "took over" difficult to judge.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Thu May 29 22:20:07 2025

On Thu, 29 May 2025 20:06:21 +0000, MitchAlsup1 wrote:

quadibloc <[email protected]> writes:

Eventually, IBM caught up with the Control Data 6600 by perfecting
pipelining in the IBM 360/91,

At the cost of about 3× the number of gates and power along with a 60% increase in the clock rate (60ns versus 100ns). This advantage vanished
about the time of first /91 deliveries with CDC 7600 going to a ~27ns
clock along with pipelining and concurrent calculation.

Like I said, part of IBM’s tradition of overpromising and
underdelivering.

But it served its purpose, that of dissuading customers from buying
the CDC product.

Mc68010 had a "loop buffer" of a couple handful of instructions.
Mc68020 had 256B instruction cache no TLB
Mc68030 had 256B I$ 256B D$ and ~32E TLB tablewalks in HW

As I recall, the ’030 wasn’t that much of an advance over the ’020.
But the 68040 was a major step forward. And the 68060 wasn’t too bad,
either. But by that time the major customer (Apple) had lost interest.
I think it was used in some Amiga machines.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Schultz@21:1/5 to All on Thu May 29 18:36:49 2025

On 5/29/25 3:06 PM, MitchAlsup1 wrote:

Mc68010 had a "loop buffer" of a couple handful of instructions.

Not exactly. In the very specific case of the decrement and branch on
condition it could lock up its prefetch queue (two words) and
instruction register. Since the dbcc instruction was two words, this
meant it only worked with single word instructions.

Faster but very limited.

--
http://davesrocketworks.com
David Schultz
"The cheaper the crook, the gaudier the patter." - Sam Spade

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Fri May 30 13:28:39 2025

On Tue, 20 May 2025 21:21:07 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

On Mon, 19 May 2025 06:22:42 GMT
[email protected] (Anton Ertl) wrote:

The Pentium Pro (introduced 1995-11-01), HP PA-8000 (introduced
1995-11-02), and MIPS R10000 (introduced 1996-01) are the first
microprocessors which have full-blown OoO execution.

What about PPC604? It had more limited OoO resources than the 3
processors you mentioned above, esp. fewer numeber of reservation
stations, but it most certainly had reorder buffers, 16 of them.
So, by your own definitions, it should be called the first
single-chip full-blown CPU.

Yes. The OoO nature with ROB is explained in <https://arstechnica.com/articles/paedia/cpu/ppc-1.ars/6>.

Somehow that did not register with me earlier (even though a collegue
had a Mac with a PPC 604e IIRC). I guess it's because Apple Marketing
is low on technical details, and if Motorola emphasized this aspect,
that did not pass the filters of the press. Also, IIRC the
performance was not so exceptional that it would direct a spotlight at
the underlying technology, whereas the Pentium Pro with its suprising
SPECint win certainly did. Finally, the successors of the 604 (in particular, the PPC 7450) did not progress much further with OoO
execution

From uArch perspective, PPC/MPC 7xx and 7xxx are really successors of
603 rather than of 604.

The thing closest to microarchitectural successor of 604 (via ill-fated
620) is POWER3, but that one was aimed at completely different market.
An offspring that attempted to re-enter PC processors market was PPC970
(a red-headed little brother of POWER4). This foray was terminated by
Steve Jobs (he always prefer Intel but until this millennium did not
poses political power to impose his preferences on technical team)
lasting for about 3 years.

and still had only mild OoO capabilities at a time when the
Pentium 4 already has a 128-entry ROB (and other structure sizes to
match). So given the lack of ambition in the 7450, I did not even
think about the possibility that the 604 might have been the first microprocessor with OoO execution.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Al Kossow@21:1/5 to All on Fri May 30 10:51:14 2025

Steve Jobs (he always prefer Intel but until this millennium did not
poses political power to impose his preferences on technical team)
lasting for about 3 years.

His hardware products at NeXT prove this is nonsense.
The last NeXT prototype that I saw in a Moto lab in Austin
used the 88110.

He was completely capable of forcing his will on Apple hardware
engineers. Project leads who disagreed were let go or put into
continuation engineering.

The switch was pragmatic and forced because of the weak PPC
roadmap, especially in the portable space.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Al Kossow on Fri May 30 19:36:53 2025

Al Kossow <[email protected]> writes:

Steve Jobs (he always prefer Intel but until this millennium did not
poses political power to impose his preferences on technical team)
lasting for about 3 years.

His hardware products at NeXT prove this is nonsense.
The last NeXT prototype that I saw in a Moto lab in Austin
used the 88110.

He was completely capable of forcing his will on Apple hardware
engineers. Project leads who disagreed were let go or put into
continuation engineering.

The switch was pragmatic and forced because of the weak PPC
roadmap, especially in the portable space.

We (Unisys) had some systems designed around the 88100 in
that time frame. Apple's decision to go to PPC rather than
the 88110 caused us to evaluate all the current available
processors (SPARC, MIPS, x86, and PPC). For rather pragmatic
reasons (the target machine used the Intel Paragon backplane),
the Pentium Pro was the ultimate choice, used to build the
OPUS family of massively parallel (yet single-system image)
computer systems.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Al Kossow on Fri May 30 22:05:31 2025

On Fri, 30 May 2025 10:51:14 -0700, Al Kossow wrote:

Steve Jobs (he always prefer Intel but until this millennium did not
poses political power to impose his preferences on technical team)
lasting for about 3 years.

His hardware products at NeXT prove this is nonsense.

Also, the entire history of the development of the first-generation
Macintosh -- Motorola all the way, even after the switch from 68K to
PowerPC.

The switch was pragmatic and forced because of the weak PPC
roadmap, especially in the portable space.

That’s why the last-gasp PowerPC processor that was used in any Macintosh, the G5, came from IBM, not Motorola. I think the hope was that IBM would
step in where Motorola was faltering. But that hope didn’t last long.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Fri May 30 23:01:14 2025

On Fri, 30 May 2025 22:05:31 +0000, Lawrence D'Oliveiro wrote:

On Fri, 30 May 2025 10:51:14 -0700, Al Kossow wrote:

Steve Jobs (he always prefer Intel but until this millennium did not
poses political power to impose his preferences on technical team)
lasting for about 3 years.

His hardware products at NeXT prove this is nonsense.

Also, the entire history of the development of the first-generation
Macintosh -- Motorola all the way, even after the switch from 68K to
PowerPC.

Steve had a power over Murry Goldman:: Goldman believed that Apple
volume would pay for the FAB, and thus the entire product line
consumed by Apple could be sold at marginal production costs.
Jobs knew nobody else would deliver product at this kind of cost
structure.

I have suspected Dell and Intel had/have a similar arrangement.

The switch was pragmatic and forced because of the weak PPC
roadmap, especially in the portable space.

That’s why the last-gasp PowerPC processor that was used in any
Macintosh,
the G5, came from IBM, not Motorola. I think the hope was that IBM would
step in where Motorola was faltering. But that hope didn’t last long.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Sat May 31 07:57:27 2025

[email protected] (Scott Lurndal) writes:

We (Unisys) had some systems designed around the 88100 in
that time frame. Apple's decision to go to PPC rather than
the 88110 caused us to evaluate all the current available
processors (SPARC, MIPS, x86, and PPC). For rather pragmatic
reasons (the target machine used the Intel Paragon backplane),
the Pentium Pro was the ultimate choice, used to build the
OPUS family of massively parallel (yet single-system image)
computer systems.

Likewise, Data General's Aviion ( line of Unix workstations and
servers was based on the 88100, and I worked with them in 1990 and
1991. When Motorola gave up the 88k line, DG gave up Motorola and
switched to Intel. That worked for a while, but apparently they were
not successful enough with this line of business, and got bought by
EMC for DG's Clariion like of disk array storage products. So going
for Intel was no Panacea, either. It worked well enough to Unisys
survive, though.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Sat May 31 08:10:14 2025

Michael S <[email protected]> writes:

On Tue, 20 May 2025 21:21:07 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

[PPC604]

Yes. The OoO nature with ROB is explained in
<https://arstechnica.com/articles/paedia/cpu/ppc-1.ars/6>.
=20
Somehow that did not register with me earlier (even though a collegue
had a Mac with a PPC 604e IIRC). I guess it's because Apple Marketing
is low on technical details, and if Motorola emphasized this aspect,
that did not pass the filters of the press. Also, IIRC the
performance was not so exceptional that it would direct a spotlight at
the underlying technology, whereas the Pentium Pro with its suprising
SPECint win certainly did. Finally, the successors of the 604 (in
particular, the PPC 7450) did not progress much further with OoO
execution=20

=46rom uArch perspective, PPC/MPC 7xx and 7xxx are really successors of
603 rather than of 604.

Looking at <https://arstechnica.com/features/2004/08/ppc-1/>, the 603e
is also a full-blown OoO CPU (it does not describe the 603 in enough
detail to establish that for the 603); <https://en.wikipedia.org/wiki/PowerPC_600#PowerPC_603> says that the
603 is OoO, but does not give details. In any case, the 603 appeared
around the same time as the 604, so the 604 might be the first
full-blown OoO CPU even if the 603 also is OoO.

In any case, the 750 and 7450 are full-blown OoO machines, but still
with relatibvely small buffers: the 7450 (introduced 2001) has a
six-entry integer queue, and apparently only three of the entries can
be used for reordering (chipsnchees calls this a three-entry scheduler
preceded by a three-entry non-scheduling queue); similarly, the vector
side has a two-entry scheduler preceded by a two-entry non-scheduling
queue. I don't find information about the size of the reorder buffer
of the 7450, but it has 16 rename registers for the GPRs and 16 rename registers for the vector registers, which is indicative of its
reordering capabilities.

By contrast, the AMD K7 (1999) has a 15-entry integer scheduler, a
36-entry FP scheduler, a 72-entry ROB, 88 FP registers (-8 for the 8 architectural 387 registers), and IIRC 72 integer registers (-8 for
the 8 architectural GPRs). I wonder why they needed 80 FP rename
registers if they could reorder only across 72 instructions. In any
case, this appeared earlier and had much more reordering capability
than the PPC 7450. I wonder why the 7450 designers chose to have that
little reorder capacity. Too much space spent on Altivec? Did they
have a design that was hard to scale for more entries? Power
consumption?

An offspring that attempted to re-enter PC processors market was PPC970
(a red-headed little brother of POWER4). This foray was terminated by
Steve Jobs (he always prefer Intel but until this millennium did not
poses political power to impose his preferences on technical team)
lasting for about 3 years.

The PPC970 was marketed by Steve Jobs, as usual, as the best thing
since sliced bread, but in my work, it was slower than comtemporary
IA-32 and AMD64 systems:

All the numbers below are execution times in seconds, so lower means
faster:

From <https://www.complang.tuwien.ac.at/franz/latex-bench>

- PowerMac G5, 2000MHz PPC970, Gentoo Linux PPC64 1.47
- Athlon 64 3200+, 2000MHz, 1MB L2, Fedora Core 1 (64-bit) 0.76

From <https://cgit.git.savannah.gnu.org/cgit/gforth.git/plain/Benchres>

sieve bubble matrix fib
0.279 0.411 0.183 0.519 0.7.0; PPC970 2GHz (PowerMac G5); gcc-4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
0.245 0.287 0.156 0.376 0.7.0; Pentium 4 Northwood 2.26GHz; gcc-2.95.4 20011002 (Debian prerelease)
0.216 0.268 0.112 0.340 0.7.0; K8 2GHz (Opteron 270); gcc-4.1.3 20080623 (prerelease) (Debian 4.1.2-23)

Many years later I did some microbenchmarking <https://www.complang.tuwien.ac.at/anton/undefined-div-bench/> that
involved division instructions. I was surprised by the slowness of
the PPC970 on this microbenchmark. Results in cycles per iteration
(lower is better):

ooomb oooub
41.9 41.9 PPC 7447A (iBook G4) ppc (32-bit) gcc-4.3.2
130.0 130.0 PPC 970 (PowerMac G5) ppc64 gcc-4.4.5

(but note that this compares 32-bit with 64-bit division).

As for the question of why Apple switched to Intel, it appeared pretty
clear to me: At the level where they were working, the M of AIM could
not keep up in the GHz race, the I in AIM finally manage to get the
GHz (that the performance was subpar did not register with me at the
time; Apple marketing worked:-) with the PPC970, but that was too
power-hungry for laptops. Probably both M and I demanded more money
from A in order to develop in the direction that A was asking them to.
A did not want to provide the money, so they developed for their
specific markets: M developed what the embedded market asked for, I
developed for servers and workstations. My guess is that Intel
already had the high-performance laptop CPUs that Apple needed,
because that was in their market, so they could offer Apple a better
deal, and that's how it went.

The irony is that P.A. Semi worked on the kind of CPUs that Apple
wanted, lost their prospective customer with this move by Apple, was
then bought by Apple in 2008, and their workforce became part of what
is now Apple Silicon, and worked on the chips that first powered
iPhones and later displaced Intel from Apple's laptop and desktop
computers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Terje Mathisen on Wed Jul 16 14:27:11 2025

On Mon, 19 May 2025 22:04:22 +0200, Terje Mathisen wrote:

quadibloc wrote:

The Pentium II (and Pentium Pro) also only used OoO for floating-point,
while the 68050 only used OoO for integers!

Huh???

The Pentium (all versions) had two pipes (u & v), both in-order, and
with severe limitations on which opcodes could run in v in parallel with
the primary opcode in the u pipe.

The P6/PentiumPro OTOH does true OoO for all instruction types.

John, you are usually much better informed!

I had read somewhere that the Pentium Pro and the Pentium II, like the System/360 Model 91, were OoO only in their floating-point pipelines. If
that source was faulty, and better sources say differently, I'll need to
check on it.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Wed Jul 16 18:27:38 2025

On Wed, 16 Jul 2025 14:27:11 +0000, John Savard wrote:

On Mon, 19 May 2025 22:04:22 +0200, Terje Mathisen wrote:

quadibloc wrote:

The Pentium II (and Pentium Pro) also only used OoO for floating-point,
while the 68050 only used OoO for integers!

Huh???

The Pentium (all versions) had two pipes (u & v), both in-order, and
with severe limitations on which opcodes could run in v in parallel with
the primary opcode in the u pipe.

The P6/PentiumPro OTOH does true OoO for all instruction types.

John, you are usually much better informed!

I had read somewhere that the Pentium Pro and the Pentium II, like the System/360 Model 91, were OoO only in their floating-point pipelines. If
that source was faulty, and better sources say differently, I'll need to check on it.

The Anderson papers indicate the /91 was just heavily pipelined in
the integer side.

I don't know about PPro in the integer section, but it was definitely
OoO in branches, the memory section, and in the PFU. So, I don't see
why they would not have had integer OoO.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Jul 16 17:45:20 2025

I don't know about PPro in the integer section, but it was definitely
OoO in branches, the memory section, and in the PFU. So, I don't see
why they would not have had integer OoO.

For the /91 I can see some potential simplifications to keep some parts in-order, but for the PPro it seems to me that the requirements of precise-exceptions make it so having some parts OoO and some parts
in-order wouldn't give much benefits (if any).

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sat Jul 26 02:45:56 2025

On Wed, 16 Jul 2025 18:27:38 +0000, MitchAlsup1 wrote:

The Anderson papers indicate the /91 was just heavily pipelined in the integer side.

Not good enough to keep up with CDC?

After about two years of promising that they would blow CDC out of the
water ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Lawrence D'Oliveiro on Thu Jul 31 20:38:38 2025

On Sat, 26 Jul 2025 02:45:56 +0000, Lawrence D'Oliveiro wrote:

Not good enough to keep up with CDC?

After about two years of promising that they would blow CDC out of the
water ...

The IBM System/360 Model 91 wasn't even good enough to keep up with the
Model 85.

However, IBM still realized that OoO was useful, even if it delivered less
than the promised improvement in performance. So they went on to the
Model 195 which added cache to the Model 91 design. That did work well
enough that *I think* it actually did out-perform the CDC machines of the
time.

Even if it didn't, it performed well, and could have been considered a
superior alternative - the CDC 6600 had reliabillity problems, I remember reading. So it would only have had to come close to the 7600 or whatever
CDC had at the time in such a situation.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Savard on Fri Aug 1 15:02:19 2025

On Thu, 31 Jul 2025 20:38:38 -0000 (UTC)
John Savard <[email protected]d> wrote:

On Sat, 26 Jul 2025 02:45:56 +0000, Lawrence D'Oliveiro wrote:

Not good enough to keep up with CDC?

After about two years of promising that they would blow CDC out of
the water ...

The IBM System/360 Model 91 wasn't even good enough to keep up with
the Model 85.

However, IBM still realized that OoO was useful, even if it delivered
less than the promised improvement in performance. So they went on to
the Model 195 which added cache to the Model 91 design. That did work
well enough that *I think* it actually did out-perform the CDC
machines of the time.

From what I see in Wkipedia, it looks like all "number-crunching
oriented" S/360 Models, i.e. 85, 91 and 195, were failures from
business POV, even if to slightly different degrees (85 less bad).
May be, S/370 Model 195 was more successful, I was not able to find info
about number of units shipped.

But, then again, CDC 7600, despite its excellent performance, was
significantly less successful commercially than 6600. So, may be, it was
just a bad era for that type of machines.

Even if it didn't, it performed well, and could have been considered a superior alternative - the CDC 6600 had reliabillity problems, I
remember reading. So it would only have had to come close to the 7600
or whatever CDC had at the time in such a situation.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Fri Aug 1 15:44:52 2025

According to Michael S <[email protected]>:

From what I see in Wkipedia, it looks like all "number-crunching
oriented" S/360 Models, i.e. 85, 91 and 195, were failures from
business POV, even if to slightly different degrees (85 less bad).
May be, S/370 Model 195 was more successful, I was not able to find info >about number of units shipped.

Neither can I but I don't think it was very many.

The /91 was a very unbalanced machine. For general computing
like compilers it was about the same speed as the /85, but
for floating point codes it was twice as fast or more depending
on how well the code was tuned to the /91.

The IBM history book says the /85 was a technical success largely
due to the cache but didn't sell well, partly due to poor economic
conditions, partly because customers wanted something faster and
cheaper built using integrated circuits.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet
- Bob Worm
  Mon Jul 27 15:19:55 2026
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (3 / 13)
Uptime:	51:32:21
Calls:	12,445
Calls today:	5
Files:	15,192
Messages:	6,537,250

OoO execution (was: The Seymour Cray Era of Supercomputers)

Who's Online

Recent Visitors

System Info