• OoO execution (was: The Seymour Cray Era of Supercomputers)

    From Anton Ertl@21:1/5 to quadibloc on Mon May 19 06:22:42 2025
    quadibloc <[email protected]> writes:
    Eventually, IBM caught up with the Control
    Data 6600 by perfecting pipelining in the IBM 360/91, and then combining
    it with cache in the 360/195. From the Pentium II onwards, that's the
    way computers are made nowadays.

    Pipelining and caches are already used on the MIPS R2000 in 1986, and
    the 486 in 1989.

    You are probably thinking of OoO Execution, where people usually write
    as if the Tomasulo algorithm of the 360/91 as implemented the modern
    concept of OoO execution. But the 360/91 only did OoO for FP, did not
    support branch prediction, had imprecise exceptions, and the Tomasulo
    algorithm was used primarily as a workaround for the dearth of FP
    registers in the S/360.

    The innovation that made OoO execution generally usable rather than a
    publicity stunt like the 360/91 is the reorder buffer (ROB), which allows to retire the instructions in-order, and to cancel speculatively
    "executed" instructions after an exception or branch misprediction.

    The Pentium Pro (introduced 1995-11-01), HP PA-8000 (introduced
    1995-11-02), and MIPS R10000 (introduced 1996-01) are the first
    microprocessors which have full-blown OoO execution.

    But as someone pointed out to me, IBM has implemented OoO execution
    between the 370/195 and the Pentium Pro: The ES/9000 models 900 and
    820 (shipping from September 1991) "were the first models with
    out-of-order execution since the System/370-195 of 1973. However
    unlike the old S/360-91-derived systems, the models 900 and 820 had
    full out-of-order execution for both integer and floating-point units,
    with precise exception handling, and a fully superscalar pipeline." <https://en.wikipedia.org/wiki/IBM_System/390#ES/9000>. So apparently
    they had a ROB, and AFAIK were the first machines to have one. These
    models also had a branch target buffer; the article does not mention
    branch prediction proper, but given a ROB and a branch target buffer,
    it would be surprising if they did not predict branches.

    So who came up with the concept of the ROB? I recently looked at one
    of the HPS papers (Hwu, Patt, Shebanov on a High Performance Substrate
    for the VAX from the mid-late 80s) again, and there was no ROB in that
    paper. I did not revisit their later papers whether they had it
    there. So apparently ROBs were not known in the mid-1980s in
    academia, and in 1991 there was hardware with a ROB commercially
    available, and a few years later it appeared in microprocessors.

    I wonder how early and how much IBM talked about their ES/9000 OoO implementation and features, but that may have inspired the teams at
    Intel, HP and SGI; or maybe there was an ealier source that inspired
    them all, but only in 1995/1996 the number of transistors on a chip
    was enough to do OoO on a microprocessor.

    Ironically, in the transition to CMOS (i.e., microprocessors) IBM
    mainframe processors regressed back to in-order (and, I think,
    single-issue) again (but with higher clock rates), and in the early
    2000s they looked pretty outdated to me. In the meantime they have re-progressed to OoO again AFAIK.

    Back to OoO: it's interesting that Tomasulo and the 360/91 are
    mentioned often, but the ROB and its inventor(s?), which are at least
    as important for the success of OoO execution, isn't.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to It appears that Anton Ertl on Mon May 19 17:10:51 2025
    It appears that Anton Ertl <[email protected]> said:
    quadibloc <[email protected]> writes:
    Eventually, IBM caught up with the Control
    Data 6600 by perfecting pipelining in the IBM 360/91, and then combining
    it with cache in the 360/195. From the Pentium II onwards, that's the
    way computers are made nowadays.

    Pipelining and caches are already used on the MIPS R2000 in 1986, and
    the 486 in 1989.

    You are probably thinking of OoO Execution, where people usually write
    as if the Tomasulo algorithm of the 360/91 as implemented the modern
    concept of OoO execution. But the 360/91 only did OoO for FP, did not >support branch prediction, had imprecise exceptions, and the Tomasulo >algorithm was used primarily as a workaround for the dearth of FP
    registers in the S/360.

    The 360/91 had primitive branch prediction in "loop mode". It had an
    eight doublewprd instruction queue (which it confusingly called a stack.)
    If a program did a backward branch of less than eight doublewords, it'd
    stop prefetching and execute out of the queue until the program fell or branched out. It was occasionally worth tweaking assembly code to get
    a loop to start on a doubleword boundary (the CNOP assembler op) so it'd
    fit and run in loop mode.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Mon May 19 17:46:45 2025
    John Levine <[email protected]> writes:
    The 360/91 had primitive branch prediction in "loop mode". It had an
    eight doublewprd instruction queue (which it confusingly called a stack.)
    If a program did a backward branch of less than eight doublewords, it'd
    stop prefetching and execute out of the queue until the program fell or >branched out.

    The 68010 had a similar feature (with a smaller buffer), but I don't
    think one would call it branch prediction. In any case, I meant
    speculative execution based on branch prediction (but did not write it
    that way), and the 360/91 did not do speculative execution AFAIK.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Mon May 19 14:33:47 2025
    Anton Ertl wrote:
    quadibloc <[email protected]> writes:
    Eventually, IBM caught up with the Control
    Data 6600 by perfecting pipelining in the IBM 360/91, and then combining
    it with cache in the 360/195. From the Pentium II onwards, that's the
    way computers are made nowadays.

    Pipelining and caches are already used on the MIPS R2000 in 1986, and
    the 486 in 1989.

    You are probably thinking of OoO Execution, where people usually write
    as if the Tomasulo algorithm of the 360/91 as implemented the modern
    concept of OoO execution. But the 360/91 only did OoO for FP, did not support branch prediction, had imprecise exceptions, and the Tomasulo algorithm was used primarily as a workaround for the dearth of FP
    registers in the S/360.

    The innovation that made OoO execution generally usable rather than a publicity stunt like the 360/91 is the reorder buffer (ROB), which allows to retire the instructions in-order, and to cancel speculatively
    "executed" instructions after an exception or branch misprediction.

    The Pentium Pro (introduced 1995-11-01), HP PA-8000 (introduced
    1995-11-02), and MIPS R10000 (introduced 1996-01) are the first microprocessors which have full-blown OoO execution.

    But as someone pointed out to me, IBM has implemented OoO execution
    between the 370/195 and the Pentium Pro: The ES/9000 models 900 and
    820 (shipping from September 1991) "were the first models with
    out-of-order execution since the System/370-195 of 1973. However
    unlike the old S/360-91-derived systems, the models 900 and 820 had
    full out-of-order execution for both integer and floating-point units,
    with precise exception handling, and a fully superscalar pipeline." <https://en.wikipedia.org/wiki/IBM_System/390#ES/9000>. So apparently
    they had a ROB, and AFAIK were the first machines to have one. These
    models also had a branch target buffer; the article does not mention
    branch prediction proper, but given a ROB and a branch target buffer,
    it would be surprising if they did not predict branches.

    So who came up with the concept of the ROB? I recently looked at one
    of the HPS papers (Hwu, Patt, Shebanov on a High Performance Substrate
    for the VAX from the mid-late 80s) again, and there was no ROB in that
    paper. I did not revisit their later papers whether they had it
    there. So apparently ROBs were not known in the mid-1980s in
    academia, and in 1991 there was hardware with a ROB commercially
    available, and a few years later it appeared in microprocessors.

    There were a number of papers that circled around the various ideas.
    "Decoupled Access Execute Computer Architectures" uses queues to link
    the hardware modules together.
    "Implementing Precise Interrupts in Pipelined Processors" first mentions
    the ROB but doesn't have a renamer and limited OoO ability.
    HPS has rename, reservation stations, and multiple FU but no ROB.

    I don't know in what machine all the pieces came together at once
    but it looks like about 1986 they figured out to use multiple pipelines
    AND rename AND future file AND a ROB AND reservation stations AND multiple function units AND forwarding buses.

    Decoupled Access Execute Computer Architectures,
    James E. Smith, 1982

    Instruction Issue Logic in Pipelined Supercomputers
    Shlomo Weiss, James E Smith, 1984

    Implementing Precise Interrupts in Pipelined Processors,
    James E. Smith, A. R. Pleszkun, 1985

    HPS - A New Microarchitecture Rationale And Introduction,
    Yale N. Patt, Wen-mei Hwu, and Michael Shebanow, 1985

    I wonder how early and how much IBM talked about their ES/9000 OoO implementation and features, but that may have inspired the teams at
    Intel, HP and SGI; or maybe there was an ealier source that inspired
    them all, but only in 1995/1996 the number of transistors on a chip
    was enough to do OoO on a microprocessor.

    Ironically, in the transition to CMOS (i.e., microprocessors) IBM
    mainframe processors regressed back to in-order (and, I think,
    single-issue) again (but with higher clock rates), and in the early
    2000s they looked pretty outdated to me. In the meantime they have re-progressed to OoO again AFAIK.

    Back to OoO: it's interesting that Tomasulo and the 360/91 are
    mentioned often, but the ROB and its inventor(s?), which are at least
    as important for the success of OoO execution, isn't.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ze@21:1/5 to All on Mon May 19 19:09:12 2025
    Wasn't one of the earliest forms of branch prediction the simple
    heuristic of always taking it in one direction and not taking it in the
    other direction , I seem to remember that being the case for some of the
    early pipelined microprocessors. I believe it was called static branch prediction compared to the more modern dynamic branch prediction.

    Nicholas (Nick) King

    --

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to quadibloc on Mon May 19 22:04:22 2025
    quadibloc wrote:
    On Mon, 19 May 2025 6:22:42 +0000, Anton Ertl wrote:

    You are probably thinking of OoO Execution, where people usually write
    as if the Tomasulo algorithm of the 360/91 as implemented the modern
    concept of OoO execution.  But the 360/91 only did OoO for FP, did not
    support branch prediction, had imprecise exceptions, and the Tomasulo
    algorithm was used primarily as a workaround for the dearth of FP
    registers in the S/360.

    Yes, I was thinking of OoO execution, as opposed to other forms of
    pipelining - basic pipelining was used in the 7094 II and even the 6502.

    The Pentium II (and Pentium Pro) also only used OoO for floating-point,
    while the 68050 only used OoO for integers!

    Huh???

    The Pentium (all versions) had two pipes (u & v), both in-order, and
    with severe limitations on which opcodes could run in v in parallel with
    the primary opcode in the u pipe.

    The P6/PentiumPro OTOH does true OoO for all instruction types.

    John, you are usually much better informed!

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Mon May 19 23:27:05 2025
    On Mon, 19 May 2025 22:04:22 +0200
    Terje Mathisen <[email protected]> wrote:


    John, you are usually much better informed!


    I don't think so. John is as uninformed as usual.
    I think, he is repeating this particular bit of nonsense about PPro at
    least for the 3rd time and every single time he was corrected.

    Terje



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon May 19 23:41:15 2025
    On Mon, 19 May 2025 06:22:42 GMT
    [email protected] (Anton Ertl) wrote:


    The innovation that made OoO execution generally usable rather than a publicity stunt like the 360/91 is the reorder buffer (ROB), which
    allows to retire the instructions in-order, and to cancel
    speculatively "executed" instructions after an exception or branch misprediction.

    The Pentium Pro (introduced 1995-11-01), HP PA-8000 (introduced
    1995-11-02), and MIPS R10000 (introduced 1996-01) are the first microprocessors which have full-blown OoO execution.


    What about PPC604? It had more limited OoO resources than the 3
    processors you mentioned above, esp. fewer numeber of reservation
    stations, but it most certainly had reorder buffers, 16 of them.
    So, by your own definitions, it should be called the first single-chip full-blown CPU.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue May 20 00:04:03 2025
    On Mon, 19 May 2025 19:09:12 +0000, Ze wrote:

    Wasn't one of the earliest forms of branch prediction the simple
    heuristic of always taking it in one direction and not taking it in the
    other direction , I seem to remember that being the case for some of the early pipelined microprocessors. I believe it was called static branch prediction compared to the more modern dynamic branch prediction.

    The simple heuristic I remember was to assume that backward branches would
    be more likely to be taken than not (on the grounds that they were
    probably loops) while forward ones would more likely not be taken (I guess
    as an excuse for not disturbing the pipeline too much).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Tue May 20 00:01:18 2025
    On Mon, 19 May 2025 23:41:15 +0300, Michael S wrote:

    What about PPC604? It had more limited OoO resources than the 3
    processors you mentioned above, esp. fewer numeber of reservation
    stations, but it most certainly had reorder buffers, 16 of them.
    So, by your own definitions, it should be called the first single-chip full-blown CPU.

    Was it a PowerPC 604-based Apple Mac that was the first PC to exceed the then-current US Department of Defense threshold for the definition of a “supercomputer”? I think it might have been 1 gigaFLOPS at the time. (Or
    is that too high for the time?)

    That meant it was subject to export restrictions. I remember Apple making
    a lot of publicity about it at the time.

    Of course, the threshold was raised soon after.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue May 20 00:30:28 2025
    On Tue, 20 May 2025 0:04:03 +0000, Lawrence D'Oliveiro wrote:

    On Mon, 19 May 2025 19:09:12 +0000, Ze wrote:

    Wasn't one of the earliest forms of branch prediction the simple
    heuristic of always taking it in one direction and not taking it in the
    other direction , I seem to remember that being the case for some of the
    early pipelined microprocessors. I believe it was called static branch
    prediction compared to the more modern dynamic branch prediction.

    The simple heuristic I remember was to assume that backward branches
    would
    be more likely to be taken than not (on the grounds that they were
    probably loops) while forward ones would more likely not be taken (I
    guess
    as an excuse for not disturbing the pipeline too much).

    CDC 7600 used this scheme. Backwards taken, forwards not-taken.
    Was about 70% accurate for essentially zero storage and 1 (or few)
    gates.

    This scheme might have been limited in scope (backwards into the
    instruction stack was predicted taken, farther than stack was
    predicted not-taken:: I don't remember exactly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Tue May 20 13:52:17 2025
    [email protected] (MitchAlsup1) writes:
    On Tue, 20 May 2025 0:04:03 +0000, Lawrence D'Oliveiro wrote:

    On Mon, 19 May 2025 19:09:12 +0000, Ze wrote:

    Wasn't one of the earliest forms of branch prediction the simple
    heuristic of always taking it in one direction and not taking it in the
    other direction , I seem to remember that being the case for some of the >>> early pipelined microprocessors. I believe it was called static branch
    prediction compared to the more modern dynamic branch prediction.

    The simple heuristic I remember was to assume that backward branches
    would
    be more likely to be taken than not (on the grounds that they were
    probably loops) while forward ones would more likely not be taken (I
    guess
    as an excuse for not disturbing the pipeline too much).

    CDC 7600 used this scheme. Backwards taken, forwards not-taken.
    Was about 70% accurate for essentially zero storage and 1 (or few)
    gates.

    Burroughs B4900 re-wrote the branch opcode on each branch to reflect
    the last two taken vs. not-taken choices. There were four opcodes
    for each type of branch - taken/taken, taken/not-taken, not-taken/taken
    and not-taken/not-taken.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Tue May 20 21:21:07 2025
    Michael S <[email protected]> writes:
    On Mon, 19 May 2025 06:22:42 GMT
    [email protected] (Anton Ertl) wrote:
    The Pentium Pro (introduced 1995-11-01), HP PA-8000 (introduced
    1995-11-02), and MIPS R10000 (introduced 1996-01) are the first
    microprocessors which have full-blown OoO execution.


    What about PPC604? It had more limited OoO resources than the 3
    processors you mentioned above, esp. fewer numeber of reservation
    stations, but it most certainly had reorder buffers, 16 of them.
    So, by your own definitions, it should be called the first single-chip >full-blown CPU.

    Yes. The OoO nature with ROB is explained in <https://arstechnica.com/articles/paedia/cpu/ppc-1.ars/6>.

    Somehow that did not register with me earlier (even though a collegue
    had a Mac with a PPC 604e IIRC). I guess it's because Apple Marketing
    is low on technical details, and if Motorola emphasized this aspect,
    that did not pass the filters of the press. Also, IIRC the
    performance was not so exceptional that it would direct a spotlight at
    the underlying technology, whereas the Pentium Pro with its suprising
    SPECint win certainly did. Finally, the successors of the 604 (in
    particular, the PPC 7450) did not progress much further with OoO
    execution and still had only mild OoO capabilities at a time when the
    Pentium 4 already has a 128-entry ROB (and other structure sizes to
    match). So given the lack of ambition in the 7450, I did not even
    think about the possibility that the 604 might have been the first microprocessor with OoO execution.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to Anton Ertl on Wed May 21 12:52:57 2025
    On Mon, 19 May 2025 17:46:45 GMT, [email protected]
    (Anton Ertl) wrote:

    John Levine <[email protected]> writes:
    The 360/91 had primitive branch prediction in "loop mode". It had an
    eight doublewprd instruction queue (which it confusingly called a stack.) >>If a program did a backward branch of less than eight doublewords, it'd >>stop prefetching and execute out of the queue until the program fell or >>branched out.

    The 68010 had a similar feature (with a smaller buffer), but I don't
    think one would call it branch prediction. In any case, I meant
    speculative execution based on branch prediction (but did not write it
    that way), and the 360/91 did not do speculative execution AFAIK.

    - anton

    Most DSPs have some kind of "loop buffer" from which they can execute
    without fetching code from memory.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed May 21 13:14:35 2025
    Most DSPs have some kind of "loop buffer" from which they can execute
    without fetching code from memory.

    And Mitch's My 66000 `VEC` instruction takes the idea a step further.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to George Neuner on Wed May 21 18:47:21 2025
    On 21/05/2025 17:52, George Neuner wrote:
    On Mon, 19 May 2025 17:46:45 GMT, [email protected]
    (Anton Ertl) wrote:

    John Levine <[email protected]> writes:
    The 360/91 had primitive branch prediction in "loop mode". It had an
    eight doublewprd instruction queue (which it confusingly called a stack.) >>> If a program did a backward branch of less than eight doublewords, it'd
    stop prefetching and execute out of the queue until the program fell or
    branched out.

    The 68010 had a similar feature (with a smaller buffer), but I don't
    think one would call it branch prediction. In any case, I meant
    speculative execution based on branch prediction (but did not write it
    that way), and the 360/91 did not do speculative execution AFAIK.

    - anton

    Most DSPs have some kind of "loop buffer" from which they can execute
    without fetching code from memory.

    The Ferranti Atlas 2 and the EE KDF9 are both prior art.

    --
    Bill F.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Thu May 29 19:02:11 2025
    Anton Ertl <[email protected]> schrieb:
    quadibloc <[email protected]> writes:
    Eventually, IBM caught up with the Control
    Data 6600 by perfecting pipelining in the IBM 360/91, and then combining
    it with cache in the 360/195. From the Pentium II onwards, that's the
    way computers are made nowadays.

    Pipelining and caches are already used on the MIPS R2000 in 1986, and
    the 486 in 1989.

    Or the 801. That may have been the first machine to have
    separate I- and D-caches (was it?)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Thu May 29 20:06:21 2025
    On Thu, 29 May 2025 19:02:11 +0000, Thomas Koenig wrote:

    Anton Ertl <[email protected]> schrieb:
    quadibloc <[email protected]> writes:

    Eventually, IBM caught up with the Control
    Data 6600 by perfecting pipelining in the IBM 360/91,

    At the cost of about 3× the number of gates and power along with
    a 60% increase in the clock rate (60ns versus 100ns). This advantage
    vanished about the time of first /91 deliveries with CDC 7600 going
    to a ~27ns clock along with pipelining and concurrent calculation.

    and then
    combining
    it with cache in the 360/195.

    A last gasp for leadership in Big number crunching for IBM.

    From the Pentium II onwards, that's the
    way computers are made nowadays.

    Once everyone can afford the gates to make pipeline staging latches
    it is the natural way for design. Prior to this point, the designers
    were more focused on "getting it on in a single die" than getting
    the highest possible performance--often limited by the speed of
    the external interface more than calculations inside.

    Pipelining and caches are already used on the MIPS R2000 in 1986, and
    the 486 in 1989.

    Or the 801. That may have been the first machine to have
    separate I- and D-caches (was it?)

    Without disagreeing with the above::

    MIPS R2000 (and R3000) had a unified cache--read twice per cycle on
    clock high and clock low. R3000 was faster in writing (STs) to the
    cache than R2000. Tablewalks in SW via a big hash table.

    Mc68010 had a "loop buffer" of a couple handful of instructions.
    Mc68020 had 256B instruction cache no TLB
    Mc68030 had 256B I$ 256B D$ and ~32E TLB tablewalks in HW

    Mc88100 had 16KB I$ with 64E TLB 16KB D$ with 64E TLB tablewalks
    in HW.

    CDC 6600 had a multi-word instruction stack 6600 and a significantly
    larger instruction stack 7600 with backward branch prediction.
    Base+Bounds memory protection 6600. Context switch in ~16 cycles
    by writing out current state while reading in new state.

    Many machines overlapped Fetch-DECODE with EXECUTE-WRITEBACK all the
    way back to beginning as a 2 stage pipeline. This, alone, makes the
    point where pipelining "took over" difficult to judge.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Thu May 29 22:20:07 2025
    On Thu, 29 May 2025 20:06:21 +0000, MitchAlsup1 wrote:

    quadibloc <[email protected]> writes:

    Eventually, IBM caught up with the Control Data 6600 by perfecting
    pipelining in the IBM 360/91,

    At the cost of about 3× the number of gates and power along with a 60% increase in the clock rate (60ns versus 100ns). This advantage vanished
    about the time of first /91 deliveries with CDC 7600 going to a ~27ns
    clock along with pipelining and concurrent calculation.

    Like I said, part of IBM’s tradition of overpromising and
    underdelivering.

    But it served its purpose, that of dissuading customers from buying
    the CDC product.

    Mc68010 had a "loop buffer" of a couple handful of instructions.
    Mc68020 had 256B instruction cache no TLB
    Mc68030 had 256B I$ 256B D$ and ~32E TLB tablewalks in HW

    As I recall, the ’030 wasn’t that much of an advance over the ’020.
    But the 68040 was a major step forward. And the 68060 wasn’t too bad,
    either. But by that time the major customer (Apple) had lost interest.
    I think it was used in some Amiga machines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Schultz@21:1/5 to All on Thu May 29 18:36:49 2025
    On 5/29/25 3:06 PM, MitchAlsup1 wrote:
    Mc68010 had a "loop buffer" of a couple handful of instructions.

    Not exactly. In the very specific case of the decrement and branch on
    condition it could lock up its prefetch queue (two words) and
    instruction register. Since the dbcc instruction was two words, this
    meant it only worked with single word instructions.

    Faster but very limited.


    --
    http://davesrocketworks.com
    David Schultz
    "The cheaper the crook, the gaudier the patter." - Sam Spade

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Fri May 30 13:28:39 2025
    On Tue, 20 May 2025 21:21:07 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Mon, 19 May 2025 06:22:42 GMT
    [email protected] (Anton Ertl) wrote:
    The Pentium Pro (introduced 1995-11-01), HP PA-8000 (introduced
    1995-11-02), and MIPS R10000 (introduced 1996-01) are the first
    microprocessors which have full-blown OoO execution.


    What about PPC604? It had more limited OoO resources than the 3
    processors you mentioned above, esp. fewer numeber of reservation
    stations, but it most certainly had reorder buffers, 16 of them.
    So, by your own definitions, it should be called the first
    single-chip full-blown CPU.

    Yes. The OoO nature with ROB is explained in <https://arstechnica.com/articles/paedia/cpu/ppc-1.ars/6>.

    Somehow that did not register with me earlier (even though a collegue
    had a Mac with a PPC 604e IIRC). I guess it's because Apple Marketing
    is low on technical details, and if Motorola emphasized this aspect,
    that did not pass the filters of the press. Also, IIRC the
    performance was not so exceptional that it would direct a spotlight at
    the underlying technology, whereas the Pentium Pro with its suprising
    SPECint win certainly did. Finally, the successors of the 604 (in particular, the PPC 7450) did not progress much further with OoO
    execution

    From uArch perspective, PPC/MPC 7xx and 7xxx are really successors of
    603 rather than of 604.

    The thing closest to microarchitectural successor of 604 (via ill-fated
    620) is POWER3, but that one was aimed at completely different market.
    An offspring that attempted to re-enter PC processors market was PPC970
    (a red-headed little brother of POWER4). This foray was terminated by
    Steve Jobs (he always prefer Intel but until this millennium did not
    poses political power to impose his preferences on technical team)
    lasting for about 3 years.

    and still had only mild OoO capabilities at a time when the
    Pentium 4 already has a 128-entry ROB (and other structure sizes to
    match). So given the lack of ambition in the 7450, I did not even
    think about the possibility that the 604 might have been the first microprocessor with OoO execution.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Al Kossow@21:1/5 to All on Fri May 30 10:51:14 2025
    Steve Jobs (he always prefer Intel but until this millennium did not
    poses political power to impose his preferences on technical team)
    lasting for about 3 years.

    His hardware products at NeXT prove this is nonsense.
    The last NeXT prototype that I saw in a Moto lab in Austin
    used the 88110.

    He was completely capable of forcing his will on Apple hardware
    engineers. Project leads who disagreed were let go or put into
    continuation engineering.

    The switch was pragmatic and forced because of the weak PPC
    roadmap, especially in the portable space.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Al Kossow on Fri May 30 19:36:53 2025
    Al Kossow <[email protected]> writes:
    Steve Jobs (he always prefer Intel but until this millennium did not
    poses political power to impose his preferences on technical team)
    lasting for about 3 years.

    His hardware products at NeXT prove this is nonsense.
    The last NeXT prototype that I saw in a Moto lab in Austin
    used the 88110.

    He was completely capable of forcing his will on Apple hardware
    engineers. Project leads who disagreed were let go or put into
    continuation engineering.

    The switch was pragmatic and forced because of the weak PPC
    roadmap, especially in the portable space.

    We (Unisys) had some systems designed around the 88100 in
    that time frame. Apple's decision to go to PPC rather than
    the 88110 caused us to evaluate all the current available
    processors (SPARC, MIPS, x86, and PPC). For rather pragmatic
    reasons (the target machine used the Intel Paragon backplane),
    the Pentium Pro was the ultimate choice, used to build the
    OPUS family of massively parallel (yet single-system image)
    computer systems.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Al Kossow on Fri May 30 22:05:31 2025
    On Fri, 30 May 2025 10:51:14 -0700, Al Kossow wrote:

    Steve Jobs (he always prefer Intel but until this millennium did not
    poses political power to impose his preferences on technical team)
    lasting for about 3 years.

    His hardware products at NeXT prove this is nonsense.

    Also, the entire history of the development of the first-generation
    Macintosh -- Motorola all the way, even after the switch from 68K to
    PowerPC.

    The switch was pragmatic and forced because of the weak PPC
    roadmap, especially in the portable space.

    That’s why the last-gasp PowerPC processor that was used in any Macintosh, the G5, came from IBM, not Motorola. I think the hope was that IBM would
    step in where Motorola was faltering. But that hope didn’t last long.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Fri May 30 23:01:14 2025
    On Fri, 30 May 2025 22:05:31 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 30 May 2025 10:51:14 -0700, Al Kossow wrote:

    Steve Jobs (he always prefer Intel but until this millennium did not
    poses political power to impose his preferences on technical team)
    lasting for about 3 years.

    His hardware products at NeXT prove this is nonsense.

    Also, the entire history of the development of the first-generation
    Macintosh -- Motorola all the way, even after the switch from 68K to
    PowerPC.

    Steve had a power over Murry Goldman:: Goldman believed that Apple
    volume would pay for the FAB, and thus the entire product line
    consumed by Apple could be sold at marginal production costs.
    Jobs knew nobody else would deliver product at this kind of cost
    structure.

    I have suspected Dell and Intel had/have a similar arrangement.

    The switch was pragmatic and forced because of the weak PPC
    roadmap, especially in the portable space.

    That’s why the last-gasp PowerPC processor that was used in any
    Macintosh,
    the G5, came from IBM, not Motorola. I think the hope was that IBM would
    step in where Motorola was faltering. But that hope didn’t last long.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Sat May 31 07:57:27 2025
    [email protected] (Scott Lurndal) writes:
    We (Unisys) had some systems designed around the 88100 in
    that time frame. Apple's decision to go to PPC rather than
    the 88110 caused us to evaluate all the current available
    processors (SPARC, MIPS, x86, and PPC). For rather pragmatic
    reasons (the target machine used the Intel Paragon backplane),
    the Pentium Pro was the ultimate choice, used to build the
    OPUS family of massively parallel (yet single-system image)
    computer systems.

    Likewise, Data General's Aviion ( line of Unix workstations and
    servers was based on the 88100, and I worked with them in 1990 and
    1991. When Motorola gave up the 88k line, DG gave up Motorola and
    switched to Intel. That worked for a while, but apparently they were
    not successful enough with this line of business, and got bought by
    EMC for DG's Clariion like of disk array storage products. So going
    for Intel was no Panacea, either. It worked well enough to Unisys
    survive, though.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sat May 31 08:10:14 2025
    Michael S <[email protected]> writes:
    On Tue, 20 May 2025 21:21:07 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    [PPC604]
    Yes. The OoO nature with ROB is explained in
    <https://arstechnica.com/articles/paedia/cpu/ppc-1.ars/6>.
    =20
    Somehow that did not register with me earlier (even though a collegue
    had a Mac with a PPC 604e IIRC). I guess it's because Apple Marketing
    is low on technical details, and if Motorola emphasized this aspect,
    that did not pass the filters of the press. Also, IIRC the
    performance was not so exceptional that it would direct a spotlight at
    the underlying technology, whereas the Pentium Pro with its suprising
    SPECint win certainly did. Finally, the successors of the 604 (in
    particular, the PPC 7450) did not progress much further with OoO
    execution=20

    =46rom uArch perspective, PPC/MPC 7xx and 7xxx are really successors of
    603 rather than of 604.

    Looking at <https://arstechnica.com/features/2004/08/ppc-1/>, the 603e
    is also a full-blown OoO CPU (it does not describe the 603 in enough
    detail to establish that for the 603); <https://en.wikipedia.org/wiki/PowerPC_600#PowerPC_603> says that the
    603 is OoO, but does not give details. In any case, the 603 appeared
    around the same time as the 604, so the 604 might be the first
    full-blown OoO CPU even if the 603 also is OoO.

    In any case, the 750 and 7450 are full-blown OoO machines, but still
    with relatibvely small buffers: the 7450 (introduced 2001) has a
    six-entry integer queue, and apparently only three of the entries can
    be used for reordering (chipsnchees calls this a three-entry scheduler
    preceded by a three-entry non-scheduling queue); similarly, the vector
    side has a two-entry scheduler preceded by a two-entry non-scheduling
    queue. I don't find information about the size of the reorder buffer
    of the 7450, but it has 16 rename registers for the GPRs and 16 rename registers for the vector registers, which is indicative of its
    reordering capabilities.

    By contrast, the AMD K7 (1999) has a 15-entry integer scheduler, a
    36-entry FP scheduler, a 72-entry ROB, 88 FP registers (-8 for the 8 architectural 387 registers), and IIRC 72 integer registers (-8 for
    the 8 architectural GPRs). I wonder why they needed 80 FP rename
    registers if they could reorder only across 72 instructions. In any
    case, this appeared earlier and had much more reordering capability
    than the PPC 7450. I wonder why the 7450 designers chose to have that
    little reorder capacity. Too much space spent on Altivec? Did they
    have a design that was hard to scale for more entries? Power
    consumption?

    An offspring that attempted to re-enter PC processors market was PPC970
    (a red-headed little brother of POWER4). This foray was terminated by
    Steve Jobs (he always prefer Intel but until this millennium did not
    poses political power to impose his preferences on technical team)
    lasting for about 3 years.

    The PPC970 was marketed by Steve Jobs, as usual, as the best thing
    since sliced bread, but in my work, it was slower than comtemporary
    IA-32 and AMD64 systems:

    All the numbers below are execution times in seconds, so lower means
    faster:

    From <https://www.complang.tuwien.ac.at/franz/latex-bench>

    - PowerMac G5, 2000MHz PPC970, Gentoo Linux PPC64 1.47
    - Athlon 64 3200+, 2000MHz, 1MB L2, Fedora Core 1 (64-bit) 0.76

    From <https://cgit.git.savannah.gnu.org/cgit/gforth.git/plain/Benchres>

    sieve bubble matrix fib
    0.279 0.411 0.183 0.519 0.7.0; PPC970 2GHz (PowerMac G5); gcc-4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
    0.245 0.287 0.156 0.376 0.7.0; Pentium 4 Northwood 2.26GHz; gcc-2.95.4 20011002 (Debian prerelease)
    0.216 0.268 0.112 0.340 0.7.0; K8 2GHz (Opteron 270); gcc-4.1.3 20080623 (prerelease) (Debian 4.1.2-23)

    Many years later I did some microbenchmarking <https://www.complang.tuwien.ac.at/anton/undefined-div-bench/> that
    involved division instructions. I was surprised by the slowness of
    the PPC970 on this microbenchmark. Results in cycles per iteration
    (lower is better):

    ooomb oooub
    41.9 41.9 PPC 7447A (iBook G4) ppc (32-bit) gcc-4.3.2
    130.0 130.0 PPC 970 (PowerMac G5) ppc64 gcc-4.4.5

    (but note that this compares 32-bit with 64-bit division).

    As for the question of why Apple switched to Intel, it appeared pretty
    clear to me: At the level where they were working, the M of AIM could
    not keep up in the GHz race, the I in AIM finally manage to get the
    GHz (that the performance was subpar did not register with me at the
    time; Apple marketing worked:-) with the PPC970, but that was too
    power-hungry for laptops. Probably both M and I demanded more money
    from A in order to develop in the direction that A was asking them to.
    A did not want to provide the money, so they developed for their
    specific markets: M developed what the embedded market asked for, I
    developed for servers and workstations. My guess is that Intel
    already had the high-performance laptop CPUs that Apple needed,
    because that was in their market, so they could offer Apple a better
    deal, and that's how it went.

    The irony is that P.A. Semi worked on the kind of CPUs that Apple
    wanted, lost their prospective customer with this move by Apple, was
    then bought by Apple in 2008, and their workforce became part of what
    is now Apple Silicon, and worked on the chips that first powered
    iPhones and later displaced Intel from Apple's laptop and desktop
    computers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Terje Mathisen on Wed Jul 16 14:27:11 2025
    On Mon, 19 May 2025 22:04:22 +0200, Terje Mathisen wrote:
    quadibloc wrote:

    The Pentium II (and Pentium Pro) also only used OoO for floating-point,
    while the 68050 only used OoO for integers!

    Huh???

    The Pentium (all versions) had two pipes (u & v), both in-order, and
    with severe limitations on which opcodes could run in v in parallel with
    the primary opcode in the u pipe.

    The P6/PentiumPro OTOH does true OoO for all instruction types.

    John, you are usually much better informed!

    I had read somewhere that the Pentium Pro and the Pentium II, like the System/360 Model 91, were OoO only in their floating-point pipelines. If
    that source was faulty, and better sources say differently, I'll need to
    check on it.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Wed Jul 16 18:27:38 2025
    On Wed, 16 Jul 2025 14:27:11 +0000, John Savard wrote:

    On Mon, 19 May 2025 22:04:22 +0200, Terje Mathisen wrote:
    quadibloc wrote:

    The Pentium II (and Pentium Pro) also only used OoO for floating-point,
    while the 68050 only used OoO for integers!

    Huh???

    The Pentium (all versions) had two pipes (u & v), both in-order, and
    with severe limitations on which opcodes could run in v in parallel with
    the primary opcode in the u pipe.

    The P6/PentiumPro OTOH does true OoO for all instruction types.

    John, you are usually much better informed!

    I had read somewhere that the Pentium Pro and the Pentium II, like the System/360 Model 91, were OoO only in their floating-point pipelines. If
    that source was faulty, and better sources say differently, I'll need to check on it.

    The Anderson papers indicate the /91 was just heavily pipelined in
    the integer side.

    I don't know about PPro in the integer section, but it was definitely
    OoO in branches, the memory section, and in the PFU. So, I don't see
    why they would not have had integer OoO.


    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jul 16 17:45:20 2025
    I don't know about PPro in the integer section, but it was definitely
    OoO in branches, the memory section, and in the PFU. So, I don't see
    why they would not have had integer OoO.

    For the /91 I can see some potential simplifications to keep some parts in-order, but for the PPro it seems to me that the requirements of precise-exceptions make it so having some parts OoO and some parts
    in-order wouldn't give much benefits (if any).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sat Jul 26 02:45:56 2025
    On Wed, 16 Jul 2025 18:27:38 +0000, MitchAlsup1 wrote:

    The Anderson papers indicate the /91 was just heavily pipelined in the integer side.

    Not good enough to keep up with CDC?

    After about two years of promising that they would blow CDC out of the
    water ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Lawrence D'Oliveiro on Thu Jul 31 20:38:38 2025
    On Sat, 26 Jul 2025 02:45:56 +0000, Lawrence D'Oliveiro wrote:

    Not good enough to keep up with CDC?

    After about two years of promising that they would blow CDC out of the
    water ...

    The IBM System/360 Model 91 wasn't even good enough to keep up with the
    Model 85.

    However, IBM still realized that OoO was useful, even if it delivered less
    than the promised improvement in performance. So they went on to the
    Model 195 which added cache to the Model 91 design. That did work well
    enough that *I think* it actually did out-perform the CDC machines of the
    time.

    Even if it didn't, it performed well, and could have been considered a
    superior alternative - the CDC 6600 had reliabillity problems, I remember reading. So it would only have had to come close to the 7600 or whatever
    CDC had at the time in such a situation.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Savard on Fri Aug 1 15:02:19 2025
    On Thu, 31 Jul 2025 20:38:38 -0000 (UTC)
    John Savard <[email protected]d> wrote:

    On Sat, 26 Jul 2025 02:45:56 +0000, Lawrence D'Oliveiro wrote:

    Not good enough to keep up with CDC?

    After about two years of promising that they would blow CDC out of
    the water ...

    The IBM System/360 Model 91 wasn't even good enough to keep up with
    the Model 85.

    However, IBM still realized that OoO was useful, even if it delivered
    less than the promised improvement in performance. So they went on to
    the Model 195 which added cache to the Model 91 design. That did work
    well enough that *I think* it actually did out-perform the CDC
    machines of the time.


    From what I see in Wkipedia, it looks like all "number-crunching
    oriented" S/360 Models, i.e. 85, 91 and 195, were failures from
    business POV, even if to slightly different degrees (85 less bad).
    May be, S/370 Model 195 was more successful, I was not able to find info
    about number of units shipped.

    But, then again, CDC 7600, despite its excellent performance, was
    significantly less successful commercially than 6600. So, may be, it was
    just a bad era for that type of machines.

    Even if it didn't, it performed well, and could have been considered a superior alternative - the CDC 6600 had reliabillity problems, I
    remember reading. So it would only have had to come close to the 7600
    or whatever CDC had at the time in such a situation.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Fri Aug 1 15:44:52 2025
    According to Michael S <[email protected]>:
    From what I see in Wkipedia, it looks like all "number-crunching
    oriented" S/360 Models, i.e. 85, 91 and 195, were failures from
    business POV, even if to slightly different degrees (85 less bad).
    May be, S/370 Model 195 was more successful, I was not able to find info >about number of units shipped.

    Neither can I but I don't think it was very many.

    The /91 was a very unbalanced machine. For general computing
    like compilers it was about the same speed as the /85, but
    for floating point codes it was twice as fast or more depending
    on how well the code was tuned to the /91.

    The IBM history book says the /85 was a technical success largely
    due to the cache but didn't sell well, partly due to poor economic
    conditions, partly because customers wanted something faster and
    cheaper built using integrated circuits.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)