• Instruction Tracing

    From Lawrence D'Oliveiro@21:1/5 to All on Sat Aug 10 06:20:51 2024
    In the early days of the spread of RISC (i.e. the 1980s), much was made of
    the analysis of the dynamic execution profiles of actual compiled programs
    to see what machine instructions they most frequently used. This then
    became the rationale for optimizing common instructions, and even omitting
    ones that were not so often used.

    One thing these instruction traces would frequently report is that integer multiply and divide instructions were not so common, and so could be
    omitted and emulated in software, with minimal impact on overall
    performance. We saw this design decision taken in the early versions of
    Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.

    Later, it seems, the CPU designers realized that instruction traces were
    not the final word on performance measurements, and started to include
    hardware integer multiply and divide instructions.

    (ROMP was also one of those RISC architectures that had delayed branches,
    along with MIPS, HP-PA and I think SPARC as well.)

    I have heard it said that the RT PC was a poor advertisement for the
    benefits of RISC, and the joke was made that “RT” stood for “Reduced Technology”.

    Later, of course, IBM more than made good this deficiency with its second
    take on RISC, in the form of the POWER architecture, which is still a performance leader to this day.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sat Aug 10 10:18:02 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    One thing these instruction traces would frequently report is that integer >multiply and divide instructions were not so common, and so could be
    omitted and emulated in software, with minimal impact on overall
    performance. We saw this design decision taken in the early versions of >Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.

    Alpha and IA-64 have no integer division. IIRC IA-64 has no FP
    division.

    One interesting aspect of RISC-V is that they put multiplication and
    division in the same extension (which is included in RV64G, i.e., the
    General version of RISC-V).

    Later, it seems, the CPU designers realized that instruction traces were
    not the final word on performance measurements, and started to include >hardware integer multiply and divide instructions.

    When you invest more hardware to increase performance per cycle, at
    one point the best return on investment is to have multiplication and
    division instructions. What is interesting is that the multipliers
    have than soon been fully pipelined. Or, as Mitch Alsup reports, in
    cases where that was cheaper, have two half-pipelined multipliers.
    Apparently there are enough applications that require a huge number of multiplications; my guess is that the NSA won't tell us what they are.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Aug 10 18:25:39 2024
    On Sat, 10 Aug 2024 6:20:51 +0000, Lawrence D'Oliveiro wrote:

    In the early days of the spread of RISC (i.e. the 1980s), much was made
    of
    the analysis of the dynamic execution profiles of actual compiled
    programs
    to see what machine instructions they most frequently used. This then
    became the rationale for optimizing common instructions, and even
    omitting
    ones that were not so often used.

    One thing these instruction traces would frequently report is that
    integer
    multiply and divide instructions were not so common, and so could be
    omitted and emulated in software, with minimal impact on overall
    performance. We saw this design decision taken in the early versions of Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.

    One of the reasons I like unified register files is that one HAS TO
    implement FP MUL and if one has FMUL then one has easy access to IMUL.
    Same for FDIV. FCMP is only 12 gates different than ICMP, and one needs
    not consume OpCode space with FP LDs and STs.

    {{I give MIPS (the company) a pass, here, because their FPU was
    on a different chip than the integer stuff.}}

    Later, it seems, the CPU designers realized that instruction traces were
    not the final word on performance measurements, and started to include hardware integer multiply and divide instructions.

    AMD had a HW trace unit which would spew instruction and data addresses
    and branch directions to ½ of main memory. This would trace across
    user<->OS boundaries so simulation could include everything the chip
    was doing--including the damage OS excursions did to caches and TLBs.
    While there I extended this to the interconnect and DRAM Bank control.

    We had over 1,000 files, 4GB each using about 12-bits per average
    instruction capturing SPECint, SPECfp, database, TCPIP, server
    workloads,... Using the "server farm" we could run all of them
    "overnight".

    What we can all agree upon is the user instruction tracing is
    insufficient
    in capturing what the chip will be doing, but that capturing all the instructions across all the privilege levels is.

    (ROMP was also one of those RISC architectures that had delayed
    branches,
    along with MIPS, HP-PA and I think SPARC as well.)

    Everybody makes mistakes.

    I have heard it said that the RT PC was a poor advertisement for the
    benefits of RISC, and the joke was made that “RT” stood for “Reduced Technology”.

    People who make mistakes are laughed at--and deservedly so.

    Later, of course, IBM more than made good this deficiency with its
    second
    take on RISC, in the form of the POWER architecture, which is still a performance leader to this day.

    Think about how much more market share POWER would have if they had
    not crippled it the first go around ?!?!!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sat Aug 10 18:33:36 2024
    On Sat, 10 Aug 2024 10:18:02 +0000, Anton Ertl wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:
    One thing these instruction traces would frequently report is that
    integer
    multiply and divide instructions were not so common, and so could be >>omitted and emulated in software, with minimal impact on overall >>performance. We saw this design decision taken in the early versions of >>Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.

    Alpha and IA-64 have no integer division. IIRC IA-64 has no FP
    division.

    "Stupid is a stupid does" Forest Gump.
    {and applicable, too}}

    One interesting aspect of RISC-V is that they put multiplication and
    division in the same extension (which is included in RV64G, i.e., the
    General version of RISC-V).

    Later, it seems, the CPU designers realized that instruction traces were >>not the final word on performance measurements, and started to include >>hardware integer multiply and divide instructions.

    When you invest more hardware to increase performance per cycle, at
    one point the best return on investment is to have multiplication and division instructions. What is interesting is that the multipliers
    have than soon been fully pipelined.

    The MUL unit of Mc88100 was fully pipelined (1985) Integer multiply was
    3 cycles, single was 4 cycles, double was 7 IIRC.

    Or, as Mitch Alsup reports, in
    cases where that was cheaper, have two half-pipelined multipliers.

    When the multiplier tree delay is greater than 1 cycle, it becomes
    cheaper to have 2×½ multipliers without a stage delay than to have
    1 multiplier with 4096 flip-flops in the middle. Where cheaper is
    smaller and consumes less power.

    Apparently there are enough applications that require a huge number of multiplications; my guess is that the NSA won't tell us what they are.

    AES is greatly sped up with a carry-less multiplication, all one has to
    do is to deactivate the majority gate in the CAS cell (which adds no
    gates of delay or area.)

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Dallman on Sat Aug 10 19:54:50 2024
    On Sat, 10 Aug 2024 19:41:00 +0000, John Dallman wrote:

    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    IIRC IA-64 has no FP division.

    You recall correctly. It has an "approximation to reciprocal"
    instruction,
    which gives you about 8 bits of precision, and then requires the
    compiler
    to generate Newton-Raphson sequences. Intel's manual, 2010 edition, says
    this is advantageous because users can generate only the precision they
    need. Writing Itanium assembler for customised precision? Not many
    people
    would have wanted to do that in 2001, let alone 2010.

    In, I think, 1996, my employers had visitors from Intel trying to
    persuade us to adopt their C/C++ compiler for IA-32. They had been able
    to speed up one of our competitors' code by a factor of two, and hoped
    to
    do the same for us.

    They failed. We already had that factor of two, which was "ordinary
    compiler optimisation." That competitor had some rather odd coding
    standards at the time, which meant most compilers failed if asked to
    optimise their code. Someone from Intel had stayed at their site for
    most
    of a year, reporting the bugs and getting them fixed until Intel's
    compiler could optimise the code.

    While visiting us, Intel asked what may have been a significant question about the mixture of floating-point arithmetic instructions we used. We didn't have precise figures, but were sure that we used at least as many square roots as divides. IA-64 does square roots like divides, with a
    starter approximation and Newton-Raphson sequences. Slowly, because the
    N-R instructions all depend on the previous instruction, and can't be
    run in parallel.

    Newton-Raphson has 2 dependent multiplies in a dependent loop.
    Goldschmidt is a rearrangement of N-R such that the multiplies are
    independent with loop-to-loop dependencies. The way IA-64 did
    them it was 8 cycles per loop. Had they been done in function
    unit sequencing, instead, the loop would have only been 4 cycles.
    Converting to Goldschmidt it would have only been 2.

    Goldschmidt does not correct for arithmetic anomalies, whereas
    N-R does. Thus IEEE accurate Goldschmidt iterators use N-R as
    their last iteration.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Sat Aug 10 20:41:00 2024
    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    IIRC IA-64 has no FP division.

    You recall correctly. It has an "approximation to reciprocal" instruction, which gives you about 8 bits of precision, and then requires the compiler
    to generate Newton-Raphson sequences. Intel's manual, 2010 edition, says
    this is advantageous because users can generate only the precision they
    need. Writing Itanium assembler for customised precision? Not many people
    would have wanted to do that in 2001, let alone 2010.

    In, I think, 1996, my employers had visitors from Intel trying to
    persuade us to adopt their C/C++ compiler for IA-32. They had been able
    to speed up one of our competitors' code by a factor of two, and hoped to
    do the same for us.

    They failed. We already had that factor of two, which was "ordinary
    compiler optimisation." That competitor had some rather odd coding
    standards at the time, which meant most compilers failed if asked to
    optimise their code. Someone from Intel had stayed at their site for most
    of a year, reporting the bugs and getting them fixed until Intel's
    compiler could optimise the code.

    While visiting us, Intel asked what may have been a significant question
    about the mixture of floating-point arithmetic instructions we used. We
    didn't have precise figures, but were sure that we used at least as many
    square roots as divides. IA-64 does square roots like divides, with a
    starter approximation and Newton-Raphson sequences. Slowly, because the
    N-R instructions all depend on the previous instruction, and can't be run
    in parallel.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Aug 10 23:48:28 2024
    On Sat, 10 Aug 2024 21:34:47 +0000, BGB wrote:


    My rough ranking of instruction probabilities (descending probability,
    *):
    Load/Store (Constant Displacement, ~30%);
    Branch (~ 14% of ops);
    ALU, ADD/SUB/AND/OR (~ 13%);
    Load/Store (Register Indexed, ~10%);
    Compare and Test (~ 6%);
    Integer Shift (~ 4%);
    Register Move (~ 3%);
    Sign/Zero Extension (~ 3%);
    ALU, XOR (~ 2%);
    Multiply (~ 2%);
    ...

    *: Crude estimate based on categorizing the dynamic execution
    probabilities (which are per-instruction rather than by category).

    Meanwhile, DIV and friends are generally closer to 0.05% or so...
    You can leave them out and hardly anyone will notice.

    The literature from the CRAY-1 era indicated big number crunching
    applications use FFDIV about ¼ that of FMUL, IDIV not so much.

    For the most part, something like RISC-V makes sense, except that
    omitting Indexed Load/Store is basically akin to shooting oneself in the
    foot (and does result in a significant increase in the amount of Shift
    and ADD instructions used).


    With RISC-V, one may see ~ 25% Load/Store followed by ~ 20% ADD and 15% Shift, ...

    If you add the number of indexed LD/STs in your list above with shifts,
    you can find all those missing RISC-V shift instructions.

    Some of this is because ADD and Shift end up over-represented by their
    need to be used in compound operations (indexed load/store and sign/zero extension).

    RISC-V 16-bit smash::

    SLI Rt,Rs,48
    SRA Rt,Rt,48

    My 66000

    SLA Rt,Rs,<16:0>

    Where RISC-V uses the shifter at 48-bits twice, My 66000 only uses
    the masking part of the shifter.

    <snip>

    Meanwhile, I am once again reminded of an annoying edge case bug in my Verilog implementation:
    If a TLB Miss happens on an inter-ISA branch, it can leave the CPU core
    in an inconsistent state.

    Woops...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Sat Aug 10 23:20:44 2024
    On Sat, 10 Aug 2024 17:49:42 -0500, BGB wrote:

    Meanwhile, saw a video recently where someone had ported Doom to a 233
    MHz PowerPC (running Windows NT4) machine and, its performance was not good...

    Not obvious is what combination of factors conspired to cause Doom to apparently run at single-digit framerates.

    Windows NT never meant for games?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to [email protected] on Sun Aug 11 01:57:10 2024
    It appears that Lawrence D'Oliveiro <[email protected]d> said:
    (ROMP was also one of those RISC architectures that had delayed branches, >along with MIPS, HP-PA and I think SPARC as well.)

    I have heard it said that the RT PC was a poor advertisement for the
    benefits of RISC, and the joke was made that “RT” stood for “Reduced >Technology”.

    I worked on AIX for the RT/PC. It was a pretty reasonable chip for the
    time, but it suffered greatly from internal IBM political fights which
    made it too little too late. AIX ran on top of a bloated virtual
    machine which made the whole thing too slow. There was skunkworks port
    of BSD that was supposed to be a lot better.

    As far as the delayed branches and such, they made sense in the narrow
    time window when it was too expensive to put a cache on a workstation
    but that time came and went by the time the RT shipped.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From OrangeFish@21:1/5 to John Levine on Sun Aug 11 09:44:27 2024
    On 2024-08-10 21:57, John Levine wrote:
    It appears that Lawrence D'Oliveiro <[email protected]d> said:
    (ROMP was also one of those RISC architectures that had delayed branches,
    along with MIPS, HP-PA and I think SPARC as well.)

    I have heard it said that the RT PC was a poor advertisement for the
    benefits of RISC, and the joke was made that “RT” stood for “Reduced >> Technology”.

    I worked on AIX for the RT/PC. It was a pretty reasonable chip for the
    time, but it suffered greatly from internal IBM political fights which
    made it too little too late. AIX ran on top of a bloated virtual
    machine which made the whole thing too slow. There was skunkworks port
    of BSD that was supposed to be a lot better.

    As far as the delayed branches and such, they made sense in the narrow
    time window when it was too expensive to put a cache on a workstation
    but that time came and went by the time the RT shipped.

    A long time ago, I heard (or maybe read) that the original ROMP was
    chopped in half (the FP stuff was removed) by orders of marketing for
    some sort of h/w word-processor. When that product bombed and the
    workstation market blossomed, the engineers "bolted" the FP stuff back
    on. I cannot find the source. Is there any truth to this?

    OF

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Sun Aug 11 14:44:38 2024
    John Levine <[email protected]> writes:
    As far as the delayed branches and such, they made sense in the narrow
    time window when it was too expensive to put a cache on a workstation
    but that time came and went by the time the RT shipped.

    Delayed branches were put in the first commercial generation of RISCs
    (except ARM), which all shipped with caches (except ARM). Delayed
    branches are a natural consequence of the 5-stage (Or, in the 88100
    case, four-stage) pipeline.

    IIRC ARM used a 3-stage implementation for the ARM1/2, which may be a consequence of them rejecting delayed branches; and they did not have
    caches, so they could not have made use of the higher clock rate that
    a longer pipeline could have affored. So it seems that the connection
    between cache and delayed branches, if there is any, is the opposite
    of what you suggest.

    Delayed branches provided a speedup on these early 5-stage
    implementations. They also provided a big headache for more
    sophisticated implementations, and therefore soon fell out of favour.
    Power (IIRC) and Alpha don't have delayed branches.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to BGB on Sun Aug 11 13:27:52 2024
    On Sun, 11 Aug 2024 02:41:01 -0500, BGB <[email protected]> wrote:

    On 8/10/2024 6:20 PM, Lawrence D'Oliveiro wrote:
    On Sat, 10 Aug 2024 17:49:42 -0500, BGB wrote:

    Meanwhile, saw a video recently where someone had ported Doom to a 233
    MHz PowerPC (running Windows NT4) machine and, its performance was not
    good...

    Not obvious is what combination of factors conspired to cause Doom to
    apparently run at single-digit framerates.

    Windows NT never meant for games?

    Windows GDI isn't fast, but it isn't usually *that* slow either.
    WinQuake and Quake2 both used GDI to good effect.

    If Windows GDI performance is seriously broken, this is a different
    issue from merely "not meant for games".

    GDI was fine for most anything *except* video games.

    If you really needed maximum performance, you used DirectX ... and
    yes, DirectX was available on NT.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Aug 11 17:55:40 2024
    According to OrangeFish <[email protected]d>:
    As far as the delayed branches and such, they made sense in the narrow
    time window when it was too expensive to put a cache on a workstation
    but that time came and went by the time the RT shipped.

    A long time ago, I heard (or maybe read) that the original ROMP was
    chopped in half (the FP stuff was removed) by orders of marketing for
    some sort of h/w word-processor. When that product bombed and the >workstation market blossomed, the engineers "bolted" the FP stuff back
    on. I cannot find the source. Is there any truth to this?

    I also heard that the ROMP was originally intended for some sort of
    word processor from the Office Products division (the O in ROMP) and
    was repurposed into a workstation.

    The RT's floating point was an add-on card with a Natl Semi FPU
    that appeared at high memory addresses. I would be surprised if
    any of the ROMP's predecessors had hardware FP. The 801 didn't
    and it would make no sense for word processing.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sun Aug 11 21:09:02 2024
    On Sun, 11 Aug 2024 14:44:38 +0000, Anton Ertl wrote:

    John Levine <[email protected]> writes:
    As far as the delayed branches and such, they made sense in the narrow
    time window when it was too expensive to put a cache on a workstation
    but that time came and went by the time the RT shipped.

    Delayed branches were put in the first commercial generation of RISCs
    (except ARM), which all shipped with caches (except ARM). Delayed
    branches are a natural consequence of the 5-stage (Or, in the 88100
    case, four-stage) pipeline.

    Delayed branches are wonderful to the pipeline, very much less so for
    the architecture overall as it makes wide issue "all that much harder"
    It was truly a pain in the ass on Mc88120 a 6-wide machine.

    Neither nullification or inverse nullification helped much and both
    hurt at wide issue, too. At least Mc88100 had a bit to indicate
    the delay slot was not being used.

    Looking back, I wish we had not been forced to do them--I think many
    of the 1st generation architects wish similarly. Delayed branches
    were supposed to bring a 16% gain in performance. After looking at
    the utility rates slightly less than 50% useful instructions, with
    something slightly over 70% fill rate; they only brought 8%-ish.
    {{A useful instruction is useful in both taken and non-taken paths.}}

    IIRC ARM used a 3-stage implementation for the ARM1/2, which may be a consequence of them rejecting delayed branches; and they did not have
    caches, so they could not have made use of the higher clock rate that
    a longer pipeline could have affored. So it seems that the connection between cache and delayed branches, if there is any, is the opposite
    of what you suggest.

    Delayed branches provided a speedup on these early 5-stage
    implementations. They also provided a big headache for more
    sophisticated implementations, and therefore soon fell out of favour.

    Much like virtual caches...

    The only thing that has persisted is LDs being longer than 2 cycles.
    Squashing {forward, ADD, SRAM, LDalign} into 2 cycles is proving
    to be a frequency headache in the simpler RISC-V implementations
    even now. with wires getting slower and gates getting faster, that
    trade off is getting worse. Many of the Intel x86s use 4 cycle LDs.
    {the cost of frequency is efficiency}

    Power (IIRC) and Alpha don't have delayed branches.

    Non of the modern RISCs have them either.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Aug 11 23:07:03 2024
    On Sun, 11 Aug 2024 14:44:38 GMT, Anton Ertl wrote:

    Power (IIRC) and Alpha don't have delayed branches.

    Not only does POWER not have delayed branches, but I recall the IBM folks claiming in the initial publicity that branches could often execute in
    zero clock cycles--that is, fully overlapped with surrounding
    instructions.

    POWER was also “superscalar” (being able to execute more than one
    operation per clock cycle) right from the beginning. Not sure if other
    RISC architectures of the time were like that. I don’t think Alpha was:
    one thing I remember from its early descriptions was its use of very high
    clock speeds. That seemed to me to be the opposite of “(at least) one instruction per clock cycle”, which I thought was supposed to be one of
    the defining features of RISC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sun Aug 11 23:08:18 2024
    On Sun, 11 Aug 2024 21:09:02 +0000, MitchAlsup1 wrote:

    Delayed branches are wonderful to the pipeline ...

    They also persisted for a few years longer in DSP architectures, until the capabilities of these were (mostly) subsumed into general-purpose CPUs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Josh Vanderhoof@21:1/5 to BGB on Sun Aug 11 21:08:44 2024
    BGB <[email protected]> writes:

    Meanwhile, saw a video recently where someone had ported Doom to a 233
    MHz PowerPC (running Windows NT4) machine and, its performance was not good...

    Not obvious is what combination of factors conspired to cause Doom to apparently run at single-digit framerates.

    Video mentioned that it was drawing using GDI calls, but this by
    itself wouldn't explain the level of slowness seen in the video.

    Like, presumably, this would require around 90% + of the clock cycles
    going into overhead, which seems a bit much.


    Reference:
    https://www.youtube.com/watch?v=LAkSJ-HqKw8

    I think NT4 was before CreateDIBSection existed. Probably much slower
    without that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to John Levine on Sun Aug 11 16:58:03 2024
    John Levine <[email protected]> writes:
    I also heard that the ROMP was originally intended for some sort of
    word processor from the Office Products division (the O in ROMP) and
    was repurposed into a workstation.

    late 70s, circa 1980 ... there were efforts to convert myriad of CISC microprocessors to 801/RISC microprocessors ... Iliad for low/mid range
    370s (4361/4381 following on to 4331/4341), ROMP (research/office
    products) for Displaywriter follow-on (with CP.r operating system
    implemented in PL.8), also AS/400 follow-on to S/38. For various reasons
    these efforts floundered and some of the 801/RISC engineers left IBM for
    other vendors.

    I helped with white paper that showed that nearly whole 370 could be
    implemened directly in circuits (much more efficient than microcode) for 4361/4381. AS/400 returned to CISC microprocessor. The follow-on to displaywriter was canceled (most of that market moving to IBM/PC and
    other personal computers).

    Austin group decided to pivot ROMP to Unix workstation market and got
    the company that had done AT&T UNIX port to IBM/PC as PC/IX to do one
    for ROMP (AIX, possibly "Austin IX" for PC/RT). They also had to do
    something with the 200 or so PL.8 programmers and decided to use them to implement an "abstract" virtual machine as "VRM" ... telling the company
    doing the UNIX port that it would be much more efficient and timely for
    them to implement to the VRM interface (rather than bare hardware).
    Besides other issues with that claim, it introduced new problem that new
    device drivers had to be done twice, one in "C" for the unix/AIX layer
    and then in "PL.8" for the VRM.

    Palo Alto was working on a port of UCB BSD to 370 and got redirected to
    port to the PC/RT ... they demonstrated that they did the BSD port to
    ROMP directly ... with much less effort than either the VRM
    implementation or the AIX implementation ... released as "AOS".

    trivia: early 80s 1) IBM Los Gatos lab was working on single chip "Blue
    Iliad", 1st 32bit 801/RISC, really hot, single large chip than never
    quite came to fruition and 2) IBM Boeblingen lab had done ROMAN, 3chip
    370 implemention (with performance of 370/168). I had proposal to see
    how many chips I could cram into single rack (either "Blue Iliad" or
    ROMAN or combination of both).

    While AS/400 1st reverted to CISC chip ... later in the 90s, out of the Somerset (AIM, apple, ibm, motorola) single-chip power/pc ... they got a 801/RISC chip to move to.

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Mon Aug 12 08:10:41 2024
    Lawrence D'Oliveiro wrote:
    On Sun, 11 Aug 2024 14:44:38 GMT, Anton Ertl wrote:

    Power (IIRC) and Alpha don't have delayed branches.

    Not only does POWER not have delayed branches, but I recall the IBM folks claiming in the initial publicity that branches could often execute in
    zero clock cycles--that is, fully overlapped with surrounding
    instructions.

    Afair, the original POWER had 3 chips, with branches in a separate unit
    from integer/logic ops, right?

    The idea, as presented in that month's BYTE magazine was that the entire latency of transferring comparison flags over to the branch unit, select
    the corresponding direction and then transmit the resulting IP back to
    the fetch unit would happen fast enough that those offchip latencies
    would not matter.

    It also had multiple (8?) sets of compare result flags in order to avoid
    making them a speed limiter.

    POWER was also “superscalar” (being able to execute more than one operation per clock cycle) right from the beginning. Not sure if other
    RISC architectures of the time were like that. I don’t think Alpha was:
    one thing I remember from its early descriptions was its use of very high clock speeds. That seemed to me to be the opposite of “(at least) one instruction per clock cycle”, which I thought was supposed to be one of
    the defining features of RISC.

    Yeah, the R part was intended to make latency a single cycle for _most_ instructions.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Mon Aug 12 05:29:29 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Sun, 11 Aug 2024 14:44:38 GMT, Anton Ertl wrote:

    Power (IIRC) and Alpha don't have delayed branches.

    Not only does POWER not have delayed branches, but I recall the IBM folks >claiming in the initial publicity that branches could often execute in
    zero clock cycles--that is, fully overlapped with surrounding
    instructions.

    Yes. Power has a sophisticated branch unit for its time.

    POWER was also "superscalar" (being able to execute more than one
    operation per clock cycle) right from the beginning. Not sure if other
    RISC architectures of the time were like that. I don’t think Alpha was:
    one thing I remember from its early descriptions was its use of very high >clock speeds.

    Already the 21064 is two-wide superscalar (1 integer unit, 1 FPU, 1
    load/store unit, don't remember if the branch unit could run in
    parallel to the ALU; I think not). And it has very high clock speeds
    for its time; it appeared with 150MHz while the competition was like
    50MHz (SuperSPARC, superscalar) to 100MHz (MIPS R4000, not
    superscalar), or, for Power, 62.5MHz in the POWER1++. But POWER1
    (without ++) preceded the 21064 by 2 years.

    That seemed to me to be the opposite of "(at least) one
    instruction per clock cycle", which I thought was supposed to be one of
    the defining features of RISC.

    A peak (i.e., guaranteed not to be exceeded) performance of 1 IPC was
    a goal in early RISCs. I guess you could construct a program that has
    1 IPC on the MIPS R2000 and the first SPARC, but for useful code the
    IPC on these early RISC's is lower. Likewise for the 21064, the peak performance is 2 IPC, but performance on useful code is usually lower.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Aug 12 06:33:17 2024
    On Mon, 12 Aug 2024 05:29:29 GMT, Anton Ertl wrote:

    Already the 21064 is two-wide superscalar (1 integer unit, 1 FPU, 1 load/store unit, don't remember if the branch unit could run in parallel
    to the ALU; I think not). And it has very high clock speeds for its
    time; it appeared with 150MHz while the competition was like 50MHz (SuperSPARC, superscalar) to 100MHz (MIPS R4000, not superscalar), or,
    for Power, 62.5MHz in the POWER1++. But POWER1 (without ++) preceded
    the 21064 by 2 years.

    But in spite of having, say, 2½ times the clock speed of POWER, Alpha was
    not 2½ times faster, was it?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Josh Vanderhoof on Mon Aug 12 10:56:07 2024
    On Sun, 11 Aug 2024 21:08:44 -0400
    Josh Vanderhoof <[email protected]> wrote:

    BGB <[email protected]> writes:

    Meanwhile, saw a video recently where someone had ported Doom to a
    233 MHz PowerPC (running Windows NT4) machine and, its performance
    was not good...

    Not obvious is what combination of factors conspired to cause Doom
    to apparently run at single-digit framerates.

    Video mentioned that it was drawing using GDI calls, but this by
    itself wouldn't explain the level of slowness seen in the video.

    Like, presumably, this would require around 90% + of the clock
    cycles going into overhead, which seems a bit much.


    Reference:
    https://www.youtube.com/watch?v=LAkSJ-HqKw8

    I think NT4 was before CreateDIBSection existed. Probably much slower without that.

    NT4 has CreateDIBSection.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Aug 12 11:09:18 2024
    On Mon, 12 Aug 2024 06:33:17 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Mon, 12 Aug 2024 05:29:29 GMT, Anton Ertl wrote:

    Already the 21064 is two-wide superscalar (1 integer unit, 1 FPU, 1 load/store unit, don't remember if the branch unit could run in
    parallel to the ALU; I think not). And it has very high clock
    speeds for its time; it appeared with 150MHz while the competition
    was like 50MHz (SuperSPARC, superscalar) to 100MHz (MIPS R4000, not superscalar), or, for Power, 62.5MHz in the POWER1++. But POWER1
    (without ++) preceded the 21064 by 2 years.

    But in spite of having, say, 2½ times the clock speed of POWER, Alpha
    was not 2½ times faster, was it?

    Of course not.
    But Alpha EV4 was single chip vs multiple chips in POWER1 or 3 chips of contemporary PA-RISC.
    More relevant comparison is EV4 vs IBM RSC https://en.wikipedia.org/wiki/RISC_Single_Chip
    I think that EV4 was 3-5 times faster than RSC.

    Back in 1992-1993 I was not impressed by speed of RS/6000 model 220
    relatively to i486 PCs. Frankly, 220 was running much heavier software
    stack.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Mon Aug 12 08:42:51 2024
    On Mon, 12 Aug 2024 11:09:18 +0300, Michael S wrote:

    On Mon, 12 Aug 2024 06:33:17 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    But in spite of having, say, 2½ times the clock speed of POWER, Alpha
    was not 2½ times faster, was it?

    Of course not.

    That’s what I mean: it took several clock cycles per instruction, contrary
    to just about every other RISC architecture.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Mon Aug 12 11:10:45 2024
    Terje Mathisen <[email protected]> writes:
    Lawrence D'Oliveiro wrote:
    On Sun, 11 Aug 2024 14:44:38 GMT, Anton Ertl wrote:
    =20
    Power (IIRC) and Alpha don't have delayed branches.
    =20
    Not only does POWER not have delayed branches, but I recall the IBM fol=
    ks
    claiming in the initial publicity that branches could often execute in
    zero clock cycles--that is, fully overlapped with surrounding
    instructions.

    Afair, the original POWER had 3 chips, with branches in a separate unit=20 >from integer/logic ops, right?

    Looking at <https://en.wikipedia.org/wiki/POWER1>, the RIOS-1
    configuration has 9 chips: ICU, FPU, FXU (integer unit), SCU (storage
    control), 4xDCU (data cache), I/O Unit. The RIOS-9 configuration has
    only 2 DCUs (7 chips total).

    It also had multiple (8?) sets of compare result flags in order to avoid = >making them a speed limiter.

    I wonder how much that is used. There is only one carry bit.

    Yeah, the R part was intended to make latency a single cycle for _most_=20 >instructions.

    It was mainly meant to increase the *throughput* to one instruction
    per cycle; that includes instructions like loads that have a
    latency > 1 cycle.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Aug 12 18:14:53 2024
    On Mon, 12 Aug 2024 08:42:51 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Mon, 12 Aug 2024 11:09:18 +0300, Michael S wrote:

    On Mon, 12 Aug 2024 06:33:17 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    But in spite of having, say, 2½ times the clock speed of POWER,
    Alpha was not 2½ times faster, was it?

    Of course not.

    That’s what I mean: it took several clock cycles per instruction,
    contrary to just about every other RISC architecture.

    On EV4 simple ALU instructions took 1 cycle , both for throughput and
    for latency.
    Shifts and conditional moves had latency of 2, throughput of 1.
    Integer multiplier was not pipelined, but few RISC also had it
    none-pipelined. Latency of integer multiplier was 19-21 cycles.
    On FP side both FADD and FMUL were fully pipelined (T=1) and had
    latency of 6 cycles.
    L1D cache hits were fully pipelined (T=1) and had latency of 3 cycles.

    So, as long as code/data was fitting in L1 cache, EV4 IPC was not
    far behind competition. Relatively to MIPS R4K, may be, even ahead.

    Of course, cache misses were relatively more expensive than for much
    lower clocked competitors. DEC's solution to that was wide and fast
    system bus.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Mon Aug 12 17:32:09 2024
    On Mon, 12 Aug 2024 15:14:53 +0000, Michael S wrote:

    On Mon, 12 Aug 2024 08:42:51 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Mon, 12 Aug 2024 11:09:18 +0300, Michael S wrote:

    On Mon, 12 Aug 2024 06:33:17 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    But in spite of having, say, 2½ times the clock speed of POWER,
    Alpha was not 2½ times faster, was it?

    Of course not.

    That’s what I mean: it took several clock cycles per instruction,
    contrary to just about every other RISC architecture.

    On EV4 simple ALU instructions took 1 cycle , both for throughput and
    for latency.
    Shifts and conditional moves had latency of 2, throughput of 1.
    Integer multiplier was not pipelined, but few RISC also had it none-pipelined.

    Mc88100 had a pipelined multiplier, you could start a int mul
    every cycle or a single mul evey cycle or a double mul every 4
    cycles.

    Latency of integer multiplier was 19-21 cycles.

    3 cycles for Mc88100

    On FP side both FADD and FMUL were fully pipelined (T=1) and had
    latency of 6 cycles.
    L1D cache hits were fully pipelined (T=1) and had latency of 3 cycles.

    So, as long as code/data was fitting in L1 cache, EV4 IPC was not
    far behind competition. Relatively to MIPS R4K, may be, even ahead.

    Of course, cache misses were relatively more expensive than for much
    lower clocked competitors. DEC's solution to that was wide and fast
    system bus.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)