Forum: >>> Magnum BBS <<<

Instruction Tracing

From Lawrence D'Oliveiro@21:1/5 to All on Sat Aug 10 06:20:51 2024

In the early days of the spread of RISC (i.e. the 1980s), much was made of
the analysis of the dynamic execution profiles of actual compiled programs
to see what machine instructions they most frequently used. This then
became the rationale for optimizing common instructions, and even omitting
ones that were not so often used.

One thing these instruction traces would frequently report is that integer multiply and divide instructions were not so common, and so could be
omitted and emulated in software, with minimal impact on overall
performance. We saw this design decision taken in the early versions of
Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.

Later, it seems, the CPU designers realized that instruction traces were
not the final word on performance measurements, and started to include
hardware integer multiply and divide instructions.

(ROMP was also one of those RISC architectures that had delayed branches,
along with MIPS, HP-PA and I think SPARC as well.)

I have heard it said that the RT PC was a poor advertisement for the
benefits of RISC, and the joke was made that “RT” stood for “Reduced Technology”.

Later, of course, IBM more than made good this deficiency with its second
take on RISC, in the form of the POWER architecture, which is still a performance leader to this day.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sat Aug 10 10:18:02 2024

Lawrence D'Oliveiro <[email protected]d> writes:

One thing these instruction traces would frequently report is that integer >multiply and divide instructions were not so common, and so could be
omitted and emulated in software, with minimal impact on overall
performance. We saw this design decision taken in the early versions of >Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.

Alpha and IA-64 have no integer division. IIRC IA-64 has no FP
division.

One interesting aspect of RISC-V is that they put multiplication and
division in the same extension (which is included in RV64G, i.e., the
General version of RISC-V).

Later, it seems, the CPU designers realized that instruction traces were
not the final word on performance measurements, and started to include >hardware integer multiply and divide instructions.

When you invest more hardware to increase performance per cycle, at
one point the best return on investment is to have multiplication and
division instructions. What is interesting is that the multipliers
have than soon been fully pipelined. Or, as Mitch Alsup reports, in
cases where that was cheaper, have two half-pipelined multipliers.
Apparently there are enough applications that require a huge number of multiplications; my guess is that the NSA won't tell us what they are.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Aug 10 18:25:39 2024

On Sat, 10 Aug 2024 6:20:51 +0000, Lawrence D'Oliveiro wrote:

In the early days of the spread of RISC (i.e. the 1980s), much was made
of
the analysis of the dynamic execution profiles of actual compiled
programs
to see what machine instructions they most frequently used. This then
became the rationale for optimizing common instructions, and even
omitting
ones that were not so often used.

One thing these instruction traces would frequently report is that
integer
multiply and divide instructions were not so common, and so could be
omitted and emulated in software, with minimal impact on overall
performance. We saw this design decision taken in the early versions of Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.

One of the reasons I like unified register files is that one HAS TO
implement FP MUL and if one has FMUL then one has easy access to IMUL.
Same for FDIV. FCMP is only 12 gates different than ICMP, and one needs
not consume OpCode space with FP LDs and STs.

{{I give MIPS (the company) a pass, here, because their FPU was
on a different chip than the integer stuff.}}

Later, it seems, the CPU designers realized that instruction traces were
not the final word on performance measurements, and started to include hardware integer multiply and divide instructions.

AMD had a HW trace unit which would spew instruction and data addresses
and branch directions to ½ of main memory. This would trace across
user<->OS boundaries so simulation could include everything the chip
was doing--including the damage OS excursions did to caches and TLBs.
While there I extended this to the interconnect and DRAM Bank control.

We had over 1,000 files, 4GB each using about 12-bits per average
instruction capturing SPECint, SPECfp, database, TCPIP, server
workloads,... Using the "server farm" we could run all of them
"overnight".

What we can all agree upon is the user instruction tracing is
insufficient
in capturing what the chip will be doing, but that capturing all the instructions across all the privilege levels is.

(ROMP was also one of those RISC architectures that had delayed
branches,
along with MIPS, HP-PA and I think SPARC as well.)

Everybody makes mistakes.

I have heard it said that the RT PC was a poor advertisement for the
benefits of RISC, and the joke was made that “RT” stood for “Reduced Technology”.

People who make mistakes are laughed at--and deservedly so.

Later, of course, IBM more than made good this deficiency with its
second
take on RISC, in the form of the POWER architecture, which is still a performance leader to this day.

Think about how much more market share POWER would have if they had
not crippled it the first go around ?!?!!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sat Aug 10 18:33:36 2024

On Sat, 10 Aug 2024 10:18:02 +0000, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

One thing these instruction traces would frequently report is that
integer
multiply and divide instructions were not so common, and so could be >>omitted and emulated in software, with minimal impact on overall >>performance. We saw this design decision taken in the early versions of >>Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.

Alpha and IA-64 have no integer division. IIRC IA-64 has no FP
division.

"Stupid is a stupid does" Forest Gump.
{and applicable, too}}

One interesting aspect of RISC-V is that they put multiplication and
division in the same extension (which is included in RV64G, i.e., the
General version of RISC-V).

Later, it seems, the CPU designers realized that instruction traces were >>not the final word on performance measurements, and started to include >>hardware integer multiply and divide instructions.

When you invest more hardware to increase performance per cycle, at
one point the best return on investment is to have multiplication and division instructions. What is interesting is that the multipliers
have than soon been fully pipelined.

The MUL unit of Mc88100 was fully pipelined (1985) Integer multiply was
3 cycles, single was 4 cycles, double was 7 IIRC.

Or, as Mitch Alsup reports, in
cases where that was cheaper, have two half-pipelined multipliers.

When the multiplier tree delay is greater than 1 cycle, it becomes
cheaper to have 2×½ multipliers without a stage delay than to have
1 multiplier with 4096 flip-flops in the middle. Where cheaper is
smaller and consumes less power.

Apparently there are enough applications that require a huge number of multiplications; my guess is that the NSA won't tell us what they are.

AES is greatly sped up with a carry-less multiplication, all one has to
do is to deactivate the majority gate in the CAS cell (which adds no
gates of delay or area.)

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Dallman on Sat Aug 10 19:54:50 2024

On Sat, 10 Aug 2024 19:41:00 +0000, John Dallman wrote:

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

IIRC IA-64 has no FP division.

You recall correctly. It has an "approximation to reciprocal"
instruction,
which gives you about 8 bits of precision, and then requires the
compiler
to generate Newton-Raphson sequences. Intel's manual, 2010 edition, says
this is advantageous because users can generate only the precision they
need. Writing Itanium assembler for customised precision? Not many
people
would have wanted to do that in 2001, let alone 2010.

In, I think, 1996, my employers had visitors from Intel trying to
persuade us to adopt their C/C++ compiler for IA-32. They had been able
to speed up one of our competitors' code by a factor of two, and hoped
to
do the same for us.

They failed. We already had that factor of two, which was "ordinary
compiler optimisation." That competitor had some rather odd coding
standards at the time, which meant most compilers failed if asked to
optimise their code. Someone from Intel had stayed at their site for
most
of a year, reporting the bugs and getting them fixed until Intel's
compiler could optimise the code.

While visiting us, Intel asked what may have been a significant question about the mixture of floating-point arithmetic instructions we used. We didn't have precise figures, but were sure that we used at least as many square roots as divides. IA-64 does square roots like divides, with a
starter approximation and Newton-Raphson sequences. Slowly, because the
N-R instructions all depend on the previous instruction, and can't be
run in parallel.

Newton-Raphson has 2 dependent multiplies in a dependent loop.
Goldschmidt is a rearrangement of N-R such that the multiplies are
independent with loop-to-loop dependencies. The way IA-64 did
them it was 8 cycles per loop. Had they been done in function
unit sequencing, instead, the loop would have only been 4 cycles.
Converting to Goldschmidt it would have only been 2.

Goldschmidt does not correct for arithmetic anomalies, whereas
N-R does. Thus IEEE accurate Goldschmidt iterators use N-R as
their last iteration.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Sat Aug 10 20:41:00 2024

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

IIRC IA-64 has no FP division.

You recall correctly. It has an "approximation to reciprocal" instruction, which gives you about 8 bits of precision, and then requires the compiler
to generate Newton-Raphson sequences. Intel's manual, 2010 edition, says
this is advantageous because users can generate only the precision they
need. Writing Itanium assembler for customised precision? Not many people
would have wanted to do that in 2001, let alone 2010.

In, I think, 1996, my employers had visitors from Intel trying to
persuade us to adopt their C/C++ compiler for IA-32. They had been able
to speed up one of our competitors' code by a factor of two, and hoped to
do the same for us.

They failed. We already had that factor of two, which was "ordinary
compiler optimisation." That competitor had some rather odd coding
standards at the time, which meant most compilers failed if asked to
optimise their code. Someone from Intel had stayed at their site for most
of a year, reporting the bugs and getting them fixed until Intel's
compiler could optimise the code.

While visiting us, Intel asked what may have been a significant question
about the mixture of floating-point arithmetic instructions we used. We
didn't have precise figures, but were sure that we used at least as many
square roots as divides. IA-64 does square roots like divides, with a
starter approximation and Newton-Raphson sequences. Slowly, because the
N-R instructions all depend on the previous instruction, and can't be run
in parallel.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Aug 10 23:48:28 2024

On Sat, 10 Aug 2024 21:34:47 +0000, BGB wrote:

My rough ranking of instruction probabilities (descending probability,
*):
Load/Store (Constant Displacement, ~30%);
Branch (~ 14% of ops);
ALU, ADD/SUB/AND/OR (~ 13%);
Load/Store (Register Indexed, ~10%);
Compare and Test (~ 6%);
Integer Shift (~ 4%);
Register Move (~ 3%);
Sign/Zero Extension (~ 3%);
ALU, XOR (~ 2%);
Multiply (~ 2%);
...

*: Crude estimate based on categorizing the dynamic execution
probabilities (which are per-instruction rather than by category).

Meanwhile, DIV and friends are generally closer to 0.05% or so...
You can leave them out and hardly anyone will notice.

The literature from the CRAY-1 era indicated big number crunching
applications use FFDIV about ¼ that of FMUL, IDIV not so much.

For the most part, something like RISC-V makes sense, except that
omitting Indexed Load/Store is basically akin to shooting oneself in the
foot (and does result in a significant increase in the amount of Shift
and ADD instructions used).

With RISC-V, one may see ~ 25% Load/Store followed by ~ 20% ADD and 15% Shift, ...

If you add the number of indexed LD/STs in your list above with shifts,
you can find all those missing RISC-V shift instructions.

Some of this is because ADD and Shift end up over-represented by their
need to be used in compound operations (indexed load/store and sign/zero extension).

RISC-V 16-bit smash::

SLI Rt,Rs,48
SRA Rt,Rt,48

My 66000

SLA Rt,Rs,<16:0>

Where RISC-V uses the shifter at 48-bits twice, My 66000 only uses
the masking part of the shifter.

<snip>

Meanwhile, I am once again reminded of an annoying edge case bug in my Verilog implementation:
If a TLB Miss happens on an inter-ISA branch, it can leave the CPU core
in an inconsistent state.

Woops...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Sat Aug 10 23:20:44 2024

On Sat, 10 Aug 2024 17:49:42 -0500, BGB wrote:

Meanwhile, saw a video recently where someone had ported Doom to a 233
MHz PowerPC (running Windows NT4) machine and, its performance was not good...

Not obvious is what combination of factors conspired to cause Doom to apparently run at single-digit framerates.

Windows NT never meant for games?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to [email protected] on Sun Aug 11 01:57:10 2024

It appears that Lawrence D'Oliveiro <[email protected]d> said:

(ROMP was also one of those RISC architectures that had delayed branches, >along with MIPS, HP-PA and I think SPARC as well.)

I have heard it said that the RT PC was a poor advertisement for the
benefits of RISC, and the joke was made that “RT” stood for “Reduced >Technology”.

I worked on AIX for the RT/PC. It was a pretty reasonable chip for the
time, but it suffered greatly from internal IBM political fights which
made it too little too late. AIX ran on top of a bloated virtual
machine which made the whole thing too slow. There was skunkworks port
of BSD that was supposed to be a lot better.

As far as the delayed branches and such, they made sense in the narrow
time window when it was too expensive to put a cache on a workstation
but that time came and went by the time the RT shipped.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From OrangeFish@21:1/5 to John Levine on Sun Aug 11 09:44:27 2024

On 2024-08-10 21:57, John Levine wrote:

It appears that Lawrence D'Oliveiro <[email protected]d> said:

(ROMP was also one of those RISC architectures that had delayed branches,
along with MIPS, HP-PA and I think SPARC as well.)

I have heard it said that the RT PC was a poor advertisement for the
benefits of RISC, and the joke was made that “RT” stood for “Reduced >> Technology”.

I worked on AIX for the RT/PC. It was a pretty reasonable chip for the
time, but it suffered greatly from internal IBM political fights which
made it too little too late. AIX ran on top of a bloated virtual
machine which made the whole thing too slow. There was skunkworks port
of BSD that was supposed to be a lot better.

As far as the delayed branches and such, they made sense in the narrow
time window when it was too expensive to put a cache on a workstation
but that time came and went by the time the RT shipped.

A long time ago, I heard (or maybe read) that the original ROMP was
chopped in half (the FP stuff was removed) by orders of marketing for
some sort of h/w word-processor. When that product bombed and the
workstation market blossomed, the engineers "bolted" the FP stuff back
on. I cannot find the source. Is there any truth to this?

OF

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Sun Aug 11 14:44:38 2024

John Levine <[email protected]> writes:

As far as the delayed branches and such, they made sense in the narrow
time window when it was too expensive to put a cache on a workstation
but that time came and went by the time the RT shipped.

Delayed branches were put in the first commercial generation of RISCs
(except ARM), which all shipped with caches (except ARM). Delayed
branches are a natural consequence of the 5-stage (Or, in the 88100
case, four-stage) pipeline.

IIRC ARM used a 3-stage implementation for the ARM1/2, which may be a consequence of them rejecting delayed branches; and they did not have
caches, so they could not have made use of the higher clock rate that
a longer pipeline could have affored. So it seems that the connection
between cache and delayed branches, if there is any, is the opposite
of what you suggest.

Delayed branches provided a speedup on these early 5-stage
implementations. They also provided a big headache for more
sophisticated implementations, and therefore soon fell out of favour.
Power (IIRC) and Alpha don't have delayed branches.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to BGB on Sun Aug 11 13:27:52 2024

On Sun, 11 Aug 2024 02:41:01 -0500, BGB <[email protected]> wrote:

On 8/10/2024 6:20 PM, Lawrence D'Oliveiro wrote:

On Sat, 10 Aug 2024 17:49:42 -0500, BGB wrote:

Meanwhile, saw a video recently where someone had ported Doom to a 233
MHz PowerPC (running Windows NT4) machine and, its performance was not
good...

Not obvious is what combination of factors conspired to cause Doom to
apparently run at single-digit framerates.

Windows NT never meant for games?

Windows GDI isn't fast, but it isn't usually *that* slow either.
WinQuake and Quake2 both used GDI to good effect.

If Windows GDI performance is seriously broken, this is a different
issue from merely "not meant for games".

GDI was fine for most anything *except* video games.

If you really needed maximum performance, you used DirectX ... and
yes, DirectX was available on NT.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Aug 11 17:55:40 2024

According to OrangeFish <[email protected]d>:

As far as the delayed branches and such, they made sense in the narrow
time window when it was too expensive to put a cache on a workstation
but that time came and went by the time the RT shipped.

A long time ago, I heard (or maybe read) that the original ROMP was
chopped in half (the FP stuff was removed) by orders of marketing for
some sort of h/w word-processor. When that product bombed and the >workstation market blossomed, the engineers "bolted" the FP stuff back
on. I cannot find the source. Is there any truth to this?

I also heard that the ROMP was originally intended for some sort of
word processor from the Office Products division (the O in ROMP) and
was repurposed into a workstation.

The RT's floating point was an add-on card with a Natl Semi FPU
that appeared at high memory addresses. I would be surprised if
any of the ROMP's predecessors had hardware FP. The 801 didn't
and it would make no sense for word processing.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sun Aug 11 21:09:02 2024

On Sun, 11 Aug 2024 14:44:38 +0000, Anton Ertl wrote:

John Levine <[email protected]> writes:

As far as the delayed branches and such, they made sense in the narrow
time window when it was too expensive to put a cache on a workstation
but that time came and went by the time the RT shipped.

Delayed branches were put in the first commercial generation of RISCs
(except ARM), which all shipped with caches (except ARM). Delayed
branches are a natural consequence of the 5-stage (Or, in the 88100
case, four-stage) pipeline.

Delayed branches are wonderful to the pipeline, very much less so for
the architecture overall as it makes wide issue "all that much harder"
It was truly a pain in the ass on Mc88120 a 6-wide machine.

Neither nullification or inverse nullification helped much and both
hurt at wide issue, too. At least Mc88100 had a bit to indicate
the delay slot was not being used.

Looking back, I wish we had not been forced to do them--I think many
of the 1st generation architects wish similarly. Delayed branches
were supposed to bring a 16% gain in performance. After looking at
the utility rates slightly less than 50% useful instructions, with
something slightly over 70% fill rate; they only brought 8%-ish.
{{A useful instruction is useful in both taken and non-taken paths.}}

IIRC ARM used a 3-stage implementation for the ARM1/2, which may be a consequence of them rejecting delayed branches; and they did not have
caches, so they could not have made use of the higher clock rate that
a longer pipeline could have affored. So it seems that the connection between cache and delayed branches, if there is any, is the opposite
of what you suggest.

Delayed branches provided a speedup on these early 5-stage
implementations. They also provided a big headache for more
sophisticated implementations, and therefore soon fell out of favour.

Much like virtual caches...

The only thing that has persisted is LDs being longer than 2 cycles.
Squashing {forward, ADD, SRAM, LDalign} into 2 cycles is proving
to be a frequency headache in the simpler RISC-V implementations
even now. with wires getting slower and gates getting faster, that
trade off is getting worse. Many of the Intel x86s use 4 cycle LDs.
{the cost of frequency is efficiency}

Power (IIRC) and Alpha don't have delayed branches.

Non of the modern RISCs have them either.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Aug 11 23:07:03 2024

On Sun, 11 Aug 2024 14:44:38 GMT, Anton Ertl wrote:

Power (IIRC) and Alpha don't have delayed branches.

Not only does POWER not have delayed branches, but I recall the IBM folks claiming in the initial publicity that branches could often execute in
zero clock cycles--that is, fully overlapped with surrounding
instructions.

POWER was also “superscalar” (being able to execute more than one
operation per clock cycle) right from the beginning. Not sure if other
RISC architectures of the time were like that. I don’t think Alpha was:
one thing I remember from its early descriptions was its use of very high
clock speeds. That seemed to me to be the opposite of “(at least) one instruction per clock cycle”, which I thought was supposed to be one of
the defining features of RISC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sun Aug 11 23:08:18 2024

On Sun, 11 Aug 2024 21:09:02 +0000, MitchAlsup1 wrote:

Delayed branches are wonderful to the pipeline ...

They also persisted for a few years longer in DSP architectures, until the capabilities of these were (mostly) subsumed into general-purpose CPUs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Josh Vanderhoof@21:1/5 to BGB on Sun Aug 11 21:08:44 2024

BGB <[email protected]> writes:

Meanwhile, saw a video recently where someone had ported Doom to a 233
MHz PowerPC (running Windows NT4) machine and, its performance was not good...

Not obvious is what combination of factors conspired to cause Doom to apparently run at single-digit framerates.

Video mentioned that it was drawing using GDI calls, but this by
itself wouldn't explain the level of slowness seen in the video.

Like, presumably, this would require around 90% + of the clock cycles
going into overhead, which seems a bit much.

Reference:
https://www.youtube.com/watch?v=LAkSJ-HqKw8

I think NT4 was before CreateDIBSection existed. Probably much slower
without that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to John Levine on Sun Aug 11 16:58:03 2024

John Levine <[email protected]> writes:

I also heard that the ROMP was originally intended for some sort of
word processor from the Office Products division (the O in ROMP) and
was repurposed into a workstation.

late 70s, circa 1980 ... there were efforts to convert myriad of CISC microprocessors to 801/RISC microprocessors ... Iliad for low/mid range
370s (4361/4381 following on to 4331/4341), ROMP (research/office
products) for Displaywriter follow-on (with CP.r operating system
implemented in PL.8), also AS/400 follow-on to S/38. For various reasons
these efforts floundered and some of the 801/RISC engineers left IBM for
other vendors.

I helped with white paper that showed that nearly whole 370 could be
implemened directly in circuits (much more efficient than microcode) for 4361/4381. AS/400 returned to CISC microprocessor. The follow-on to displaywriter was canceled (most of that market moving to IBM/PC and
other personal computers).

Austin group decided to pivot ROMP to Unix workstation market and got
the company that had done AT&T UNIX port to IBM/PC as PC/IX to do one
for ROMP (AIX, possibly "Austin IX" for PC/RT). They also had to do
something with the 200 or so PL.8 programmers and decided to use them to implement an "abstract" virtual machine as "VRM" ... telling the company
doing the UNIX port that it would be much more efficient and timely for
them to implement to the VRM interface (rather than bare hardware).
Besides other issues with that claim, it introduced new problem that new
device drivers had to be done twice, one in "C" for the unix/AIX layer
and then in "PL.8" for the VRM.

Palo Alto was working on a port of UCB BSD to 370 and got redirected to
port to the PC/RT ... they demonstrated that they did the BSD port to
ROMP directly ... with much less effort than either the VRM
implementation or the AIX implementation ... released as "AOS".

trivia: early 80s 1) IBM Los Gatos lab was working on single chip "Blue
Iliad", 1st 32bit 801/RISC, really hot, single large chip than never
quite came to fruition and 2) IBM Boeblingen lab had done ROMAN, 3chip
370 implemention (with performance of 370/168). I had proposal to see
how many chips I could cram into single rack (either "Blue Iliad" or
ROMAN or combination of both).

While AS/400 1st reverted to CISC chip ... later in the 90s, out of the Somerset (AIM, apple, ibm, motorola) single-chip power/pc ... they got a 801/RISC chip to move to.

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Mon Aug 12 08:10:41 2024

Lawrence D'Oliveiro wrote:

On Sun, 11 Aug 2024 14:44:38 GMT, Anton Ertl wrote:

Power (IIRC) and Alpha don't have delayed branches.

Not only does POWER not have delayed branches, but I recall the IBM folks claiming in the initial publicity that branches could often execute in
zero clock cycles--that is, fully overlapped with surrounding
instructions.

Afair, the original POWER had 3 chips, with branches in a separate unit
from integer/logic ops, right?

The idea, as presented in that month's BYTE magazine was that the entire latency of transferring comparison flags over to the branch unit, select
the corresponding direction and then transmit the resulting IP back to
the fetch unit would happen fast enough that those offchip latencies
would not matter.

It also had multiple (8?) sets of compare result flags in order to avoid
making them a speed limiter.

POWER was also “superscalar” (being able to execute more than one operation per clock cycle) right from the beginning. Not sure if other
RISC architectures of the time were like that. I don’t think Alpha was:
one thing I remember from its early descriptions was its use of very high clock speeds. That seemed to me to be the opposite of “(at least) one instruction per clock cycle”, which I thought was supposed to be one of
the defining features of RISC.

Yeah, the R part was intended to make latency a single cycle for _most_ instructions.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Mon Aug 12 05:29:29 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Sun, 11 Aug 2024 14:44:38 GMT, Anton Ertl wrote:

Power (IIRC) and Alpha don't have delayed branches.

Not only does POWER not have delayed branches, but I recall the IBM folks >claiming in the initial publicity that branches could often execute in
zero clock cycles--that is, fully overlapped with surrounding
instructions.

Yes. Power has a sophisticated branch unit for its time.

POWER was also "superscalar" (being able to execute more than one
operation per clock cycle) right from the beginning. Not sure if other
RISC architectures of the time were like that. I don’t think Alpha was:
one thing I remember from its early descriptions was its use of very high >clock speeds.

Already the 21064 is two-wide superscalar (1 integer unit, 1 FPU, 1
load/store unit, don't remember if the branch unit could run in
parallel to the ALU; I think not). And it has very high clock speeds
for its time; it appeared with 150MHz while the competition was like
50MHz (SuperSPARC, superscalar) to 100MHz (MIPS R4000, not
superscalar), or, for Power, 62.5MHz in the POWER1++. But POWER1
(without ++) preceded the 21064 by 2 years.

That seemed to me to be the opposite of "(at least) one
instruction per clock cycle", which I thought was supposed to be one of
the defining features of RISC.

A peak (i.e., guaranteed not to be exceeded) performance of 1 IPC was
a goal in early RISCs. I guess you could construct a program that has
1 IPC on the MIPS R2000 and the first SPARC, but for useful code the
IPC on these early RISC's is lower. Likewise for the 21064, the peak performance is 2 IPC, but performance on useful code is usually lower.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Aug 12 06:33:17 2024

On Mon, 12 Aug 2024 05:29:29 GMT, Anton Ertl wrote:

Already the 21064 is two-wide superscalar (1 integer unit, 1 FPU, 1 load/store unit, don't remember if the branch unit could run in parallel
to the ALU; I think not). And it has very high clock speeds for its
time; it appeared with 150MHz while the competition was like 50MHz (SuperSPARC, superscalar) to 100MHz (MIPS R4000, not superscalar), or,
for Power, 62.5MHz in the POWER1++. But POWER1 (without ++) preceded
the 21064 by 2 years.

But in spite of having, say, 2½ times the clock speed of POWER, Alpha was
not 2½ times faster, was it?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Josh Vanderhoof on Mon Aug 12 10:56:07 2024

On Sun, 11 Aug 2024 21:08:44 -0400
Josh Vanderhoof <[email protected]> wrote:

BGB <[email protected]> writes:

Meanwhile, saw a video recently where someone had ported Doom to a
233 MHz PowerPC (running Windows NT4) machine and, its performance
was not good...

Not obvious is what combination of factors conspired to cause Doom
to apparently run at single-digit framerates.

Video mentioned that it was drawing using GDI calls, but this by
itself wouldn't explain the level of slowness seen in the video.

Like, presumably, this would require around 90% + of the clock
cycles going into overhead, which seems a bit much.

Reference:
https://www.youtube.com/watch?v=LAkSJ-HqKw8

I think NT4 was before CreateDIBSection existed. Probably much slower without that.

NT4 has CreateDIBSection.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Aug 12 11:09:18 2024

On Mon, 12 Aug 2024 06:33:17 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Mon, 12 Aug 2024 05:29:29 GMT, Anton Ertl wrote:

Already the 21064 is two-wide superscalar (1 integer unit, 1 FPU, 1 load/store unit, don't remember if the branch unit could run in
parallel to the ALU; I think not). And it has very high clock
speeds for its time; it appeared with 150MHz while the competition
was like 50MHz (SuperSPARC, superscalar) to 100MHz (MIPS R4000, not superscalar), or, for Power, 62.5MHz in the POWER1++. But POWER1
(without ++) preceded the 21064 by 2 years.

But in spite of having, say, 2½ times the clock speed of POWER, Alpha
was not 2½ times faster, was it?

Of course not.
But Alpha EV4 was single chip vs multiple chips in POWER1 or 3 chips of contemporary PA-RISC.
More relevant comparison is EV4 vs IBM RSC https://en.wikipedia.org/wiki/RISC_Single_Chip
I think that EV4 was 3-5 times faster than RSC.

Back in 1992-1993 I was not impressed by speed of RS/6000 model 220
relatively to i486 PCs. Frankly, 220 was running much heavier software
stack.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Mon Aug 12 08:42:51 2024

On Mon, 12 Aug 2024 11:09:18 +0300, Michael S wrote:

On Mon, 12 Aug 2024 06:33:17 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

But in spite of having, say, 2½ times the clock speed of POWER, Alpha
was not 2½ times faster, was it?

Of course not.

That’s what I mean: it took several clock cycles per instruction, contrary
to just about every other RISC architecture.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Mon Aug 12 11:10:45 2024

Terje Mathisen <[email protected]> writes:

Lawrence D'Oliveiro wrote:

On Sun, 11 Aug 2024 14:44:38 GMT, Anton Ertl wrote:
=20

Power (IIRC) and Alpha don't have delayed branches.

=20
Not only does POWER not have delayed branches, but I recall the IBM fol=

ks

claiming in the initial publicity that branches could often execute in
zero clock cycles--that is, fully overlapped with surrounding
instructions.

Afair, the original POWER had 3 chips, with branches in a separate unit=20 >from integer/logic ops, right?

Looking at <https://en.wikipedia.org/wiki/POWER1>, the RIOS-1
configuration has 9 chips: ICU, FPU, FXU (integer unit), SCU (storage
control), 4xDCU (data cache), I/O Unit. The RIOS-9 configuration has
only 2 DCUs (7 chips total).

It also had multiple (8?) sets of compare result flags in order to avoid = >making them a speed limiter.

I wonder how much that is used. There is only one carry bit.

Yeah, the R part was intended to make latency a single cycle for _most_=20 >instructions.

It was mainly meant to increase the *throughput* to one instruction
per cycle; that includes instructions like loads that have a
latency > 1 cycle.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Aug 12 18:14:53 2024

On Mon, 12 Aug 2024 08:42:51 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Mon, 12 Aug 2024 11:09:18 +0300, Michael S wrote:

On Mon, 12 Aug 2024 06:33:17 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

But in spite of having, say, 2½ times the clock speed of POWER,
Alpha was not 2½ times faster, was it?

Of course not.

That’s what I mean: it took several clock cycles per instruction,
contrary to just about every other RISC architecture.

On EV4 simple ALU instructions took 1 cycle , both for throughput and
for latency.
Shifts and conditional moves had latency of 2, throughput of 1.
Integer multiplier was not pipelined, but few RISC also had it
none-pipelined. Latency of integer multiplier was 19-21 cycles.
On FP side both FADD and FMUL were fully pipelined (T=1) and had
latency of 6 cycles.
L1D cache hits were fully pipelined (T=1) and had latency of 3 cycles.

So, as long as code/data was fitting in L1 cache, EV4 IPC was not
far behind competition. Relatively to MIPS R4K, may be, even ahead.

Of course, cache misses were relatively more expensive than for much
lower clocked competitors. DEC's solution to that was wide and fast
system bus.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Mon Aug 12 17:32:09 2024

On Mon, 12 Aug 2024 15:14:53 +0000, Michael S wrote:

On Mon, 12 Aug 2024 08:42:51 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Mon, 12 Aug 2024 11:09:18 +0300, Michael S wrote:

On Mon, 12 Aug 2024 06:33:17 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

But in spite of having, say, 2½ times the clock speed of POWER,
Alpha was not 2½ times faster, was it?

Of course not.

That’s what I mean: it took several clock cycles per instruction,
contrary to just about every other RISC architecture.

On EV4 simple ALU instructions took 1 cycle , both for throughput and
for latency.
Shifts and conditional moves had latency of 2, throughput of 1.
Integer multiplier was not pipelined, but few RISC also had it none-pipelined.

Mc88100 had a pipelined multiplier, you could start a int mul
every cycle or a single mul evey cycle or a double mul every 4
cycles.

Latency of integer multiplier was 19-21 cycles.

3 cycles for Mc88100

On FP side both FADD and FMUL were fully pipelined (T=1) and had
latency of 6 cycles.
L1D cache hits were fully pipelined (T=1) and had latency of 3 cycles.

So, as long as code/data was fitting in L1 cache, EV4 IPC was not
far behind competition. Relatively to MIPS R4K, may be, even ahead.

Of course, cache misses were relatively more expensive than for much
lower clocked competitors. DEC's solution to that was wide and fast
system bus.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet
- Bob Worm
  Mon Jul 27 15:19:55 2026
  from Wales, Uk via Telnet
- Rixter
  Mon Jul 27 13:04:59 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	46:33:06
Calls:	12,444
Calls today:	4
Files:	15,192
Messages:	6,537,109

Instruction Tracing

Who's Online

Recent Visitors

System Info