One thing these instruction traces would frequently report is that integer >multiply and divide instructions were not so common, and so could be
omitted and emulated in software, with minimal impact on overall
performance. We saw this design decision taken in the early versions of >Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.
Later, it seems, the CPU designers realized that instruction traces were
not the final word on performance measurements, and started to include >hardware integer multiply and divide instructions.
In the early days of the spread of RISC (i.e. the 1980s), much was made
of
the analysis of the dynamic execution profiles of actual compiled
programs
to see what machine instructions they most frequently used. This then
became the rationale for optimizing common instructions, and even
omitting
ones that were not so often used.
One thing these instruction traces would frequently report is that
integer
multiply and divide instructions were not so common, and so could be
omitted and emulated in software, with minimal impact on overall
performance. We saw this design decision taken in the early versions of Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.
Later, it seems, the CPU designers realized that instruction traces were
not the final word on performance measurements, and started to include hardware integer multiply and divide instructions.
(ROMP was also one of those RISC architectures that had delayed
branches,
along with MIPS, HP-PA and I think SPARC as well.)
I have heard it said that the RT PC was a poor advertisement for the
benefits of RISC, and the joke was made that “RT” stood for “Reduced Technology”.
Later, of course, IBM more than made good this deficiency with its
second
take on RISC, in the form of the POWER architecture, which is still a performance leader to this day.
Lawrence D'Oliveiro <[email protected]d> writes:
One thing these instruction traces would frequently report is that
integer
multiply and divide instructions were not so common, and so could be >>omitted and emulated in software, with minimal impact on overall >>performance. We saw this design decision taken in the early versions of >>Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.
Alpha and IA-64 have no integer division. IIRC IA-64 has no FP
division.
One interesting aspect of RISC-V is that they put multiplication and
division in the same extension (which is included in RV64G, i.e., the
General version of RISC-V).
Later, it seems, the CPU designers realized that instruction traces were >>not the final word on performance measurements, and started to include >>hardware integer multiply and divide instructions.
When you invest more hardware to increase performance per cycle, at
one point the best return on investment is to have multiplication and division instructions. What is interesting is that the multipliers
have than soon been fully pipelined.
Or, as Mitch Alsup reports, in
cases where that was cheaper, have two half-pipelined multipliers.
Apparently there are enough applications that require a huge number of multiplications; my guess is that the NSA won't tell us what they are.
- anton
In article <[email protected]>, [email protected] (Anton Ertl) wrote:
IIRC IA-64 has no FP division.
You recall correctly. It has an "approximation to reciprocal"
instruction,
which gives you about 8 bits of precision, and then requires the
compiler
to generate Newton-Raphson sequences. Intel's manual, 2010 edition, says
this is advantageous because users can generate only the precision they
need. Writing Itanium assembler for customised precision? Not many
people
would have wanted to do that in 2001, let alone 2010.
In, I think, 1996, my employers had visitors from Intel trying to
persuade us to adopt their C/C++ compiler for IA-32. They had been able
to speed up one of our competitors' code by a factor of two, and hoped
to
do the same for us.
They failed. We already had that factor of two, which was "ordinary
compiler optimisation." That competitor had some rather odd coding
standards at the time, which meant most compilers failed if asked to
optimise their code. Someone from Intel had stayed at their site for
most
of a year, reporting the bugs and getting them fixed until Intel's
compiler could optimise the code.
While visiting us, Intel asked what may have been a significant question about the mixture of floating-point arithmetic instructions we used. We didn't have precise figures, but were sure that we used at least as many square roots as divides. IA-64 does square roots like divides, with a
starter approximation and Newton-Raphson sequences. Slowly, because the
N-R instructions all depend on the previous instruction, and can't be
run in parallel.
John
IIRC IA-64 has no FP division.
My rough ranking of instruction probabilities (descending probability,
*):
Load/Store (Constant Displacement, ~30%);
Branch (~ 14% of ops);
ALU, ADD/SUB/AND/OR (~ 13%);
Load/Store (Register Indexed, ~10%);
Compare and Test (~ 6%);
Integer Shift (~ 4%);
Register Move (~ 3%);
Sign/Zero Extension (~ 3%);
ALU, XOR (~ 2%);
Multiply (~ 2%);
...
*: Crude estimate based on categorizing the dynamic execution
probabilities (which are per-instruction rather than by category).
Meanwhile, DIV and friends are generally closer to 0.05% or so...
You can leave them out and hardly anyone will notice.
For the most part, something like RISC-V makes sense, except that
omitting Indexed Load/Store is basically akin to shooting oneself in the
foot (and does result in a significant increase in the amount of Shift
and ADD instructions used).
With RISC-V, one may see ~ 25% Load/Store followed by ~ 20% ADD and 15% Shift, ...
Some of this is because ADD and Shift end up over-represented by their
need to be used in compound operations (indexed load/store and sign/zero extension).
Meanwhile, I am once again reminded of an annoying edge case bug in my Verilog implementation:
If a TLB Miss happens on an inter-ISA branch, it can leave the CPU core
in an inconsistent state.
Meanwhile, saw a video recently where someone had ported Doom to a 233
MHz PowerPC (running Windows NT4) machine and, its performance was not good...
Not obvious is what combination of factors conspired to cause Doom to apparently run at single-digit framerates.
(ROMP was also one of those RISC architectures that had delayed branches, >along with MIPS, HP-PA and I think SPARC as well.)
I have heard it said that the RT PC was a poor advertisement for the
benefits of RISC, and the joke was made that “RT” stood for “Reduced >Technology”.
It appears that Lawrence D'Oliveiro <[email protected]d> said:
(ROMP was also one of those RISC architectures that had delayed branches,
along with MIPS, HP-PA and I think SPARC as well.)
I have heard it said that the RT PC was a poor advertisement for the
benefits of RISC, and the joke was made that “RT” stood for “Reduced >> Technology”.
I worked on AIX for the RT/PC. It was a pretty reasonable chip for the
time, but it suffered greatly from internal IBM political fights which
made it too little too late. AIX ran on top of a bloated virtual
machine which made the whole thing too slow. There was skunkworks port
of BSD that was supposed to be a lot better.
As far as the delayed branches and such, they made sense in the narrow
time window when it was too expensive to put a cache on a workstation
but that time came and went by the time the RT shipped.
As far as the delayed branches and such, they made sense in the narrow
time window when it was too expensive to put a cache on a workstation
but that time came and went by the time the RT shipped.
On 8/10/2024 6:20 PM, Lawrence D'Oliveiro wrote:
On Sat, 10 Aug 2024 17:49:42 -0500, BGB wrote:
Meanwhile, saw a video recently where someone had ported Doom to a 233
MHz PowerPC (running Windows NT4) machine and, its performance was not
good...
Not obvious is what combination of factors conspired to cause Doom to
apparently run at single-digit framerates.
Windows NT never meant for games?
Windows GDI isn't fast, but it isn't usually *that* slow either.
WinQuake and Quake2 both used GDI to good effect.
If Windows GDI performance is seriously broken, this is a different
issue from merely "not meant for games".
As far as the delayed branches and such, they made sense in the narrow
time window when it was too expensive to put a cache on a workstation
but that time came and went by the time the RT shipped.
A long time ago, I heard (or maybe read) that the original ROMP was
chopped in half (the FP stuff was removed) by orders of marketing for
some sort of h/w word-processor. When that product bombed and the >workstation market blossomed, the engineers "bolted" the FP stuff back
on. I cannot find the source. Is there any truth to this?
John Levine <[email protected]> writes:
As far as the delayed branches and such, they made sense in the narrow
time window when it was too expensive to put a cache on a workstation
but that time came and went by the time the RT shipped.
Delayed branches were put in the first commercial generation of RISCs
(except ARM), which all shipped with caches (except ARM). Delayed
branches are a natural consequence of the 5-stage (Or, in the 88100
case, four-stage) pipeline.
IIRC ARM used a 3-stage implementation for the ARM1/2, which may be a consequence of them rejecting delayed branches; and they did not have
caches, so they could not have made use of the higher clock rate that
a longer pipeline could have affored. So it seems that the connection between cache and delayed branches, if there is any, is the opposite
of what you suggest.
Delayed branches provided a speedup on these early 5-stage
implementations. They also provided a big headache for more
sophisticated implementations, and therefore soon fell out of favour.
Power (IIRC) and Alpha don't have delayed branches.
- anton
Power (IIRC) and Alpha don't have delayed branches.
Delayed branches are wonderful to the pipeline ...
Meanwhile, saw a video recently where someone had ported Doom to a 233
MHz PowerPC (running Windows NT4) machine and, its performance was not good...
Not obvious is what combination of factors conspired to cause Doom to apparently run at single-digit framerates.
Video mentioned that it was drawing using GDI calls, but this by
itself wouldn't explain the level of slowness seen in the video.
Like, presumably, this would require around 90% + of the clock cycles
going into overhead, which seems a bit much.
Reference:
https://www.youtube.com/watch?v=LAkSJ-HqKw8
I also heard that the ROMP was originally intended for some sort of
word processor from the Office Products division (the O in ROMP) and
was repurposed into a workstation.
On Sun, 11 Aug 2024 14:44:38 GMT, Anton Ertl wrote:
Power (IIRC) and Alpha don't have delayed branches.
Not only does POWER not have delayed branches, but I recall the IBM folks claiming in the initial publicity that branches could often execute in
zero clock cycles--that is, fully overlapped with surrounding
instructions.
POWER was also “superscalar” (being able to execute more than one operation per clock cycle) right from the beginning. Not sure if other
RISC architectures of the time were like that. I don’t think Alpha was:
one thing I remember from its early descriptions was its use of very high clock speeds. That seemed to me to be the opposite of “(at least) one instruction per clock cycle”, which I thought was supposed to be one of
the defining features of RISC.
On Sun, 11 Aug 2024 14:44:38 GMT, Anton Ertl wrote:
Power (IIRC) and Alpha don't have delayed branches.
Not only does POWER not have delayed branches, but I recall the IBM folks >claiming in the initial publicity that branches could often execute in
zero clock cycles--that is, fully overlapped with surrounding
instructions.
POWER was also "superscalar" (being able to execute more than one
operation per clock cycle) right from the beginning. Not sure if other
RISC architectures of the time were like that. I don’t think Alpha was:
one thing I remember from its early descriptions was its use of very high >clock speeds.
That seemed to me to be the opposite of "(at least) one
instruction per clock cycle", which I thought was supposed to be one of
the defining features of RISC.
Already the 21064 is two-wide superscalar (1 integer unit, 1 FPU, 1 load/store unit, don't remember if the branch unit could run in parallel
to the ALU; I think not). And it has very high clock speeds for its
time; it appeared with 150MHz while the competition was like 50MHz (SuperSPARC, superscalar) to 100MHz (MIPS R4000, not superscalar), or,
for Power, 62.5MHz in the POWER1++. But POWER1 (without ++) preceded
the 21064 by 2 years.
BGB <[email protected]> writes:
Meanwhile, saw a video recently where someone had ported Doom to a
233 MHz PowerPC (running Windows NT4) machine and, its performance
was not good...
Not obvious is what combination of factors conspired to cause Doom
to apparently run at single-digit framerates.
Video mentioned that it was drawing using GDI calls, but this by
itself wouldn't explain the level of slowness seen in the video.
Like, presumably, this would require around 90% + of the clock
cycles going into overhead, which seems a bit much.
Reference:
https://www.youtube.com/watch?v=LAkSJ-HqKw8
I think NT4 was before CreateDIBSection existed. Probably much slower without that.
On Mon, 12 Aug 2024 05:29:29 GMT, Anton Ertl wrote:
Already the 21064 is two-wide superscalar (1 integer unit, 1 FPU, 1 load/store unit, don't remember if the branch unit could run in
parallel to the ALU; I think not). And it has very high clock
speeds for its time; it appeared with 150MHz while the competition
was like 50MHz (SuperSPARC, superscalar) to 100MHz (MIPS R4000, not superscalar), or, for Power, 62.5MHz in the POWER1++. But POWER1
(without ++) preceded the 21064 by 2 years.
But in spite of having, say, 2½ times the clock speed of POWER, Alpha
was not 2½ times faster, was it?
On Mon, 12 Aug 2024 06:33:17 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
But in spite of having, say, 2½ times the clock speed of POWER, Alpha
was not 2½ times faster, was it?
Of course not.
Lawrence D'Oliveiro wrote:
On Sun, 11 Aug 2024 14:44:38 GMT, Anton Ertl wrote:ks
=20
Power (IIRC) and Alpha don't have delayed branches.=20
Not only does POWER not have delayed branches, but I recall the IBM fol=
claiming in the initial publicity that branches could often execute in
zero clock cycles--that is, fully overlapped with surrounding
instructions.
Afair, the original POWER had 3 chips, with branches in a separate unit=20 >from integer/logic ops, right?
It also had multiple (8?) sets of compare result flags in order to avoid = >making them a speed limiter.
Yeah, the R part was intended to make latency a single cycle for _most_=20 >instructions.
On Mon, 12 Aug 2024 11:09:18 +0300, Michael S wrote:
On Mon, 12 Aug 2024 06:33:17 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
But in spite of having, say, 2½ times the clock speed of POWER,
Alpha was not 2½ times faster, was it?
Of course not.
That’s what I mean: it took several clock cycles per instruction,
contrary to just about every other RISC architecture.
On Mon, 12 Aug 2024 08:42:51 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Mon, 12 Aug 2024 11:09:18 +0300, Michael S wrote:
On Mon, 12 Aug 2024 06:33:17 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
But in spite of having, say, 2½ times the clock speed of POWER,
Alpha was not 2½ times faster, was it?
Of course not.
That’s what I mean: it took several clock cycles per instruction,
contrary to just about every other RISC architecture.
On EV4 simple ALU instructions took 1 cycle , both for throughput and
for latency.
Shifts and conditional moves had latency of 2, throughput of 1.
Integer multiplier was not pipelined, but few RISC also had it none-pipelined.
Latency of integer multiplier was 19-21 cycles.
On FP side both FADD and FMUL were fully pipelined (T=1) and had
latency of 6 cycles.
L1D cache hits were fully pipelined (T=1) and had latency of 3 cycles.
So, as long as code/data was fitting in L1 cache, EV4 IPC was not
far behind competition. Relatively to MIPS R4K, may be, even ahead.
Of course, cache misses were relatively more expensive than for much
lower clocked competitors. DEC's solution to that was wide and fast
system bus.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 716 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 52:03:07 |
| Calls: | 12,115 |
| Calls today: | 6 |
| Files: | 15,010 |
| Messages: | 6,518,580 |
| Posted today: | 1 |