Forum: >>> Magnum BBS <<<

88xxx or PPC

From Brett@21:1/5 to [email protected] on Sat Mar 2 23:28:41 2024

MitchAlsup1 <[email protected]> wrote:

Paul A. Clayton wrote:

On 1/25/24 10:22 AM, Anton Ertl wrote:
[snip]

I think the commonly understood meaning is that
all instructions start their execution in-order (i.e., none goes to a
functional unit earlier than an architecturally earlier instruction).
Execution can overlap.

What about a skewed pipeline? A simple skewed pipeline that
statically assigned operations to a pipeline-stage/execution unit
has been called in-order (in what I have read). A "second-chance"
pipeline (where many operations can dynamically choose the
pipeline stage based on operand availability) involves dynamic
scheduling (so would seem to fall in to out-of-order), but
counterflow pipelines ("Counterflow Pipeline Processor
Architecture", Robert F. Sproull et al., 1994) — which are more
extreme in some ways than pipelines that have two stages in which
operations can start — are stated to have "No overtaking.
Instructions must stay in program order in the instruction
pipeline.", which sounds "in-order" (and the paper was written by
people working at Sun Microsystems).

(I thought counterflow pipelines were weird. Simplifying
communication makes sense, but ...)

I get the impression that early PowerPC out-or-order execution
implementations were really very similar to using the forwarding
network for out-of-order completion (with in-order writeback). If
I recall correctly, renaming was done by appending a version to
the architectural register name and operands would be captured as
soon as they were available rather than passing along the pipeline
with forwarding until the writeback stage.

This sounds more like Mc 88110 rather than PPC 620.

PPC was shrunk from 6-wide to 4-wide in order to fit in the acceptable
die area. Other things may have been jettisoned at this shrink of design point. The 620 was originally targeted to be equal to Mc 88120 which was
a 6-wide GBOoO machine full Tomasulo with precise exceptions and 4 external busses named {Data Out, Data In, Address Out, Address In}

Address Out was used for cache misses to bring data to the CPU
Data Out was used for cache victims to send data to DRAM
Data In was used by arriving DRAM data
Address In was used for arriving Snoops

Smart externals could use Data In to send the CPU data before it knew it needed it. That data could be code or data.

So which was better, your baby the 88xxx or PPC?
Pick your metric: die size, performance, heat, upgrade path, other.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Sat Mar 2 23:51:20 2024

Brett wrote:

MitchAlsup1 <[email protected]> wrote:

This sounds more like Mc 88110 rather than PPC 620.

PPC was shrunk from 6-wide to 4-wide in order to fit in the acceptable
die area. Other things may have been jettisoned at this shrink of design
point. The 620 was originally targeted to be equal to Mc 88120 which was
a 6-wide GBOoO machine full Tomasulo with precise exceptions and 4 external >> busses named {Data Out, Data In, Address Out, Address In}

Address Out was used for cache misses to bring data to the CPU
Data Out was used for cache victims to send data to DRAM
Data In was used by arriving DRAM data
Address In was used for arriving Snoops

Smart externals could use Data In to send the CPU data before it knew it
needed it. That data could be code or data.

So which was better, your baby the 88xxx or PPC?
Pick your metric: die size, performance, heat, upgrade path, other.

I am yet to see a CPU with the performance we got on the 88120 simulator.
In 1992 we were getting 2.0 IPC on things like XLISP and 5.9 IPC on
Matrix300.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to [email protected] on Sun Mar 3 08:21:58 2024

[email protected] (MitchAlsup1) writes:

Brett wrote:

So which was better, your baby the 88xxx or PPC?
Pick your metric: die size, performance, heat, upgrade path, other.

If the metric is availability, the PPC 620 was weak (only produced in
small numbers, because of disappointing performance for when it
finally appeared), but the 88120 is weaker (never produced).

I am yet to see a CPU with the performance we got on the 88120 simulator.
In 1992 we were getting 2.0 IPC on things like XLISP and 5.9 IPC on >Matrix300.

I have recently measured actual hardware, using performance counters
on a number of Forth programs running on Gforth (matrix is based on
intmm in the Stanford integer benchmarks, not on matrix300). You can
find the results in

https://www.complang.tuwien.ac.at/anton/tmp/select-ipc-uarch.eps https://www.complang.tuwien.ac.at/anton/tmp/opt-ipc-uarch.eps

The first graphics shows fewer hardware, but makes the benchmarks
identifyable; it also measures the results without (left of two bars
for one benchmark) and with (right) an the optimization the paper is
about; the optimization also reduces instructions, so the results
usually are still faster even in those cases where the optimization
reduces IPC.

The second graphic contains only the data with the optimization, it
shows more microarchitectures, and sorts the results for each
microarchitecture by IPC (making the benchmarks unidentifiable), but
it shows the differences between the microarchitectures nicely. It is remarkable that Goldencove/Raptorcove not only have a large IPC, but
also very short cycles (clocks up to 6GHz in current products; for the
Core i3-1315U on which this was measured, Intel promises 4.5GHz, but
the NUC that I measured just delivers 3.8GHz; but that is still plenty
of speed for a middle-end low-power product).

One interesting case is the K8 (released 2003, although this one is
from the 2005 incarnation Athlon X2 4400+), in which Mitch Alsup also
had a hand. We can see how far we have come since those days.

I don't think that it's plausible that the 88120, which would have
appeared in the mid-1990s would perform as well or better than
goldencove on this workload. My guess is that it would have had to
undergo a silicon diet like the PPC620, probably even more so, because
it was to appear earlier, which probably would have meant less
transistors, which would have reduced the matrix300 IPC, and probably
to a lesser amount, the Xlisp IPC.

Also, the question is how fast the result would clock. The 88110 was
available in 1992 at 50MHz, in the same year as the 200MHz 21064.
When would the 88120 have been available at what clock rate? The
PPC620 was available in 1997 at up to 150MHz, while the Pentium II
Klamath was available in 1997 at clock rates up to 300MHz, and the
(in-order) 21164a was available at 600MHz; my guess is that the 21164a
could also produce good matrix300 numbers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Sun Mar 3 16:55:33 2024

On Sat, 2 Mar 2024 23:51:20 +0000
[email protected] (MitchAlsup1) wrote:

Brett wrote:

MitchAlsup1 <[email protected]> wrote:

This sounds more like Mc 88110 rather than PPC 620.

PPC was shrunk from 6-wide to 4-wide in order to fit in the
acceptable die area. Other things may have been jettisoned at this
shrink of design point. The 620 was originally targeted to be
equal to Mc 88120 which was a 6-wide GBOoO machine full Tomasulo
with precise exceptions and 4 external busses named {Data Out,
Data In, Address Out, Address In}

Address Out was used for cache misses to bring data to the CPU
Data Out was used for cache victims to send data to DRAM
Data In was used by arriving DRAM data
Address In was used for arriving Snoops

Smart externals could use Data In to send the CPU data before it
knew it needed it. That data could be code or data.

So which was better, your baby the 88xxx or PPC?
Pick your metric: die size, performance, heat, upgrade path, other.

I am yet to see a CPU with the performance we got on the 88120
simulator. In 1992 we were getting 2.0 IPC on things like XLISP and
5.9 IPC on Matrix300.

I can't find information about Matrix300.
It seems to be part of SPEC89 FP suite, but spec.org does not provide
info about anything older than SPEC92.
Can you tell me what exactly does it do?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Sun Mar 3 16:33:45 2024

Michael S <[email protected]> writes:

I can't find information about Matrix300.
It seems to be part of SPEC89 FP suite, but spec.org does not provide
info about anything older than SPEC92.
Can you tell me what exactly does it do?

It's 300x300 FP matrix multiply (not sure if single or double). There
was a company that had a tool (famous at the time, but I don't
remember the name) that could transform the original source code into
a cache-blocked variant, which typically ran at the limits imposed by
the FUs. Eventually everyone used that tool in their compiler to get
good SPEC89 results. As a consequence, SPEC eliminated matrix300 in
SPEC92.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sun Mar 3 18:26:10 2024

Anton Ertl wrote:

[email protected] (MitchAlsup1) writes:

I don't think that it's plausible that the 88120, which would have
appeared in the mid-1990s would perform as well or better than
goldencove on this workload. My guess is that it would have had to
undergo a silicon diet like the PPC620, probably even more so, because
it was to appear earlier, which probably would have meant less
transistors, which would have reduced the matrix300 IPC, and probably
to a lesser amount, the Xlisp IPC.

Also, the question is how fast the result would clock. The 88110 was available in 1992 at 50MHz, in the same year as the 200MHz 21064.
When would the 88120 have been available at what clock rate? The

We were on schedule when I left in 1992 to be in silicon by 1995.
We were targeting 100 MHz.
We did not get far enough to know the die-size*.

(*) Motorola was offering us 0.5µ BiCMOS. SPICE simulations indicated
that the wire attached to the emitters was going to be suspect to
crystal migration due to current density. We (our project) was only
using the bipolars in sense amplifiers and not in normal gates.

PPC620 was available in 1997 at up to 150MHz, while the Pentium II
Klamath was available in 1997 at clock rates up to 300MHz, and the
(in-order) 21164a was available at 600MHz; my guess is that the 21164a
could also produce good matrix300 numbers.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Sun Mar 3 18:18:07 2024

Michael S wrote:

On Sat, 2 Mar 2024 23:51:20 +0000
[email protected] (MitchAlsup1) wrote:

Brett wrote:

MitchAlsup1 <[email protected]> wrote:

This sounds more like Mc 88110 rather than PPC 620.

PPC was shrunk from 6-wide to 4-wide in order to fit in the
acceptable die area. Other things may have been jettisoned at this
shrink of design point. The 620 was originally targeted to be
equal to Mc 88120 which was a 6-wide GBOoO machine full Tomasulo
with precise exceptions and 4 external busses named {Data Out,
Data In, Address Out, Address In}

Address Out was used for cache misses to bring data to the CPU
Data Out was used for cache victims to send data to DRAM
Data In was used by arriving DRAM data
Address In was used for arriving Snoops

Smart externals could use Data In to send the CPU data before it
knew it needed it. That data could be code or data.

So which was better, your baby the 88xxx or PPC?
Pick your metric: die size, performance, heat, upgrade path, other.

I am yet to see a CPU with the performance we got on the 88120
simulator. In 1992 we were getting 2.0 IPC on things like XLISP and
5.9 IPC on Matrix300.

I can't find information about Matrix300.
It seems to be part of SPEC89 FP suite, but spec.org does not provide
info about anything older than SPEC92.
Can you tell me what exactly does it do?

It runs DGEMM for 300×300 sized matrixes and was designed as a Great Big number crunching thingamajig.

It fell out of favor because by 1995 it fit in L2 cache sizes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Sun Mar 3 20:30:52 2024

On Sun, 03 Mar 2024 16:33:45 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

I can't find information about Matrix300.
It seems to be part of SPEC89 FP suite, but spec.org does not provide
info about anything older than SPEC92.
Can you tell me what exactly does it do?

It's 300x300 FP matrix multiply (not sure if single or double). There
was a company that had a tool (famous at the time, but I don't
remember the name) that could transform the original source code into
a cache-blocked variant, which typically ran at the limits imposed by
the FUs. Eventually everyone used that tool in their compiler to get
good SPEC89 results. As a consequence, SPEC eliminated matrix300 in
SPEC92.

- anton

So, in today's world it would be something like "How fast can you do
DGEMM with 7 out of your 8 [SIMD] hands tied behind your back?"
Or, may be, more than 7 if your variant of AMX supports double
precision.
The challenge is, funny, but the answer not particularly useful.
But even in that not particularly useful answer IPC appears to be the
least useful part. Far worse than FLOPS/Hz.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Sun Mar 3 22:22:37 2024

Michael S <[email protected]> writes:

But even in that not particularly useful answer IPC appears to be the
least useful part. Far worse than FLOPS/Hz.

Those were the days before SIMD, so IPC told you a little about
FLOPS/Hz. I think, though, that you look at it from the other end.
You are asking: Is that a number I want to know for evaluating DGEMM performance on the 88120? But Mitch Alsup and the other people were
probably thinking: this uarchitecture can never exceed 6 IPC, so
getting 5.9 IPC on an actual SPEC CPU89 benchmark is pretty good. And matrix300 with its mixture of FP adds, FP muls, loads, stores, address arithmetic, and control, i.e., making relatively balanced use of many
FUs, is a pretty good benchmark for getting high numbers of this kind.

I guess that the 21164 also showed close to 4 IPC on the 4-wide 21164
on matrix300, while its 2 integer units would limit it to much lower performance on, say, intmm.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Sun Mar 3 22:41:03 2024

Michael S wrote:

On Sun, 03 Mar 2024 16:33:45 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

I can't find information about Matrix300.
It seems to be part of SPEC89 FP suite, but spec.org does not provide
info about anything older than SPEC92.
Can you tell me what exactly does it do?

It's 300x300 FP matrix multiply (not sure if single or double). There
was a company that had a tool (famous at the time, but I don't
remember the name) that could transform the original source code into
a cache-blocked variant, which typically ran at the limits imposed by
the FUs. Eventually everyone used that tool in their compiler to get
good SPEC89 results. As a consequence, SPEC eliminated matrix300 in
SPEC92.

- anton

So, in today's world it would be something like "How fast can you do
DGEMM with 7 out of your 8 [SIMD] hands tied behind your back?"
Or, may be, more than 7 if your variant of AMX supports double
precision.

xGEMM supports transposes of the input matrixes; D stands for Double
precision, S stands for Single precision. Matrix300 used DGEMM.

SIMD would only support 2 of the 8 calls to DGEMM where the transposes
are not {'N', 'n', 'T', 't', 'C', or 'c'}. SIMD would do nothing for
the 6 transpose calls. It should also be noted that the transposed
matrixes have significantly worse cache behavior than the non trans-
posed version as each access is to a different cache line.

The problem is typical SIMD does not support the kinds of transposes
xGEMM performs. That is the problem is not the transposes, it is naïve
SIMD which is the problem.

So, postulate that one can SIMD the non-transposed loop and gain 4×.
The other 3 loops get 1× for an overall gain of <less than> 25%;
where the "less than" is due to the cache and TLB effects.

A TLB with as few as 24 entries FA gets 100% hit rate in the non-
transpose case, and poor performance on any (all) of the transpose
cases, in the dual transpose case, the TLB takes a miss every other
cache access. Here a 256-entry DM TLB gets 100% hit rate where a 64-
entry FA TLB is getting close to zero hit rate.

The challenge is, funny, but the answer not particularly useful.

xGEMM is likely the second most used GB-math number crunching
subroutine in use--FFT <flavors> being the most used.

But even in that not particularly useful answer IPC appears to be the
least useful part. Far worse than FLOPS/Hz.

Correct, and this illustrates how times have changed. In 1985 Matrix300
would use * and + as separate instructions. The major loop consists of
2 LDs, 1*, 1+, 1 ST, and a ADD-CMP-BC which could be distributed over
the 4-way unrolled loop (in source code). Mc 88110 compiler would
produce a 24 instruction loop (non-transposed) and the Mc 88120 sim-
ulator would run this loop (including DRAM accesses (cache misses and
cache victims), and TLB table-walking) in 4 cycles. Today, FMAC is
<fused into> 1 instruction, saving instruction count and calculation
latency (8->5) allowing Matrix 300 to fit in a 78-instruction execution
window instead of a 96-instruction EW. {{But you still have to count
FMAC as 2 FLOPs.}}

My 66000, using VEC-LOOP the instruction count goes down again to 5 (from
6) per loop since LOOP is performing ADD-CMP-BC in a single instruction
and in a single clock.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to [email protected] on Mon Mar 4 08:52:09 2024

[email protected] (MitchAlsup1) writes:

SIMD would only support 2 of the 8 calls to DGEMM where the transposes
are not {'N', 'n', 'T', 't', 'C', or 'c'}. SIMD would do nothing for
the 6 transpose calls.

The following version is good for the non-transposed matrices (it's
not DGEMM, but the difference to DGEMM is left as exercise):

void matmul(double a[], double b[], double c[], size_t m, size_t n, size_t p)
{
size_t i,j,k;
double r;
memset(c,0,n*p*sizeof(double));
for (i=0; i<n; i++)
for (k=0; k<m; k++)
for (j=0; j<p; j++)
c[i*p+j] += a[i*m+k]*b[k*p+j];
}

But the loops are interchangeable, and the naive i,j,k order is bad
for SIMD as well as introducing a recurrence on c[i*p+j] on the inner
loop. Given the amount of parallelism inherent in matrix
multiplication, I would be surprised if one if transposing some or all
of the involved matrices prevented some loop order or other
transformation that would prevent SIMD. In the extreme case, you just transpose an appropriate input matrix at the start, or the result at
the end, at O(n^2) effort (for n*n matrices), while the matrix
multiply itself takes O(n^3) effort.

But of course, for the 88120 that was a non-issue, because it did not
have SIMD.

It should also be noted that the transposed
matrixes have significantly worse cache behavior than the non trans-
posed version as each access is to a different cache line.

That was fixed by the cache-blocking transformation that everybody
used, and which resulted in the elimination of matrix300 from SPEC92.

It was not the cache sizes, which could easily have been addressed by
modifying matrix300 into, say, matrix2000.

My 66000, using VEC-LOOP the instruction count goes down again to 5 (from
6) per loop since LOOP is performing ADD-CMP-BC in a single instruction
and in a single clock.

If the programmer (or compiler) for My66000 does not process the
elements in the favoured order, it will not perform particularly good
for arbitrary transpositions, either. I don't think that you want to
perform these program transformations in hardware, do you?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Mon Mar 4 17:14:57 2024

On Sun, 03 Mar 2024 22:22:37 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

But even in that not particularly useful answer IPC appears to be the
least useful part. Far worse than FLOPS/Hz.

I guess that the 21164 also showed close to 4 IPC on the 4-wide 21164
on matrix300, while its 2 integer units would limit it to much lower performance on, say, intmm.

- anton

I don't know about specific case of matrix300 and what transformations
are allowed by SPEC rules and what not, but if I were tasked with
writing generic DGEMM for Alpha 21164 with maximal performance on
non-small relatively square matrices as a amin goal, then I'd start
with something like that:

// main_loop_3x6 - multiply 3 raws of A[][]
// by 6 columns of B[][] assuming C-language order
void innermost_loop_3x6(
const double* A, int lda,
const double* B, int ldb,
double* C, int ldc,
int n)
{
const double* A0 = A;
const double* A1 = A0 + lda;
const double* A2 = A1 + lda;
double acc00 = 0, acc10 = 0, acc20 = 0;
double acc01 = 0, acc11 = 0, acc21 = 0;
double acc02 = 0, acc12 = 0, acc22 = 0;
double acc03 = 0, acc13 = 0, acc23 = 0;
double acc04 = 0, acc14 = 0, acc24 = 0;
double acc05 = 0, acc15 = 0, acc25 = 0;

for (int i = 0; i < n; ++i) {
double a0 = A0[i];
double a1 = A1[i];
double a2 = A2[i];
double b;
b = B[0]; acc00 += a0 * b; acc10 += a1 * b; acc20 += a2 * b;
b = B[1]; acc01 += a0 * b; acc11 += a1 * b; acc21 += a2 * b;
b = B[2]; acc02 += a0 * b; acc12 += a1 * b; acc22 += a2 * b;
b = B[3]; acc03 += a0 * b; acc13 += a1 * b; acc23 += a2 * b;
b = B[4]; acc04 += a0 * b; acc14 += a1 * b; acc24 += a2 * b;
b = B[5]; acc05 += a0 * b; acc15 += a1 * b; acc25 += a2 * b;
B += ldb;
}

double* C0 = C;
double* C1 = C0 + ldc;
double* C2 = C1 + ldc;
C0[0] += acc00; C1[0] += acc10; C2[0] += acc20;
C0[1] += acc01; C1[1] += acc11; C2[1] += acc21;
C0[2] += acc02; C1[2] += acc12; C2[2] += acc22;
C0[3] += acc03; C1[3] += acc13; C2[3] += acc23;
C0[4] += acc04; C1[4] += acc14; C2[4] += acc24;
C0[5] += acc05; C1[5] += acc15; C2[5] += acc25;
}

The loop consists of 9 loads, 4 pointer updates,
1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
DP adds. 51 total instructions.
Ideally, it will run in 18 clocks, for IPC = 2.83. Realistically on
real hardware with cache misses etc it will take 20-23 clock and IPC
would be proportionally lower.
What is my point? My point is that I expect "medium-IPC" kernel like
above to achieve higher FLOPS (== better performance) then simpler,
smaller kernel with IPC in excess of 3.5.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Mon Mar 4 18:18:35 2024

Michael S <[email protected]> writes:

On Sun, 03 Mar 2024 22:22:37 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

But even in that not particularly useful answer IPC appears to be the
least useful part. Far worse than FLOPS/Hz.

I guess that the 21164 also showed close to 4 IPC on the 4-wide 21164
on matrix300, while its 2 integer units would limit it to much lower
performance on, say, intmm.

- anton

I don't know about specific case of matrix300 and what transformations
are allowed by SPEC rules and what not, but if I were tasked with
writing generic DGEMM for Alpha 21164 with maximal performance on
non-small relatively square matrices as a amin goal, then I'd start
with something like that:

// main_loop_3x6 - multiply 3 raws of A[][]
// by 6 columns of B[][] assuming C-language order

...

The loop consists of 9 loads, 4 pointer updates,
1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
DP adds. 51 total instructions.
Ideally, it will run in 18 clocks, for IPC = 2.83.

Given that starting 18 FP multiplies and 18 FP additions takes 18
cycles, that is optimal. But you unrolled more than is necessary to
achieve 2FlOPC (FP operations per cycle). With less unrolling, you
could have achieved the same 2FLOPC and of course you would see higher
IPC. And as Mitch Alsup explains, his 5.9 IPC was for a non-unrolled
loop.

What is my point? My point is that I expect "medium-IPC" kernel like
above to achieve higher FLOPS (== better performance) then simpler,
smaller kernel with IPC in excess of 3.5.

These days, with power limits resulting in lower clocks for programs
that do more work, yes, I guess that you will see better FLOPS from
variants that execute fewer instructions. But in the 90s, CPUs ran at
their rated clock rate no matter what, and a 21164 would run a variant
that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
whether that variant performs 0.83 non-flop instructions/cycle or 1.9
non-flop instructions/cycle.

But yes, 5.9 IPC on matrix300 shows little about the matrix multiply performance. Still, I think that the point is that there are many
hurdles that might result in a lower IPC (for code where only 6IPC
means 2FLOPC), the fact that they achieved 5.9 indicates that they
managed to lower the hurdles a lot; true, it would be better if they
could have shown it with code where 6IPC is more meaningful.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Mon Mar 4 19:43:11 2024

Anton Ertl wrote:

Michael S <[email protected]> writes:

On Sun, 03 Mar 2024 22:22:37 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

But even in that not particularly useful answer IPC appears to be the
least useful part. Far worse than FLOPS/Hz.

I guess that the 21164 also showed close to 4 IPC on the 4-wide 21164
on matrix300, while its 2 integer units would limit it to much lower
performance on, say, intmm.

- anton

I don't know about specific case of matrix300 and what transformations
are allowed by SPEC rules and what not, but if I were tasked with
writing generic DGEMM for Alpha 21164 with maximal performance on
non-small relatively square matrices as a amin goal, then I'd start
with something like that:

// main_loop_3x6 - multiply 3 raws of A[][]
// by 6 columns of B[][] assuming C-language order

....

The loop consists of 9 loads, 4 pointer updates,
1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
DP adds. 51 total instructions.
Ideally, it will run in 18 clocks, for IPC = 2.83.

Given that starting 18 FP multiplies and 18 FP additions takes 18
cycles, that is optimal. But you unrolled more than is necessary to
achieve 2FlOPC (FP operations per cycle). With less unrolling, you
could have achieved the same 2FLOPC and of course you would see higher
IPC. And as Mitch Alsup explains, his 5.9 IPC was for a non-unrolled
loop.

What is my point? My point is that I expect "medium-IPC" kernel like
above to achieve higher FLOPS (== better performance) then simpler,
smaller kernel with IPC in excess of 3.5.

These days, with power limits resulting in lower clocks for programs
that do more work, yes, I guess that you will see better FLOPS from
variants that execute fewer instructions. But in the 90s, CPUs ran at
their rated clock rate no matter what, and a 21164 would run a variant
that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
whether that variant performs 0.83 non-flop instructions/cycle or 1.9 non-flop instructions/cycle.

But yes, 5.9 IPC on matrix300 shows little about the matrix multiply performance. Still, I think that the point is that there are many
hurdles that might result in a lower IPC (for code where only 6IPC
means 2FLOPC), the fact that they achieved 5.9 indicates that they
managed to lower the hurdles a lot; true, it would be better if they
could have shown it with code where 6IPC is more meaningful.

The processor for which that IPC was stated had a 16KB L1 DM Cache
with 4 banks used twice per cycle and a 16 byte line, so DGEMM was <essentially> always taking cache misses (every other cycle). Most
of the performance of the overall design came down to 3 things::
The DRAM memory system which could start 2 new accesses every cycle
1 RD 1 WT; The zero cycle branch mispredict repair, and
The short 6-cycle pipeline from Fetch to Retire.

And most of this centered around what we called the conditional
cache--a RoB for memory <if you will>--a place where ST could be
placed and LDs could access but would not get written to L1 if
the instruction packet could not retire (mispredict or exception).

No processor today is doing any of these (well maybe My 66600...)
The BOOM RISC-V processor has a 7 stage front-end and 3 cycle branch
mispredict repair...

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Tue Mar 5 00:06:21 2024

On Mon, 04 Mar 2024 18:18:35 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

On Sun, 03 Mar 2024 22:22:37 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

But even in that not particularly useful answer IPC appears to be
the least useful part. Far worse than FLOPS/Hz.

I guess that the 21164 also showed close to 4 IPC on the 4-wide
21164 on matrix300, while its 2 integer units would limit it to
much lower performance on, say, intmm.

- anton

I don't know about specific case of matrix300 and what
transformations are allowed by SPEC rules and what not, but if I
were tasked with writing generic DGEMM for Alpha 21164 with maximal >performance on non-small relatively square matrices as a amin goal,
then I'd start with something like that:

// main_loop_3x6 - multiply 3 raws of A[][]
// by 6 columns of B[][] assuming C-language order

...

The loop consists of 9 loads, 4 pointer updates,
1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
DP adds. 51 total instructions.
Ideally, it will run in 18 clocks, for IPC = 2.83.

Given that starting 18 FP multiplies and 18 FP additions takes 18
cycles, that is optimal. But you unrolled more than is necessary to
achieve 2FlOPC (FP operations per cycle). With less unrolling, you
could have achieved the same 2FLOPC and of course you would see higher
IPC. And as Mitch Alsup explains, his 5.9 IPC was for a non-unrolled
loop.

What is my point? My point is that I expect "medium-IPC" kernel like
above to achieve higher FLOPS (== better performance) then simpler,
smaller kernel with IPC in excess of 3.5.

These days, with power limits resulting in lower clocks for programs
that do more work, yes, I guess that you will see better FLOPS from
variants that execute fewer instructions. But in the 90s, CPUs ran at
their rated clock rate no matter what, and a 21164 would run a variant
that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
whether that variant performs 0.83 non-flop instructions/cycle or 1.9 non-flop instructions/cycle.

But yes, 5.9 IPC on matrix300 shows little about the matrix multiply performance. Still, I think that the point is that there are many
hurdles that might result in a lower IPC (for code where only 6IPC
means 2FLOPC), the fact that they achieved 5.9 indicates that they
managed to lower the hurdles a lot; true, it would be better if they
could have shown it with code where 6IPC is more meaningful.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Tue Mar 5 00:18:33 2024

On Mon, 04 Mar 2024 18:18:35 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

On Sun, 03 Mar 2024 22:22:37 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

But even in that not particularly useful answer IPC appears to be
the least useful part. Far worse than FLOPS/Hz.

I guess that the 21164 also showed close to 4 IPC on the 4-wide
21164 on matrix300, while its 2 integer units would limit it to
much lower performance on, say, intmm.

- anton

I don't know about specific case of matrix300 and what
transformations are allowed by SPEC rules and what not, but if I
were tasked with writing generic DGEMM for Alpha 21164 with maximal >performance on non-small relatively square matrices as a amin goal,
then I'd start with something like that:

// main_loop_3x6 - multiply 3 raws of A[][]
// by 6 columns of B[][] assuming C-language order

...

The loop consists of 9 loads, 4 pointer updates,
1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
DP adds. 51 total instructions.
Ideally, it will run in 18 clocks, for IPC = 2.83.

Given that starting 18 FP multiplies and 18 FP additions takes 18
cycles, that is optimal. But you unrolled more than is necessary to
achieve 2FlOPC (FP operations per cycle). With less unrolling, you
could have achieved the same 2FLOPC and of course you would see higher
IPC. And as Mitch Alsup explains, his 5.9 IPC was for a non-unrolled
loop.

What is my point? My point is that I expect "medium-IPC" kernel like
above to achieve higher FLOPS (== better performance) then simpler,
smaller kernel with IPC in excess of 3.5.

These days, with power limits resulting in lower clocks for programs
that do more work, yes, I guess that you will see better FLOPS from
variants that execute fewer instructions. But in the 90s, CPUs ran at
their rated clock rate no matter what, and a 21164 would run a variant
that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
whether that variant performs 0.83 non-flop instructions/cycle or 1.9 non-flop instructions/cycle.

In 90-x CPUs had other reasons to minimize the # of instructions and
esp. the # of load instructions per task. E.g. too few banks in L1D
cache, so the cache that in theory supports two accesses per clock in
practice is closer to 1. E.g. very few hits served under miss. E.g. low associativity. E.g. theoretically 4-wide instruction Fetch/Decode that
in practice delivers 4 decoded instructions only when all inner planets
in solar system are aligned.
According to my understanding, 21164 being speed racer suffered from
that sort of problems more than most competitors.

But yes, 5.9 IPC on matrix300 shows little about the matrix multiply performance. Still, I think that the point is that there are many
hurdles that might result in a lower IPC (for code where only 6IPC
means 2FLOPC), the fact that they achieved 5.9 indicates that they
managed to lower the hurdles a lot; true, it would be better if they
could have shown it with code where 6IPC is more meaningful.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Mar 5 00:05:45 2024

Michael S wrote:

On Mon, 04 Mar 2024 18:18:35 GMT
[email protected] (Anton Ertl) wrote:

These days, with power limits resulting in lower clocks for programs
that do more work, yes, I guess that you will see better FLOPS from
variants that execute fewer instructions. But in the 90s, CPUs ran at
their rated clock rate no matter what, and a 21164 would run a variant
that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
whether that variant performs 0.83 non-flop instructions/cycle or 1.9
non-flop instructions/cycle.

In 90-x CPUs had other reasons to minimize the # of instructions and

Everyone Always have excellent reasons to minimize the number of instructions.

Over in CISC-land, it takes fewer instructions to get the job done.
Over in RISC-land, the instructions run in fewer nanoseconds.

The critical term in performance is::

seconds/program = instructions/program × cycles/instruction × seconds/cycle

RISCs tend to get instructions/program wrong and cycles/instruction right, while
CISCs tend to get instructions/program right and cycles/instruction wrong.

I happen to believe that between RISC and CISC is a realm where one needs
fewer instructions but sacrifices essentially nothing in the frequency department.

My 66000 tends to have only 10% more instructions than VAX while RISC-V
tends to have 50% more instructions than VAX--My 66000 needs only 71% the instruction count as RISC-V.

esp. the # of load instructions per task. E.g. too few banks in L1D
cache, so the cache that in theory supports two accesses per clock in practice is closer to 1.

CISCs generally have a 45%-50% memory reference density, while
RISCs generally have a 30%-35% memory reference density. So, CISCs tend
to run into the cache banking wall at 2 IPC, while RISCs delay that wall
until 3 IPC.

E.g. very few hits served under miss.

Accesses are correlated, so this is to be expected. The real question is whether you can still perform with miss under miss !! Even if you don't
take hits, you can still get the next request out "there" sooner. Sooner
saves latency.

E.g. low associativity.

Associativity costs power and area.

E.g. theoretically 4-wide instruction Fetch/Decode that
in practice delivers 4 decoded instructions only when all inner planets
in solar system are aligned.

A 4-wide instruction fetch yields only 2.5 instructions per random access.
This is just std math:: (1+2+3+4)/4 = 2.5

But access to good predication means up to 1/3rd of all short branches can
be avoided. Few ISAs have access to "good" predication. Here, a good solution for predication, drives the random 4-wide fetch access to deliver 3.25 instruc- tion per fetch; a 50% increase.

According to my understanding, 21164 being speed racer suffered from
that sort of problems more than most competitors.

Because it was wider and faster it was more dependent on "everything
working well all the time" and the fact that it was high frequency
than others means all its bad cache behavior got multiplied by the
latency to deeper levels of the memory hierarchy !

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Mon Mar 4 17:37:48 2024

On 3/4/2024 4:05 PM, MitchAlsup1 wrote:

Michael S wrote:

On Mon, 04 Mar 2024 18:18:35 GMT
[email protected] (Anton Ertl) wrote:

These days, with power limits resulting in lower clocks for programs
that do more work, yes, I guess that you will see better FLOPS from
variants that execute fewer instructions. But in the 90s, CPUs ran at
their rated clock rate no matter what, and a 21164 would run a variant
that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
whether that variant performs 0.83 non-flop instructions/cycle or 1.9
non-flop instructions/cycle.

In 90-x CPUs had other reasons to minimize the # of instructions and

Everyone Always have excellent reasons to minimize the number of instructions.

Over in CISC-land, it takes fewer instructions to get the job done.
Over in RISC-land, the instructions run in fewer nanoseconds.

The critical term in performance is::

seconds/program = instructions/program × cycles/instruction × seconds/cycle

RISCs tend to get instructions/program wrong and cycles/instruction
right, while
CISCs tend to get instructions/program right and cycles/instruction wrong.

I happen to believe that between RISC and CISC is a realm where one needs fewer instructions but sacrifices essentially nothing in the frequency department.

My 66000 tends to have only 10% more instructions than VAX while RISC-V
tends to have 50% more instructions than VAX--My 66000 needs only 71%
the instruction count as RISC-V.

esp. the # of load instructions per task. E.g. too few banks in L1D
cache, so the cache that in theory supports two accesses per clock in
practice is closer to 1.

CISCs generally have a 45%-50% memory reference density, while
RISCs generally have a 30%-35% memory reference density.

If those percentages are number of loads & stores divided by total
instruction count, isn't this just a restatement of your previous point
that CISCs need fewer instructions to do the job? i.e. the *time*
between loads or stores is the same for RISCs and CISCs?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Mar 5 02:33:21 2024

Stephen Fuld wrote:

On 3/4/2024 4:05 PM, MitchAlsup1 wrote:

Michael S wrote:

On Mon, 04 Mar 2024 18:18:35 GMT
[email protected] (Anton Ertl) wrote:

These days, with power limits resulting in lower clocks for programs
that do more work, yes, I guess that you will see better FLOPS from
variants that execute fewer instructions. But in the 90s, CPUs ran at >>>> their rated clock rate no matter what, and a 21164 would run a variant >>>> that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
whether that variant performs 0.83 non-flop instructions/cycle or 1.9
non-flop instructions/cycle.

In 90-x CPUs had other reasons to minimize the # of instructions and

Everyone Always have excellent reasons to minimize the number of
instructions.

Over in CISC-land, it takes fewer instructions to get the job done.
Over in RISC-land, the instructions run in fewer nanoseconds.

The critical term in performance is::

seconds/program = instructions/program × cycles/instruction × seconds/cycle

RISCs tend to get instructions/program wrong and cycles/instruction
right, while
CISCs tend to get instructions/program right and cycles/instruction wrong. >>
I happen to believe that between RISC and CISC is a realm where one needs
fewer instructions but sacrifices essentially nothing in the frequency
department.

My 66000 tends to have only 10% more instructions than VAX while RISC-V
tends to have 50% more instructions than VAX--My 66000 needs only 71%
the instruction count as RISC-V.

esp. the # of load instructions per task. E.g. too few banks in L1D
cache, so the cache that in theory supports two accesses per clock in
practice is closer to 1.

CISCs generally have a 45%-50% memory reference density, while
RISCs generally have a 30%-35% memory reference density.

If those percentages are number of loads & stores divided by total instruction count, isn't this just a restatement of your previous point
that CISCs need fewer instructions to do the job? i.e. the *time*
between loads or stores is the same for RISCs and CISCs?

Not when you include various other facts::
CISCs tend to have fewer registers
CISCs tend to have LD-OPs and some have LD-OP-STs
Both of the above give the compiler the illusion that inbound memory
references are less expensive than a typical LD because you get the
LD, and you don't have to waste a precious register. Thus there are
more memory references--but RISC compilers have taught us that more
registers beats LD-OPs--pipeline designers have taught us that thin-
ner pipelines perform better--both stand against LD-OPs and LD-OP-STs.

VAX went so far as to allow any operand and any result to be memory
{Most of us now believe this was a massive overstep.}

CISCs really do perform more memory references--not by as much as the
above statistics imply, but significantly more memory references.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Mon Mar 4 20:50:18 2024

On 3/4/2024 6:33 PM, MitchAlsup1 wrote:

Stephen Fuld wrote:

On 3/4/2024 4:05 PM, MitchAlsup1 wrote:

snip

CISCs generally have a 45%-50% memory reference density, while
RISCs generally have a 30%-35% memory reference density.

If those percentages are number of loads & stores divided by total
instruction count, isn't this just a restatement of your previous
point that CISCs need fewer instructions to do the job? i.e. the
*time* between loads or stores is the same for RISCs and CISCs?

Not when you include various other facts::
CISCs tend to have fewer registers
CISCs tend to have LD-OPs and some have LD-OP-STs
Both of the above give the compiler the illusion that inbound memory references are less expensive than a typical LD because you get the
LD, and you don't have to waste a precious register. Thus there are
more memory references--but RISC compilers have taught us that more
registers beats LD-OPs--pipeline designers have taught us that thin-
ner pipelines perform better--both stand against LD-OPs and LD-OP-STs.

VAX went so far as to allow any operand and any result to be memory
{Most of us now believe this was a massive overstep.}

CISCs really do perform more memory references--not by as much as the
above statistics imply, but significantly more memory references.

Interesting. Thanks.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Tue Mar 5 14:48:31 2024

[email protected] (MitchAlsup1) writes:

Stephen Fuld wrote:

On 3/4/2024 4:05 PM, MitchAlsup1 wrote:

CISCs generally have a 45%-50% memory reference density, while
RISCs generally have a 30%-35% memory reference density.

If those percentages are number of loads & stores divided by total
instruction count, isn't this just a restatement of your previous point
that CISCs need fewer instructions to do the job? i.e. the *time*
between loads or stores is the same for RISCs and CISCs?

Not when you include various other facts::
CISCs tend to have fewer registers

How much of that is because active CISC architectures
are forty or fifty years old?

Would a modern, designed from scratch, CISC architecture
still restrict the number of registers?

If memory access ever becomes as fast a register access,
all bets will be off...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Mar 5 15:44:29 2024

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Stephen Fuld wrote:

On 3/4/2024 4:05 PM, MitchAlsup1 wrote:

CISCs generally have a 45%-50% memory reference density, while
RISCs generally have a 30%-35% memory reference density.

If those percentages are number of loads & stores divided by total
instruction count, isn't this just a restatement of your previous point
that CISCs need fewer instructions to do the job? i.e. the *time*
between loads or stores is the same for RISCs and CISCs?

Not when you include various other facts::
CISCs tend to have fewer registers

How much of that is because active CISC architectures
are forty or fifty years old?

The problems of encoding remain as relevant today as 50 years ago.
But the things one wants the ISA to do are larger today than 50 YA.
Those encodings with LD-OPs are pretty much restricted to 16 registers
(16 base registers) and here you still have OpCode mapping difficulties.

If you give up on LD-OPs to gain register count, you are already in the RISC-camp.

Would a modern, designed from scratch, CISC architecture
still restrict the number of registers?

I wanted to do a 64-bit VAX minus the indirect address modes and
give it 32 registers. Never got around to it though.

My experience with My 66000 indicates one can get vanishingly close
to VAX instruction density (and count) with a RISC ISA done right.

If memory access ever becomes as fast a register access,
all bets will be off...

It won't, and never has.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Tue Mar 5 16:22:24 2024

[email protected] (MitchAlsup1) writes:

Scott Lurndal wrote:

If memory access ever becomes as fast a register access,
all bets will be off...

It won't, and never has.

There were a number of historic implementations where the
registers were actually stored in low memory.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Mar 5 17:33:23 2024

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Scott Lurndal wrote:

If memory access ever becomes as fast a register access,
all bets will be off...

It won't, and never has.

There were a number of historic implementations where the
registers were actually stored in low memory.

Yes, but that is making registers as slow as memory,
not making memory as fast as registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Tue Mar 5 17:57:14 2024

[email protected] (Scott Lurndal) writes:

Would a modern, designed from scratch, CISC architecture
still restrict the number of registers?

Designed from scratch? AVX-512 supports 32 xmm/ymm/zmm registers.
Intel APX will support 32 GPRs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Tue Mar 5 18:10:06 2024

[email protected] (MitchAlsup1) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Scott Lurndal wrote:

If memory access ever becomes as fast a register access,
all bets will be off...

It won't, and never has.

There were a number of historic implementations where the
registers were actually stored in low memory.

Yes, but that is making registers as slow as memory,
not making memory as fast as registers.

With something like SRAM, with less than 1ns latency, then. With that
you might completely eliminate registers and do everything
memory-to-memory. The small gain lost by eliminating
registers will likely be offset by fewer instructions
to execute. Sufficient for most desktop users, surely.

That's assuming SRAM can scale at reasonable cost due
to some future technology breakthrough (or some other
future non-volatile memory technology with favorable
access timing).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Mar 5 19:28:01 2024

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Scott Lurndal wrote:

If memory access ever becomes as fast a register access,
all bets will be off...

It won't, and never has.

There were a number of historic implementations where the
registers were actually stored in low memory.

Yes, but that is making registers as slow as memory,
not making memory as fast as registers.

With something like SRAM, with less than 1ns latency, then. With that
you might completely eliminate registers and do everything
memory-to-memory. The small gain lost by eliminating
registers will likely be offset by fewer instructions
to execute. Sufficient for most desktop users, surely.

Registers are 200ps, while on-die SRAM can cycle at 200ps
you still have to pay the latency to route address-bits
from AGEN to the SRAM arrays, and route the data back,
while one can access a register and run through the
forwarding logic in the same cycle. So, you are essentially
comparing something that costs ½* cycle to one that costs
1¼ cycles.

(*) maybe ¾-cycle on a GB physical RF.

That's assuming SRAM can scale at reasonable cost due
to some future technology breakthrough (or some other
future non-volatile memory technology with favorable
access timing).

I have lived in FAB environments where I was told "SRAM
will approach DRAM densities* in a generation or two, too.
Still, yet to happen.

(*) They thought they had to keep the DRAM capacitor above
a certain amount of fF to retain the 8ms (-16ms) refresh
rates. What they really had to do was control leakage !

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Tue Mar 5 22:40:37 2024

On Tue, 5 Mar 2024 19:28:01 +0000
[email protected] (MitchAlsup1) wrote:

I have lived in FAB environments where I was told "SRAM
will approach DRAM densities* in a generation or two, too.
Still, yet to happen.

(*) They thought they had to keep the DRAM capacitor above
a certain amount of fF to retain the 8ms (-16ms) refresh
rates. What they really had to do was control leakage !

According to what I read on RWT forum, in recent 3-4 years the table
was turned completely - SRAM-to-DRAM area ratio is growing rather than shrinking. Slowly, of course, nowadays everything is slow.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Mar 5 21:06:40 2024

Michael S wrote:

On Tue, 5 Mar 2024 19:28:01 +0000
[email protected] (MitchAlsup1) wrote:

I have lived in FAB environments where I was told "SRAM
will approach DRAM densities* in a generation or two, too.
Still, yet to happen.

(*) They thought they had to keep the DRAM capacitor above
a certain amount of fF to retain the 8ms (-16ms) refresh
rates. What they really had to do was control leakage !

According to what I read on RWT forum, in recent 3-4 years the table
was turned completely - SRAM-to-DRAM area ratio is growing rather than shrinking. Slowly, of course, nowadays everything is slow.

Yes, partially this is the FinFET effect where it is more area
difficult to build the 6T SRAM cell when compared to planar,
whereas with more metal layers, it is not so difficult to build
above transistor capacitors {Stacks of vias and minimum metal
sq lambda pads.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Tue Mar 5 22:34:39 2024

On Tue, 05 Mar 2024 18:10:06 GMT
[email protected] (Scott Lurndal) wrote:

[email protected] (MitchAlsup1) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Scott Lurndal wrote:

If memory access ever becomes as fast a register access,
all bets will be off...

It won't, and never has.

There were a number of historic implementations where the
registers were actually stored in low memory.

Yes, but that is making registers as slow as memory,
not making memory as fast as registers.

With something like SRAM, with less than 1ns latency, then. With that
you might completely eliminate registers and do everything
memory-to-memory. The small gain lost by eliminating
registers will likely be offset by fewer instructions
to execute. Sufficient for most desktop users, surely.

That's assuming SRAM can scale at reasonable cost due
to some future technology breakthrough

1ns latency means the size of your SRAM array is 256KB at best.
More realistically 128 KB.
The only possible breakthrough I can see at this front is 3D
working much better than anticipated (and better than how well it works
for 3D NAND flash). But even that can improve capacity by factor of 200
at best, more realistically by factor of 100.
So, in the best possible future scenario, given all benefits of doubt
1ns latency requirement limits the size of your SRAM to ~50 MB.
Make it at least 5 ns instead of 1 then you can start dreaming.
Still more for benefit of your grandchildren rather than for yourself.

(or some other
future non-volatile memory technology with favorable
access timing).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Wed Mar 6 20:00:26 2024

Paul A. Clayton wrote:

On 3/5/24 10:44 AM, MitchAlsup1 wrote:

Scott Lurndal wrote:

[snip]

My 66000, if I understand correctly, has registers in memory; it
"merely" caches them in a faster storage.

That is sort of correct:: there is a single flip-flop that points at
all thread-state, and the RF is 4-cache-lines of that state. It is
also true that the HW read in new RFs and writes out old RFs as if
the RF were a cache, but this is also true of non-RF thread-state--
it is read in an written out too.

It seems that 64-bit stack-pointer-relative accesses could be roughly as fast by using
the offset as the index (each stack frame would be comparable to a
different thread register context; the tradeoffs of extra storage
for multiple stack frames ("multithreading" — alternating between
indexing up and indexing down would provide some utilization
flexibility with low indexing overhead) relative to pushing out
early frames (normal "context switch"); such a cache would
probably be limited in frame size cached.

Smells too much like register windows which never outperformed
the flat RF from MIPS. In any event, 50% of subroutines need no
stack <accesses> and those that do typically only store 3 registers
(for restore later).

An L2 register set that can only be accessed for one operand might
be somewhat similar to LD-OP.

In high speed designs, there are at least 2 cycles of delay from AGEN
to the L2 and 2 cycles of delay back. Even zero cycle access sees at
least 4 cycles of latency, 5 if you count AGEN.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Wed Mar 6 19:53:40 2024

Paul A. Clayton wrote:

On 3/5/24 10:44 AM, MitchAlsup1 wrote:

Scott Lurndal wrote:

[snip]

If memory access ever becomes as fast a register access,
all bets will be off...

It won't, and never has.

There seem to be three aspects that lead to this conclusion: the
storage technology, the indexing method (including alignment and
extension), and the method of determining presence ("tagging").

Porting. SRAMs are single ported, Register files are multiported.

Any storage technology used for registers could be used for
memory. However, area and power costs must be justified by
benefit.

A theoretical cheap read/expensive write storage technology might
be more appropriate for storage that is not conventional
registers, potentially providing faster non-register storage (for
reads). (Since register writes can be buffered and elided, if the
buffering overhead in a different storage technology was low
enough expensive writes might not prevent use for registers. Yet
one might then view the buffers as "registers" and the 'backing
storage' as more memory-like.)

With no register renaming or sub-registers, the indexing method
for registers is trivial and only a 'ready' bit is needed. (For
old-style VLIW, even the ready bit is not needed.)

For general memory addressing, there is a more complex address
generation, the size of the operand will be variable (alignment
and extension — word-addressed memory would avoid this overhead☺),

Register access if by fixed bit pattern in the instruction,
Memory access is by performing arithmetic on operands to get the
address.

and tag comparison would be required. General memory addressing
also involves indirection through a register. (The instruction
pointer is available early in the pipeline, so IP-relative
accesses would not have the delay of reading an arbitrary
register. Register address values that are rarely updated or
usually updated by adding a constant [or replacing with a
constant] could be hoisted earlier in the pipeline.)

Register read, address generation, and tag comparison overheads
can be removed for offset addressing by using the base pointer as
the "tag" and the offset as the index. (e.g., "Knapsack: A Zero-
Cycle Memory Hierarchy Component", Todd M. Austin et al., 1993;
"Signature Buffer: Bridging Performance Gap between Registers and
Caches", Lu Peng et al., 2004) "Internal fragmentation" of
utilization increases the cost of such storage relative to the
benefit and offset addressing constrains generality.

Register renaming introduces some complexity for addressing
registers. A Register Alias Table lookup is a kind of "address
generation".

One would also desire a "memory" storage component to have larger
capacity. A larger capacity storage will be more expensive to
access and will favor denser (typically slower) storage.

My 66000, if I understand correctly, has registers in memory; it
"merely" caches them in a faster storage. It seems that 64-bit stack-pointer-relative accesses could be roughly as fast by using
the offset as the index (each stack frame would be comparable to a
different thread register context; the tradeoffs of extra storage
for multiple stack frames ("multithreading" — alternating between
indexing up and indexing down would provide some utilization
flexibility with low indexing overhead) relative to pushing out
early frames (normal "context switch"); such a cache would
probably be limited in frame size cached.

An L2 register set that can only be accessed for one operand might
be somewhat similar to LD-OP.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Sat Mar 9 04:17:04 2024

Paul A. Clayton wrote:

On 3/6/24 3:00 PM, MitchAlsup1 wrote:

Paul A. Clayton wrote:

[snip]

It seems that 64-bit
stack-pointer-relative accesses could be roughly as fast by using
the offset as the index (each stack frame would be comparable to a
different thread register context; the tradeoffs of extra storage
for multiple stack frames ("multithreading" — alternating between
indexing up and indexing down would provide some utilization
flexibility with low indexing overhead) relative to pushing out
early frames (normal "context switch"); such a cache would
probably be limited in frame size cached.

Smells too much like register windows which never outperformed
the flat RF from MIPS. In any event, 50% of subroutines need no
stack <accesses> and those that do typically only store 3 registers
(for restore later).

Register windows were intended to avoid save/restore overhead by
retaining values in registers with renaming. A stack cache is
meant to reduce the overhead of loads and stores to the stack —
not just preserving and restoring registers. A direct-mapped stack
cache is not entirely insane. A partial stack frame cache might
cache up to 256 bytes (e.g.) with alternating frames indexing with
inverted bits (to reduce interference) — one could even reserve a
chunk (e.g., 64 bytes) of a frame and not overlapped by limiting
offset cached to be smaller than the cache.

Such might be more useful than register windows, but that does
not mean that it is actually a good option.

If it is such a good option why has it not reached production ??

An L2 register set that can only be accessed for one operand
might be somewhat similar to LD-OP.

In high speed designs, there are at least 2 cycles of delay from AGEN
to the L2 and 2 cycles of delay back. Even zero cycle access sees at
least 4 cycles of latency, 5 if you count AGEN.

Presumably this is related to the storage technology used as well
as the capacity.

Purely wire delay due to the size of the L2 cache.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Sat Mar 9 04:14:35 2024

Paul A. Clayton wrote:

On 3/6/24 2:53 PM, MitchAlsup1 wrote:

Paul A. Clayton wrote:

On 3/5/24 10:44 AM, MitchAlsup1 wrote:

Scott Lurndal wrote:

[snip]

If memory access ever becomes as fast a register access,
all bets will be off...

It won't, and never has.

There seem to be three aspects that lead to this conclusion: the
storage technology, the indexing method (including alignment and
extension), and the method of determining presence ("tagging").

Porting. SRAMs are single ported, Register files are multiported.

Is this really a fundamental distinction?

Yes, actually it is.

If one uses SRAM to mean
merely Static (not-refreshed) RAM, then register files are also
SRAM. If one uses SRAM to mean classic 6-transistor SRAM cells,
then the 8-transistor cells used in one of Intel's Atom L1 caches
would not be SRAM.

Would it surprise you know that in order to make such a dual ported
SRAM cell "process tolerant"* that the SRAM cell has to be at least
as big as if there were 2 independent SRAM cells ? That is:: if you
want a 2 ported SRAM use 2 SRAM instances read them independently,
but write both at the same time with the same value.

(*) under some process variations, the SRAM cell will loose data if
both read ports are used simultaneously--UNLESS the gain of the
central inverter-pair is increased. For cells with more than 2 ports
you get to a point where the cell cannot be written at some corners
of the process space (strong N-channels with weak P-channels.)

Transistor level design of Register Files is similarly fraught with
peril.

At some point, the number of select lines and the number of bus-
wires is big enough that you CAN hide the register file under the
wires. Transistor count goes up as 2+2+2×ports while wire goes up
by 2+selects×2×ports.

The storage technology is not strictly bound to is use.

In the abstract this is true enough.
In practice it is not.

(Obviously, high area/power per bit storage is biased to smaller
capacity and higher latency storage is biased to infrequent access
or prefetchable/thoughput uses.)

[snip]

For general memory addressing, there is a more complex address
generation, the size of the operand will be variable (alignment
and extension — word-addressed memory would avoid this overhead☺),

Register access if by fixed bit pattern in the instruction,
Memory access is by performing arithmetic on operands to get the
address.

As noted later, memory accesses can also be indexed by a fixed bit
pattern in the instruction. Determining whether a register ID bit
field is actually used may well require less decoding than
determining if an operation is a load based on stack pointer or
global pointer with an immediate offset, but the difference would
not seem to be that great. The offset size would probably also
have to be checked — the special cache would be unlikely to
support all offsets.

Predecoding on insertion into the instruction cache could cache
this usage information.

You cannot predecode if the instruction is not of fixed size, (or
if you do not add predecode bits ala Athlon, Opteron).

[snip]

Register read, address generation, and tag comparison overheads
can be removed for offset addressing by using the base pointer as
the "tag" and the offset as the index. (e.g., "Knapsack: A Zero-
Cycle Memory Hierarchy Component", Todd M. Austin et al., 1993;
"Signature Buffer: Bridging Performance Gap between Registers and
Caches", Lu Peng et al., 2004) "Internal fragmentation" of
utilization increases the cost of such storage relative to the
benefit and offset addressing constrains generality.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Sun May 26 03:14:30 2024

Paul A. Clayton wrote:

On 3/8/24 11:14 PM, MitchAlsup1 wrote:

Even with My 66000's variable length instructions, most (by
frequency of occurrence) 32-bit immediates would be illegal
instructions and more significant 32-bit words in 64-bit
immediates would usually be illegal instructions, so one could
probably have highly accurate speculative predecode-on-fill.

Since the variable length decoder is only 32 gates (equivalent in
size to 3 1-bit flip-flops) one can simply attach said decoder
to every word of storage in the instruction buffer. And arrange
a tree of "If I get picked, here are my follow on instructions"

Now, once one has a unary pointer into the IB, one gets 2 inst
in 1 gate of delay, 4 in 2 gates, 8 in 3 gates,...until you
get eaten alive with wire delay.

Thus, if length decoding is easy, predecoding (into some kind of
able) is unnecessary.

If branch prediction fetch ahead used instruction addresses
(rather than cache block addresses), a valid target prediction
would provide accurate predecode for the following instructions
and constrain the possible decodings for preceding instructions.

Mistakes in predecode that mistook an immediate 32-bit word for an opcode-containing word might not be particularly painful.

Now when these are mask out by the actual decode selection tree.

Mistakenly "finding" a branch in predecode might not be that
painful even if predicted taken — similar to a false BTB hit
corrected in decode. Wrongly "finding" an optimizable load
instruction might waste resources and introduce a minor glitch in
decode (where the "instruction" has to be retranslated into an
immediate component).

It *feels* attractive to me to have predecode fill a BTB-like
structure to reduce redundant data storage. Filling the "BTB" with
less critical instruction data when there are few (immediate-
based) branches seems less hurtful than losing some taken branch
targets, though a parallel ordinary BTB (redundant storage) might
compensate. The BTB-like structure might hold more diverse
information that could benefit from early availability; e.g.,
loads from something like a "Knapsack Cache". (Even loads from a
more variable base might be sped by having a future file of two or
three such base addresses — or even just the least significant
bits — which could be accessed more quickly and earlier than the
general register file. Bases that are changed frequently with
dynamic values [not immediate addition] would rarely update the
future file fast enough to be useful. I think some x86
implementations did something similar by adding segment base and
displacement early in the pipeline.) More generally, it seems that
the instruction stream could be parsed and stored into components
with different tradeoffs in latency, capacity, etc.

I do not know if such "aggressive" predecode would be worthwhile
nor what in-memory format would best manage the tradeoffs of
density, parallelism, criticality, etc. or what "L1 cache" format
would be best (with added specialization/utilization tradeoffs).

It is a trade-off:: in a GBOoO design, adding a pipe stage cost
around 2% (in an LBIO design around 5%) so the predictor has to
buy more than 2% to "make the cut". It definitely would not make
cut in the LBIO design, it may or may not make the cut in a GBOoO
design. What we can say is: that the GBOoO design has to have some
kind of branch prediction and not go so far as to assign is a name
or a class.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Paul A. Clayton on Sun May 26 20:50:44 2024

Paul A. Clayton <[email protected]> schrieb:

One type of predecode that has been commercially implemented (in a
POWER processor) was storing calculated branch insets rather than
offsets.

What is a branch inset?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Sun May 26 22:06:37 2024

Thomas Koenig wrote:

Paul A. Clayton <[email protected]> schrieb:

One type of predecode that has been commercially implemented (in a
POWER processor) was storing calculated branch insets rather than
offsets.

What is a branch inset?

Low order bits of the target virtual address, so you can access ICache
prior to the adder completing AGEN.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	62:31:38
Calls:	12,446
Calls today:	1
Files:	15,192
Messages:	6,537,485

88xxx or PPC

Who's Online

Recent Visitors

System Info