Paul A. Clayton wrote:
On 1/25/24 10:22 AM, Anton Ertl wrote:
[snip]
I think the commonly understood meaning is that
all instructions start their execution in-order (i.e., none goes to a
functional unit earlier than an architecturally earlier instruction).
Execution can overlap.
What about a skewed pipeline? A simple skewed pipeline that
statically assigned operations to a pipeline-stage/execution unit
has been called in-order (in what I have read). A "second-chance"
pipeline (where many operations can dynamically choose the
pipeline stage based on operand availability) involves dynamic
scheduling (so would seem to fall in to out-of-order), but
counterflow pipelines ("Counterflow Pipeline Processor
Architecture", Robert F. Sproull et al., 1994) — which are more
extreme in some ways than pipelines that have two stages in which
operations can start — are stated to have "No overtaking.
Instructions must stay in program order in the instruction
pipeline.", which sounds "in-order" (and the paper was written by
people working at Sun Microsystems).
(I thought counterflow pipelines were weird. Simplifying
communication makes sense, but ...)
I get the impression that early PowerPC out-or-order execution
implementations were really very similar to using the forwarding
network for out-of-order completion (with in-order writeback). If
I recall correctly, renaming was done by appending a version to
the architectural register name and operands would be captured as
soon as they were available rather than passing along the pipeline
with forwarding until the writeback stage.
This sounds more like Mc 88110 rather than PPC 620.
PPC was shrunk from 6-wide to 4-wide in order to fit in the acceptable
die area. Other things may have been jettisoned at this shrink of design point. The 620 was originally targeted to be equal to Mc 88120 which was
a 6-wide GBOoO machine full Tomasulo with precise exceptions and 4 external busses named {Data Out, Data In, Address Out, Address In}
Address Out was used for cache misses to bring data to the CPU
Data Out was used for cache victims to send data to DRAM
Data In was used by arriving DRAM data
Address In was used for arriving Snoops
Smart externals could use Data In to send the CPU data before it knew it needed it. That data could be code or data.
MitchAlsup1 <[email protected]> wrote:
This sounds more like Mc 88110 rather than PPC 620.
PPC was shrunk from 6-wide to 4-wide in order to fit in the acceptable
die area. Other things may have been jettisoned at this shrink of design
point. The 620 was originally targeted to be equal to Mc 88120 which was
a 6-wide GBOoO machine full Tomasulo with precise exceptions and 4 external >> busses named {Data Out, Data In, Address Out, Address In}
Address Out was used for cache misses to bring data to the CPU
Data Out was used for cache victims to send data to DRAM
Data In was used by arriving DRAM data
Address In was used for arriving Snoops
Smart externals could use Data In to send the CPU data before it knew it
needed it. That data could be code or data.
So which was better, your baby the 88xxx or PPC?
Pick your metric: die size, performance, heat, upgrade path, other.
Brett wrote:
So which was better, your baby the 88xxx or PPC?
Pick your metric: die size, performance, heat, upgrade path, other.
I am yet to see a CPU with the performance we got on the 88120 simulator.
In 1992 we were getting 2.0 IPC on things like XLISP and 5.9 IPC on >Matrix300.
Brett wrote:
MitchAlsup1 <[email protected]> wrote:
This sounds more like Mc 88110 rather than PPC 620.
PPC was shrunk from 6-wide to 4-wide in order to fit in the
acceptable die area. Other things may have been jettisoned at this
shrink of design point. The 620 was originally targeted to be
equal to Mc 88120 which was a 6-wide GBOoO machine full Tomasulo
with precise exceptions and 4 external busses named {Data Out,
Data In, Address Out, Address In}
Address Out was used for cache misses to bring data to the CPU
Data Out was used for cache victims to send data to DRAM
Data In was used by arriving DRAM data
Address In was used for arriving Snoops
Smart externals could use Data In to send the CPU data before it
knew it needed it. That data could be code or data.
So which was better, your baby the 88xxx or PPC?
Pick your metric: die size, performance, heat, upgrade path, other.
I am yet to see a CPU with the performance we got on the 88120
simulator. In 1992 we were getting 2.0 IPC on things like XLISP and
5.9 IPC on Matrix300.
I can't find information about Matrix300.
It seems to be part of SPEC89 FP suite, but spec.org does not provide
info about anything older than SPEC92.
Can you tell me what exactly does it do?
[email protected] (MitchAlsup1) writes:
I don't think that it's plausible that the 88120, which would have
appeared in the mid-1990s would perform as well or better than
goldencove on this workload. My guess is that it would have had to
undergo a silicon diet like the PPC620, probably even more so, because
it was to appear earlier, which probably would have meant less
transistors, which would have reduced the matrix300 IPC, and probably
to a lesser amount, the Xlisp IPC.
Also, the question is how fast the result would clock. The 88110 was available in 1992 at 50MHz, in the same year as the 200MHz 21064.
When would the 88120 have been available at what clock rate? The
PPC620 was available in 1997 at up to 150MHz, while the Pentium II
Klamath was available in 1997 at clock rates up to 300MHz, and the
(in-order) 21164a was available at 600MHz; my guess is that the 21164a
could also produce good matrix300 numbers.
- anton
On Sat, 2 Mar 2024 23:51:20 +0000
[email protected] (MitchAlsup1) wrote:
Brett wrote:
MitchAlsup1 <[email protected]> wrote:
This sounds more like Mc 88110 rather than PPC 620.
PPC was shrunk from 6-wide to 4-wide in order to fit in the
acceptable die area. Other things may have been jettisoned at this
shrink of design point. The 620 was originally targeted to be
equal to Mc 88120 which was a 6-wide GBOoO machine full Tomasulo
with precise exceptions and 4 external busses named {Data Out,
Data In, Address Out, Address In}
Address Out was used for cache misses to bring data to the CPU
Data Out was used for cache victims to send data to DRAM
Data In was used by arriving DRAM data
Address In was used for arriving Snoops
Smart externals could use Data In to send the CPU data before it
knew it needed it. That data could be code or data.
So which was better, your baby the 88xxx or PPC?
Pick your metric: die size, performance, heat, upgrade path, other.
I am yet to see a CPU with the performance we got on the 88120
simulator. In 1992 we were getting 2.0 IPC on things like XLISP and
5.9 IPC on Matrix300.
I can't find information about Matrix300.
It seems to be part of SPEC89 FP suite, but spec.org does not provide
info about anything older than SPEC92.
Can you tell me what exactly does it do?
Michael S <[email protected]> writes:
I can't find information about Matrix300.
It seems to be part of SPEC89 FP suite, but spec.org does not provide
info about anything older than SPEC92.
Can you tell me what exactly does it do?
It's 300x300 FP matrix multiply (not sure if single or double). There
was a company that had a tool (famous at the time, but I don't
remember the name) that could transform the original source code into
a cache-blocked variant, which typically ran at the limits imposed by
the FUs. Eventually everyone used that tool in their compiler to get
good SPEC89 results. As a consequence, SPEC eliminated matrix300 in
SPEC92.
- anton
But even in that not particularly useful answer IPC appears to be the
least useful part. Far worse than FLOPS/Hz.
On Sun, 03 Mar 2024 16:33:45 GMT
[email protected] (Anton Ertl) wrote:
Michael S <[email protected]> writes:
I can't find information about Matrix300.
It seems to be part of SPEC89 FP suite, but spec.org does not provide
info about anything older than SPEC92.
Can you tell me what exactly does it do?
It's 300x300 FP matrix multiply (not sure if single or double). There
was a company that had a tool (famous at the time, but I don't
remember the name) that could transform the original source code into
a cache-blocked variant, which typically ran at the limits imposed by
the FUs. Eventually everyone used that tool in their compiler to get
good SPEC89 results. As a consequence, SPEC eliminated matrix300 in
SPEC92.
- anton
So, in today's world it would be something like "How fast can you do
DGEMM with 7 out of your 8 [SIMD] hands tied behind your back?"
Or, may be, more than 7 if your variant of AMX supports double
precision.
The challenge is, funny, but the answer not particularly useful.
But even in that not particularly useful answer IPC appears to be the
least useful part. Far worse than FLOPS/Hz.
SIMD would only support 2 of the 8 calls to DGEMM where the transposes
are not {'N', 'n', 'T', 't', 'C', or 'c'}. SIMD would do nothing for
the 6 transpose calls.
It should also be noted that the transposed
matrixes have significantly worse cache behavior than the non trans-
posed version as each access is to a different cache line.
My 66000, using VEC-LOOP the instruction count goes down again to 5 (from
6) per loop since LOOP is performing ADD-CMP-BC in a single instruction
and in a single clock.
Michael S <[email protected]> writes:
But even in that not particularly useful answer IPC appears to be the
least useful part. Far worse than FLOPS/Hz.
I guess that the 21164 also showed close to 4 IPC on the 4-wide 21164
on matrix300, while its 2 integer units would limit it to much lower performance on, say, intmm.
- anton
On Sun, 03 Mar 2024 22:22:37 GMT...
[email protected] (Anton Ertl) wrote:
Michael S <[email protected]> writes:
But even in that not particularly useful answer IPC appears to be the
least useful part. Far worse than FLOPS/Hz.
I guess that the 21164 also showed close to 4 IPC on the 4-wide 21164
on matrix300, while its 2 integer units would limit it to much lower
performance on, say, intmm.
- anton
I don't know about specific case of matrix300 and what transformations
are allowed by SPEC rules and what not, but if I were tasked with
writing generic DGEMM for Alpha 21164 with maximal performance on
non-small relatively square matrices as a amin goal, then I'd start
with something like that:
// main_loop_3x6 - multiply 3 raws of A[][]
// by 6 columns of B[][] assuming C-language order
The loop consists of 9 loads, 4 pointer updates,
1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
DP adds. 51 total instructions.
Ideally, it will run in 18 clocks, for IPC = 2.83.
What is my point? My point is that I expect "medium-IPC" kernel like
above to achieve higher FLOPS (== better performance) then simpler,
smaller kernel with IPC in excess of 3.5.
Michael S <[email protected]> writes:
On Sun, 03 Mar 2024 22:22:37 GMT....
[email protected] (Anton Ertl) wrote:
Michael S <[email protected]> writes:
But even in that not particularly useful answer IPC appears to be the
least useful part. Far worse than FLOPS/Hz.
I guess that the 21164 also showed close to 4 IPC on the 4-wide 21164
on matrix300, while its 2 integer units would limit it to much lower
performance on, say, intmm.
- anton
I don't know about specific case of matrix300 and what transformations
are allowed by SPEC rules and what not, but if I were tasked with
writing generic DGEMM for Alpha 21164 with maximal performance on
non-small relatively square matrices as a amin goal, then I'd start
with something like that:
// main_loop_3x6 - multiply 3 raws of A[][]
// by 6 columns of B[][] assuming C-language order
The loop consists of 9 loads, 4 pointer updates,
1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
DP adds. 51 total instructions.
Ideally, it will run in 18 clocks, for IPC = 2.83.
Given that starting 18 FP multiplies and 18 FP additions takes 18
cycles, that is optimal. But you unrolled more than is necessary to
achieve 2FlOPC (FP operations per cycle). With less unrolling, you
could have achieved the same 2FLOPC and of course you would see higher
IPC. And as Mitch Alsup explains, his 5.9 IPC was for a non-unrolled
loop.
What is my point? My point is that I expect "medium-IPC" kernel like
above to achieve higher FLOPS (== better performance) then simpler,
smaller kernel with IPC in excess of 3.5.
These days, with power limits resulting in lower clocks for programs
that do more work, yes, I guess that you will see better FLOPS from
variants that execute fewer instructions. But in the 90s, CPUs ran at
their rated clock rate no matter what, and a 21164 would run a variant
that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
whether that variant performs 0.83 non-flop instructions/cycle or 1.9 non-flop instructions/cycle.
But yes, 5.9 IPC on matrix300 shows little about the matrix multiply performance. Still, I think that the point is that there are many
hurdles that might result in a lower IPC (for code where only 6IPC
means 2FLOPC), the fact that they achieved 5.9 indicates that they
managed to lower the hurdles a lot; true, it would be better if they
could have shown it with code where 6IPC is more meaningful.
- anton
Michael S <[email protected]> writes:
On Sun, 03 Mar 2024 22:22:37 GMT
[email protected] (Anton Ertl) wrote:
Michael S <[email protected]> writes:
But even in that not particularly useful answer IPC appears to be
the least useful part. Far worse than FLOPS/Hz.
I guess that the 21164 also showed close to 4 IPC on the 4-wide
21164 on matrix300, while its 2 integer units would limit it to
much lower performance on, say, intmm.
- anton
I don't know about specific case of matrix300 and what
transformations are allowed by SPEC rules and what not, but if I
were tasked with writing generic DGEMM for Alpha 21164 with maximal >performance on non-small relatively square matrices as a amin goal,
then I'd start with something like that:
// main_loop_3x6 - multiply 3 raws of A[][]...
// by 6 columns of B[][] assuming C-language order
The loop consists of 9 loads, 4 pointer updates,
1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
DP adds. 51 total instructions.
Ideally, it will run in 18 clocks, for IPC = 2.83.
Given that starting 18 FP multiplies and 18 FP additions takes 18
cycles, that is optimal. But you unrolled more than is necessary to
achieve 2FlOPC (FP operations per cycle). With less unrolling, you
could have achieved the same 2FLOPC and of course you would see higher
IPC. And as Mitch Alsup explains, his 5.9 IPC was for a non-unrolled
loop.
What is my point? My point is that I expect "medium-IPC" kernel like
above to achieve higher FLOPS (== better performance) then simpler,
smaller kernel with IPC in excess of 3.5.
These days, with power limits resulting in lower clocks for programs
that do more work, yes, I guess that you will see better FLOPS from
variants that execute fewer instructions. But in the 90s, CPUs ran at
their rated clock rate no matter what, and a 21164 would run a variant
that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
whether that variant performs 0.83 non-flop instructions/cycle or 1.9 non-flop instructions/cycle.
But yes, 5.9 IPC on matrix300 shows little about the matrix multiply performance. Still, I think that the point is that there are many
hurdles that might result in a lower IPC (for code where only 6IPC
means 2FLOPC), the fact that they achieved 5.9 indicates that they
managed to lower the hurdles a lot; true, it would be better if they
could have shown it with code where 6IPC is more meaningful.
- anton
Michael S <[email protected]> writes:
On Sun, 03 Mar 2024 22:22:37 GMT
[email protected] (Anton Ertl) wrote:
Michael S <[email protected]> writes:
But even in that not particularly useful answer IPC appears to be
the least useful part. Far worse than FLOPS/Hz.
I guess that the 21164 also showed close to 4 IPC on the 4-wide
21164 on matrix300, while its 2 integer units would limit it to
much lower performance on, say, intmm.
- anton
I don't know about specific case of matrix300 and what
transformations are allowed by SPEC rules and what not, but if I
were tasked with writing generic DGEMM for Alpha 21164 with maximal >performance on non-small relatively square matrices as a amin goal,
then I'd start with something like that:
// main_loop_3x6 - multiply 3 raws of A[][]...
// by 6 columns of B[][] assuming C-language order
The loop consists of 9 loads, 4 pointer updates,
1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
DP adds. 51 total instructions.
Ideally, it will run in 18 clocks, for IPC = 2.83.
Given that starting 18 FP multiplies and 18 FP additions takes 18
cycles, that is optimal. But you unrolled more than is necessary to
achieve 2FlOPC (FP operations per cycle). With less unrolling, you
could have achieved the same 2FLOPC and of course you would see higher
IPC. And as Mitch Alsup explains, his 5.9 IPC was for a non-unrolled
loop.
What is my point? My point is that I expect "medium-IPC" kernel like
above to achieve higher FLOPS (== better performance) then simpler,
smaller kernel with IPC in excess of 3.5.
These days, with power limits resulting in lower clocks for programs
that do more work, yes, I guess that you will see better FLOPS from
variants that execute fewer instructions. But in the 90s, CPUs ran at
their rated clock rate no matter what, and a 21164 would run a variant
that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
whether that variant performs 0.83 non-flop instructions/cycle or 1.9 non-flop instructions/cycle.
But yes, 5.9 IPC on matrix300 shows little about the matrix multiply performance. Still, I think that the point is that there are many
hurdles that might result in a lower IPC (for code where only 6IPC
means 2FLOPC), the fact that they achieved 5.9 indicates that they
managed to lower the hurdles a lot; true, it would be better if they
could have shown it with code where 6IPC is more meaningful.
- anton
On Mon, 04 Mar 2024 18:18:35 GMT
[email protected] (Anton Ertl) wrote:
These days, with power limits resulting in lower clocks for programs
that do more work, yes, I guess that you will see better FLOPS from
variants that execute fewer instructions. But in the 90s, CPUs ran at
their rated clock rate no matter what, and a 21164 would run a variant
that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
whether that variant performs 0.83 non-flop instructions/cycle or 1.9
non-flop instructions/cycle.
In 90-x CPUs had other reasons to minimize the # of instructions and
esp. the # of load instructions per task. E.g. too few banks in L1D
cache, so the cache that in theory supports two accesses per clock in practice is closer to 1.
E.g. very few hits served under miss.
E.g. low associativity.
E.g. theoretically 4-wide instruction Fetch/Decode that
in practice delivers 4 decoded instructions only when all inner planets
in solar system are aligned.
According to my understanding, 21164 being speed racer suffered from
that sort of problems more than most competitors.
Michael S wrote:
On Mon, 04 Mar 2024 18:18:35 GMT
[email protected] (Anton Ertl) wrote:
These days, with power limits resulting in lower clocks for programs
that do more work, yes, I guess that you will see better FLOPS from
variants that execute fewer instructions. But in the 90s, CPUs ran at
their rated clock rate no matter what, and a 21164 would run a variant
that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
whether that variant performs 0.83 non-flop instructions/cycle or 1.9
non-flop instructions/cycle.
In 90-x CPUs had other reasons to minimize the # of instructions and
Everyone Always have excellent reasons to minimize the number of instructions.
Over in CISC-land, it takes fewer instructions to get the job done.
Over in RISC-land, the instructions run in fewer nanoseconds.
The critical term in performance is::
seconds/program = instructions/program × cycles/instruction × seconds/cycle
RISCs tend to get instructions/program wrong and cycles/instruction
right, while
CISCs tend to get instructions/program right and cycles/instruction wrong.
I happen to believe that between RISC and CISC is a realm where one needs fewer instructions but sacrifices essentially nothing in the frequency department.
My 66000 tends to have only 10% more instructions than VAX while RISC-V
tends to have 50% more instructions than VAX--My 66000 needs only 71%
the instruction count as RISC-V.
esp. the # of load instructions per task. E.g. too few banks in L1D
cache, so the cache that in theory supports two accesses per clock in
practice is closer to 1.
CISCs generally have a 45%-50% memory reference density, while
RISCs generally have a 30%-35% memory reference density.
On 3/4/2024 4:05 PM, MitchAlsup1 wrote:
Michael S wrote:
On Mon, 04 Mar 2024 18:18:35 GMT
[email protected] (Anton Ertl) wrote:
These days, with power limits resulting in lower clocks for programs
that do more work, yes, I guess that you will see better FLOPS from
variants that execute fewer instructions. But in the 90s, CPUs ran at >>>> their rated clock rate no matter what, and a 21164 would run a variant >>>> that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
whether that variant performs 0.83 non-flop instructions/cycle or 1.9
non-flop instructions/cycle.
In 90-x CPUs had other reasons to minimize the # of instructions and
Everyone Always have excellent reasons to minimize the number of
instructions.
Over in CISC-land, it takes fewer instructions to get the job done.
Over in RISC-land, the instructions run in fewer nanoseconds.
The critical term in performance is::
seconds/program = instructions/program × cycles/instruction × seconds/cycle
RISCs tend to get instructions/program wrong and cycles/instruction
right, while
CISCs tend to get instructions/program right and cycles/instruction wrong. >>
I happen to believe that between RISC and CISC is a realm where one needs
fewer instructions but sacrifices essentially nothing in the frequency
department.
My 66000 tends to have only 10% more instructions than VAX while RISC-V
tends to have 50% more instructions than VAX--My 66000 needs only 71%
the instruction count as RISC-V.
esp. the # of load instructions per task. E.g. too few banks in L1D
cache, so the cache that in theory supports two accesses per clock in
practice is closer to 1.
CISCs generally have a 45%-50% memory reference density, while
RISCs generally have a 30%-35% memory reference density.
If those percentages are number of loads & stores divided by total instruction count, isn't this just a restatement of your previous point
that CISCs need fewer instructions to do the job? i.e. the *time*
between loads or stores is the same for RISCs and CISCs?
Stephen Fuld wrote:
On 3/4/2024 4:05 PM, MitchAlsup1 wrote:
CISCs generally have a 45%-50% memory reference density, while
RISCs generally have a 30%-35% memory reference density.
If those percentages are number of loads & stores divided by total
instruction count, isn't this just a restatement of your previous
point that CISCs need fewer instructions to do the job? i.e. the
*time* between loads or stores is the same for RISCs and CISCs?
Not when you include various other facts::
CISCs tend to have fewer registers
CISCs tend to have LD-OPs and some have LD-OP-STs
Both of the above give the compiler the illusion that inbound memory references are less expensive than a typical LD because you get the
LD, and you don't have to waste a precious register. Thus there are
more memory references--but RISC compilers have taught us that more
registers beats LD-OPs--pipeline designers have taught us that thin-
ner pipelines perform better--both stand against LD-OPs and LD-OP-STs.
VAX went so far as to allow any operand and any result to be memory
{Most of us now believe this was a massive overstep.}
CISCs really do perform more memory references--not by as much as the
above statistics imply, but significantly more memory references.
Stephen Fuld wrote:
On 3/4/2024 4:05 PM, MitchAlsup1 wrote:
CISCs generally have a 45%-50% memory reference density, while
RISCs generally have a 30%-35% memory reference density.
If those percentages are number of loads & stores divided by total
instruction count, isn't this just a restatement of your previous point
that CISCs need fewer instructions to do the job? i.e. the *time*
between loads or stores is the same for RISCs and CISCs?
Not when you include various other facts::
CISCs tend to have fewer registers
[email protected] (MitchAlsup1) writes:
Stephen Fuld wrote:
On 3/4/2024 4:05 PM, MitchAlsup1 wrote:
CISCs generally have a 45%-50% memory reference density, while
RISCs generally have a 30%-35% memory reference density.
If those percentages are number of loads & stores divided by total
instruction count, isn't this just a restatement of your previous point
that CISCs need fewer instructions to do the job? i.e. the *time*
between loads or stores is the same for RISCs and CISCs?
Not when you include various other facts::
CISCs tend to have fewer registers
How much of that is because active CISC architectures
are forty or fifty years old?
Would a modern, designed from scratch, CISC architecture
still restrict the number of registers?
If memory access ever becomes as fast a register access,
all bets will be off...
Scott Lurndal wrote:
If memory access ever becomes as fast a register access,
all bets will be off...
It won't, and never has.
[email protected] (MitchAlsup1) writes:
Scott Lurndal wrote:
If memory access ever becomes as fast a register access,
all bets will be off...
It won't, and never has.
There were a number of historic implementations where the
registers were actually stored in low memory.
Would a modern, designed from scratch, CISC architecture
still restrict the number of registers?
Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
Scott Lurndal wrote:
If memory access ever becomes as fast a register access,
all bets will be off...
It won't, and never has.
There were a number of historic implementations where the
registers were actually stored in low memory.
Yes, but that is making registers as slow as memory,
not making memory as fast as registers.
[email protected] (MitchAlsup1) writes:
Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
Scott Lurndal wrote:
If memory access ever becomes as fast a register access,
all bets will be off...
It won't, and never has.
There were a number of historic implementations where the
registers were actually stored in low memory.
Yes, but that is making registers as slow as memory,
not making memory as fast as registers.
With something like SRAM, with less than 1ns latency, then. With that
you might completely eliminate registers and do everything
memory-to-memory. The small gain lost by eliminating
registers will likely be offset by fewer instructions
to execute. Sufficient for most desktop users, surely.
That's assuming SRAM can scale at reasonable cost due
to some future technology breakthrough (or some other
future non-volatile memory technology with favorable
access timing).
I have lived in FAB environments where I was told "SRAM
will approach DRAM densities* in a generation or two, too.
Still, yet to happen.
(*) They thought they had to keep the DRAM capacitor above
a certain amount of fF to retain the 8ms (-16ms) refresh
rates. What they really had to do was control leakage !
On Tue, 5 Mar 2024 19:28:01 +0000
[email protected] (MitchAlsup1) wrote:
I have lived in FAB environments where I was told "SRAM
will approach DRAM densities* in a generation or two, too.
Still, yet to happen.
(*) They thought they had to keep the DRAM capacitor above
a certain amount of fF to retain the 8ms (-16ms) refresh
rates. What they really had to do was control leakage !
According to what I read on RWT forum, in recent 3-4 years the table
was turned completely - SRAM-to-DRAM area ratio is growing rather than shrinking. Slowly, of course, nowadays everything is slow.
[email protected] (MitchAlsup1) writes:
Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
Scott Lurndal wrote:
If memory access ever becomes as fast a register access,
all bets will be off...
It won't, and never has.
There were a number of historic implementations where the
registers were actually stored in low memory.
Yes, but that is making registers as slow as memory,
not making memory as fast as registers.
With something like SRAM, with less than 1ns latency, then. With that
you might completely eliminate registers and do everything
memory-to-memory. The small gain lost by eliminating
registers will likely be offset by fewer instructions
to execute. Sufficient for most desktop users, surely.
That's assuming SRAM can scale at reasonable cost due
to some future technology breakthrough
(or some other
future non-volatile memory technology with favorable
access timing).
On 3/5/24 10:44 AM, MitchAlsup1 wrote:
Scott Lurndal wrote:[snip]
My 66000, if I understand correctly, has registers in memory; it
"merely" caches them in a faster storage.
It seems that 64-bit stack-pointer-relative accesses could be roughly as fast by using
the offset as the index (each stack frame would be comparable to a
different thread register context; the tradeoffs of extra storage
for multiple stack frames ("multithreading" — alternating between
indexing up and indexing down would provide some utilization
flexibility with low indexing overhead) relative to pushing out
early frames (normal "context switch"); such a cache would
probably be limited in frame size cached.
An L2 register set that can only be accessed for one operand might
be somewhat similar to LD-OP.
On 3/5/24 10:44 AM, MitchAlsup1 wrote:
Scott Lurndal wrote:[snip]
If memory access ever becomes as fast a register access,
all bets will be off...
It won't, and never has.
There seem to be three aspects that lead to this conclusion: the
storage technology, the indexing method (including alignment and
extension), and the method of determining presence ("tagging").
Any storage technology used for registers could be used for
memory. However, area and power costs must be justified by
benefit.
A theoretical cheap read/expensive write storage technology might
be more appropriate for storage that is not conventional
registers, potentially providing faster non-register storage (for
reads). (Since register writes can be buffered and elided, if the
buffering overhead in a different storage technology was low
enough expensive writes might not prevent use for registers. Yet
one might then view the buffers as "registers" and the 'backing
storage' as more memory-like.)
With no register renaming or sub-registers, the indexing method
for registers is trivial and only a 'ready' bit is needed. (For
old-style VLIW, even the ready bit is not needed.)
For general memory addressing, there is a more complex address
generation, the size of the operand will be variable (alignment
and extension — word-addressed memory would avoid this overhead☺),
and tag comparison would be required. General memory addressing
also involves indirection through a register. (The instruction
pointer is available early in the pipeline, so IP-relative
accesses would not have the delay of reading an arbitrary
register. Register address values that are rarely updated or
usually updated by adding a constant [or replacing with a
constant] could be hoisted earlier in the pipeline.)
Register read, address generation, and tag comparison overheads
can be removed for offset addressing by using the base pointer as
the "tag" and the offset as the index. (e.g., "Knapsack: A Zero-
Cycle Memory Hierarchy Component", Todd M. Austin et al., 1993;
"Signature Buffer: Bridging Performance Gap between Registers and
Caches", Lu Peng et al., 2004) "Internal fragmentation" of
utilization increases the cost of such storage relative to the
benefit and offset addressing constrains generality.
Register renaming introduces some complexity for addressing
registers. A Register Alias Table lookup is a kind of "address
generation".
One would also desire a "memory" storage component to have larger
capacity. A larger capacity storage will be more expensive to
access and will favor denser (typically slower) storage.
My 66000, if I understand correctly, has registers in memory; it
"merely" caches them in a faster storage. It seems that 64-bit stack-pointer-relative accesses could be roughly as fast by using
the offset as the index (each stack frame would be comparable to a
different thread register context; the tradeoffs of extra storage
for multiple stack frames ("multithreading" — alternating between
indexing up and indexing down would provide some utilization
flexibility with low indexing overhead) relative to pushing out
early frames (normal "context switch"); such a cache would
probably be limited in frame size cached.
An L2 register set that can only be accessed for one operand might
be somewhat similar to LD-OP.
On 3/6/24 3:00 PM, MitchAlsup1 wrote:
Paul A. Clayton wrote:[snip]
It seems that 64-bit
stack-pointer-relative accesses could be roughly as fast by using
the offset as the index (each stack frame would be comparable to a
different thread register context; the tradeoffs of extra storage
for multiple stack frames ("multithreading" — alternating between
indexing up and indexing down would provide some utilization
flexibility with low indexing overhead) relative to pushing out
early frames (normal "context switch"); such a cache would
probably be limited in frame size cached.
Smells too much like register windows which never outperformed
the flat RF from MIPS. In any event, 50% of subroutines need no
stack <accesses> and those that do typically only store 3 registers
(for restore later).
Register windows were intended to avoid save/restore overhead by
retaining values in registers with renaming. A stack cache is
meant to reduce the overhead of loads and stores to the stack —
not just preserving and restoring registers. A direct-mapped stack
cache is not entirely insane. A partial stack frame cache might
cache up to 256 bytes (e.g.) with alternating frames indexing with
inverted bits (to reduce interference) — one could even reserve a
chunk (e.g., 64 bytes) of a frame and not overlapped by limiting
offset cached to be smaller than the cache.
Such might be more useful than register windows, but that does
not mean that it is actually a good option.
An L2 register set that can only be accessed for one operand
might be somewhat similar to LD-OP.
In high speed designs, there are at least 2 cycles of delay from AGEN
to the L2 and 2 cycles of delay back. Even zero cycle access sees at
least 4 cycles of latency, 5 if you count AGEN.
Presumably this is related to the storage technology used as well
as the capacity.
On 3/6/24 2:53 PM, MitchAlsup1 wrote:
Paul A. Clayton wrote:
On 3/5/24 10:44 AM, MitchAlsup1 wrote:
Scott Lurndal wrote:[snip]
If memory access ever becomes as fast a register access,
all bets will be off...
It won't, and never has.
There seem to be three aspects that lead to this conclusion: the
storage technology, the indexing method (including alignment and
extension), and the method of determining presence ("tagging").
Porting. SRAMs are single ported, Register files are multiported.
Is this really a fundamental distinction?
If one uses SRAM to mean
merely Static (not-refreshed) RAM, then register files are also
SRAM. If one uses SRAM to mean classic 6-transistor SRAM cells,
then the 8-transistor cells used in one of Intel's Atom L1 caches
would not be SRAM.
The storage technology is not strictly bound to is use.
(Obviously, high area/power per bit storage is biased to smaller
capacity and higher latency storage is biased to infrequent access
or prefetchable/thoughput uses.)
[snip]
For general memory addressing, there is a more complex address
generation, the size of the operand will be variable (alignment
and extension — word-addressed memory would avoid this overhead☺),
Register access if by fixed bit pattern in the instruction,
Memory access is by performing arithmetic on operands to get the
address.
As noted later, memory accesses can also be indexed by a fixed bit
pattern in the instruction. Determining whether a register ID bit
field is actually used may well require less decoding than
determining if an operation is a load based on stack pointer or
global pointer with an immediate offset, but the difference would
not seem to be that great. The offset size would probably also
have to be checked — the special cache would be unlikely to
support all offsets.
Predecoding on insertion into the instruction cache could cache
this usage information.
[snip]
Register read, address generation, and tag comparison overheads
can be removed for offset addressing by using the base pointer as
the "tag" and the offset as the index. (e.g., "Knapsack: A Zero-
Cycle Memory Hierarchy Component", Todd M. Austin et al., 1993;
"Signature Buffer: Bridging Performance Gap between Registers and
Caches", Lu Peng et al., 2004) "Internal fragmentation" of
utilization increases the cost of such storage relative to the
benefit and offset addressing constrains generality.
On 3/8/24 11:14 PM, MitchAlsup1 wrote:
Even with My 66000's variable length instructions, most (by
frequency of occurrence) 32-bit immediates would be illegal
instructions and more significant 32-bit words in 64-bit
immediates would usually be illegal instructions, so one could
probably have highly accurate speculative predecode-on-fill.
If branch prediction fetch ahead used instruction addresses
(rather than cache block addresses), a valid target prediction
would provide accurate predecode for the following instructions
and constrain the possible decodings for preceding instructions.
Mistakes in predecode that mistook an immediate 32-bit word for an opcode-containing word might not be particularly painful.
Mistakenly "finding" a branch in predecode might not be that
painful even if predicted taken — similar to a false BTB hit
corrected in decode. Wrongly "finding" an optimizable load
instruction might waste resources and introduce a minor glitch in
decode (where the "instruction" has to be retranslated into an
immediate component).
It *feels* attractive to me to have predecode fill a BTB-like
structure to reduce redundant data storage. Filling the "BTB" with
less critical instruction data when there are few (immediate-
based) branches seems less hurtful than losing some taken branch
targets, though a parallel ordinary BTB (redundant storage) might
compensate. The BTB-like structure might hold more diverse
information that could benefit from early availability; e.g.,
loads from something like a "Knapsack Cache". (Even loads from a
more variable base might be sped by having a future file of two or
three such base addresses — or even just the least significant
bits — which could be accessed more quickly and earlier than the
general register file. Bases that are changed frequently with
dynamic values [not immediate addition] would rarely update the
future file fast enough to be useful. I think some x86
implementations did something similar by adding segment base and
displacement early in the pipeline.) More generally, it seems that
the instruction stream could be parsed and stored into components
with different tradeoffs in latency, capacity, etc.
I do not know if such "aggressive" predecode would be worthwhile
nor what in-memory format would best manage the tradeoffs of
density, parallelism, criticality, etc. or what "L1 cache" format
would be best (with added specialization/utilization tradeoffs).
One type of predecode that has been commercially implemented (in a
POWER processor) was storing calculated branch insets rather than
offsets.
Paul A. Clayton <[email protected]> schrieb:
One type of predecode that has been commercially implemented (in a
POWER processor) was storing calculated branch insets rather than
offsets.
What is a branch inset?
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 35:37:53 |
| Calls: | 12,109 |
| Files: | 15,006 |
| Messages: | 6,518,353 |