The only good arguments I have heard wrt big architectural register
files has to do with things like Register-Windows and/or optimizing
CALL/RET interface.
The only good arguments I have heard wrt big architectural register
files has to do with things like Register-Windows and/or optimizing
CALL/RET interface.
But even there, it justifies only additional "second-class registers",
i.e. where the set of immediately addressable registers can still be the
same size as usual (e.g. 16 or 32), but you can quickly push some of
those to some kind of "stack" and then pull them back in.
I think you can get similar benefits with "cache-line sized" memory >operations that load/store several registers at a time (assuming you
have good enough store-to-load forwarding).
Or even fold those
loads&stores into some kind of CALL/RET instructions, which can let you
start the control-flow part of the CALL before the stores, and similarly >start the loads before the control flow part of the RET is done.
Stefan Monnier <[email protected]> writes:
The only good arguments I have heard wrt big architectural register
files has to do with things like Register-Windows and/or optimizing
CALL/RET interface.
But even there, it justifies only additional "second-class registers",
i.e. where the set of immediately addressable registers can still be the >>same size as usual (e.g. 16 or 32), but you can quickly push some of
those to some kind of "stack" and then pull them back in.
Not efficiently. You would have to wait until the last instruction
has written back its result, then make the switch, and only then start >reading registers from instructions behind the SAVE/RESTORE
instruction. Each SAVE and each RESTORE would cost several cycles
even on an in-order machine. Not what the mechanism was designed for.
I think you can get similar benefits with "cache-line sized" memory >>operations that load/store several registers at a time (assuming you
have good enough store-to-load forwarding).
ARM A64's load pair and store pair instructions.
of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary
40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
same reason you can put any kind of data in the data cache !!! The
unified renamer ran out of registers a LOT less often than partitioned renamer.
of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary
40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
same reason you can put any kind of data in the data cache !!! The
unified renamer ran out of registers a LOT less often than partitioned
renamer.
IIUC, access to a size-40 register file is about 1/2 a cycle to access
while a size-160 regfile will probably be a full cycle.
I assume this is part of what makes the pipeline longer, and it also
makes the forwarding network more complex since there are more values
that can benefit from forwarding (i.e. where forwarding is needed to
avoid having to wait for write+read in the register file).
But in return for that, fewer values get read from the regfile, so you
need fewer read ports, right?
How do they deal with the massive number of PRF writes per cycle?
Do they try to "kill" those writes that can be determined to be useless (because the remaining reads are serviced via the forwarding network instead)?
Do they use banking?
Stefan
On Thu, 17 Jul 2025 20:34:13 +0000, Stefan Monnier wrote:
of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary
40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
same reason you can put any kind of data in the data cache !!! The
unified renamer ran out of registers a LOT less often than partitioned
renamer.
IIUC, access to a size-40 register file is about 1/2 a cycle to access
while a size-160 regfile will probably be a full cycle.
K9 was a high clock frequency design (5GHz in 65nm) so Rf decode
was 1 cycle (mostly wire delay), RF read-select was 1-cycle (all
wire delay) and RF read-out was 1 cycle 3/4 wire delay (16 ports).
Also note: We had to use Ampere's Law for wire propagation instead of
simple LRC--since edge speeds were faster than 3ps. {{All sorts of
stuff starts to break at these speeds.}}
I assume this is part of what makes the pipeline longer, and it also
makes the forwarding network more complex since there are more values
that can benefit from forwarding (i.e. where forwarding is needed to
avoid having to wait for write+read in the register file).
Forwarding 1 was mostly MUX delay and a bit of wire delay
Forwarding 2 was mostly wire delay after MUX.
But in return for that, fewer values get read from the regfile, so you
need fewer read ports, right?
This depends on exactly WHEN you know which registers you need to
read. We were using a value-free reservation station design.
OH and BTW the renamer was 22-renames per cycle.
How do they deal with the massive number of PRF writes per cycle?
There were 8 results per cycle (max) and a RoB to absorb the OoOness,
so we could block-write registers to the RF.
Do they try to "kill" those writes that can be determined to be useless
(because the remaining reads are serviced via the forwarding network
instead)?
Not at 8 gates per cycle. Maybe at 13-14 gates per cycle you could
attempt this.
Do they use banking?
We tried almost everything we could think of and then some.
Stefan
MitchAlsup1 wrote:
On Thu, 17 Jul 2025 20:34:13 +0000, Stefan Monnier wrote:
of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary >>>> 40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
same reason you can put any kind of data in the data cache !!! The
unified renamer ran out of registers a LOT less often than partitioned >>>> renamer.
IIUC, access to a size-40 register file is about 1/2 a cycle to access
while a size-160 regfile will probably be a full cycle.
K9 was a high clock frequency design (5GHz in 65nm) so Rf decode
was 1 cycle (mostly wire delay), RF read-select was 1-cycle (all
wire delay) and RF read-out was 1 cycle 3/4 wire delay (16 ports).
So a 3-stage pipeline for PRF reads or writes?
One of my basic design assumptions is that PRF R/W are 1 cycle.
I was pondering the consequences of dealing with this, trying to
maintain back-to-back scheduling, or minimize how badly it breaks.
Also note: We had to use Ampere's Law for wire propagation instead of
simple LRC--since edge speeds were faster than 3ps. {{All sorts of
stuff starts to break at these speeds.}}
I assume this is part of what makes the pipeline longer, and it also
makes the forwarding network more complex since there are more values
that can benefit from forwarding (i.e. where forwarding is needed to
avoid having to wait for write+read in the register file).
Forwarding 1 was mostly MUX delay and a bit of wire delay
Forwarding 2 was mostly wire delay after MUX.
But in return for that, fewer values get read from the regfile, so you
need fewer read ports, right?
This depends on exactly WHEN you know which registers you need to
read. We were using a value-free reservation station design.
OH and BTW the renamer was 22-renames per cycle.
How do they deal with the massive number of PRF writes per cycle?
There were 8 results per cycle (max) and a RoB to absorb the OoOness,
so we could block-write registers to the RF.
Do they try to "kill" those writes that can be determined to be useless
(because the remaining reads are serviced via the forwarding network
instead)?
Not at 8 gates per cycle. Maybe at 13-14 gates per cycle you could
attempt this.
Do they use banking?
We tried almost everything we could think of and then some.
Stefan
Each of those PRF write pipeline stages can forward to its equivalent
or younger read stages. For 3 stages, 8R4W ports,
stage 3 write S3W forwards to S3R, S2R, S1R = 3*8*4 = 96
stage 2 S2W to S2R, S1R = 2*8*4 = 64
stage 1 S1W to S1R = 8*4 = 32
= 192 forwarding buses, plus tag comparators, muxes, etc.
just for dealing with the PRF internal forwarding network.
And if those are SIMD registers then those are all very wide buses.
I would look to do things that avoid using the PRF whenever possible,
like using valued Reservation Stations if they can be read in same cycle rather than pulling from PRF so the FU schedulers don't have to deal
with the PRF pipeline latency (assuming that RS can be read in 1 cycle).
When operands come from the RS then there is no latency between the
scheduler picking a uOp and launching it for execution.
If operands come from the PRF then the FU schedulers have to issue
the register reads 3 cycles before they launch the uOp.
So each FU has a 3 cycle launch queue.
Except the PRF latency is variable if the value comes from its internal pipeline forwarding network so this needs more thought.
But independent of that, I do miss Ivan's posts in this newsgroup, even
if they aren't about the Mill. I do hope he can find time to post at
least occasionally.
On Sun, 20 Jul 2025 22:27:27 -0700, Stephen Fuld wrote:
But independent of that, I do miss Ivan's posts in this newsgroup, even
if they aren't about the Mill. I do hope he can find time to post at
least occasionally.
Although I agree, I am also satisfied as long as he is well and healthy.
If he can't waste time with USENET for now, that is all right with me.
But I am instead concerned if he is unable to find funding to make any progress with the Mill, given that it appears to have been a very promising project. That is much more important.
On 7/21/2025 8:45 AM, John Savard wrote:
On Sun, 20 Jul 2025 22:27:27 -0700, Stephen Fuld wrote:
But independent of that, I do miss Ivan's posts in this newsgroup, even
if they aren't about the Mill. I do hope he can find time to post at
least occasionally.
Although I agree, I am also satisfied as long as he is well and healthy.
If he can't waste time with USENET for now, that is all right with me.
But I am instead concerned if he is unable to find funding to make any
progress with the Mill, given that it appears to have been a very promising >> project. That is much more important.
Based on the posts at the link I posted above, they are making progress, >albeit quite slowly. I understand the patents issue, as they require
real money. But I thought their model of doing work for a share of the >possible eventual profits, if any, would attract enough people to get
the work done. After all, there are lots of people who contribute to
many open source projects for no monetary return at all. And the Mill
needs only a few people. But apparently, I was wrong.
Stephen Fuld <[email protected]d> writes:
On 7/21/2025 8:45 AM, John Savard wrote:
On Sun, 20 Jul 2025 22:27:27 -0700, Stephen Fuld wrote:
But independent of that, I do miss Ivan's posts in this newsgroup, even >>>> if they aren't about the Mill. I do hope he can find time to post at
least occasionally.
Although I agree, I am also satisfied as long as he is well and healthy. >>>
If he can't waste time with USENET for now, that is all right with me.
But I am instead concerned if he is unable to find funding to make any
progress with the Mill, given that it appears to have been a very promising >>> project. That is much more important.
Based on the posts at the link I posted above, they are making progress,
albeit quite slowly. I understand the patents issue, as they require
real money. But I thought their model of doing work for a share of the
possible eventual profits, if any, would attract enough people to get
the work done. After all, there are lots of people who contribute to
many open source projects for no monetary return at all. And the Mill
needs only a few people. But apparently, I was wrong.
It's easy to underestimate the resources required to bring a new
processor architecture to a point where it makes sense to build
a test chip. Then to optimize the design for the target node.
That's just the hardware side. Then there is the software infrastrucure (processor ABI, processor-specific library code, etc).
Not to mention
marketing and hotchips.
Looking at the webpage, the belt seems to have some characteristics
in common with stack-based architectures, bringing to mind Burroughs
large systems and the HP-3000.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 20:12:57 |
| Calls: | 12,104 |
| Calls today: | 4 |
| Files: | 15,004 |
| Messages: | 6,518,100 |