Forum: >>> Magnum BBS <<<

Register windows (was: The Third Wish)

From Stefan Monnier@21:1/5 to All on Thu Jul 17 12:20:13 2025

The only good arguments I have heard wrt big architectural register
files has to do with things like Register-Windows and/or optimizing
CALL/RET interface.

But even there, it justifies only additional "second-class registers",
i.e. where the set of immediately addressable registers can still be the
same size as usual (e.g. 16 or 32), but you can quickly push some of
those to some kind of "stack" and then pull them back in.
IIRC the Mill had actually 2 categories of "second-class registers":
the stack and the scratch registers.

I think you can get similar benefits with "cache-line sized" memory
operations that load/store several registers at a time (assuming you
have good enough store-to-load forwarding). Or even fold those
loads&stores into some kind of CALL/RET instructions, which can let you
start the control-flow part of the CALL before the stores, and similarly
start the loads before the control flow part of the RET is done.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Thu Jul 17 17:38:35 2025

Stefan Monnier <[email protected]> writes:

The only good arguments I have heard wrt big architectural register
files has to do with things like Register-Windows and/or optimizing
CALL/RET interface.

But even there, it justifies only additional "second-class registers",
i.e. where the set of immediately addressable registers can still be the
same size as usual (e.g. 16 or 32), but you can quickly push some of
those to some kind of "stack" and then pull them back in.

Not efficiently. You would have to wait until the last instruction
has written back its result, then make the switch, and only then start
reading registers from instructions behind the SAVE/RESTORE
instruction. Each SAVE and each RESTORE would cost several cycles
even on an in-order machine. Not what the mechanism was designed for.

I think you can get similar benefits with "cache-line sized" memory >operations that load/store several registers at a time (assuming you
have good enough store-to-load forwarding).

ARM A64's load pair and store pair instructions.

Or even fold those
loads&stores into some kind of CALL/RET instructions, which can let you
start the control-flow part of the CALL before the stores, and similarly >start the loads before the control flow part of the RET is done.

In an OoO machine with correct predictions (the usual case), control
flow often runs far ahead of functional-unit processing and retirement
(and only retirement is architectural execution). Any stores on the
predicted control flow will be speculatively performed as soon as
their source data is available, and the same goes for loads, with
(non)aliases being predicted. Plus really modern machines often can
achieve 0-cycle store-to-load forwarding. All of this makes
mechanisms like register windows and IA-64's register stack
unnecessary.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Thu Jul 17 19:17:43 2025

[email protected] (Anton Ertl) writes:

Stefan Monnier <[email protected]> writes:

The only good arguments I have heard wrt big architectural register
files has to do with things like Register-Windows and/or optimizing
CALL/RET interface.

But even there, it justifies only additional "second-class registers",
i.e. where the set of immediately addressable registers can still be the >>same size as usual (e.g. 16 or 32), but you can quickly push some of
those to some kind of "stack" and then pull them back in.

Not efficiently. You would have to wait until the last instruction
has written back its result, then make the switch, and only then start >reading registers from instructions behind the SAVE/RESTORE
instruction. Each SAVE and each RESTORE would cost several cycles
even on an in-order machine. Not what the mechanism was designed for.

I think you can get similar benefits with "cache-line sized" memory >>operations that load/store several registers at a time (assuming you
have good enough store-to-load forwarding).

ARM A64's load pair and store pair instructions.

ARM A64 has (optional) 64-byte load/store instructions
(LD64/ST64), which store/load an entire cache line using 8 GPRs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu Jul 17 16:34:13 2025

of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary
40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
same reason you can put any kind of data in the data cache !!! The
unified renamer ran out of registers a LOT less often than partitioned renamer.

IIUC, access to a size-40 register file is about 1/2 a cycle to access
while a size-160 regfile will probably be a full cycle.
I assume this is part of what makes the pipeline longer, and it also
makes the forwarding network more complex since there are more values
that can benefit from forwarding (i.e. where forwarding is needed to
avoid having to wait for write+read in the register file).
But in return for that, fewer values get read from the regfile, so you
need fewer read ports, right?

How do they deal with the massive number of PRF writes per cycle?
Do they try to "kill" those writes that can be determined to be useless (because the remaining reads are serviced via the forwarding network instead)? Do they use banking?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Jul 17 21:38:17 2025

On Thu, 17 Jul 2025 20:34:13 +0000, Stefan Monnier wrote:

of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary
40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
same reason you can put any kind of data in the data cache !!! The
unified renamer ran out of registers a LOT less often than partitioned
renamer.

IIUC, access to a size-40 register file is about 1/2 a cycle to access
while a size-160 regfile will probably be a full cycle.

K9 was a high clock frequency design (5GHz in 65nm) so Rf decode
was 1 cycle (mostly wire delay), RF read-select was 1-cycle (all
wire delay) and RF read-out was 1 cycle 3/4 wire delay (16 ports).

Also note: We had to use Ampere's Law for wire propagation instead of
simple LRC--since edge speeds were faster than 3ps. {{All sorts of
stuff starts to break at these speeds.}}

I assume this is part of what makes the pipeline longer, and it also
makes the forwarding network more complex since there are more values
that can benefit from forwarding (i.e. where forwarding is needed to
avoid having to wait for write+read in the register file).

Forwarding 1 was mostly MUX delay and a bit of wire delay
Forwarding 2 was mostly wire delay after MUX.

But in return for that, fewer values get read from the regfile, so you
need fewer read ports, right?

This depends on exactly WHEN you know which registers you need to
read. We were using a value-free reservation station design.

OH and BTW the renamer was 22-renames per cycle.

How do they deal with the massive number of PRF writes per cycle?

There were 8 results per cycle (max) and a RoB to absorb the OoOness,
so we could block-write registers to the RF.

Do they try to "kill" those writes that can be determined to be useless (because the remaining reads are serviced via the forwarding network instead)?

Not at 8 gates per cycle. Maybe at 13-14 gates per cycle you could
attempt this.

Do they use banking?

We tried almost everything we could think of and then some.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Sun Jul 20 11:47:04 2025

MitchAlsup1 wrote:

On Thu, 17 Jul 2025 20:34:13 +0000, Stefan Monnier wrote:

of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary
40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
same reason you can put any kind of data in the data cache !!! The
unified renamer ran out of registers a LOT less often than partitioned
renamer.

IIUC, access to a size-40 register file is about 1/2 a cycle to access
while a size-160 regfile will probably be a full cycle.

K9 was a high clock frequency design (5GHz in 65nm) so Rf decode
was 1 cycle (mostly wire delay), RF read-select was 1-cycle (all
wire delay) and RF read-out was 1 cycle 3/4 wire delay (16 ports).

So a 3-stage pipeline for PRF reads or writes?
One of my basic design assumptions is that PRF R/W are 1 cycle.
I was pondering the consequences of dealing with this, trying to
maintain back-to-back scheduling, or minimize how badly it breaks.

Also note: We had to use Ampere's Law for wire propagation instead of
simple LRC--since edge speeds were faster than 3ps. {{All sorts of
stuff starts to break at these speeds.}}

I assume this is part of what makes the pipeline longer, and it also
makes the forwarding network more complex since there are more values
that can benefit from forwarding (i.e. where forwarding is needed to
avoid having to wait for write+read in the register file).

Forwarding 1 was mostly MUX delay and a bit of wire delay
Forwarding 2 was mostly wire delay after MUX.

But in return for that, fewer values get read from the regfile, so you
need fewer read ports, right?

This depends on exactly WHEN you know which registers you need to
read. We were using a value-free reservation station design.

OH and BTW the renamer was 22-renames per cycle.

How do they deal with the massive number of PRF writes per cycle?

There were 8 results per cycle (max) and a RoB to absorb the OoOness,
so we could block-write registers to the RF.

Do they try to "kill" those writes that can be determined to be useless
(because the remaining reads are serviced via the forwarding network
instead)?

Not at 8 gates per cycle. Maybe at 13-14 gates per cycle you could
attempt this.

Do they use banking?

We tried almost everything we could think of and then some.

Stefan

Each of those PRF write pipeline stages can forward to its equivalent
or younger read stages. For 3 stages, 8R4W ports,
stage 3 write S3W forwards to S3R, S2R, S1R = 3*8*4 = 96
stage 2 S2W to S2R, S1R = 2*8*4 = 64
stage 1 S1W to S1R = 8*4 = 32
= 192 forwarding buses, plus tag comparators, muxes, etc.
just for dealing with the PRF internal forwarding network.
And if those are SIMD registers then those are all very wide buses.

I would look to do things that avoid using the PRF whenever possible,
like using valued Reservation Stations if they can be read in same cycle
rather than pulling from PRF so the FU schedulers don't have to deal
with the PRF pipeline latency (assuming that RS can be read in 1 cycle).

When operands come from the RS then there is no latency between the
scheduler picking a uOp and launching it for execution.

If operands come from the PRF then the FU schedulers have to issue
the register reads 3 cycles before they launch the uOp.
So each FU has a 3 cycle launch queue.
Except the PRF latency is variable if the value comes from its internal pipeline forwarding network so this needs more thought.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sun Jul 20 17:34:26 2025

On Sun, 20 Jul 2025 15:47:04 +0000, EricP wrote:

MitchAlsup1 wrote:

On Thu, 17 Jul 2025 20:34:13 +0000, Stefan Monnier wrote:

of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary >>>> 40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
same reason you can put any kind of data in the data cache !!! The
unified renamer ran out of registers a LOT less often than partitioned >>>> renamer.

IIUC, access to a size-40 register file is about 1/2 a cycle to access
while a size-160 regfile will probably be a full cycle.

K9 was a high clock frequency design (5GHz in 65nm) so Rf decode
was 1 cycle (mostly wire delay), RF read-select was 1-cycle (all
wire delay) and RF read-out was 1 cycle 3/4 wire delay (16 ports).

So a 3-stage pipeline for PRF reads or writes?

Yes, but remember, crossing the data path in wire without touching
any gate (except your own buffering) was 1 full clock of delay.
The RF was as wide as the Data path was tall, so it was essentially
3 clocks long:: horizontal into decode, vertical select lines,
then horizontal readout.

One of my basic design assumptions is that PRF R/W are 1 cycle.
I was pondering the consequences of dealing with this, trying to
maintain back-to-back scheduling, or minimize how badly it breaks.

Converting from a 16 gate/cycle machine into a 8 gate per cycle
machine causes 1 pipeline stage to become 2.5 pipeline stages.

Also note: We had to use Ampere's Law for wire propagation instead of
simple LRC--since edge speeds were faster than 3ps. {{All sorts of
stuff starts to break at these speeds.}}

I assume this is part of what makes the pipeline longer, and it also
makes the forwarding network more complex since there are more values
that can benefit from forwarding (i.e. where forwarding is needed to
avoid having to wait for write+read in the register file).

Forwarding 1 was mostly MUX delay and a bit of wire delay
Forwarding 2 was mostly wire delay after MUX.

But in return for that, fewer values get read from the regfile, so you
need fewer read ports, right?

This depends on exactly WHEN you know which registers you need to
read. We were using a value-free reservation station design.

OH and BTW the renamer was 22-renames per cycle.

How do they deal with the massive number of PRF writes per cycle?

There were 8 results per cycle (max) and a RoB to absorb the OoOness,
so we could block-write registers to the RF.

Do they try to "kill" those writes that can be determined to be useless
(because the remaining reads are serviced via the forwarding network
instead)?

Not at 8 gates per cycle. Maybe at 13-14 gates per cycle you could
attempt this.

Do they use banking?

We tried almost everything we could think of and then some.

Stefan

Each of those PRF write pipeline stages can forward to its equivalent
or younger read stages. For 3 stages, 8R4W ports,
stage 3 write S3W forwards to S3R, S2R, S1R = 3*8*4 = 96
stage 2 S2W to S2R, S1R = 2*8*4 = 64
stage 1 S1W to S1R = 8*4 = 32
= 192 forwarding buses, plus tag comparators, muxes, etc.
just for dealing with the PRF internal forwarding network.
And if those are SIMD registers then those are all very wide buses.

I would look to do things that avoid using the PRF whenever possible,
like using valued Reservation Stations if they can be read in same cycle rather than pulling from PRF so the FU schedulers don't have to deal
with the PRF pipeline latency (assuming that RS can be read in 1 cycle).

When operands come from the RS then there is no latency between the
scheduler picking a uOp and launching it for execution.

The RSs are short (16 entry) so tag-operand readout is 1 cycle.

If operands come from the PRF then the FU schedulers have to issue
the register reads 3 cycles before they launch the uOp.
So each FU has a 3 cycle launch queue.

Yep, it gets nasty.

Except the PRF latency is variable if the value comes from its internal pipeline forwarding network so this needs more thought.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Sun Jul 20 22:27:27 2025

On 7/18/2025 8:29 AM, Stefan Monnier wrote:

snipped comments on the Mill.

I know, from the information on the Mill website, that they are making
slow progress, limited by money for people and patent applications.

https://millcomputing.com/topic/yearly-ping-and-see-how-things-are-going-thread/

But independent of that, I do miss Ivan's posts in this newsgroup, even
if they aren't about the Mill. I do hope he can find time to post at
least occasionally.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Stephen Fuld on Mon Jul 21 15:45:14 2025

On Sun, 20 Jul 2025 22:27:27 -0700, Stephen Fuld wrote:

But independent of that, I do miss Ivan's posts in this newsgroup, even
if they aren't about the Mill. I do hope he can find time to post at
least occasionally.

Although I agree, I am also satisfied as long as he is well and healthy.

If he can't waste time with USENET for now, that is all right with me.

But I am instead concerned if he is unable to find funding to make any
progress with the Mill, given that it appears to have been a very promising project. That is much more important.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Savard on Mon Jul 21 12:05:14 2025

On 7/21/2025 8:45 AM, John Savard wrote:

On Sun, 20 Jul 2025 22:27:27 -0700, Stephen Fuld wrote:

But independent of that, I do miss Ivan's posts in this newsgroup, even
if they aren't about the Mill. I do hope he can find time to post at
least occasionally.

Although I agree, I am also satisfied as long as he is well and healthy.

If he can't waste time with USENET for now, that is all right with me.

But I am instead concerned if he is unable to find funding to make any progress with the Mill, given that it appears to have been a very promising project. That is much more important.

Based on the posts at the link I posted above, they are making progress,
albeit quite slowly. I understand the patents issue, as they require
real money. But I thought their model of doing work for a share of the possible eventual profits, if any, would attract enough people to get
the work done. After all, there are lots of people who contribute to
many open source projects for no monetary return at all. And the Mill
needs only a few people. But apparently, I was wrong.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Mon Jul 21 19:56:16 2025

Stephen Fuld <[email protected]d> writes:

On 7/21/2025 8:45 AM, John Savard wrote:

On Sun, 20 Jul 2025 22:27:27 -0700, Stephen Fuld wrote:

But independent of that, I do miss Ivan's posts in this newsgroup, even
if they aren't about the Mill. I do hope he can find time to post at
least occasionally.

Although I agree, I am also satisfied as long as he is well and healthy.

If he can't waste time with USENET for now, that is all right with me.

But I am instead concerned if he is unable to find funding to make any
progress with the Mill, given that it appears to have been a very promising >> project. That is much more important.

Based on the posts at the link I posted above, they are making progress, >albeit quite slowly. I understand the patents issue, as they require
real money. But I thought their model of doing work for a share of the >possible eventual profits, if any, would attract enough people to get
the work done. After all, there are lots of people who contribute to
many open source projects for no monetary return at all. And the Mill
needs only a few people. But apparently, I was wrong.

It's easy to underestimate the resources required to bring a new
processor architecture to a point where it makes sense to build
a test chip. Then to optimize the design for the target node.

That's just the hardware side. Then there is the software infrastrucure (processor ABI, processor-specific library code, etc). Not to mention marketing and hotchips.

Looking at the webpage, the belt seems to have some characteristics
in common with stack-based architectures, bringing to mind Burroughs
large systems and the HP-3000.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Scott Lurndal on Mon Jul 21 22:02:28 2025

On 7/21/2025 12:56 PM, Scott Lurndal wrote:

Stephen Fuld <[email protected]d> writes:

On 7/21/2025 8:45 AM, John Savard wrote:

On Sun, 20 Jul 2025 22:27:27 -0700, Stephen Fuld wrote:

But independent of that, I do miss Ivan's posts in this newsgroup, even >>>> if they aren't about the Mill. I do hope he can find time to post at
least occasionally.

Although I agree, I am also satisfied as long as he is well and healthy. >>>
If he can't waste time with USENET for now, that is all right with me.

But I am instead concerned if he is unable to find funding to make any
progress with the Mill, given that it appears to have been a very promising >>> project. That is much more important.

Based on the posts at the link I posted above, they are making progress,
albeit quite slowly. I understand the patents issue, as they require
real money. But I thought their model of doing work for a share of the
possible eventual profits, if any, would attract enough people to get
the work done. After all, there are lots of people who contribute to
many open source projects for no monetary return at all. And the Mill
needs only a few people. But apparently, I was wrong.

It's easy to underestimate the resources required to bring a new
processor architecture to a point where it makes sense to build
a test chip. Then to optimize the design for the target node.

I get the impression from the kind of people that they are looking for,
that they are concentrating on the software side. They are working on
Verilog, but more on porting SW tools.

That's just the hardware side. Then there is the software infrastrucure (processor ABI, processor-specific library code, etc).

Yes. I think that is what they are concentrating on now.

Not to mention
marketing and hotchips.

No real marketing effort yet.

Looking at the webpage, the belt seems to have some characteristics
in common with stack-based architectures, bringing to mind Burroughs
large systems and the HP-3000.

IMHO, only sort of. The Burroughs large systems are true stack based
systems, with a real HW stack, etc. While, if you squint enough, the
Mill has sort of a stack, it has enough differences to be a totally
different thing.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	55:56:22
Calls:	12,446
Calls today:	1
Files:	15,192
Messages:	6,537,358

Register windows (was: The Third Wish)

Who's Online

Recent Visitors

System Info