• Register windows (was: The Third Wish)

    From Stefan Monnier@21:1/5 to All on Thu Jul 17 12:20:13 2025
    The only good arguments I have heard wrt big architectural register
    files has to do with things like Register-Windows and/or optimizing
    CALL/RET interface.

    But even there, it justifies only additional "second-class registers",
    i.e. where the set of immediately addressable registers can still be the
    same size as usual (e.g. 16 or 32), but you can quickly push some of
    those to some kind of "stack" and then pull them back in.
    IIRC the Mill had actually 2 categories of "second-class registers":
    the stack and the scratch registers.

    I think you can get similar benefits with "cache-line sized" memory
    operations that load/store several registers at a time (assuming you
    have good enough store-to-load forwarding). Or even fold those
    loads&stores into some kind of CALL/RET instructions, which can let you
    start the control-flow part of the CALL before the stores, and similarly
    start the loads before the control flow part of the RET is done.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Thu Jul 17 17:38:35 2025
    Stefan Monnier <[email protected]> writes:
    The only good arguments I have heard wrt big architectural register
    files has to do with things like Register-Windows and/or optimizing
    CALL/RET interface.

    But even there, it justifies only additional "second-class registers",
    i.e. where the set of immediately addressable registers can still be the
    same size as usual (e.g. 16 or 32), but you can quickly push some of
    those to some kind of "stack" and then pull them back in.

    Not efficiently. You would have to wait until the last instruction
    has written back its result, then make the switch, and only then start
    reading registers from instructions behind the SAVE/RESTORE
    instruction. Each SAVE and each RESTORE would cost several cycles
    even on an in-order machine. Not what the mechanism was designed for.

    I think you can get similar benefits with "cache-line sized" memory >operations that load/store several registers at a time (assuming you
    have good enough store-to-load forwarding).

    ARM A64's load pair and store pair instructions.

    Or even fold those
    loads&stores into some kind of CALL/RET instructions, which can let you
    start the control-flow part of the CALL before the stores, and similarly >start the loads before the control flow part of the RET is done.

    In an OoO machine with correct predictions (the usual case), control
    flow often runs far ahead of functional-unit processing and retirement
    (and only retirement is architectural execution). Any stores on the
    predicted control flow will be speculatively performed as soon as
    their source data is available, and the same goes for loads, with
    (non)aliases being predicted. Plus really modern machines often can
    achieve 0-cycle store-to-load forwarding. All of this makes
    mechanisms like register windows and IA-64's register stack
    unnecessary.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Thu Jul 17 19:17:43 2025
    [email protected] (Anton Ertl) writes:
    Stefan Monnier <[email protected]> writes:
    The only good arguments I have heard wrt big architectural register
    files has to do with things like Register-Windows and/or optimizing
    CALL/RET interface.

    But even there, it justifies only additional "second-class registers",
    i.e. where the set of immediately addressable registers can still be the >>same size as usual (e.g. 16 or 32), but you can quickly push some of
    those to some kind of "stack" and then pull them back in.

    Not efficiently. You would have to wait until the last instruction
    has written back its result, then make the switch, and only then start >reading registers from instructions behind the SAVE/RESTORE
    instruction. Each SAVE and each RESTORE would cost several cycles
    even on an in-order machine. Not what the mechanism was designed for.

    I think you can get similar benefits with "cache-line sized" memory >>operations that load/store several registers at a time (assuming you
    have good enough store-to-load forwarding).

    ARM A64's load pair and store pair instructions.

    ARM A64 has (optional) 64-byte load/store instructions
    (LD64/ST64), which store/load an entire cache line using 8 GPRs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Jul 17 16:34:13 2025
    of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary
    40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
    same reason you can put any kind of data in the data cache !!! The
    unified renamer ran out of registers a LOT less often than partitioned renamer.

    IIUC, access to a size-40 register file is about 1/2 a cycle to access
    while a size-160 regfile will probably be a full cycle.
    I assume this is part of what makes the pipeline longer, and it also
    makes the forwarding network more complex since there are more values
    that can benefit from forwarding (i.e. where forwarding is needed to
    avoid having to wait for write+read in the register file).
    But in return for that, fewer values get read from the regfile, so you
    need fewer read ports, right?

    How do they deal with the massive number of PRF writes per cycle?
    Do they try to "kill" those writes that can be determined to be useless (because the remaining reads are serviced via the forwarding network instead)? Do they use banking?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Jul 17 21:38:17 2025
    On Thu, 17 Jul 2025 20:34:13 +0000, Stefan Monnier wrote:

    of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary
    40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
    same reason you can put any kind of data in the data cache !!! The
    unified renamer ran out of registers a LOT less often than partitioned
    renamer.

    IIUC, access to a size-40 register file is about 1/2 a cycle to access
    while a size-160 regfile will probably be a full cycle.

    K9 was a high clock frequency design (5GHz in 65nm) so Rf decode
    was 1 cycle (mostly wire delay), RF read-select was 1-cycle (all
    wire delay) and RF read-out was 1 cycle 3/4 wire delay (16 ports).

    Also note: We had to use Ampere's Law for wire propagation instead of
    simple LRC--since edge speeds were faster than 3ps. {{All sorts of
    stuff starts to break at these speeds.}}

    I assume this is part of what makes the pipeline longer, and it also
    makes the forwarding network more complex since there are more values
    that can benefit from forwarding (i.e. where forwarding is needed to
    avoid having to wait for write+read in the register file).

    Forwarding 1 was mostly MUX delay and a bit of wire delay
    Forwarding 2 was mostly wire delay after MUX.

    But in return for that, fewer values get read from the regfile, so you
    need fewer read ports, right?

    This depends on exactly WHEN you know which registers you need to
    read. We were using a value-free reservation station design.

    OH and BTW the renamer was 22-renames per cycle.

    How do they deal with the massive number of PRF writes per cycle?

    There were 8 results per cycle (max) and a RoB to absorb the OoOness,
    so we could block-write registers to the RF.

    Do they try to "kill" those writes that can be determined to be useless (because the remaining reads are serviced via the forwarding network instead)?

    Not at 8 gates per cycle. Maybe at 13-14 gates per cycle you could
    attempt this.

    Do they use banking?

    We tried almost everything we could think of and then some.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sun Jul 20 11:47:04 2025
    MitchAlsup1 wrote:
    On Thu, 17 Jul 2025 20:34:13 +0000, Stefan Monnier wrote:

    of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary
    40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
    same reason you can put any kind of data in the data cache !!! The
    unified renamer ran out of registers a LOT less often than partitioned
    renamer.

    IIUC, access to a size-40 register file is about 1/2 a cycle to access
    while a size-160 regfile will probably be a full cycle.

    K9 was a high clock frequency design (5GHz in 65nm) so Rf decode
    was 1 cycle (mostly wire delay), RF read-select was 1-cycle (all
    wire delay) and RF read-out was 1 cycle 3/4 wire delay (16 ports).

    So a 3-stage pipeline for PRF reads or writes?
    One of my basic design assumptions is that PRF R/W are 1 cycle.
    I was pondering the consequences of dealing with this, trying to
    maintain back-to-back scheduling, or minimize how badly it breaks.

    Also note: We had to use Ampere's Law for wire propagation instead of
    simple LRC--since edge speeds were faster than 3ps. {{All sorts of
    stuff starts to break at these speeds.}}

    I assume this is part of what makes the pipeline longer, and it also
    makes the forwarding network more complex since there are more values
    that can benefit from forwarding (i.e. where forwarding is needed to
    avoid having to wait for write+read in the register file).

    Forwarding 1 was mostly MUX delay and a bit of wire delay
    Forwarding 2 was mostly wire delay after MUX.

    But in return for that, fewer values get read from the regfile, so you
    need fewer read ports, right?

    This depends on exactly WHEN you know which registers you need to
    read. We were using a value-free reservation station design.

    OH and BTW the renamer was 22-renames per cycle.

    How do they deal with the massive number of PRF writes per cycle?

    There were 8 results per cycle (max) and a RoB to absorb the OoOness,
    so we could block-write registers to the RF.

    Do they try to "kill" those writes that can be determined to be useless
    (because the remaining reads are serviced via the forwarding network
    instead)?

    Not at 8 gates per cycle. Maybe at 13-14 gates per cycle you could
    attempt this.

    Do they use banking?

    We tried almost everything we could think of and then some.


    Stefan

    Each of those PRF write pipeline stages can forward to its equivalent
    or younger read stages. For 3 stages, 8R4W ports,
    stage 3 write S3W forwards to S3R, S2R, S1R = 3*8*4 = 96
    stage 2 S2W to S2R, S1R = 2*8*4 = 64
    stage 1 S1W to S1R = 8*4 = 32
    = 192 forwarding buses, plus tag comparators, muxes, etc.
    just for dealing with the PRF internal forwarding network.
    And if those are SIMD registers then those are all very wide buses.

    I would look to do things that avoid using the PRF whenever possible,
    like using valued Reservation Stations if they can be read in same cycle
    rather than pulling from PRF so the FU schedulers don't have to deal
    with the PRF pipeline latency (assuming that RS can be read in 1 cycle).

    When operands come from the RS then there is no latency between the
    scheduler picking a uOp and launching it for execution.

    If operands come from the PRF then the FU schedulers have to issue
    the register reads 3 cycles before they launch the uOp.
    So each FU has a 3 cycle launch queue.
    Except the PRF latency is variable if the value comes from its internal pipeline forwarding network so this needs more thought.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Jul 20 17:34:26 2025
    On Sun, 20 Jul 2025 15:47:04 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Thu, 17 Jul 2025 20:34:13 +0000, Stefan Monnier wrote:

    of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary >>>> 40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
    same reason you can put any kind of data in the data cache !!! The
    unified renamer ran out of registers a LOT less often than partitioned >>>> renamer.

    IIUC, access to a size-40 register file is about 1/2 a cycle to access
    while a size-160 regfile will probably be a full cycle.

    K9 was a high clock frequency design (5GHz in 65nm) so Rf decode
    was 1 cycle (mostly wire delay), RF read-select was 1-cycle (all
    wire delay) and RF read-out was 1 cycle 3/4 wire delay (16 ports).

    So a 3-stage pipeline for PRF reads or writes?

    Yes, but remember, crossing the data path in wire without touching
    any gate (except your own buffering) was 1 full clock of delay.
    The RF was as wide as the Data path was tall, so it was essentially
    3 clocks long:: horizontal into decode, vertical select lines,
    then horizontal readout.

    One of my basic design assumptions is that PRF R/W are 1 cycle.
    I was pondering the consequences of dealing with this, trying to
    maintain back-to-back scheduling, or minimize how badly it breaks.

    Converting from a 16 gate/cycle machine into a 8 gate per cycle
    machine causes 1 pipeline stage to become 2.5 pipeline stages.

    Also note: We had to use Ampere's Law for wire propagation instead of
    simple LRC--since edge speeds were faster than 3ps. {{All sorts of
    stuff starts to break at these speeds.}}

    I assume this is part of what makes the pipeline longer, and it also
    makes the forwarding network more complex since there are more values
    that can benefit from forwarding (i.e. where forwarding is needed to
    avoid having to wait for write+read in the register file).

    Forwarding 1 was mostly MUX delay and a bit of wire delay
    Forwarding 2 was mostly wire delay after MUX.

    But in return for that, fewer values get read from the regfile, so you
    need fewer read ports, right?

    This depends on exactly WHEN you know which registers you need to
    read. We were using a value-free reservation station design.

    OH and BTW the renamer was 22-renames per cycle.

    How do they deal with the massive number of PRF writes per cycle?

    There were 8 results per cycle (max) and a RoB to absorb the OoOness,
    so we could block-write registers to the RF.

    Do they try to "kill" those writes that can be determined to be useless
    (because the remaining reads are serviced via the forwarding network
    instead)?

    Not at 8 gates per cycle. Maybe at 13-14 gates per cycle you could
    attempt this.

    Do they use banking?

    We tried almost everything we could think of and then some.


    Stefan

    Each of those PRF write pipeline stages can forward to its equivalent
    or younger read stages. For 3 stages, 8R4W ports,
    stage 3 write S3W forwards to S3R, S2R, S1R = 3*8*4 = 96
    stage 2 S2W to S2R, S1R = 2*8*4 = 64
    stage 1 S1W to S1R = 8*4 = 32
    = 192 forwarding buses, plus tag comparators, muxes, etc.
    just for dealing with the PRF internal forwarding network.
    And if those are SIMD registers then those are all very wide buses.

    I would look to do things that avoid using the PRF whenever possible,
    like using valued Reservation Stations if they can be read in same cycle rather than pulling from PRF so the FU schedulers don't have to deal
    with the PRF pipeline latency (assuming that RS can be read in 1 cycle).

    When operands come from the RS then there is no latency between the
    scheduler picking a uOp and launching it for execution.

    The RSs are short (16 entry) so tag-operand readout is 1 cycle.

    If operands come from the PRF then the FU schedulers have to issue
    the register reads 3 cycles before they launch the uOp.
    So each FU has a 3 cycle launch queue.

    Yep, it gets nasty.

    Except the PRF latency is variable if the value comes from its internal pipeline forwarding network so this needs more thought.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Sun Jul 20 22:27:27 2025
    On 7/18/2025 8:29 AM, Stefan Monnier wrote:

    snipped comments on the Mill.

    I know, from the information on the Mill website, that they are making
    slow progress, limited by money for people and patent applications.

    https://millcomputing.com/topic/yearly-ping-and-see-how-things-are-going-thread/

    But independent of that, I do miss Ivan's posts in this newsgroup, even
    if they aren't about the Mill. I do hope he can find time to post at
    least occasionally.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Stephen Fuld on Mon Jul 21 15:45:14 2025
    On Sun, 20 Jul 2025 22:27:27 -0700, Stephen Fuld wrote:

    But independent of that, I do miss Ivan's posts in this newsgroup, even
    if they aren't about the Mill. I do hope he can find time to post at
    least occasionally.

    Although I agree, I am also satisfied as long as he is well and healthy.

    If he can't waste time with USENET for now, that is all right with me.

    But I am instead concerned if he is unable to find funding to make any
    progress with the Mill, given that it appears to have been a very promising project. That is much more important.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Savard on Mon Jul 21 12:05:14 2025
    On 7/21/2025 8:45 AM, John Savard wrote:
    On Sun, 20 Jul 2025 22:27:27 -0700, Stephen Fuld wrote:

    But independent of that, I do miss Ivan's posts in this newsgroup, even
    if they aren't about the Mill. I do hope he can find time to post at
    least occasionally.

    Although I agree, I am also satisfied as long as he is well and healthy.

    If he can't waste time with USENET for now, that is all right with me.

    But I am instead concerned if he is unable to find funding to make any progress with the Mill, given that it appears to have been a very promising project. That is much more important.

    Based on the posts at the link I posted above, they are making progress,
    albeit quite slowly. I understand the patents issue, as they require
    real money. But I thought their model of doing work for a share of the possible eventual profits, if any, would attract enough people to get
    the work done. After all, there are lots of people who contribute to
    many open source projects for no monetary return at all. And the Mill
    needs only a few people. But apparently, I was wrong.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Mon Jul 21 19:56:16 2025
    Stephen Fuld <[email protected]d> writes:
    On 7/21/2025 8:45 AM, John Savard wrote:
    On Sun, 20 Jul 2025 22:27:27 -0700, Stephen Fuld wrote:

    But independent of that, I do miss Ivan's posts in this newsgroup, even
    if they aren't about the Mill. I do hope he can find time to post at
    least occasionally.

    Although I agree, I am also satisfied as long as he is well and healthy.

    If he can't waste time with USENET for now, that is all right with me.

    But I am instead concerned if he is unable to find funding to make any
    progress with the Mill, given that it appears to have been a very promising >> project. That is much more important.

    Based on the posts at the link I posted above, they are making progress, >albeit quite slowly. I understand the patents issue, as they require
    real money. But I thought their model of doing work for a share of the >possible eventual profits, if any, would attract enough people to get
    the work done. After all, there are lots of people who contribute to
    many open source projects for no monetary return at all. And the Mill
    needs only a few people. But apparently, I was wrong.

    It's easy to underestimate the resources required to bring a new
    processor architecture to a point where it makes sense to build
    a test chip. Then to optimize the design for the target node.

    That's just the hardware side. Then there is the software infrastrucure (processor ABI, processor-specific library code, etc). Not to mention marketing and hotchips.

    Looking at the webpage, the belt seems to have some characteristics
    in common with stack-based architectures, bringing to mind Burroughs
    large systems and the HP-3000.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Scott Lurndal on Mon Jul 21 22:02:28 2025
    On 7/21/2025 12:56 PM, Scott Lurndal wrote:
    Stephen Fuld <[email protected]d> writes:
    On 7/21/2025 8:45 AM, John Savard wrote:
    On Sun, 20 Jul 2025 22:27:27 -0700, Stephen Fuld wrote:

    But independent of that, I do miss Ivan's posts in this newsgroup, even >>>> if they aren't about the Mill. I do hope he can find time to post at
    least occasionally.

    Although I agree, I am also satisfied as long as he is well and healthy. >>>
    If he can't waste time with USENET for now, that is all right with me.

    But I am instead concerned if he is unable to find funding to make any
    progress with the Mill, given that it appears to have been a very promising >>> project. That is much more important.

    Based on the posts at the link I posted above, they are making progress,
    albeit quite slowly. I understand the patents issue, as they require
    real money. But I thought their model of doing work for a share of the
    possible eventual profits, if any, would attract enough people to get
    the work done. After all, there are lots of people who contribute to
    many open source projects for no monetary return at all. And the Mill
    needs only a few people. But apparently, I was wrong.

    It's easy to underestimate the resources required to bring a new
    processor architecture to a point where it makes sense to build
    a test chip. Then to optimize the design for the target node.

    I get the impression from the kind of people that they are looking for,
    that they are concentrating on the software side. They are working on
    Verilog, but more on porting SW tools.



    That's just the hardware side. Then there is the software infrastrucure (processor ABI, processor-specific library code, etc).

    Yes. I think that is what they are concentrating on now.

    Not to mention
    marketing and hotchips.

    No real marketing effort yet.



    Looking at the webpage, the belt seems to have some characteristics
    in common with stack-based architectures, bringing to mind Burroughs
    large systems and the HP-3000.

    IMHO, only sort of. The Burroughs large systems are true stack based
    systems, with a real HW stack, etc. While, if you squint enough, the
    Mill has sort of a stack, it has enough differences to be a totally
    different thing.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)