Forum: >>> Magnum BBS <<<

Banked register files

From Brett@21:1/5 to All on Mon Aug 19 21:46:07 2024

Banked register files, a mental exercise at expanding the register file.

With three operand RISC you have you have three 5 bit register specifiers
using 15 bits.

If instead you have eight banks of eight registers you have a 3 bit bank specifier and three 3 bit register specifiers for 12 bits.

Now the banks need to talk to each other and so you would add a bit to each register specifier to tell whether it uses the bank or the base registers,
for 72 registers total, not 64. So a 3 bit bank specifier and three 4 bit register specifiers for 15 bits, the same as a 32 register RISC chip.

Two operand plus 16 bit offset instructions would need to sacrifice one bit
of offset. Four operand instructions would save a bit.

As an extra bonus you now have another 3 bit field that could be another
source or destination if you are not using the bank register. But with only eight base registers it can look hard to pull off using 4 or 5 registers at once. But maybe not if most of the addressing is in the bank registers. The frame pointer would be in the base registers, as it loads the other
pointers.

The most general case for banked registers is loop unrolling. Eight
registers is not a lot so the first loop may use two banks, but now you
have 4 unrolls that are fairly trivial to set up.

Is this a good idea, maybe, maybe not. This is a mental exercise, it proves
I am mental. ;)

How does banked compare to high registers? Not as good.

Intel could pull off something like this to one up ARM. A new fixed width instruction set with a nice patent moat, and fits the x86 mindset.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Mon Aug 19 22:14:07 2024

On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

Banked register files, a mental exercise at expanding the register file.

With three operand RISC you have you have three 5 bit register
specifiers using 15 bits.

If instead you have eight banks of eight registers you have a 3 bit bank specifier and three 3 bit register specifiers for 12 bits.

Now the banks need to talk to each other and so you would add a bit to
each register specifier to tell whether it uses the bank or the base registers, for 72 registers total, not 64. So a 3 bit bank specifier
and three 4 bit register specifiers for 15 bits, the same as a 32
register RISC chip.

This covers 100% of instructions that smell like::

ADD R17,R17,R25

but covers 000% of the instructions that smell like::

ADD R7,R17,R25

I strongly suspect that it covers less than 50% of the 3-operand
instruction uses.

Two operand plus 16 bit offset instructions would need to sacrifice one
bit of offset. Four operand instructions would save a bit.

As an extra bonus you now have another 3 bit field that could be another source or destination if you are not using the bank register. But with
only eight base registers it can look hard to pull off using 4 or 5
registers
at once. But maybe not if most of the addressing is in the bank
registers.
The frame pointer would be in the base registers, as it loads the other pointers.

I guess the real question at this point is how are the banks used when:

a) calling a subroutine
b) returning from a subroutine
c) calling a method
d) calling an external subroutine

e) dealing with {lower bound, upper bound, stride} for each dimension
of a multi-dimensional array

f) how does the scheme work when INT-RF != FP-RF ??

The most general case for banked registers is loop unrolling. Eight
registers is not a lot so the first loop may use two banks, but now you
have 4 unrolls that are fairly trivial to set up.

Unrolling has become less and less necessary with GBOoO implementations. Producing fewer instructions to encode the whole loop is more import-
ant.

Is this a good idea, maybe, maybe not. This is a mental exercise, it
proves > I am mental. ;)

Arguably less crazy than some other proposals.

Compiler people would have to be convinced to get on board as this
would disrupt their built-in idea that register files are mono-
lithic.

How does banked compare to high registers? Not as good.

Intel could pull off something like this to one up ARM. A new fixed
width instruction set with a nice patent moat, and fits the x86 mindset.

What fits the x86 mindset is the::
MEM Rd,[Rbase+Rindex<<scale+LargeDisplacement]
address mode.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to [email protected] on Mon Aug 19 23:23:11 2024

MitchAlsup1 <[email protected]> wrote:

On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

Banked register files, a mental exercise at expanding the register file.

With three operand RISC you have you have three 5 bit register
specifiers using 15 bits.

If instead you have eight banks of eight registers you have a 3 bit bank
specifier and three 3 bit register specifiers for 12 bits.

Now the banks need to talk to each other and so you would add a bit to
each register specifier to tell whether it uses the bank or the base
registers, for 72 registers total, not 64. So a 3 bit bank specifier
and three 4 bit register specifiers for 15 bits, the same as a 32
register RISC chip.

This covers 100% of instructions that smell like::

ADD R17,R17,R25

but covers 000% of the instructions that smell like::

ADD R7,R17,R25

I strongly suspect that it covers less than 50% of the 3-operand
instruction uses.

My description was bad, let’s do a MC 68000 version, base registers are addressing mostly, and the banks are integer/float.

The compiler can handle this easily, simple dependency grouping and if you
need more than 8 registers you use the No base flag to total to the base registers. So you have two chains that total in two banks and both write to
the base registers where the last total of the two chains are added.

This saves a lot of no bank bits, only the result needs a no bank override.

Simple code only uses base registers, or base plus one bank.
Call and return parameters are in the base registers, spilling to a bank if
you need more.

Since only the result needs an override, you can do 4 banks of 16
registers.

Two operand plus 16 bit offset instructions would need to sacrifice one
bit of offset. Four operand instructions would save a bit.

As an extra bonus you now have another 3 bit field that could be another
source or destination if you are not using the bank register. But with
only eight base registers it can look hard to pull off using 4 or 5
registers
at once. But maybe not if most of the addressing is in the bank
registers.
The frame pointer would be in the base registers, as it loads the other
pointers.

I guess the real question at this point is how are the banks used when:

a) calling a subroutine
b) returning from a subroutine
c) calling a method
d) calling an external subroutine

e) dealing with {lower bound, upper bound, stride} for each dimension
of a multi-dimensional array

f) how does the scheme work when INT-RF != FP-RF ??

The most general case for banked registers is loop unrolling. Eight
registers is not a lot so the first loop may use two banks, but now you
have 4 unrolls that are fairly trivial to set up.

Unrolling has become less and less necessary with GBOoO implementations. Producing fewer instructions to encode the whole loop is more import-
ant.

Is this a good idea, maybe, maybe not. This is a mental exercise, it
proves > I am mental. ;)

Arguably less crazy than some other proposals.

Compiler people would have to be convinced to get on board as this
would disrupt their built-in idea that register files are mono-
lithic.

How does banked compare to high registers? Not as good.

Intel could pull off something like this to one up ARM. A new fixed
width instruction set with a nice patent moat, and fits the x86 mindset.

What fits the x86 mindset is the::
MEM Rd,[Rbase+Rindex<<scale+LargeDisplacement]
address mode.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Tue Aug 20 01:19:10 2024

On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

Banked register files, a mental exercise at expanding the register file. >>>
With three operand RISC you have you have three 5 bit register
specifiers using 15 bits.

If instead you have eight banks of eight registers you have a 3 bit bank >>> specifier and three 3 bit register specifiers for 12 bits.

Now the banks need to talk to each other and so you would add a bit to
each register specifier to tell whether it uses the bank or the base
registers, for 72 registers total, not 64. So a 3 bit bank specifier
and three 4 bit register specifiers for 15 bits, the same as a 32
register RISC chip.

This covers 100% of instructions that smell like::

ADD R17,R17,R25

but covers 000% of the instructions that smell like::

ADD R7,R17,R25

I strongly suspect that it covers less than 50% of the 3-operand
instruction uses.

My description was bad, let’s do a MC 68000 version, base registers are addressing mostly, and the banks are integer/float.

There is a reason this style fell out of fashion.
Not enough address registers at the same time one had not enough
data registers--whereas a flat 16-entry file had enough for either.

There is a reason CRAY-2 staging memory was not copied, too; and
it is mainly the same reason.

The compiler can handle this easily, simple dependency grouping and if
you
need more than 8 registers you use the No base flag to total to the base registers. So you have two chains that total in two banks and both write
to
the base registers where the last total of the two chains are added.

This saves a lot of no bank bits, only the result needs a no bank
override.

Yes, 68K did have pretty good code density--{{Now if 68020 had NOT
gone a blown up the addressing modes...}}

Simple code only uses base registers, or base plus one bank.
Call and return parameters are in the base registers, spilling to a bank
if you need more.

Since only the result needs an override, you can do 4 banks of 16
registers.

Dream on.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to [email protected] on Tue Aug 20 03:54:11 2024

MitchAlsup1 <[email protected]> wrote:

On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

Banked register files, a mental exercise at expanding the register file. >>>>

Four Banked register files, a mental exercise at expanding the register
file.

With three operand RISC you have you have three 5 bit register specifiers
using 15 bits.

If instead you have four banks of sixteen registers you have a 2 bit bank specifier and three 4 bit register specifiers with one override bit for the destination for 15 total bits, the same as a 32 register RISC chip. The override bit specifies bank zero for destination, 64 total registers.

Two operand plus 16 bit offset instructions would need to sacrifice one bit
of offset. Four operand instructions would save two bits, quite useful.

Addressing can be done from any bank.

The compiler can handle large banks easily, simple dependency grouping and
if you need more than 16 registers for a single calculation you use the
Base override flag to total to the base registers. So you have two chains
that total in two banks and both write to the base registers where the last total of the two chains are added.

Call and return parameters are in the base registers.
Simple code only uses base registers, or base plus one bank.

Is this a good idea, i think so, but this is a mental exercise, it proves I
am mental. ;)

How does banked compare to high registers? Slightly better.
Intel could pull off something like this to one up ARM. A new fixed width instruction set with a nice patent moat, and fits the x86 mindset.

Yes you can do
Rd,[Rbase+Rindex<<scale+LargeDisplacement]
Large displacements would be in extension words like My 66000.
Nothing stops you from doing add from memory, besides being costly in
opcode bits and die size.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mac@21:1/5 to All on Thu Aug 22 21:15:06 2024

The old DEC/Intel, later Netronome, IXP network processors had two banks of
128 registers each, with the source operands had to come from different
banks. Funky machine, lots of visible pipeline delays. Limited scratch
memory with 3-cycle latency and explicit address register. DRAM was an I/o device with asynchronous load/store.

Not for C programmers. An optimizing assembler handled register assignment
and filled pipeline delay slots

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Brett on Thu Aug 22 21:58:11 2024

Brett <[email protected]> wrote:

MitchAlsup1 <[email protected]> wrote:

On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

Banked register files, a mental exercise at expanding the register file. >>>>>

Four Banked register files, a mental exercise at expanding the register
file.

With three operand RISC you have you have three 5 bit register specifiers using 15 bits.

If instead you have four banks of sixteen registers you have a 2 bit bank specifier and three 4 bit register specifiers with one override bit for the destination for 15 total bits, the same as a 32 register RISC chip. The override bit specifies bank zero for destination, 64 total registers.

Two operand plus 16 bit offset instructions would need to sacrifice one bit of offset. Four operand instructions would save two bits, quite useful.

Addressing can be done from any bank.

The compiler can handle large banks easily, simple dependency grouping and
if you need more than 16 registers for a single calculation you use the
Base override flag to total to the base registers. So you have two chains that total in two banks and both write to the base registers where the last total of the two chains are added.

Call and return parameters are in the base registers.
Simple code only uses base registers, or base plus one bank.

On the mental status, having multiple banks means you can have multiple ALU clusters and rename. You are no longer limited to 9 way rename and 12ish
way issue, but a multiple of that. The limits are load and store bandwidth,
and some added latency to coordinate. Lots of money would get piled into compilers to maximize even bank use.

Is this a good idea, i think so, but this is a mental exercise, it proves I am mental. ;)

How does banked compare to high registers? Slightly better.
Intel could pull off something like this to one up ARM. A new fixed width instruction set with a nice patent moat, and fits the x86 mindset.

Yes you can do
Rd,[Rbase+Rindex<<scale+LargeDisplacement]
Large displacements would be in extension words like My 66000.
Nothing stops you from doing add from memory, besides being costly in
opcode bits and die size.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Robert Finch on Sat Aug 24 17:27:47 2024

Robert Finch <[email protected]> wrote:

On 2024-08-22 5:58 p.m., Brett wrote:

Brett <[email protected]> wrote:

MitchAlsup1 <[email protected]> wrote:

On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

Banked register files, a mental exercise at expanding the register file.

Four Banked register files, a mental exercise at expanding the register
file.

With three operand RISC you have you have three 5 bit register specifiers >>> using 15 bits.

If instead you have four banks of sixteen registers you have a 2 bit bank >>> specifier and three 4 bit register specifiers with one override bit for the >>> destination for 15 total bits, the same as a 32 register RISC chip. The
override bit specifies bank zero for destination, 64 total registers.

Two operand plus 16 bit offset instructions would need to sacrifice one bit >>> of offset. Four operand instructions would save two bits, quite useful.

Addressing can be done from any bank.

The compiler can handle large banks easily, simple dependency grouping and >>> if you need more than 16 registers for a single calculation you use the
Base override flag to total to the base registers. So you have two chains >>> that total in two banks and both write to the base registers where the last >>> total of the two chains are added.

Call and return parameters are in the base registers.
Simple code only uses base registers, or base plus one bank.

On the mental status, having multiple banks means you can have multiple ALU >> clusters and rename. You are no longer limited to 9 way rename and 12ish
way issue, but a multiple of that. The limits are load and store bandwidth, >> and some added latency to coordinate. Lots of money would get piled into
compilers to maximize even bank use.

Is this a good idea, i think so, but this is a mental exercise, it proves I >>> am mental. ;)

How does banked compare to high registers? Slightly better.
Intel could pull off something like this to one up ARM. A new fixed width >>> instruction set with a nice patent moat, and fits the x86 mindset.

Yes you can do
Rd,[Rbase+Rindex<<scale+LargeDisplacement]
Large displacements would be in extension words like My 66000.
Nothing stops you from doing add from memory, besides being costly in
opcode bits and die size.

Using BRAMs usually allows for a lot more registers than make sense in
an architecture. Makes one wonder what to do with the extra registers.
The MOV instruction can be made to use more bits for the register spec allowing transfers between banks of registers. Since MOV needs only two register specs instead of three, there are more bits available.

FPGA Block Memory, had to look it up, not a hardware or embedded guy.

There is a popular embedded CPU with dual register files and dual
operations.
Not what I was going for, but there are possibilities there. On the high
end you crack the dual instructions and let them execute out of order in
the different bank pipes. This gives you 128 bit vector ops on a 64 bit cpu with multiple banks. This would be a completely different instruction set
from what I was proposing, but fits in the same encoding. You just need two types of load pair, etc.

Personally I would do the MIPS thing and make all registers 128 bits, but
this gives you 256 bit vectors of a sort with the banks.

I have been experimenting with the idea of having a smaller register
file so fewer encoding bits, and then making up for the small file by
having more dedicated registers. For instance, 16 regs with 2
independent link register, eight condition code registers, and a stack pointer. That really gives over 20 registers, which might be enough for reasonable compiles.

I saw a design where there was an attempt to process basic blocks in
parallel silos feeding functional units. It made use of fewer registers
by holding data in pipeline registers instead of GPRs which it could do
since some of the data for a basic block never goes outside the block.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Brett on Mon Aug 26 21:10:48 2024

Brett <[email protected]> wrote:

Robert Finch <[email protected]> wrote:

On 2024-08-22 5:58 p.m., Brett wrote:

Brett <[email protected]> wrote:

MitchAlsup1 <[email protected]> wrote:

On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

Banked register files, a mental exercise at expanding the register file.

Four Banked register files, a mental exercise at expanding the register >>>> file.

With three operand RISC you have you have three 5 bit register specifiers >>>> using 15 bits.

If instead you have four banks of sixteen registers you have a 2 bit bank >>>> specifier and three 4 bit register specifiers with one override bit for the
destination for 15 total bits, the same as a 32 register RISC chip. The >>>> override bit specifies bank zero for destination, 64 total registers.

Two operand plus 16 bit offset instructions would need to sacrifice one bit
of offset. Four operand instructions would save two bits, quite useful. >>>>
Addressing can be done from any bank.

The compiler can handle large banks easily, simple dependency grouping and >>>> if you need more than 16 registers for a single calculation you use the >>>> Base override flag to total to the base registers. So you have two chains >>>> that total in two banks and both write to the base registers where the last
total of the two chains are added.

Call and return parameters are in the base registers.
Simple code only uses base registers, or base plus one bank.

On the mental status, having multiple banks means you can have multiple ALU >>> clusters and rename. You are no longer limited to 9 way rename and 12ish >>> way issue, but a multiple of that. The limits are load and store bandwidth, >>> and some added latency to coordinate. Lots of money would get piled into >>> compilers to maximize even bank use.

Is this a good idea, i think so, but this is a mental exercise, it proves I
am mental. ;)

How does banked compare to high registers? Slightly better.
Intel could pull off something like this to one up ARM. A new fixed width >>>> instruction set with a nice patent moat, and fits the x86 mindset.

Yes you can do
Rd,[Rbase+Rindex<<scale+LargeDisplacement]
Large displacements would be in extension words like My 66000.
Nothing stops you from doing add from memory, besides being costly in
opcode bits and die size.

Using BRAMs usually allows for a lot more registers than make sense in
an architecture. Makes one wonder what to do with the extra registers.
The MOV instruction can be made to use more bits for the register spec
allowing transfers between banks of registers. Since MOV needs only two
register specs instead of three, there are more bits available.

FPGA Block Memory, had to look it up, not a hardware or embedded guy.

There is a popular embedded CPU with dual register files and dual
operations.
Not what I was going for, but there are possibilities there. On the high
end you crack the dual instructions and let them execute out of order in
the different bank pipes. This gives you 128 bit vector ops on a 64 bit cpu with multiple banks. This would be a completely different instruction set from what I was proposing, but fits in the same encoding. You just need two types of load pair, etc.

Personally I would do the MIPS thing and make all registers 128 bits, but this gives you 256 bit vectors of a sort with the banks.

I have been experimenting with the idea of having a smaller register
file so fewer encoding bits, and then making up for the small file by
having more dedicated registers. For instance, 16 regs with 2
independent link register, eight condition code registers, and a stack
pointer. That really gives over 20 registers, which might be enough for
reasonable compiles.

I saw a design where there was an attempt to process basic blocks in
parallel silos feeding functional units. It made use of fewer registers
by holding data in pipeline registers instead of GPRs which it could do
since some of the data for a basic block never goes outside the block.

No reply’s, so I figure y’all are under NDA. ;)

So I posted over on Real World Tech my prediction that Intel APX is not 32 general registers, but two separate banks of 16 registers with their own pipelines. ;)

Hilarious post. ;)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Tue Aug 27 00:32:50 2024

On Mon, 26 Aug 2024 21:10:48 +0000, Brett wrote:

Brett <[email protected]> wrote:

Robert Finch <[email protected]> wrote:

On 2024-08-22 5:58 p.m., Brett wrote:

Brett <[email protected]> wrote:

MitchAlsup1 <[email protected]> wrote:

I saw a design where there was an attempt to process basic blocks in
parallel silos feeding functional units. It made use of fewer registers
by holding data in pipeline registers instead of GPRs which it could do
since some of the data for a basic block never goes outside the block.

No reply’s, so I figure y’all are under NDA. ;)

It has been well known since mid 1990s that most loops end up with a
single
or dual stream of self dependent instructions and few loop dependencies
{mostly the loop index itself}. This leads to instruction dependency
graphs (and execution times) that look like::

| LD |
| LD |
| FMUL |
| FADD |
| STA | | STD |
| ADD |
| CMP |
| BV | ------------------------------------------------------------
| LD |
| LD |
| FMUL |

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to [email protected] on Tue Aug 27 23:51:59 2024

MitchAlsup1 <[email protected]> wrote:

On Mon, 26 Aug 2024 21:10:48 +0000, Brett wrote:

Brett <[email protected]> wrote:

Robert Finch <[email protected]> wrote:

On 2024-08-22 5:58 p.m., Brett wrote:

Brett <[email protected]> wrote:

MitchAlsup1 <[email protected]> wrote:

I saw a design where there was an attempt to process basic blocks in
parallel silos feeding functional units. It made use of fewer registers >>>> by holding data in pipeline registers instead of GPRs which it could do >>>> since some of the data for a basic block never goes outside the block.

No reply’s, so I figure y’all are under NDA. ;)

It has been well known since mid 1990s that most loops end up with a
single
or dual stream of self dependent instructions and few loop dependencies {mostly the loop index itself}. This leads to instruction dependency
graphs (and execution times) that look like::

| LD |
| LD |
| FMUL |
| FADD |
| STA | | STD |
| ADD |
| CMP |
| BV | ------------------------------------------------------------
| LD |
| LD |
| FMUL |

To even out the cluster load you would want the compiler to unroll once,
first bank second, then second bank first.

Can also be done without compiler by mapping the links on the second pass
of a loop.

I am assuming clusters or banks as naming and issue width continue growing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Wed Aug 28 16:44:39 2024

On Tue, 27 Aug 2024 23:51:59 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Mon, 26 Aug 2024 21:10:48 +0000, Brett wrote:

Brett <[email protected]> wrote:

Robert Finch <[email protected]> wrote:

On 2024-08-22 5:58 p.m., Brett wrote:

Brett <[email protected]> wrote:

MitchAlsup1 <[email protected]> wrote:

I saw a design where there was an attempt to process basic blocks in >>>>> parallel silos feeding functional units. It made use of fewer registers >>>>> by holding data in pipeline registers instead of GPRs which it could do >>>>> since some of the data for a basic block never goes outside the block. >>>

No reply’s, so I figure y’all are under NDA. ;)

It has been well known since mid 1990s that most loops end up with a
single
or dual stream of self dependent instructions and few loop dependencies
{mostly the loop index itself}. This leads to instruction dependency
graphs (and execution times) that look like::

| LD |
| LD |
| FMUL |
| FADD |
| STA | | STD |
| ADD |
| CMP |
| BV |
------------------------------------------------------------
| LD |
| LD |
| FMUL |

To even out the cluster load you would want the compiler to unroll once, first bank second, then second bank first.

Can also be done without compiler by mapping the links on the second
pass of a loop.

The above is done with simple reservation stations and no compiler work.

I am assuming clusters or banks as naming and issue width continue
growing.

Once you start doing reservation station machines, your 72-entry banked register file needs to have the RSs watch 72 results instead of just 32.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to [email protected] on Thu Aug 29 23:31:56 2024

MitchAlsup1 <[email protected]> wrote:

On Tue, 27 Aug 2024 23:51:59 +0000, Brett wrote:

MitchAlsup1 <[email protected]> wrote:

On Mon, 26 Aug 2024 21:10:48 +0000, Brett wrote:

Brett <[email protected]> wrote:

Robert Finch <[email protected]> wrote:

On 2024-08-22 5:58 p.m., Brett wrote:

Brett <[email protected]> wrote:

MitchAlsup1 <[email protected]> wrote:

I saw a design where there was an attempt to process basic blocks in >>>>>> parallel silos feeding functional units. It made use of fewer registers >>>>>> by holding data in pipeline registers instead of GPRs which it could do >>>>>> since some of the data for a basic block never goes outside the block. >>>>

No reply’s, so I figure y’all are under NDA. ;)

It has been well known since mid 1990s that most loops end up with a
single
or dual stream of self dependent instructions and few loop dependencies
{mostly the loop index itself}. This leads to instruction dependency
graphs (and execution times) that look like::

| LD |
| LD |
| FMUL |
| FADD |
| STA | | STD |
| ADD |
| CMP |
| BV |
------------------------------------------------------------
| LD |
| LD |
| FMUL |

To even out the cluster load you would want the compiler to unroll once,
first bank second, then second bank first.

Can also be done without compiler by mapping the links on the second
pass of a loop.

The above is done with simple reservation stations and no compiler work.

I am assuming clusters or banks as naming and issue width continue
growing.

Once you start doing reservation station machines, your 72-entry banked register file needs to have the RSs watch 72 results instead of just 32.

ALU’s are cheap, so each bank has its own set.
You can forward and complete twice as many results.

The traditional problem of banking is a one cycle delay crossing banks, a compiler can fix that, a CPU cannot on first pass.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	58:57:08
Calls:	12,446
Calls today:	1
Files:	15,192
Messages:	6,537,423

Banked register files

Who's Online

Recent Visitors

System Info