Banked register files, a mental exercise at expanding the register file.
With three operand RISC you have you have three 5 bit register
specifiers using 15 bits.
If instead you have eight banks of eight registers you have a 3 bit bank specifier and three 3 bit register specifiers for 12 bits.
Now the banks need to talk to each other and so you would add a bit to
each register specifier to tell whether it uses the bank or the base registers, for 72 registers total, not 64. So a 3 bit bank specifier
and three 4 bit register specifiers for 15 bits, the same as a 32
register RISC chip.
Two operand plus 16 bit offset instructions would need to sacrifice one
bit of offset. Four operand instructions would save a bit.
As an extra bonus you now have another 3 bit field that could be another source or destination if you are not using the bank register. But with
only eight base registers it can look hard to pull off using 4 or 5
registers
at once. But maybe not if most of the addressing is in the bank
registers.
The frame pointer would be in the base registers, as it loads the other pointers.
The most general case for banked registers is loop unrolling. Eight
registers is not a lot so the first loop may use two banks, but now you
have 4 unrolls that are fairly trivial to set up.
Is this a good idea, maybe, maybe not. This is a mental exercise, it
proves > I am mental. ;)
How does banked compare to high registers? Not as good.
Intel could pull off something like this to one up ARM. A new fixed
width instruction set with a nice patent moat, and fits the x86 mindset.
On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:
Banked register files, a mental exercise at expanding the register file.
With three operand RISC you have you have three 5 bit register
specifiers using 15 bits.
If instead you have eight banks of eight registers you have a 3 bit bank
specifier and three 3 bit register specifiers for 12 bits.
Now the banks need to talk to each other and so you would add a bit to
each register specifier to tell whether it uses the bank or the base
registers, for 72 registers total, not 64. So a 3 bit bank specifier
and three 4 bit register specifiers for 15 bits, the same as a 32
register RISC chip.
This covers 100% of instructions that smell like::
ADD R17,R17,R25
but covers 000% of the instructions that smell like::
ADD R7,R17,R25
I strongly suspect that it covers less than 50% of the 3-operand
instruction uses.
Two operand plus 16 bit offset instructions would need to sacrifice one
bit of offset. Four operand instructions would save a bit.
As an extra bonus you now have another 3 bit field that could be another
source or destination if you are not using the bank register. But with
only eight base registers it can look hard to pull off using 4 or 5
registers
at once. But maybe not if most of the addressing is in the bank
registers.
The frame pointer would be in the base registers, as it loads the other
pointers.
I guess the real question at this point is how are the banks used when:
a) calling a subroutine
b) returning from a subroutine
c) calling a method
d) calling an external subroutine
e) dealing with {lower bound, upper bound, stride} for each dimension
of a multi-dimensional array
f) how does the scheme work when INT-RF != FP-RF ??
The most general case for banked registers is loop unrolling. Eight
registers is not a lot so the first loop may use two banks, but now you
have 4 unrolls that are fairly trivial to set up.
Unrolling has become less and less necessary with GBOoO implementations. Producing fewer instructions to encode the whole loop is more import-
ant.
Is this a good idea, maybe, maybe not. This is a mental exercise, it
proves > I am mental. ;)
Arguably less crazy than some other proposals.
Compiler people would have to be convinced to get on board as this
would disrupt their built-in idea that register files are mono-
lithic.
How does banked compare to high registers? Not as good.
Intel could pull off something like this to one up ARM. A new fixed
width instruction set with a nice patent moat, and fits the x86 mindset.
What fits the x86 mindset is the::
MEM Rd,[Rbase+Rindex<<scale+LargeDisplacement]
address mode.
MitchAlsup1 <[email protected]> wrote:
On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:
Banked register files, a mental exercise at expanding the register file. >>>
With three operand RISC you have you have three 5 bit register
specifiers using 15 bits.
If instead you have eight banks of eight registers you have a 3 bit bank >>> specifier and three 3 bit register specifiers for 12 bits.
Now the banks need to talk to each other and so you would add a bit to
each register specifier to tell whether it uses the bank or the base
registers, for 72 registers total, not 64. So a 3 bit bank specifier
and three 4 bit register specifiers for 15 bits, the same as a 32
register RISC chip.
This covers 100% of instructions that smell like::
ADD R17,R17,R25
but covers 000% of the instructions that smell like::
ADD R7,R17,R25
I strongly suspect that it covers less than 50% of the 3-operand
instruction uses.
My description was bad, let’s do a MC 68000 version, base registers are addressing mostly, and the banks are integer/float.
The compiler can handle this easily, simple dependency grouping and if
you
need more than 8 registers you use the No base flag to total to the base registers. So you have two chains that total in two banks and both write
to
the base registers where the last total of the two chains are added.
This saves a lot of no bank bits, only the result needs a no bank
override.
Simple code only uses base registers, or base plus one bank.
Call and return parameters are in the base registers, spilling to a bank
if you need more.
Since only the result needs an override, you can do 4 banks of 16
registers.
On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:
MitchAlsup1 <[email protected]> wrote:
On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:
Banked register files, a mental exercise at expanding the register file. >>>>
MitchAlsup1 <[email protected]> wrote:
On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:
MitchAlsup1 <[email protected]> wrote:
On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:
Banked register files, a mental exercise at expanding the register file. >>>>>
Four Banked register files, a mental exercise at expanding the register
file.
With three operand RISC you have you have three 5 bit register specifiers using 15 bits.
If instead you have four banks of sixteen registers you have a 2 bit bank specifier and three 4 bit register specifiers with one override bit for the destination for 15 total bits, the same as a 32 register RISC chip. The override bit specifies bank zero for destination, 64 total registers.
Two operand plus 16 bit offset instructions would need to sacrifice one bit of offset. Four operand instructions would save two bits, quite useful.
Addressing can be done from any bank.
The compiler can handle large banks easily, simple dependency grouping and
if you need more than 16 registers for a single calculation you use the
Base override flag to total to the base registers. So you have two chains that total in two banks and both write to the base registers where the last total of the two chains are added.
Call and return parameters are in the base registers.
Simple code only uses base registers, or base plus one bank.
Is this a good idea, i think so, but this is a mental exercise, it proves I am mental. ;)
How does banked compare to high registers? Slightly better.
Intel could pull off something like this to one up ARM. A new fixed width instruction set with a nice patent moat, and fits the x86 mindset.
Yes you can do
Rd,[Rbase+Rindex<<scale+LargeDisplacement]
Large displacements would be in extension words like My 66000.
Nothing stops you from doing add from memory, besides being costly in
opcode bits and die size.
On 2024-08-22 5:58 p.m., Brett wrote:
Brett <[email protected]> wrote:Using BRAMs usually allows for a lot more registers than make sense in
MitchAlsup1 <[email protected]> wrote:
On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:
MitchAlsup1 <[email protected]> wrote:
On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:
Banked register files, a mental exercise at expanding the register file.
Four Banked register files, a mental exercise at expanding the register
file.
With three operand RISC you have you have three 5 bit register specifiers >>> using 15 bits.
If instead you have four banks of sixteen registers you have a 2 bit bank >>> specifier and three 4 bit register specifiers with one override bit for the >>> destination for 15 total bits, the same as a 32 register RISC chip. The
override bit specifies bank zero for destination, 64 total registers.
Two operand plus 16 bit offset instructions would need to sacrifice one bit >>> of offset. Four operand instructions would save two bits, quite useful.
Addressing can be done from any bank.
The compiler can handle large banks easily, simple dependency grouping and >>> if you need more than 16 registers for a single calculation you use the
Base override flag to total to the base registers. So you have two chains >>> that total in two banks and both write to the base registers where the last >>> total of the two chains are added.
Call and return parameters are in the base registers.
Simple code only uses base registers, or base plus one bank.
On the mental status, having multiple banks means you can have multiple ALU >> clusters and rename. You are no longer limited to 9 way rename and 12ish
way issue, but a multiple of that. The limits are load and store bandwidth, >> and some added latency to coordinate. Lots of money would get piled into
compilers to maximize even bank use.
Is this a good idea, i think so, but this is a mental exercise, it proves I >>> am mental. ;)
How does banked compare to high registers? Slightly better.
Intel could pull off something like this to one up ARM. A new fixed width >>> instruction set with a nice patent moat, and fits the x86 mindset.
Yes you can do
Rd,[Rbase+Rindex<<scale+LargeDisplacement]
Large displacements would be in extension words like My 66000.
Nothing stops you from doing add from memory, besides being costly in
opcode bits and die size.
an architecture. Makes one wonder what to do with the extra registers.
The MOV instruction can be made to use more bits for the register spec allowing transfers between banks of registers. Since MOV needs only two register specs instead of three, there are more bits available.
I have been experimenting with the idea of having a smaller register
file so fewer encoding bits, and then making up for the small file by
having more dedicated registers. For instance, 16 regs with 2
independent link register, eight condition code registers, and a stack pointer. That really gives over 20 registers, which might be enough for reasonable compiles.
I saw a design where there was an attempt to process basic blocks in
parallel silos feeding functional units. It made use of fewer registers
by holding data in pipeline registers instead of GPRs which it could do
since some of the data for a basic block never goes outside the block.
Robert Finch <[email protected]> wrote:
On 2024-08-22 5:58 p.m., Brett wrote:
Brett <[email protected]> wrote:Using BRAMs usually allows for a lot more registers than make sense in
MitchAlsup1 <[email protected]> wrote:
On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:
MitchAlsup1 <[email protected]> wrote:
On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:
Banked register files, a mental exercise at expanding the register file.
Four Banked register files, a mental exercise at expanding the register >>>> file.
With three operand RISC you have you have three 5 bit register specifiers >>>> using 15 bits.
If instead you have four banks of sixteen registers you have a 2 bit bank >>>> specifier and three 4 bit register specifiers with one override bit for the
destination for 15 total bits, the same as a 32 register RISC chip. The >>>> override bit specifies bank zero for destination, 64 total registers.
Two operand plus 16 bit offset instructions would need to sacrifice one bit
of offset. Four operand instructions would save two bits, quite useful. >>>>
Addressing can be done from any bank.
The compiler can handle large banks easily, simple dependency grouping and >>>> if you need more than 16 registers for a single calculation you use the >>>> Base override flag to total to the base registers. So you have two chains >>>> that total in two banks and both write to the base registers where the last
total of the two chains are added.
Call and return parameters are in the base registers.
Simple code only uses base registers, or base plus one bank.
On the mental status, having multiple banks means you can have multiple ALU >>> clusters and rename. You are no longer limited to 9 way rename and 12ish >>> way issue, but a multiple of that. The limits are load and store bandwidth, >>> and some added latency to coordinate. Lots of money would get piled into >>> compilers to maximize even bank use.
Is this a good idea, i think so, but this is a mental exercise, it proves I
am mental. ;)
How does banked compare to high registers? Slightly better.
Intel could pull off something like this to one up ARM. A new fixed width >>>> instruction set with a nice patent moat, and fits the x86 mindset.
Yes you can do
Rd,[Rbase+Rindex<<scale+LargeDisplacement]
Large displacements would be in extension words like My 66000.
Nothing stops you from doing add from memory, besides being costly in
opcode bits and die size.
an architecture. Makes one wonder what to do with the extra registers.
The MOV instruction can be made to use more bits for the register spec
allowing transfers between banks of registers. Since MOV needs only two
register specs instead of three, there are more bits available.
FPGA Block Memory, had to look it up, not a hardware or embedded guy.
There is a popular embedded CPU with dual register files and dual
operations.
Not what I was going for, but there are possibilities there. On the high
end you crack the dual instructions and let them execute out of order in
the different bank pipes. This gives you 128 bit vector ops on a 64 bit cpu with multiple banks. This would be a completely different instruction set from what I was proposing, but fits in the same encoding. You just need two types of load pair, etc.
Personally I would do the MIPS thing and make all registers 128 bits, but this gives you 256 bit vectors of a sort with the banks.
I have been experimenting with the idea of having a smaller register
file so fewer encoding bits, and then making up for the small file by
having more dedicated registers. For instance, 16 regs with 2
independent link register, eight condition code registers, and a stack
pointer. That really gives over 20 registers, which might be enough for
reasonable compiles.
I saw a design where there was an attempt to process basic blocks in
parallel silos feeding functional units. It made use of fewer registers
by holding data in pipeline registers instead of GPRs which it could do
since some of the data for a basic block never goes outside the block.
Brett <[email protected]> wrote:
Robert Finch <[email protected]> wrote:
On 2024-08-22 5:58 p.m., Brett wrote:
Brett <[email protected]> wrote:
MitchAlsup1 <[email protected]> wrote:
I saw a design where there was an attempt to process basic blocks in
parallel silos feeding functional units. It made use of fewer registers
by holding data in pipeline registers instead of GPRs which it could do
since some of the data for a basic block never goes outside the block.
No reply’s, so I figure y’all are under NDA. ;)
On Mon, 26 Aug 2024 21:10:48 +0000, Brett wrote:
Brett <[email protected]> wrote:
Robert Finch <[email protected]> wrote:
On 2024-08-22 5:58 p.m., Brett wrote:
Brett <[email protected]> wrote:
MitchAlsup1 <[email protected]> wrote:
I saw a design where there was an attempt to process basic blocks in
parallel silos feeding functional units. It made use of fewer registers >>>> by holding data in pipeline registers instead of GPRs which it could do >>>> since some of the data for a basic block never goes outside the block.
No reply’s, so I figure y’all are under NDA. ;)
It has been well known since mid 1990s that most loops end up with a
single
or dual stream of self dependent instructions and few loop dependencies {mostly the loop index itself}. This leads to instruction dependency
graphs (and execution times) that look like::
| LD |
| LD |
| FMUL |
| FADD |
| STA | | STD |
| ADD |
| CMP |
| BV | ------------------------------------------------------------
| LD |
| LD |
| FMUL |
MitchAlsup1 <[email protected]> wrote:
On Mon, 26 Aug 2024 21:10:48 +0000, Brett wrote:
Brett <[email protected]> wrote:
Robert Finch <[email protected]> wrote:No reply’s, so I figure y’all are under NDA. ;)
On 2024-08-22 5:58 p.m., Brett wrote:
Brett <[email protected]> wrote:
MitchAlsup1 <[email protected]> wrote:
I saw a design where there was an attempt to process basic blocks in >>>>> parallel silos feeding functional units. It made use of fewer registers >>>>> by holding data in pipeline registers instead of GPRs which it could do >>>>> since some of the data for a basic block never goes outside the block. >>>
It has been well known since mid 1990s that most loops end up with a
single
or dual stream of self dependent instructions and few loop dependencies
{mostly the loop index itself}. This leads to instruction dependency
graphs (and execution times) that look like::
| LD |
| LD |
| FMUL |
| FADD |
| STA | | STD |
| ADD |
| CMP |
| BV |
------------------------------------------------------------
| LD |
| LD |
| FMUL |
To even out the cluster load you would want the compiler to unroll once, first bank second, then second bank first.
Can also be done without compiler by mapping the links on the second
pass of a loop.
I am assuming clusters or banks as naming and issue width continue
growing.
On Tue, 27 Aug 2024 23:51:59 +0000, Brett wrote:
MitchAlsup1 <[email protected]> wrote:
On Mon, 26 Aug 2024 21:10:48 +0000, Brett wrote:
Brett <[email protected]> wrote:
Robert Finch <[email protected]> wrote:No reply’s, so I figure y’all are under NDA. ;)
On 2024-08-22 5:58 p.m., Brett wrote:
Brett <[email protected]> wrote:
MitchAlsup1 <[email protected]> wrote:
I saw a design where there was an attempt to process basic blocks in >>>>>> parallel silos feeding functional units. It made use of fewer registers >>>>>> by holding data in pipeline registers instead of GPRs which it could do >>>>>> since some of the data for a basic block never goes outside the block. >>>>
It has been well known since mid 1990s that most loops end up with a
single
or dual stream of self dependent instructions and few loop dependencies
{mostly the loop index itself}. This leads to instruction dependency
graphs (and execution times) that look like::
| LD |
| LD |
| FMUL |
| FADD |
| STA | | STD |
| ADD |
| CMP |
| BV |
------------------------------------------------------------
| LD |
| LD |
| FMUL |
To even out the cluster load you would want the compiler to unroll once,
first bank second, then second bank first.
Can also be done without compiler by mapping the links on the second
pass of a loop.
The above is done with simple reservation stations and no compiler work.
I am assuming clusters or banks as naming and issue width continue
growing.
Once you start doing reservation station machines, your 72-entry banked register file needs to have the RSs watch 72 results instead of just 32.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (3 / 13) |
| Uptime: | 42:23:55 |
| Calls: | 12,110 |
| Calls today: | 1 |
| Files: | 15,008 |
| Messages: | 6,518,427 |