• Banked register files

    From Brett@21:1/5 to All on Mon Aug 19 21:46:07 2024
    Banked register files, a mental exercise at expanding the register file.

    With three operand RISC you have you have three 5 bit register specifiers
    using 15 bits.

    If instead you have eight banks of eight registers you have a 3 bit bank specifier and three 3 bit register specifiers for 12 bits.

    Now the banks need to talk to each other and so you would add a bit to each register specifier to tell whether it uses the bank or the base registers,
    for 72 registers total, not 64. So a 3 bit bank specifier and three 4 bit register specifiers for 15 bits, the same as a 32 register RISC chip.

    Two operand plus 16 bit offset instructions would need to sacrifice one bit
    of offset. Four operand instructions would save a bit.

    As an extra bonus you now have another 3 bit field that could be another
    source or destination if you are not using the bank register. But with only eight base registers it can look hard to pull off using 4 or 5 registers at once. But maybe not if most of the addressing is in the bank registers. The frame pointer would be in the base registers, as it loads the other
    pointers.

    The most general case for banked registers is loop unrolling. Eight
    registers is not a lot so the first loop may use two banks, but now you
    have 4 unrolls that are fairly trivial to set up.

    Is this a good idea, maybe, maybe not. This is a mental exercise, it proves
    I am mental. ;)

    How does banked compare to high registers? Not as good.

    Intel could pull off something like this to one up ARM. A new fixed width instruction set with a nice patent moat, and fits the x86 mindset.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Mon Aug 19 22:14:07 2024
    On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

    Banked register files, a mental exercise at expanding the register file.

    With three operand RISC you have you have three 5 bit register
    specifiers using 15 bits.

    If instead you have eight banks of eight registers you have a 3 bit bank specifier and three 3 bit register specifiers for 12 bits.

    Now the banks need to talk to each other and so you would add a bit to
    each register specifier to tell whether it uses the bank or the base registers, for 72 registers total, not 64. So a 3 bit bank specifier
    and three 4 bit register specifiers for 15 bits, the same as a 32
    register RISC chip.

    This covers 100% of instructions that smell like::

    ADD R17,R17,R25

    but covers 000% of the instructions that smell like::

    ADD R7,R17,R25

    I strongly suspect that it covers less than 50% of the 3-operand
    instruction uses.

    Two operand plus 16 bit offset instructions would need to sacrifice one
    bit of offset. Four operand instructions would save a bit.

    As an extra bonus you now have another 3 bit field that could be another source or destination if you are not using the bank register. But with
    only eight base registers it can look hard to pull off using 4 or 5
    registers
    at once. But maybe not if most of the addressing is in the bank
    registers.
    The frame pointer would be in the base registers, as it loads the other pointers.

    I guess the real question at this point is how are the banks used when:

    a) calling a subroutine
    b) returning from a subroutine
    c) calling a method
    d) calling an external subroutine

    e) dealing with {lower bound, upper bound, stride} for each dimension
    of a multi-dimensional array

    f) how does the scheme work when INT-RF != FP-RF ??

    The most general case for banked registers is loop unrolling. Eight
    registers is not a lot so the first loop may use two banks, but now you
    have 4 unrolls that are fairly trivial to set up.

    Unrolling has become less and less necessary with GBOoO implementations. Producing fewer instructions to encode the whole loop is more import-
    ant.

    Is this a good idea, maybe, maybe not. This is a mental exercise, it
    proves > I am mental. ;)

    Arguably less crazy than some other proposals.

    Compiler people would have to be convinced to get on board as this
    would disrupt their built-in idea that register files are mono-
    lithic.

    How does banked compare to high registers? Not as good.

    Intel could pull off something like this to one up ARM. A new fixed
    width instruction set with a nice patent moat, and fits the x86 mindset.

    What fits the x86 mindset is the::
    MEM Rd,[Rbase+Rindex<<scale+LargeDisplacement]
    address mode.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to [email protected] on Mon Aug 19 23:23:11 2024
    MitchAlsup1 <[email protected]> wrote:
    On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

    Banked register files, a mental exercise at expanding the register file.

    With three operand RISC you have you have three 5 bit register
    specifiers using 15 bits.

    If instead you have eight banks of eight registers you have a 3 bit bank
    specifier and three 3 bit register specifiers for 12 bits.

    Now the banks need to talk to each other and so you would add a bit to
    each register specifier to tell whether it uses the bank or the base
    registers, for 72 registers total, not 64. So a 3 bit bank specifier
    and three 4 bit register specifiers for 15 bits, the same as a 32
    register RISC chip.

    This covers 100% of instructions that smell like::

    ADD R17,R17,R25

    but covers 000% of the instructions that smell like::

    ADD R7,R17,R25

    I strongly suspect that it covers less than 50% of the 3-operand
    instruction uses.

    My description was bad, let’s do a MC 68000 version, base registers are addressing mostly, and the banks are integer/float.

    The compiler can handle this easily, simple dependency grouping and if you
    need more than 8 registers you use the No base flag to total to the base registers. So you have two chains that total in two banks and both write to
    the base registers where the last total of the two chains are added.

    This saves a lot of no bank bits, only the result needs a no bank override.

    Simple code only uses base registers, or base plus one bank.
    Call and return parameters are in the base registers, spilling to a bank if
    you need more.

    Since only the result needs an override, you can do 4 banks of 16
    registers.

    Two operand plus 16 bit offset instructions would need to sacrifice one
    bit of offset. Four operand instructions would save a bit.

    As an extra bonus you now have another 3 bit field that could be another
    source or destination if you are not using the bank register. But with
    only eight base registers it can look hard to pull off using 4 or 5
    registers
    at once. But maybe not if most of the addressing is in the bank
    registers.
    The frame pointer would be in the base registers, as it loads the other
    pointers.

    I guess the real question at this point is how are the banks used when:

    a) calling a subroutine
    b) returning from a subroutine
    c) calling a method
    d) calling an external subroutine

    e) dealing with {lower bound, upper bound, stride} for each dimension
    of a multi-dimensional array

    f) how does the scheme work when INT-RF != FP-RF ??

    The most general case for banked registers is loop unrolling. Eight
    registers is not a lot so the first loop may use two banks, but now you
    have 4 unrolls that are fairly trivial to set up.

    Unrolling has become less and less necessary with GBOoO implementations. Producing fewer instructions to encode the whole loop is more import-
    ant.

    Is this a good idea, maybe, maybe not. This is a mental exercise, it
    proves > I am mental. ;)

    Arguably less crazy than some other proposals.

    Compiler people would have to be convinced to get on board as this
    would disrupt their built-in idea that register files are mono-
    lithic.

    How does banked compare to high registers? Not as good.

    Intel could pull off something like this to one up ARM. A new fixed
    width instruction set with a nice patent moat, and fits the x86 mindset.

    What fits the x86 mindset is the::
    MEM Rd,[Rbase+Rindex<<scale+LargeDisplacement]
    address mode.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Tue Aug 20 01:19:10 2024
    On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

    Banked register files, a mental exercise at expanding the register file. >>>
    With three operand RISC you have you have three 5 bit register
    specifiers using 15 bits.

    If instead you have eight banks of eight registers you have a 3 bit bank >>> specifier and three 3 bit register specifiers for 12 bits.

    Now the banks need to talk to each other and so you would add a bit to
    each register specifier to tell whether it uses the bank or the base
    registers, for 72 registers total, not 64. So a 3 bit bank specifier
    and three 4 bit register specifiers for 15 bits, the same as a 32
    register RISC chip.

    This covers 100% of instructions that smell like::

    ADD R17,R17,R25

    but covers 000% of the instructions that smell like::

    ADD R7,R17,R25

    I strongly suspect that it covers less than 50% of the 3-operand
    instruction uses.

    My description was bad, let’s do a MC 68000 version, base registers are addressing mostly, and the banks are integer/float.

    There is a reason this style fell out of fashion.
    Not enough address registers at the same time one had not enough
    data registers--whereas a flat 16-entry file had enough for either.

    There is a reason CRAY-2 staging memory was not copied, too; and
    it is mainly the same reason.


    The compiler can handle this easily, simple dependency grouping and if
    you
    need more than 8 registers you use the No base flag to total to the base registers. So you have two chains that total in two banks and both write
    to
    the base registers where the last total of the two chains are added.

    This saves a lot of no bank bits, only the result needs a no bank
    override.

    Yes, 68K did have pretty good code density--{{Now if 68020 had NOT
    gone a blown up the addressing modes...}}

    Simple code only uses base registers, or base plus one bank.
    Call and return parameters are in the base registers, spilling to a bank
    if you need more.

    Since only the result needs an override, you can do 4 banks of 16
    registers.

    Dream on.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to [email protected] on Tue Aug 20 03:54:11 2024
    MitchAlsup1 <[email protected]> wrote:
    On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

    Banked register files, a mental exercise at expanding the register file. >>>>

    Four Banked register files, a mental exercise at expanding the register
    file.

    With three operand RISC you have you have three 5 bit register specifiers
    using 15 bits.

    If instead you have four banks of sixteen registers you have a 2 bit bank specifier and three 4 bit register specifiers with one override bit for the destination for 15 total bits, the same as a 32 register RISC chip. The override bit specifies bank zero for destination, 64 total registers.

    Two operand plus 16 bit offset instructions would need to sacrifice one bit
    of offset. Four operand instructions would save two bits, quite useful.

    Addressing can be done from any bank.

    The compiler can handle large banks easily, simple dependency grouping and
    if you need more than 16 registers for a single calculation you use the
    Base override flag to total to the base registers. So you have two chains
    that total in two banks and both write to the base registers where the last total of the two chains are added.

    Call and return parameters are in the base registers.
    Simple code only uses base registers, or base plus one bank.

    Is this a good idea, i think so, but this is a mental exercise, it proves I
    am mental. ;)

    How does banked compare to high registers? Slightly better.
    Intel could pull off something like this to one up ARM. A new fixed width instruction set with a nice patent moat, and fits the x86 mindset.

    Yes you can do
    Rd,[Rbase+Rindex<<scale+LargeDisplacement]
    Large displacements would be in extension words like My 66000.
    Nothing stops you from doing add from memory, besides being costly in
    opcode bits and die size.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mac@21:1/5 to All on Thu Aug 22 21:15:06 2024
    The old DEC/Intel, later Netronome, IXP network processors had two banks of
    128 registers each, with the source operands had to come from different
    banks. Funky machine, lots of visible pipeline delays. Limited scratch
    memory with 3-cycle latency and explicit address register. DRAM was an I/o device with asynchronous load/store.

    Not for C programmers. An optimizing assembler handled register assignment
    and filled pipeline delay slots

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Brett on Thu Aug 22 21:58:11 2024
    Brett <[email protected]> wrote:
    MitchAlsup1 <[email protected]> wrote:
    On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

    Banked register files, a mental exercise at expanding the register file. >>>>>

    Four Banked register files, a mental exercise at expanding the register
    file.

    With three operand RISC you have you have three 5 bit register specifiers using 15 bits.

    If instead you have four banks of sixteen registers you have a 2 bit bank specifier and three 4 bit register specifiers with one override bit for the destination for 15 total bits, the same as a 32 register RISC chip. The override bit specifies bank zero for destination, 64 total registers.

    Two operand plus 16 bit offset instructions would need to sacrifice one bit of offset. Four operand instructions would save two bits, quite useful.

    Addressing can be done from any bank.

    The compiler can handle large banks easily, simple dependency grouping and
    if you need more than 16 registers for a single calculation you use the
    Base override flag to total to the base registers. So you have two chains that total in two banks and both write to the base registers where the last total of the two chains are added.

    Call and return parameters are in the base registers.
    Simple code only uses base registers, or base plus one bank.

    On the mental status, having multiple banks means you can have multiple ALU clusters and rename. You are no longer limited to 9 way rename and 12ish
    way issue, but a multiple of that. The limits are load and store bandwidth,
    and some added latency to coordinate. Lots of money would get piled into compilers to maximize even bank use.

    Is this a good idea, i think so, but this is a mental exercise, it proves I am mental. ;)

    How does banked compare to high registers? Slightly better.
    Intel could pull off something like this to one up ARM. A new fixed width instruction set with a nice patent moat, and fits the x86 mindset.

    Yes you can do
    Rd,[Rbase+Rindex<<scale+LargeDisplacement]
    Large displacements would be in extension words like My 66000.
    Nothing stops you from doing add from memory, besides being costly in
    opcode bits and die size.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Robert Finch on Sat Aug 24 17:27:47 2024
    Robert Finch <[email protected]> wrote:
    On 2024-08-22 5:58 p.m., Brett wrote:
    Brett <[email protected]> wrote:
    MitchAlsup1 <[email protected]> wrote:
    On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

    Banked register files, a mental exercise at expanding the register file.


    Four Banked register files, a mental exercise at expanding the register
    file.

    With three operand RISC you have you have three 5 bit register specifiers >>> using 15 bits.

    If instead you have four banks of sixteen registers you have a 2 bit bank >>> specifier and three 4 bit register specifiers with one override bit for the >>> destination for 15 total bits, the same as a 32 register RISC chip. The
    override bit specifies bank zero for destination, 64 total registers.

    Two operand plus 16 bit offset instructions would need to sacrifice one bit >>> of offset. Four operand instructions would save two bits, quite useful.

    Addressing can be done from any bank.

    The compiler can handle large banks easily, simple dependency grouping and >>> if you need more than 16 registers for a single calculation you use the
    Base override flag to total to the base registers. So you have two chains >>> that total in two banks and both write to the base registers where the last >>> total of the two chains are added.

    Call and return parameters are in the base registers.
    Simple code only uses base registers, or base plus one bank.

    On the mental status, having multiple banks means you can have multiple ALU >> clusters and rename. You are no longer limited to 9 way rename and 12ish
    way issue, but a multiple of that. The limits are load and store bandwidth, >> and some added latency to coordinate. Lots of money would get piled into
    compilers to maximize even bank use.

    Is this a good idea, i think so, but this is a mental exercise, it proves I >>> am mental. ;)

    How does banked compare to high registers? Slightly better.
    Intel could pull off something like this to one up ARM. A new fixed width >>> instruction set with a nice patent moat, and fits the x86 mindset.

    Yes you can do
    Rd,[Rbase+Rindex<<scale+LargeDisplacement]
    Large displacements would be in extension words like My 66000.
    Nothing stops you from doing add from memory, besides being costly in
    opcode bits and die size.




    Using BRAMs usually allows for a lot more registers than make sense in
    an architecture. Makes one wonder what to do with the extra registers.
    The MOV instruction can be made to use more bits for the register spec allowing transfers between banks of registers. Since MOV needs only two register specs instead of three, there are more bits available.

    FPGA Block Memory, had to look it up, not a hardware or embedded guy.

    There is a popular embedded CPU with dual register files and dual
    operations.
    Not what I was going for, but there are possibilities there. On the high
    end you crack the dual instructions and let them execute out of order in
    the different bank pipes. This gives you 128 bit vector ops on a 64 bit cpu with multiple banks. This would be a completely different instruction set
    from what I was proposing, but fits in the same encoding. You just need two types of load pair, etc.

    Personally I would do the MIPS thing and make all registers 128 bits, but
    this gives you 256 bit vectors of a sort with the banks.

    I have been experimenting with the idea of having a smaller register
    file so fewer encoding bits, and then making up for the small file by
    having more dedicated registers. For instance, 16 regs with 2
    independent link register, eight condition code registers, and a stack pointer. That really gives over 20 registers, which might be enough for reasonable compiles.

    I saw a design where there was an attempt to process basic blocks in
    parallel silos feeding functional units. It made use of fewer registers
    by holding data in pipeline registers instead of GPRs which it could do
    since some of the data for a basic block never goes outside the block.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Brett on Mon Aug 26 21:10:48 2024
    Brett <[email protected]> wrote:
    Robert Finch <[email protected]> wrote:
    On 2024-08-22 5:58 p.m., Brett wrote:
    Brett <[email protected]> wrote:
    MitchAlsup1 <[email protected]> wrote:
    On Mon, 19 Aug 2024 23:23:11 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Mon, 19 Aug 2024 21:46:07 +0000, Brett wrote:

    Banked register files, a mental exercise at expanding the register file.


    Four Banked register files, a mental exercise at expanding the register >>>> file.

    With three operand RISC you have you have three 5 bit register specifiers >>>> using 15 bits.

    If instead you have four banks of sixteen registers you have a 2 bit bank >>>> specifier and three 4 bit register specifiers with one override bit for the
    destination for 15 total bits, the same as a 32 register RISC chip. The >>>> override bit specifies bank zero for destination, 64 total registers.

    Two operand plus 16 bit offset instructions would need to sacrifice one bit
    of offset. Four operand instructions would save two bits, quite useful. >>>>
    Addressing can be done from any bank.

    The compiler can handle large banks easily, simple dependency grouping and >>>> if you need more than 16 registers for a single calculation you use the >>>> Base override flag to total to the base registers. So you have two chains >>>> that total in two banks and both write to the base registers where the last
    total of the two chains are added.

    Call and return parameters are in the base registers.
    Simple code only uses base registers, or base plus one bank.

    On the mental status, having multiple banks means you can have multiple ALU >>> clusters and rename. You are no longer limited to 9 way rename and 12ish >>> way issue, but a multiple of that. The limits are load and store bandwidth, >>> and some added latency to coordinate. Lots of money would get piled into >>> compilers to maximize even bank use.

    Is this a good idea, i think so, but this is a mental exercise, it proves I
    am mental. ;)

    How does banked compare to high registers? Slightly better.
    Intel could pull off something like this to one up ARM. A new fixed width >>>> instruction set with a nice patent moat, and fits the x86 mindset.

    Yes you can do
    Rd,[Rbase+Rindex<<scale+LargeDisplacement]
    Large displacements would be in extension words like My 66000.
    Nothing stops you from doing add from memory, besides being costly in
    opcode bits and die size.




    Using BRAMs usually allows for a lot more registers than make sense in
    an architecture. Makes one wonder what to do with the extra registers.
    The MOV instruction can be made to use more bits for the register spec
    allowing transfers between banks of registers. Since MOV needs only two
    register specs instead of three, there are more bits available.

    FPGA Block Memory, had to look it up, not a hardware or embedded guy.

    There is a popular embedded CPU with dual register files and dual
    operations.
    Not what I was going for, but there are possibilities there. On the high
    end you crack the dual instructions and let them execute out of order in
    the different bank pipes. This gives you 128 bit vector ops on a 64 bit cpu with multiple banks. This would be a completely different instruction set from what I was proposing, but fits in the same encoding. You just need two types of load pair, etc.

    Personally I would do the MIPS thing and make all registers 128 bits, but this gives you 256 bit vectors of a sort with the banks.

    I have been experimenting with the idea of having a smaller register
    file so fewer encoding bits, and then making up for the small file by
    having more dedicated registers. For instance, 16 regs with 2
    independent link register, eight condition code registers, and a stack
    pointer. That really gives over 20 registers, which might be enough for
    reasonable compiles.

    I saw a design where there was an attempt to process basic blocks in
    parallel silos feeding functional units. It made use of fewer registers
    by holding data in pipeline registers instead of GPRs which it could do
    since some of the data for a basic block never goes outside the block.

    No reply’s, so I figure y’all are under NDA. ;)

    So I posted over on Real World Tech my prediction that Intel APX is not 32 general registers, but two separate banks of 16 registers with their own pipelines. ;)

    Hilarious post. ;)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Tue Aug 27 00:32:50 2024
    On Mon, 26 Aug 2024 21:10:48 +0000, Brett wrote:

    Brett <[email protected]> wrote:
    Robert Finch <[email protected]> wrote:
    On 2024-08-22 5:58 p.m., Brett wrote:
    Brett <[email protected]> wrote:
    MitchAlsup1 <[email protected]> wrote:

    I saw a design where there was an attempt to process basic blocks in
    parallel silos feeding functional units. It made use of fewer registers
    by holding data in pipeline registers instead of GPRs which it could do
    since some of the data for a basic block never goes outside the block.

    No reply’s, so I figure y’all are under NDA. ;)

    It has been well known since mid 1990s that most loops end up with a
    single
    or dual stream of self dependent instructions and few loop dependencies
    {mostly the loop index itself}. This leads to instruction dependency
    graphs (and execution times) that look like::

    | LD |
    | LD |
    | FMUL |
    | FADD |
    | STA | | STD |
    | ADD |
    | CMP |
    | BV | ------------------------------------------------------------
    | LD |
    | LD |
    | FMUL |

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to [email protected] on Tue Aug 27 23:51:59 2024
    MitchAlsup1 <[email protected]> wrote:
    On Mon, 26 Aug 2024 21:10:48 +0000, Brett wrote:

    Brett <[email protected]> wrote:
    Robert Finch <[email protected]> wrote:
    On 2024-08-22 5:58 p.m., Brett wrote:
    Brett <[email protected]> wrote:
    MitchAlsup1 <[email protected]> wrote:

    I saw a design where there was an attempt to process basic blocks in
    parallel silos feeding functional units. It made use of fewer registers >>>> by holding data in pipeline registers instead of GPRs which it could do >>>> since some of the data for a basic block never goes outside the block.

    No reply’s, so I figure y’all are under NDA. ;)

    It has been well known since mid 1990s that most loops end up with a
    single
    or dual stream of self dependent instructions and few loop dependencies {mostly the loop index itself}. This leads to instruction dependency
    graphs (and execution times) that look like::

    | LD |
    | LD |
    | FMUL |
    | FADD |
    | STA | | STD |
    | ADD |
    | CMP |
    | BV | ------------------------------------------------------------
    | LD |
    | LD |
    | FMUL |


    To even out the cluster load you would want the compiler to unroll once,
    first bank second, then second bank first.

    Can also be done without compiler by mapping the links on the second pass
    of a loop.

    I am assuming clusters or banks as naming and issue width continue growing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Wed Aug 28 16:44:39 2024
    On Tue, 27 Aug 2024 23:51:59 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Mon, 26 Aug 2024 21:10:48 +0000, Brett wrote:

    Brett <[email protected]> wrote:
    Robert Finch <[email protected]> wrote:
    On 2024-08-22 5:58 p.m., Brett wrote:
    Brett <[email protected]> wrote:
    MitchAlsup1 <[email protected]> wrote:

    I saw a design where there was an attempt to process basic blocks in >>>>> parallel silos feeding functional units. It made use of fewer registers >>>>> by holding data in pipeline registers instead of GPRs which it could do >>>>> since some of the data for a basic block never goes outside the block. >>>
    No reply’s, so I figure y’all are under NDA. ;)

    It has been well known since mid 1990s that most loops end up with a
    single
    or dual stream of self dependent instructions and few loop dependencies
    {mostly the loop index itself}. This leads to instruction dependency
    graphs (and execution times) that look like::

    | LD |
    | LD |
    | FMUL |
    | FADD |
    | STA | | STD |
    | ADD |
    | CMP |
    | BV |
    ------------------------------------------------------------
    | LD |
    | LD |
    | FMUL |


    To even out the cluster load you would want the compiler to unroll once, first bank second, then second bank first.

    Can also be done without compiler by mapping the links on the second
    pass of a loop.

    The above is done with simple reservation stations and no compiler work.

    I am assuming clusters or banks as naming and issue width continue
    growing.

    Once you start doing reservation station machines, your 72-entry banked register file needs to have the RSs watch 72 results instead of just 32.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to [email protected] on Thu Aug 29 23:31:56 2024
    MitchAlsup1 <[email protected]> wrote:
    On Tue, 27 Aug 2024 23:51:59 +0000, Brett wrote:

    MitchAlsup1 <[email protected]> wrote:
    On Mon, 26 Aug 2024 21:10:48 +0000, Brett wrote:

    Brett <[email protected]> wrote:
    Robert Finch <[email protected]> wrote:
    On 2024-08-22 5:58 p.m., Brett wrote:
    Brett <[email protected]> wrote:
    MitchAlsup1 <[email protected]> wrote:

    I saw a design where there was an attempt to process basic blocks in >>>>>> parallel silos feeding functional units. It made use of fewer registers >>>>>> by holding data in pipeline registers instead of GPRs which it could do >>>>>> since some of the data for a basic block never goes outside the block. >>>>
    No reply’s, so I figure y’all are under NDA. ;)

    It has been well known since mid 1990s that most loops end up with a
    single
    or dual stream of self dependent instructions and few loop dependencies
    {mostly the loop index itself}. This leads to instruction dependency
    graphs (and execution times) that look like::

    | LD |
    | LD |
    | FMUL |
    | FADD |
    | STA | | STD |
    | ADD |
    | CMP |
    | BV |
    ------------------------------------------------------------
    | LD |
    | LD |
    | FMUL |


    To even out the cluster load you would want the compiler to unroll once,
    first bank second, then second bank first.

    Can also be done without compiler by mapping the links on the second
    pass of a loop.

    The above is done with simple reservation stations and no compiler work.

    I am assuming clusters or banks as naming and issue width continue
    growing.

    Once you start doing reservation station machines, your 72-entry banked register file needs to have the RSs watch 72 results instead of just 32.


    ALU’s are cheap, so each bank has its own set.
    You can forward and complete twice as many results.

    The traditional problem of banking is a one cycle delay crossing banks, a compiler can fix that, a CPU cannot on first pass.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)