• Re: Tonights Tradeoff - Background Execution Buffers

    From MitchAlsup1@21:1/5 to Robert Finch on Tue Sep 24 20:38:52 2024
    On Tue, 24 Sep 2024 20:03:29 +0000, Robert Finch wrote:

    Under construction: Q+ background execution buffers for the block memory operations. For instance, a block store operation can be executed in the background while other instructions are executing. Store operations are issued when the MEM unit is not busy. Background instructions continue
    to execute even when interrupts occur. The background operations may be useful for initializing blocks of memory that are not needed right-away.
    When the operation is issued a handle for the buffer is returned in the destination register so that the status of the operation may be queried,
    or the operation cancelled.

    This is how My 66000 performs:: LDM, STM, ENTER, EXIT, MM, and MS.
    Addresses are AGENED and then a state machine over in the memory
    unit performs the required steps. {{Not usefully different than the
    divider performing the individual steps of division.}} While the
    unit performs its duties, other units can be fed and complete
    other instructions.

    You just have to mark the affected registers to prevent hazards.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Thu Sep 26 14:11:09 2024
    On Thu, 26 Sep 2024 8:13:12 +0000, Robert Finch wrote:

    On 2024-09-24 4:38 p.m., MitchAlsup1 wrote:
    On Tue, 24 Sep 2024 20:03:29 +0000, Robert Finch wrote:

    Under construction: Q+ background execution buffers for the block memory >>> operations. For instance, a block store operation can be executed in the >>> background while other instructions are executing. Store operations are
    issued when the MEM unit is not busy. Background instructions continue
    to execute even when interrupts occur. The background operations may be
    useful for initializing blocks of memory that are not needed right-away. >>> When the operation is issued a handle for the buffer is returned in the
    destination register so that the status of the operation may be queried, >>> or the operation cancelled.

    This is how My 66000 performs:: LDM, STM, ENTER, EXIT, MM, and MS.
    Addresses are AGENED and then a state machine over in the memory
    unit performs the required steps. {{Not usefully different than the
    divider performing the individual steps of division.}} While the
    unit performs its duties, other units can be fed and complete
    other instructions.

    You just have to mark the affected registers to prevent hazards.

    Q+ releases the registers right away, so things can continue on.
    Q+ captures the register values at issue then does not modify the
    registers. Did not want an instruction with three updates happening. It
    keeps track of its own values. In theory anyway. Have not got to testing
    it yet. A status operation might be used to query the final operation results.

    Altering Q+ to use 64-bit instructions and 256 registers instead of supporting a vector instruction set. Two pipeline stages can be removed
    then and it is a simpler design. Code density will decrease <200%.
    Relying on software to assign registers for vectors.

    Also adding a predicate field to instructions. Branches are horrendously
    slow in this simple implementation. It may be faster to predicate a
    dozen instructions.

    The depth of predication should be such that if FETCH will "get there"
    by the time the branch "resolves" that number of instructions should
    be predicated.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Robert Finch on Fri Oct 4 06:19:31 2024
    Robert Finch <[email protected]> writes:
    Today I am wondering how many predicate registers are enough. Scanning >webpages reveals a variety. The Itanium has 64-predicates, but they are
    used for modulo loops and rotated. Rotating register is Itaniums method
    of register renaming, so it needs more visible registers. In a classic >superscalar design with a RAT where registers are renamed, it seems like
    64 would be far too many.

    Would it? Zen5 has 192 flags registers <https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/09/hc2024_zen5_spec_uplift.png?ssl=1>,
    and I assume that means it has 192 C, 192 V, and 192 NZP registers
    (physical), for one architectural flags register.

    I cannot see the compiler making use of very many predicate registers >simultaneously.

    Maybe not, but what are the alternatives:

    1) Have one flags register, like AMD64 and ARM A32, T32, and A64, or
    the carry flag of Power and 88K, and the flags result of most Power instructions. Then the compilers typically only know that other
    instructions will overwrite that register, and is forced to consume
    the flag right away. This leads to bad code generation, as shown in <[email protected]>:

    |E.g., in
    |<[email protected]> we see that gcc-5.3.0
    |compiles
    |
    | cf = _addcarry_u64(cf, src1[1], src2[1], &dst[1]);
    | cf = _addcarry_u64(cf, src1[2], src2[2], &dst[2]);
    |
    |into
    |
    | d: 48 8b 42 08 mov 0x8(%rdx),%rax
    |11: 41 80 c1 ff add $0xff,%r9b
    |15: 49 13 40 08 adc 0x8(%r8),%rax
    |19: 41 0f 92 c1 setb %r9b
    |1d: 48 89 41 08 mov %rax,0x8(%rcx)
    |21: 48 8b 42 10 mov 0x10(%rdx),%rax
    |25: 41 80 c1 ff add $0xff,%r9b
    |29: 49 13 40 10 adc 0x10(%r8),%rax
    |2d: 41 0f 92 c1 setb %r9b
    |31: 48 89 41 10 mov %rax,0x10(%rcx)
    |
    |Here gcc reifies the carry bit in a GPR (r9b) with the instructions at
    |19 and 2d, and also converts it from a GPR into a carry flag in 11 and
    |25. This shows that the compiler does not trust itself to preserve
    |the carry flag from one adc to the next.

    2) Have multiple flags registers, like IA-64. The compiler will
    certainly be able to deal with that, but extra instructions are needed
    for generating the flags.

    3) Use the GPRs for flags. This also often requires additional
    instructions for generating the flags, as in MIPS, 88K, or RISC-V
    (with quite a bit of differentce between the MIPS/Alpha/RISC-V
    approach and the 88K approach). This disadvantage is often mitigated
    by having compare-and-branch instructions or instructions that branch
    on certain properties of a register's content.

    4) Keep the flags results along with GPRs: have carry and overflow as
    bit 64 and 65, N is bit 63, and Z tells something about bits 0-63.
    The advantage is that you do not have to track the flags separately
    (and, in case of AMD64, track each of C, O, and NZP separately), but
    instead can use the RAT that is already there for the GPRs. You can
    find a preliminary paper on that on <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.

    Since they are not used simultaneously, and register
    renaming is in effect, there should not be a great need for predicate >registers.

    You need to preserve one instance for every recovery point, i.e.,
    every instruction that branches or can trap, and that have not yet
    been committed. You also need to preserve one instance if there is
    any consumer that has not yet proceeded through execution. The
    simplest way to satisfy both requirements is to just preserve any
    flags result until the generating instruction retires. And if most instructions generate flags, that means a lot of instances of the
    flags. There is a reason why Zen5 has 192.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Robert Finch on Sat Oct 5 09:43:09 2024
    Robert Finch <[email protected]> writes:
    On 2024-10-04 2:19 a.m., Anton Ertl wrote:
    4) Keep the flags results along with GPRs: have carry and overflow as
    bit 64 and 65, N is bit 63, and Z tells something about bits 0-63.
    The advantage is that you do not have to track the flags separately
    (and, in case of AMD64, track each of C, O, and NZP separately), but
    instead can use the RAT that is already there for the GPRs. You can
    find a preliminary paper on that on
    <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
    ...
    One solution, not mentioned in your article, is to support arithmetic
    with two bits less than the number of bit a register can support, so
    that the carry and overflow can be stored. On a 64-bit machine have all >operations use only 62-bits. It would solve the issue of how to load or
    store the carry and overflow bits associated with a register.

    Yes, that's a solution, but the question is how well existing software
    would react to having no int64_t (and equivalent types, such as long
    long), but instead an int62_t (or maybe int63_t, if the 64th bit is
    used for both signed and unsigned overflow, by having separate signed
    and unsigned addition etc.). I expect that such an architecture would
    have low acceptance. By contrast, in my paper I suggest an addition
    to existing 64-bit architectures that has fewer of the same
    disadvantages as the widely-used condition-code-register approach has,
    but still has a few of them.

    Sometimes
    arithmetic is performed with fewer bits, as for pointer representation.
    I wonder if pointer masking could somehow be involved. It may be useful
    to have a bit indicating the presence of a pointer. Also thinking of how
    to track a binary point position for fixed point arithmetic. Perhaps
    using the whole upper byte of a register for status/control bits would work.

    There are some extensions for AMD64 in that direction.

    It may be possible with Q+ to support a second destination register
    which is in a subset of the GPRs. For example, one of eight registers
    could be specified to holds the carry/overflow status. That effectively
    ties up a second ALU though as an extra write port is needed for the >instruction.

    Needing only one write port is an advantage of my approach.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Sat Oct 5 23:02:28 2024
    On Fri, 4 Oct 2024 4:04:20 +0000, Robert Finch wrote:

    Today I am wondering how many predicate registers are enough.

    Realistically 3 as long as they are orthogonal to each other and to
    code.

    Scanning webpages reveals a variety. The Itanium has 64-predicates, but they are
    used for modulo loops and rotated. Rotating register is Itaniums method
    of register renaming, so it needs more visible registers. In a classic superscalar design with a RAT where registers are renamed, it seems like
    64 would be far too many. Cray had eight vector mask registers.

    In the ECL logic CRAY used an 8:1 multiplexer costs no more than a
    2:1 multiplexer in power and gate count.

    I think
    the RISCV- Hwatcha has 16 if I looked at the diagram correctly.
    I cannot see the compiler making use of very many predicate registers simultaneously. Since they are not used simultaneously, and register
    renaming is in effect, there should not be a great need for predicate registers.

    With the orthogonality mentioned above, 3 covers a tree with 8 branches
    or a 3-deep if()then{}else[}.

    Suppose one wants predicated logic in a loop with the predicate being
    set outside of the loop.

    1 decision per predicate,

    It may be desirable to have several blocks of
    logic predicated by different predicates in the loop. It is likely
    desirable to have more than one predicate then.
    Reserved four bits in the instruction for predicates. Do not want to
    waste bits though. Using a 64-bit instruction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Wed Oct 9 14:43:37 2024
    Robert Finch <[email protected]> writes:
    On 2024-10-05 5:43 a.m., Anton Ertl wrote:

    Been thinking some about the carry and overflow and what to do about
    register spills and reloads during expression processing. My thought was
    that on the machine with 256 registers, simply allocate a ridiculous
    number of registers for expression processing, for example 25 or even
    50.

    Then if the expression is too complex, have the compiler spit out an
    error message to the programmer to simplify the expression.

    Completely unacceptable.

    I spent several days back in 1990 fixing a temporary register
    bug in PCC caused by complex expression. The expression
    was generated by cfront, so there was no way to "simplify"
    the expression.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Wed Oct 9 16:19:41 2024
    On Wed, 9 Oct 2024 10:44:08 +0000, Robert Finch wrote:


    Been thinking some about the carry and overflow and what to do about
    register spills and reloads during expression processing. My thought was
    that on the machine with 256 registers, simply allocate a ridiculous
    number of registers for expression processing, for example 25 or even
    50. Then if the expression is too complex, have the compiler spit out an error message to the programmer to simplify the expression. Remnants of
    the ‘expression too complex’ error in BASIC.

    Both completely unacceptable, and in your case completely unnecessary.
    in 967 subroutines I read out of My 66000 LLVM compile, I only have
    3 cases of spill-fill, and that is with only 32 registers with uni-
    versal constants.

    Of the RISC-V code I read alongside with 32+32 registers, I counted 8.

    With those statistics and 256 registers, If you can't get to essentially
    0 spill=fill the problem is not with your architecture but with your
    compiler.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to All on Sun Oct 13 16:43:53 2024
    Robert Finch <[email protected]> writes:

    [Context: carry and overflow in GPRs
    <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>]

    Been thinking some about the carry and overflow and what to do about
    register spills and reloads during expression processing. My thought was
    that on the machine with 256 registers, simply allocate a ridiculous
    number of registers for expression processing, for example 25 or even
    50. Then if the expression is too complex, have the compiler spit out an >error message to the programmer to simplify the expression. Remnants of
    the ‘expression too complex’ error in BASIC. So, there are no spills or >reloads during expression processing.

    The first question is how carry and overflow are represented in the
    programming language.

    Currently there are programming languages with growable integers, and
    overflow is needed short-term for that, so spilling the overflow bit
    is probably not necessary for that (and indeed, the one overflow bit
    of AMD64 or ARM A64 that is not preserved across calls is good enough
    for that).

    For dealing with multiple-precision integers (e.g., when the growable
    integers actually grow to more than one word), typically library
    routines are used, but sure, one could also have a programming
    language that computes with multi-precision integers and then is
    compiled into either loops over the individual words of these numbers,
    or it unrolls these loops (if the length is known in advance). Yes,
    if you run out of registers there, you may want to spill and refill a
    register, including its carry bit. But that should be rare, so if
    it's an expensive operation, we can live with it.

    What we have now is things like the GNU C extension

    bool __builtin_add_overflow (type1 a, type2 b, type3 *res);

    This produces two different results, the return value, and res. With
    the kind of architecture I have in mind, these two results could be
    allocated into the same register. If at some point the register has
    to be spilled, the two results can be stored into different memory
    locations, and on refill they will land in different GPRs unless the
    compiler writer really puts a lot more work in than is merited (I
    don't expect many spills and refills).

    I think the storextra / loadextra
    registers used during context switching would work okay. But in Q+ there
    are 256 regs which require eight storextra / loadextra registers. I
    think the store extra / load extra registers could be hidden in the
    context save and restore hardware. Not even requiring access via CSRs or >whatever.

    Yes. In my paper I wanted to spell out an implementation that does
    not look like I am ignoring some hard problems and shove it over to
    the implementor. If a computer architect wants to pick my idea up,
    they are welcome to implement context-switching in any way they deem appropriate.

    I suppose context loads and stores could be done in blocks of
    32 registers. An issue is that the load extra needs to be done before >registers are loaded.

    Maybe, with 256 GPRs, you would use 8 storeextra and 8 loadextra
    registers, each on associated with 32 registers. This avoids having
    to make the whole process a sequential operation working on 32-GPR
    blocks. Just store all 256 GPRs, sync (to get the storeextra
    registers up-to-date, then store the 8 storeextra registers. For
    context load, load the 8 loadextra registers, sync (so the loads of
    the loadextra registers are finished), then the 256 GPRs.

    Or alternatively just have 8 extra registers that are used for both
    context stores and context loads. Then you cannot use the same sync
    for both storing and loading, but you may prefer a little more
    context-switch overhead to needing 16 extra registers.

    Another thought is to store additional info such as a CRC check of the >register file on context save and restore.

    Typically ECC memory and something similar in bus protocols achieve
    what I guess you want to achieve with the CRC checks.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)