• Re: Misc: Preliminary (actual) performance: BJX2 vs RV64

    From MitchAlsup1@21:1/5 to BGB on Sun Jan 21 21:22:50 2024
    BGB wrote:

    I have now gotten around to fully implementing the ability to boot BJX2
    into RISC-V mode.

    Though, this part wasn't the hard-part, rather, more, porting most of TestKern to be able to build on RISC-V (some parts are still stubbed
    out, so using it as a kernel in RV Mode will not yet be possible, but
    got enough ported at least to be able to run programs "bare metal" in
    RV64 Mode).

    Both are using more or less the same C library (TestKern + modified
    PDPCLIB).

    For the BJX2 side, things are compiled with BGBCC.
    For the RISC-V side, GCC 12.2.0 (riscv64-unknown-elf, RV64IMA).

    This allows more accurate comparison than, say, on paper analysis or comparing results between different emulators.


    So, first program tested was Doom, with preliminary results (average framerate):
    RV -O3 18.1
    RV -Os 15.5
    XG2 21.6
    This is from running the first 3 demos and stopping at the same spot.

    Both give "similar" MIPs values, but the mix differs:
    BJX2: Dominated by memory Load/Store followed by branches;
    RISC-V: Dominated by ALU operations (particularly ADD and Shift).
    Load/Store, and Branches, are a little down the list.

    RV64 has a lot fewer SP-relative loads/stores compared with BJX2,
    despite having fewer GPRs.


    Meanwhile, ADD and SLLI seem to be the top two instructions used in
    RISC-V (I will still continue to blame the lack of register-indexed load/store on this one...).

    It does seem to suffer more from spending a higher percentage of its
    time with interlocks, particularly with ALU operations (doesn't seem
    like a great situation to have 2-cycle latency on ADD and Shift instructions...).

    You might be the first person with a RISC-V that has 2 cycle ADDs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun Jan 21 23:40:16 2024
    BGB wrote:

    On 1/21/2024 3:22 PM, MitchAlsup1 wrote:
    BGB wrote:

    I have now gotten around to fully implementing the ability to boot
    BJX2 into RISC-V mode.

    Though, this part wasn't the hard-part, rather, more, porting most of
    TestKern to be able to build on RISC-V (some parts are still stubbed
    out, so using it as a kernel in RV Mode will not yet be possible, but
    got enough ported at least to be able to run programs "bare metal" in
    RV64 Mode).

    Both are using more or less the same C library (TestKern + modified
    PDPCLIB).

    For the BJX2 side, things are compiled with BGBCC.
       For the RISC-V side, GCC 12.2.0 (riscv64-unknown-elf, RV64IMA).

    This allows more accurate comparison than, say, on paper analysis or
    comparing results between different emulators.


    So, first program tested was Doom, with preliminary results (average
    framerate):
       RV -O3  18.1
       RV -Os  15.5
       XG2     21.6
    This is from running the first 3 demos and stopping at the same spot.

    Both give "similar" MIPs values, but the mix differs:
       BJX2: Dominated by memory Load/Store followed by branches;
       RISC-V: Dominated by ALU operations (particularly ADD and Shift).
         Load/Store, and Branches, are a little down the list.

    RV64 has a lot fewer SP-relative loads/stores compared with BJX2,
    despite having fewer GPRs.


    Meanwhile, ADD and SLLI seem to be the top two instructions used in
    RISC-V (I will still continue to blame the lack of register-indexed
    load/store on this one...).

    It does seem to suffer more from spending a higher percentage of its
    time with interlocks, particularly with ALU operations (doesn't seem
    like a great situation to have 2-cycle latency on ADD and Shift
    instructions...).

    You might be the first person with a RISC-V that has 2 cycle ADDs.

    Yeah, and probably not an ideal situation for RISC-V, as seemingly it is
    one of the most common instructions:
    MV Xd, Xs
    LI Xd, Imm12s

    ADDI Xd, Xs, 0
    ADDI Xd, X0, Imm12s
    ....

    For move they could use OR Rd,Rs,#0 or do you have 2 cycle logicals ??

    Shift sees a lot of use as well, as it is also used for both indexed addressing, and for performing sign an zero extension.

    Say:
    j=(short)i;
    Being, say:
    SLLI X11, X10, 16
    SRAI X11, X11, 16

    Which I do in 1 instruction
    SLL R11,R10,<16,0>
    {Extract the lower 16 bits at offset 0}
    I started calling this a Smash -- Smash this long into a short.
    This is what happens when shifts are subset of bit manipulation

    As opposed to having dedicated instructions for a lot of these cases (as
    in BJX2).

    See; mine are not dedicated, they just as easily perform

    struct { long i : 17,
    j : 9,
    k : 3,
    ... } st;
    short s = st.k;

    SLL Rs,Rst,<3,26>

    Oh well...

    I have the director of Norther Telecom circa 1984 for this. I BLEW the
    88K implementation by putting the two 5-bit fields back to back and used
    the 16-bit immediate encoding, wasting bits and tying my hands into
    the future at the same time. My 66000 has essentially the same instrs;
    but the immediate form is XOM7 and uses a 12-bit immediate field. When
    this pattern is decoded, the two 5-bit fields are routed onto the Rs2
    operand bus at position<37..32> and position<5..0>. No 32-bit or smaller
    data value (replacing the immediate) can access the extract functionality
    and the 64-bitters than can are limited to putting SANE bit patterns
    there when they do. This Lower field is limited from 0..63 the upper one
    from 0..64, and all intermediate bits are checked for zeros.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Jan 22 19:01:35 2024
    BGB wrote:

    On 1/21/2024 5:40 PM, MitchAlsup1 wrote:
    BGB wrote:

    Shift sees a lot of use as well, as it is also used for both indexed
    addressing, and for performing sign an zero extension.

    Say:
       j=(short)i;
    Being, say:
       SLLI X11, X10, 16
       SRAI X11, X11, 16

    Which I do in 1 instruction
        SLL  R11,R10,<16,0>
    {Extract the lower 16 bits at offset 0}
    I started calling this a Smash -- Smash this long into a short.
    This is what happens when shifts are subset of bit manipulation
    As opposed to having dedicated instructions for a lot of these cases
    (as in BJX2).

    See; mine are not dedicated, they just as easily perform

        struct { long i : 17,
                      j : 9,
                      k : 3,
                     ...      } st;
        short s = st.k;

        SLL     Rs,Rst,<3,26>


    Possibly, if one has a big enough immediate field to encode it.

    It is 12-bits, 2×6-bit fields.

    Could have made sense as a use for the 12-bit Immed fields in RISC-V,
    but it can be noted that they did not do so (and chose instead to use
    pairs of shifts).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to BGB on Mon Jan 22 18:46:41 2024
    BGB <[email protected]> schrieb:

    All of the ALU ops are 2-cycle at present.

    You're imitating POWER, are you? :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Jan 22 20:15:00 2024
    BGB wrote:

    On 1/22/2024 1:01 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 1/21/2024 5:40 PM, MitchAlsup1 wrote:
    BGB wrote:

    Shift sees a lot of use as well, as it is also used for both indexed >>>>> addressing, and for performing sign an zero extension.

    Say:
       j=(short)i;
    Being, say:
       SLLI X11, X10, 16
       SRAI X11, X11, 16

    Which I do in 1 instruction
         SLL  R11,R10,<16,0>
    {Extract the lower 16 bits at offset 0}
    I started calling this a Smash -- Smash this long into a short.
    This is what happens when shifts are subset of bit manipulation
    As opposed to having dedicated instructions for a lot of these cases >>>>> (as in BJX2).

    See; mine are not dedicated, they just as easily perform

         struct { long i : 17,
                       j : 9,
                       k : 3,
                      ...      } st;
         short s = st.k;

         SLL     Rs,Rst,<3,26>


    Possibly, if one has a big enough immediate field to encode it.

    It is 12-bits, 2×6-bit fields.


    Yes, but 12-bits was bigger than the 9-bit fields I was originally
    using, or the Imm5 encodings in some other contexts.

    Granted, XG2 expands these to 10 and 6 bits.

    Or, could use a Jumbo encoding, or, ...

    In most cases, having EXT{S/U}.{B/W/L} works well enough, and deals with
    all of the common cases (and is faster than using a pair of shifts, particularly when these shifts each have a 2 cycle latency...).

    And here we have the classical chicken and egg problem.

    Bit fields are not as fast as {B,H,W,D} so few people use them;
    Bit fields are not well supported in ISA so few compilers optimize them;
    EVEN if they are ideal for the situation at hand.

    When the HW cost of properly supporting them is essentially free !!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Mon Jan 22 22:09:46 2024
    BGB <[email protected]> writes:
    Also seemingly GCC seems to use "ADDI" for MV and LI, whereas the RISC-V
    spec had said to use "ORI" for these.

    What makes you think so? According to <https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf>,
    page 13:

    |ADDI rd, rs1, 0 is used to implement the MV rd, rs1 assembler |pseudo-instruction.

    And on page 76:

    |C.LI expands into addi rd, x0, imm[5:0]

    C.LI is a separate instruction. I did not find anything about a
    non-compact LI, but given how C.LI expands (why does the ISA manual
    actually specify that?), I expect that LI is a pseudo-instruction that
    is actually "addi rd, x0, imm".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Tue Jan 23 00:49:11 2024
    BGB-Alt wrote:

    On 1/22/2024 4:09 PM, Anton Ertl wrote:
    BGB <[email protected]> writes:
    Also seemingly GCC seems to use "ADDI" for MV and LI, whereas the RISC-V >>> spec had said to use "ORI" for these.

    What makes you think so? According to
    <https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf>,
    page 13:

    |ADDI rd, rs1, 0 is used to implement the MV rd, rs1 assembler
    |pseudo-instruction.

    And on page 76:

    |C.LI expands into addi rd, x0, imm[5:0]

    C.LI is a separate instruction. I did not find anything about a
    non-compact LI, but given how C.LI expands (why does the ISA manual
    actually specify that?), I expect that LI is a pseudo-instruction that
    is actually "addi rd, x0, imm".


    OK.

    I had thought when I had looked it up, that it had said that these
    mapped to ORI.

    But, if it is ADDI, then GCC is behaving according to the spec.
    Either way, the end-result is the same in this case.

    In theory, could hack over these in the decoder by
    detecting/special-casing things when the immediate is 0 (to map these
    over to the MOV logic).

    I made My 66000 have a MOV OpCode for a particular reason::
    {MOV, ABS, NEG, INV} can be performed in 0-cycles in the forwarding
    network--if your FUs are designed to put up with this as inputs.

    It would have not been "that hard" to just special case decode,
    but who is going to do this when the opcode set includes FP and
    SIMD that needs these ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB-Alt on Thu Feb 1 03:01:36 2024
    BGB-Alt wrote:

    On 1/22/2024 6:49 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:
    <snip>

    Also it would appear as-if the scheduling is assuming 1-cycle ALU and
    2-cycle load, vs 2-cycle ALU and 3-cycle load.

    So, at least part of the problem is that GCC is generating code that is
    not ideal for my pipeline.

    Captain Obvious strikes again.

    Tried modeling what happens if RV64 had superscalar (in my emulator),
    and the interlock issue gets worse, as then jumps up to around 23%-26% interlock penalty (mostly eating any gains that superscalar would
    bring). Where, it seems that superscalar (according to my CPU's rules)
    would bundle around 10-15% of the RV64 ops with '-O3' (or, around 8-12%
    with '-Os').

    You are running into the reasons CPU designers went OoO after the 2-wide in-order machine generation.

    On the other hand, disabling WEX in BJX2 causes interlock penalties to
    drop. So, it still maintains a performance advantage over RV, as the
    drop in MIPs score is smaller.

    Your compiler is tuned to your pipeline.
    But how do you tune your compiler to EVERY conceivable pipeline ??

    Otherwise, had started work on trying to get RV64G support working, as
    this would support a wider variety of programs than RV64IMA.



    In another experiment, had added logic to fold && and || operators to
    use bitwise arithmetic for logical expressions (in certain cases).
    If both the LHS and RHS represent logical expressions with no side effects; If the LHS and RHS are not "too expensive" according to a cost heuristic (past a certain size, it is cheaper to use short-circuit branching
    rather than ALU operations).

    Internally, this added various pseudo operators to the compiler:
    &&&, |||: Logical and expressed as bitwise.
    !& : !(a&b)
    !!&: !(!(a&b)), Normal TEST operator, with a logic result.
    Exists to be distinct from normal bitwise AND.

    For the inexpensive cases, PRED was designed to handle the && and ||
    of HLLs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Thu Feb 1 03:10:42 2024
    Robert Finch wrote:

    On 2024-01-31 6:19 p.m., BGB-Alt wrote:
    On 1/22/2024 6:49 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:
    <snip>

    Did partly compensate for the code-size increase by adding some
    experimental 3R CMPxx ops:
      CMPQEQ, CMPQNE, CMPQGT, CMPQGE

    Currently only available in 64-bit forms, which can handle signed and
    unsigned 32-bit values along with signed 64-bit values (unsigned 64-bit
    would require a 3R CMPQHI instruction, and is less likely to be used as
    often).

    Where:
      CMPQEQ Rs, Rt, Rn
      CMPQNE Rs, Rt, Rn
      CMPQGT Rs, Rt, Rn
      CMPQGE Rs, Rt, Rn
    Does:
      Rn = (Rs == Rt);
      Rn = (Rs != Rt);
      Rn = (Rs >  Rt);
      Rn = (Rs >= Rt);
    Where, < and <= can be done by flipping the arguments.

    These instructions are also called 'set' instructions in some
    architectures. Useful enough to include IMO. Q+ calls the 'ZSxx' for
    zero or set (from the MMIX CPU) so they are not confused with
    instructions that only set, which are called 'Sxx' instructions. I think
    the Itanium calls them CMPxx instructions. I have been experimenting
    with the option of having them cumulate values like the Itanium does.
    Needs more opcode bits though.

    Q+ has
    Rt = (Ra==Rb) ? Rc : 0; // ZSEQ
    Rt = (Ra==Rb) ? Imm8 : 0;
    Rt = (Ra==Rb) ? Rc : Rt; // SEQ
    Rt = (Ra==Rb) ? Imm8 : Rt;
    Plus other ops besides ==

    My 66000 has compare instructions that generate a bit-vector of output conditions:: one for all forms of integer, and one for FP. In the case of
    FP, it generates a set bit when NaN comparisons should go to the else-clause and a different bit when that same comparison should deliver NaNs to the then-clause. This enables the compiler to flip the then-else-clauses when
    it chooses to do so.

    In addition, the integer version has range comparisons (0 <[=] Rs1, <[=] Rs2) for array limit comparisons. Any Byte or Any Half and Either Word comparisons can be added later should anyone choose, but it looks like for now VVM supersedes
    these needs.

    Thus I have 1 integer CMP and one FP CMP instruction rather than a multitude.

    If you want True/False, you can extract the bit you want::

    CMP Rt,Rs1,Rs3
    SLL Rd,Rt,<1,EQ> // {0, +1}
    SLLs Re,Rt,<1,EQ> // {0, -1}


    The CMPQ{EQ/NE/GT} cases are also available in an Imm5u form (TBD if it
    will use the expansion to Imm6u or Imm6s in XG2 mode). Currently these
    have a comparably lower hit rate.

    It is less clear if the "better" fallback case is to load a constant
    into a register and use the 3R CMPxx ops, or to fall-back to the
    original CMPxx+MOVT/MOVNT.

    At present, the CMPxx+MOVT/MOVNT fallback strategy seems to be winning
    (though, the 3R CMPxx fallback is likely to be better when the value
    falls outside the range of the "CMPxx Imm10{u/n}, Rn" operations).

    ...



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Fri Feb 2 19:39:47 2024
    BGB wrote:

    On 2/1/2024 5:04 AM, Robert Finch wrote:

    integer main(integer argc, char* argv[])
    begin
        integer x;

        for (x = 1; x < 10; x++) begin
            if (argc > 10 and argc < 12 or argc==52)
                puts("Hello World!\n");
        end
    end

        .sdreg    29
    _main:
      enter 2,32
      ldo s1,32[fp]
    ; for (x = 1; x < 10; x++) begin
      ldi s0,1
      ldi t1,10
      bge s0,t1,.00039
    .00038:
    ; if (argc > 10 and argc < 12 or argc==52)
      zsgt t1,s1,10,1
      zslt t2,s1,12,1
      zseq t3,s1,52,1
      and_or t0,t1,t2,t3
      beqz t0,.00041
    ; puts("Hello World!\n");
      sub sp,sp,8
      lda t0,_main.00016[gp]
      orm t0,_main.00016
      sto t0,0[sp]
      bsr _puts
    .00041:
    .00040:
      ldi t1,10
      iblt s0,t1,.00038
    .00039:
    .00037:
      leave 2,16
        .type    _main,@function
        .size    _main,$-_main


    Hmm...

    Possible I guess, but 4R ALU ops isn't something my CPU can do as-is,
    and I am not sure it would be used enough to make it worthwhile.


    Though, did go and try a different strategy:
    I noted while skimming the SiFive S76 docs that it specified some
    constraints on the timing of various ops. Memory Load timing depended on
    what was being loaded, as did ALU timing.

    This gave me an idea.


    I could add a "fast path" to the L1 cache where, if the memory access satisfied certain requirements, it would be reduced to 2 cycle latency:
    Aligned-Only, 32 or 64 bit Load;
    Normal RAM access (not MMIO or similar);
    Does not trigger a "read-after-write" dependency;
    ...
    This case allowing for cheaper memory access logic which doesn't kill
    the timing (if the result is forwarded directly to the pipeline).

    The above was a question poised to me while interviewing with HP in 1988.

    The right answer is:: "Do nothing that harms the frequency of the pipeline". {{Which you my or may not be doing to yourself}}

    The second correct right answer is:: "Do nothing that adds 1 to the exponent
    of test vector complexity". {{Which you invariably are doing to yourself}}

    Basically, in this case, the L1D$ has an alternate output that is
    directed to EX2 with a flag that encodes whether the value is valid. It
    does not replace the logic in EX3, mostly because (unless something has
    gone terribly wrong), both should always give the same output value.


    Also an alternate "fast case ALU", which reduces ALU to 1-cycle for a
    few common cases:
    ADD{S/U}L, SUB{S/U}L
    ADD/SUB if the input values fall safely into signed 32-bit range.
    Currently +/- 2^30, as this can't overflow the signed 32-bit.
    Skips 64-bit mostly because low-latency 64-bit ADD is harder.
    AND/OR/XOR
    These handle full 64-bit though.

    Currently, ignores all the other operations, and currently applies only
    to Lane 1. As with Load, it doesn't modify the logic in EX2 mostly
    because both should always produce the same result.


    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)