• 88xxx or PPC

    From Brett@21:1/5 to [email protected] on Sat Mar 2 23:28:41 2024
    MitchAlsup1 <[email protected]> wrote:
    Paul A. Clayton wrote:

    On 1/25/24 10:22 AM, Anton Ertl wrote:
    [snip]
    I think the commonly understood meaning is that
    all instructions start their execution in-order (i.e., none goes to a
    functional unit earlier than an architecturally earlier instruction).
    Execution can overlap.

    What about a skewed pipeline? A simple skewed pipeline that
    statically assigned operations to a pipeline-stage/execution unit
    has been called in-order (in what I have read). A "second-chance"
    pipeline (where many operations can dynamically choose the
    pipeline stage based on operand availability) involves dynamic
    scheduling (so would seem to fall in to out-of-order), but
    counterflow pipelines ("Counterflow Pipeline Processor
    Architecture", Robert F. Sproull et al., 1994) — which are more
    extreme in some ways than pipelines that have two stages in which
    operations can start — are stated to have "No overtaking.
    Instructions must stay in program order in the instruction
    pipeline.", which sounds "in-order" (and the paper was written by
    people working at Sun Microsystems).

    (I thought counterflow pipelines were weird. Simplifying
    communication makes sense, but ...)

    I get the impression that early PowerPC out-or-order execution
    implementations were really very similar to using the forwarding
    network for out-of-order completion (with in-order writeback). If
    I recall correctly, renaming was done by appending a version to
    the architectural register name and operands would be captured as
    soon as they were available rather than passing along the pipeline
    with forwarding until the writeback stage.

    This sounds more like Mc 88110 rather than PPC 620.

    PPC was shrunk from 6-wide to 4-wide in order to fit in the acceptable
    die area. Other things may have been jettisoned at this shrink of design point. The 620 was originally targeted to be equal to Mc 88120 which was
    a 6-wide GBOoO machine full Tomasulo with precise exceptions and 4 external busses named {Data Out, Data In, Address Out, Address In}

    Address Out was used for cache misses to bring data to the CPU
    Data Out was used for cache victims to send data to DRAM
    Data In was used by arriving DRAM data
    Address In was used for arriving Snoops

    Smart externals could use Data In to send the CPU data before it knew it needed it. That data could be code or data.


    So which was better, your baby the 88xxx or PPC?
    Pick your metric: die size, performance, heat, upgrade path, other.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Sat Mar 2 23:51:20 2024
    Brett wrote:

    MitchAlsup1 <[email protected]> wrote:


    This sounds more like Mc 88110 rather than PPC 620.

    PPC was shrunk from 6-wide to 4-wide in order to fit in the acceptable
    die area. Other things may have been jettisoned at this shrink of design
    point. The 620 was originally targeted to be equal to Mc 88120 which was
    a 6-wide GBOoO machine full Tomasulo with precise exceptions and 4 external >> busses named {Data Out, Data In, Address Out, Address In}

    Address Out was used for cache misses to bring data to the CPU
    Data Out was used for cache victims to send data to DRAM
    Data In was used by arriving DRAM data
    Address In was used for arriving Snoops

    Smart externals could use Data In to send the CPU data before it knew it
    needed it. That data could be code or data.


    So which was better, your baby the 88xxx or PPC?
    Pick your metric: die size, performance, heat, upgrade path, other.


    I am yet to see a CPU with the performance we got on the 88120 simulator.
    In 1992 we were getting 2.0 IPC on things like XLISP and 5.9 IPC on
    Matrix300.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Sun Mar 3 08:21:58 2024
    [email protected] (MitchAlsup1) writes:
    Brett wrote:
    So which was better, your baby the 88xxx or PPC?
    Pick your metric: die size, performance, heat, upgrade path, other.

    If the metric is availability, the PPC 620 was weak (only produced in
    small numbers, because of disappointing performance for when it
    finally appeared), but the 88120 is weaker (never produced).

    I am yet to see a CPU with the performance we got on the 88120 simulator.
    In 1992 we were getting 2.0 IPC on things like XLISP and 5.9 IPC on >Matrix300.

    I have recently measured actual hardware, using performance counters
    on a number of Forth programs running on Gforth (matrix is based on
    intmm in the Stanford integer benchmarks, not on matrix300). You can
    find the results in

    https://www.complang.tuwien.ac.at/anton/tmp/select-ipc-uarch.eps https://www.complang.tuwien.ac.at/anton/tmp/opt-ipc-uarch.eps

    The first graphics shows fewer hardware, but makes the benchmarks
    identifyable; it also measures the results without (left of two bars
    for one benchmark) and with (right) an the optimization the paper is
    about; the optimization also reduces instructions, so the results
    usually are still faster even in those cases where the optimization
    reduces IPC.

    The second graphic contains only the data with the optimization, it
    shows more microarchitectures, and sorts the results for each
    microarchitecture by IPC (making the benchmarks unidentifiable), but
    it shows the differences between the microarchitectures nicely. It is remarkable that Goldencove/Raptorcove not only have a large IPC, but
    also very short cycles (clocks up to 6GHz in current products; for the
    Core i3-1315U on which this was measured, Intel promises 4.5GHz, but
    the NUC that I measured just delivers 3.8GHz; but that is still plenty
    of speed for a middle-end low-power product).

    One interesting case is the K8 (released 2003, although this one is
    from the 2005 incarnation Athlon X2 4400+), in which Mitch Alsup also
    had a hand. We can see how far we have come since those days.

    I don't think that it's plausible that the 88120, which would have
    appeared in the mid-1990s would perform as well or better than
    goldencove on this workload. My guess is that it would have had to
    undergo a silicon diet like the PPC620, probably even more so, because
    it was to appear earlier, which probably would have meant less
    transistors, which would have reduced the matrix300 IPC, and probably
    to a lesser amount, the Xlisp IPC.

    Also, the question is how fast the result would clock. The 88110 was
    available in 1992 at 50MHz, in the same year as the 200MHz 21064.
    When would the 88120 have been available at what clock rate? The
    PPC620 was available in 1997 at up to 150MHz, while the Pentium II
    Klamath was available in 1997 at clock rates up to 300MHz, and the
    (in-order) 21164a was available at 600MHz; my guess is that the 21164a
    could also produce good matrix300 numbers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Sun Mar 3 16:55:33 2024
    On Sat, 2 Mar 2024 23:51:20 +0000
    [email protected] (MitchAlsup1) wrote:

    Brett wrote:

    MitchAlsup1 <[email protected]> wrote:


    This sounds more like Mc 88110 rather than PPC 620.

    PPC was shrunk from 6-wide to 4-wide in order to fit in the
    acceptable die area. Other things may have been jettisoned at this
    shrink of design point. The 620 was originally targeted to be
    equal to Mc 88120 which was a 6-wide GBOoO machine full Tomasulo
    with precise exceptions and 4 external busses named {Data Out,
    Data In, Address Out, Address In}

    Address Out was used for cache misses to bring data to the CPU
    Data Out was used for cache victims to send data to DRAM
    Data In was used by arriving DRAM data
    Address In was used for arriving Snoops

    Smart externals could use Data In to send the CPU data before it
    knew it needed it. That data could be code or data.


    So which was better, your baby the 88xxx or PPC?
    Pick your metric: die size, performance, heat, upgrade path, other.



    I am yet to see a CPU with the performance we got on the 88120
    simulator. In 1992 we were getting 2.0 IPC on things like XLISP and
    5.9 IPC on Matrix300.

    I can't find information about Matrix300.
    It seems to be part of SPEC89 FP suite, but spec.org does not provide
    info about anything older than SPEC92.
    Can you tell me what exactly does it do?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sun Mar 3 16:33:45 2024
    Michael S <[email protected]> writes:
    I can't find information about Matrix300.
    It seems to be part of SPEC89 FP suite, but spec.org does not provide
    info about anything older than SPEC92.
    Can you tell me what exactly does it do?

    It's 300x300 FP matrix multiply (not sure if single or double). There
    was a company that had a tool (famous at the time, but I don't
    remember the name) that could transform the original source code into
    a cache-blocked variant, which typically ran at the limits imposed by
    the FUs. Eventually everyone used that tool in their compiler to get
    good SPEC89 results. As a consequence, SPEC eliminated matrix300 in
    SPEC92.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sun Mar 3 18:26:10 2024
    Anton Ertl wrote:

    [email protected] (MitchAlsup1) writes:

    I don't think that it's plausible that the 88120, which would have
    appeared in the mid-1990s would perform as well or better than
    goldencove on this workload. My guess is that it would have had to
    undergo a silicon diet like the PPC620, probably even more so, because
    it was to appear earlier, which probably would have meant less
    transistors, which would have reduced the matrix300 IPC, and probably
    to a lesser amount, the Xlisp IPC.

    Also, the question is how fast the result would clock. The 88110 was available in 1992 at 50MHz, in the same year as the 200MHz 21064.
    When would the 88120 have been available at what clock rate? The

    We were on schedule when I left in 1992 to be in silicon by 1995.
    We were targeting 100 MHz.
    We did not get far enough to know the die-size*.

    (*) Motorola was offering us 0.5µ BiCMOS. SPICE simulations indicated
    that the wire attached to the emitters was going to be suspect to
    crystal migration due to current density. We (our project) was only
    using the bipolars in sense amplifiers and not in normal gates.

    PPC620 was available in 1997 at up to 150MHz, while the Pentium II
    Klamath was available in 1997 at clock rates up to 300MHz, and the
    (in-order) 21164a was available at 600MHz; my guess is that the 21164a
    could also produce good matrix300 numbers.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Sun Mar 3 18:18:07 2024
    Michael S wrote:

    On Sat, 2 Mar 2024 23:51:20 +0000
    [email protected] (MitchAlsup1) wrote:

    Brett wrote:

    MitchAlsup1 <[email protected]> wrote:


    This sounds more like Mc 88110 rather than PPC 620.

    PPC was shrunk from 6-wide to 4-wide in order to fit in the
    acceptable die area. Other things may have been jettisoned at this
    shrink of design point. The 620 was originally targeted to be
    equal to Mc 88120 which was a 6-wide GBOoO machine full Tomasulo
    with precise exceptions and 4 external busses named {Data Out,
    Data In, Address Out, Address In}

    Address Out was used for cache misses to bring data to the CPU
    Data Out was used for cache victims to send data to DRAM
    Data In was used by arriving DRAM data
    Address In was used for arriving Snoops

    Smart externals could use Data In to send the CPU data before it
    knew it needed it. That data could be code or data.


    So which was better, your baby the 88xxx or PPC?
    Pick your metric: die size, performance, heat, upgrade path, other.



    I am yet to see a CPU with the performance we got on the 88120
    simulator. In 1992 we were getting 2.0 IPC on things like XLISP and
    5.9 IPC on Matrix300.

    I can't find information about Matrix300.
    It seems to be part of SPEC89 FP suite, but spec.org does not provide
    info about anything older than SPEC92.
    Can you tell me what exactly does it do?

    It runs DGEMM for 300×300 sized matrixes and was designed as a Great Big number crunching thingamajig.

    It fell out of favor because by 1995 it fit in L2 cache sizes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sun Mar 3 20:30:52 2024
    On Sun, 03 Mar 2024 16:33:45 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    I can't find information about Matrix300.
    It seems to be part of SPEC89 FP suite, but spec.org does not provide
    info about anything older than SPEC92.
    Can you tell me what exactly does it do?

    It's 300x300 FP matrix multiply (not sure if single or double). There
    was a company that had a tool (famous at the time, but I don't
    remember the name) that could transform the original source code into
    a cache-blocked variant, which typically ran at the limits imposed by
    the FUs. Eventually everyone used that tool in their compiler to get
    good SPEC89 results. As a consequence, SPEC eliminated matrix300 in
    SPEC92.

    - anton

    So, in today's world it would be something like "How fast can you do
    DGEMM with 7 out of your 8 [SIMD] hands tied behind your back?"
    Or, may be, more than 7 if your variant of AMX supports double
    precision.
    The challenge is, funny, but the answer not particularly useful.
    But even in that not particularly useful answer IPC appears to be the
    least useful part. Far worse than FLOPS/Hz.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sun Mar 3 22:22:37 2024
    Michael S <[email protected]> writes:
    But even in that not particularly useful answer IPC appears to be the
    least useful part. Far worse than FLOPS/Hz.

    Those were the days before SIMD, so IPC told you a little about
    FLOPS/Hz. I think, though, that you look at it from the other end.
    You are asking: Is that a number I want to know for evaluating DGEMM performance on the 88120? But Mitch Alsup and the other people were
    probably thinking: this uarchitecture can never exceed 6 IPC, so
    getting 5.9 IPC on an actual SPEC CPU89 benchmark is pretty good. And matrix300 with its mixture of FP adds, FP muls, loads, stores, address arithmetic, and control, i.e., making relatively balanced use of many
    FUs, is a pretty good benchmark for getting high numbers of this kind.

    I guess that the 21164 also showed close to 4 IPC on the 4-wide 21164
    on matrix300, while its 2 integer units would limit it to much lower performance on, say, intmm.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Sun Mar 3 22:41:03 2024
    Michael S wrote:

    On Sun, 03 Mar 2024 16:33:45 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    I can't find information about Matrix300.
    It seems to be part of SPEC89 FP suite, but spec.org does not provide
    info about anything older than SPEC92.
    Can you tell me what exactly does it do?

    It's 300x300 FP matrix multiply (not sure if single or double). There
    was a company that had a tool (famous at the time, but I don't
    remember the name) that could transform the original source code into
    a cache-blocked variant, which typically ran at the limits imposed by
    the FUs. Eventually everyone used that tool in their compiler to get
    good SPEC89 results. As a consequence, SPEC eliminated matrix300 in
    SPEC92.

    - anton

    So, in today's world it would be something like "How fast can you do
    DGEMM with 7 out of your 8 [SIMD] hands tied behind your back?"
    Or, may be, more than 7 if your variant of AMX supports double
    precision.

    xGEMM supports transposes of the input matrixes; D stands for Double
    precision, S stands for Single precision. Matrix300 used DGEMM.

    SIMD would only support 2 of the 8 calls to DGEMM where the transposes
    are not {'N', 'n', 'T', 't', 'C', or 'c'}. SIMD would do nothing for
    the 6 transpose calls. It should also be noted that the transposed
    matrixes have significantly worse cache behavior than the non trans-
    posed version as each access is to a different cache line.

    The problem is typical SIMD does not support the kinds of transposes
    xGEMM performs. That is the problem is not the transposes, it is naïve
    SIMD which is the problem.

    So, postulate that one can SIMD the non-transposed loop and gain 4×.
    The other 3 loops get 1× for an overall gain of <less than> 25%;
    where the "less than" is due to the cache and TLB effects.

    A TLB with as few as 24 entries FA gets 100% hit rate in the non-
    transpose case, and poor performance on any (all) of the transpose
    cases, in the dual transpose case, the TLB takes a miss every other
    cache access. Here a 256-entry DM TLB gets 100% hit rate where a 64-
    entry FA TLB is getting close to zero hit rate.

    The challenge is, funny, but the answer not particularly useful.

    xGEMM is likely the second most used GB-math number crunching
    subroutine in use--FFT <flavors> being the most used.

    But even in that not particularly useful answer IPC appears to be the
    least useful part. Far worse than FLOPS/Hz.

    Correct, and this illustrates how times have changed. In 1985 Matrix300
    would use * and + as separate instructions. The major loop consists of
    2 LDs, 1*, 1+, 1 ST, and a ADD-CMP-BC which could be distributed over
    the 4-way unrolled loop (in source code). Mc 88110 compiler would
    produce a 24 instruction loop (non-transposed) and the Mc 88120 sim-
    ulator would run this loop (including DRAM accesses (cache misses and
    cache victims), and TLB table-walking) in 4 cycles. Today, FMAC is
    <fused into> 1 instruction, saving instruction count and calculation
    latency (8->5) allowing Matrix 300 to fit in a 78-instruction execution
    window instead of a 96-instruction EW. {{But you still have to count
    FMAC as 2 FLOPs.}}

    My 66000, using VEC-LOOP the instruction count goes down again to 5 (from
    6) per loop since LOOP is performing ADD-CMP-BC in a single instruction
    and in a single clock.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Mon Mar 4 08:52:09 2024
    [email protected] (MitchAlsup1) writes:
    SIMD would only support 2 of the 8 calls to DGEMM where the transposes
    are not {'N', 'n', 'T', 't', 'C', or 'c'}. SIMD would do nothing for
    the 6 transpose calls.

    The following version is good for the non-transposed matrices (it's
    not DGEMM, but the difference to DGEMM is left as exercise):

    void matmul(double a[], double b[], double c[], size_t m, size_t n, size_t p)
    {
    size_t i,j,k;
    double r;
    memset(c,0,n*p*sizeof(double));
    for (i=0; i<n; i++)
    for (k=0; k<m; k++)
    for (j=0; j<p; j++)
    c[i*p+j] += a[i*m+k]*b[k*p+j];
    }

    But the loops are interchangeable, and the naive i,j,k order is bad
    for SIMD as well as introducing a recurrence on c[i*p+j] on the inner
    loop. Given the amount of parallelism inherent in matrix
    multiplication, I would be surprised if one if transposing some or all
    of the involved matrices prevented some loop order or other
    transformation that would prevent SIMD. In the extreme case, you just transpose an appropriate input matrix at the start, or the result at
    the end, at O(n^2) effort (for n*n matrices), while the matrix
    multiply itself takes O(n^3) effort.

    But of course, for the 88120 that was a non-issue, because it did not
    have SIMD.

    It should also be noted that the transposed
    matrixes have significantly worse cache behavior than the non trans-
    posed version as each access is to a different cache line.

    That was fixed by the cache-blocking transformation that everybody
    used, and which resulted in the elimination of matrix300 from SPEC92.

    It was not the cache sizes, which could easily have been addressed by
    modifying matrix300 into, say, matrix2000.

    My 66000, using VEC-LOOP the instruction count goes down again to 5 (from
    6) per loop since LOOP is performing ADD-CMP-BC in a single instruction
    and in a single clock.

    If the programmer (or compiler) for My66000 does not process the
    elements in the favoured order, it will not perform particularly good
    for arbitrary transpositions, either. I don't think that you want to
    perform these program transformations in hardware, do you?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Mar 4 17:14:57 2024
    On Sun, 03 Mar 2024 22:22:37 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    But even in that not particularly useful answer IPC appears to be the
    least useful part. Far worse than FLOPS/Hz.


    I guess that the 21164 also showed close to 4 IPC on the 4-wide 21164
    on matrix300, while its 2 integer units would limit it to much lower performance on, say, intmm.

    - anton

    I don't know about specific case of matrix300 and what transformations
    are allowed by SPEC rules and what not, but if I were tasked with
    writing generic DGEMM for Alpha 21164 with maximal performance on
    non-small relatively square matrices as a amin goal, then I'd start
    with something like that:

    // main_loop_3x6 - multiply 3 raws of A[][]
    // by 6 columns of B[][] assuming C-language order
    void innermost_loop_3x6(
    const double* A, int lda,
    const double* B, int ldb,
    double* C, int ldc,
    int n)
    {
    const double* A0 = A;
    const double* A1 = A0 + lda;
    const double* A2 = A1 + lda;
    double acc00 = 0, acc10 = 0, acc20 = 0;
    double acc01 = 0, acc11 = 0, acc21 = 0;
    double acc02 = 0, acc12 = 0, acc22 = 0;
    double acc03 = 0, acc13 = 0, acc23 = 0;
    double acc04 = 0, acc14 = 0, acc24 = 0;
    double acc05 = 0, acc15 = 0, acc25 = 0;

    for (int i = 0; i < n; ++i) {
    double a0 = A0[i];
    double a1 = A1[i];
    double a2 = A2[i];
    double b;
    b = B[0]; acc00 += a0 * b; acc10 += a1 * b; acc20 += a2 * b;
    b = B[1]; acc01 += a0 * b; acc11 += a1 * b; acc21 += a2 * b;
    b = B[2]; acc02 += a0 * b; acc12 += a1 * b; acc22 += a2 * b;
    b = B[3]; acc03 += a0 * b; acc13 += a1 * b; acc23 += a2 * b;
    b = B[4]; acc04 += a0 * b; acc14 += a1 * b; acc24 += a2 * b;
    b = B[5]; acc05 += a0 * b; acc15 += a1 * b; acc25 += a2 * b;
    B += ldb;
    }

    double* C0 = C;
    double* C1 = C0 + ldc;
    double* C2 = C1 + ldc;
    C0[0] += acc00; C1[0] += acc10; C2[0] += acc20;
    C0[1] += acc01; C1[1] += acc11; C2[1] += acc21;
    C0[2] += acc02; C1[2] += acc12; C2[2] += acc22;
    C0[3] += acc03; C1[3] += acc13; C2[3] += acc23;
    C0[4] += acc04; C1[4] += acc14; C2[4] += acc24;
    C0[5] += acc05; C1[5] += acc15; C2[5] += acc25;
    }

    The loop consists of 9 loads, 4 pointer updates,
    1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
    DP adds. 51 total instructions.
    Ideally, it will run in 18 clocks, for IPC = 2.83. Realistically on
    real hardware with cache misses etc it will take 20-23 clock and IPC
    would be proportionally lower.
    What is my point? My point is that I expect "medium-IPC" kernel like
    above to achieve higher FLOPS (== better performance) then simpler,
    smaller kernel with IPC in excess of 3.5.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Mon Mar 4 18:18:35 2024
    Michael S <[email protected]> writes:
    On Sun, 03 Mar 2024 22:22:37 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    But even in that not particularly useful answer IPC appears to be the
    least useful part. Far worse than FLOPS/Hz.


    I guess that the 21164 also showed close to 4 IPC on the 4-wide 21164
    on matrix300, while its 2 integer units would limit it to much lower
    performance on, say, intmm.

    - anton

    I don't know about specific case of matrix300 and what transformations
    are allowed by SPEC rules and what not, but if I were tasked with
    writing generic DGEMM for Alpha 21164 with maximal performance on
    non-small relatively square matrices as a amin goal, then I'd start
    with something like that:

    // main_loop_3x6 - multiply 3 raws of A[][]
    // by 6 columns of B[][] assuming C-language order
    ...
    The loop consists of 9 loads, 4 pointer updates,
    1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
    DP adds. 51 total instructions.
    Ideally, it will run in 18 clocks, for IPC = 2.83.

    Given that starting 18 FP multiplies and 18 FP additions takes 18
    cycles, that is optimal. But you unrolled more than is necessary to
    achieve 2FlOPC (FP operations per cycle). With less unrolling, you
    could have achieved the same 2FLOPC and of course you would see higher
    IPC. And as Mitch Alsup explains, his 5.9 IPC was for a non-unrolled
    loop.

    What is my point? My point is that I expect "medium-IPC" kernel like
    above to achieve higher FLOPS (== better performance) then simpler,
    smaller kernel with IPC in excess of 3.5.

    These days, with power limits resulting in lower clocks for programs
    that do more work, yes, I guess that you will see better FLOPS from
    variants that execute fewer instructions. But in the 90s, CPUs ran at
    their rated clock rate no matter what, and a 21164 would run a variant
    that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
    whether that variant performs 0.83 non-flop instructions/cycle or 1.9
    non-flop instructions/cycle.

    But yes, 5.9 IPC on matrix300 shows little about the matrix multiply performance. Still, I think that the point is that there are many
    hurdles that might result in a lower IPC (for code where only 6IPC
    means 2FLOPC), the fact that they achieved 5.9 indicates that they
    managed to lower the hurdles a lot; true, it would be better if they
    could have shown it with code where 6IPC is more meaningful.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Mar 4 19:43:11 2024
    Anton Ertl wrote:

    Michael S <[email protected]> writes:
    On Sun, 03 Mar 2024 22:22:37 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    But even in that not particularly useful answer IPC appears to be the
    least useful part. Far worse than FLOPS/Hz.


    I guess that the 21164 also showed close to 4 IPC on the 4-wide 21164
    on matrix300, while its 2 integer units would limit it to much lower
    performance on, say, intmm.

    - anton

    I don't know about specific case of matrix300 and what transformations
    are allowed by SPEC rules and what not, but if I were tasked with
    writing generic DGEMM for Alpha 21164 with maximal performance on
    non-small relatively square matrices as a amin goal, then I'd start
    with something like that:

    // main_loop_3x6 - multiply 3 raws of A[][]
    // by 6 columns of B[][] assuming C-language order
    ....
    The loop consists of 9 loads, 4 pointer updates,
    1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
    DP adds. 51 total instructions.
    Ideally, it will run in 18 clocks, for IPC = 2.83.

    Given that starting 18 FP multiplies and 18 FP additions takes 18
    cycles, that is optimal. But you unrolled more than is necessary to
    achieve 2FlOPC (FP operations per cycle). With less unrolling, you
    could have achieved the same 2FLOPC and of course you would see higher
    IPC. And as Mitch Alsup explains, his 5.9 IPC was for a non-unrolled
    loop.

    What is my point? My point is that I expect "medium-IPC" kernel like
    above to achieve higher FLOPS (== better performance) then simpler,
    smaller kernel with IPC in excess of 3.5.

    These days, with power limits resulting in lower clocks for programs
    that do more work, yes, I guess that you will see better FLOPS from
    variants that execute fewer instructions. But in the 90s, CPUs ran at
    their rated clock rate no matter what, and a 21164 would run a variant
    that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
    whether that variant performs 0.83 non-flop instructions/cycle or 1.9 non-flop instructions/cycle.

    But yes, 5.9 IPC on matrix300 shows little about the matrix multiply performance. Still, I think that the point is that there are many
    hurdles that might result in a lower IPC (for code where only 6IPC
    means 2FLOPC), the fact that they achieved 5.9 indicates that they
    managed to lower the hurdles a lot; true, it would be better if they
    could have shown it with code where 6IPC is more meaningful.

    The processor for which that IPC was stated had a 16KB L1 DM Cache
    with 4 banks used twice per cycle and a 16 byte line, so DGEMM was <essentially> always taking cache misses (every other cycle). Most
    of the performance of the overall design came down to 3 things::
    The DRAM memory system which could start 2 new accesses every cycle
    1 RD 1 WT; The zero cycle branch mispredict repair, and
    The short 6-cycle pipeline from Fetch to Retire.

    And most of this centered around what we called the conditional
    cache--a RoB for memory <if you will>--a place where ST could be
    placed and LDs could access but would not get written to L1 if
    the instruction packet could not retire (mispredict or exception).

    No processor today is doing any of these (well maybe My 66600...)
    The BOOM RISC-V processor has a 7 stage front-end and 3 cycle branch
    mispredict repair...

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Mar 5 00:06:21 2024
    On Mon, 04 Mar 2024 18:18:35 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Sun, 03 Mar 2024 22:22:37 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    But even in that not particularly useful answer IPC appears to be
    the least useful part. Far worse than FLOPS/Hz.


    I guess that the 21164 also showed close to 4 IPC on the 4-wide
    21164 on matrix300, while its 2 integer units would limit it to
    much lower performance on, say, intmm.

    - anton

    I don't know about specific case of matrix300 and what
    transformations are allowed by SPEC rules and what not, but if I
    were tasked with writing generic DGEMM for Alpha 21164 with maximal >performance on non-small relatively square matrices as a amin goal,
    then I'd start with something like that:

    // main_loop_3x6 - multiply 3 raws of A[][]
    // by 6 columns of B[][] assuming C-language order
    ...
    The loop consists of 9 loads, 4 pointer updates,
    1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
    DP adds. 51 total instructions.
    Ideally, it will run in 18 clocks, for IPC = 2.83.

    Given that starting 18 FP multiplies and 18 FP additions takes 18
    cycles, that is optimal. But you unrolled more than is necessary to
    achieve 2FlOPC (FP operations per cycle). With less unrolling, you
    could have achieved the same 2FLOPC and of course you would see higher
    IPC. And as Mitch Alsup explains, his 5.9 IPC was for a non-unrolled
    loop.

    What is my point? My point is that I expect "medium-IPC" kernel like
    above to achieve higher FLOPS (== better performance) then simpler,
    smaller kernel with IPC in excess of 3.5.

    These days, with power limits resulting in lower clocks for programs
    that do more work, yes, I guess that you will see better FLOPS from
    variants that execute fewer instructions. But in the 90s, CPUs ran at
    their rated clock rate no matter what, and a 21164 would run a variant
    that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
    whether that variant performs 0.83 non-flop instructions/cycle or 1.9 non-flop instructions/cycle.

    But yes, 5.9 IPC on matrix300 shows little about the matrix multiply performance. Still, I think that the point is that there are many
    hurdles that might result in a lower IPC (for code where only 6IPC
    means 2FLOPC), the fact that they achieved 5.9 indicates that they
    managed to lower the hurdles a lot; true, it would be better if they
    could have shown it with code where 6IPC is more meaningful.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Mar 5 00:18:33 2024
    On Mon, 04 Mar 2024 18:18:35 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Sun, 03 Mar 2024 22:22:37 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    But even in that not particularly useful answer IPC appears to be
    the least useful part. Far worse than FLOPS/Hz.


    I guess that the 21164 also showed close to 4 IPC on the 4-wide
    21164 on matrix300, while its 2 integer units would limit it to
    much lower performance on, say, intmm.

    - anton

    I don't know about specific case of matrix300 and what
    transformations are allowed by SPEC rules and what not, but if I
    were tasked with writing generic DGEMM for Alpha 21164 with maximal >performance on non-small relatively square matrices as a amin goal,
    then I'd start with something like that:

    // main_loop_3x6 - multiply 3 raws of A[][]
    // by 6 columns of B[][] assuming C-language order
    ...
    The loop consists of 9 loads, 4 pointer updates,
    1 counter decrement, 1 conditional branch, 18 DP multiplies and 18
    DP adds. 51 total instructions.
    Ideally, it will run in 18 clocks, for IPC = 2.83.

    Given that starting 18 FP multiplies and 18 FP additions takes 18
    cycles, that is optimal. But you unrolled more than is necessary to
    achieve 2FlOPC (FP operations per cycle). With less unrolling, you
    could have achieved the same 2FLOPC and of course you would see higher
    IPC. And as Mitch Alsup explains, his 5.9 IPC was for a non-unrolled
    loop.

    What is my point? My point is that I expect "medium-IPC" kernel like
    above to achieve higher FLOPS (== better performance) then simpler,
    smaller kernel with IPC in excess of 3.5.

    These days, with power limits resulting in lower clocks for programs
    that do more work, yes, I guess that you will see better FLOPS from
    variants that execute fewer instructions. But in the 90s, CPUs ran at
    their rated clock rate no matter what, and a 21164 would run a variant
    that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
    whether that variant performs 0.83 non-flop instructions/cycle or 1.9 non-flop instructions/cycle.


    In 90-x CPUs had other reasons to minimize the # of instructions and
    esp. the # of load instructions per task. E.g. too few banks in L1D
    cache, so the cache that in theory supports two accesses per clock in
    practice is closer to 1. E.g. very few hits served under miss. E.g. low associativity. E.g. theoretically 4-wide instruction Fetch/Decode that
    in practice delivers 4 decoded instructions only when all inner planets
    in solar system are aligned.
    According to my understanding, 21164 being speed racer suffered from
    that sort of problems more than most competitors.


    But yes, 5.9 IPC on matrix300 shows little about the matrix multiply performance. Still, I think that the point is that there are many
    hurdles that might result in a lower IPC (for code where only 6IPC
    means 2FLOPC), the fact that they achieved 5.9 indicates that they
    managed to lower the hurdles a lot; true, it would be better if they
    could have shown it with code where 6IPC is more meaningful.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Mar 5 00:05:45 2024
    Michael S wrote:

    On Mon, 04 Mar 2024 18:18:35 GMT
    [email protected] (Anton Ertl) wrote:

    These days, with power limits resulting in lower clocks for programs
    that do more work, yes, I guess that you will see better FLOPS from
    variants that execute fewer instructions. But in the 90s, CPUs ran at
    their rated clock rate no matter what, and a 21164 would run a variant
    that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
    whether that variant performs 0.83 non-flop instructions/cycle or 1.9
    non-flop instructions/cycle.


    In 90-x CPUs had other reasons to minimize the # of instructions and

    Everyone Always have excellent reasons to minimize the number of instructions.

    Over in CISC-land, it takes fewer instructions to get the job done.
    Over in RISC-land, the instructions run in fewer nanoseconds.

    The critical term in performance is::

    seconds/program = instructions/program × cycles/instruction × seconds/cycle

    RISCs tend to get instructions/program wrong and cycles/instruction right, while
    CISCs tend to get instructions/program right and cycles/instruction wrong.

    I happen to believe that between RISC and CISC is a realm where one needs
    fewer instructions but sacrifices essentially nothing in the frequency department.

    My 66000 tends to have only 10% more instructions than VAX while RISC-V
    tends to have 50% more instructions than VAX--My 66000 needs only 71% the instruction count as RISC-V.

    esp. the # of load instructions per task. E.g. too few banks in L1D
    cache, so the cache that in theory supports two accesses per clock in practice is closer to 1.

    CISCs generally have a 45%-50% memory reference density, while
    RISCs generally have a 30%-35% memory reference density. So, CISCs tend
    to run into the cache banking wall at 2 IPC, while RISCs delay that wall
    until 3 IPC.

    E.g. very few hits served under miss.

    Accesses are correlated, so this is to be expected. The real question is whether you can still perform with miss under miss !! Even if you don't
    take hits, you can still get the next request out "there" sooner. Sooner
    saves latency.

    E.g. low associativity.

    Associativity costs power and area.

    E.g. theoretically 4-wide instruction Fetch/Decode that
    in practice delivers 4 decoded instructions only when all inner planets
    in solar system are aligned.

    A 4-wide instruction fetch yields only 2.5 instructions per random access.
    This is just std math:: (1+2+3+4)/4 = 2.5

    But access to good predication means up to 1/3rd of all short branches can
    be avoided. Few ISAs have access to "good" predication. Here, a good solution for predication, drives the random 4-wide fetch access to deliver 3.25 instruc- tion per fetch; a 50% increase.

    According to my understanding, 21164 being speed racer suffered from
    that sort of problems more than most competitors.

    Because it was wider and faster it was more dependent on "everything
    working well all the time" and the fact that it was high frequency
    than others means all its bad cache behavior got multiplied by the
    latency to deeper levels of the memory hierarchy !

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Mon Mar 4 17:37:48 2024
    On 3/4/2024 4:05 PM, MitchAlsup1 wrote:
    Michael S wrote:

    On Mon, 04 Mar 2024 18:18:35 GMT
    [email protected] (Anton Ertl) wrote:

    These days, with power limits resulting in lower clocks for programs
    that do more work, yes, I guess that you will see better FLOPS from
    variants that execute fewer instructions.  But in the 90s, CPUs ran at
    their rated clock rate no matter what, and a 21164 would run a variant
    that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
    whether that variant performs 0.83 non-flop instructions/cycle or 1.9
    non-flop instructions/cycle.


    In 90-x CPUs had other reasons to minimize the # of instructions and

    Everyone Always have excellent reasons to minimize the number of instructions.

    Over in CISC-land, it takes fewer instructions to get the job done.
    Over in RISC-land, the instructions run in fewer nanoseconds.

    The critical term in performance is::

    seconds/program = instructions/program × cycles/instruction × seconds/cycle

    RISCs tend to get instructions/program wrong and cycles/instruction
    right, while
    CISCs tend to get instructions/program right and cycles/instruction wrong.

    I happen to believe that between RISC and CISC is a realm where one needs fewer instructions but sacrifices essentially nothing in the frequency department.

    My 66000 tends to have only 10% more instructions than VAX while RISC-V
    tends to have 50% more instructions than VAX--My 66000 needs only 71%
    the instruction count as RISC-V.

    esp. the # of load instructions per task. E.g. too few banks in L1D
    cache, so the cache that in theory supports two accesses per clock in
    practice is closer to 1.

    CISCs generally have a 45%-50% memory reference density, while
    RISCs generally have a 30%-35% memory reference density.

    If those percentages are number of loads & stores divided by total
    instruction count, isn't this just a restatement of your previous point
    that CISCs need fewer instructions to do the job? i.e. the *time*
    between loads or stores is the same for RISCs and CISCs?


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Mar 5 02:33:21 2024
    Stephen Fuld wrote:

    On 3/4/2024 4:05 PM, MitchAlsup1 wrote:
    Michael S wrote:

    On Mon, 04 Mar 2024 18:18:35 GMT
    [email protected] (Anton Ertl) wrote:

    These days, with power limits resulting in lower clocks for programs
    that do more work, yes, I guess that you will see better FLOPS from
    variants that execute fewer instructions.  But in the 90s, CPUs ran at >>>> their rated clock rate no matter what, and a 21164 would run a variant >>>> that does 2 FLOPc at the same speed as any other 2 FLOPC variant,
    whether that variant performs 0.83 non-flop instructions/cycle or 1.9
    non-flop instructions/cycle.


    In 90-x CPUs had other reasons to minimize the # of instructions and

    Everyone Always have excellent reasons to minimize the number of
    instructions.

    Over in CISC-land, it takes fewer instructions to get the job done.
    Over in RISC-land, the instructions run in fewer nanoseconds.

    The critical term in performance is::

    seconds/program = instructions/program × cycles/instruction × seconds/cycle

    RISCs tend to get instructions/program wrong and cycles/instruction
    right, while
    CISCs tend to get instructions/program right and cycles/instruction wrong. >>
    I happen to believe that between RISC and CISC is a realm where one needs
    fewer instructions but sacrifices essentially nothing in the frequency
    department.

    My 66000 tends to have only 10% more instructions than VAX while RISC-V
    tends to have 50% more instructions than VAX--My 66000 needs only 71%
    the instruction count as RISC-V.

    esp. the # of load instructions per task. E.g. too few banks in L1D
    cache, so the cache that in theory supports two accesses per clock in
    practice is closer to 1.

    CISCs generally have a 45%-50% memory reference density, while
    RISCs generally have a 30%-35% memory reference density.

    If those percentages are number of loads & stores divided by total instruction count, isn't this just a restatement of your previous point
    that CISCs need fewer instructions to do the job? i.e. the *time*
    between loads or stores is the same for RISCs and CISCs?


    Not when you include various other facts::
    CISCs tend to have fewer registers
    CISCs tend to have LD-OPs and some have LD-OP-STs
    Both of the above give the compiler the illusion that inbound memory
    references are less expensive than a typical LD because you get the
    LD, and you don't have to waste a precious register. Thus there are
    more memory references--but RISC compilers have taught us that more
    registers beats LD-OPs--pipeline designers have taught us that thin-
    ner pipelines perform better--both stand against LD-OPs and LD-OP-STs.

    VAX went so far as to allow any operand and any result to be memory
    {Most of us now believe this was a massive overstep.}

    CISCs really do perform more memory references--not by as much as the
    above statistics imply, but significantly more memory references.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Mon Mar 4 20:50:18 2024
    On 3/4/2024 6:33 PM, MitchAlsup1 wrote:
    Stephen Fuld wrote:

    On 3/4/2024 4:05 PM, MitchAlsup1 wrote:

    snip

    CISCs generally have a 45%-50% memory reference density, while
    RISCs generally have a 30%-35% memory reference density.

    If those percentages are number of loads & stores divided by total
    instruction count, isn't this just a restatement of your previous
    point that CISCs need fewer instructions to do the job?  i.e. the
    *time* between loads or stores is the same for RISCs and CISCs?


    Not when you include various other facts::
    CISCs tend to have fewer registers
    CISCs tend to have LD-OPs and some have LD-OP-STs
    Both of the above give the compiler the illusion that inbound memory references are less expensive than a typical LD because you get the
    LD, and you don't have to waste a precious register. Thus there are
    more memory references--but RISC compilers have taught us that more
    registers beats LD-OPs--pipeline designers have taught us that thin-
    ner pipelines perform better--both stand against LD-OPs and LD-OP-STs.

    VAX went so far as to allow any operand and any result to be memory
    {Most of us now believe this was a massive overstep.}

    CISCs really do perform more memory references--not by as much as the
    above statistics imply, but significantly more memory references.

    Interesting. Thanks.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Tue Mar 5 14:48:31 2024
    [email protected] (MitchAlsup1) writes:
    Stephen Fuld wrote:

    On 3/4/2024 4:05 PM, MitchAlsup1 wrote:


    CISCs generally have a 45%-50% memory reference density, while
    RISCs generally have a 30%-35% memory reference density.

    If those percentages are number of loads & stores divided by total
    instruction count, isn't this just a restatement of your previous point
    that CISCs need fewer instructions to do the job? i.e. the *time*
    between loads or stores is the same for RISCs and CISCs?


    Not when you include various other facts::
    CISCs tend to have fewer registers

    How much of that is because active CISC architectures
    are forty or fifty years old?

    Would a modern, designed from scratch, CISC architecture
    still restrict the number of registers?

    If memory access ever becomes as fast a register access,
    all bets will be off...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Mar 5 15:44:29 2024
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Stephen Fuld wrote:

    On 3/4/2024 4:05 PM, MitchAlsup1 wrote:


    CISCs generally have a 45%-50% memory reference density, while
    RISCs generally have a 30%-35% memory reference density.

    If those percentages are number of loads & stores divided by total
    instruction count, isn't this just a restatement of your previous point
    that CISCs need fewer instructions to do the job? i.e. the *time*
    between loads or stores is the same for RISCs and CISCs?


    Not when you include various other facts::
    CISCs tend to have fewer registers

    How much of that is because active CISC architectures
    are forty or fifty years old?

    The problems of encoding remain as relevant today as 50 years ago.
    But the things one wants the ISA to do are larger today than 50 YA.
    Those encodings with LD-OPs are pretty much restricted to 16 registers
    (16 base registers) and here you still have OpCode mapping difficulties.

    If you give up on LD-OPs to gain register count, you are already in the RISC-camp.

    Would a modern, designed from scratch, CISC architecture
    still restrict the number of registers?

    I wanted to do a 64-bit VAX minus the indirect address modes and
    give it 32 registers. Never got around to it though.

    My experience with My 66000 indicates one can get vanishingly close
    to VAX instruction density (and count) with a RISC ISA done right.

    If memory access ever becomes as fast a register access,
    all bets will be off...

    It won't, and never has.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Tue Mar 5 16:22:24 2024
    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:

    If memory access ever becomes as fast a register access,
    all bets will be off...

    It won't, and never has.

    There were a number of historic implementations where the
    registers were actually stored in low memory.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Mar 5 17:33:23 2024
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:

    If memory access ever becomes as fast a register access,
    all bets will be off...

    It won't, and never has.

    There were a number of historic implementations where the
    registers were actually stored in low memory.


    Yes, but that is making registers as slow as memory,
    not making memory as fast as registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Tue Mar 5 17:57:14 2024
    [email protected] (Scott Lurndal) writes:
    Would a modern, designed from scratch, CISC architecture
    still restrict the number of registers?

    Designed from scratch? AVX-512 supports 32 xmm/ymm/zmm registers.
    Intel APX will support 32 GPRs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Tue Mar 5 18:10:06 2024
    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:

    If memory access ever becomes as fast a register access,
    all bets will be off...

    It won't, and never has.

    There were a number of historic implementations where the
    registers were actually stored in low memory.


    Yes, but that is making registers as slow as memory,
    not making memory as fast as registers.

    With something like SRAM, with less than 1ns latency, then. With that
    you might completely eliminate registers and do everything
    memory-to-memory. The small gain lost by eliminating
    registers will likely be offset by fewer instructions
    to execute. Sufficient for most desktop users, surely.

    That's assuming SRAM can scale at reasonable cost due
    to some future technology breakthrough (or some other
    future non-volatile memory technology with favorable
    access timing).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Mar 5 19:28:01 2024
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:

    If memory access ever becomes as fast a register access,
    all bets will be off...

    It won't, and never has.

    There were a number of historic implementations where the
    registers were actually stored in low memory.


    Yes, but that is making registers as slow as memory,
    not making memory as fast as registers.

    With something like SRAM, with less than 1ns latency, then. With that
    you might completely eliminate registers and do everything
    memory-to-memory. The small gain lost by eliminating
    registers will likely be offset by fewer instructions
    to execute. Sufficient for most desktop users, surely.

    Registers are 200ps, while on-die SRAM can cycle at 200ps
    you still have to pay the latency to route address-bits
    from AGEN to the SRAM arrays, and route the data back,
    while one can access a register and run through the
    forwarding logic in the same cycle. So, you are essentially
    comparing something that costs ½* cycle to one that costs
    1¼ cycles.

    (*) maybe ¾-cycle on a GB physical RF.

    That's assuming SRAM can scale at reasonable cost due
    to some future technology breakthrough (or some other
    future non-volatile memory technology with favorable
    access timing).

    I have lived in FAB environments where I was told "SRAM
    will approach DRAM densities* in a generation or two, too.
    Still, yet to happen.

    (*) They thought they had to keep the DRAM capacitor above
    a certain amount of fF to retain the 8ms (-16ms) refresh
    rates. What they really had to do was control leakage !

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Tue Mar 5 22:40:37 2024
    On Tue, 5 Mar 2024 19:28:01 +0000
    [email protected] (MitchAlsup1) wrote:

    I have lived in FAB environments where I was told "SRAM
    will approach DRAM densities* in a generation or two, too.
    Still, yet to happen.

    (*) They thought they had to keep the DRAM capacitor above
    a certain amount of fF to retain the 8ms (-16ms) refresh
    rates. What they really had to do was control leakage !

    According to what I read on RWT forum, in recent 3-4 years the table
    was turned completely - SRAM-to-DRAM area ratio is growing rather than shrinking. Slowly, of course, nowadays everything is slow.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Mar 5 21:06:40 2024
    Michael S wrote:

    On Tue, 5 Mar 2024 19:28:01 +0000
    [email protected] (MitchAlsup1) wrote:

    I have lived in FAB environments where I was told "SRAM
    will approach DRAM densities* in a generation or two, too.
    Still, yet to happen.

    (*) They thought they had to keep the DRAM capacitor above
    a certain amount of fF to retain the 8ms (-16ms) refresh
    rates. What they really had to do was control leakage !

    According to what I read on RWT forum, in recent 3-4 years the table
    was turned completely - SRAM-to-DRAM area ratio is growing rather than shrinking. Slowly, of course, nowadays everything is slow.


    Yes, partially this is the FinFET effect where it is more area
    difficult to build the 6T SRAM cell when compared to planar,
    whereas with more metal layers, it is not so difficult to build
    above transistor capacitors {Stacks of vias and minimum metal
    sq lambda pads.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Tue Mar 5 22:34:39 2024
    On Tue, 05 Mar 2024 18:10:06 GMT
    [email protected] (Scott Lurndal) wrote:

    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:

    If memory access ever becomes as fast a register access,
    all bets will be off...

    It won't, and never has.

    There were a number of historic implementations where the
    registers were actually stored in low memory.


    Yes, but that is making registers as slow as memory,
    not making memory as fast as registers.

    With something like SRAM, with less than 1ns latency, then. With that
    you might completely eliminate registers and do everything
    memory-to-memory. The small gain lost by eliminating
    registers will likely be offset by fewer instructions
    to execute. Sufficient for most desktop users, surely.

    That's assuming SRAM can scale at reasonable cost due
    to some future technology breakthrough

    1ns latency means the size of your SRAM array is 256KB at best.
    More realistically 128 KB.
    The only possible breakthrough I can see at this front is 3D
    working much better than anticipated (and better than how well it works
    for 3D NAND flash). But even that can improve capacity by factor of 200
    at best, more realistically by factor of 100.
    So, in the best possible future scenario, given all benefits of doubt
    1ns latency requirement limits the size of your SRAM to ~50 MB.
    Make it at least 5 ns instead of 1 then you can start dreaming.
    Still more for benefit of your grandchildren rather than for yourself.


    (or some other
    future non-volatile memory technology with favorable
    access timing).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Wed Mar 6 20:00:26 2024
    Paul A. Clayton wrote:

    On 3/5/24 10:44 AM, MitchAlsup1 wrote:
    Scott Lurndal wrote:
    [snip]

    My 66000, if I understand correctly, has registers in memory; it
    "merely" caches them in a faster storage.

    That is sort of correct:: there is a single flip-flop that points at
    all thread-state, and the RF is 4-cache-lines of that state. It is
    also true that the HW read in new RFs and writes out old RFs as if
    the RF were a cache, but this is also true of non-RF thread-state--
    it is read in an written out too.

    It seems that 64-bit stack-pointer-relative accesses could be roughly as fast by using
    the offset as the index (each stack frame would be comparable to a
    different thread register context; the tradeoffs of extra storage
    for multiple stack frames ("multithreading" — alternating between
    indexing up and indexing down would provide some utilization
    flexibility with low indexing overhead) relative to pushing out
    early frames (normal "context switch"); such a cache would
    probably be limited in frame size cached.

    Smells too much like register windows which never outperformed
    the flat RF from MIPS. In any event, 50% of subroutines need no
    stack <accesses> and those that do typically only store 3 registers
    (for restore later).

    An L2 register set that can only be accessed for one operand might
    be somewhat similar to LD-OP.

    In high speed designs, there are at least 2 cycles of delay from AGEN
    to the L2 and 2 cycles of delay back. Even zero cycle access sees at
    least 4 cycles of latency, 5 if you count AGEN.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Wed Mar 6 19:53:40 2024
    Paul A. Clayton wrote:

    On 3/5/24 10:44 AM, MitchAlsup1 wrote:
    Scott Lurndal wrote:
    [snip]
    If memory access ever becomes as fast a register access,
    all bets will be off...

    It won't, and never has.

    There seem to be three aspects that lead to this conclusion: the
    storage technology, the indexing method (including alignment and
    extension), and the method of determining presence ("tagging").

    Porting. SRAMs are single ported, Register files are multiported.

    Any storage technology used for registers could be used for
    memory. However, area and power costs must be justified by
    benefit.

    A theoretical cheap read/expensive write storage technology might
    be more appropriate for storage that is not conventional
    registers, potentially providing faster non-register storage (for
    reads). (Since register writes can be buffered and elided, if the
    buffering overhead in a different storage technology was low
    enough expensive writes might not prevent use for registers. Yet
    one might then view the buffers as "registers" and the 'backing
    storage' as more memory-like.)

    With no register renaming or sub-registers, the indexing method
    for registers is trivial and only a 'ready' bit is needed. (For
    old-style VLIW, even the ready bit is not needed.)

    For general memory addressing, there is a more complex address
    generation, the size of the operand will be variable (alignment
    and extension — word-addressed memory would avoid this overhead☺),

    Register access if by fixed bit pattern in the instruction,
    Memory access is by performing arithmetic on operands to get the
    address.

    and tag comparison would be required. General memory addressing
    also involves indirection through a register. (The instruction
    pointer is available early in the pipeline, so IP-relative
    accesses would not have the delay of reading an arbitrary
    register. Register address values that are rarely updated or
    usually updated by adding a constant [or replacing with a
    constant] could be hoisted earlier in the pipeline.)

    Register read, address generation, and tag comparison overheads
    can be removed for offset addressing by using the base pointer as
    the "tag" and the offset as the index. (e.g., "Knapsack: A Zero-
    Cycle Memory Hierarchy Component", Todd M. Austin et al., 1993;
    "Signature Buffer: Bridging Performance Gap between Registers and
    Caches", Lu Peng et al., 2004) "Internal fragmentation" of
    utilization increases the cost of such storage relative to the
    benefit and offset addressing constrains generality.

    Register renaming introduces some complexity for addressing
    registers. A Register Alias Table lookup is a kind of "address
    generation".

    One would also desire a "memory" storage component to have larger
    capacity. A larger capacity storage will be more expensive to
    access and will favor denser (typically slower) storage.

    My 66000, if I understand correctly, has registers in memory; it
    "merely" caches them in a faster storage. It seems that 64-bit stack-pointer-relative accesses could be roughly as fast by using
    the offset as the index (each stack frame would be comparable to a
    different thread register context; the tradeoffs of extra storage
    for multiple stack frames ("multithreading" — alternating between
    indexing up and indexing down would provide some utilization
    flexibility with low indexing overhead) relative to pushing out
    early frames (normal "context switch"); such a cache would
    probably be limited in frame size cached.

    An L2 register set that can only be accessed for one operand might
    be somewhat similar to LD-OP.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Sat Mar 9 04:17:04 2024
    Paul A. Clayton wrote:

    On 3/6/24 3:00 PM, MitchAlsup1 wrote:
    Paul A. Clayton wrote:
    [snip]
                                               It seems that 64-bit
    stack-pointer-relative accesses could be roughly as fast by using
    the offset as the index (each stack frame would be comparable to a
    different thread register context; the tradeoffs of extra storage
    for multiple stack frames ("multithreading" — alternating between
    indexing up and indexing down would provide some utilization
    flexibility with low indexing overhead) relative to pushing out
    early frames (normal "context switch"); such a cache would
    probably be limited in frame size cached.

    Smells too much like register windows which never outperformed
    the flat RF from MIPS. In any event, 50% of subroutines need no
    stack <accesses> and those that do typically only store 3 registers
    (for restore later).

    Register windows were intended to avoid save/restore overhead by
    retaining values in registers with renaming. A stack cache is
    meant to reduce the overhead of loads and stores to the stack —
    not just preserving and restoring registers. A direct-mapped stack
    cache is not entirely insane. A partial stack frame cache might
    cache up to 256 bytes (e.g.) with alternating frames indexing with
    inverted bits (to reduce interference) — one could even reserve a
    chunk (e.g., 64 bytes) of a frame and not overlapped by limiting
    offset cached to be smaller than the cache.

    Such might be more useful than register windows, but that does
    not mean that it is actually a good option.

    If it is such a good option why has it not reached production ??

    An L2 register set that can only be accessed for one operand
    might be somewhat similar to LD-OP.

    In high speed designs, there are at least 2 cycles of delay from AGEN
    to the L2 and 2 cycles of delay back. Even zero cycle access sees at
    least 4 cycles of latency, 5 if you count AGEN.

    Presumably this is related to the storage technology used as well
    as the capacity.

    Purely wire delay due to the size of the L2 cache.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Sat Mar 9 04:14:35 2024
    Paul A. Clayton wrote:

    On 3/6/24 2:53 PM, MitchAlsup1 wrote:
    Paul A. Clayton wrote:

    On 3/5/24 10:44 AM, MitchAlsup1 wrote:
    Scott Lurndal wrote:
    [snip]
    If memory access ever becomes as fast a register access,
    all bets will be off...

    It won't, and never has.

    There seem to be three aspects that lead to this conclusion: the
    storage technology, the indexing method (including alignment and
    extension), and the method of determining presence ("tagging").

    Porting. SRAMs are single ported, Register files are multiported.

    Is this really a fundamental distinction?

    Yes, actually it is.

    If one uses SRAM to mean
    merely Static (not-refreshed) RAM, then register files are also
    SRAM. If one uses SRAM to mean classic 6-transistor SRAM cells,
    then the 8-transistor cells used in one of Intel's Atom L1 caches
    would not be SRAM.

    Would it surprise you know that in order to make such a dual ported
    SRAM cell "process tolerant"* that the SRAM cell has to be at least
    as big as if there were 2 independent SRAM cells ? That is:: if you
    want a 2 ported SRAM use 2 SRAM instances read them independently,
    but write both at the same time with the same value.

    (*) under some process variations, the SRAM cell will loose data if
    both read ports are used simultaneously--UNLESS the gain of the
    central inverter-pair is increased. For cells with more than 2 ports
    you get to a point where the cell cannot be written at some corners
    of the process space (strong N-channels with weak P-channels.)

    Transistor level design of Register Files is similarly fraught with
    peril.

    At some point, the number of select lines and the number of bus-
    wires is big enough that you CAN hide the register file under the
    wires. Transistor count goes up as 2+2+2×ports while wire goes up
    by 2+selects×2×ports.

    The storage technology is not strictly bound to is use.

    In the abstract this is true enough.
    In practice it is not.

    (Obviously, high area/power per bit storage is biased to smaller
    capacity and higher latency storage is biased to infrequent access
    or prefetchable/thoughput uses.)

    [snip]
    For general memory addressing, there is a more complex address
    generation, the size of the operand will be variable (alignment
    and extension — word-addressed memory would avoid this overhead☺),

    Register access if by fixed bit pattern in the instruction,
    Memory access is by performing arithmetic on operands to get the
    address.

    As noted later, memory accesses can also be indexed by a fixed bit
    pattern in the instruction. Determining whether a register ID bit
    field is actually used may well require less decoding than
    determining if an operation is a load based on stack pointer or
    global pointer with an immediate offset, but the difference would
    not seem to be that great. The offset size would probably also
    have to be checked — the special cache would be unlikely to
    support all offsets.

    Predecoding on insertion into the instruction cache could cache
    this usage information.

    You cannot predecode if the instruction is not of fixed size, (or
    if you do not add predecode bits ala Athlon, Opteron).

    [snip]
    Register read, address generation, and tag comparison overheads
    can be removed for offset addressing by using the base pointer as
    the "tag" and the offset as the index. (e.g., "Knapsack: A Zero-
    Cycle Memory Hierarchy Component", Todd M. Austin et al., 1993;
    "Signature Buffer: Bridging Performance Gap between Registers and
    Caches", Lu Peng et al., 2004) "Internal fragmentation" of
    utilization increases the cost of such storage relative to the
    benefit and offset addressing constrains generality.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Sun May 26 03:14:30 2024
    Paul A. Clayton wrote:

    On 3/8/24 11:14 PM, MitchAlsup1 wrote:


    Even with My 66000's variable length instructions, most (by
    frequency of occurrence) 32-bit immediates would be illegal
    instructions and more significant 32-bit words in 64-bit
    immediates would usually be illegal instructions, so one could
    probably have highly accurate speculative predecode-on-fill.

    Since the variable length decoder is only 32 gates (equivalent in
    size to 3 1-bit flip-flops) one can simply attach said decoder
    to every word of storage in the instruction buffer. And arrange
    a tree of "If I get picked, here are my follow on instructions"

    Now, once one has a unary pointer into the IB, one gets 2 inst
    in 1 gate of delay, 4 in 2 gates, 8 in 3 gates,...until you
    get eaten alive with wire delay.

    Thus, if length decoding is easy, predecoding (into some kind of
    able) is unnecessary.

    If branch prediction fetch ahead used instruction addresses
    (rather than cache block addresses), a valid target prediction
    would provide accurate predecode for the following instructions
    and constrain the possible decodings for preceding instructions.

    Mistakes in predecode that mistook an immediate 32-bit word for an opcode-containing word might not be particularly painful.

    Now when these are mask out by the actual decode selection tree.

    Mistakenly "finding" a branch in predecode might not be that
    painful even if predicted taken — similar to a false BTB hit
    corrected in decode. Wrongly "finding" an optimizable load
    instruction might waste resources and introduce a minor glitch in
    decode (where the "instruction" has to be retranslated into an
    immediate component).

    It *feels* attractive to me to have predecode fill a BTB-like
    structure to reduce redundant data storage. Filling the "BTB" with
    less critical instruction data when there are few (immediate-
    based) branches seems less hurtful than losing some taken branch
    targets, though a parallel ordinary BTB (redundant storage) might
    compensate. The BTB-like structure might hold more diverse
    information that could benefit from early availability; e.g.,
    loads from something like a "Knapsack Cache". (Even loads from a
    more variable base might be sped by having a future file of two or
    three such base addresses — or even just the least significant
    bits — which could be accessed more quickly and earlier than the
    general register file. Bases that are changed frequently with
    dynamic values [not immediate addition] would rarely update the
    future file fast enough to be useful. I think some x86
    implementations did something similar by adding segment base and
    displacement early in the pipeline.) More generally, it seems that
    the instruction stream could be parsed and stored into components
    with different tradeoffs in latency, capacity, etc.

    I do not know if such "aggressive" predecode would be worthwhile
    nor what in-memory format would best manage the tradeoffs of
    density, parallelism, criticality, etc. or what "L1 cache" format
    would be best (with added specialization/utilization tradeoffs).

    It is a trade-off:: in a GBOoO design, adding a pipe stage cost
    around 2% (in an LBIO design around 5%) so the predictor has to
    buy more than 2% to "make the cut". It definitely would not make
    cut in the LBIO design, it may or may not make the cut in a GBOoO
    design. What we can say is: that the GBOoO design has to have some
    kind of branch prediction and not go so far as to assign is a name
    or a class.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Paul A. Clayton on Sun May 26 20:50:44 2024
    Paul A. Clayton <[email protected]> schrieb:

    One type of predecode that has been commercially implemented (in a
    POWER processor) was storing calculated branch insets rather than
    offsets.

    What is a branch inset?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Sun May 26 22:06:37 2024
    Thomas Koenig wrote:

    Paul A. Clayton <[email protected]> schrieb:

    One type of predecode that has been commercially implemented (in a
    POWER processor) was storing calculated branch insets rather than
    offsets.

    What is a branch inset?


    Low order bits of the target virtual address, so you can access ICache
    prior to the adder completing AGEN.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)